Skip to content

Download using xet #1305

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
May 6, 2025
Merged

Download using xet #1305

merged 35 commits into from
May 6, 2025

Conversation

coyotte508
Copy link
Member

@coyotte508 coyotte508 commented Mar 21, 2025

cc @hanouticelina @Wauplin @bpronan @assafvayner

downloadFile has a xet: true optional param, that when set will download the file with the xet protocol if possible.

Breaking changes

  • downloadFile returns a Blob
  • fileDownloadInfo's return format is changed:
    • downloadLink => url, and it's present every time
    • optional xet prop for xet files

Concerns

https://huggingface.co/spaces/coyotte508/hub-xet uses quite a bit of CPU when downloading a full xet file, maybe we need to introduce multi-threading (with workers?) or optimize perf

Especially since we use it to parse safetensors data now cc @Kakulukian

Performance work

Before releasing out of experimental, some things need to be addressed & tested on different engines (node, browser):

  • CPU usage, eg how fast it is to de-chunk 10GB of local data. Can it be significantly improved with moving some of the code to WASM.
  • Stream backpressure, how much RAM does get used, especially due to different handling of Web streams, and if we can make it consistent across
  • Http calls are sequentially made, to save RAM, but it can hurt in high-ping high-bandwidth situations. Not a problem if we download multiple files in parallel, but when downloading one large file, we probably don't care about reading data on the wire and can just promise-queue a bunch of http calls and use some RAM. We can probably change the behavior of the lib depending on the downloaded file's size.

Being able to upload xet files would be nice too, experimentally as well (eg for hub-ci tests)

@assafvayner
Copy link
Collaborator

uses quite a bit of CPU when downloading a full xet file, maybe we need to introduce multi-threading (with workers?) or optimize perf

We (myself and @seanses) are also looking at introducing web-workers for multithreading (and general interface for processing uploads) wasm xet-core. Promising stuff so far I think.

@coyotte508 coyotte508 marked this pull request as ready for review April 8, 2025 14:20
@coyotte508 coyotte508 requested a review from xenova as a code owner April 8, 2025 14:20
@coyotte508
Copy link
Member Author

Merging this one, and major version!

@coyotte508 coyotte508 merged commit a27b5b4 into main May 6, 2025
5 checks passed
@coyotte508 coyotte508 deleted the download-xet branch May 6, 2025 10:47
@julien-c
Copy link
Member

julien-c commented May 6, 2025

Being able to upload xet files would be nice too, experimentally as well

but for uploads from the Web we will/would use the Wasm version, is that correct? not the pure JS version?

@coyotte508
Copy link
Member Author

Being able to upload xet files would be nice too, experimentally as well

but for uploads from the Web we will/would use the Wasm version, is that correct? not the pure JS version?

wasm for chunking

@julien-c
Copy link
Member

julien-c commented May 6, 2025

^yes

Deep-unlearning pushed a commit that referenced this pull request May 13, 2025
cc @hanouticelina @Wauplin @bpronan @assafvayner 

`downloadFile` has a `xet: true` optional param, that when set will
download the file with the xet protocol if possible.

## Breaking changes

- `downloadFile` returns a `Blob`
- `fileDownloadInfo`'s return format is changed:
	- `downloadLink` => `url`, and it's present every time
	-  optional `xet` prop for xet files

## Concerns

https://huggingface.co/spaces/coyotte508/hub-xet uses quite a bit of CPU
when downloading a full xet file, maybe we need to introduce
multi-threading (with workers?) or optimize perf

Especially since we use it to parse safetensors data now cc @Kakulukian 

## Performance work

Before releasing out of experimental, some things need to be addressed &
tested on different engines (node, browser):

- CPU usage, eg how fast it is to de-chunk 10GB of local data. Can it be
significantly improved with moving some of the code to WASM.
- Stream backpressure, how much RAM does get used, especially due to
different handling of Web streams, and if we can make it consistent
across
- Http calls are sequentially made, to save RAM, but it can hurt in
high-ping high-bandwidth situations. Not a problem if we download
multiple files in parallel, but when downloading one large file, we
probably don't care about reading data on the wire and can just
promise-queue a bunch of http calls and use some RAM. We can probably
change the behavior of the lib depending on the downloaded file's size.

Being able to upload xet files would be nice too, experimentally as well
(eg for hub-ci tests)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants