Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download using xet #1305

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open

Download using xet #1305

wants to merge 21 commits into from

Conversation

coyotte508
Copy link
Member

@coyotte508 coyotte508 commented Mar 21, 2025

cc @hanouticelina @Wauplin @bpronan @assafvayner

downloadFile has a xet: true optional param, that when set will download the file with the xet protocol if possible.

Breaking changes

  • downloadFile returns a Blob
  • fileDownloadInfo's return format is changed:
    • downloadLink => url, and it's present every time
    • optional xet prop for xet files

Concerns

https://huggingface.co/spaces/coyotte508/hub-xet uses quite a bit of CPU when downloading a full xet file, maybe we need to introduce multi-threading (with workers?) or optimize perf

Especially since we use it to parse safetensors data now cc @Kakulukian

Performance work

Before releasing out of experimental, some things need to be addressed & tested on different engines (node, browser):

  • CPU usage, eg how fast it is to de-chunk 10GB of local data. Can it be significantly improved with moving some of the code to WASM.
  • Stream backpressure, how much RAM does get used, especially due to different handling of Web streams, and if we can make it consistent across
  • Http calls are sequentially made, to save RAM, but it can hurt in high-ping high-bandwidth situations. Not a problem if we download multiple files in parallel, but when downloading one large file, we probably don't care about reading data on the wire and can just promise-queue a bunch of http calls and use some RAM. We can probably change the behavior of the lib depending on the downloaded file's size.

Being able to upload xet files would be nice too, experimentally as well (eg for hub-ci tests)

@assafvayner
Copy link
Collaborator

uses quite a bit of CPU when downloading a full xet file, maybe we need to introduce multi-threading (with workers?) or optimize perf

We (myself and @seanses) are also looking at introducing web-workers for multithreading (and general interface for processing uploads) wasm xet-core. Promising stuff so far I think.

@coyotte508 coyotte508 marked this pull request as ready for review April 8, 2025 14:20
@coyotte508 coyotte508 requested a review from xenova as a code owner April 8, 2025 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants