-
Notifications
You must be signed in to change notification settings - Fork 453
Download using xet #1305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download using xet #1305
Conversation
We (myself and @seanses) are also looking at introducing web-workers for multithreading (and general interface for processing uploads) wasm xet-core. Promising stuff so far I think. |
Merging this one, and major version! |
but for uploads from the Web we will/would use the Wasm version, is that correct? not the pure JS version? |
wasm for chunking |
^yes |
cc @hanouticelina @Wauplin @bpronan @assafvayner `downloadFile` has a `xet: true` optional param, that when set will download the file with the xet protocol if possible. ## Breaking changes - `downloadFile` returns a `Blob` - `fileDownloadInfo`'s return format is changed: - `downloadLink` => `url`, and it's present every time - optional `xet` prop for xet files ## Concerns https://huggingface.co/spaces/coyotte508/hub-xet uses quite a bit of CPU when downloading a full xet file, maybe we need to introduce multi-threading (with workers?) or optimize perf Especially since we use it to parse safetensors data now cc @Kakulukian ## Performance work Before releasing out of experimental, some things need to be addressed & tested on different engines (node, browser): - CPU usage, eg how fast it is to de-chunk 10GB of local data. Can it be significantly improved with moving some of the code to WASM. - Stream backpressure, how much RAM does get used, especially due to different handling of Web streams, and if we can make it consistent across - Http calls are sequentially made, to save RAM, but it can hurt in high-ping high-bandwidth situations. Not a problem if we download multiple files in parallel, but when downloading one large file, we probably don't care about reading data on the wire and can just promise-queue a bunch of http calls and use some RAM. We can probably change the behavior of the lib depending on the downloaded file's size. Being able to upload xet files would be nice too, experimentally as well (eg for hub-ci tests)
cc @hanouticelina @Wauplin @bpronan @assafvayner
downloadFile
has axet: true
optional param, that when set will download the file with the xet protocol if possible.Breaking changes
downloadFile
returns aBlob
fileDownloadInfo
's return format is changed:downloadLink
=>url
, and it's present every timexet
prop for xet filesConcerns
https://huggingface.co/spaces/coyotte508/hub-xet uses quite a bit of CPU when downloading a full xet file, maybe we need to introduce multi-threading (with workers?) or optimize perf
Especially since we use it to parse safetensors data now cc @Kakulukian
Performance work
Before releasing out of experimental, some things need to be addressed & tested on different engines (node, browser):
Being able to upload xet files would be nice too, experimentally as well (eg for hub-ci tests)