Skip to content

Commit 58a077f

Browse files
committed
add instruction on how to get HF dataset URI
1 parent a6e2483 commit 58a077f

File tree

2 files changed

+11
-4
lines changed

2 files changed

+11
-4
lines changed

README.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -243,6 +243,15 @@ dataset = StreamingDataset('s3://my-bucket/my-data', cache_dir="/path/to/cache")
243243

244244
To use your favorite Hugging Face dataset with LitData, simply pass its URL to `StreamingDataset`.
245245

246+
<details>
247+
<summary>How to get HF dataset URI?</summary>
248+
249+
- To get the HF dataset URI, `HF: use dataset -> polars -> HF_URI without filename`.
250+
- For `hf://datasets/open-thoughts/OpenThoughts-114k/data/train-*.parquet`: remove `train-*.parquet`.
251+
- Use **`hf://datasets/open-thoughts/OpenThoughts-114k/data`**.
252+
253+
</details>
254+
246255
```python
247256
import litdata as ld
248257

src/litdata/streaming/downloader.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -205,10 +205,8 @@ def download_file(self, remote_filepath: str, local_filepath: str) -> None:
205205
temp_path = local_filepath + ".tmp" # Avoid partial writes
206206
try:
207207
with self.fs.open(remote_filepath, "rb") as cloud_file, open(temp_path, "wb") as local_file:
208-
data = cloud_file.read()
209-
if isinstance(data, str):
210-
raise ValueError(f"Expected parquet data in bytes format. But found str. {remote_filepath}")
211-
local_file.write(data)
208+
for chunk in iter(lambda: cloud_file.read(4096), b""): # Stream in 4KB chunks local_file.
209+
local_file.write(chunk)
212210

213211
os.rename(temp_path, local_filepath) # Atomic move after successful write
214212

0 commit comments

Comments
 (0)