Streaming HF Dataset Writer #408

JoelNiklaus · 2025-12-08T15:45:32Z

The HuggingFaceDatasetWriter only uploads the data at the end. When we want to run inference on a large dataset, we may want to continuously upload data to the hub so we can monitor progress. We implement this in DataForge. This PR is for upstreaming this writer to datatrove.

guipenedo · 2025-12-08T19:19:49Z

How different is this from running Parquet writer with a hf:// path? Doesn't it also stream?

JoelNiklaus · 2025-12-09T00:19:36Z

I have tried the HuggingFaceDatasetWriter which inherits from ParquetWriter. Unfortunately, it does not seem to stream. It seems to upload everything in the end only.

It stages files to LFS during the run but defers the actual commit (making them visible in the repo) until close() is called. Each call to upload_files() pushes bytes to Hugging Face's LFS backend via preupload_lfs_files() and appends to an internal operations list, yet the single create_commit() that finalizes those operations only fires at teardown. This design minimizes commit noise and avoids race conditions for typical batch jobs, but it means external observers (like a progress monitor) see zero documents until the entire job finishes.

# From datatrove/pipeline/writers/huggingface.py

def upload_files(self, *filenames):
    # ...
    preupload_lfs_files(self.dataset, repo_type="dataset", additions=additions, revision=self.revision)
    # ...
    self.operations.extend(additions)  # staged, not committed

def close(self, rank: int = 0):
    # ...
    create_commit(                      # <-- only here do files become visible
        self.dataset,
        repo_type="dataset",
        operations=self.operations,
        commit_message=f"DataTrove upload ({len(self.operations)} files)",
        revision=self.revision,
    )

guipenedo · 2025-12-09T07:29:21Z

This is fully intended, the hub has a hard time handling many commits and that block was created to upload large (complete) datasets efficiently. I don't guarantee it but I think plain ParquetWriter with a hf:// output path should work as you intended. If it doesn't we can add this new block

guipenedo · 2025-12-09T09:49:04Z

Follow up: assuming you are checkpointing with the inference runner, and what you want is to upload each checkpoint as it is ready, you don't really need streaming. If you use the base ParquetWriter with an hf path each checkpoint's file will be uploaded as it completes (similar to JsonlWriter behaviour when uploading to S3 and so on).
The only thing is to use the base ParquetWriter instead of the huggingfacewriter that was built specifically to optimize the upload of large (completed) datasets

JoelNiklaus · 2025-12-09T19:21:30Z

I see, thanks for the explanation. Yes that makes sense, I just tested it now and could reproduce it.

added streaming huggingface dataset writer

8c752f1

JoelNiklaus requested a review from guipenedo December 8, 2025 15:45

run ruff

2d30888

JoelNiklaus closed this Dec 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Streaming HF Dataset Writer #408

Streaming HF Dataset Writer #408

Uh oh!

JoelNiklaus commented Dec 8, 2025

Uh oh!

guipenedo commented Dec 8, 2025

Uh oh!

JoelNiklaus commented Dec 9, 2025

Uh oh!

guipenedo commented Dec 9, 2025

Uh oh!

guipenedo commented Dec 9, 2025

Uh oh!

JoelNiklaus commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Streaming HF Dataset Writer #408

Streaming HF Dataset Writer #408

Uh oh!

Conversation

JoelNiklaus commented Dec 8, 2025

Uh oh!

guipenedo commented Dec 8, 2025

Uh oh!

JoelNiklaus commented Dec 9, 2025

Uh oh!

guipenedo commented Dec 9, 2025

Uh oh!

guipenedo commented Dec 9, 2025

Uh oh!

JoelNiklaus commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants