Skip to content

Conversation

@JoelNiklaus
Copy link
Contributor

The HuggingFaceDatasetWriter only uploads the data at the end. When we want to run inference on a large dataset, we may want to continuously upload data to the hub so we can monitor progress. We implement this in DataForge. This PR is for upstreaming this writer to datatrove.

@JoelNiklaus JoelNiklaus requested a review from guipenedo December 8, 2025 15:45
@guipenedo
Copy link
Collaborator

How different is this from running Parquet writer with a hf:// path? Doesn't it also stream?

@JoelNiklaus
Copy link
Contributor Author

I have tried the HuggingFaceDatasetWriter which inherits from ParquetWriter. Unfortunately, it does not seem to stream. It seems to upload everything in the end only.

It stages files to LFS during the run but defers the actual commit (making them visible in the repo) until close() is called. Each call to upload_files() pushes bytes to Hugging Face's LFS backend via preupload_lfs_files() and appends to an internal operations list, yet the single create_commit() that finalizes those operations only fires at teardown. This design minimizes commit noise and avoids race conditions for typical batch jobs, but it means external observers (like a progress monitor) see zero documents until the entire job finishes.

# From datatrove/pipeline/writers/huggingface.py

def upload_files(self, *filenames):
    # ...
    preupload_lfs_files(self.dataset, repo_type="dataset", additions=additions, revision=self.revision)
    # ...
    self.operations.extend(additions)  # staged, not committed

def close(self, rank: int = 0):
    # ...
    create_commit(                      # <-- only here do files become visible
        self.dataset,
        repo_type="dataset",
        operations=self.operations,
        commit_message=f"DataTrove upload ({len(self.operations)} files)",
        revision=self.revision,
    )

@guipenedo
Copy link
Collaborator

This is fully intended, the hub has a hard time handling many commits and that block was created to upload large (complete) datasets efficiently. I don't guarantee it but I think plain ParquetWriter with a hf:// output path should work as you intended. If it doesn't we can add this new block

@guipenedo
Copy link
Collaborator

Follow up: assuming you are checkpointing with the inference runner, and what you want is to upload each checkpoint as it is ready, you don't really need streaming. If you use the base ParquetWriter with an hf path each checkpoint's file will be uploaded as it completes (similar to JsonlWriter behaviour when uploading to S3 and so on).
The only thing is to use the base ParquetWriter instead of the huggingfacewriter that was built specifically to optimize the upload of large (completed) datasets

@JoelNiklaus
Copy link
Contributor Author

I see, thanks for the explanation. Yes that makes sense, I just tested it now and could reproduce it.

@JoelNiklaus JoelNiklaus closed this Dec 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants