-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Fixes aggregation of image datasets #2717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fixes aggregation of image datasets #2717
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a bug in image dataset aggregation where HuggingFace Image() feature types were being lost during the aggregation process, causing images to be stored with generic struct schemas instead. The fix ensures proper preservation of image schemas by passing feature metadata through the aggregation pipeline.
Key changes:
- Modified
aggregate_data()andappend_or_create_parquet_file()to retrieve and pass HuggingFace features schema for image datasets - Added special handling for reading and writing parquet files containing images using
datasets.Dataset.from_parquet()to preserve image format - Added comprehensive test coverage with
test_aggregate_image_datasets()to verify schema preservation and data integrity
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| tests/datasets/test_aggregate.py | Added comprehensive test for image dataset aggregation including schema validation and data integrity checks |
| src/lerobot/datasets/aggregate.py | Updated aggregation logic to retrieve and pass HuggingFace features schema when processing image datasets, ensuring proper Image() type preservation in parquet files |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| dst_path.parent.mkdir(parents=True, exist_ok=True) | ||
| if contains_images: | ||
| to_parquet_with_hf_images(df, dst_path) | ||
| to_parquet_with_hf_images(df, dst_path, features=hf_features) |
Copilot
AI
Dec 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function to_parquet_with_hf_images is being called with a features parameter, but the current function signature in utils.py only accepts (df: pandas.DataFrame, path: Path) and does not have a features parameter. This will cause a TypeError at runtime. The function signature needs to be updated to accept and use the features parameter to properly preserve the HuggingFace Image schema.
|
|
||
| if contains_images: | ||
| to_parquet_with_hf_images(final_df, target_path) | ||
| to_parquet_with_hf_images(final_df, target_path, features=hf_features) |
Copilot
AI
Dec 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function to_parquet_with_hf_images is being called with a features parameter, but the current function signature in utils.py only accepts (df: pandas.DataFrame, path: Path) and does not have a features parameter. This will cause a TypeError at runtime. The function signature needs to be updated to accept and use the features parameter to properly preserve the HuggingFace Image schema.
| df = pd.read_parquet(src_path) | ||
| if contains_images: | ||
| # Use HuggingFace datasets to read source data to preserve image format | ||
| src_ds = datasets.Dataset.from_parquet(str(src_path)) |
Copilot
AI
Dec 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When reading image datasets using datasets.Dataset.from_parquet, the features parameter should be passed to ensure image columns are properly loaded with the correct schema. Without this, the image data might not be correctly preserved during the read-update-write cycle. Consider using datasets.Dataset.from_parquet(str(src_path), features=hf_features) to maintain schema consistency.
| src_ds = datasets.Dataset.from_parquet(str(src_path)) | |
| src_ds = datasets.Dataset.from_parquet(str(src_path), features=hf_features) |
| existing_df = pd.read_parquet(dst_path) | ||
| if contains_images: | ||
| # Use HuggingFace datasets to read existing data to preserve image format | ||
| existing_ds = datasets.Dataset.from_parquet(str(dst_path)) |
Copilot
AI
Dec 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When reading existing image datasets using datasets.Dataset.from_parquet, the features parameter should be passed to ensure image columns are properly loaded with the correct schema. Without this, the image data might not be correctly preserved during the read-merge-write cycle. Consider using datasets.Dataset.from_parquet(str(dst_path), features=hf_features) to maintain schema consistency.
| existing_ds = datasets.Dataset.from_parquet(str(dst_path)) | |
| existing_ds = datasets.Dataset.from_parquet(str(dst_path), features=hf_features) |
Title
Fixes aggregation of image datasets
Type / Scope
aggregate_datasets. Also affectslerobot-edit-datasetSummary / Motivation
aggregate_datasetsloses Image feature schema for image datasets #2715. Also, adds tests to ensure this edge case is properly covered.Related issues
aggregate_datasetsloses Image feature schema for image datasets #2715What changed
How was this tested
How to run locally (reviewer)
Run the relevant tests:
Run these tests to confirm no breaking changes on closely related parts of the library.
Checklist (required before merge)
pre-commit run -a)pytest) (run test_aggregate.py, test_dataset_tools.py)Reviewer notes