Fixes aggregation of image datasets #2717

fracapuano · 2025-12-24T18:17:08Z

Title

Fixes aggregation of image datasets

Type / Scope

Type: Bug
Scope: Bug fix for aggregate_datasets. Also affects lerobot-edit-dataset

Summary / Motivation

Fixes aggregate_datasets loses Image feature schema for image datasets #2715. Also, adds tests to ensure this edge case is properly covered.
As a side note: I'd advocate for removing image datasets altogether :)

Related issues

Fixes / Closes: aggregate_datasets loses Image feature schema for image datasets #2715

What changed

Short, concrete bullets of the modifications (files/behaviour).
Short note if this introduces breaking changes and migration steps.

How was this tested

Tests added: list new tests or test files.

How to run locally (reviewer)

Run the relevant tests:

pytest tests/datasets/test_aggregate.py::test_aggregate_image_datasets

Run these tests to confirm no breaking changes on closely related parts of the library.

pytest tests/datasets/test_aggregate.py
========================================== test session starts ===========================================
platform darwin -- Python 3.10.13, pytest-8.4.1, pluggy-1.6.0
rootdir: /Users/fracapuano/Desktop/personal/lerobot
configfile: pyproject.toml
plugins: timeout-2.4.0, anyio-4.10.0, cov-6.2.1, mock-serial-0.0.1, hydra-core-1.3.2
collected 4 items                                                                                        

Testing with DEVICE='cpu'

tests/datasets/test_aggregate.py ....                                                              [100%]

=========================================== 4 passed in 15.24s ===========================================
(lerobot) ➜  lerobot git:(fix/fracapuano-aggregate-image-datasets) ✗ pytest tests/datasets/test_dataset_tools.py
========================================== test session starts ===========================================
platform darwin -- Python 3.10.13, pytest-8.4.1, pluggy-1.6.0
rootdir: /Users/fracapuano/Desktop/personal/lerobot
configfile: pyproject.toml
plugins: timeout-2.4.0, anyio-4.10.0, cov-6.2.1, mock-serial-0.0.1, hydra-core-1.3.2
collected 40 items                                                                                       

Testing with DEVICE='cpu'

tests/datasets/test_dataset_tools.py ........................................                      [100%]

========================================== 40 passed in 44.80s ===========================================

Checklist (required before merge)

Linting/formatting run (pre-commit run -a)
All tests pass locally (pytest) (run test_aggregate.py, test_dataset_tools.py)
NA [] Documentation updated
CI is green

Reviewer notes

Anything the reviewer should focus on (performance, edge-cases, specific files) or general notes.
Anyone in the community is free to review the PR.

Copilot

Pull request overview

This PR fixes a bug in image dataset aggregation where HuggingFace Image() feature types were being lost during the aggregation process, causing images to be stored with generic struct schemas instead. The fix ensures proper preservation of image schemas by passing feature metadata through the aggregation pipeline.

Key changes:

Modified aggregate_data() and append_or_create_parquet_file() to retrieve and pass HuggingFace features schema for image datasets
Added special handling for reading and writing parquet files containing images using datasets.Dataset.from_parquet() to preserve image format
Added comprehensive test coverage with test_aggregate_image_datasets() to verify schema preservation and data integrity

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
tests/datasets/test_aggregate.py	Added comprehensive test for image dataset aggregation including schema validation and data integrity checks
src/lerobot/datasets/aggregate.py	Updated aggregation logic to retrieve and pass HuggingFace features schema when processing image datasets, ensuring proper Image() type preservation in parquet files

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-24T18:20:48Z

src/lerobot/datasets/aggregate.py

        dst_path.parent.mkdir(parents=True, exist_ok=True)
        if contains_images:
-            to_parquet_with_hf_images(df, dst_path)
+            to_parquet_with_hf_images(df, dst_path, features=hf_features)


The function to_parquet_with_hf_images is being called with a features parameter, but the current function signature in utils.py only accepts (df: pandas.DataFrame, path: Path) and does not have a features parameter. This will cause a TypeError at runtime. The function signature needs to be updated to accept and use the features parameter to properly preserve the HuggingFace Image schema.

Copilot · 2025-12-24T18:20:48Z

src/lerobot/datasets/aggregate.py


    if contains_images:
-        to_parquet_with_hf_images(final_df, target_path)
+        to_parquet_with_hf_images(final_df, target_path, features=hf_features)


The function to_parquet_with_hf_images is being called with a features parameter, but the current function signature in utils.py only accepts (df: pandas.DataFrame, path: Path) and does not have a features parameter. This will cause a TypeError at runtime. The function signature needs to be updated to accept and use the features parameter to properly preserve the HuggingFace Image schema.

Copilot · 2025-12-24T18:20:48Z

src/lerobot/datasets/aggregate.py

-        df = pd.read_parquet(src_path)
+        if contains_images:
+            # Use HuggingFace datasets to read source data to preserve image format
+            src_ds = datasets.Dataset.from_parquet(str(src_path))


When reading image datasets using datasets.Dataset.from_parquet, the features parameter should be passed to ensure image columns are properly loaded with the correct schema. Without this, the image data might not be correctly preserved during the read-update-write cycle. Consider using datasets.Dataset.from_parquet(str(src_path), features=hf_features) to maintain schema consistency.

Suggested change

src_ds = datasets.Dataset.from_parquet(str(src_path))

src_ds = datasets.Dataset.from_parquet(str(src_path), features=hf_features)

Copilot · 2025-12-24T18:20:49Z

src/lerobot/datasets/aggregate.py

-        existing_df = pd.read_parquet(dst_path)
+        if contains_images:
+            # Use HuggingFace datasets to read existing data to preserve image format
+            existing_ds = datasets.Dataset.from_parquet(str(dst_path))


When reading existing image datasets using datasets.Dataset.from_parquet, the features parameter should be passed to ensure image columns are properly loaded with the correct schema. Without this, the image data might not be correctly preserved during the read-merge-write cycle. Consider using datasets.Dataset.from_parquet(str(dst_path), features=hf_features) to maintain schema consistency.

Suggested change

existing_ds = datasets.Dataset.from_parquet(str(dst_path))

existing_ds = datasets.Dataset.from_parquet(str(dst_path), features=hf_features)

fracapuano added 2 commits December 24, 2025 19:07

$@fracapuano$

fix: use features when aggregating image based datasets

6d56a83

$@fracapuano$

add: test asserting for data type

34210ee

Copilot AI review requested due to automatic review settings December 24, 2025 18:17

github-actions bot added dataset Issues regarding data inputs, processing, or datasets tests Problems with test coverage, failures, or improvements to testing labels Dec 24, 2025

Copilot started reviewing on behalf of fracapuano December 24, 2025 18:17 View session

Copilot AI reviewed Dec 24, 2025

View reviewed changes

$@fracapuano$

add: features param to writing dataset

c7ec775

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixes aggregation of image datasets #2717

Fixes aggregation of image datasets #2717

$@fracapuano$ fracapuano commented Dec 24, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 24, 2025

Uh oh!

Copilot AI Dec 24, 2025

Uh oh!

Copilot AI Dec 24, 2025

Uh oh!

Copilot AI Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	src_ds = datasets.Dataset.from_parquet(str(src_path))
	src_ds = datasets.Dataset.from_parquet(str(src_path), features=hf_features)

	existing_ds = datasets.Dataset.from_parquet(str(dst_path))
	existing_ds = datasets.Dataset.from_parquet(str(dst_path), features=hf_features)

Fixes aggregation of image datasets #2717

Are you sure you want to change the base?

Fixes aggregation of image datasets #2717

Conversation

fracapuano commented Dec 24, 2025

Title

Type / Scope

Summary / Motivation

Related issues

What changed

How was this tested

How to run locally (reviewer)

Checklist (required before merge)

Reviewer notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

$@fracapuano$ fracapuano commented Dec 24, 2025