Skip to content

Conversation

@fracapuano
Copy link
Contributor

Title

Fixes aggregation of image datasets

Type / Scope

  • Type: Bug
  • Scope: Bug fix for aggregate_datasets. Also affects lerobot-edit-dataset

Summary / Motivation

Related issues

What changed

  • Short, concrete bullets of the modifications (files/behaviour).
  • Short note if this introduces breaking changes and migration steps.

How was this tested

  • Tests added: list new tests or test files.

How to run locally (reviewer)

  • Run the relevant tests:

    pytest tests/datasets/test_aggregate.py::test_aggregate_image_datasets

Run these tests to confirm no breaking changes on closely related parts of the library.

pytest tests/datasets/test_aggregate.py
========================================== test session starts ===========================================
platform darwin -- Python 3.10.13, pytest-8.4.1, pluggy-1.6.0
rootdir: /Users/fracapuano/Desktop/personal/lerobot
configfile: pyproject.toml
plugins: timeout-2.4.0, anyio-4.10.0, cov-6.2.1, mock-serial-0.0.1, hydra-core-1.3.2
collected 4 items                                                                                        

Testing with DEVICE='cpu'

tests/datasets/test_aggregate.py ....                                                              [100%]

=========================================== 4 passed in 15.24s ===========================================
(lerobot) ➜  lerobot git:(fix/fracapuano-aggregate-image-datasets) ✗ pytest tests/datasets/test_dataset_tools.py
========================================== test session starts ===========================================
platform darwin -- Python 3.10.13, pytest-8.4.1, pluggy-1.6.0
rootdir: /Users/fracapuano/Desktop/personal/lerobot
configfile: pyproject.toml
plugins: timeout-2.4.0, anyio-4.10.0, cov-6.2.1, mock-serial-0.0.1, hydra-core-1.3.2
collected 40 items                                                                                       

Testing with DEVICE='cpu'

tests/datasets/test_dataset_tools.py ........................................                      [100%]

========================================== 40 passed in 44.80s ===========================================

Checklist (required before merge)

  • Linting/formatting run (pre-commit run -a)
  • All tests pass locally (pytest) (run test_aggregate.py, test_dataset_tools.py)
  • NA [] Documentation updated
  • CI is green

Reviewer notes

  • Anything the reviewer should focus on (performance, edge-cases, specific files) or general notes.
  • Anyone in the community is free to review the PR.

Copilot AI review requested due to automatic review settings December 24, 2025 18:17
@github-actions github-actions bot added dataset Issues regarding data inputs, processing, or datasets tests Problems with test coverage, failures, or improvements to testing labels Dec 24, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug in image dataset aggregation where HuggingFace Image() feature types were being lost during the aggregation process, causing images to be stored with generic struct schemas instead. The fix ensures proper preservation of image schemas by passing feature metadata through the aggregation pipeline.

Key changes:

  • Modified aggregate_data() and append_or_create_parquet_file() to retrieve and pass HuggingFace features schema for image datasets
  • Added special handling for reading and writing parquet files containing images using datasets.Dataset.from_parquet() to preserve image format
  • Added comprehensive test coverage with test_aggregate_image_datasets() to verify schema preservation and data integrity

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
tests/datasets/test_aggregate.py Added comprehensive test for image dataset aggregation including schema validation and data integrity checks
src/lerobot/datasets/aggregate.py Updated aggregation logic to retrieve and pass HuggingFace features schema when processing image datasets, ensuring proper Image() type preservation in parquet files

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

dst_path.parent.mkdir(parents=True, exist_ok=True)
if contains_images:
to_parquet_with_hf_images(df, dst_path)
to_parquet_with_hf_images(df, dst_path, features=hf_features)
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function to_parquet_with_hf_images is being called with a features parameter, but the current function signature in utils.py only accepts (df: pandas.DataFrame, path: Path) and does not have a features parameter. This will cause a TypeError at runtime. The function signature needs to be updated to accept and use the features parameter to properly preserve the HuggingFace Image schema.

Copilot uses AI. Check for mistakes.

if contains_images:
to_parquet_with_hf_images(final_df, target_path)
to_parquet_with_hf_images(final_df, target_path, features=hf_features)
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function to_parquet_with_hf_images is being called with a features parameter, but the current function signature in utils.py only accepts (df: pandas.DataFrame, path: Path) and does not have a features parameter. This will cause a TypeError at runtime. The function signature needs to be updated to accept and use the features parameter to properly preserve the HuggingFace Image schema.

Copilot uses AI. Check for mistakes.
df = pd.read_parquet(src_path)
if contains_images:
# Use HuggingFace datasets to read source data to preserve image format
src_ds = datasets.Dataset.from_parquet(str(src_path))
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reading image datasets using datasets.Dataset.from_parquet, the features parameter should be passed to ensure image columns are properly loaded with the correct schema. Without this, the image data might not be correctly preserved during the read-update-write cycle. Consider using datasets.Dataset.from_parquet(str(src_path), features=hf_features) to maintain schema consistency.

Suggested change
src_ds = datasets.Dataset.from_parquet(str(src_path))
src_ds = datasets.Dataset.from_parquet(str(src_path), features=hf_features)

Copilot uses AI. Check for mistakes.
existing_df = pd.read_parquet(dst_path)
if contains_images:
# Use HuggingFace datasets to read existing data to preserve image format
existing_ds = datasets.Dataset.from_parquet(str(dst_path))
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reading existing image datasets using datasets.Dataset.from_parquet, the features parameter should be passed to ensure image columns are properly loaded with the correct schema. Without this, the image data might not be correctly preserved during the read-merge-write cycle. Consider using datasets.Dataset.from_parquet(str(dst_path), features=hf_features) to maintain schema consistency.

Suggested change
existing_ds = datasets.Dataset.from_parquet(str(dst_path))
existing_ds = datasets.Dataset.from_parquet(str(dst_path), features=hf_features)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataset Issues regarding data inputs, processing, or datasets tests Problems with test coverage, failures, or improvements to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

aggregate_datasets loses Image feature schema for image datasets

1 participant