Skip to content

Conversation

@AlbertvanHouten
Copy link
Contributor

@AlbertvanHouten AlbertvanHouten commented Oct 20, 2025

This pull request introduces the new experimental dataset into the dataset classes of OTX. A lot of logic has been migrated to datumaro such as management of image color channels and the tiling implementation.

Summary

How to test

Checklist

  • The PR title and description are clear and descriptive
  • I have manually tested the changes
  • All changes are covered by automated tests
  • All related issues are linked to this PR (if applicable)
  • Documentation has been updated (if applicable)

AlbertvanHouten and others added 16 commits September 4, 2025 10:10
Signed-off-by: Albert van Houten <[email protected]>
Co-authored-by: Leonardo Lai <[email protected]>
…4751)

Signed-off-by: Albert van Houten <[email protected]>
Co-authored-by: Grégoire Payen de La Garanderie <[email protected]>
…ing_extensions into feature/datumaro

# Conflicts:
#	library/src/otx/data/dataset/base_new.py
#	library/src/otx/data/dataset/classification_new.py
#	library/src/otx/data/dataset/detection_new.py
#	library/src/otx/data/dataset/instance_segmentation_new.py
#	library/src/otx/data/dataset/keypoint_detection_new.py
#	library/src/otx/data/dataset/segmentation_new.py
#	library/src/otx/data/entity/sample.py
Signed-off-by: Albert van Houten <[email protected]>
Signed-off-by: Albert van Houten <[email protected]>
Co-authored-by: Albert van Houten <[email protected]>
@github-actions github-actions bot added DEPENDENCY Any changes in any dependencies (new dep or its version) should be produced via Change Request on PM TEST Any changes in tests BUILD labels Oct 20, 2025
Signed-off-by: Albert van Houten <[email protected]>
Signed-off-by: Albert van Houten <[email protected]>
…ing_extensions into feature/datumaro

Signed-off-by: Albert van Houten <[email protected]>

# Conflicts:
#	library/pyproject.toml
#	library/src/otx/data/dataset/anomaly.py
#	library/src/otx/data/dataset/base.py
#	library/src/otx/data/dataset/classification.py
#	library/src/otx/data/dataset/detection.py
#	library/src/otx/data/dataset/instance_segmentation.py
#	library/src/otx/data/dataset/keypoint_detection.py
#	library/src/otx/data/dataset/segmentation.py
#	library/src/otx/data/dataset/tile.py
#	library/src/otx/data/factory.py
#	library/src/otx/data/module.py
#	library/src/otx/data/transform_libs/torchvision.py
#	library/tests/unit/data/samplers/test_class_incremental_sampler.py
#	library/tests/unit/data/utils/test_utils.py
@AlbertvanHouten AlbertvanHouten changed the title Feature/datumaro Experimental datumaro implementation Oct 29, 2025
@AlbertvanHouten AlbertvanHouten marked this pull request as ready for review October 29, 2025 09:24
@AlbertvanHouten AlbertvanHouten requested a review from a team as a code owner October 29, 2025 09:24
Copilot AI review requested due to automatic review settings October 29, 2025 09:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements an experimental Datumaro dataset integration for OTX, transitioning from legacy Datumaro components to the new experimental Dataset API. The changes introduce a new sample-based architecture while maintaining compatibility with existing OTX functionality.

Key changes:

  • Migration from legacy Datumaro components to experimental Dataset API with schema-based conversion
  • Introduction of new OTXSample-based data entities with PyTree registration for TorchVision compatibility
  • Replacement of legacy polygon handling with numpy ragged arrays for better performance
  • Comprehensive test updates and new test implementations for the updated dataset architecture

Reviewed Changes

Copilot reviewed 59 out of 59 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pyproject.toml Updates Datumaro dependency to experimental branch version
library/src/otx/data/dataset/*.py Implements new dataset classes using experimental Datumaro with sample-based architecture
library/src/otx/data/entity/sample.py Adds new OTXSample classes with PyTree registration for transform compatibility
library/tests/unit/types/test_label.py Updates label tests to use new hierarchical label categories
library/tests/unit/data/transform_libs/test_torchvision.py Converts polygon handling from Datumaro objects to numpy arrays

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rescaled_polygons[i] = scaled_points
else:
# Handle empty or invalid polygons
rescaled_polygons[i] = np.array([[0, 0], [0, 0], [0, 0]], dtype=np.float32)
Copy link

Copilot AI Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Creating a dummy polygon with three identical points may not be the best approach for handling empty/invalid polygons. Consider using an empty array or a clearly marked invalid polygon structure that can be properly filtered out later in the pipeline.

Suggested change
rescaled_polygons[i] = np.array([[0, 0], [0, 0], [0, 0]], dtype=np.float32)
rescaled_polygons[i] = np.empty((0, 2), dtype=np.float32)

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why exactly 3 points BTW? Why not just empty tensor or 0 tensor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely sure why Greg applied it in this manner, but I guess there is a validation step somewhere that requires 3 points for a valid polygon.

# Conflicts:
#	library/tests/unit/data/dataset/test_base.py
#	library/tests/unit/data/test_tiling.py
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kprokofi please review if these changes make sense. The old logic only worked when tiling with square images.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think padding do nothing here. We resize and pad to the same size.

Copy link
Contributor Author

@AlbertvanHouten AlbertvanHouten Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the padding, tiling only works if the image is square (which it was previously in the integration test). I have replaced it with another dataset, as the polygons in the dataset were self-intersecting, which isn't supported, and this new dataset does have rectangle images, which broke specifically this tiling recipe.

Comment on lines +144 to +150
# Unit test models separately using --forked flag so that each is run in a separate process.
# Running these models after each other from the same process can cause segfaults
- name: Unit testing models
working-directory: library
run: uv run pytest tests/unit --cov --cov-report=xml
env:
OMP_NUM_THREADS: "1"
run: uv run pytest tests/unit/backend/native/models --cov --cov-report=xml --cov-append --forked
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any idea what causes the segfaults?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear to me. I've tried a few different things to see if it would fix it, like clearing up pytorch memory usage, but this was the only way that I could fix it.

Copy link
Contributor

@kprokofi kprokofi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of comments

rescaled_polygons[i] = scaled_points
else:
# Handle empty or invalid polygons
rescaled_polygons[i] = np.array([[0, 0], [0, 0], [0, 0]], dtype=np.float32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why exactly 3 points BTW? Why not just empty tensor or 0 tensor?

Copy link
Contributor

@kprokofi kprokofi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second round of comments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think padding do nothing here. We resize and pad to the same size.

Comment on lines 112 to 114
"RandomPhotometricDistort",
"RandomGaussianBlur",
"RandomGaussianNoise",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove TopdownAffine from configurable augs list? We actually provide possibility to turn on/off this augmentation in Geti

Signed-off-by: Albert van Houten <[email protected]>
@github-actions github-actions bot added the DOC Improvements or additions to documentation label Jan 6, 2026
@AlbertvanHouten
Copy link
Contributor Author

I believe all comments should be handled now @kprokofi. Please recheck the PR when you have time.

"""OTXDataItemSample is a base class for OTX data items."""

subset: Subset = subset_field()
image: np.ndarray | tv_tensors.Image | torch.Tensor = image_field(dtype=pl.UInt8(), channels_first=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All other samples have channels_first=False. Why does SegmentationSample differ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its due to different transformations being called for the segmentation task. Without channels first, the predicted masks will have H/W channels swapped.

leoll2
leoll2 previously approved these changes Jan 7, 2026
Signed-off-by: Albert van Houten <[email protected]>
Copy link
Contributor

@kprokofi kprokofi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work!

@AlbertvanHouten AlbertvanHouten added this pull request to the merge queue Jan 9, 2026
Merged via the queue into develop with commit 4376a84 Jan 9, 2026
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BUILD DEPENDENCY Any changes in any dependencies (new dep or its version) should be produced via Change Request on PM DOC Improvements or additions to documentation Geti Tune Backend Issues related to Geti Tune backend TEST Any changes in tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Integrate OTX with the new dataset class

8 participants