-
Notifications
You must be signed in to change notification settings - Fork 462
Experimental datumaro implementation #4920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Albert van Houten <[email protected]> Co-authored-by: Leonardo Lai <[email protected]>
…ing_extensions into feature/datumaro
Signed-off-by: Albert van Houten <[email protected]>
…4751) Signed-off-by: Albert van Houten <[email protected]> Co-authored-by: Grégoire Payen de La Garanderie <[email protected]>
…ing_extensions into feature/datumaro # Conflicts: # library/src/otx/data/dataset/base_new.py # library/src/otx/data/dataset/classification_new.py # library/src/otx/data/dataset/detection_new.py # library/src/otx/data/dataset/instance_segmentation_new.py # library/src/otx/data/dataset/keypoint_detection_new.py # library/src/otx/data/dataset/segmentation_new.py # library/src/otx/data/entity/sample.py
Signed-off-by: Albert van Houten <[email protected]>
Signed-off-by: Albert van Houten <[email protected]>
…ing_extensions into feature/datumaro
…ing_extensions into feature/datumaro
Signed-off-by: Albert van Houten <[email protected]>
Signed-off-by: Albert van Houten <[email protected]> Co-authored-by: Albert van Houten <[email protected]>
Signed-off-by: Albert van Houten <[email protected]>
…orm/training_extensions into feature/datumaro
Signed-off-by: Albert van Houten <[email protected]>
Signed-off-by: Albert van Houten <[email protected]>
Signed-off-by: Albert van Houten <[email protected]>
…ing_extensions into feature/datumaro Signed-off-by: Albert van Houten <[email protected]> # Conflicts: # library/pyproject.toml # library/src/otx/data/dataset/anomaly.py # library/src/otx/data/dataset/base.py # library/src/otx/data/dataset/classification.py # library/src/otx/data/dataset/detection.py # library/src/otx/data/dataset/instance_segmentation.py # library/src/otx/data/dataset/keypoint_detection.py # library/src/otx/data/dataset/segmentation.py # library/src/otx/data/dataset/tile.py # library/src/otx/data/factory.py # library/src/otx/data/module.py # library/src/otx/data/transform_libs/torchvision.py # library/tests/unit/data/samplers/test_class_incremental_sampler.py # library/tests/unit/data/utils/test_utils.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements an experimental Datumaro dataset integration for OTX, transitioning from legacy Datumaro components to the new experimental Dataset API. The changes introduce a new sample-based architecture while maintaining compatibility with existing OTX functionality.
Key changes:
- Migration from legacy Datumaro components to experimental Dataset API with schema-based conversion
- Introduction of new OTXSample-based data entities with PyTree registration for TorchVision compatibility
- Replacement of legacy polygon handling with numpy ragged arrays for better performance
- Comprehensive test updates and new test implementations for the updated dataset architecture
Reviewed Changes
Copilot reviewed 59 out of 59 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Updates Datumaro dependency to experimental branch version |
| library/src/otx/data/dataset/*.py | Implements new dataset classes using experimental Datumaro with sample-based architecture |
| library/src/otx/data/entity/sample.py | Adds new OTXSample classes with PyTree registration for transform compatibility |
| library/tests/unit/types/test_label.py | Updates label tests to use new hierarchical label categories |
| library/tests/unit/data/transform_libs/test_torchvision.py | Converts polygon handling from Datumaro objects to numpy arrays |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| rescaled_polygons[i] = scaled_points | ||
| else: | ||
| # Handle empty or invalid polygons | ||
| rescaled_polygons[i] = np.array([[0, 0], [0, 0], [0, 0]], dtype=np.float32) |
Copilot
AI
Oct 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Creating a dummy polygon with three identical points may not be the best approach for handling empty/invalid polygons. Consider using an empty array or a clearly marked invalid polygon structure that can be properly filtered out later in the pipeline.
| rescaled_polygons[i] = np.array([[0, 0], [0, 0], [0, 0]], dtype=np.float32) | |
| rescaled_polygons[i] = np.empty((0, 2), dtype=np.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why exactly 3 points BTW? Why not just empty tensor or 0 tensor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not entirely sure why Greg applied it in this manner, but I guess there is a validation step somewhere that requires 3 points for a valid polygon.
# Conflicts: # library/tests/unit/data/dataset/test_base.py # library/tests/unit/data/test_tiling.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kprokofi please review if these changes make sense. The old logic only worked when tiling with square images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think padding do nothing here. We resize and pad to the same size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the padding, tiling only works if the image is square (which it was previously in the integration test). I have replaced it with another dataset, as the polygons in the dataset were self-intersecting, which isn't supported, and this new dataset does have rectangle images, which broke specifically this tiling recipe.
| # Unit test models separately using --forked flag so that each is run in a separate process. | ||
| # Running these models after each other from the same process can cause segfaults | ||
| - name: Unit testing models | ||
| working-directory: library | ||
| run: uv run pytest tests/unit --cov --cov-report=xml | ||
| env: | ||
| OMP_NUM_THREADS: "1" | ||
| run: uv run pytest tests/unit/backend/native/models --cov --cov-report=xml --cov-append --forked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any idea what causes the segfaults?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear to me. I've tried a few different things to see if it would fix it, like clearing up pytorch memory usage, but this was the only way that I could fix it.
kprokofi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round of comments
| rescaled_polygons[i] = scaled_points | ||
| else: | ||
| # Handle empty or invalid polygons | ||
| rescaled_polygons[i] = np.array([[0, 0], [0, 0], [0, 0]], dtype=np.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why exactly 3 points BTW? Why not just empty tensor or 0 tensor?
kprokofi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second round of comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think padding do nothing here. We resize and pad to the same size.
| "RandomPhotometricDistort", | ||
| "RandomGaussianBlur", | ||
| "RandomGaussianNoise", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove TopdownAffine from configurable augs list? We actually provide possibility to turn on/off this augmentation in Geti
Signed-off-by: Albert van Houten <[email protected]>
9095a35 to
1c14e75
Compare
Signed-off-by: Albert van Houten <[email protected]>
…from dataset Signed-off-by: Albert van Houten <[email protected]>
Signed-off-by: Albert van Houten <[email protected]>
Signed-off-by: Albert van Houten <[email protected]>
Signed-off-by: Albert van Houten <[email protected]>
|
I believe all comments should be handled now @kprokofi. Please recheck the PR when you have time. |
| """OTXDataItemSample is a base class for OTX data items.""" | ||
|
|
||
| subset: Subset = subset_field() | ||
| image: np.ndarray | tv_tensors.Image | torch.Tensor = image_field(dtype=pl.UInt8(), channels_first=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All other samples have channels_first=False. Why does SegmentationSample differ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its due to different transformations being called for the segmentation task. Without channels first, the predicted masks will have H/W channels swapped.
Signed-off-by: Albert van Houten <[email protected]>
…ing_extensions into feature/datumaro
Signed-off-by: Albert van Houten <[email protected]>
kprokofi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your work!
This pull request introduces the new experimental dataset into the dataset classes of OTX. A lot of logic has been migrated to datumaro such as management of image color channels and the tiling implementation.
Summary
How to test
Checklist