-
Notifications
You must be signed in to change notification settings - Fork 462
Design for dataset I/E #5126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Design for dataset I/E #5126
Conversation
9ddb49b to
01393f7
Compare
Docker Image SizesCPU
GPU
XPU
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a comprehensive design document for dataset import/export functionality and project forking capabilities in Geti Tune. The design establishes a staging area architecture that enables flexible dataset transfer workflows through REST APIs and background jobs.
Key Changes:
- Comprehensive design document for dataset import/export operations with staging area architecture
- REST API specification updates to support new job types and endpoints for dataset operations
- Documentation of task compatibility rules for cross-project dataset transfers
Reviewed changes
Copilot reviewed 2 out of 9 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| application/docs/dataset-ie.md | New comprehensive design document covering dataset import/export workflows, staging area architecture, task compatibility, job definitions, and REST API specifications |
| application/docs/api.md | Updated Jobs API section with new endpoints, standardized paths, and added dataset operation job types |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
application/docs/dataset-ie.md
Outdated
| - Datasets can be arbitrarily large, and most of the operations scale linearly with dataset size. | ||
| - Example of operations: uploading, downloading, archiving, extracting, parsing, transforming. | ||
| - Metadata about the dataset (e.g., type of annotations, labels, number of items) can be obtained without fully loading | ||
| the dataset into memory, provided that the dataset is stored in Datumaro format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont have all the knowledge but i believe this is not the only benefit of dealing with Datumaro format. Should we, across the app, push for this? Visually speaking, adding a "Recommended" tag everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, Datumaro format is very convenient for several reasons. In Geti Tune, for example, it was the only format that could seamlessly support all task types and videos natively. We can definitely add a "recommended" tag in the UI of dataset export.
application/docs/dataset-ie.md
Outdated
| - Metadata about the dataset (e.g., type of annotations, labels, number of items) can be obtained without fully loading | ||
| the dataset into memory, provided that the dataset is stored in Datumaro format. | ||
| - Datasets may be imported to projects of different task types and labels. If so, annotations need to be adapted | ||
| accordingly (see [Task compatibility](task-compatibility.md) for more details) and the labels must explicitly re-mapped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was definitely a hard scenario to tackle on the previous Geti. What if, on the first iteration, we make it easy for everyone and just tell the user that either A) The annotations are all gone or B) The chosen task must be X, where X is the appropriate one for that dataset that he/she imported. WDYT? We can iterate later and make the system smarter. But to start, i think it would help tremendously if we don't have to tackle this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be much easier with the new datumaro implementation. It will take care of e.g., converting from polygons to bounding boxes. Ofcourse, not all scenarios make sense to convert as there is missing information if you were to go from classification to a segmentation task. I guess this is defined in the task-compatibility file (@leoll2 the link to that file isn't working, should it still be added?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Albert said, this feature comes out of the box with the new Datumaro. It won't require much effort other than testing the various combinations (we only support 4, actually).
I fixed the link, it points to section later in this document.
|
|
||
| - `train` | ||
| - `prepare_dataset_for_import` | ||
| - `import_dataset_to_existing_project` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to have only 1 endpoint for this and move the action choice to the backend? This would make it simpler for at least the client (UI or user using the API). Can I not, as a user, just call the same endpoint, and then the backend figures out if the project is a new or an existing one? (we could even pass a flag if necessary, but that defeats the purpose a little bit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you mean combining prepare_dataset_for_import and import_dataset_to_existing_project, it is not possible because we need the user input after uploading and scanning the dataset, namely to decide which labels to include and how to map them.
If you mean combining import_dataset_to_existing_project and import_dataset_as_new_project, that's technically possible but the payload is quite different (see the API section) so two distinct endpoints are preferrable imho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant the latter. If we can somehow find a union of all data we need that fits both, then we could merge both. But not a big deal in any case.
| If the target is a new project, this job also creates the project. | ||
| - **import_dataset_as_new_project**: creates a new project, then loads a Datumaro dataset and inserts its items into | ||
| the dataset of the newly created project, applying optional filtering and annotation conversion. | ||
| - **stage_dataset**: loads a project's dataset (or dataset revision) as a Datumaro dataset, applying optional filtering, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is staging, an intermediary internal step, be exposed? What is the use case for the user to do this instead of just importing/exporting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used in the "dataset copy/fork" usecase. Users can basically take the dataset of an existing project and copy it into a new/existing project, optionally filtering and remapping labels.
Summary
This PR proposes a design for the import/export features (#5113), as well as for the creation of new projects reusing existing datasets (#4988)
Resolves #5113
Resolves #4988
Checklist