Skip to content

Conversation

@leoll2
Copy link
Contributor

@leoll2 leoll2 commented Jan 6, 2026

Summary

This PR proposes a design for the import/export features (#5113), as well as for the creation of new projects reusing existing datasets (#4988)

Resolves #5113
Resolves #4988

Checklist

  • The PR title and description are clear and descriptive
  • I have manually tested the changes
  • All changes are covered by automated tests
  • All related issues are linked to this PR (if applicable)
  • Documentation has been updated (if applicable)

@leoll2 leoll2 self-assigned this Jan 6, 2026
@github-actions github-actions bot added the DOC Improvements or additions to documentation label Jan 6, 2026
@leoll2 leoll2 force-pushed the leonardo/dataset-ie-design branch from 9ddb49b to 01393f7 Compare January 6, 2026 11:11
@github-actions
Copy link

github-actions bot commented Jan 6, 2026

Docker Image Sizes

CPU

Image Size
geti-tune-cpu:pr-5126 2.88G
geti-tune-cpu:sha-11a99f1 2.88G

GPU

Image Size
geti-tune-gpu:pr-5126 10.66G
geti-tune-gpu:sha-11a99f1 10.66G

XPU

Image Size
geti-tune-xpu:pr-5126 8.72G
geti-tune-xpu:sha-11a99f1 8.72G

@leoll2 leoll2 marked this pull request as ready for review January 6, 2026 12:25
@leoll2 leoll2 requested a review from jpggvilaca as a code owner January 6, 2026 12:25
Copilot AI review requested due to automatic review settings January 6, 2026 12:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a comprehensive design document for dataset import/export functionality and project forking capabilities in Geti Tune. The design establishes a staging area architecture that enables flexible dataset transfer workflows through REST APIs and background jobs.

Key Changes:

  • Comprehensive design document for dataset import/export operations with staging area architecture
  • REST API specification updates to support new job types and endpoints for dataset operations
  • Documentation of task compatibility rules for cross-project dataset transfers

Reviewed changes

Copilot reviewed 2 out of 9 changed files in this pull request and generated 3 comments.

File Description
application/docs/dataset-ie.md New comprehensive design document covering dataset import/export workflows, staging area architecture, task compatibility, job definitions, and REST API specifications
application/docs/api.md Updated Jobs API section with new endpoints, standardized paths, and added dataset operation job types

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Datasets can be arbitrarily large, and most of the operations scale linearly with dataset size.
- Example of operations: uploading, downloading, archiving, extracting, parsing, transforming.
- Metadata about the dataset (e.g., type of annotations, labels, number of items) can be obtained without fully loading
the dataset into memory, provided that the dataset is stored in Datumaro format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont have all the knowledge but i believe this is not the only benefit of dealing with Datumaro format. Should we, across the app, push for this? Visually speaking, adding a "Recommended" tag everywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, Datumaro format is very convenient for several reasons. In Geti Tune, for example, it was the only format that could seamlessly support all task types and videos natively. We can definitely add a "recommended" tag in the UI of dataset export.

- Metadata about the dataset (e.g., type of annotations, labels, number of items) can be obtained without fully loading
the dataset into memory, provided that the dataset is stored in Datumaro format.
- Datasets may be imported to projects of different task types and labels. If so, annotations need to be adapted
accordingly (see [Task compatibility](task-compatibility.md) for more details) and the labels must explicitly re-mapped.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was definitely a hard scenario to tackle on the previous Geti. What if, on the first iteration, we make it easy for everyone and just tell the user that either A) The annotations are all gone or B) The chosen task must be X, where X is the appropriate one for that dataset that he/she imported. WDYT? We can iterate later and make the system smarter. But to start, i think it would help tremendously if we don't have to tackle this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be much easier with the new datumaro implementation. It will take care of e.g., converting from polygons to bounding boxes. Ofcourse, not all scenarios make sense to convert as there is missing information if you were to go from classification to a segmentation task. I guess this is defined in the task-compatibility file (@leoll2 the link to that file isn't working, should it still be added?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Albert said, this feature comes out of the box with the new Datumaro. It won't require much effort other than testing the various combinations (we only support 4, actually).

I fixed the link, it points to section later in this document.


- `train`
- `prepare_dataset_for_import`
- `import_dataset_to_existing_project`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to have only 1 endpoint for this and move the action choice to the backend? This would make it simpler for at least the client (UI or user using the API). Can I not, as a user, just call the same endpoint, and then the backend figures out if the project is a new or an existing one? (we could even pass a flag if necessary, but that defeats the purpose a little bit)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean combining prepare_dataset_for_import and import_dataset_to_existing_project, it is not possible because we need the user input after uploading and scanning the dataset, namely to decide which labels to include and how to map them.

If you mean combining import_dataset_to_existing_project and import_dataset_as_new_project, that's technically possible but the payload is quite different (see the API section) so two distinct endpoints are preferrable imho.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the latter. If we can somehow find a union of all data we need that fits both, then we could merge both. But not a big deal in any case.

If the target is a new project, this job also creates the project.
- **import_dataset_as_new_project**: creates a new project, then loads a Datumaro dataset and inserts its items into
the dataset of the newly created project, applying optional filtering and annotation conversion.
- **stage_dataset**: loads a project's dataset (or dataset revision) as a Datumaro dataset, applying optional filtering,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is staging, an intermediary internal step, be exposed? What is the use case for the user to do this instead of just importing/exporting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used in the "dataset copy/fork" usecase. Users can basically take the dataset of an existing project and copy it into a new/existing project, optionally filtering and remapping labels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DOC Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Write design for dataset I/E Create new project reusing existing dataset

4 participants