Design for dataset I/E #5126

leoll2 · 2026-01-06T11:09:01Z

Summary

This PR proposes a design for the import/export features (#5113), as well as for the creation of new projects reusing existing datasets (#4988)

Resolves #5113
Resolves #4988

Checklist

The PR title and description are clear and descriptive
I have manually tested the changes
All changes are covered by automated tests
All related issues are linked to this PR (if applicable)
Documentation has been updated (if applicable)

github-actions · 2026-01-06T11:19:25Z

Docker Image Sizes

CPU

Image	Size
geti-tune-cpu:pr-5126	2.88G
geti-tune-cpu:sha-11a99f1	2.88G

GPU

Image	Size
geti-tune-gpu:pr-5126	10.66G
geti-tune-gpu:sha-11a99f1	10.66G

XPU

Image	Size
geti-tune-xpu:pr-5126	8.72G
geti-tune-xpu:sha-11a99f1	8.72G

Copilot

Pull request overview

This PR introduces a comprehensive design document for dataset import/export functionality and project forking capabilities in Geti Tune. The design establishes a staging area architecture that enables flexible dataset transfer workflows through REST APIs and background jobs.

Key Changes:

Comprehensive design document for dataset import/export operations with staging area architecture
REST API specification updates to support new job types and endpoints for dataset operations
Documentation of task compatibility rules for cross-project dataset transfers

Reviewed changes

Copilot reviewed 2 out of 9 changed files in this pull request and generated 3 comments.

File	Description
application/docs/dataset-ie.md	New comprehensive design document covering dataset import/export workflows, staging area architecture, task compatibility, job definitions, and REST API specifications
application/docs/api.md	Updated Jobs API section with new endpoints, standardized paths, and added dataset operation job types

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

application/docs/dataset-ie.md

application/docs/api.md

jpggvilaca · 2026-01-06T13:08:38Z

application/docs/dataset-ie.md

+- Datasets can be arbitrarily large, and most of the operations scale linearly with dataset size.
+  - Example of operations: uploading, downloading, archiving, extracting, parsing, transforming.
+- Metadata about the dataset (e.g., type of annotations, labels, number of items) can be obtained without fully loading
+the dataset into memory, provided that the dataset is stored in Datumaro format.


I dont have all the knowledge but i believe this is not the only benefit of dealing with Datumaro format. Should we, across the app, push for this? Visually speaking, adding a "Recommended" tag everywhere?

Correct, Datumaro format is very convenient for several reasons. In Geti Tune, for example, it was the only format that could seamlessly support all task types and videos natively. We can definitely add a "recommended" tag in the UI of dataset export.

jpggvilaca · 2026-01-06T13:10:58Z

application/docs/dataset-ie.md

+- Metadata about the dataset (e.g., type of annotations, labels, number of items) can be obtained without fully loading
+the dataset into memory, provided that the dataset is stored in Datumaro format.
+- Datasets may be imported to projects of different task types and labels. If so, annotations need to be adapted
+accordingly (see [Task compatibility](task-compatibility.md) for more details) and the labels must explicitly re-mapped.


This was definitely a hard scenario to tackle on the previous Geti. What if, on the first iteration, we make it easy for everyone and just tell the user that either A) The annotations are all gone or B) The chosen task must be X, where X is the appropriate one for that dataset that he/she imported. WDYT? We can iterate later and make the system smarter. But to start, i think it would help tremendously if we don't have to tackle this.

This will be much easier with the new datumaro implementation. It will take care of e.g., converting from polygons to bounding boxes. Ofcourse, not all scenarios make sense to convert as there is missing information if you were to go from classification to a segmentation task. I guess this is defined in the task-compatibility file (@leoll2 the link to that file isn't working, should it still be added?)

As Albert said, this feature comes out of the box with the new Datumaro. It won't require much effort other than testing the various combinations (we only support 4, actually).

I fixed the link, it points to section later in this document.

jpggvilaca · 2026-01-06T13:13:27Z

application/docs/api.md

+
+- `train`
+- `prepare_dataset_for_import`
+- `import_dataset_to_existing_project`


Would it be possible to have only 1 endpoint for this and move the action choice to the backend? This would make it simpler for at least the client (UI or user using the API). Can I not, as a user, just call the same endpoint, and then the backend figures out if the project is a new or an existing one? (we could even pass a flag if necessary, but that defeats the purpose a little bit)

If you mean combining prepare_dataset_for_import and import_dataset_to_existing_project, it is not possible because we need the user input after uploading and scanning the dataset, namely to decide which labels to include and how to map them.

If you mean combining import_dataset_to_existing_project and import_dataset_as_new_project, that's technically possible but the payload is quite different (see the API section) so two distinct endpoints are preferrable imho.

I meant the latter. If we can somehow find a union of all data we need that fits both, then we could merge both. But not a big deal in any case.

application/docs/dataset-ie.md

jpggvilaca · 2026-01-06T13:53:04Z

application/docs/dataset-ie.md

+    If the target is a new project, this job also creates the project.
+- **import_dataset_as_new_project**: creates a new project, then loads a Datumaro dataset and inserts its items into
+    the dataset of the newly created project, applying optional filtering and annotation conversion.
+- **stage_dataset**: loads a project's dataset (or dataset revision) as a Datumaro dataset, applying optional filtering,


Why is staging, an intermediary internal step, be exposed? What is the use case for the user to do this instead of just importing/exporting?

It is used in the "dataset copy/fork" usecase. Users can basically take the dataset of an existing project and copy it into a new/existing project, optionally filtering and remapping labels.

application/docs/dataset-ie.md

rephrase sentence

leoll2 self-assigned this Jan 6, 2026

github-actions bot added the DOC Improvements or additions to documentation label Jan 6, 2026

Design for dataset I/E

01393f7

leoll2 force-pushed the leonardo/dataset-ie-design branch from 9ddb49b to 01393f7 Compare January 6, 2026 11:11

Add section about annotation type conversion

f6b1a3f

leoll2 marked this pull request as ready for review January 6, 2026 12:25

leoll2 requested a review from jpggvilaca as a code owner January 6, 2026 12:25

Copilot AI review requested due to automatic review settings January 6, 2026 12:25

Copilot AI reviewed Jan 6, 2026

View reviewed changes

application/docs/dataset-ie.md Outdated Show resolved Hide resolved

application/docs/dataset-ie.md Outdated Show resolved Hide resolved

application/docs/api.md Outdated Show resolved Hide resolved

jpggvilaca reviewed Jan 6, 2026

View reviewed changes

application/docs/dataset-ie.md Outdated Show resolved Hide resolved

jpggvilaca reviewed Jan 6, 2026

View reviewed changes

AlbertvanHouten reviewed Jan 6, 2026

View reviewed changes

application/docs/dataset-ie.md Show resolved Hide resolved

AlbertvanHouten reviewed Jan 6, 2026

View reviewed changes

application/docs/dataset-ie.md Outdated Show resolved Hide resolved

AlbertvanHouten reviewed Jan 6, 2026

View reviewed changes

application/docs/dataset-ie.md Show resolved Hide resolved

leoll2 added 4 commits January 6, 2026 17:59

fix typos and formatting

d64131c

add note about resumable uploads

b0789fe

rephrase sentence

fix link

898e224

add sections about concurrency and cleanup

3bcc0bd

AlbertvanHouten approved these changes Jan 14, 2026

View reviewed changes

Design for dataset I/E #5126

Are you sure you want to change the base?

Design for dataset I/E #5126

Uh oh!

Conversation

leoll2 commented Jan 6, 2026

Summary

Checklist

Uh oh!

github-actions bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Docker Image Sizes

CPU

GPU

XPU

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions bot commented Jan 6, 2026 •

edited

Loading