feat(dataset): add UUID field to dataset records by amritghimire · Pull Request #1769 · datachain-ai/datachain

amritghimire · 2026-05-13T12:46:34Z

Add automatic UUID generation for datasets with:

UUID column in datasets table with unique constraint
Automatic UUID4 generation for new datasets
Support for explicit UUID specification during creation
Database migration support with UUID backfill
Comprehensive test coverage for UUID functionality

This enables unique dataset identification across different DataChain instances and improves dataset tracking capabilities.

Add automatic UUID generation for datasets with: - UUID column in datasets table with unique constraint - Automatic UUID4 generation for new datasets - Support for explicit UUID specification during creation - Database migration support with UUID backfill - Comprehensive test coverage for UUID functionality This enables unique dataset identification across different DataChain instances and improves dataset tracking capabilities.

cloudflare-workers-and-pages · 2026-05-13T12:48:26Z

Deploying datachain with Cloudflare Pages

Latest commit:	`204b907`
Status:	✅ Deploy successful!
Preview URL:	https://5964e01d.datachain-2g6.pages.dev
Branch Preview URL:	https://amrit-feat-add-dataset-uuid.datachain-2g6.pages.dev

View logs

codecov · 2026-05-14T16:02:21Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

This PR introduces a UUID identifier on dataset records to support stable, globally unique dataset identification across DataChain instances, including generation on creation, optional explicit UUID assignment, and SQLite upgrade/backfill behavior.

Changes:

Added a required uuid field to DatasetRecord and DatasetListRecord, and threaded it through parsing/serialization and test fixtures.
Extended metastore dataset creation to generate UUID4 by default while allowing explicit UUID input, and added a unique UUID index to the datasets schema.
Implemented SQLite-side UUID backfill logic on initialization and added functional tests covering UUID generation/preservation and upgrade backfill.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/unit/test_query_steps_hash.py	Updates dataset record construction to include UUID.
tests/func/test_read_dataset_remote.py	Adds UUID to remote dataset fixtures used by functional tests.
tests/func/test_pull.py	Adds UUID to remote dataset fixture used by pull tests.
tests/func/test_metastore.py	Adds assertions and new tests for dataset UUID creation, explicit UUID preservation, and upgrade backfill.
tests/conftest.py	Updates dataset-related fixtures to populate UUID fields.
src/datachain/dataset.py	Adds UUID fields to dataset dataclasses and updates parse logic to include UUID.
src/datachain/data_storage/sqlite.py	Adds SQLite dataset UUID backfill routine and adjusts column-add default handling.
src/datachain/data_storage/metastore.py	Adds dataset UUID column + unique index and extends `create_dataset()` to accept/generate UUID.

Comments suppressed due to low confidence (2)

src/datachain/data_storage/sqlite.py:567

duplicate_uuids is built with scalar_subquery(), but this subquery can return multiple UUIDs. Using a scalar subquery inside IN (...) can put SQLite into a scalar context and fail when more than one duplicate exists. Use a non-scalar subquery for IN (e.g., pass the select(...) directly / as a subquery) so multiple duplicate UUIDs are handled correctly.

        duplicate_uuids = (
            select(d.c.uuid)
            .group_by(d.c.uuid)
            .having(func.count() > 1)
            .scalar_subquery()

src/datachain/data_storage/sqlite.py:559

_migrate_table_schema() creates indexes before _backfill_dataset_uuids() runs. If any existing rows have duplicate/empty UUIDs, index creation can fail and be silently skipped (see _create_table_indexes), and the later backfill won’t recreate it. Consider running UUID backfill before attempting to create the UUID unique index, or re-attempt index creation after backfill.

        # Auto-migrate: add missing columns based on schema definitions
        for table in self._metastore_tables:
            self._migrate_table_schema(table)

        self._backfill_dataset_uuids()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Fix NOT NULL constraint to apply regardless of default clause - Add comprehensive tests for add_column edge cases covering: - Boolean defaults (True->1, False->0) - String defaults with quote escaping - Callable defaults (no SQL DEFAULT clause) - Unsupported default types (RuntimeError) Improves test coverage for missing branches in add_column method.

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/datachain/data_storage/sqlite.py:575

_migrate_table_schema() creates indexes before _backfill_dataset_uuids() runs. If there are pre-existing duplicate/empty UUIDs, the unique index creation can fail (and is swallowed), and it is never retried after backfill, leaving UUID uniqueness unenforced. Consider backfilling/fixing UUIDs before creating the unique index (or explicitly retry index creation for datasets after backfill, possibly dropping/recreating the index).

    def _init_tables(self) -> None:
        """Initialize tables with automatic schema migration."""
        for table in self._metastore_tables:
            self.db.create_table(table, if_not_exists=True)
            self.default_table_names.append(table.name)

        # Auto-migrate: add missing columns based on schema definitions
        for table in self._metastore_tables:
            self._migrate_table_schema(table)

        self._backfill_dataset_uuids()

    def _backfill_dataset_uuids(self) -> None:
        d = self._datasets
        duplicate_uuids = (
            select(d.c.uuid)
            .group_by(d.c.uuid)
            .having(func.count() > 1)
            .scalar_subquery()
        )
        bad_rows_query = select(d.c.id).where(
            d.c.uuid.is_(None) | (d.c.uuid == "") | d.c.uuid.in_(duplicate_uuids)
        )
        for (dataset_id,) in self.db.execute(bad_rows_query).fetchall():
            self.db.execute(
                update(d).where(d.c.id == dataset_id).values(uuid=str(uuid4()))
            )

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

amritghimire self-assigned this May 13, 2026

amritghimire and others added 5 commits May 13, 2026 19:22

Merge branch 'main' into amrit/feat/add-dataset-uuid

ab85d59

Merge branch 'main' into amrit/feat/add-dataset-uuid

81d347d

Address tests

888fa90

Merge branch 'main' into amrit/feat/add-dataset-uuid

4add203

Fix test

8924304

amritghimire requested a review from Copilot May 15, 2026 03:56

Copilot started reviewing on behalf of amritghimire May 15, 2026 03:57 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

Comment thread src/datachain/data_storage/sqlite.py Outdated

Comment thread tests/func/test_metastore.py

amritghimire requested a review from Copilot May 15, 2026 04:25

Copilot started reviewing on behalf of amritghimire May 15, 2026 04:26 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

Comment thread src/datachain/data_storage/sqlite.py

Comment thread tests/unit/test_sqlite_add_column.py

Fix test

14776be

amritghimire requested a review from Copilot May 15, 2026 04:39

Copilot started reviewing on behalf of amritghimire May 15, 2026 04:39 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

Comment thread src/datachain/data_storage/metastore.py Outdated

Comment thread src/datachain/data_storage/sqlite.py

Comment thread src/datachain/data_storage/sqlite.py

Address pr comment

204b907

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dataset): add UUID field to dataset records#1769

feat(dataset): add UUID field to dataset records#1769
amritghimire wants to merge 9 commits into
mainfrom
amrit/feat/add-dataset-uuid

amritghimire commented May 13, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented May 13, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 14, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

amritghimire commented May 13, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying datachain with Cloudflare Pages

Uh oh!

codecov Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cloudflare-workers-and-pages Bot commented May 13, 2026 •

edited

Loading

codecov Bot commented May 14, 2026 •

edited

Loading