Skip to content

feat(dataset): add UUID field to dataset records#1769

Open
amritghimire wants to merge 9 commits into
mainfrom
amrit/feat/add-dataset-uuid
Open

feat(dataset): add UUID field to dataset records#1769
amritghimire wants to merge 9 commits into
mainfrom
amrit/feat/add-dataset-uuid

Conversation

@amritghimire
Copy link
Copy Markdown
Contributor

Add automatic UUID generation for datasets with:

  • UUID column in datasets table with unique constraint
  • Automatic UUID4 generation for new datasets
  • Support for explicit UUID specification during creation
  • Database migration support with UUID backfill
  • Comprehensive test coverage for UUID functionality

This enables unique dataset identification across different DataChain instances and improves dataset tracking capabilities.

Add automatic UUID generation for datasets with:
- UUID column in datasets table with unique constraint
- Automatic UUID4 generation for new datasets
- Support for explicit UUID specification during creation
- Database migration support with UUID backfill
- Comprehensive test coverage for UUID functionality

This enables unique dataset identification across different
DataChain instances and improves dataset tracking capabilities.
@amritghimire amritghimire self-assigned this May 13, 2026
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented May 13, 2026

Deploying datachain with  Cloudflare Pages  Cloudflare Pages

Latest commit: 204b907
Status: ✅  Deploy successful!
Preview URL: https://5964e01d.datachain-2g6.pages.dev
Branch Preview URL: https://amrit-feat-add-dataset-uuid.datachain-2g6.pages.dev

View logs

@codecov
Copy link
Copy Markdown

codecov Bot commented May 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a UUID identifier on dataset records to support stable, globally unique dataset identification across DataChain instances, including generation on creation, optional explicit UUID assignment, and SQLite upgrade/backfill behavior.

Changes:

  • Added a required uuid field to DatasetRecord and DatasetListRecord, and threaded it through parsing/serialization and test fixtures.
  • Extended metastore dataset creation to generate UUID4 by default while allowing explicit UUID input, and added a unique UUID index to the datasets schema.
  • Implemented SQLite-side UUID backfill logic on initialization and added functional tests covering UUID generation/preservation and upgrade backfill.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/unit/test_query_steps_hash.py Updates dataset record construction to include UUID.
tests/func/test_read_dataset_remote.py Adds UUID to remote dataset fixtures used by functional tests.
tests/func/test_pull.py Adds UUID to remote dataset fixture used by pull tests.
tests/func/test_metastore.py Adds assertions and new tests for dataset UUID creation, explicit UUID preservation, and upgrade backfill.
tests/conftest.py Updates dataset-related fixtures to populate UUID fields.
src/datachain/dataset.py Adds UUID fields to dataset dataclasses and updates parse logic to include UUID.
src/datachain/data_storage/sqlite.py Adds SQLite dataset UUID backfill routine and adjusts column-add default handling.
src/datachain/data_storage/metastore.py Adds dataset UUID column + unique index and extends create_dataset() to accept/generate UUID.
Comments suppressed due to low confidence (2)

src/datachain/data_storage/sqlite.py:567

  • duplicate_uuids is built with scalar_subquery(), but this subquery can return multiple UUIDs. Using a scalar subquery inside IN (...) can put SQLite into a scalar context and fail when more than one duplicate exists. Use a non-scalar subquery for IN (e.g., pass the select(...) directly / as a subquery) so multiple duplicate UUIDs are handled correctly.
        duplicate_uuids = (
            select(d.c.uuid)
            .group_by(d.c.uuid)
            .having(func.count() > 1)
            .scalar_subquery()

src/datachain/data_storage/sqlite.py:559

  • _migrate_table_schema() creates indexes before _backfill_dataset_uuids() runs. If any existing rows have duplicate/empty UUIDs, index creation can fail and be silently skipped (see _create_table_indexes), and the later backfill won’t recreate it. Consider running UUID backfill before attempting to create the UUID unique index, or re-attempt index creation after backfill.
        # Auto-migrate: add missing columns based on schema definitions
        for table in self._metastore_tables:
            self._migrate_table_schema(table)

        self._backfill_dataset_uuids()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/datachain/data_storage/sqlite.py Outdated
Comment thread tests/func/test_metastore.py
- Fix NOT NULL constraint to apply regardless of default clause
- Add comprehensive tests for add_column edge cases covering:
  - Boolean defaults (True->1, False->0)
  - String defaults with quote escaping
  - Callable defaults (no SQL DEFAULT clause)
  - Unsupported default types (RuntimeError)

Improves test coverage for missing branches in add_column method.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/datachain/data_storage/sqlite.py:575

  • _migrate_table_schema() creates indexes before _backfill_dataset_uuids() runs. If there are pre-existing duplicate/empty UUIDs, the unique index creation can fail (and is swallowed), and it is never retried after backfill, leaving UUID uniqueness unenforced. Consider backfilling/fixing UUIDs before creating the unique index (or explicitly retry index creation for datasets after backfill, possibly dropping/recreating the index).
    def _init_tables(self) -> None:
        """Initialize tables with automatic schema migration."""
        for table in self._metastore_tables:
            self.db.create_table(table, if_not_exists=True)
            self.default_table_names.append(table.name)

        # Auto-migrate: add missing columns based on schema definitions
        for table in self._metastore_tables:
            self._migrate_table_schema(table)

        self._backfill_dataset_uuids()

    def _backfill_dataset_uuids(self) -> None:
        d = self._datasets
        duplicate_uuids = (
            select(d.c.uuid)
            .group_by(d.c.uuid)
            .having(func.count() > 1)
            .scalar_subquery()
        )
        bad_rows_query = select(d.c.id).where(
            d.c.uuid.is_(None) | (d.c.uuid == "") | d.c.uuid.in_(duplicate_uuids)
        )
        for (dataset_id,) in self.db.execute(bad_rows_query).fetchall():
            self.db.execute(
                update(d).where(d.c.id == dataset_id).values(uuid=str(uuid4()))
            )

Comment thread src/datachain/data_storage/sqlite.py
Comment thread tests/unit/test_sqlite_add_column.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Comment thread src/datachain/data_storage/metastore.py Outdated
Comment thread src/datachain/data_storage/sqlite.py
Comment thread src/datachain/data_storage/sqlite.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants