feat(dataset): add UUID field to dataset records#1769
Conversation
Add automatic UUID generation for datasets with: - UUID column in datasets table with unique constraint - Automatic UUID4 generation for new datasets - Support for explicit UUID specification during creation - Database migration support with UUID backfill - Comprehensive test coverage for UUID functionality This enables unique dataset identification across different DataChain instances and improves dataset tracking capabilities.
Deploying datachain with
|
| Latest commit: |
204b907
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://5964e01d.datachain-2g6.pages.dev |
| Branch Preview URL: | https://amrit-feat-add-dataset-uuid.datachain-2g6.pages.dev |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
This PR introduces a UUID identifier on dataset records to support stable, globally unique dataset identification across DataChain instances, including generation on creation, optional explicit UUID assignment, and SQLite upgrade/backfill behavior.
Changes:
- Added a required
uuidfield toDatasetRecordandDatasetListRecord, and threaded it through parsing/serialization and test fixtures. - Extended metastore dataset creation to generate UUID4 by default while allowing explicit UUID input, and added a unique UUID index to the datasets schema.
- Implemented SQLite-side UUID backfill logic on initialization and added functional tests covering UUID generation/preservation and upgrade backfill.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_query_steps_hash.py | Updates dataset record construction to include UUID. |
| tests/func/test_read_dataset_remote.py | Adds UUID to remote dataset fixtures used by functional tests. |
| tests/func/test_pull.py | Adds UUID to remote dataset fixture used by pull tests. |
| tests/func/test_metastore.py | Adds assertions and new tests for dataset UUID creation, explicit UUID preservation, and upgrade backfill. |
| tests/conftest.py | Updates dataset-related fixtures to populate UUID fields. |
| src/datachain/dataset.py | Adds UUID fields to dataset dataclasses and updates parse logic to include UUID. |
| src/datachain/data_storage/sqlite.py | Adds SQLite dataset UUID backfill routine and adjusts column-add default handling. |
| src/datachain/data_storage/metastore.py | Adds dataset UUID column + unique index and extends create_dataset() to accept/generate UUID. |
Comments suppressed due to low confidence (2)
src/datachain/data_storage/sqlite.py:567
duplicate_uuidsis built withscalar_subquery(), but this subquery can return multiple UUIDs. Using a scalar subquery insideIN (...)can put SQLite into a scalar context and fail when more than one duplicate exists. Use a non-scalar subquery forIN(e.g., pass theselect(...)directly / as a subquery) so multiple duplicate UUIDs are handled correctly.
duplicate_uuids = (
select(d.c.uuid)
.group_by(d.c.uuid)
.having(func.count() > 1)
.scalar_subquery()
src/datachain/data_storage/sqlite.py:559
_migrate_table_schema()creates indexes before_backfill_dataset_uuids()runs. If any existing rows have duplicate/empty UUIDs, index creation can fail and be silently skipped (see_create_table_indexes), and the later backfill won’t recreate it. Consider running UUID backfill before attempting to create the UUID unique index, or re-attempt index creation after backfill.
# Auto-migrate: add missing columns based on schema definitions
for table in self._metastore_tables:
self._migrate_table_schema(table)
self._backfill_dataset_uuids()
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Fix NOT NULL constraint to apply regardless of default clause - Add comprehensive tests for add_column edge cases covering: - Boolean defaults (True->1, False->0) - String defaults with quote escaping - Callable defaults (no SQL DEFAULT clause) - Unsupported default types (RuntimeError) Improves test coverage for missing branches in add_column method.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
src/datachain/data_storage/sqlite.py:575
_migrate_table_schema()creates indexes before_backfill_dataset_uuids()runs. If there are pre-existing duplicate/empty UUIDs, the unique index creation can fail (and is swallowed), and it is never retried after backfill, leaving UUID uniqueness unenforced. Consider backfilling/fixing UUIDs before creating the unique index (or explicitly retry index creation fordatasetsafter backfill, possibly dropping/recreating the index).
def _init_tables(self) -> None:
"""Initialize tables with automatic schema migration."""
for table in self._metastore_tables:
self.db.create_table(table, if_not_exists=True)
self.default_table_names.append(table.name)
# Auto-migrate: add missing columns based on schema definitions
for table in self._metastore_tables:
self._migrate_table_schema(table)
self._backfill_dataset_uuids()
def _backfill_dataset_uuids(self) -> None:
d = self._datasets
duplicate_uuids = (
select(d.c.uuid)
.group_by(d.c.uuid)
.having(func.count() > 1)
.scalar_subquery()
)
bad_rows_query = select(d.c.id).where(
d.c.uuid.is_(None) | (d.c.uuid == "") | d.c.uuid.in_(duplicate_uuids)
)
for (dataset_id,) in self.db.execute(bad_rows_query).fetchall():
self.db.execute(
update(d).where(d.c.id == dataset_id).values(uuid=str(uuid4()))
)
Add automatic UUID generation for datasets with:
This enables unique dataset identification across different DataChain instances and improves dataset tracking capabilities.