Skip to content

Ilongin/12872 dataset soft delete#1770

Draft
ilongin wants to merge 6 commits into
mainfrom
ilongin/12872-dataset-soft-delete
Draft

Ilongin/12872 dataset soft delete#1770
ilongin wants to merge 6 commits into
mainfrom
ilongin/12872-dataset-soft-delete

Conversation

@ilongin
Copy link
Copy Markdown
Contributor

@ilongin ilongin commented May 14, 2026

Closes datachain-ai/studio#12872.

catalog.remove_dataset_version no longer hard-deletes a COMPLETE user dataset version. Instead it:

  • drops the warehouse rows table,
  • sets status = REMOVED + removed_at = now(),
  • rewrites the version column to "1.0.0~removed-<id>" so the (dataset_id, version) uniqueness slot is freed for reuse,
  • keeps the version row and all its dataset_dependencies so dependents can still render lineage.

Non-COMPLETE versions (CREATED/FAILED/STALE/REMOVING leftovers from the GC path) and internal datasets (lst__*, session_*) continue to hard-delete.

Key changes

  • DatasetStatus.REMOVED = 8 + DatasetVersion.removed_at + DatasetVersion.display_version (strips the suffix added on removal).
  • New constant REMOVED_VERSION_SUFFIX = "~removed-".
  • DatasetRecord.live_versions filters tombstones; latest_version / latest_major_version / latest_compatible_version / DatasetListRecord.latest_version skip REMOVED.
  • metastore.update_dataset_version: positional lookup arg renamed versiondataset_version so the version= column can be updated via **kwargs (the rewrite on removal).
  • Checkpoint and delta paths detect REMOVED and fall back to recreate / rebuild instead of trying to read a tombstone.

Behavior notes

  • Slot reuse works: re-saving under the same name (auto-bump or explicit version="1.0.0") succeeds — the renamed tombstone keeps its row but doesn't occupy the live slot.
  • dc.datasets() / read_dataset() hide REMOVED tombstones automatically via the existing include_incomplete=False filter.
  • Schema migration is column-only (removed_at); auto-migration in _migrate_table_schema handles it.

@ilongin ilongin marked this pull request as draft May 14, 2026 08:58
@codecov
Copy link
Copy Markdown

codecov Bot commented May 14, 2026

Codecov Report

❌ Patch coverage is 84.44444% with 7 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/datachain/dataset.py 80.76% 3 Missing and 2 partials ⚠️
src/datachain/catalog/catalog.py 84.61% 1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

self.warehouse.drop_dataset_rows_table(dataset, version)

if soft:
# Rename the `version` column to free the (dataset_id, version)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 alternatives to this:

  1. Completely reserve specific version even though dataset is removed, e.g if someone creates v1.0.5, removes it and re-creates again, the new one will be v1.0.6 instead of v1.0.5.
  2. Add conditional unique constraint (both sqlite and PG support this) to keep unique dataset_id + version ONLY if status is non REMOVED.

Problem with first one is that it's not really good UX for the users IMO and the problem with second one, which is the best option, is that we need to do DB migration which is not easy in local SQLite specially as it needs to drop and re-create table if constraints are changed.
This is why I decided for current solution where every time we soft delete dataset version we need to update version e.g 1.0.4 -> 1.0.4-removed-12345

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented May 15, 2026

Deploying datachain with  Cloudflare Pages  Cloudflare Pages

Latest commit: 236af57
Status: ✅  Deploy successful!
Preview URL: https://401766be.datachain-2g6.pages.dev
Branch Preview URL: https://ilongin-12872-dataset-soft-d.datachain-2g6.pages.dev

View logs

Copy link
Copy Markdown
Contributor

@amritghimire amritghimire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think more general approach on how soft delete are used is, adding a new column deleted_at, so we can clear items in trash for long time if needed by filtering with time. Also, a simple filter on selection to deleted_at to null would do the work and so on,

@ilongin
Copy link
Copy Markdown
Contributor Author

ilongin commented May 15, 2026

I think more general approach on how soft delete are used is, adding a new column deleted_at, so we can clear items in trash for long time if needed by filtering with time. Also, a simple filter on selection to deleted_at to null would do the work and so on,

I did add removed_at column.

@ilongin ilongin requested a review from amritghimire May 15, 2026 07:03
# Rename the `version` column to free the (dataset_id, version)
# uniqueness slot so a future save can reclaim it. The original
# semver is still recoverable via DatasetVersion.display_version.
mangled_version = f"{version}{REMOVED_VERSION_SUFFIX}{v.id}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds too complicated tbh

why not just preserve the version as-is (and don't allow recreating it). Also add an ability to completely remove it if really needed (including metadata, allow to admins only initially)

self._projects.c.name == project_name,
self._dataset_version_jobs.c.job_id.in_(job_ancestry),
self._dataset_version_jobs.c.is_creator.is_(True),
self._datasets_versions.c.status != DatasetStatus.REMOVED,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make it default in _datasets_versions_select itself for example?

I really don't like when we have to "remember" places to plug very specific fitlers, especially exclusions - it is impossible to remember / manage. It is all usually a hack

Comment thread src/datachain/dataset.py

assert dataset_name is not None

# The `version` column on REMOVED tombstones carries a mangle suffix
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this all looks like a quick hack :(

Comment thread src/datachain/delta.py
# The version the dep points at may have been soft-deleted (REMOVED
# tombstone). Without a readable previous version we can't compute a
# diff; fall back to normal dataset creation, same as a missing dep.
if not source_ds.has_version(source_ds_dep.version):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have qute a few places where we are filtering "None" (removed deps) (e.g. pipelines) - what do we want to do in those place?

Comment thread src/datachain/delta.py
# The version the dep points at may have been soft-deleted (REMOVED
# tombstone). Without a readable previous version we can't compute a
# diff; fall back to normal dataset creation, same as a missing dep.
if not source_ds.has_version(source_ds_dep.version):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it still might have version, but it can be a new one (recreated, since we allow it)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants