Ilongin/12872 dataset soft delete#1770
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
| self.warehouse.drop_dataset_rows_table(dataset, version) | ||
|
|
||
| if soft: | ||
| # Rename the `version` column to free the (dataset_id, version) |
There was a problem hiding this comment.
There are 2 alternatives to this:
- Completely reserve specific version even though dataset is removed, e.g if someone creates
v1.0.5, removes it and re-creates again, the new one will bev1.0.6instead ofv1.0.5. - Add conditional unique constraint (both sqlite and PG support this) to keep unique
dataset_id+versionONLY if status is non REMOVED.
Problem with first one is that it's not really good UX for the users IMO and the problem with second one, which is the best option, is that we need to do DB migration which is not easy in local SQLite specially as it needs to drop and re-create table if constraints are changed.
This is why I decided for current solution where every time we soft delete dataset version we need to update version e.g 1.0.4 -> 1.0.4-removed-12345
Deploying datachain with
|
| Latest commit: |
236af57
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://401766be.datachain-2g6.pages.dev |
| Branch Preview URL: | https://ilongin-12872-dataset-soft-d.datachain-2g6.pages.dev |
amritghimire
left a comment
There was a problem hiding this comment.
I think more general approach on how soft delete are used is, adding a new column deleted_at, so we can clear items in trash for long time if needed by filtering with time. Also, a simple filter on selection to deleted_at to null would do the work and so on,
I did add |
…ain-ai/datachain into ilongin/12872-dataset-soft-delete
| # Rename the `version` column to free the (dataset_id, version) | ||
| # uniqueness slot so a future save can reclaim it. The original | ||
| # semver is still recoverable via DatasetVersion.display_version. | ||
| mangled_version = f"{version}{REMOVED_VERSION_SUFFIX}{v.id}" |
There was a problem hiding this comment.
sounds too complicated tbh
why not just preserve the version as-is (and don't allow recreating it). Also add an ability to completely remove it if really needed (including metadata, allow to admins only initially)
| self._projects.c.name == project_name, | ||
| self._dataset_version_jobs.c.job_id.in_(job_ancestry), | ||
| self._dataset_version_jobs.c.is_creator.is_(True), | ||
| self._datasets_versions.c.status != DatasetStatus.REMOVED, |
There was a problem hiding this comment.
can we make it default in _datasets_versions_select itself for example?
I really don't like when we have to "remember" places to plug very specific fitlers, especially exclusions - it is impossible to remember / manage. It is all usually a hack
|
|
||
| assert dataset_name is not None | ||
|
|
||
| # The `version` column on REMOVED tombstones carries a mangle suffix |
There was a problem hiding this comment.
this all looks like a quick hack :(
| # The version the dep points at may have been soft-deleted (REMOVED | ||
| # tombstone). Without a readable previous version we can't compute a | ||
| # diff; fall back to normal dataset creation, same as a missing dep. | ||
| if not source_ds.has_version(source_ds_dep.version): |
There was a problem hiding this comment.
we have qute a few places where we are filtering "None" (removed deps) (e.g. pipelines) - what do we want to do in those place?
| # The version the dep points at may have been soft-deleted (REMOVED | ||
| # tombstone). Without a readable previous version we can't compute a | ||
| # diff; fall back to normal dataset creation, same as a missing dep. | ||
| if not source_ds.has_version(source_ds_dep.version): |
There was a problem hiding this comment.
it still might have version, but it can be a new one (recreated, since we allow it)
Closes datachain-ai/studio#12872.
catalog.remove_dataset_versionno longer hard-deletes aCOMPLETEuser dataset version. Instead it:status = REMOVED+removed_at = now(),versioncolumn to"1.0.0~removed-<id>"so the(dataset_id, version)uniqueness slot is freed for reuse,dataset_dependenciesso dependents can still render lineage.Non-COMPLETE versions (
CREATED/FAILED/STALE/REMOVINGleftovers from the GC path) and internal datasets (lst__*,session_*) continue to hard-delete.Key changes
DatasetStatus.REMOVED = 8+DatasetVersion.removed_at+DatasetVersion.display_version(strips the suffix added on removal).REMOVED_VERSION_SUFFIX = "~removed-".DatasetRecord.live_versionsfilters tombstones;latest_version/latest_major_version/latest_compatible_version/DatasetListRecord.latest_versionskip REMOVED.metastore.update_dataset_version: positional lookup arg renamedversion→dataset_versionso theversion=column can be updated via**kwargs(the rewrite on removal).Behavior notes
version="1.0.0") succeeds — the renamed tombstone keeps its row but doesn't occupy the live slot.dc.datasets()/read_dataset()hide REMOVED tombstones automatically via the existinginclude_incomplete=Falsefilter.removed_at); auto-migration in_migrate_table_schemahandles it.