-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adopt dandischema.digests.zarr.get_checksum()
to dandi-cli: use zarr_checksum library constructs instead of copies in dandischema
#1371
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,6 +24,7 @@ | |
|
||
from dandischema.digests.dandietag import DandiETag | ||
from fscacher import PersistentCache | ||
from zarr_checksum.checksum import ZarrChecksum, ZarrChecksumManifest | ||
from zarr_checksum.tree import ZarrChecksumTree | ||
|
||
from .threaded_walk import threaded_walk | ||
|
@@ -134,3 +135,31 @@ def md5file_nocache(filepath: str | Path) -> str: | |
present in Zarrs | ||
""" | ||
return Digester(["md5"])(filepath)["md5"] | ||
|
||
|
||
def checksum_zarr_dir( | ||
files: dict[str, tuple[str, int]], directories: dict[str, tuple[str, int]] | ||
) -> str: | ||
""" | ||
Calculate the Zarr checksum of a directory only from information about the | ||
files and subdirectories immediately within it. | ||
|
||
:param files: | ||
A mapping from names of files in the directory to pairs of their MD5 | ||
digests and sizes | ||
:param directories: | ||
A mapping from names of subdirectories in the directory to pairs of | ||
their Zarr checksums and the sum of the sizes of all files recursively | ||
within them | ||
""" | ||
manifest = ZarrChecksumManifest( | ||
files=[ | ||
ZarrChecksum(digest=digest, name=name, size=size) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. note to myself here: that instead of using dandi-schema's ZarrJSONChecksumSerializer as it was used in original defined in dandi-schema |
||
for name, (digest, size) in files.items() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. note to myself (worth adding as a comment???): if previously we explicitly sorted in now it is not needed since ZarrChecksumManifest.generate_digest would do it: https://github.com/dandi/zarr_checksum//blob/HEAD/zarr_checksum/checksum.py#L80 |
||
], | ||
directories=[ | ||
ZarrChecksum(digest=digest, name=name, size=size) | ||
for name, (digest, size) in directories.items() | ||
], | ||
) | ||
return manifest.generate_digest().digest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need to "manually" loop here via
dirpath.iterdir()
or could make use of https://github.com/dandi/zarr_checksum#python? i.e. could we reuse
yield_files_local
to collectfiles
and then just pass tocompute_zarr_checksum
instead of reimplementing it aschecksum_zarr_dir
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yield_files_local()
appears to only yield files recognized by thezarr
module, while our current code recognizes all files that are actually in a*.zarr
directory tree, including "extraneous" files.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, that sounds like a possible cause of differences/confusion which we ideally clarify/unify while at it.
I started writing items below, but then looked at the zarr-python implementation of traversal at https://github.com/zarr-developers/zarr-python/blob/HEAD/zarr/storage.py#L1174 and it seems to list all files really. Although "not guaranteed", I guess, but are you sure that it actually excludes any "extraneous" files @jwodder?
some thoughts I started to compose:
zarr_checksum
's default (and only ATM) behavior is somewhat inline with that, while our own support ofzarr
diverges from it.zarr
-- but I guess zarr validation doesn't fail (does not care) about them right?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed the code was doing something zarr-specific based on the fact that it was using the
zarr
library. Ifzarr_checksum
just checksums every file present, then there's no need to bring inzarr
and its heavy dependencies. (CC @AlmightyYakob)I believe extraneous files are ignored by Zarr validation.
I am not aware of any, and I do not know how to identify such files in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does seem that
NestedDirectoryStore.keys
essentially callsos.walk
on the local file path. I also assumed more was at play.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, my main and remaining concern is the duplication and thus possible divergence of implementations/behavior. As this PR makes it already better (we will avoid using duplicate implementations within dandi-schema and use more of zarr_checksum ones) and would take us one step closer to strip zarr digest specifics from dandi-schema, let's proceed forward as is.
But while reviewing this, here is some thoughts, issues I filed etc:
zarr_checksum
intofolder_checksum
with a set of format-specific (e.g. zarr) adapters? zarr_checksum#50dandi-cli
we go for theos.walk
directly, but then useexclude_from_zarr
to exclude.git
etc. Shouldn't such excludes be defined inzarr_checksum
instead? Commented onzarr
dependency if possible zarr_checksum#41 (comment)zarr_checksum
viaNestedDirectoryStore.keys
which ATM also usesos.walk
without any filtering etc, but that could change later and we might not even spot it since we cannot foresee how it might changeShouldn't we centralize the walker for zarr, e.g. within
zarr_checksum
?:dandi-cli
implementation is argued to be warranted because ofzarr_checksum
should just become as efficient, ie carry such an implementation. @jwodder - do you see why that should not or could not be done?EntryUploadTracker
, which (CMIIW) is used (along withto_delete
) to collect first a list of zarr subparts which changed and thus need to be uploaded (deleted), to avoid a full upload. And that is during that (threaded!) walk we are also computing the checksum eventually over all elements of the zarr folder - uploaded or not.But both of those aspects are unrelated to the code touched by this PR! . Here we do regular non-threaded walk and nohow interact with upload. In my original comment I pointed out to the fact that we seems to just need to have custom looping / summarization to derive digest and other stats relevant for our ZarrStat and contained by it LocalZarrEntry's. But those at large mimic the
ZarrDirectoryDigest
andZarrChecksum
. But unlike those ofzarr_checksum
not containing entire hierarchy,ZarrStat
actually does contain all the files flattened while recursively traversing. So, for now I guess we are doomed to have this duplicated implementation but I think in the longer run we might want to see how we could avoid it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yarikoptic
Which efficient threaded walk are you referring to?
get_zarr_checksum()
or the digesting-for-uploading machinery? I'm fine with the former being inzarr_checksum
, but the latter currently does not belong there.Are you going to create an issue about this? If not, your concern is just going to go into the void.
The entire
ZarrAsset.stat()
code is currently unused by the library. (I'd look up what commit removed the last use, but I can't getgit log -S
to honor any word boundary syntax I know of.)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like 47c524c which seemed was a big refactoring which stopped using
.stat().
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that commit only got rid of stat'ing Zarr entries on the Archive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are on the same page here. filed dandi/zarr_checksum#52