Skip to content

Add CLI for converting v2 metadata to v3 #3257

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 66 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 58 commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
45bb4e5
add rough cli converter structure
K-Meech Jul 1, 2025
456c9e7
allow zstd, gzip and numcodecs zarr 3 compression
K-Meech Jul 1, 2025
242a338
convert filters to v3
K-Meech Jul 1, 2025
1045c33
create BytesCodec with correct endian
K-Meech Jul 1, 2025
4e2442f
handle C vs F order in v2 metadata
K-Meech Jul 1, 2025
c63f0b8
save group and array metadata to file
K-Meech Jul 2, 2025
2947ce4
create overall conversion functions for store, array or group
K-Meech Jul 2, 2025
ba81755
add minimal typer cli
K-Meech Jul 3, 2025
67f9580
add initial tests for converter
K-Meech Jul 3, 2025
0d7c2c8
add tests for conversion of groups and nested groups and arrays
K-Meech Jul 3, 2025
cf39580
add tests for conversion of compressors and filters
K-Meech Jul 3, 2025
11499e7
test conversion of order and endianness
K-Meech Jul 3, 2025
90b0996
add tests for edge cases of incorrect codecs
K-Meech Jul 3, 2025
85159bb
add tests for / separator
K-Meech Jul 4, 2025
53ba166
draft of metadata remover and add test for internal paths
K-Meech Jul 7, 2025
d4cdc04
add clear command to cli with tests
K-Meech Jul 7, 2025
dfdc729
add test for metadata removal with path#
K-Meech Jul 7, 2025
ad60991
add verbose logging option
K-Meech Jul 7, 2025
66bae0d
add dry run option to cli
K-Meech Jul 8, 2025
97df9bf
add test for dry-run
K-Meech Jul 8, 2025
42e0435
add zarr-converter script and enable cli dep in tests
K-Meech Jul 9, 2025
9e20b39
use v2 chunk key encoding type
K-Meech Jul 9, 2025
6586e66
Merge branch 'main' of github.com:K-Meech/zarr-python into km/v2-v3-c…
K-Meech Jul 14, 2025
ce409a3
update endianness of test data type
K-Meech Jul 14, 2025
fb7136b
Merge branch 'main' of github.com:K-Meech/zarr-python into km/v2-v3-c…
K-Meech Jul 16, 2025
6585f24
check converted arrays can be accessed
K-Meech Jul 16, 2025
46e958d
Merge branch 'main' of github.com:K-Meech/zarr-python into km/v2-v3-c…
K-Meech Jul 16, 2025
08fc138
remove uses of pathlib walk, as it didn't exist in python 3.11
K-Meech Jul 16, 2025
3540434
include tags in checkout for gpu test, to avoid numcodecs.zarr3 reque…
K-Meech Jul 16, 2025
0889979
rename cli commands from review comments
K-Meech Jul 23, 2025
d906dba
remove path option
K-Meech Jul 23, 2025
5e03e3c
allow metadata to be written to a separate store location
K-Meech Jul 24, 2025
89aa095
add overwrite and remove-v2-metadata options
K-Meech Jul 24, 2025
ade9c3b
add force option
K-Meech Jul 24, 2025
218e8a8
use v2, v3 format for CLI
K-Meech Jul 24, 2025
49787f6
split into convert_group and convert_array functions
K-Meech Jul 24, 2025
488485c
update command names in converter tests
K-Meech Jul 24, 2025
18487c9
update test filename to reflect command name change
K-Meech Jul 24, 2025
a5cd760
fix tests for sub-groups
K-Meech Jul 24, 2025
bde452f
add tests for --force
K-Meech Jul 24, 2025
671c5e3
add test for migrating to separate output location
K-Meech Jul 24, 2025
0281cc1
add test for remove-v2-metadata option
K-Meech Jul 25, 2025
2ffe854
update test names to match command name
K-Meech Jul 25, 2025
432eae6
add test for --remove-v2-metadata with separate output location
K-Meech Jul 25, 2025
7cb42c5
merge upstream changes
K-Meech Jul 25, 2025
6e6788d
separate cli fixtures from the tests
K-Meech Jul 25, 2025
4abc84a
add test for overwrite option in separate location
K-Meech Jul 25, 2025
0bdd6f8
fix failing test
K-Meech Jul 25, 2025
f2fa389
small fixes to tests
K-Meech Jul 25, 2025
4d98121
Merge pull request #1 from K-Meech/km/v2-v2-conversion-review
K-Meech Jul 28, 2025
649bb20
fix pre-commit errors
K-Meech Jul 28, 2025
dba4073
update docstrings with review comments
K-Meech Aug 1, 2025
b702060
pass filters and compressors to processing functions, rather than ful…
K-Meech Aug 1, 2025
b900a0e
use Store as input rather than StoreLike
K-Meech Aug 1, 2025
42aa7db
move conversion functions into public api
K-Meech Aug 1, 2025
d3fc21e
Merge branch 'main' of github.com:K-Meech/zarr-python into km/v2-v3-c…
K-Meech Aug 1, 2025
5c05c0c
merge upstream changes
K-Meech Aug 4, 2025
f62fe31
fail on discovery of consolidated metadata
K-Meech Aug 4, 2025
71067ba
minor changes from review
K-Meech Aug 6, 2025
34e97f0
use same logger throughout zarr-python
K-Meech Aug 6, 2025
9f6b875
add release notes and docs for the cli
K-Meech Aug 6, 2025
1362cc6
tidy up formatting of zarr.metadata api docs
K-Meech Aug 6, 2025
4ae3491
Merge branch 'main' of github.com:K-Meech/zarr-python into km/v2-v3-c…
K-Meech Aug 6, 2025
f301172
fix failing tests
K-Meech Aug 6, 2025
0449ef7
add a section about --verbose to the docs
K-Meech Aug 7, 2025
14b9cfd
Merge branch 'main' into km/v2-v3-conversion
d-v-b Aug 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/gpu_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ jobs:

steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # grab all branches and tags
# - name: cuda-toolkit
# uses: Jimver/[email protected]
# id: cuda-toolkit
Expand Down
6 changes: 5 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ remote = [
gpu = [
"cupy-cuda12x",
]
cli = ["typer"]
# Development extras
test = [
"coverage>=7.10",
Expand Down Expand Up @@ -113,6 +114,9 @@ docs = [
'pytest'
]

[project.scripts]
zarr = "zarr._cli.cli:app"


[project.urls]
"Bug Tracker" = "https://github.com/zarr-developers/zarr-python/issues"
Expand Down Expand Up @@ -157,7 +161,7 @@ deps = ["minimal", "optional"]

[tool.hatch.envs.test.overrides]
matrix.deps.dependencies = [
{value = "zarr[remote, remote_tests, test, optional]", if = ["optional"]}
{value = "zarr[remote, remote_tests, test, optional, cli]", if = ["optional"]}
]

[tool.hatch.envs.test.scripts]
Expand Down
Empty file added src/zarr/_cli/__init__.py
Empty file.
189 changes: 189 additions & 0 deletions src/zarr/_cli/cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
import logging
from enum import Enum
from typing import Annotated, Literal, cast

import typer

import zarr.metadata.migrate_v3 as migrate_metadata
from zarr.core.sync import sync
from zarr.storage._common import make_store

app = typer.Typer()

logger = logging.getLogger(__name__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it deliberate that this is a new logger, instead of importing the logger object from zarr? I don't tihnk it matters too much, but re-using zarr._logger might save some code duplication because you could remove functions from this file for configuring the logger.



def _set_logging_config(*, verbose: bool) -> None:
if verbose:
lvl = logging.INFO
else:
lvl = logging.WARNING
fmt = "%(message)s"
logging.basicConfig(level=lvl, format=fmt)


def _set_verbose_level() -> None:
logging.getLogger().setLevel(logging.INFO)


class ZarrFormat(str, Enum):
v2 = "v2"
v3 = "v3"


class ZarrFormatV3(str, Enum):
"""Limit CLI choice to only v3"""

v3 = "v3"


@app.command() # type: ignore[misc]
def migrate(
zarr_format: Annotated[
ZarrFormatV3,
typer.Argument(
help="Zarr format to migrate to. Currently only 'v3' is supported.",
),
],
input_store: Annotated[
str,
typer.Argument(
help=(
"Input Zarr to migrate - should be a store, path to directory in file system or name of zip file "
"e.g. 'data/example-1.zarr', 's3://example-bucket/example'..."
)
),
],
output_store: Annotated[
str | None,
typer.Argument(
help=(
"Output location to write generated metadata (no array data will be copied). If not provided, "
"metadata will be written to input_store. Should be a store, path to directory in file system "
"or name of zip file e.g. 'data/example-1.zarr', 's3://example-bucket/example'..."
)
),
] = None,
dry_run: Annotated[
bool,
typer.Option(
help="Enable a dry-run: files that would be converted are logged, but no new files are created or changed."
),
] = False,
overwrite: Annotated[
bool,
typer.Option(
help="Remove any existing v3 metadata at the output location, before migration starts."
),
] = False,
force: Annotated[
bool,
typer.Option(
help=(
"Only used when --overwrite is given. Allows v3 metadata to be removed when no valid "
"v2 metadata exists at the output location."
)
),
] = False,
remove_v2_metadata: Annotated[
bool,
typer.Option(
help="Remove v2 metadata (if any) from the output location, after migration is complete."
),
] = False,
) -> None:
"""Migrate all v2 metadata in a zarr hierarchy to v3. This will create a zarr.json file for each level
(every group / array). v2 files (.zarray, .zattrs etc.) will be left as-is.
"""
if dry_run:
_set_verbose_level()
logger.info(
"Dry run enabled - no new files will be created or changed. Log of files that would be created on a real run:"
)

input_zarr_store = sync(make_store(input_store, mode="r+"))

if output_store is not None:
output_zarr_store = sync(make_store(output_store, mode="w-"))
write_store = output_zarr_store
else:
output_zarr_store = None
write_store = input_zarr_store

if overwrite:
sync(migrate_metadata.remove_metadata(write_store, 3, force=force, dry_run=dry_run))

migrate_metadata.migrate_v2_to_v3(
input_store=input_zarr_store, output_store=output_zarr_store, dry_run=dry_run
)

if remove_v2_metadata:
# There should always be valid v3 metadata at the output location after migration, so force=False
sync(migrate_metadata.remove_metadata(write_store, 2, force=False, dry_run=dry_run))


@app.command() # type: ignore[misc]
def remove_metadata(
zarr_format: Annotated[
ZarrFormat,
typer.Argument(help="Which format's metadata to remove - v2 or v3."),
],
store: Annotated[
str,
typer.Argument(
help="Store or path to directory in file system or name of zip file e.g. 'data/example-1.zarr', 's3://example-bucket/example'..."
),
],
force: Annotated[
bool,
typer.Option(
help=(
"Allow metadata to be deleted when no valid alternative exists e.g. allow deletion of v2 metadata, "
"when no v3 metadata is present."
)
),
] = False,
dry_run: Annotated[
bool,
typer.Option(
help="Enable a dry-run: files that would be deleted are logged, but no files are removed or changed."
),
] = False,
) -> None:
"""Remove all v2 (.zarray, .zattrs, .zgroup, .zmetadata) or v3 (zarr.json) metadata files from the given Zarr.
Note - this will remove metadata files at all levels of the hierarchy (every group and array).
"""
if dry_run:
_set_verbose_level()
logger.info(
"Dry run enabled - no files will be deleted or changed. Log of files that would be deleted on a real run:"
)
input_zarr_store = sync(make_store(store, mode="r+"))

sync(
migrate_metadata.remove_metadata(
store=input_zarr_store,
zarr_format=cast(Literal[2, 3], int(zarr_format[1:])),
force=force,
dry_run=dry_run,
)
)


@app.callback() # type: ignore[misc]
def main(
verbose: Annotated[
bool,
typer.Option(
help="enable verbose logging - will print info about metadata files being deleted / saved."
),
] = False,
) -> None:
"""
See available commands below - access help for individual commands with zarr COMMAND --help.
"""
_set_logging_config(verbose=verbose)


if __name__ == "__main__":
app()
3 changes: 3 additions & 0 deletions src/zarr/metadata/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from zarr.metadata.migrate_v3 import migrate_to_v3, migrate_v2_to_v3, remove_metadata

__all__ = ["migrate_to_v3", "migrate_v2_to_v3", "remove_metadata"]
Loading
Loading