diff --git a/_data/navigation.yml b/_data/navigation.yml index 1eaf5d1..bf1f259 100644 --- a/_data/navigation.yml +++ b/_data/navigation.yml @@ -19,7 +19,19 @@ sidebar: url: "#sponsorship" - title: "Videos" url: "#videos" - - title: Subpages + - title: Technical + children: + - title: "Components" + url: '/components' + - title: "Flexibility" + url: '/flexibility' + - title: "Implementations" + url: '/implementations' + - title: "Specification" + url: https://zarr-specs.readthedocs.io/ + - title: "ZEPs" + url: '/zeps' + - title: Community children: - title: "Adopters" url: "/adopters" @@ -31,13 +43,7 @@ sidebar: url: '/conventions' - title: "Datasets" url: '/datasets' - - title: "Implementations" - url: '/implementations' - title: "Office Hours" url: "/office-hours" - title: "Slides" - url: "/slides" - - title: "Specification" - url: https://zarr-specs.readthedocs.io/ - - title: "ZEPs" - url: '/zeps' + url: "/slides" \ No newline at end of file diff --git a/components/index.md b/components/index.md new file mode 100644 index 0000000..7f8e97a --- /dev/null +++ b/components/index.md @@ -0,0 +1,46 @@ +--- +layout: single +author_profile: false +title: Zarr Components +sidebar: + title: "Components" + nav: sidebar +--- + +Zarr consists of several components, both abstract and concrete. +These span both the physical storage layer and the conceptual structural layer. +Zarr-related projects all use the Zarr Protocol (and hence data model), described by the [Zarr Specification](https://zarr-specs.readthedocs.io/), but otherwise may choose to implement other layers however they wish. + +## Abstract components + +These abstract components together describe what type of data can be stored in zarr, and how to store it, without assuming you are working in a particular programming language, or with a particular storage system. + +**Protocol**: All zarr-related projects use the Zarr Protocol, described in the [Zarr Specification](https://zarr-specs.readthedocs.io/), which allows transfer of chunked array data and metadata between devices (or between memory regions of the same device). +The protocol works by serializing and de-serializing array data as byte streams and storing both this data and accompanying metadata via an [Abstract Key-Value Store Interface](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#abstract-store-interface). +A system of [Codecs](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#chunk-encoding) is used to describe the encoding and serialization steps. + +**Data Model**: The specification's description of the [Stored Representation](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#stored-representation) implies a particular data model, based on the [HDF Abstract Data Model](https://support.hdfgroup.org/documentation/hdf5/latest/_h5_d_m__u_g.html). +It consists of a hierarchical tree of groups and arrays, with optional arbitrary metadata at every node. This model is completely domain-agnostic. + +**Format**: If the keys in the abstract key-value store interface are mapped unaltered to paths in a POSIX filesystem or prefixes in object storage, the data written to disk will follow the "Native Zarr Format". +Most, but not all, zarr implementations will serialize to this format. + +**Extensions**: Zarr provides a core set of generally-useful features, but extensions to this core are encouraged. These might take the form of domain-specific [metadata conventions](https://zarr.dev/conventions/), new codecs, or additions to the data model via [extension points](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#extension-points). These can be abstract, or enforced by implementations or client libraries however they like, but generally should be opt-in. + +## Concrete components + +Concrete implementations of the abstract components can be implemented in any language. +The canonical reference implementation is [Zarr-Python](https://github.com/zarr-developers/zarr-python), but there are many [other implementations](https://zarr.dev/implementations/). +Zarr-Python contains reference examples of useful constructs that can be re-implemented in other languages. + +**Abstract Base Classes**: Zarr-python's [`zarr.abc`](https://zarr.readthedocs.io/en/stable/api/zarr/abc/index.html) module contains abstract base classes enforcing a particular python realization of the specification's Key-Value Store interface, using a `Store` ABC, which is based on a `MutableMapping`-like API. +This component is concrete in the sense that it is implemented in a specific programming language, and enforces particular syntax for getting and setting values in a key-value store. + +**Store Implementations**: Zarr-python's [`zarr.storage`](https://zarr.readthedocs.io/en/stable/api/zarr/abc/index.html) module contains concrete implementations of the `Store` ABC for interacting with particular storage systems. +The zarr-python store implementations which write to local filesystems or object storage write data in the Native Zarr Format. +It's expected that most users of zarr from python will just use one of these implementations. + +**User API**: Zarr-python's [`zarr.api`](https://zarr.readthedocs.io/en/stable/api/zarr/abc/index.html) module contains functions and classes for interacting with any concrete implementation of the `zarr.abc.Store` interface. +This allows user applications to use a standard zarr API to read and write from a variety of common storage systems. + +These various components allow for a huge amount of [flexibility](https://zarr.dev/flexibility/). \ No newline at end of file diff --git a/flexibility/index.md b/flexibility/index.md new file mode 100644 index 0000000..e02ece0 --- /dev/null +++ b/flexibility/index.md @@ -0,0 +1,58 @@ +--- +layout: single +author_profile: false +title: Zarr's Flexibility +sidebar: + title: "Flexibility" + nav: sidebar +--- + +One of Zarr's greatest strengths is its flexibility, or "hackability". +This largely comes from the separation of distinct [Zarr Components](https://zarr.dev/components/), but there are a range of other properties that make zarr flexible too. + +## Types of flexibility + +This flexibility comes in several forms: +- The Zarr protocol is device agnostic. +- The Zarr data model is domain agnostic. +- Key-value stores are an almost universal abstraction in data systems, and so can almost always be mapped to existing system interfaces. +- The Zarr format on-disk is extremely simple. +- Storing each chunk under a different key allows implementations to scale their IO throughput in a variety of simple ways. +- The reference Zarr implementation is written in Python, a very hackable language, with ABCs you can use when creating new store implementations. +- Components are seperated: the protocol, file format, standard API, ABC, and store implementations are all separate. +- There is no requirement to use more than one zarr component - individual projects can achieve powerful functionality by intelligently using only some of the Zarr components. +- You can define your own codecs. +- You are free to create your own domain-specific metadata standard and enforce it upon zarr stores however you like. +- Zarr v3 has nascent support for other extension points, including defining your own type of chunk grid, data types, and more. +- [Zarr Enhancement Proposals](https://zarr.dev/zeps/) (or "ZEPs") provide a mechanism for enhancing or adding to the specification in a community-standardized way. + +## Examples + +Here are a few zarr-related software projects, which each make use of a selected subset of different zarr components to achieve interesting functionality. +These particular projects are more than simply zarr implementations written in a different language (you can find a [list of implementations here](https://zarr.dev/implementations/)). + +- **MongoDBStore** is a concrete store implementation in python, which stores values in a MongoDB NoSQL database under zarr keys. +It is therefore spec-compliant, and can be interacted with via the zarr-python user API, but does not write data in the native zarr format. + +- [**VirtualiZarr**](https://github.com/zarr-developers/VirtualiZarr) provides a concrete store implementation in python (the `ManifestStore`) which stores references to locations and byte ranges of chunks on disk inside "chunk manifests", which reside inside files stored in other binary formats such as netCDF. +These references are generated by "readers", which do the job of parsing the file structure and mapping the contents to the zarr data model. +VirtualiZarr therefore eschews the native zarr format but still provides spec-compliant access to non-zarr-formatted data using zarr-python's API, without duplicating the original data. +The manifests effectively act as an indirection layer between the zarr-spec-compliant key interface, and the actual location of the chunks in storage. + +- [**NCZarr**](https://docs.unidata.ucar.edu/nug/current/nczarr_head.html) and [**Lindi**](https://github.com/NeurodataWithoutBorders/lindi) can both in some sense be considered as the opposite of VirtualiZarr - they allow interacting with zarr-formatted data on disk via a non-zarr API. +Lindi maps zarr's data model to the HDF data model and allows access to via the `h5py` library through the [`LindiH5pyFile`](https://github.com/NeurodataWithoutBorders/lindi/blob/b125c111880dd830f2911c1bc2084b2de94f6d71/lindi/LindiH5pyFile/LindiH5pyFile.py#L28) class. +[NCZarr](https://docs.unidata.ucar.edu/nug/current/nczarr_head.html) allows interacting with zarr-formatted data via the netcdf-c library. +Note that both libraries implement optional additional optimizations by going beyond the zarr specification and format on disk, which is not recommended. + +- [**Tensorstore**](https://github.com/google/tensorstore) is a general storage library written in C++ that can write to the Zarr format (so is a spec-compliant non-python "native" store implementation) but also to other array formats such as N5. +As it can write to multiple different storage sytems, it effectively has its own set of concrete store implementations. +Additional features are provided, notably using an Optionally-Cooperative Distributed B+Tree (OCDBT) on top of a base key-value store to implement ACID transactions. +It still stores all data using the native Zarr Format, but versions keys at the store level. + +- [**Icechunk**](https://icechunk.io/) is a cloud-native tensor storage engine which also provides ACID transactions, but does so via indirection between a zarr-spec-compliant key-value store interface and a specialized non-zarr-native storage layout on-disk (for which Icechunk has it's own format specification). +Whilst the core icechunk client is written in rust, the `icechunk-python` client implements a concrete subclass of the zarr-python `Store` ABC. +Therefore libraries such as [xarray](https://xarray.dev/) can use the zarr-python user API to read and write to icechunk stores, effectively treating them as version-controlled zarr stores. +Icechunk also integrates with VirtualiZarr as a serialization format for byte range references. +Together they allow data stored in non-zarr formats to be committed to a persistent icechunk store and read back later via the zarr-python API without duplicating the original data chunks. + +We also have a full list of [zarr implementations](https://zarr.dev/implementations/). \ No newline at end of file diff --git a/index.md b/index.md index 6a98119..10c9503 100644 --- a/index.md +++ b/index.md @@ -32,28 +32,28 @@ can be represented as a key-value store, including most commonly POSIX file systems and cloud object storage but also zip files as well as relational and document databases. -See the following GitHub repositories for more information: - -* [Zarr Python](https://github.com/zarr-developers/zarr) -* [Zarr Specs](https://github.com/zarr-developers/zarr-specs) -* [Numcodecs](https://github.com/zarr-developers/numcodecs) -* [Z5](https://github.com/constantinpape/z5) -* [N5](https://github.com/saalfeldlab/n5) -* [Zarr.jl](https://github.com/meggart/Zarr.jl) -* [ndarray.scala](https://github.com/lasersonlab/ndarray.scala) +For more details read about the various [components of Zarr](https://zarr.dev/components/), +see the canonical [Zarr-Python](https://github.com/zarr-developers/zarr-python) implementation, +or look through [other Zarr implementations](https://zarr.dev/implementations/) for one in your preferred language. ## Applications -* Simple and fast serialization of NumPy-like arrays, accessible from languages including Python, C, C++, Rust, Javascript and Java -* Multi-scale n-dimensional image storage, e.g. in light and electron microscopy -* Geospatial rasters, e.g. following the NetCDF / CF metadata conventions +* Multi-scale n-dimensional image storage, e.g. in light and electron microscopy. +* Genomics data, e.g. for quantitative and population genetics. +* Gridded scientific data in various domains, such as CFD or Plasma Physics. +* Geospatial rasters, e.g. following the NetCDF data model. +* Checkpointing ML model weights. ## Features +* Serialize NumPy-like arrays in a simple and fast way. +* Access from languages including Python, C, C++, Rust, Javascript and Java. * Chunk multi-dimensional arrays along any dimension. +* Compress array chunks via an extensible system of compressors. * Store arrays in memory, on disk, inside a Zip file, on S3, etc. * Read and write arrays concurrently from multiple threads or processes. * Organize arrays into hierarchies via annotatable groups. +* Extend easily thanks to the [flexible design](https://zarr.dev/flexibility/). ## Sponsorship