Skip to content

Conversation

@d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Oct 14, 2025

Opening this PR with a draft of a rectilinear chunk grid spec

Copy link
Member

@LDeakin LDeakin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Do we want to include a recommendation that implementations SHOULD use run length encoding where appropriate when saving metadata?

Implementation here: zarrs/zarrs#284

@d-v-b d-v-b marked this pull request as ready for review October 18, 2025 18:47
@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 18, 2025

This is ready for review. I would like to include a small, complete array that demonstrates this chunk grid, and a JSON schema for the metadata.

Copy link
Member

@normanrz normanrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @d-v-b!
This PR fulfills all requirements to be merged and is a great addition to zarr-extensions.
Schema JSON and examples would be great. Do you wish add these in this PR or in a later one?

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 20, 2025

if @jbms, @LDeakin, and @manzt are all OK with how this look now then I'm happy to merge and add the schema + example data in a subsequent PR. But I am also wondering if we want to handle the origin of the chunk grid here, while this PR is still open. See #30.

@normanrz
Copy link
Member

I leave that to you. Let me know, when you want this PR merged.

@jbms
Copy link
Contributor

jbms commented Oct 20, 2025

If we want to support negative chunk indices (in order to allow expansion in the negative direction) then we need to be able to specify the sizes of the negative chunks also.

For a regular grid it is sufficient to have a single grid_origin parameter that specifies the start of chunk 0 in array index space.

But for the rectilinear grid we need to specify both the position in array index space and the position in chunk index space of the start of the chunk size list.

@jbms
Copy link
Contributor

jbms commented Oct 20, 2025

Another thing that has been suggested in the past is to allow the logical size of a chunk to differ from the physical size, i.e. allow chunks to be stored with unused padding at both the start and end.

For every chunk, you need to specify the physical size, logical size, and offset of the start of the logical chunk within the physical chunk.

This would allow you to insert and remove elements in the middle of a dimension without having to re-encode chunks, just rename them. And by using kerchunk or icechunk or OCDBT the renaming could be done "virtually".

To avoid even having to rename you could allow an arbitrary virtual to physical chunk index map, where you remap the chunk index also.

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 20, 2025

Speaking for Zarr Python, these would require some work to implement -- in particular, we will need to introduce a new array indexing API to work with negative indices that aren't referenced to the end of an array. So for that reason I'm inclined to keep this chunk grid simple for now, provided we are confident we can safely extend it in the future.

I think using additional keys in the configuration + defining new variants of the "kind" field to overload the meaning of chunk_shapes should cover the additional flexibility. Defining a totally new chunk grid could also work, but I think as long as the changes generalize this chunk grid, we should aim to update this spec to a new version with changes that will be breaking for naive implementations.

@jbms
Copy link
Contributor

jbms commented Oct 20, 2025

The kind should arguably be per-dimension rather than apply to all dimensions. Perhaps the kind could be indicated by the type of the json value instead.

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 20, 2025

The kind should arguably be per-dimension rather than apply to all dimensions.

Our intention for "kind" was that it defines the semantics for the chunk_shapes field. E.g., if "kind" was set to "reference", then chunk_shapes might be a path or URI, and thus per-dimension metadata would not be meaningful without resolving that reference. We could use the type of the chunk_shapes field to express this, but I think an explicit "kind" field gives us more flexibility here.

@jbms
Copy link
Contributor

jbms commented Oct 20, 2025

The kind should arguably be per-dimension rather than apply to all dimensions.

Our intention for "kind" was that it defines the semantics for the chunk_shapes field. E.g., if "kind" was set to "reference", then chunk_shapes might be a path or URI, and thus per-dimension metadata would not be meaningful without resolving that reference. We could use the type of the chunk_shapes field to express this, but I think an explicit "kind" field gives us more flexibility here.

What I meant is that you may want the chunk sizes for one dimension to be stored in a separate 1-d array, but another dimension might have uniform chunking or sizes specified inline in the metadata.

@mkitti
Copy link
Contributor

mkitti commented Nov 30, 2025

What if the chunk edge lengths were just an array, and then we defined a codec chain to describe how the chunk edge legnths were encoded. Run length encoding could be an array-to-array codec applied to the chunk edge lengths.

@LDeakin
Copy link
Member

LDeakin commented Nov 30, 2025

What if the chunk edge lengths were just an array

This metadata-only variant with RLE seemed sufficient for many use cases, and we just really wanted to get this moving at the summit. It is potentially a significant change for implementations to support irregular gridding, let alone a chunk grid that does I/O. The intention with "kind" was that chunk edge lengths/offsets in a separate array could be a future proposal.

Some things to think about if chunk edge lengths were a separate array

  • Currently, all zarr extension points can be fully initialised through reading the zarr.json
    • Implementations would have to change their APIs to enable extension initialisation with a store reference
  • It seems sensible to me that chunk edge lengths could be encoded as subarrays (child arrays of an array)
    • This is currently disallowed in the spec - I want this changed, but haven't drafted anything
  • Chunk shapes/offsets could be lazily loaded if chunk grids supported store access during normal chunk querying operations
    • This is not disallowed in the spec, but would be an abstraction change for likely all implementations
    • RLE is a poor fit for humongous irregular chunk grids encoded as a separate array for lazy loading, because partial decoding is unsupported. Also, you'd probably want to store chunk edge offsets rather than edge lengths, so that arbitrary chunk extents could be queried efficiently.

@mkitti
Copy link
Contributor

mkitti commented Nov 30, 2025

I'm not suggesting that the chunk edge lengths be an external zarr array accessed via a store. It can and probably should be a written as a JSON array as part of the configuration. Nonetheless, we can still think of that JSON array conceptually as an array and that run-length-encoding is a codec that operates on that array.

Run-length-encoding is likely useful in other places. Could we factor it out as its own codec so that we can use it elsewhere? It might also be clearer to then explicitly mark that this array of chunk edge lengths as being run-length-encoded. For example, let's say I transpose the shard index so that I now have a array of offsets followed by an array of chunk sizes (nbytes). If I were not using any variable length compresssion codecs those chunk sizes might all be the same and easily run length encoded.

@LDeakin
Copy link
Member

LDeakin commented Nov 30, 2025

Oh gotcha! Well, that is a lot simpler.

I'd suggest doing run-length encoding as an array-to-bytes codec that concatenates separately encoded data and run lengths. Similar to how sharding deals with the index and data.

Spitballing (named for clarity):

{
  "chunk_key_encoding": {
    "name": "rectilinear",
    "configuration": {
      "compressed_edge_length_bytes_per_dimension": [
        "BASE64BYTESDIM0"
        "BASE64BYTESDIM1"
        "BASE64BYTESDIM2"
        "BASE64BYTESDIM3"
      ],
      "number_of_chunks_per_dimension": [
        50,
        40,
        30,
        20
      ],
      "compressed_edge_length_codecs": [
        {
          "name": "run_length_encoding",
          "configuration": {
            "data_codecs": [
              ...
            ],
            "run_length_codecs": [
              ...
            ]
          }
        }
      ]
    }
  }
}

@d-v-b
Copy link
Contributor Author

d-v-b commented Dec 1, 2025

We could take the array analogy further and use a subset (or all of) of the array metadata document.

continuing to spitball:

{
  "chunk_key_encoding": {
    "name": "rectilinear",
    "configuration": {
    "chunk_edge_lengths" : {
        "node_type": "array",
        "data_type": "uint64",
        "shape": [
          50,
          40,
          30,
          20
        ],
        "chunks": [
          "BASE64BYTESDIM0"
          "BASE64BYTESDIM1"
          "BASE64BYTESDIM2"
          "BASE64BYTESDIM3"
        ],
        "codecs": [
          {
            "name": "run_length_encoding",
            "configuration": {
              "data_codecs": [
                ...
              ],
              "run_length_codecs": [
                ...
              ]
            }
          }
        ]
      }
    }
  }
}

xref #22

@LDeakin
Copy link
Member

LDeakin commented Dec 1, 2025

Indeed! This particular case is a bit quirky since it is a bunch of 1D arrays though

@d-v-b
Copy link
Contributor Author

d-v-b commented Dec 1, 2025

yeah... unless other people feel differently, I don't think bringing more explicit Zarr array semantics in to the rectilinear chunk grid definition solves an acute problem, so maybe we consider deferring that until later

@mkitti
Copy link
Contributor

mkitti commented Dec 1, 2025

The reason I would want to think of run-length-encoding as an array-to-array codec is that I would still want the option of representing the RLE-coded array as a JSON array as currently described in the spec.

I see though that this is a special case because both the run-length and the data being encoded are both integers allowing for a homogeneous array.

We can think of the original array as a 1D array of integers. The run-length-encoding is a 1D array of (run_length, data) tuples. Since they are both integers it could be thought of as a 2D array. If we wanted two distinct arrays of run_lengths and data we might use the transpose codec. For heterogeneous tuple case, this operation would typically be thought of as unzip in functional languages.

The original description of the syntax here reduces 1-runs to a single integer. That seems to be a second codec that converts a homogeneous 1D array into a 1D array of a sum type. Now, the conversation in optional about sum types becomes relevant.

@d-v-b
Copy link
Contributor Author

d-v-b commented Dec 1, 2025

this is what the current example JSON looks like:

{
    ...
    "shape": [6, 6, 6, 6, 6],
    "chunk_grid": {
        "name": "rectilinear",
        "configuration": {
            "kind": "inline",
            "chunk_shapes": [
                4, // integer. expands to [4, 4]
                [1, 2, 3], // explicit list of edge lengths. expands to itself.
                [[4, 2]], // run-length encoded. expands to [4, 4].
                [[1, 3], 3], // run-length encoded and explicit list. expands to [1, 1, 1, 3]
                [4, 4, 4] // explicit list with overflow chunks
            ]
        }
    }
}

What exactly is the deficiency with this representation that your proposal would solve? I agree that it would be nice if we had a general purpose array -> bytes RLE codec, but I don't see how that solves a problem with this chunk grid declaration. On the contrary, I think simplicity should be a design goal here, and treating the arrays of edge lengths like generic Zarr arrays doesn't seem worth the complexity.

@mkitti
Copy link
Contributor

mkitti commented Dec 1, 2025

I think we could keep this simple while also factoring out this particular RLE codec.

Have an optional configuration parameter called chunk_shape_codecs that defaults to a codec chain of one codec: int_array_run_length_encoding. Describe the codec exactly as you have here, verbatim. The syntax would not change at all, but now we have a way to extend or modify the behavior.

Perhaps in the future there may be a desire to use a distinct notation for writing this information. Then another codec could be specified.

In the meantime, I could see the application for reusing int_array_run_length_encoding to define a shard index or something similar.

@d-v-b
Copy link
Contributor Author

d-v-b commented Dec 1, 2025

The syntax would not change at all, but now we have a way to extend or modify the behavior.

we already have a way to extend or modify this chunk grid -- by introducing new variants of the configuration, which are tagged with the "kind" field.

@mkitti
Copy link
Contributor

mkitti commented Dec 1, 2025

Maybe we should change "inline" to "default" then because the possible alternate encodings I can think of are still "inline".

An alternate encoding might be specifying that I always want the last chunk in each dimension to be truncated. I might parameterize that using the total length and the inital chunk size. I suspect this is a common case. Is that a different "kind"?

My main point is that I think this run length encoding scheme is more general than rectilinear chunking so it would be useful to describe that elsewhere so it could be reused in other places. That suggests to me that some additional modularity is possible.

@d-v-b
Copy link
Contributor Author

d-v-b commented Dec 1, 2025

My main point is that I think this run length encoding scheme is more general than rectilinear chunking so it would be useful to describe that elsewhere so it could be reused in other places.

So far the utility of RLE elsewhere in Zarr is not a convincing argument for making this particular chunk grid more elaborate. I think it would make sense for the rectilinear chunk grid to be parametrized by codecs if:

  • there were a variety of codecs we expected people to use
  • adding a codecs field (or equivalent) makes the chunk grid easier to work with.

At the moment, we have two encodings (explicit edge length and RLE), which can be distinguished via JSON types. We vaguely anticipated alternative encodings (like references to Zarr arrays, or JSON-encoded compressed bytestrings) but all of these would be distinguishable via the "kind" attribute as well as the type of the field. What is missing from this picture that we would gain by adding a codecs parameter?

@mkitti
Copy link
Contributor

mkitti commented Dec 2, 2025

A common case for rectilinear chunk grid is to truncate the last chunk in each dimension to exactly match the overall shape of the array. N5, for example, has end-chunks that may be smaller than the otherwise regular chunk shape.

In the example below, I have an array that is 1000 x 1000 because I have a camera from a manufacturer that advertised a 1 megapixel camera and takes SI prefixes seriously. However, I want 64 x 32 chunks but I do not want to pad the chunks that cross the boundary of the 1000 x 1000 shape. Thus the very last chunk will be 40 x 8. To encode that with the run length encoding scheme, I could write the following.

{
    // ...
    "shape": [1000, 1000]
    "chunk_grid": {
        "name": "rectilinear",
        "configuration": {
            "kind": "inline",
            "chunk_shapes": [
                [[64, 15], 40],
                [[32, 31], 8]
            ]
        }
    }
}

While it is nice that I can compactly describe the chunk sizes, I have to admit that I had to grab a calculator to calculate the numbers 15, 40, 31, and 8.

In [1]: 1000 // 64
Out[1]: 15

In [2]: 1000 % 64
Out[2]: 40

In [3]: 1000 // 32
Out[3]: 31

In [4]: 1000 % 32
Out[4]: 8

Why I am manually doing a calculation to enter a parameter? Rather I could just enter the values that I do know as a parameters: 64 and 32.

{
    // ...
    "shape": [1000, 1000]
    "chunk_grid": {
        "name": "rectilinear",
        "configuration": {
            "kind": "inline_regular_and_remainder",
            "regular_chunk_shape": [64, 32]
        }
    }
}

While I could do do this with a different kind as above, the only thing I really wanted to change is the encoding of each of the 1D integer arrays to use known parameters rather than having to calculate them. Another variation could be having the irregularly shaped chunk come first rather than last in each dimension. Furthermore, the encodings that I use to describe those 1D integer arrays may be useful for encoding other metadata.

Over in zarr-developers/zarr-specs#368, I was thinking about the current shard indexes as well as introducing inline shard indices. Perhaps it would also make sense for shards to also have rectilinear chunk grids. Without compression of the chunks, the regularity of the chunk byte sizes is related to the regularity of the chunk shapes. If the chunks are contiguous in storage, then the differences in their offsets would also follow a similar regularity suggesting that delta encoding combined with run-length-encoding may be useful. While the sharding_indexed codec currently prohibits variable length compression, including run-length-encoding, perhaps some short run-length-encoded sequences may be eventually permissible due to the ease of retrieving values from short run length encodings.

Rather than defining a bunch of one-off scattered metadata encodings throughout the specification and the extensions, I was wondering if developing a set of utility codecs for describing repetitive sequences could provide for a more uniform and consistent design for linked concepts.

@LDeakin
Copy link
Member

LDeakin commented Dec 2, 2025

Why I am manually doing a calculation to enter a parameter?

Just leave it up to the implementation to calculate that? The user can just specify their edge lengths. The RLE encoding in metadata is effectively an implementation detail from the user perspective.

A common case for rectilinear chunk grid is to truncate the last chunk in each dimension

Indeed, and yes a more compact "kind" could be added for this specific case, or a flag added to regular, or even a dedicated chunk grid? I added regular_bounded to zarrs as a dedicated chunk grid for this use case.

Perhaps it would also make sense for shards to also have rectilinear chunk grids

And how are the subchunk shapes defined if the shards are irregularly gridded? Sharding will move to a regular_bounded like internal grid with #34 resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants