-
Notifications
You must be signed in to change notification settings - Fork 10
Rectilinear (variable-length) chunk grid #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Do we want to include a recommendation that implementations SHOULD use run length encoding where appropriate when saving metadata?
Implementation here: zarrs/zarrs#284
|
This is ready for review. I would like to include a small, complete array that demonstrates this chunk grid, and a JSON schema for the metadata. |
normanrz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @d-v-b!
This PR fulfills all requirements to be merged and is a great addition to zarr-extensions.
Schema JSON and examples would be great. Do you wish add these in this PR or in a later one?
|
I leave that to you. Let me know, when you want this PR merged. |
|
If we want to support negative chunk indices (in order to allow expansion in the negative direction) then we need to be able to specify the sizes of the negative chunks also. For a regular grid it is sufficient to have a single grid_origin parameter that specifies the start of chunk 0 in array index space. But for the rectilinear grid we need to specify both the position in array index space and the position in chunk index space of the start of the chunk size list. |
|
Another thing that has been suggested in the past is to allow the logical size of a chunk to differ from the physical size, i.e. allow chunks to be stored with unused padding at both the start and end. For every chunk, you need to specify the physical size, logical size, and offset of the start of the logical chunk within the physical chunk. This would allow you to insert and remove elements in the middle of a dimension without having to re-encode chunks, just rename them. And by using kerchunk or icechunk or OCDBT the renaming could be done "virtually". To avoid even having to rename you could allow an arbitrary virtual to physical chunk index map, where you remap the chunk index also. |
|
Speaking for Zarr Python, these would require some work to implement -- in particular, we will need to introduce a new array indexing API to work with negative indices that aren't referenced to the end of an array. So for that reason I'm inclined to keep this chunk grid simple for now, provided we are confident we can safely extend it in the future. I think using additional keys in the configuration + defining new variants of the "kind" field to overload the meaning of |
|
The kind should arguably be per-dimension rather than apply to all dimensions. Perhaps the kind could be indicated by the type of the json value instead. |
Our intention for |
What I meant is that you may want the chunk sizes for one dimension to be stored in a separate 1-d array, but another dimension might have uniform chunking or sizes specified inline in the metadata. |
|
What if the chunk edge lengths were just an array, and then we defined a codec chain to describe how the chunk edge legnths were encoded. Run length encoding could be an array-to-array codec applied to the chunk edge lengths. |
This metadata-only variant with RLE seemed sufficient for many use cases, and we just really wanted to get this moving at the summit. It is potentially a significant change for implementations to support irregular gridding, let alone a chunk grid that does I/O. The intention with Some things to think about if chunk edge lengths were a separate array
|
|
I'm not suggesting that the chunk edge lengths be an external zarr array accessed via a store. It can and probably should be a written as a JSON array as part of the configuration. Nonetheless, we can still think of that JSON array conceptually as an array and that run-length-encoding is a codec that operates on that array. Run-length-encoding is likely useful in other places. Could we factor it out as its own codec so that we can use it elsewhere? It might also be clearer to then explicitly mark that this array of chunk edge lengths as being run-length-encoded. For example, let's say I transpose the shard index so that I now have a array of offsets followed by an array of chunk sizes (nbytes). If I were not using any variable length compresssion codecs those chunk sizes might all be the same and easily run length encoded. |
|
Oh gotcha! Well, that is a lot simpler. I'd suggest doing run-length encoding as an array-to-bytes codec that concatenates separately encoded data and run lengths. Similar to how sharding deals with the index and data. Spitballing (named for clarity): {
"chunk_key_encoding": {
"name": "rectilinear",
"configuration": {
"compressed_edge_length_bytes_per_dimension": [
"BASE64BYTESDIM0"
"BASE64BYTESDIM1"
"BASE64BYTESDIM2"
"BASE64BYTESDIM3"
],
"number_of_chunks_per_dimension": [
50,
40,
30,
20
],
"compressed_edge_length_codecs": [
{
"name": "run_length_encoding",
"configuration": {
"data_codecs": [
...
],
"run_length_codecs": [
...
]
}
}
]
}
}
} |
|
We could take the array analogy further and use a subset (or all of) of the array metadata document. continuing to spitball: {
"chunk_key_encoding": {
"name": "rectilinear",
"configuration": {
"chunk_edge_lengths" : {
"node_type": "array",
"data_type": "uint64",
"shape": [
50,
40,
30,
20
],
"chunks": [
"BASE64BYTESDIM0"
"BASE64BYTESDIM1"
"BASE64BYTESDIM2"
"BASE64BYTESDIM3"
],
"codecs": [
{
"name": "run_length_encoding",
"configuration": {
"data_codecs": [
...
],
"run_length_codecs": [
...
]
}
}
]
}
}
}
}xref #22 |
|
Indeed! This particular case is a bit quirky since it is a bunch of 1D arrays though |
|
yeah... unless other people feel differently, I don't think bringing more explicit Zarr array semantics in to the rectilinear chunk grid definition solves an acute problem, so maybe we consider deferring that until later |
|
The reason I would want to think of run-length-encoding as an array-to-array codec is that I would still want the option of representing the RLE-coded array as a JSON array as currently described in the spec. I see though that this is a special case because both the run-length and the data being encoded are both integers allowing for a homogeneous array. We can think of the original array as a 1D array of integers. The run-length-encoding is a 1D array of The original description of the syntax here reduces 1-runs to a single integer. That seems to be a second codec that converts a homogeneous 1D array into a 1D array of a sum type. Now, the conversation in |
|
this is what the current example JSON looks like: {
...
"shape": [6, 6, 6, 6, 6],
"chunk_grid": {
"name": "rectilinear",
"configuration": {
"kind": "inline",
"chunk_shapes": [
4, // integer. expands to [4, 4]
[1, 2, 3], // explicit list of edge lengths. expands to itself.
[[4, 2]], // run-length encoded. expands to [4, 4].
[[1, 3], 3], // run-length encoded and explicit list. expands to [1, 1, 1, 3]
[4, 4, 4] // explicit list with overflow chunks
]
}
}
}What exactly is the deficiency with this representation that your proposal would solve? I agree that it would be nice if we had a general purpose array -> bytes RLE codec, but I don't see how that solves a problem with this chunk grid declaration. On the contrary, I think simplicity should be a design goal here, and treating the arrays of edge lengths like generic Zarr arrays doesn't seem worth the complexity. |
|
I think we could keep this simple while also factoring out this particular RLE codec. Have an optional configuration parameter called Perhaps in the future there may be a desire to use a distinct notation for writing this information. Then another codec could be specified. In the meantime, I could see the application for reusing |
we already have a way to extend or modify this chunk grid -- by introducing new variants of the configuration, which are tagged with the |
|
Maybe we should change "inline" to "default" then because the possible alternate encodings I can think of are still "inline". An alternate encoding might be specifying that I always want the last chunk in each dimension to be truncated. I might parameterize that using the total length and the inital chunk size. I suspect this is a common case. Is that a different "kind"? My main point is that I think this run length encoding scheme is more general than rectilinear chunking so it would be useful to describe that elsewhere so it could be reused in other places. That suggests to me that some additional modularity is possible. |
So far the utility of RLE elsewhere in Zarr is not a convincing argument for making this particular chunk grid more elaborate. I think it would make sense for the rectilinear chunk grid to be parametrized by codecs if:
At the moment, we have two encodings (explicit edge length and RLE), which can be distinguished via JSON types. We vaguely anticipated alternative encodings (like references to Zarr arrays, or JSON-encoded compressed bytestrings) but all of these would be distinguishable via the |
|
A common case for rectilinear chunk grid is to truncate the last chunk in each dimension to exactly match the overall shape of the array. N5, for example, has end-chunks that may be smaller than the otherwise regular chunk shape. In the example below, I have an array that is 1000 x 1000 because I have a camera from a manufacturer that advertised a 1 megapixel camera and takes SI prefixes seriously. However, I want 64 x 32 chunks but I do not want to pad the chunks that cross the boundary of the 1000 x 1000 shape. Thus the very last chunk will be 40 x 8. To encode that with the run length encoding scheme, I could write the following. {
// ...
"shape": [1000, 1000]
"chunk_grid": {
"name": "rectilinear",
"configuration": {
"kind": "inline",
"chunk_shapes": [
[[64, 15], 40],
[[32, 31], 8]
]
}
}
}While it is nice that I can compactly describe the chunk sizes, I have to admit that I had to grab a calculator to calculate the numbers In [1]: 1000 // 64
Out[1]: 15
In [2]: 1000 % 64
Out[2]: 40
In [3]: 1000 // 32
Out[3]: 31
In [4]: 1000 % 32
Out[4]: 8Why I am manually doing a calculation to enter a parameter? Rather I could just enter the values that I do know as a parameters: 64 and 32. {
// ...
"shape": [1000, 1000]
"chunk_grid": {
"name": "rectilinear",
"configuration": {
"kind": "inline_regular_and_remainder",
"regular_chunk_shape": [64, 32]
}
}
}While I could do do this with a different Over in zarr-developers/zarr-specs#368, I was thinking about the current shard indexes as well as introducing inline shard indices. Perhaps it would also make sense for shards to also have rectilinear chunk grids. Without compression of the chunks, the regularity of the chunk byte sizes is related to the regularity of the chunk shapes. If the chunks are contiguous in storage, then the differences in their offsets would also follow a similar regularity suggesting that delta encoding combined with run-length-encoding may be useful. While the Rather than defining a bunch of one-off scattered metadata encodings throughout the specification and the extensions, I was wondering if developing a set of utility codecs for describing repetitive sequences could provide for a more uniform and consistent design for linked concepts. |
Just leave it up to the implementation to calculate that? The user can just specify their edge lengths. The RLE encoding in metadata is effectively an implementation detail from the user perspective.
Indeed, and yes a more compact "
And how are the subchunk shapes defined if the shards are irregularly gridded? Sharding will move to a |
Opening this PR with a draft of a rectilinear chunk grid spec