-
Notifications
You must be signed in to change notification settings - Fork 58
Open
Description
When reading from an icechunk repo on AWS S3 using the zarr or xarray libraries, it seems icechunk issues one single-range HTTP GET request for each chunk read, even when requesting multiple chunks which are stored contiguously in the same S3 object (using Zarr sharding).
This could be optimized by:
- Coalescing reads of neighbouring ranges from the same object into a single-part range GET requests
- Coalescing reads of non-neighbouring ranges from the same object into multi-part range GET requests
Am I missing some writing or reading option that would enable the above optimization?
Reproduction script
import icechunk
import zarr
import numpy as np
import os
# Generate data
s3_storage = icechunk.s3_storage(
bucket="earthdaily-pathfinders-scaleai",
prefix="scratch/jakob/icechunk/test2",
from_env=True,
)
s3_repo = icechunk.Repository.create(
s3_storage,
config=icechunk.RepositoryConfig(inline_chunk_threshold_bytes=1)
)
session = s3_repo.writable_session("main")
zarr.create_array(session.store, name="x", chunks=(1,), shards=(10,), data=np.arange(10))
session.commit("data")
# Enable logging.
os.environ["ICECHUNK_LOG"]="icechunk=trace"
icechunk.set_logs_filter(None)
# Read some data
x = zarr.open_array(session.store, path="/x")
x[1:3]
Output from the x[1:3] statement:
2025-10-29T18:45:44.219733Z TRACE icechunk::asset_manager: Downloading chunk, chunk_id: E2NTYJ67BFBJG2220840, range: 153..317
at icechunk/src/asset_manager.rs:371
in icechunk::asset_manager::fetch_chunk with chunk_id: E2NTYJ67BFBJG2220840, range: 153..317
in icechunk::store::get with key: "x/c/0", byte_range: Until(164)
2025-10-29T18:45:44.743241Z TRACE icechunk::asset_manager: Downloading chunk, chunk_id: E2NTYJ67BFBJG2220840, range: 0..17
at icechunk/src/asset_manager.rs:371
in icechunk::asset_manager::fetch_chunk with chunk_id: E2NTYJ67BFBJG2220840, range: 0..17
in icechunk::store::get with key: "x/c/0", byte_range: Bounded(0..17)
2025-10-29T18:45:44.832065Z TRACE icechunk::asset_manager: Downloading chunk, chunk_id: E2NTYJ67BFBJG2220840, range: 17..34
at icechunk/src/asset_manager.rs:371
in icechunk::asset_manager::fetch_chunk with chunk_id: E2NTYJ67BFBJG2220840, range: 17..34
in icechunk::store::get with key: "x/c/0", byte_range: Bounded(17..34)
More verbose logging shows actual HTTP requests made:
os.environ["ICECHUNK_LOG"]="trace"
icechunk.set_logs_filter(None)
x[1:3]
Output (excerpt):
...
GET
/scratch/jakob/icechunk/test2/chunks/E2NTYJ67BFBJG2220840
x-id=GetObject
host:earthdaily-pathfinders-scaleai.s3.us-east-1.amazonaws.com
range:bytes=0-16
...
GET
/scratch/jakob/icechunk/test2/chunks/E2NTYJ67BFBJG2220840
x-id=GetObject
host:earthdaily-pathfinders-scaleai.s3.us-east-1.amazonaws.com
range:bytes=17-33
...
This indicates two GET requests for exactly consecutive ranges in the same S3 object.
timtylin and cr458
Metadata
Metadata
Assignees
Labels
No labels