Skip to content

Coalescing chunk reads using HTTP multi-part range requests #1316

@jleben

Description

@jleben

When reading from an icechunk repo on AWS S3 using the zarr or xarray libraries, it seems icechunk issues one single-range HTTP GET request for each chunk read, even when requesting multiple chunks which are stored contiguously in the same S3 object (using Zarr sharding).

This could be optimized by:

  • Coalescing reads of neighbouring ranges from the same object into a single-part range GET requests
  • Coalescing reads of non-neighbouring ranges from the same object into multi-part range GET requests

Am I missing some writing or reading option that would enable the above optimization?

Reproduction script

import icechunk
import zarr
import numpy as np
import os

# Generate data
s3_storage = icechunk.s3_storage(
    bucket="earthdaily-pathfinders-scaleai", 
    prefix="scratch/jakob/icechunk/test2",
    from_env=True,
)
s3_repo = icechunk.Repository.create(
    s3_storage,
    config=icechunk.RepositoryConfig(inline_chunk_threshold_bytes=1)
)
session = s3_repo.writable_session("main")
zarr.create_array(session.store, name="x", chunks=(1,), shards=(10,), data=np.arange(10))
session.commit("data")

# Enable logging. 
os.environ["ICECHUNK_LOG"]="icechunk=trace"
icechunk.set_logs_filter(None)

# Read some data
x = zarr.open_array(session.store, path="/x")
x[1:3]

Output from the x[1:3] statement:

  2025-10-29T18:45:44.219733Z TRACE icechunk::asset_manager: Downloading chunk, chunk_id: E2NTYJ67BFBJG2220840, range: 153..317
    at icechunk/src/asset_manager.rs:371
    in icechunk::asset_manager::fetch_chunk with chunk_id: E2NTYJ67BFBJG2220840, range: 153..317
    in icechunk::store::get with key: "x/c/0", byte_range: Until(164)

  2025-10-29T18:45:44.743241Z TRACE icechunk::asset_manager: Downloading chunk, chunk_id: E2NTYJ67BFBJG2220840, range: 0..17
    at icechunk/src/asset_manager.rs:371
    in icechunk::asset_manager::fetch_chunk with chunk_id: E2NTYJ67BFBJG2220840, range: 0..17
    in icechunk::store::get with key: "x/c/0", byte_range: Bounded(0..17)

  2025-10-29T18:45:44.832065Z TRACE icechunk::asset_manager: Downloading chunk, chunk_id: E2NTYJ67BFBJG2220840, range: 17..34
    at icechunk/src/asset_manager.rs:371
    in icechunk::asset_manager::fetch_chunk with chunk_id: E2NTYJ67BFBJG2220840, range: 17..34
    in icechunk::store::get with key: "x/c/0", byte_range: Bounded(17..34)

More verbose logging shows actual HTTP requests made:

os.environ["ICECHUNK_LOG"]="trace"
icechunk.set_logs_filter(None)
x[1:3]

Output (excerpt):

...
GET
/scratch/jakob/icechunk/test2/chunks/E2NTYJ67BFBJG2220840
x-id=GetObject
host:earthdaily-pathfinders-scaleai.s3.us-east-1.amazonaws.com
range:bytes=0-16
...
GET
/scratch/jakob/icechunk/test2/chunks/E2NTYJ67BFBJG2220840
x-id=GetObject
host:earthdaily-pathfinders-scaleai.s3.us-east-1.amazonaws.com
range:bytes=17-33
...

This indicates two GET requests for exactly consecutive ranges in the same S3 object.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions