Skip to content

Avoid memory copy in obstore write #2972

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 14, 2025
6 changes: 4 additions & 2 deletions src/zarr/storage/_obstore.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@
from collections.abc import Iterable
from typing import TYPE_CHECKING, Any, TypedDict

import numpy as np

from zarr.abc.store import (
ByteRequest,
OffsetByteRequest,
Expand Down Expand Up @@ -145,15 +147,15 @@ async def set(self, key: str, value: Buffer) -> None:

self._check_writable()

buf = value.to_bytes()
buf = value.as_numpy_array().view(np.uint8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worthwhile to make this a standard method? This would make it easier to use throughout the codebase

Some possible names:

  • to_bytelike
  • to_binary
  • to_uint8s
  • ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess i'm also curious to know why we have a to_bytes() method that makes a copy, when .as_numpy_array().view(np.uint8) achieves the same effective result (an iterable of bytes) without a copy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was suggested in #2925, so I've added an as_bytes_like method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bytes objects always own their memory. So creating a new bytes object means making a copy

If we prefer to move away from this approach, we could add a deprecation cycle to move from to_bytes to as_bytes_like

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think FsspecStore works with BytesLike since it didn't accept a memoryview object when I tried it. So we probably need to keep to_bytes().

Copy link
Member

@jakirkham jakirkham Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this would be a good issue to raise with fsspec if we can reproduce with it directly

If not, a Zarr issue would be welcome

Ideally these copies should be avoidable in the FsspecStore case as well

Edit: Ofc this is non-blocking, just would like to improve the performance in all cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some possible names:

I'd also suggest as_buffer_like, because in my head "bytes" in Python is a type that always owns its own memory, while "buffer" doesn't necessarily own its memory, as in the buffer protocol or collections.abc.Buffer.

await obs.put_async(self.store, key, buf)

async def set_if_not_exists(self, key: str, value: Buffer) -> None:
# docstring inherited
import obstore as obs

self._check_writable()
buf = value.to_bytes()
buf = value.as_numpy_array().view(np.uint8)
with contextlib.suppress(obs.exceptions.AlreadyExistsError):
await obs.put_async(self.store, key, buf, mode="create")

Expand Down