-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Support concurrent loading of variables #8965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Would that be compatible with async stores? |
This idea of passing an arbitrary concurrent executor to xarray seems potentially related to #7810, which suggests allowing |
This is a tricky issue. One problem we have in our stack is that we currently outsourced nearly all actual parallelism to Dask. (The one exception to this is fsspec's async capabilities, which are hidden behind a separate thread housing an async event loop.) Ideally, there would be one single runtime responsible for actually implementing concurrent data access and I/O. If all the libraries implemented async methods, then that could be placed completely in the user's responsibility, i.e. you could right code like async def my_processing_function():
await xr.open_dataset(...)
# which would call
await zarr.open_group(...)
# which would call
await object_store.get_object(...) The user would be responsible for starting an event loop and running the coroutine. The event loop would manage the concurrency for the whole stack and everything would be fine. In Zarr we are in the process of adding the async methods. That begs the question...should Xarray add them too? If not, then Xarray has to decide how to call async code. It could use the fsspec approach of managing an async event loop on another thread. It could manage a threadpool of its own. How would these interact with Dask / fsspec / Zarr / etc. The futures approach proposed here is one example of how to add concurrency within Xarray. I feel like this conundrum really illustrates the limitations of the modularity that we value so much from our stack. I have no idea what the "right" answer is. However, my perspective has been greatly influenced by writing Tokio Rust code, which does not suffer from this delegation problem. It's a very different situation from Python. |
FWIW this appears to do what I wanted with Zarr at least, i.e. issue concurrent loads per variable. def concurrent_compute(ds: xr.Dataset) -> xr.Dataset:
from concurrent.futures import ThreadPoolExecutor, as_completed
copy = ds.copy()
def load_variable_data(name: str, var: xr.Variable) -> np.ndarray:
return (name, var.compute().data)
with ThreadPoolExecutor(max_workers=None) as executor:
futures = [
executor.submit(load_variable_data, k, v) for k, v in copy.variables.items()
]
for future in as_completed(futures):
name, loaded = future.result()
copy.variables[name].data = loaded
return copy
concurrent_compute(ds) |
If we wanted to load "coordinate" variables from disk concurrently, we'd need to update this loop similarly: xarray/xarray/core/coordinates.py Lines 1090 to 1105 in 66e13a2
|
Is your feature request related to a problem?
Today if users have to concurrently load multiple variables in a DataArray or Dataset, they have to use dask.
It struck me that it'd be pretty easy for
.load
to gain anexecutor
kwarg that accepts anything that follows theconcurrent.futures
executor interface, and parallelize this loop.xarray/xarray/core/dataset.py
Lines 853 to 857 in b003674
The text was updated successfully, but these errors were encountered: