What happened?
Despite efforts, the idea of having HDF5 as a memory-mapping device is failing. However, it is critical that memory mapping works in this case.
Potential explanations (as per codex, all sounding pretty reasonable):
- The BaseDataset API hard-requires in-memory
numpy.ndarray objects for dataobj and affine, and from_filename eagerly converts every HDF5 dataset into full NumPy arrays before constructing the object. This prevents use of HDF5 datasets or memory-mapped arrays as backends and guarantees that all volumes occupy RAM, defeating the intended low-memory design. We should allow BaseDataset to use lazy/backed arrays instead of forcing numpy.ndarray.
- Writing to HDF5 via
to_filename always materializes every field (including the full data array) in memory and never updates _filepath to serve as a backing store, so even after writing, there is no mechanism to drop the in-memory copy or reopen lazily. This duplicative write path increases peak memory usage rather than reducing it. We should rework to_filename to create and reuse on-disk backing stores
- The DWI initializer removes b=0 volumes via boolean masking (
self.dataobj = self.dataobj[..., ~b0_mask]), which copies the full array; the b=0 reference is also computed from the in-memory data. Combined with the base class constraints, this further increases transient memory during the construction of diffusion datasets.
What command did you use?
What version of the software are you running?
main
How are you running this software?
Local installation ("bare-metal")
Is your data BIDS valid?
Yes
Are you reusing any previously computed results?
No
Please copy and paste any relevant log output.
Additional information / screenshots
No response
What happened?
Despite efforts, the idea of having HDF5 as a memory-mapping device is failing. However, it is critical that memory mapping works in this case.
Potential explanations (as per codex, all sounding pretty reasonable):
numpy.ndarrayobjects fordataobjandaffine, andfrom_filenameeagerly converts every HDF5 dataset into full NumPy arrays before constructing the object. This prevents use of HDF5 datasets or memory-mapped arrays as backends and guarantees that all volumes occupy RAM, defeating the intended low-memory design. We should allowBaseDatasetto use lazy/backed arrays instead of forcingnumpy.ndarray.to_filenamealways materializes every field (including the full data array) in memory and never updates_filepathto serve as a backing store, so even after writing, there is no mechanism to drop the in-memory copy or reopen lazily. This duplicative write path increases peak memory usage rather than reducing it. We should reworkto_filenameto create and reuse on-disk backing storesself.dataobj = self.dataobj[..., ~b0_mask]), which copies the full array; the b=0 reference is also computed from the in-memory data. Combined with the base class constraints, this further increases transient memory during the construction of diffusion datasets.What command did you use?
What version of the software are you running?
main
How are you running this software?
Local installation ("bare-metal")
Is your data BIDS valid?
Yes
Are you reusing any previously computed results?
No
Please copy and paste any relevant log output.
Additional information / screenshots
No response