Skip to content

Our HDF5 implementation continues to be a memory hog #347

@oesteban

Description

@oesteban

What happened?

Despite efforts, the idea of having HDF5 as a memory-mapping device is failing. However, it is critical that memory mapping works in this case.

Potential explanations (as per codex, all sounding pretty reasonable):

  • The BaseDataset API hard-requires in-memory numpy.ndarray objects for dataobj and affine, and from_filename eagerly converts every HDF5 dataset into full NumPy arrays before constructing the object. This prevents use of HDF5 datasets or memory-mapped arrays as backends and guarantees that all volumes occupy RAM, defeating the intended low-memory design. We should allow BaseDataset to use lazy/backed arrays instead of forcing numpy.ndarray.
  • Writing to HDF5 via to_filename always materializes every field (including the full data array) in memory and never updates _filepath to serve as a backing store, so even after writing, there is no mechanism to drop the in-memory copy or reopen lazily. This duplicative write path increases peak memory usage rather than reducing it. We should rework to_filename to create and reuse on-disk backing stores
  • The DWI initializer removes b=0 volumes via boolean masking (self.dataobj = self.dataobj[..., ~b0_mask]), which copies the full array; the b=0 reference is also computed from the in-memory data. Combined with the base class constraints, this further increases transient memory during the construction of diffusion datasets.

What command did you use?

n/a

What version of the software are you running?

main

How are you running this software?

Local installation ("bare-metal")

Is your data BIDS valid?

Yes

Are you reusing any previously computed results?

No

Please copy and paste any relevant log output.

Additional information / screenshots

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions