Skip to content

Inconsistent File Hashes When Resaving NetCDF Files with Chunks #10028

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
5 tasks done
NoraLoose opened this issue Feb 5, 2025 · 10 comments
Closed
5 tasks done

Inconsistent File Hashes When Resaving NetCDF Files with Chunks #10028

NoraLoose opened this issue Feb 5, 2025 · 10 comments

Comments

@NoraLoose
Copy link

What happened?

When resaving a NetCDF file using xarray, the resulting file has a consistent hash if opened without chunks. However, when the dataset is opened with chunks and resaved, the file hash changes with each save, even if the data remains unchanged. This behavior suggests non-deterministic output when working with chunked datasets.

What did you expect to happen?

I expect that resaving a dataset, whether opened with or without chunks, should produce deterministic file output if the data remains unchanged. This is particularly important for workflows that rely on file integrity checks.

Minimal Complete Verifiable Example

import hashlib
import xarray as xr
import numpy as np

def calculate_file_hash(filepath, hash_algorithm="sha256"):
    """Calculate the hash of a file using the specified hash algorithm."""
    hash_func = hashlib.new(hash_algorithm)
    with open(filepath, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_func.update(chunk)
    return hash_func.hexdigest()

# Create and save the dataset
ds = xr.Dataset(
    data_vars=dict(
        omega=(["nc"], [0.00014052, 0.00014544]),
        lon=(["ny", "nx"], np.array([
            [216.83334, 217.],
            [216.83334, 217.]
        ], dtype=np.float32))
    ),
    coords=dict(
        nc=[0, 1],
        ny=[0, 1],
        nx=[0, 1]
    )
)
fname = "test_data.nc"
ds.to_netcdf(fname)
print("Original file hash:", calculate_file_hash(fname))

# Resave without chunks
ds_without_chunks = xr.open_dataset(fname, chunks=None)
fname_resaved_without_chunks = "test_data_resaved_without_chunks.nc"
ds_without_chunks.to_netcdf(fname_resaved_without_chunks)
print("Resaved without chunks:", calculate_file_hash(fname_resaved_without_chunks))

# Resave with chunks (inconsistent hash)
ds_with_chunks = xr.open_dataset(fname, chunks={"nc": 1})
fname_resaved_with_chunks = "test_data_resaved_with_chunks.nc"
ds_with_chunks.to_netcdf(fname_resaved_with_chunks)
print("Resaved with chunks (first save):", calculate_file_hash(fname_resaved_with_chunks))

fname_resaved_once_more_with_chunks = "test_data_resaved_once_more_with_chunks.nc"
ds_with_chunks.to_netcdf(fname_resaved_once_more_with_chunks)
print("Resaved with chunks (second save):", calculate_file_hash(fname_resaved_once_more_with_chunks))

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

Original file hash:                   34a4af2dd6ec064dc1812ca0e1bbeb0feb3c698c78969c2bbeb25b0ec6fd7af0
Resaved without chunks:               34a4af2dd6ec064dc1812ca0e1bbeb0feb3c698c78969c2bbeb25b0ec6fd7af0
Resaved with chunks (first save):     64bdecb4517884dbce259e6207da93be2b9f9d1d40a7a9fa1c0ef8752ec7a8d0
Resaved with chunks (second save):    b90ef43fd7d15a476d0db24e334218f19c30a47ddb9b354538d1fb571c2322e5

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.12.8 | packaged by conda-forge | (main, Dec 5 2024, 14:24:40) [GCC 13.3.0]
python-bits: 64
OS: Linux
OS-release: 5.14.21-150400.24.46-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.2
libnetcdf: 4.9.4-development

xarray: 2024.10.0
pandas: 2.2.3
numpy: 1.26.4
scipy: 1.14.1
netCDF4: 1.7.2
pydap: None
h5netcdf: None
h5py: None
zarr: 2.18.4
cftime: 1.6.4.post1
nc_time_axis: None
iris: None
bottleneck: 1.4.2
dask: 2025.1.0
distributed: 2025.1.0
matplotlib: 3.9.2
cartopy: 0.24.1
seaborn: None
numbagg: None
fsspec: 2024.12.0
cupy: None
pint: None
sparse: 0.15.4
flox: None
numpy_groupies: None
setuptools: 69.5.1
pip: 24.0
conda: None
pytest: 8.3.4
mypy: None
IPython: 8.29.0
sphinx: None

@NoraLoose NoraLoose added bug needs triage Issue that has not been reviewed by xarray team member labels Feb 5, 2025
Copy link

welcome bot commented Feb 5, 2025

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@dcherian
Copy link
Contributor

dcherian commented Feb 5, 2025

Do you see a difference in ncdump -sh. My theory is that the order in which dimensions are added to the netCDF file is not deterministic since it uses Dataset.dims which is a set.

@NoraLoose
Copy link
Author

Thanks for your quick response @dcherian!

I don't see a difference in ncdump -sh. Do you? 👀

Test data saved without chunks:

netcdf test_data_resaved_without_chunks {
dimensions:
        nc = 2 ;
        ny = 2 ;
        nx = 2 ;
variables:
        double omega(nc) ;
                omega:_FillValue = NaN ;
                omega:_Storage = "contiguous" ;
                omega:_Endianness = "little" ;
        float lon(ny, nx) ;
                lon:_FillValue = NaNf ;
                lon:_Storage = "contiguous" ;
                lon:_Endianness = "little" ;
        int64 nc(nc) ;
                nc:_Storage = "contiguous" ;
                nc:_Endianness = "little" ;
        int64 ny(ny) ;
                ny:_Storage = "contiguous" ;
                ny:_Endianness = "little" ;
        int64 nx(nx) ;
                nx:_Storage = "contiguous" ;
                nx:_Endianness = "little" ;

// global attributes:
                :_NCProperties = "version=2,netcdf=4.9.4-development,hdf5=1.14.2" ;
                :_SuperblockVersion = 2 ;
                :_IsNetcdf4 = 1 ;
                :_Format = "netCDF-4" ;
}

Test data saved with chunks:

netcdf test_data_resaved_with_chunks {
dimensions:
        nc = 2 ;
        ny = 2 ;
        nx = 2 ;
variables:
        double omega(nc) ;
                omega:_FillValue = NaN ;
                omega:_Storage = "contiguous" ;
                omega:_Endianness = "little" ;
        float lon(ny, nx) ;
                lon:_FillValue = NaNf ;
                lon:_Storage = "contiguous" ;
                lon:_Endianness = "little" ;
        int64 nc(nc) ;
                nc:_Storage = "contiguous" ;
                nc:_Endianness = "little" ;
        int64 ny(ny) ;
                ny:_Storage = "contiguous" ;
                ny:_Endianness = "little" ;
        int64 nx(nx) ;
                nx:_Storage = "contiguous" ;
                nx:_Endianness = "little" ;

// global attributes:
                :_NCProperties = "version=2,netcdf=4.9.4-development,hdf5=1.14.2" ;
                :_SuperblockVersion = 2 ;
                :_IsNetcdf4 = 1 ;
                :_Format = "netCDF-4" ;
}

@dcherian
Copy link
Contributor

dcherian commented Feb 5, 2025

Ah I missed the dask dependency. Reminds me of #7522

@dcherian dcherian removed bug needs triage Issue that has not been reviewed by xarray team member labels Feb 5, 2025
@kmuehlbauer
Copy link
Contributor

@NoraLoose Can you check this please with h5dump -H --properties, too?

@dcherian
Copy link
Contributor

dcherian commented Feb 5, 2025

If I set compute=False to skip writing the data_vars then

Original file hash: 1d35260b587abdc285bd3f9efb7e3553b9ab388111540ffb8e2edc3887009491
Resaved without chunks: 1d35260b587abdc285bd3f9efb7e3553b9ab388111540ffb8e2edc3887009491
Resaved with chunks (first save): 5d586e508777596bfc1bbb0c75bb36a802dbf0e674f617fc111e69a87af7eaf9
Resaved with chunks (second save): 5d586e508777596bfc1bbb0c75bb36a802dbf0e674f617fc111e69a87af7eaf9

So like #7522, there's something about the parallel write that changes the metadata.

EDIT: Sadly this implies this issue is hard to fix. You could instead read back the Dataset with decode_cf=False and hash all the arrays individually.

@NoraLoose
Copy link
Author

Thanks @dcherian and @kmuehlbauer!

I see some differences when using h5dump -H --properties, specifically in the OFFSET value in STORAGE_LAYOUT:

Without chunks:

   DATASET "lon" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 2, 2 ) / ( 2, 2 ) }
      STORAGE_LAYOUT {
         CONTIGUOUS
         SIZE 16
         OFFSET 8208
      } ...

   DATASET "nc" {
      DATATYPE  H5T_STD_I64LE
      DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
      STORAGE_LAYOUT {
         CONTIGUOUS
         SIZE 16
         OFFSET 8224
      }...

With chunks (first save):

   DATASET "lon" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 2, 2 ) / ( 2, 2 ) }
      STORAGE_LAYOUT {
         CONTIGUOUS
         SIZE 16
         OFFSET 6905
      }...

   DATASET "nc" {
      DATATYPE  H5T_STD_I64LE
      DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
      STORAGE_LAYOUT {
         CONTIGUOUS
         SIZE 16
         OFFSET 6841
      }

@NoraLoose
Copy link
Author

You could instead read back the Dataset with decode_cf=False and hash all the arrays individually.

@dcherian Agreed, comparing the metadata somehow needs to be skipped. If I change my calculate_file_hash function to the following

import h5py
def calculate_file_hash(filepath):
    with h5py.File(filepath, 'r') as f:
        # Create a hash object
        hash_obj = hashlib.sha256()

        # Iterate over datasets in the file
        for dataset_name in f:
            dataset = f[dataset_name]
            
            # Skip metadata like STORAGE_LAYOUT or any other non-data attributes
            dataset_attrs = list(dataset.attrs)
            for attr in dataset_attrs:
                if attr == "STORAGE_LAYOUT":
                    del dataset.attrs[attr]  # Remove this attribute

            # Update the hash with the actual data (ignoring non-data metadata)
            data = dataset[()]
            hash_obj.update(data.tobytes())  # Convert data to bytes

        # Return the computed hash
        return hash_obj.hexdigest()

I get consistent file hashes throughout:

Original file hash: ebd2c91af1b048262960a2b7181c9120873e82fd1b12b33ecf26c743f85a83a6
Resaved without chunks: ebd2c91af1b048262960a2b7181c9120873e82fd1b12b33ecf26c743f85a83a6
Resaved with chunks (first save): ebd2c91af1b048262960a2b7181c9120873e82fd1b12b33ecf26c743f85a83a6
Resaved with chunks (second save): ebd2c91af1b048262960a2b7181c9120873e82fd1b12b33ecf26c743f85a83a6

@dcherian
Copy link
Contributor

dcherian commented Feb 6, 2025

Nice if you are going to do this, I recommend using xxhash to save some time.

@NoraLoose
Copy link
Author

Great, I think we can close this issue. Thanks again for your input @dcherian and @kmuehlbauer!

@dcherian dcherian closed this as completed Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants