Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] use_datastore not robust for invalid assets #344

Open
anton-seaice opened this issue Feb 18, 2025 · 4 comments
Open

[BUG] use_datastore not robust for invalid assets #344

anton-seaice opened this issue Feb 18, 2025 · 4 comments
Labels
bug Something isn't working

Comments

@anton-seaice
Copy link
Collaborator

Describe the bug

When I run use_datastore on experiment output which contains invalid assets, the datastore is always regenerated, even when there are no changes to the experiment output.

The expected behaviour is that it finds the existing datastore and does not regenerate it.

In general, unmodified configurations won't include invalid assets, so I don't think its a priority to address this.

To Reproduce

Run

use_datastore(
    '/g/data/tm70/as2285/payu/MOM6-CICE6/archive/',
    builder=builders.AccessOm3Builder
)

Additional context

This is the output:

Datastore found in /g/data/tm70/as2285/payu/MOM6-CICE6/archive, verifying datastore integrity...
Parsing experiment dir...

[/g/data/xp65/public/apps/med_conda/envs/analysis3-25.02/lib/python3.11/site-packages/access_nri_intake/experiment/main.py:97](https://are.nci.org.au/g/data/xp65/public/apps/med_conda/envs/analysis3-25.02/lib/python3.11/site-packages/access_nri_intake/experiment/main.py#line=96): DataStoreWarning: Experiment directory and datastore do not match (missing files from datastore). Datastore regeneration required...
  ds_info.valid = verify_ds_current(ds_info, found_experiment_files)

Building esm-datastore...
Sucessfully built esm-datastore!
Saving esm-datastore to /g/data/tm70/as2285/payu/MOM6-CICE6/archive
Successfully wrote ESM catalog json file to: file:///g/data/tm70/as2285/payu/MOM6-CICE6/archive/experiment_datastore.json
Hashing catalog to prevent unnecessary rebuilds.
This may take some time...

[/g/data/xp65/public/apps/med_conda/envs/analysis3-25.02/lib/python3.11/site-packages/access_nri_intake/source/builders.py:200](https://are.nci.org.au/g/data/xp65/public/apps/med_conda/envs/analysis3-25.02/lib/python3.11/site-packages/access_nri_intake/source/builders.py#line=199): UserWarning: Unable to parse 3002 assets. A list of these assets can be found in `.invalid_assets` attribute.
  self.get_assets().validate_parser().parse().clean_dataframe()
[/g/data/xp65/public/apps/med_conda/envs/analysis3-25.02/lib/python3.11/site-packages/intake_esm/cat.py:187](https://are.nci.org.au/g/data/xp65/public/apps/med_conda/envs/analysis3-25.02/lib/python3.11/site-packages/intake_esm/cat.py#line=186): PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  data = self.dict().copy()
/g/data/xp65/public/apps/med_conda/envs/analysis3-25.02/lib/python3.11/site-packages/pydantic/deprecated/decorator.py:226: UserWarning: Unable to parse 3002 assets[/files.](https://are.nci.org.au/files.) A list of these assets can be found in [/jobfs/135511769.gadi-pbs/experiment_datastore_invalid_assets.csv.](https://are.nci.org.au/jobfs/135511769.gadi-pbs/experiment_datastore_invalid_assets.csv.)
  return self.raw_function(**d, **var_kwargs)

Catalog sucessfully hashed!
Datastore sucessfully written to /g/data/tm70/as2285/payu/MOM6-CICE6/archive/experiment_datastore.json!
Please note that this has not added the datastore to the access-nri-intake catalog.
To add to catalog, please run 'scaffold_catalog_entry' for help on how to do so.
@charles-turner-1
Copy link
Collaborator

Is this datastore likely to stay where it is? I'm thinking about adding it as an end to end test

@anton-seaice
Copy link
Collaborator Author

No - definately not. There probably is a longer term dataset which includes files not picked up by intake, bit i don't know of one.

@marc-white
Copy link
Collaborator

@charles-turner-1 just taking a quick look at this, I'm a bit confused as to how use_datastore got to the point of complaining about there being files missing - as far as I can tell, there isn't a manifest hash file in /g/data/tm70/as2285/payu/MOM6-CICE6/archive/, so shouldn't it have come up with the "No hash file found" warning?

@charles-turner-1
Copy link
Collaborator

charles-turner-1 commented Feb 23, 2025

Aha - I put the hash files as hidden, so ls -al will show it up - I can see it in there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Backlog
Development

No branches or pull requests

3 participants