Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error on import-synda for existing db #1

Open
AtefBN opened this issue Apr 11, 2023 · 2 comments
Open

error on import-synda for existing db #1

AtefBN opened this issue Apr 11, 2023 · 2 comments

Comments

@AtefBN
Copy link
Collaborator

AtefBN commented Apr 11, 2023

`(esgpull) -bash-4.2$ esgpull self import-synda /gpfscmip/gpfsdata/esgf/synda-cmn/db/CMIP5/sdt.db
Found 810229 files to import, proceed? [y/n]: y
Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:--:--AttributeError: 'NoneType' object has no attribute 'upper'
See /gpfscmip/gpfsdata/esgf/esgpull1/log/esgpull-import_synda-2023-04-11_08-46-39.log for error log.
Aborted!
Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:--:--
(esgpull) -bash-4.2$ cat /gpfscmip/gpfsdata/esgf/esgpull1/log/esgpull-import_synda-2023-04-11_08-46-39.log
[2023-04-11 10:46:46] DEBUG root
Locals:
{
'self': SyndaFile(
file_id=297,
url='http://aims3.llnl.gov/thredds/fileServer/cmip5_css01_data/cmip5/output1/LASG-CESS/FGOALS-g2/lgm/fx/atmos/fx/r0i0p0/v20130314/areacella/areacella_fx_FGOALS-g2_lgm_r0i0p0.nc',
file_functional_id='cmip5.output1.LASG-CESS.FGOALS-g2.lgm.fx.atmos.fx.r0i0p0.v20130314.areacella_fx_FGOALS-g2_lgm_r0i0p0.nc',
filename='areacella_fx_FGOALS-g2_lgm_r0i0p0.nc',
local_path='CMIP5/output1/LASG-CESS/FGOALS-g2/lgm/fx/atmos/fx/r0i0p0/v20130314/areacella/areacella_fx_FGOALS-g2_lgm_r0i0p0.nc',
data_node='aims3.llnl.gov',
checksum=None,
checksum_type=None,
duration=None,
size=42760,
rate=None,
start_date=None,
end_date=None,
crea_date='2020-11-03 14:44:18.596992',
status='done',
error_msg=None,
sdget_status=None,
sdget_error_msg=None,
priority=1000,
tracking_id='7181939e-4b39-4eaf-a4be-85eae5b5a9e9',
model='FGOALS-g2',
project='CMIP5',
variable='areacella',
last_access_date=None,
dataset_id=97,
insertion_group_id=1,
timestamp='2013-03-12T17:25:11Z'
),
'file_id': 'cmip5.output1.LASG-CESS.FGOALS-g2.lgm.fx.atmos.fx.r0i0p0.v20130314.areacella_fx_FGOALS-g2_lgm_r0i0p0.nc',
'dataset_id': 'cmip5.output1.LASG-CESS.FGOALS-g2.lgm.fx.atmos.fx.r0i0p0.v20130314',
'dataset_master': 'cmip5.output1.LASG-CESS.FGOALS-g2.lgm.fx.atmos.fx.r0i0p0',
'version': 'v20130314',
'master_id': 'cmip5.output1.LASG-CESS.FGOALS-g2.lgm.fx.atmos.fx.r0i0p0.areacella_fx_FGOALS-g2_lgm_r0i0p0.nc',
'url': 'https://aims3.llnl.gov/thredds/fileServer/cmip5_css01_data/cmip5/output1/LASG-CESS/FGOALS-g2/lgm/fx/atmos/fx/r0i0p0/v20130314/areacella/areacella_fx_FGOALS-g2_lgm_r0i0p0.nc',
'local_path': 'CMIP5/output1/LASG-CESS/FGOALS-g2/lgm/fx/atmos/fx/r0i0p0/v20130314/areacella'
}
Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:--:--

[2023-04-11 10:46:46] ERROR root

Traceback (most recent call last):
File "/gpfscmip/gpfsdata/esgf/miniconda/envs/esgpull/lib/python3.11/site-packages/esgpull/tui.py", line 154, in logging
yield
File "/gpfscmip/gpfsdata/esgf/miniconda/envs/esgpull/lib/python3.11/site-packages/esgpull/cli/self.py", line 235, in import_synda
nb_imported = esg.import_synda(url=path, track=True, ask=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfscmip/gpfsdata/esgf/miniconda/envs/esgpull/lib/python3.11/site-packages/esgpull/esgpull.py", line 227, in import_synda
file = synda_file.to_file()
^^^^^^^^^^^^^^^^^^^^
File "/gpfscmip/gpfsdata/esgf/miniconda/envs/esgpull/lib/python3.11/site-packages/esgpull/models/synda_file.py", line 64, in to_file
checksum_type=self.checksum_type.upper(),
^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'upper'
`

@svenrdz
Copy link
Collaborator

svenrdz commented Apr 12, 2023

It seems that this file (filename = areacella_fx_FGOALS-g2_lgm_r0i0p0.nc) has no checksum nor checksum_type attributes in the synda CMIP5 database you are importing, and those are currently required by esgpull.
The query ran by esgpull on the ESGF search API confirms that the 2 attributes are also missing from the file's metadata: https://esgf-node.ipsl.upmc.fr/esg-search/search?type=File&offset=0&limit=1&format=application%2Fsolr%2Bjson&fields=%2A&query=title%3Aareacella_fx_FGOALS-g2_lgm_r0i0p0.nc&distrib=true&latest=true&retracted=false
I am guessing synda used the same query to fill its database at the time this file was added.
Now if I increase the limit parameter for this query (numFound tells us 4 replicas exist in this case), checksum and checksum_type do exist in the next 3 replicas' metadata.

Knowing this, 2 things could be done during import to handle missing information:

  • look up the metadata from all replicas for each incomplete file and use the most complete one, but I don't know if we can guarantee there will always be at least one index node with the full metadata,
  • skip files with incomplete metadata from the imported database

The first solution might look more complete but it could seriously slow down the import procedure, and does not guarantee missing info will be filled, while the 2nd solution is easy to set up but will definitely introduce divergence between the filesystem and database.

@meteorologist15
Copy link

I also encountered this error. My solution was simply to skip the files that were missing metadata bits, and for me, since there weren't a lot of files missing metadata, this was an acceptable loss. I simply added a try/except block with a little extra information to bypass the error halting the program and add the information to the log. I may submit a pull request soon with my proposed code changes:

In esgpull.py

        nb_imported = 0
        for start in iter_idx_range:
            stop = min(len(synda_ids), start + size)
            ids = synda_ids[start:stop]
            synda_files = synda.scalars(sql.synda_file.with_ids(*ids))
            files: list[File] = []
            for synda_file in synda_files:
                try:
                    file = synda_file.to_file()
                except AttributeError as e:
                    logger.warning(e)
                    warn_msg = f"Skipping {synda_file.filename} due to missing database metadata. Continuing to the next file"
                    print(warn_msg)
                    logger.warning(warn_msg)
                    continue

                if file.sha not in shas:
                    file.queries.append(self.legacy_query)
                    files.append(file)
                    synda_shas.add(file.sha)
            if files:
                nb_imported += len(files)
                self.db.add(*files)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants