Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent results of materials query #964

Open
fxcoudert opened this issue Jan 25, 2025 · 9 comments · May be fixed by #974
Open

Inconsistent results of materials query #964

fxcoudert opened this issue Jan 25, 2025 · 9 comments · May be fixed by #974
Labels
bug Something isn't working

Comments

@fxcoudert
Copy link

Python version

3.12.8

Pymatgen version

2025.1.24

Operating system version

macOS 15.2

Current behavior

The following code:

with MPRester(apikey) as mpr:
    mp_data = mpr.materials.summary.search(
        fields=["material_id", "deprecated", "formula_pretty", "nelements", "structure", "theoretical", "symmetry"]
    )
    print("Number of materials found:", len(mp_data))
    print("Database version", mpr.get_database_version())

returns

Number of materials found: 178580
Database version 2024.12.18

Of these, there are 169385 non deprecated materials, as returned by:

> sum(1 for x in mp_data if not x.deprecated)

This is consistent with the number of the web portal. Good. But now, consider this:

with MPRester(apikey) as mpr:
    mp_data = mpr.materials.summary.search(
        deprecated=False,
        fields=["material_id", "deprecated", "formula_pretty", "nelements", "structure", "theoretical", "symmetry"]
    )
    print("Number of materials found:", len(mp_data))
    print("Database version", mpr.get_database_version())

It is exactly the same query, except I ask for all non deprecated by passing deprecated=False. But it now returns:

Number of materials found: 153902
Database version 2024.12.18

Expected Behavior

I expect the two routes to return the same number (and same list) of non deprecated materials.

Minimal example

Relevant files to reproduce this bug

No response

@fxcoudert fxcoudert added the bug Something isn't working label Jan 25, 2025
@Andrew-S-Rosen
Copy link
Member

Andrew-S-Rosen commented Jan 28, 2025

Hi @fxcoudert! Just to check --- was this meant for pymatgen or for https://github.com/materialsproject/api? I want to make sure you get the quickest feedback possible.

CC @tschaume in case it's relevant to him.

@tschaume tschaume transferred this issue from materialsproject/pymatgen Jan 28, 2025
@tschaume
Copy link
Member

Thanks @Andrew-S-Rosen! The difference is due to the new ~15k GNoMe materials that are included in the API response if a user accepted its terms on the website. You can set include_gnome=False in mpr.materials.summary.search() to exclude GNoMe materials regardless of whether their terms have been accepted or not. HTH

@Andrew-S-Rosen
Copy link
Member

The all-knowing Patrick has spoken!!

@fxcoudert
Copy link
Author

fxcoudert commented Jan 28, 2025

Thanks @tschaume. I don't know if I have accepted the new terms or not, but what I am sure is that both queries were made at the same time, with the same function. So whether the terms were accepted or not, shouldn't the numbers be consistent? (169k in the first case, 154 in the second case)

PS: how can I check whether I have accepted the new terms or not? I can't seem to find the information in the dashboard for my account.

Re. @Andrew-S-Rosen: I have no idea if it is an API bug or a pymatgen bug. I have only queried through the pymatgen functions, not tried directly the API from another code.

@tschaume
Copy link
Member

tschaume commented Mar 4, 2025

@fxcoudert I agree we should and can make this a lot more transparent. If you see the group TERMS:ACCEPT-NC listed under "Groups" on your dasboard, you've accepted the non-commercial terms for GNoMe and should be able to access the GNoMe explorer.

@yang-ruoxi We might have to add an explicit line on the dashboard that indicates whether the user has accepted the GNoMe terms or not.

@tsmathis Would you mind taking a look at the example code in this issue and see if you can reproduce it? We might have to double-check the deprecated fields for the GNoMe data. Thanks!

@tsmathis
Copy link

tsmathis commented Mar 4, 2025

@tschaume, the results here are reproducible. This is a side effect of user group access control behavior mixing with bulk download behavior in the client. I'll link you my slack messages where I had investigated this a little bit ago, we can discuss from there.

@tschaume tschaume linked a pull request Mar 5, 2025 that will close this issue
@tschaume
Copy link
Member

tschaume commented Mar 5, 2025

@fxcoudert I started PR #974 to address this inconsistency. It's still work in progress and will need some data reorg on our end. We're hoping we can get this out with our next data release.

@fxcoudert
Copy link
Author

@tschaume quick question: once I have run a query and gotten structures back, how can I identify if a specific structure is in the gnome dataset or not? I thought it would be somewhere in the metadata, for example as struct.builder_meta.license, but that one always has value 'BY-C' (which is actually weird, cause it's not a valid license code?)

@tschaume
Copy link
Member

tschaume commented Mar 6, 2025

@fxcoudert both the builder_meta.batch_id and builder_meta.license fields in a SummaryDoc will help with that. The batch_id for GNoMe materials is gnome_r2scan_statics and its license is BY-NC. The two licenses options BY-C and BY-NC refer to the creative commons licenses. HTH

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants