Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fetch_entity_names method #230

Merged
merged 7 commits into from
Apr 8, 2025

Conversation

jm-rivera
Copy link
Contributor

@jm-rivera jm-rivera commented Mar 31, 2025

This PR is part of a group of PRs which will bring some key features from the Data Commons website API to the client library (e.g #229, #231)

Fetch entity names: this is an implementation of a few DC website features:

  • The name endpoint of the internal API. /api/place/name
  • The internal code to fetch the english names from the name property.
  • The internal code to fetch the i18n name using the nameWithLanguage property (the website automatically resolves the user's locale, this method takes it as an argument)

In short, this PR:

  • Adds a fetch_entity_names method to NodeEndpoint. This method takes one or more entity_dcids and fetches their name. It defaults to English, but other languages (fr, es, etc) can be selected using the optional language argument. Optionally, users can request to fallback to a particular language if the requested language is not available (to avoid sparse results in some languages).

Example usage:

from datacommons_client import DataCommonsClient

dc = DataCommonsClient(dc_instance="datacommons.one.org")

names = dc.node.fetch_entity_names(
    entity_dcids=[
        "africa",
        "country/GTM",
        "country/USA",
        "wikidataId/Q2608785",
    ],
    language="de",
)


{'africa': 'Afrika',
 'country/GTM': 'Guatemala',
 'country/USA': 'Vereinigte Staaten'}

And with the optional fallback to english:

names = dc.node.fetch_entity_names(
    entity_dcids=[
        "africa",
        "country/GTM",
        "country/USA",
        "wikidataId/Q2608785",
    ],
    language="de",
    fallback_language="en",
)

{'africa': 'Afrika',
 'country/GTM': 'Guatemala',
 'country/USA': 'Vereinigte Staaten',
 'wikidataId/Q2608785': 'La Democracia'}

In the background, this method uses name or nameWithLanguage depending on what the user requests. name is much cheaper than nameWithLanguage since it only contains one value per entity. The method only requests and parses nameWithLanguage if a non-english language is selected.

A previous version used english as the fallback, available via a boolean. The latest version implements a more generic, Optional, fallback_language which takes a string (default to None for no fallback)

Copy link
Contributor

@dwnoble dwnoble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jorge!

I think this feature is useful, and is a cleaner implementation than we have on the website today.

But, there's still some ambiguity around:

  • If i request a particular language and have english as a fallback, how do i tell from the response if i'm seeing the requested language or the fallback?
  • Some smaller places may not have a "name" field today in the KG- example: https://datacommons.org/browser/wikidataId/Q4104190 . This is arguably a data issue - maybe in this example we should be setting "name" to the Bulgarian name, but the issue remains. I think on the website we return the DCID instead of the name, but im not sure if that's the best behavior.

What do you think?

- magic strings to constants
- change fallback logic to any language (not en boolean)
@jm-rivera
Copy link
Contributor Author

jm-rivera commented Apr 1, 2025

Thanks @dwnoble!

  • If i request a particular language and have english as a fallback, how do i tell from the response if i'm seeing the requested language or the fallback?

It's a very fair question. I don't have a simple suggestion... but can see two potential routes:

  • It's on my list to implement logging. Right now everything happens a bit too silently for my liking. So a solution could be logging a info or warning when the fallback is used.
  • we add an additional key to the response, which tracks any entities for which the fallback was used. That means we still keep a convenient mapping of entity_id -> name, but if the 'fallback' key is present (or if it has items, depending on how we implement), then the fallback was used for the countries contained in that list.

What do you think of either option? Or maybe you see another route?

  • Some smaller places may not have a "name" field today in the KG- example: https://datacommons.org/browser/wikidataId/Q4104190 . This is arguably a data issue - maybe in this example we should be setting "name" to the Bulgarian name, but the issue remains. I think on the website we return the DCID instead of the name, but im not sure if that's the best behavior.

In most of the REST API, if something doesn't exist it comes back empty. We don't currently infer or fill missing data on the client library side. I think I'd be in favour of keeping that behaviour consistent.

@jm-rivera jm-rivera requested a review from dwnoble April 4, 2025 06:23
Copy link
Contributor

@dwnoble dwnoble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dwnoble!

  • If i request a particular language and have english as a fallback, how do i tell from the response if i'm seeing the requested language or the fallback?

It's a very fair question. I don't have a simple suggestion... but can see two potential routes:

  • It's on my list to implement logging. Right now everything happens a bit too silently for my liking. So a solution could be logging a info or warning when the fallback is used.
  • we add an additional key to the response, which tracks any entities for which the fallback was used. That means we still keep a convenient mapping of entity_id -> name, but if the 'fallback' key is present (or if it has items, depending on how we implement), then the fallback was used for the countries contained in that list.

What do you think of either option? Or maybe you see another route?

  • Some smaller places may not have a "name" field today in the KG- example: https://datacommons.org/browser/wikidataId/Q4104190 . This is arguably a data issue - maybe in this example we should be setting "name" to the Bulgarian name, but the issue remains. I think on the website we return the DCID instead of the name, but im not sure if that's the best behavior.

In most of the REST API, if something doesn't exist it comes back empty. We don't currently infer or fill missing data on the client library side. I think I'd be in favour of keeping that behaviour consistent.

Agreed that things happen a bit too silently right now, and that our website api currently makes some assumptions that are probably not obvious to users. Extra logging sounds good, but I'd lean toward adding more metadata to the response. What do you think about something like this?

{
  "africa": {
    "value": "Afrika",
    "property": "nameWithLanguage", // Could be 'name' if English was requested or used as fallback
    "language": "de" // The actual language of the value returned
  },
  "wikidataId/Q2608785": { // Example using fallback
     "value": "La Democracia", // The fallback name
     "property": "name", // Assuming 'name' is English and was the fallback
     // Optional: Explicitly note the request vs result?
     // "requestedLanguage": "de"
  }
  // ... other entities
}

@jm-rivera
Copy link
Contributor Author

@dwnoble Here's an implementation (with additional tests) for the type of return you suggested.

Agreed that things happen a bit too silently right now, and that our website api currently makes some assumptions that are probably not obvious to users. Extra logging sounds good, but I'd lean toward adding more metadata to the response. What do you think about something like this?

It's definitely more explicit, and overall I think it's the right way to go. At first, I wanted something more directly usable for analysts who may want to quickly add names to the data they got. With this more complete dictionary, they would have to unpack {k: v['value'] for k,v in result.items()} to have a dictionary that directly takes the dcid and maps it to the name. It's a small inconvenience, but at least there would be no ambiguity.

Thanks for the suggestion, Dan!

@jm-rivera jm-rivera requested a review from dwnoble April 7, 2025 17:41
@dwnoble dwnoble closed this Apr 7, 2025
@dwnoble dwnoble reopened this Apr 7, 2025
@dwnoble
Copy link
Contributor

dwnoble commented Apr 7, 2025

Accidentally hit close- apologies :)

@jm-rivera jm-rivera requested a review from dwnoble April 7, 2025 19:13
@jm-rivera jm-rivera merged commit eb6adef into datacommonsorg:master Apr 8, 2025
2 checks passed
@jm-rivera jm-rivera deleted the add-fetch-names branch April 8, 2025 08:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants