Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine whether DATS Identifiers must always be URIs. #8

Open
jonathancrabtree opened this issue Jun 27, 2018 · 7 comments
Open

Determine whether DATS Identifiers must always be URIs. #8

jonathancrabtree opened this issue Jun 27, 2018 · 7 comments

Comments

@jonathancrabtree
Copy link
Contributor

jonathancrabtree commented Jun 27, 2018

The following three DATS schemas all define "identifier" as having format "uri":

  • identifier_info_schema.json
  • alternate_identifier_info_schema.json
  • related_identifier_info_schema.json

However, in https://github.com/biocaddie/WG3-MetadataSpecifications/blob/master/json-instances/Uniprot-P77967.json there are identifiers like these:

{
"identifier": "http://identifiers.org/uniprot/P77967",
"identifierSource": "http://identifiers.org"
},
{
"identifier": "BA000022",
"identifierSource": "embl"
},
{
"identifier": "S74805",
"identifierSource": "PIR"
},
{
"identifier": "1NP7",
"identifierSource": "PDBsum"
},

I was under the impression that valid URIs have to specify a scheme, followed by ":", and that isn't the case for any of these ids except the very first one. What's the story here? Am I misunderstanding the meaning of "uri" or is this not DATS-compliant? I believe it does validate although the validator probably isn't checking the string format.

@agbeltran
Copy link
Contributor

It seems that the validator is not picking up the "format": "uri"option, as it should be only accepting URIs according to RFC3986. I will check the implementation of the validator package. However, we would probably need to expand identifier to support any string: while the recommendation is to use URIs whenever available, it is not always the case that datasets or other entities have URIs as identifiers. Thanks for checking.

@jonathancrabtree
Copy link
Contributor Author

OK, that sounds like a reasonable compromise. I wonder if it wouldn't be even better to support both a string id and a URI id, although the problem with that is that it wouldn't be backwards-compatible unless the string-only option is retained.

I had initially created the DATS JSON for the mouse reference genome using only string ids but then realized the URI restriction and had to go back and update them all. But as I was doing so I couldn't help but think it might be better to use the "raw" ids and leave it up to the DATS consumer to find the corresponding URIs if needed. My main concern with using the URI as the only id in the DATS is that most data sources do not appear to consider the URI the primary id, meaning that if you want to extract the "real" id you have to parse the URL, however trivial that operation might be. A corollary of this observation is that there's some danger that the URIs in my DATS JSON will change, particularly in cases where I simply went to the data source web site and looked to see what URL popped up when I did a search with the primary identifier.

@agbeltran
Copy link
Contributor

The original intention was to support both: string ID and URI ID, given that sometimes there are accession numbers with no URI ID and in other cases, there are URI IDs.

The problem with the validator is that requires the rfc3987 library and I hadn't realized before. As the validator wasn't really testing against the format URI, I think it would be fine to fix the schemas to accept both strings and URIs without affecting the current implementations.

I will make the changes in the schemas and push the fix to the validator.

@cmungall
Copy link

Not sure if I should start a new ticket. I notice in the MGI JSON

     "identifier": {
        "@type": "Identifier",
        "@id": "",
        "identifier": "http://www.informatics.jax.org/marker/MGI:5589833",
        "identifierSource": "MGI"
      },
      "alternateIdentifiers": [
        {
          "@type": "AlternateIdentifier",
          "@id": "",
          "identifier": "https://www.ncbi.nlm.nih.gov/gene/?term=102632652",
          "identifierSource": "NCBI_Gene"
        },
        {
          "@type": "AlternateIdentifier",
          "@id": "",
          "identifier": "http://www.ensembl.org/Mus_musculus/Gene/Summary?db=core;g=ENSMUSG00000110943",
          "identifierSource": "ENSEMBL"
        }

If URIs are to play the role of identifiers then it's important for URIs produced by different systems match when they denote the same entity, and this requires consistent ways of rendering the ID as a URL; http params like the one you have in http://www.ensembl.org/Mus_musculus/Gene/Summary?db=core;g=ENSMUSG00000110943 are a bad smell here.

Some alternate ways to write this URL:

From a perspective of resolution of web pages, it's not super-important which is used, but for URIs as identifiers we need to pick one.

An alternative is to use the CURIE in the JSON and have this expand via the JSON-LD context

JSON-LD contexts for ID expansion here: https://github.com/prefixcommons/biocontext

For GO we are standardizing on http://identifiers.org/ensembl/ENSMUSG00000110943 type URIs in our RDF. @jmcmurry has advocated for http://identifiers.org/ENSEMBL:ENSMUSG00000110943 but this is not well documented on the identifiers.org site and there may potentially be problems with colons in URIs. As a community we need to coalesce around a standard for our RDF/JSON-LD to link up.

@proccaserra
Copy link
Contributor

@cmungall this is correct. In the first examples, we didn't focus on this aspect. Identifiers.org or n2t would be the way to go as these have been adopted by dcppc. This is an implementation decision we (DCPPC) would need to agree on indeed.

@jmcmurry
Copy link

jmcmurry commented Jul 3, 2018

If you really don't want colons, stick with http://identifiers.org/ensembl/ENSMUSG00000110943

@agbeltran
Copy link
Contributor

Just a note to indicate that the identifier-related schemas have been relaxed to support any string and that the validator code now checks against URI constraints correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants