Skip to content

Multiple datePublished values returned by schema_org_datePublished SparQL query #332

@iannesbitt

Description

@iannesbitt

@datadavev and I noticed an issue when sending certain Pangaea schema.org documents (example, source) to Solr, which arises when there are datePublished values defined at the root level and at leaf level in workExample nodes. Based on the code in the bean below, this also seems to have been encountered several years ago when datePublished is defined in hasPart nodes in addition to the root. There is already code in the bean that excludes these hasPart > datePublished values:

    <bean id="schema_org_datePublished" class="org.dataone.cn.indexer.annotation.SparqlField">
        <constructor-arg name="name" value="pubDate" />
        <constructor-arg name="query">
            <value>
                <![CDATA[
                    PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
                    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
                    PREFIX SO:   <http://schema.org/>

                    SELECT
                        ( str(?datePublished) as ?pubDate)
                    WHERE {
                        ?datasetId rdf:type SO:Dataset .
                        ?datasetId SO:datePublished ?datePublished
                        # Don't include referenced sub-Datasets (i.e. a Dataset in a 'hasPart' property)
                        FILTER NOT EXISTS { ?id SO:hasPart ?datasetId . }
                    }
                ]]>
            </value>
        </constructor-arg>
        <property name="converter" ref="dateConverter" />
    </bean>

This filter misses other nodes with datePublished, for example:

{
  "@context": {
    "@vocab": "https://schema.org/",
  },
  "datePublished": "2015-01-16",
  ...
  "workExample": {
    "@id": "ftp://ftp.bsrn.awi.de/man/man0810.dat.gz",
    "@type": [
      "CreativeWork",
      "Dataset"
    ],
    "additionalType": "dataset",
    "creator": {
      "@type": "Person",
      "email": "greg.sample@noaa.gov",
      "familyName": "Sample",
      "givenName": "Greg",
      "name": "Greg Sample"
    },
    "datePublished": "2015",
    "identifier": "ftp://ftp.bsrn.awi.de/man/man0810.dat.gz",
    "name": "BSRN Station-to-archive file for station Momote (2010-08)",
    "url": "ftp://ftp.bsrn.awi.de/man/man0810.dat.gz"
  }
}

coerces two dates to a list: [2015-01-16T00:00:00.000Z, 2015-01-01T00:00:00.000Z] which is not allowed in the Solr schema.

A simple solution is to add another filter to exclude results from workExample nodes:

                    PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
                    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
                    PREFIX SO:   <http://schema.org/>

                    SELECT
                        ( str(?datePublished) as ?pubDate)
                    WHERE {
                        ?datasetId rdf:type SO:Dataset .
                        ?datasetId SO:datePublished ?datePublished
                        # Don't include referenced sub-Datasets (i.e. a Dataset in a 'hasPart' property)
                        FILTER NOT EXISTS { ?id SO:hasPart ?datasetId . }
+                       FILTER NOT EXISTS { ?id SO:workExample ?datasetId . }
                    }

I can open a PR with this addition.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions