@datadavev and I noticed an issue when sending certain Pangaea schema.org documents (example, source) to Solr, which arises when there are datePublished values defined at the root level and at leaf level in workExample nodes. Based on the code in the bean below, this also seems to have been encountered several years ago when datePublished is defined in hasPart nodes in addition to the root. There is already code in the bean that excludes these hasPart > datePublished values:
<bean id="schema_org_datePublished" class="org.dataone.cn.indexer.annotation.SparqlField">
<constructor-arg name="name" value="pubDate" />
<constructor-arg name="query">
<value>
<![CDATA[
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX SO: <http://schema.org/>
SELECT
( str(?datePublished) as ?pubDate)
WHERE {
?datasetId rdf:type SO:Dataset .
?datasetId SO:datePublished ?datePublished
# Don't include referenced sub-Datasets (i.e. a Dataset in a 'hasPart' property)
FILTER NOT EXISTS { ?id SO:hasPart ?datasetId . }
}
]]>
</value>
</constructor-arg>
<property name="converter" ref="dateConverter" />
</bean>
This filter misses other nodes with datePublished, for example:
{
"@context": {
"@vocab": "https://schema.org/",
},
"datePublished": "2015-01-16",
...
"workExample": {
"@id": "ftp://ftp.bsrn.awi.de/man/man0810.dat.gz",
"@type": [
"CreativeWork",
"Dataset"
],
"additionalType": "dataset",
"creator": {
"@type": "Person",
"email": "greg.sample@noaa.gov",
"familyName": "Sample",
"givenName": "Greg",
"name": "Greg Sample"
},
"datePublished": "2015",
"identifier": "ftp://ftp.bsrn.awi.de/man/man0810.dat.gz",
"name": "BSRN Station-to-archive file for station Momote (2010-08)",
"url": "ftp://ftp.bsrn.awi.de/man/man0810.dat.gz"
}
}
coerces two dates to a list: [2015-01-16T00:00:00.000Z, 2015-01-01T00:00:00.000Z] which is not allowed in the Solr schema.
A simple solution is to add another filter to exclude results from workExample nodes:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX SO: <http://schema.org/>
SELECT
( str(?datePublished) as ?pubDate)
WHERE {
?datasetId rdf:type SO:Dataset .
?datasetId SO:datePublished ?datePublished
# Don't include referenced sub-Datasets (i.e. a Dataset in a 'hasPart' property)
FILTER NOT EXISTS { ?id SO:hasPart ?datasetId . }
+ FILTER NOT EXISTS { ?id SO:workExample ?datasetId . }
}
I can open a PR with this addition.
@datadavev and I noticed an issue when sending certain Pangaea schema.org documents (example, source) to Solr, which arises when there are
datePublishedvalues defined at the root level and at leaf level inworkExamplenodes. Based on the code in the bean below, this also seems to have been encountered several years ago whendatePublishedis defined inhasPartnodes in addition to the root. There is already code in the bean that excludes thesehasPart > datePublishedvalues:This filter misses other nodes with
datePublished, for example:{ "@context": { "@vocab": "https://schema.org/", }, "datePublished": "2015-01-16", ... "workExample": { "@id": "ftp://ftp.bsrn.awi.de/man/man0810.dat.gz", "@type": [ "CreativeWork", "Dataset" ], "additionalType": "dataset", "creator": { "@type": "Person", "email": "greg.sample@noaa.gov", "familyName": "Sample", "givenName": "Greg", "name": "Greg Sample" }, "datePublished": "2015", "identifier": "ftp://ftp.bsrn.awi.de/man/man0810.dat.gz", "name": "BSRN Station-to-archive file for station Momote (2010-08)", "url": "ftp://ftp.bsrn.awi.de/man/man0810.dat.gz" } }coerces two dates to a list:
[2015-01-16T00:00:00.000Z, 2015-01-01T00:00:00.000Z]which is not allowed in the Solr schema.A simple solution is to add another filter to exclude results from
workExamplenodes:PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX SO: <http://schema.org/> SELECT ( str(?datePublished) as ?pubDate) WHERE { ?datasetId rdf:type SO:Dataset . ?datasetId SO:datePublished ?datePublished # Don't include referenced sub-Datasets (i.e. a Dataset in a 'hasPart' property) FILTER NOT EXISTS { ?id SO:hasPart ?datasetId . } + FILTER NOT EXISTS { ?id SO:workExample ?datasetId . } }I can open a PR with this addition.