Skip to content

sh:pattern does not allow testing IRIs #230

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
afs opened this issue Feb 5, 2025 · 17 comments
Open

sh:pattern does not allow testing IRIs #230

afs opened this issue Feb 5, 2025 · 17 comments
Labels
Core For SHACL 1.2 Core spec Errata Fixes for acknowledged errors in the specifications after publication

Comments

@afs
Copy link
Contributor

afs commented Feb 5, 2025

sh:pattern is defined by:

The values of sh:pattern in a shape are valid pattern arguments for the SPARQL REGEX function.

Right. And the first argument is a defined as literal.

# this is valid
BIND(regex(str(<>), 'foo') as ?foo)

# but this is not
BIND(regex(<>, 'foo') as ?foo)

Originally posted by @tpluscode in #228

Isn't this highlighted line a test of sh:pattern against an IRI ...

and this line the corresponding XFAIL result?

Originally posted by @ajnelson-nist in #228

@afs afs added Core For SHACL 1.2 Core spec Errata Fixes for acknowledged errors in the specifications after publication labels Feb 5, 2025
ajnelson-nist added a commit that referenced this issue Feb 5, 2025
Per discussion on Issues 228 and 230.

References:
* #228
* #230

Signed-off-by: Alex Nelson <[email protected]>
@afs afs pinned this issue Feb 5, 2025
@ajnelson-nist
Copy link
Contributor

PR #231 opened for group review and discussion.

@VladimirAlexiev
Copy link
Contributor

Like @tfrancart I was always certain that pattern can test any nodes.
No fix is needed to its definition since "valid pattern arguments" refers to the value of sh:pattern , I.e. to the regex (second argument of regex())

But a clarification is needed about the first argument

  • it is converted to string
  • (new) it is not recommended to use sh:pattern for numbers, dates or booleans since various different lexical representations can map to the same value. Eg all these pairs designate the same value:
    • 001 and 1 (xsd:integer)
    • 0.1 and 0.100 (xsd:decimal)
    • 0.1 and 0.100000009 (xsd:float)
    • 2025-01-31 and 002025-01-31 (xsd:date)
  • (new) it is not recommended to use sh:pattern for blank nodes, unless you know they were generated using specified blank node names

@tpluscode
Copy link
Contributor

  • (new) it is not recommended to use sh:pattern for blank nodes, unless you know they were generated using specified blank node names

This is surely not a good idea. Blank node labels are almost certain to change any time a graph is parsed or transfered.

IMO, the only meaningful improvement appears to be the addition of explicit handling of IRIs. I would represent this in SPARQL as

BIN(REGEX(IF(ISIRI(?input), STR(?input), ?input), ?pattern) as ?five)

@ajnelson-nist
Copy link
Contributor

  • (new) it is not recommended to use sh:pattern for blank nodes, unless you know they were generated using specified blank node names

This is surely not a good idea. Blank node labels are almost certain to change any time a graph is parsed or transfered.

On the other hand, I could see some use cases involving documentation quality control where this would be desirable. I'm not sure if supporting that just makes a huge foot-cannon, though.

@afs
Copy link
Contributor Author

afs commented Feb 5, 2025

If Core is being unlinked from SPARQL, it will be odd to continue to rely on STR(). I can't find any using of "lexical form" in SHACL-Core 1.0 - it does use the terminolgy "string representation", quite often with "(as defined by the SPARQL str function)" so make that a definition then clear-up.

(I hope there will continue to be a WG-approved (non-normative) list of constraint as SPARQL. An appendix?)

@HolgerKnublauch
Copy link
Contributor

The intention was always to allow sh:pattern to be used on value nodes that are IRIs.

I see nothing wrong with the spec. It states that the values of sh:pattern must be strings (which they are) and that the value nodes need to be converted to their STR.

What specifically needs to be changed in the spec? (We can of course always add a non-normative clarification to explain that IRIs are fine).

I think it's fine to continue to reference definitions from SPARQL in Core, if that makes the contracts clear, easier to implement and avoids having to reinvent things.

@afs
Copy link
Contributor Author

afs commented Feb 6, 2025

What specifically needs to be changed in the spec? (We can of course always add a non-normative clarification to explain that IRIs are fine).

I think it's fine to continue to reference definitions from SPARQL in Core, if that makes the contracts clear, easier to implement and avoids having to reinvent things.

Its the "1.1" effect! What was fine in a 1.0 needs refining as it becomes successful.

Imagine implementing the core profile on its own.

string representation should be <dfn> defined once and then used (all places). 12
For definitions of a constraint, the doc should be using <a>string representation</a> and referring to what it cares about.
All the "(as defined by the SPARQL str function)" can then be replaced.

https://www.w3.org/TR/xpath-functions/#func-matches
https://www.w3.org/TR/xpath-functions/#regex-syntax

Clarity and implementation help are non-normative (as are tests as far as definitions are concerned).

Footnotes

  1. Details matter! What about surrogates? lone or in pairs?

  2. https://www.w3.org/TR/rdf12-concepts/#dfn-rdf-string

@VladimirAlexiev
Copy link
Contributor

@tpluscode

Blank node labels are almost certain to change any time a graph is parsed or transfered.

Yes. But you can fix the name in CONSTRUCT or UPDATE.
Anyway, I also agree that checking for specific bnode names is a bad practice.

the only meaningful improvement appears to be the addition of explicit handling of IRIs.

So do you want to FORBID it for numbers and dates?

@tpluscode
Copy link
Contributor

@tpluscode

Blank node labels are almost certain to change any time a graph is parsed or transfered.

Yes. But you can fix the name in CONSTRUCT or UPDATE. Anyway, I also agree that checking for specific bnode names is a bad practice.

the only meaningful improvement appears to be the addition of explicit handling of IRIs.

So do you want to FORBID it for numbers and dates?

Not sure, really. Matching the plain literal value can just work is most cases. Dates are indeed in a bit better position because you're less likely to have malformed value, or something that isn't ISO at all. With plain str() there is always the risk of unexpected results but maybe it's not reason enough to forbid sh:pattern.

@bergos
Copy link
Contributor

bergos commented Feb 7, 2025

I don't see why we should exclude blank nodes. Generally, I would also consider it bad practice to match against blank nodes, but there could be specific cases where that may make sense. If we don't support blank nodes, the expected behavior would be that any attempt to match against a blank node should generate a failed validation result. That means any implementation that adds support for it would be no longer according to the specification.

I propose to add comment that one may run into issues trying to do pattern match against blank nodes. And if one wants to match only against specific node types, the sh:nodeKind constraint should be used.

@afs
Copy link
Contributor Author

afs commented Feb 7, 2025

It would be a change in behavior from SHACL 1.0 - i.e. a SHACL 1.0 processor would behave differently to a SHACL 1.2 processor for the same input so if blank nodes are now allowed, it needs to be noted in the spec.

Personally, I think support should at most be optional. If a SHACL processor can't support it/chooses not to support it, then that should not mean it can be described as "non conformant". In spec speak "MAY", not "MUST".

To support it in Jena would require data graphs to be read with special label preserving. Blank node labels are only valid during parsing and not normally kept. To store additional label-as-seen information would be a database format change.

(ShEx has blank node label matching. Jena does not run the tests for that feature.)

@gkellogg
Copy link
Member

gkellogg commented Feb 7, 2025

It would be a change in behavior from SHACL 1.0 - i.e. a SHACL 1.0 processor would behave differently to a SHACL 1.2 processor for the same input so if blank nodes are now allowed, it needs to be noted in the spec.

Personally, I think support should at most be optional. If a SHACL processor can't support it/chooses not to support it, then that should not mean it can be described as "non conformant". In spec speak "MAY", not "MUST".

To support it in Jena would require data graphs to be read with special label preserving. Blank node labels are only valid during parsing and not normally kept. To store additional label-as-seen information would be a database format change.

(ShEx has blank node label matching. Jena does not run the tests for that feature.)

Blank Node identifiers/labels are often misunderstood to be part of the blank node – they’re not, they are only a way to talk about a given blank node within a concrete serialization. The only way to reference/talk about a blank node within another graph is indirectly by describing it in relation to nodes that can be remotely identified. I think it’s a mistake for a spec to pretend that blank node identifiers are stable.

@bergos
Copy link
Contributor

bergos commented Feb 8, 2025

It would be a change in behavior from SHACL 1.0 - i.e. a SHACL 1.0 processor would behave differently to a SHACL 1.2 processor for the same input so if blank nodes are now allowed, it needs to be noted in the spec.

I thought this discussion popped up as people have different interpretations of what is currently written in the spec, and therefore, a new definition will be created. I'm OK with keeping it like it is now. I think it's very clear.

Personally, I think support should at most be optional. If a SHACL processor can't support it/chooses not to support it, then that should not mean it can be described as "non conformant". In spec speak "MAY", not "MUST".

If there should be a new definition, I would be for opening it for blank nodes. Optional is OK for me. But we should face reality, and blank nodes are not always used as intended. At some point, I may have to deal with a PR that either makes shacl-engine spec-incompatible or introduces a flag just for that case.

Blank Node identifiers/labels are often misunderstood to be part of the blank node – they’re not

I don't know the current situation, but about 2 years ago, the default config of Stardog kept the blank node identifiers of the parser in the store. That led to name clashes very quickly when importing two or more N-Triples files. I think that's clearly wrong, and if they want to support it, it should not be the default, but that's the reality of how blank nodes are used in some cases.

@gkellogg
Copy link
Member

gkellogg commented Feb 8, 2025

Blank Node identifiers/labels are often misunderstood to be part of the blank node – they’re not

I don't know the current situation, but about 2 years ago, the default config of Stardog kept the blank node identifiers of the parser in the store. That led to name clashes very quickly when importing two or more N-Triples files. I think that's clearly wrong, and if they want to support it, it should not be the default, but that's the reality of how blank nodes are used in some cases.

Yes, but we’re writing specifications, not documenting implementation behavior. In RDF Concepts, blank nodes do not have persistent labels. RDF.rb also retains labels, but needs to do more work when graphs merge to avoid identifier collisions. A Note, or similar, on how implementations might extend the spec to handle persistent identifiers would be reasonable, but I don’t think the spec should call it an optional feature; that’s just promoting bad practice.

@bergos
Copy link
Contributor

bergos commented Feb 9, 2025

OK, it looks like there was a lot of noise, but maybe that's because we mixed the topics of the definition of sh:pattern and replacing references to the SPARQL specification without our own definitions. I created a separate issue for that topic: #235

Can we vote on the topic sh:pattern? 👍 to keep the current definition, 👎 to change it.

@bergos
Copy link
Contributor

bergos commented Feb 9, 2025

Regarding the uniform formatting of numbers or dates, I would propose the usage of Node Expressions:

[
  sh:path [
      xsd:dateTime [
          sh:path ex:date
        ]
    ];
  sh:pattern "..."
].

@VladimirAlexiev
Copy link
Contributor

  • @afs I also think the current spec is ambiguous.
  • @gkellogg I side with you that sh:pattern over blanks should be forbidden.
  • @bergos "👍 to keep the current definition, 👎 to change it": we all want to change it to clarify that IRI testing is ok. So just one thumb cannot express whether to forbid it for blanks
  • @bergos "I would propose the usage of Node Expressions": what is this? On first sight looks like a monstrosity to me, but maybe it's a very clever idea? (But please answer in a new issue if you want to pursue that)

@afs afs unpinned this issue Mar 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core For SHACL 1.2 Core spec Errata Fixes for acknowledged errors in the specifications after publication
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants