-
-
Notifications
You must be signed in to change notification settings - Fork 308
URI/IRI "Normalization" and Compatibility #1349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We specify normalization to improve the likelihood that a URI or IRI is recognized as a known resource. Comparison and equivalence is the context in which normalization is defined in RFC 3986. Normalization is a well-understood industry term, and while it is hard to be exact about it, JSON Schema should not try to step in on behalf of other specifications. The one place where we can do this is to note that that for documents of media type Otherwise, the point of the normalization requirement is to avoid requiring JSON Schema implementations to implement normalization themselves. There is nothing in JSON Schema that requires or encourages an implementation to detect an insufficiently normalized URI and error on it. Failing to normalize your
All IRIs can be mapped to URIs. |
I agree with everything you say about "normalization". The question is whether the phrase "MUST be normalized" expresses well what you wrote in your reply. I am aware that IRIs can be mapped to URIs. If that is all, then you could have just stuck with URIs :). The question was more whether schemas written for the 2020-12 specification are "future proof". I believe the answer is: technically no (but hopefully mostly kind of). |
I'm open to suggestion on better ways to word this. The RFC 2119 usage in the core spec is not great- some of it is old and might have been written in a different context as we changed a lot of the wording over the last few drafts. What we want is to convey that non-normalized URIs (or IRIs in the future) will likely not behave correctly. We do not need (or in my opinion want, although someone may disagree) to require JSON Schema implementations to enforce normalization. I admit I don't know the best way of communicating that with proper formal language.
Can you elaborate on that? 2020-12 only allows URIs, the next release will allow IRIs. All URIs are valid IRIs, so what is the expected breakage? The only thing that will be at all braking is that |
My resolution: I have chosen to use the JSON schema specification (for myself) AS IF...
Challenges about the third point (apart from language):
I do not expect any practically relevant problems. Technically the upcoming change in the output structure could be considered "more breaking" (and I definitely do not want to stand in the way of such evolution). I am excited to see that a concept of stability is emerging (https://json-schema.org/blog/posts/future-of-json-schema). |
While, practically speaking, this only impacts how people choose URIs for their meta-schemas/vocabularies (not validators that merely consume schemas), I'd like to point out the process is fairly well defined, and RFC 3986 actually has lots to say on exactly how you normalize a URI:
And JSON Schema Core normatively references RFC 3986, so it means the same as if we incorporated the text directly. |
It is just not the intention of RFC 3986 to define a normal form (your examples come from a section called "Comparison Ladder"). I know this is an unfortunate decision by the RFC 3986 authors (which probably had their reasons) but the JSON schema specifications should acknowledge this. It seems like the JSON schema specification makes unintended use of RFC 3986 without even a comment.
... exactly :) You are right that the phrase "MUST be normalized" only occurs in the
I do not think that the "normalized" "restrictions" in the JSON schema specification have a clear meaning. Removing them (or rephrasing them as non-normative comments of some sort) probably has no real effect (just a bit of polish). |
Please elaborate on this point... I think it's fair to assume when spec talks about something, it is conveying intent. Here it provides a specific definition for "normalized" that's exactly the meaning we're looking for: Use uppercase pct-encoded sequences, remove unnecessary dot components, etc. You'll have to explain how it's possible to interpret this in any other way, with a specific example.
RFC 3986 has this to say:
The effect of this is you need to preserve extra leading dot segments when applying them to relative references, instead of removing them. The meaning isn't ambiguous, just a little bit buried.
We're using the BCP 14 "SHOULD" and "MUST" language, which imposes an interoperability requirement... in this case, it's imposing a requirement on how you write schemas (rather than how you parse them or use them in validation). Practically speaking, it says if you don't follow this requirement, then interoperability with other implementations won't necessarily be guaranteed. It might still work fine for now, but it might break sometime in the distant future, we don't know. |
Surely next you are going to tell me that you can also do scheme based normalization on an unresolved URI-reference without a scheme :). My wording in the original issue might have been too harsh / misleading. RFC 3986 uses the term "normalization" in a consistent, understandable way. Nonetheless I still believe:
|
I agree with @awwright that normalization is sufficiently well defined. I don't think I've ever seen a URI library that doesn't normalize URIs and I've definitely never hear of normalization implementations that don't normalize in a way that is incompatible with other implementations. There can be slight variations such as whether uppercase or lowercase letters are used for percent encoded characters, but that doesn't matter as long as you normalize both URIs you're comparing using the same library. Technically, the spec requires schema authors rather than implementations to normalize URIs, which I've always found to be awkward. Implementations still need to normalize before comparison to account for variations in normalization, so asking schema authors to normalize their URIs doesn't really seem necessary. I'd argue that the spec doesn't need to say anything about normalization, but not because normalization isn't well defined. It's the opposite really. It doesn't need to be mentioned because normalization is just part of the processes for comparing URIs as defined in RFC 3986. We don't need to say any more, just point to RFC 3986. |
Yes ... and (from RFC 3986)
... and (from RFC 3986)
Whereas the JSON schema specification says
Before we get into a nonsense discussion: I am not saying that the JSON schema specification demands comparing unresolved URI-references. I am saying that RFC 3986 and RFC 3987 present normalization as a way of broadening equivalence while explicitly discouraging equivalence testing on unresolved URI-references. "Normalizing" unresolved URI-references is certainly not something that the URI/IRI RFCs encourage. It is also not so clear how to even do it (some people might not do scheme based normalization if there is no scheme; some might find a creative, custom way of doing it anyways, similar to I do not believe that the question whether a URI has been normalized is decidable based on RFC 3986 (or RFC 3987). At least nobody has added a |
I completely agree that that statement doesn't make sense. |
The action here is to clean up the text a bit. I do somewhat agree with @jdesrosiers' statement in the last comment about how the requirement should be on implementations not authors. |
The current core specification (2020-12) requires that
"$schema"
(and some other) URIs "MUST be normalized".While RFC 3986 makes suggestions for "normalization" steps,
it does not define a normal form for URIs (IRIs analogous).
"normalized"/"normal"/"normalization"/... (of URIs/IRIs) is
not well-defined. I propose that these terms are either
eliminated from the specification or defined.
My current personal preference: Eliminate the
"normalization" constraint.
A thought about the switch from URIs to IRIs:
An implementation could/should reject a schema if
the resolution of
"$id"
yields an IRI that is not alsoa URI
AND the schema's meta schema declares
Is this true now? Can this still occur in the future? Is
rejection still possible/desired? What about this:
The text was updated successfully, but these errors were encountered: