Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update jsonschema to 0.28.1, support drafts 2019-09 and 2020-12 #730

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

uncenter
Copy link
Contributor

@uncenter uncenter commented Jan 9, 2025

Closes #497, jsonschema now supports both of drafts (#497 (comment)).

self.cache
.get_schema(url)
.get_schema(&Url::parse(uri.as_str())?)
Copy link
Contributor Author

@uncenter uncenter Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since jsonschema now uses fluent_uri:Uri instead of url::Url, we convert the Uri back to Url to avoid refactoring other major areas of Taplo to use fluent_uri.

Copy link

@Stranger6667 Stranger6667 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I forked taplo to implement it myself, but didn't manage to finish all the required changes there.

I've added a small note, but otherwise, everything looks great!

P.S. It is actually super useful to see how the new jsonschema versions are used in the real world. Specifically seeing what is done to handle the absence of async retrieval gives me a feeling that this feature should be higher on my todo list than now.

} else {
None
}
let errors = validator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize now that in 0.28 all the external schemas retrieving happens during the validator compilation, not during validation, hence it this logic should be moved to create_validator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, can you elaborate? How has the behavior changed from what this comment used to describe? Should I wait until async retrieval is implemented in jsonschema as you mentioned?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! jsonschema used to retrieve external documents on demand, during the validate call (or is_valid or apply) in order to perform the validation for $ref and related reference keywords. To make it work, the $ref implementation had some caching in place which implied a certain performance penalty (rwlock + inserts).

Some time ago I realized that it could be done just once when we create a validator (e.g. via jsonschema::validator_for) and all that caching won't be needed during validation as we'll have everything already retrieved, so I moved all that retrieval logic to the building stage.

I think it is not necessary to wait for async retrieval in jsonschema, as the loop approach would still work - it just needs to be applied inside create_validator (which requires making a few extra functions async here). If there are any external schemas missing, then the call to ValidationOptions::build will fail with the same validation errors as implemented in this PR

Copy link

@Stranger6667 Stranger6667 Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e. the crucial change to this comment is that it is not about reaching some validation path, but rather discovering an external referencing during the building process.

Because all the resolving became eager and now happens much earlier in the process.

Here is also a plan I have for async retrieval:

  • 0.29 will have ValidationOptions::with_registry that will expect a registry with all the needed schemas
  • This jsonschema::Registry will have an async constructor that will accept resources and traverse them all until all the external schemas are retrieved and all the inner indices are built.

So, the retrieval will happen in batches where each batch will load previously unseen external resources that are reachable right now. Then the output of this batch will be traversed in the same way and if there are some new resources, there will be another batch, and so on.

Practically speaking:

let registry = jsonschema::non_blocking::Registry::new("http://127.0.0.1/dependent.json", json!({"$ref": "http://127.0.0.1/another.json"})).await?;

let validator = jsonschema::options()
    .with_registry(registry)
    .build({"$ref": "http://127.0.0.1/dependent.json"})?;

And maybe there could be just an async builder

let validator = jsonschema::non_blocking::options()
    .build({"$ref": "http://127.0.0.1/dependent.json"})
    .await?;

// Or

let validator = jsonschema::non_blocking::validator_for({"$ref": "http://127.0.0.1/dependent.json"})
    .await?;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some time ago I realized that it could be done just once when we create a validator (e.g. via jsonschema::validator_for) and all that caching won't be needed during validation as we'll have everything already retrieved, so I moved all that retrieval logic to the building stage.

If I map over the errors from the initial validator build, will all of the retrieval errors surface at once or do I still need to continually loop and rebuild the validator until each and every remote reference is resolved?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still need to loop because external schemas can have references to other external schemas too and we don't know it until we load them.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With jsonschema==0.29 & the resolve-async feature enabled, the following should be enough on the validation build step:

jsonschema::async_options()
    .with_retriever(CacheSchemaRetriever {
        cache: self.cache().clone(),
    })
    .with_format("semver", formats::semver)
    .with_format("semver-requirement", formats::semver_req)
    .build(schema)
    .await
    .map_err(|err| anyhow!("invalid schema: {err}"))
  • retriever impl requires a few extra changes - using async fn and caching the retrieved value.

Generally, the built-in retriever could work, but as I see here the cache is shared for all validators. I think in the future it could be done by having a shared Registry to avoid re-processing already processed external resources, but there is some extra work to be done on the jsonschema side. Right now I think the new jsonschema version should already simplify the logic quite a bit.

Copy link

@Stranger6667 Stranger6667 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment about the overall validation flow, the current approach won't work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Discrepancy in validation with JSON Schema draft 2020-12
2 participants