-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Background
Our schema API requires OAuth 2.0 authentication, which the DataHub CLI json-schema source doesn't support natively. We're trying to work around this by using uri_replace_pattern to redirect schema references through an internal authenticating proxy that handles the OAuth flow. However, the feature seems to be broken for nested $ref resolution, making it impossible to ingest schemas that reference other schemas.
Describe the bug
The uri_replace_pattern configuration in the JSON Schema source does not work for nested $ref resolution. While the custom loader is passed to _load_json_schema(), the title_swapping_callback that handles nested references uses its own loader (self.loader) which ignores the configured uri_replace_pattern, causing authentication failures when schemas reference other schemas via HTTPS URLs.
To Reproduce
Steps to reproduce the behavior:
- Create a JSON schema with a
$refto an external HTTPS URL that requires authentication:
{
"$id": "https://schemas.example.com/types/person/1.0",
"properties": {
"status": {
"$ref": "https://schemas.example.com/enums/status/1.0"
}
}
}- Configure the JSON Schema source with
uri_replace_patternto proxy the requests:
{
"source": {
"type": "json-schema",
"config": {
"path": "/path/to/schemas",
"platform": "myplatform",
"uri_replace_pattern": {
"match": "https://schemas.example.com/",
"replace": "https://internal-proxy.local/schema?uri=https://schemas.example.com/"
}
}
}
}-
Run ingestion:
datahub ingest run -c config.json -
Observe error:
HTTPError: 401 Client Error: Unauthorized for url: https://schemas.example.com/enums/status/1.0
Expected behavior
The uri_replace_pattern should apply to all $ref resolutions, including nested references. The URL should be transformed to https://internal-proxy.local/schema?uri=https://schemas.example.com/enums/status/1.0 before making the HTTP request.
Root Cause
In json_schema.py (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/schema/json_schema.py) line 252-257, the code patches jsonref.JsonRef.callback with title_swapping_callback:
with unittest.mock.patch("jsonref.JsonRef.callback", title_swapping_callback):
(schema_dict, schema_string) = self._load_json_schema(
os.path.join(root_dir, file_name),
loader=ref_loader,
use_id_as_base_uri=self.config.use_id_as_base_uri,
)However, title_swapping_callback (in json_ref_patch.py line 16) calls self.loader(uri) directly, which uses the default jsonref.jsonloader instead of the custom stringreplaceloader configured via uri_replace_pattern.
Environment:
- DataHub version: 1.4.0
- Source: json-schema
- Deployment: Kubernetes
Additional context
The stringreplaceloader is correctly created and passed as ref_loader to _load_json_schema(), but the callback mechanism bypasses it. A fix would need to ensure title_swapping_callback uses the configured loader instead of the default one.