Skip to content

JSON Schema source: uri_replace_pattern not applied to nested $ref resolution #16238

@shonigbaum

Description

@shonigbaum

Background
Our schema API requires OAuth 2.0 authentication, which the DataHub CLI json-schema source doesn't support natively. We're trying to work around this by using uri_replace_pattern to redirect schema references through an internal authenticating proxy that handles the OAuth flow. However, the feature seems to be broken for nested $ref resolution, making it impossible to ingest schemas that reference other schemas.

Describe the bug
The uri_replace_pattern configuration in the JSON Schema source does not work for nested $ref resolution. While the custom loader is passed to _load_json_schema(), the title_swapping_callback that handles nested references uses its own loader (self.loader) which ignores the configured uri_replace_pattern, causing authentication failures when schemas reference other schemas via HTTPS URLs.

To Reproduce
Steps to reproduce the behavior:

  1. Create a JSON schema with a $ref to an external HTTPS URL that requires authentication:
{
 "$id": "https://schemas.example.com/types/person/1.0",
 "properties": {
   "status": {
     "$ref": "https://schemas.example.com/enums/status/1.0"
   }
 }
}
  1. Configure the JSON Schema source with uri_replace_pattern to proxy the requests:
{
 "source": {
   "type": "json-schema",
   "config": {
     "path": "/path/to/schemas",
     "platform": "myplatform",
     "uri_replace_pattern": {
       "match": "https://schemas.example.com/",
       "replace": "https://internal-proxy.local/schema?uri=https://schemas.example.com/"
     }
   }
 }
}
  1. Run ingestion: datahub ingest run -c config.json

  2. Observe error: HTTPError: 401 Client Error: Unauthorized for url: https://schemas.example.com/enums/status/1.0

Expected behavior
The uri_replace_pattern should apply to all $ref resolutions, including nested references. The URL should be transformed to https://internal-proxy.local/schema?uri=https://schemas.example.com/enums/status/1.0 before making the HTTP request.

Root Cause
In json_schema.py (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/schema/json_schema.py) line 252-257, the code patches jsonref.JsonRef.callback with title_swapping_callback:

with unittest.mock.patch("jsonref.JsonRef.callback", title_swapping_callback):
   (schema_dict, schema_string) = self._load_json_schema(
       os.path.join(root_dir, file_name),
       loader=ref_loader,
       use_id_as_base_uri=self.config.use_id_as_base_uri,
   )

However, title_swapping_callback (in json_ref_patch.py line 16) calls self.loader(uri) directly, which uses the default jsonref.jsonloader instead of the custom stringreplaceloader configured via uri_replace_pattern.

Environment:

  • DataHub version: 1.4.0
  • Source: json-schema
  • Deployment: Kubernetes

Additional context
The stringreplaceloader is correctly created and passed as ref_loader to _load_json_schema(), but the callback mechanism bypasses it. A fix would need to ensure title_swapping_callback uses the configured loader instead of the default one.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugBug report

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions