-
Notifications
You must be signed in to change notification settings - Fork 345
Description
[Bug] JSONStreamDatasource fixes schema from the first batch and fails on later null -> concrete type evolution
Summary
Data-Juicer's custom JSON stream reader appears to lock schema from the first batch in data_juicer.core.data.ray_dataset.JSONStreamDatasource._read_stream.
When an early batch infers a nested field as null (for example meta.url = null) and a later batch introduces a concrete type for the same field (for example string), the reader does not reconcile schema across batches and can fail with errors such as:
Couldn't cast array of type ... to null
or an Arrow schema mismatch during ingestion.
This is a correctness + stability bug in DJ's custom JSON streaming ingestion path, not just a configuration issue.
- Component:
data_juicer.core.data.ray_dataset.JSONStreamDatasource._read_stream - Severity suggestion:
P1 - Impact: pipeline / ingestion stage can fail on otherwise valid mixed-null JSONL
Environment
- OS: Ubuntu 24.04.2 LTS
- Kernel: Linux 5.15.0-119-generic x86_64
- Python: 3.12
- Ray: 2.54.0
- PyArrow: 23.0.1
- Data-Juicer: 1.5.0
- Reproduced in: standalone script
Why this looks like a DJ bug
In DJ source, RayDataset.read_json() is explicitly marked as a temporary solution for JSON stream reading and has a TODO to replace it with ray.data.read_json_stream once available.
In JSONStreamDatasource._read_stream, schema starts as None, then after the first batch DJ does:
schema = None
while True:
batch = reader.read_next_batch()
table = pyarrow.Table.from_batches([batch], schema=schema)
if schema is None:
schema = table.schema
yield tableThis means the first batch can determine the schema used for later batches. If the first batch infers a field as null, later batches with concrete types may fail instead of being reconciled.
Expected behavior
Valid mixed-null JSONL should remain readable in stream mode.
Examples of acceptable behavior:
- allow
null -> string/int/structevolution across batches when the data is otherwise valid; - reconcile / widen schema across batches;
- or at minimum raise a clearer error that the custom stream path does not support this schema evolution.
Actual behavior
The custom DJ stream path can fail on inputs where:
- earlier stream batches contain only
nullfor a field; - later batches introduce a concrete type for the same field.
Observed errors include:
Couldn't cast array of type ... to null
and related Arrow schema mismatch failures.
Minimal reproduction
"""Minimal repro for Data-Juicer JSONStreamDatasource schema-lock bug.
Expected behavior on affected DJ versions:
- DJ read_json_stream fails on mixed null/non-null nested field values.
- Ray native read_json succeeds on the same file.
"""
from __future__ import annotations
import json
import tempfile
from pathlib import Path
import pyarrow.json as paj
import ray
from data_juicer.core.data.ray_dataset import read_json_stream
def _write_payload(path: Path) -> None:
rows = []
# First batches contain null-only nested field.
for i in range(30):
rows.append({"id": i, "meta": {"url": None}})
# Later batch introduces concrete type.
rows.append({"id": 999, "meta": {"url": "https://example.com"}})
with path.open("w", encoding="utf-8") as f:
for row in rows:
f.write(json.dumps(row, ensure_ascii=False) + "\n")
def main() -> None:
ray.init(ignore_reinit_error=True)
try:
tmp_dir = Path(tempfile.mkdtemp(prefix="dj_repro_"))
payload = tmp_dir / "mixed_schema.jsonl"
_write_payload(payload)
print("payload =", payload)
print("\n[DJ read_json_stream]")
try:
ds = read_json_stream(
str(payload),
override_num_blocks=1,
# Small block size forces multiple read batches in stream mode.
read_options=paj.ReadOptions(use_threads=False, block_size=256),
)
print("dj count =", ds.count())
print("dj schema =", ds.schema())
except Exception as exc: # noqa: BLE001
print("dj error =", type(exc).__name__)
print(exc)
print("\n[Ray native read_json]")
ds2 = ray.data.read_json(str(payload), override_num_blocks=1)
print("ray count =", ds2.count())
print("ray schema =", ds2.schema())
finally:
ray.shutdown()
if __name__ == "__main__":
main()Reproduction scripts
- DJ bug repro (DJ fails, Ray native succeeds on same input):
tmp/debug/repro_dj_jsonstream_schema_lock.py
Run:
conda run -n LLMData --no-capture-output python /Users/markwu/Pycharm/LLMdata/tmp/debug/repro_dj_jsonstream_schema_lock.py- Ray
_backfill_missing_fieldsTypeError repro:
tmp/debug/repro_ray_backfill_typeerror.py
Run:
conda run -n LLMData --no-capture-output python /Users/markwu/Pycharm/LLMdata/tmp/debug/repro_ray_backfill_typeerror.pyDownstream mitigations currently used (workarounds, not root fix)
We currently apply multiple downstream mitigations in our pipeline wrapper to keep production stable:
- Bypass DJ
read_json_streamtoray.data.read_jsonwhen possible, to avoidJSONStreamDatasourcestream-read failures on mixed-schema JSONL. - Auto-infer and sanitize explicit Arrow schema (drop null-only unstable paths, keep known stable fields typed), then pass it through parsing options.
- Patch BTS deduplicator
map_batcheskwargs to forward per-opnum_cpus, so Ray task resource requests are honored. - Patch
ray.initcaps and tune Ray Data context (read_op_min_num_blocks,min_parallelism, block sizing, streaming buffer) for stability under large runs.
These mitigations are implemented in:
finance_cleaning/utils/dj_process_capped.py
These workarounds unblock production jobs, but they do not fix the underlying schema-handling behavior in DJ's custom JSON stream reader.
Possible fix directions
- Avoid fixing schema from the first batch in
JSONStreamDatasource._read_stream. - Reconcile schemas across batches instead of reusing the first inferred schema unchanged.
- For valid cases such as
null -> string/int, allow widening / permissive promotion across batches. - Prefer the official Ray JSON reading path when equivalent functionality is available, to reduce behavior drift between DJ custom ingestion and Ray native ingestion.
Additional context
We currently see two distinct classes of failures in this area:
- DJ JSON stream schema-lock bug: first-batch schema fixation causes
cast ... to null/ schema mismatch failures on later concrete values. - Ray Data struct backfill robustness bug: separate failure mode in
_backfill_missing_fieldsfor struct reconciliation.
This issue is specifically about the first one: DJ's custom JSON stream ingestion behavior.