Description
PR #106395 added strict Arrow input validation. The new checkBinaryOffsetsBuffer (in src/Processors/Formats/Impl/ArrowColumnToCHColumn.cpp) requires a String/Binary column's offsets buffer to hold length + 1 entries, i.e. ≥ 4 bytes even when the column is empty:
const size_t required = sizeof(typename ArrowBinaryArray::offset_type)
* static_cast<size_t>(chunk.offset() + chunk.length() + 1); // +1 even when length == 0
if (unlikely(buffer_size < required))
throw Exception(ErrorCodes::INCORRECT_DATA,
"Arrow buffer too small for column '{}': {} bytes available, {} required", ...);
So an empty String/Binary column whose offsets buffer is 0 bytes is rejected:
Code: 117. DB::Exception: Arrow buffer too small for column 'key':
0 bytes available, 4 required (INCORRECT_DATA)
A 0-byte offsets buffer for an empty variable-width array is not strictly spec-compliant (the spec wants a single 0), but it is exactly what Apache Arrow Java < 19.0.0 emits for an empty Array(String) / Map(String, …) child (BaseVariableWidthVector writes offsetBuffer.writerIndex(0) when valueCount == 0; fixed in Arrow Java 19.0.0 via apache/arrow-java#989). Apache Spark bundles Arrow Java, so consumers such as the Spark connector cannot control the produced bytes. The same streams are read without error by arrow-cpp, pyarrow and arrow-rs (and by ClickHouse itself before #106395).
This only affects String/Binary columns. The numeric/fixed-size readers compute required = elem_size * (offset + length) (no + 1), so an empty numeric column needs 0 bytes and passes - only the binary path adds the + 1. It is reached whenever an entire batch's collections are empty (every map {} or every array []), so the inner string child collapses to zero elements.
How to reproduce
ClickHouse version: 26.6.1 (current master / head; introduced by #106395).
Both scripts below are also available as a gist: https://gist.github.com/ShimonSte/14be2c7a57cb5c4212db9ca296fc0874
1. Minimal, no Spark (pip install pyarrow requests)
#!/usr/bin/env python3
import sys, pyarrow as pa, requests
CH = sys.argv[1] if len(sys.argv) > 1 else "http://localhost:8123/"
AUTH = {"X-ClickHouse-User": "default",
"X-ClickHouse-Key": sys.argv[2] if len(sys.argv) > 2 else ""}
def q(body, query=None):
return requests.post(CH, headers=AUTH, params={"query": query} if query else {}, data=body)
def ipc(batch):
sink = pa.BufferOutputStream()
with pa.ipc.new_stream(sink, batch.schema) as w:
w.write_batch(batch)
return sink.getvalue().to_pybytes()
def insert(data, label):
r = q(data, "INSERT INTO t_empty_map FORMAT ArrowStream")
print(f" {label:34s} -> HTTP {r.status_code}: "
+ ("OK" if r.status_code == 200 else r.text.strip().split(chr(10))[0]))
q(b"DROP TABLE IF EXISTS t_empty_map")
q(b"CREATE TABLE t_empty_map (id Int32, value Map(String, Int32)) ENGINE = Memory")
offsets = pa.array([0, 0, 0], type=pa.int32()) # two rows, both empty maps
items = pa.array([], type=pa.int32())
# (a) pyarrow default: spec's single 0 offset -> 4-byte offsets buffer
key_default = pa.array([], type=pa.string())
insert(ipc(pa.record_batch([pa.array([1, 2], type=pa.int32()),
pa.MapArray.from_arrays(offsets, key_default, items)], names=["id", "value"])),
"pyarrow default (4-byte offsets)")
# (b) Arrow Java < 19 style: empty key child with a 0-byte offsets buffer
key_zero = pa.Array.from_buffers(pa.string(), 0, [None, pa.py_buffer(b""), pa.py_buffer(b"")])
insert(ipc(pa.record_batch([pa.array([1, 2], type=pa.int32()),
pa.MapArray.from_arrays(offsets, key_zero, items)], names=["id", "value"])),
"Arrow-Java <19 (0-byte offsets)")
Output against current head (26.6.1.853):
pyarrow default (4-byte offsets) -> HTTP 200: OK
Arrow-Java <19 (0-byte offsets) -> HTTP 400: Code: 117. DB::Exception: Arrow buffer too small for column 'key': 0 bytes available, 4 required ... (INCORRECT_DATA)
2. Real Apache Arrow Java (version contrast)
ReproduceArrow.java (in the gist) builds the same Map(String, Int32) batch of two empty maps with a real Arrow Java MapVector and inserts it. Running the same code against two Arrow Java versions:
===== arrow-vector 18.1.0 (bundled by Spark 4.0) =====
key child valueCount=0, offsets buffer=0 bytes
INSERT -> HTTP 400 REJECTED: Code: 117 ... Arrow buffer too small for column 'key': 0 bytes available, 4 required (INCORRECT_DATA)
===== arrow-vector 19.0.0 (apache/arrow-java#989 fix) =====
key child valueCount=0, offsets buffer=4 bytes
INSERT -> HTTP 200 OK
Same ClickHouse build, only the Arrow Java version differs: the offsets buffer goes 0 → 4 bytes and the insert flips from rejected to accepted.
Expected behavior
An empty String/Binary column (length == 0) should be accepted, as it was before #106395 and as arrow-cpp/pyarrow/arrow-rs accept it. No value_offset(i) is read for a zero-length column, so an absent/0-byte offsets buffer is harmless.
Question
Would you be open to skipping the offsets-buffer check (or treating it as satisfied) when length == 0? I'm not certain this is the right place to fix it - the producer is Arrow Java bundled inside Spark, which we can't control, and the stream is read fine by every other Arrow implementation.
Related downstream issue: ClickHouse/spark-clickhouse-connector#556
Version info
Description
PR #106395 added strict Arrow input validation. The new
checkBinaryOffsetsBuffer(insrc/Processors/Formats/Impl/ArrowColumnToCHColumn.cpp) requires a String/Binary column's offsets buffer to holdlength + 1entries, i.e. ≥ 4 bytes even when the column is empty:So an empty String/Binary column whose offsets buffer is 0 bytes is rejected:
A 0-byte offsets buffer for an empty variable-width array is not strictly spec-compliant (the spec wants a single
0), but it is exactly what Apache Arrow Java < 19.0.0 emits for an emptyArray(String)/Map(String, …)child (BaseVariableWidthVectorwritesoffsetBuffer.writerIndex(0)whenvalueCount == 0; fixed in Arrow Java 19.0.0 via apache/arrow-java#989). Apache Spark bundles Arrow Java, so consumers such as the Spark connector cannot control the produced bytes. The same streams are read without error by arrow-cpp, pyarrow and arrow-rs (and by ClickHouse itself before #106395).This only affects String/Binary columns. The numeric/fixed-size readers compute
required = elem_size * (offset + length)(no+ 1), so an empty numeric column needs 0 bytes and passes - only the binary path adds the+ 1. It is reached whenever an entire batch's collections are empty (every map{}or every array[]), so the inner string child collapses to zero elements.How to reproduce
ClickHouse version: 26.6.1 (current master /
head; introduced by #106395).Both scripts below are also available as a gist: https://gist.github.com/ShimonSte/14be2c7a57cb5c4212db9ca296fc0874
1. Minimal, no Spark (
pip install pyarrow requests)Output against current
head(26.6.1.853):2. Real Apache Arrow Java (version contrast)
ReproduceArrow.java(in the gist) builds the sameMap(String, Int32)batch of two empty maps with a real Arrow JavaMapVectorand inserts it. Running the same code against two Arrow Java versions:Same ClickHouse build, only the Arrow Java version differs: the offsets buffer goes 0 → 4 bytes and the insert flips from rejected to accepted.
Expected behavior
An empty String/Binary column (
length == 0) should be accepted, as it was before #106395 and as arrow-cpp/pyarrow/arrow-rs accept it. Novalue_offset(i)is read for a zero-length column, so an absent/0-byte offsets buffer is harmless.Question
Would you be open to skipping the offsets-buffer check (or treating it as satisfied) when
length == 0? I'm not certain this is the right place to fix it - the producer is Arrow Java bundled inside Spark, which we can't control, and the stream is read fine by every other Arrow implementation.Related downstream issue: ClickHouse/spark-clickhouse-connector#556
Version info
26.6.1.112326.5.4.4,26.4.5.61,26.3.15.3,25.8.25.31