Skip to content

Arrow: reading rejects empty String/Binary column with a 0-byte offsets buffer (length == 0) #107749

Description

@ShimonSte

Description

PR #106395 added strict Arrow input validation. The new checkBinaryOffsetsBuffer (in src/Processors/Formats/Impl/ArrowColumnToCHColumn.cpp) requires a String/Binary column's offsets buffer to hold length + 1 entries, i.e. ≥ 4 bytes even when the column is empty:

const size_t required = sizeof(typename ArrowBinaryArray::offset_type)
    * static_cast<size_t>(chunk.offset() + chunk.length() + 1);   // +1 even when length == 0
if (unlikely(buffer_size < required))
    throw Exception(ErrorCodes::INCORRECT_DATA,
        "Arrow buffer too small for column '{}': {} bytes available, {} required", ...);

So an empty String/Binary column whose offsets buffer is 0 bytes is rejected:

Code: 117. DB::Exception: Arrow buffer too small for column 'key':
  0 bytes available, 4 required (INCORRECT_DATA)

A 0-byte offsets buffer for an empty variable-width array is not strictly spec-compliant (the spec wants a single 0), but it is exactly what Apache Arrow Java < 19.0.0 emits for an empty Array(String) / Map(String, …) child (BaseVariableWidthVector writes offsetBuffer.writerIndex(0) when valueCount == 0; fixed in Arrow Java 19.0.0 via apache/arrow-java#989). Apache Spark bundles Arrow Java, so consumers such as the Spark connector cannot control the produced bytes. The same streams are read without error by arrow-cpp, pyarrow and arrow-rs (and by ClickHouse itself before #106395).

This only affects String/Binary columns. The numeric/fixed-size readers compute required = elem_size * (offset + length) (no + 1), so an empty numeric column needs 0 bytes and passes - only the binary path adds the + 1. It is reached whenever an entire batch's collections are empty (every map {} or every array []), so the inner string child collapses to zero elements.

How to reproduce

ClickHouse version: 26.6.1 (current master / head; introduced by #106395).

Both scripts below are also available as a gist: https://gist.github.com/ShimonSte/14be2c7a57cb5c4212db9ca296fc0874

1. Minimal, no Spark (pip install pyarrow requests)

#!/usr/bin/env python3
import sys, pyarrow as pa, requests

CH = sys.argv[1] if len(sys.argv) > 1 else "http://localhost:8123/"
AUTH = {"X-ClickHouse-User": "default",
        "X-ClickHouse-Key": sys.argv[2] if len(sys.argv) > 2 else ""}

def q(body, query=None):
    return requests.post(CH, headers=AUTH, params={"query": query} if query else {}, data=body)

def ipc(batch):
    sink = pa.BufferOutputStream()
    with pa.ipc.new_stream(sink, batch.schema) as w:
        w.write_batch(batch)
    return sink.getvalue().to_pybytes()

def insert(data, label):
    r = q(data, "INSERT INTO t_empty_map FORMAT ArrowStream")
    print(f"  {label:34s} -> HTTP {r.status_code}: "
          + ("OK" if r.status_code == 200 else r.text.strip().split(chr(10))[0]))

q(b"DROP TABLE IF EXISTS t_empty_map")
q(b"CREATE TABLE t_empty_map (id Int32, value Map(String, Int32)) ENGINE = Memory")

offsets = pa.array([0, 0, 0], type=pa.int32())   # two rows, both empty maps
items   = pa.array([], type=pa.int32())

# (a) pyarrow default: spec's single 0 offset -> 4-byte offsets buffer
key_default = pa.array([], type=pa.string())
insert(ipc(pa.record_batch([pa.array([1, 2], type=pa.int32()),
      pa.MapArray.from_arrays(offsets, key_default, items)], names=["id", "value"])),
      "pyarrow default (4-byte offsets)")

# (b) Arrow Java < 19 style: empty key child with a 0-byte offsets buffer
key_zero = pa.Array.from_buffers(pa.string(), 0, [None, pa.py_buffer(b""), pa.py_buffer(b"")])
insert(ipc(pa.record_batch([pa.array([1, 2], type=pa.int32()),
      pa.MapArray.from_arrays(offsets, key_zero, items)], names=["id", "value"])),
      "Arrow-Java <19 (0-byte offsets)")

Output against current head (26.6.1.853):

  pyarrow default (4-byte offsets)   -> HTTP 200: OK
  Arrow-Java <19 (0-byte offsets)    -> HTTP 400: Code: 117. DB::Exception: Arrow buffer too small for column 'key': 0 bytes available, 4 required ... (INCORRECT_DATA)

2. Real Apache Arrow Java (version contrast)

ReproduceArrow.java (in the gist) builds the same Map(String, Int32) batch of two empty maps with a real Arrow Java MapVector and inserts it. Running the same code against two Arrow Java versions:

===== arrow-vector 18.1.0 (bundled by Spark 4.0) =====
key child valueCount=0, offsets buffer=0 bytes
INSERT -> HTTP 400 REJECTED: Code: 117 ... Arrow buffer too small for column 'key': 0 bytes available, 4 required (INCORRECT_DATA)

===== arrow-vector 19.0.0 (apache/arrow-java#989 fix) =====
key child valueCount=0, offsets buffer=4 bytes
INSERT -> HTTP 200 OK

Same ClickHouse build, only the Arrow Java version differs: the offsets buffer goes 0 → 4 bytes and the insert flips from rejected to accepted.

Expected behavior

An empty String/Binary column (length == 0) should be accepted, as it was before #106395 and as arrow-cpp/pyarrow/arrow-rs accept it. No value_offset(i) is read for a zero-length column, so an absent/0-byte offsets buffer is harmless.

Question

Would you be open to skipping the offsets-buffer check (or treating it as satisfied) when length == 0? I'm not certain this is the right place to fix it - the producer is Arrow Java bundled inside Spark, which we can't control, and the stream is read fine by every other Arrow implementation.
Related downstream issue: ClickHouse/spark-clickhouse-connector#556

Version info

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions