Arrow: reading rejects empty String/Binary column with a 0-byte offsets buffer (length == 0)

## Description

PR [#106395](<https://github.com/ClickHouse/ClickHouse/issues/106395>) added strict Arrow input validation. The new `checkBinaryOffsetsBuffer` (in `src/Processors/Formats/Impl/ArrowColumnToCHColumn.cpp`) requires a String/Binary column's offsets buffer to hold `length + 1` entries, i.e. ≥ 4 bytes **even when the column is empty**:

```cpp
const size_t required = sizeof(typename ArrowBinaryArray::offset_type)
    * static_cast<size_t>(chunk.offset() + chunk.length() + 1);   // +1 even when length == 0
if (unlikely(buffer_size < required))
    throw Exception(ErrorCodes::INCORRECT_DATA,
        "Arrow buffer too small for column '{}': {} bytes available, {} required", ...);
```

So an empty String/Binary column whose offsets buffer is 0 bytes is rejected:

```
Code: 117. DB::Exception: Arrow buffer too small for column 'key':
  0 bytes available, 4 required (INCORRECT_DATA)
```

A 0-byte offsets buffer for an empty variable-width array is not strictly spec-compliant (the spec wants a single `0`), but it is exactly what **Apache Arrow Java < 19.0.0 emits** for an empty `Array(String)` / `Map(String, …)` child (`BaseVariableWidthVector` writes `offsetBuffer.writerIndex(0)` when `valueCount == 0`; fixed in Arrow Java 19.0.0 via [apache/arrow-java#989](<https://github.com/apache/arrow-java/issues/989>)). Apache Spark bundles Arrow Java, so consumers such as the Spark connector cannot control the produced bytes. The same streams are read without error by **arrow-cpp, pyarrow and arrow-rs** (and by ClickHouse itself before [#106395](<https://github.com/ClickHouse/ClickHouse/issues/106395>)).

This only affects **String/Binary** columns. The numeric/fixed-size readers compute `required = elem_size * (offset + length)` (no `+ 1`), so an empty numeric column needs 0 bytes and passes - only the binary path adds the `+ 1`. It is reached whenever an entire batch's collections are empty (every map `{}` or every array `[]`), so the inner string child collapses to zero elements.

## How to reproduce

ClickHouse version: 26.6.1 (current master / `head`; introduced by [#106395](<https://github.com/ClickHouse/ClickHouse/issues/106395>)).

Both scripts below are also available as a gist: [https://gist.github.com/ShimonSte/14be2c7a57cb5c4212db9ca296fc0874](<https://gist.github.com/ShimonSte/14be2c7a57cb5c4212db9ca296fc0874>)

### 1. Minimal, no Spark (`pip install pyarrow requests`)

```python
#!/usr/bin/env python3
import sys, pyarrow as pa, requests

CH = sys.argv[1] if len(sys.argv) > 1 else "http://localhost:8123/"
AUTH = {"X-ClickHouse-User": "default",
        "X-ClickHouse-Key": sys.argv[2] if len(sys.argv) > 2 else ""}

def q(body, query=None):
    return requests.post(CH, headers=AUTH, params={"query": query} if query else {}, data=body)

def ipc(batch):
    sink = pa.BufferOutputStream()
    with pa.ipc.new_stream(sink, batch.schema) as w:
        w.write_batch(batch)
    return sink.getvalue().to_pybytes()

def insert(data, label):
    r = q(data, "INSERT INTO t_empty_map FORMAT ArrowStream")
    print(f"  {label:34s} -> HTTP {r.status_code}: "
          + ("OK" if r.status_code == 200 else r.text.strip().split(chr(10))[0]))

q(b"DROP TABLE IF EXISTS t_empty_map")
q(b"CREATE TABLE t_empty_map (id Int32, value Map(String, Int32)) ENGINE = Memory")

offsets = pa.array([0, 0, 0], type=pa.int32())   # two rows, both empty maps
items   = pa.array([], type=pa.int32())

# (a) pyarrow default: spec's single 0 offset -> 4-byte offsets buffer
key_default = pa.array([], type=pa.string())
insert(ipc(pa.record_batch([pa.array([1, 2], type=pa.int32()),
      pa.MapArray.from_arrays(offsets, key_default, items)], names=["id", "value"])),
      "pyarrow default (4-byte offsets)")

# (b) Arrow Java < 19 style: empty key child with a 0-byte offsets buffer
key_zero = pa.Array.from_buffers(pa.string(), 0, [None, pa.py_buffer(b""), pa.py_buffer(b"")])
insert(ipc(pa.record_batch([pa.array([1, 2], type=pa.int32()),
      pa.MapArray.from_arrays(offsets, key_zero, items)], names=["id", "value"])),
      "Arrow-Java <19 (0-byte offsets)")
```

Output against current `head` (26.6.1.853):

```
  pyarrow default (4-byte offsets)   -> HTTP 200: OK
  Arrow-Java <19 (0-byte offsets)    -> HTTP 400: Code: 117. DB::Exception: Arrow buffer too small for column 'key': 0 bytes available, 4 required ... (INCORRECT_DATA)
```

### 2. Real Apache Arrow Java (version contrast)

`ReproduceArrow.java` (in the gist) builds the same `Map(String, Int32)` batch of two empty maps with a real Arrow Java `MapVector` and inserts it. Running the *same code* against two Arrow Java versions:

```
===== arrow-vector 18.1.0 (bundled by Spark 4.0) =====
key child valueCount=0, offsets buffer=0 bytes
INSERT -> HTTP 400 REJECTED: Code: 117 ... Arrow buffer too small for column 'key': 0 bytes available, 4 required (INCORRECT_DATA)

===== arrow-vector 19.0.0 (apache/arrow-java#989 fix) =====
key child valueCount=0, offsets buffer=4 bytes
INSERT -> HTTP 200 OK
```

Same ClickHouse build, only the Arrow Java version differs: the offsets buffer goes 0 → 4 bytes and the insert flips from rejected to accepted.

## Expected behavior

An empty String/Binary column (`length == 0`) should be accepted, as it was before [#106395](<https://github.com/ClickHouse/ClickHouse/issues/106395>) and as arrow-cpp/pyarrow/arrow-rs accept it. No `value_offset(i)` is read for a zero-length column, so an absent/0-byte offsets buffer is harmless.

## Question

Would you be open to skipping the offsets-buffer check (or treating it as satisfied) when `length == 0`? I'm not certain this is the right place to fix it - the producer is Arrow Java bundled inside Spark, which we can't control, and the stream is read fine by every other Arrow implementation.  <br>Related downstream issue: ClickHouse/spark-clickhouse-connector#556


### Version info
- Resolved by: #107764
- Merged into: `26.6.1.1123`
- Backported to: `26.5.4.4`, `26.4.5.61`, `26.3.15.3`, `25.8.25.31`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Arrow: reading rejects empty String/Binary column with a 0-byte offsets buffer (length == 0) #107749

Description

How to reproduce

1. Minimal, no Spark (`pip install pyarrow requests`)

2. Real Apache Arrow Java (version contrast)

Expected behavior

Question

Version info

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Arrow: reading rejects empty String/Binary column with a 0-byte offsets buffer (length == 0) #107749

Description

Description

How to reproduce

1. Minimal, no Spark (pip install pyarrow requests)

2. Real Apache Arrow Java (version contrast)

Expected behavior

Question

Version info

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. Minimal, no Spark (`pip install pyarrow requests`)