Skip to content

test(integration-tests): draft / regression repro — 0-column RecordBatch via custom TableProvider#65

Draft
Curricane wants to merge 1 commit into
systemxlabs:mainfrom
Curricane:feat/zero-col-batch-regression
Draft

test(integration-tests): draft / regression repro — 0-column RecordBatch via custom TableProvider#65
Curricane wants to merge 1 commit into
systemxlabs:mainfrom
Curricane:feat/zero-col-batch-regression

Conversation

@Curricane

@Curricane Curricane commented May 12, 2026

Copy link
Copy Markdown
Collaborator

Status

Draft / regression repro — not asking for merge yet.

This PR adds a reproducer / regression-guard fixture for the 0-column
RecordBatch boundary that paniced data-fabric's dist execution path
(data-fabric task #7/#10). The fix in data-fabric was on the
TableProvider side (normalize_projection(), commit 6e7b30a3);
this PR adds a regression guard on the dist side so that future dist
changes can't silently regress on this shape regardless of upstream
TableProvider hygiene.

The new test is gated #[ignore] so default cargo test skips it and
main CI stays green. If dist already handles the 0-col batch shape
stably, we can later drop the gate and treat this as a normal PR; if it
panics, the failure stack is reproducible here as a regression artifact.

What this PR adds

  1. ZeroColTable / ZeroColExec in integration-tests/src/data.rs.

    • A custom TableProvider that intentionally does NOT normalize
      projection = Some(empty) to Some(vec![0]).
    • Two variants are registered on the dist server's session context:
      • zero_col_small — 1 partition / 2 rows. Drives the small-table /
        single-task code path (historically: dist/src/util.rs:113
        assert_eq! on column counts when consolidating partial
        aggregates).
      • zero_col_large — 4 partitions × 2 rows = 8 rows total. Drives
        the repartition / coalesce path (historically: arrow-select coalesce.rs:462 assert_eq! when feeding 0-column batches
        through RepartitionExec / CoalescePartitionsExec).
  2. integration-tests/tests/zero_col_regression.slt — exercises
    both fixtures via:

    • COUNT(*) on each table (the lowering that emits
      projection = Some(vec![])).
    • SELECT * sanity check (non-empty projection through the same
      provider).
    • COUNT(*) WHERE always-false (empty result through empty
      projection).
    • COUNT(*) over a Partitioned-mode hash join on zero_col_large
      to force a RepartitionExec above the 0-col scan stream.
    • A cross-table scalar subquery to confirm 0-col emit doesn't poison
      adjacent MemTable batches.
  3. A separate gated runner sqllogictest_zero_col_regression in
    integration-tests/tests/sqllogictest.rs, marked #[ignore] and
    documented with run instructions. The main sqllogictest test is
    unchanged.

Repro instructions

# Default cargo test — the new file does NOT run.
cargo test -p datafusion-dist-integration-tests --test sqllogictest

# Run only the new regression test.
cargo test -p datafusion-dist-integration-tests --test sqllogictest \
  -- --ignored sqllogictest_zero_col_regression

Local validation status — needs a docker compose-capable environment

Code is verified clean, but the test has not yet been observed
end-to-end
from the machine that authored this PR.

What was verified locally:

  • cargo check -p datafusion-dist-integration-tests --tests — clean.
  • cargo clippy --workspace --tests -- -D warnings — clean.
  • cargo fmt --check — clean.
  • cargo test -p datafusion-dist-integration-tests --test sqllogictest
    (default suite, without --ignored) compiles and the unchanged
    sqllogictest test still runs.

What is blocked here:

The integration-tests harness uses Docker Compose v2 (docker compose -p ... up/down) inside integration-tests/src/docker.rs to bring up
the dist server / postgres / etc. The authoring environment is a WSL
distro where:

  • docker compose (v2 plugin) is not installed.
  • Docker Desktop's docker-compose (v1) WSL integration is not
    enabled in this distro.

As a result, setup_containers() immediately fails at the
down -v --remove-orphans precondition with:

unknown shorthand flag: 'p' in -p

thread 'sqllogictest_zero_col_regression' panicked at
  integration-tests/src/docker.rs:10:9:
Stopping docker compose in .../integration-tests, project name:
integration-tests-containers failed: ExitStatus(unix_wait_status(32000))

This is an environment gap, not a dist-side panic. It would block
the existing sqllogictest test in exactly the same way; it is not
specific to the new fixture.

What needs to happen next

Someone with a docker compose-capable environment (any of: Linux box
with the docker-compose-plugin package, macOS with Docker Desktop,
WSL with Docker Desktop's WSL integration enabled, or the project's CI
runners) needs to run exactly:

cargo test -p datafusion-dist-integration-tests --test sqllogictest \
  -- --ignored sqllogictest_zero_col_regression

and post the outcome on this PR. Based on that outcome we then decide:

  • Pass → drop #[ignore] in a follow-up commit, mark this PR as a
    normal regression-guard, ready for review.
  • Panic / fail → keep the gate, paste the failure stack here, and
    open a separate dist-side implementation fix task; this PR stays as
    the reproducer artifact for that fix.

Why this can't be done with a MemTable fixture

DataFusion's built-in MemoryExec / DataSourceExec paths already
normalize projection = Some(empty) via with_row_count internally —
they never emit a true 0-column batch into downstream stages. The
five count(*) cases that landed in #63 all run against MemTable and
therefore do not exercise the dist 0-col path. A custom
TableProvider that mirrors the data-fabric pre-fix shape is the only
way to repro the boundary inside this repo.

Outcome interpretation

  • If the new test passes: dist handles the shape stably today.
    Flip the gate (remove #[ignore]) in a follow-up commit and treat
    this as a normal regression-guard PR.
  • If the new test panics: the failure stack and panic site (most
    likely dist/src/util.rs:113 or arrow-select coalesce.rs:462) is
    the actionable artifact for a dist-side fix. The draft stays draft
    until the dist-side fix lands.

Out of scope

  • The data-fabric normalize_projection() fix is upstream of dist
    and is not modified or assumed here. Dist's regression guard must be
    independent of upstream TableProvider hygiene.
  • Outer-join unmatched-emit (issue LEFT JOIN with no probe matches returns 0 rows instead of preserved build-side rows #64) — separate investigation.
  • The indexlake check_insert_batch_field is_nullable mismatch
    (separate thread / separate task) — kept independent per review
    guidance.

References

…(gated)

Add a `ZeroColTable` / `ZeroColExec` fixture in `integration-tests/src/data.rs`
and a gated `tests/zero_col_regression.slt` that drives a
0-column / `row_count > 0` RecordBatch through the dist execution path.

Why this matters
----------------
A custom `TableProvider` that does not normalize `projection = Some(empty)`
into `Some(vec![0])` will, for `COUNT(*)`, emit a RecordBatch with 0
columns but `row_count > 0`. This shape has historically paniced dist in
two places:

  - `dist/src/util.rs:113` on the small-table / single-task path when
    consolidating partial aggregates.
  - `arrow-select coalesce.rs:462` on the repartition path when
    `CoalescePartitionsExec` feeds the empty-column buffer.

The `data-fabric` project landed a TableProvider-side fix
(`normalize_projection()`, commit `6e7b30a3`) that converts the empty
projection vector into `Some(vec![0])` before the plan ever reaches
dist, sidestepping both panics. That fix is purely upstream of dist —
the dist side still needs its own regression guard so that any future
dist change that re-introduces a panic on this exact shape is caught
here regardless of upstream TableProvider hygiene.

What this PR adds
-----------------
- `ZeroColTable`: a `TableProvider` that intentionally does NOT
  normalize empty projections. Two registered variants:
    * `zero_col_small` — 1 partition / 2 rows, drives the small-table
      path.
    * `zero_col_large` — 4 partitions × 2 rows, drives the
      repartition / coalesce path (also exercised through a
      `Partitioned` hash join to force `RepartitionExec`).
- `tests/zero_col_regression.slt` — exercises both via `COUNT(*)`,
  `SELECT *` sanity, an empty-result-set query, the partitioned join,
  and a cross-table scalar subquery.
- A separate `#[tokio::test] #[ignore]` runner
  (`sqllogictest_zero_col_regression`) so the default `cargo test` run
  does not exercise this and main CI stays green until dist is proven
  stable against the shape. Run on demand with
  `cargo test --test sqllogictest -- --ignored sqllogictest_zero_col_regression`.

This is intentionally a draft PR / regression-repro: if dist already
handles the shape stably, the test passes and we can flip the gate
later; if dist panics, the failure stack is reproducible here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant