Skip to content

[BUG][SPARK] First streaming micro-batch overwrites batch data on tables with metadata.iceberg.storage enabled #7457

@gregdiy

Description

@gregdiy

Search before asking

  • I searched in the issues and found nothing similar.

Paimon version

Bug: First streaming micro-batch overwrites batch data when metadata.iceberg.storage is enabled

Environment

  • Paimon version: 1.3.1
  • Spark version: 3.5.2

Table Properties

'metadata.iceberg.storage' = 'hive-catalog'
'metadata.iceberg.storage-location' = 'table-location'
'merge-engine' = 'deduplicate'
'bucket' = '[your value]'

Steps to Reproduce

  1. Create Paimon primary key table with metadata.iceberg.storage = hive-catalog
  2. Write batch data via Spark saveAsTable with mode(append) — 400K rows commit successfully
  3. Verify row count — 400K rows confirmed
  4. Start Spark Structured Streaming job writing to same table via foreachBatch with saveAsTable and mode(append)
  5. First micro-batch completes — table now contains only the streaming micro-batch rows (~100 rows)

Expected Behavior

Streaming micro-batches should append to existing batch data. Both batch-only and stream-only work correctly in isolation.

Actual Behavior

The first streaming micro-batch after a batch write overwrites the entire table, leaving only the rows written by that micro-batch. Subsequent micro-batches append correctly to each other.

Snapshot Sequence Observed

append  - snapshot 1 (batch,  400K rows)
append  - snapshot 2 (stream, 100 rows)
overwrite - snapshot 3 (stream, 100 rows) ← table reset here
append  - snapshot 4 (stream, 100 rows)

What Was Tried

  • Setting streaming-read-overwrite = false on the table — did not resolve
  • Setting metadata.iceberg.sync-interval = -1 — resolves the overwrite but disables Iceberg sync which is a required feature

Notes

  • Stream-only writes work correctly with no overwrites
  • Batch-only writes work correctly
  • The issue only occurs when a streaming job starts after a batch write on a table with metadata.iceberg.storage enabled
  • Disabling metadata.iceberg.sync-interval is not an acceptable workaround as live Iceberg metadata is a core requirement

Compute Engine

Environment

  • Paimon version: 1.3.1
  • Compute Engine: Apache Spark
  • Spark version: 3.5.2

Minimal reproduce step

  1. Create a Paimon primary key table with metadata.iceberg.storage = hive-catalog

  2. Write batch data via Spark saveAsTable with mode(append):
    df.write.format("paimon").mode("append").saveAsTable("paimon_catalog.tables.table1")
    → confirms 400K rows

  3. Start Spark Structured Streaming job writing to same table via foreachBatch:
    def write_batch(df, epoch_id):
    df.write.format("paimon").mode("append").saveAsTable("paimon_catalog.tables.table1")

    streaming_df.writeStream.foreachBatch(write_batch).start()

  4. After first micro-batch completes, query the table:
    SELECT COUNT(*) FROM paimon_catalog.tables.table1
    → returns only the streaming micro-batch row count, batch data is gone

What doesn't meet your expectations?

After the first streaming micro-batch completes, the table contains only the rows
written by that micro-batch. The 400K rows written by the batch job are gone.

Expected: streaming micro-batches append to existing batch data, final row count = 400K + streaming rows

Actual: first micro-batch overwrites the entire table, final row count = micro-batch rows only (~100 rows)

Snapshot sequence observed:
snapshot 1 - APPEND (batch job, 400K rows)
snapshot 2 - APPEND (stream micro-batch 1, 100 rows)
snapshot 3 - OVERWRITE (stream micro-batch 2, 100 rows) ← table reset occurs here
snapshot 4 - APPEND (stream micro-batch 3, 100 rows)

This issue ONLY occurs when:

  • metadata.iceberg.storage = hive-catalog is enabled on the table
  • A batch write precedes the start of the streaming job

Stream-only and batch-only both work correctly in isolation.
The following properties did not resolve the issue:

  • streaming-read-overwrite = false
  • metadata.iceberg.sync-interval = -1 (resolves overwrite but disables Iceberg sync which is a required feature)

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions