-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Search before asking
- I searched in the issues and found nothing similar.
Paimon version
Bug: First streaming micro-batch overwrites batch data when metadata.iceberg.storage is enabled
Environment
- Paimon version: 1.3.1
- Spark version: 3.5.2
Table Properties
'metadata.iceberg.storage' = 'hive-catalog'
'metadata.iceberg.storage-location' = 'table-location'
'merge-engine' = 'deduplicate'
'bucket' = '[your value]'Steps to Reproduce
- Create Paimon primary key table with
metadata.iceberg.storage = hive-catalog - Write batch data via Spark
saveAsTablewithmode(append)— 400K rows commit successfully - Verify row count — 400K rows confirmed
- Start Spark Structured Streaming job writing to same table via
foreachBatchwithsaveAsTableandmode(append) - First micro-batch completes — table now contains only the streaming micro-batch rows (~100 rows)
Expected Behavior
Streaming micro-batches should append to existing batch data. Both batch-only and stream-only work correctly in isolation.
Actual Behavior
The first streaming micro-batch after a batch write overwrites the entire table, leaving only the rows written by that micro-batch. Subsequent micro-batches append correctly to each other.
Snapshot Sequence Observed
append - snapshot 1 (batch, 400K rows)
append - snapshot 2 (stream, 100 rows)
overwrite - snapshot 3 (stream, 100 rows) ← table reset here
append - snapshot 4 (stream, 100 rows)
What Was Tried
- Setting
streaming-read-overwrite = falseon the table — did not resolve - Setting
metadata.iceberg.sync-interval = -1— resolves the overwrite but disables Iceberg sync which is a required feature
Notes
- Stream-only writes work correctly with no overwrites
- Batch-only writes work correctly
- The issue only occurs when a streaming job starts after a batch write on a table with
metadata.iceberg.storageenabled - Disabling
metadata.iceberg.sync-intervalis not an acceptable workaround as live Iceberg metadata is a core requirement
Compute Engine
Environment
- Paimon version: 1.3.1
- Compute Engine: Apache Spark
- Spark version: 3.5.2
Minimal reproduce step
-
Create a Paimon primary key table with metadata.iceberg.storage = hive-catalog
-
Write batch data via Spark saveAsTable with mode(append):
df.write.format("paimon").mode("append").saveAsTable("paimon_catalog.tables.table1")
→ confirms 400K rows -
Start Spark Structured Streaming job writing to same table via foreachBatch:
def write_batch(df, epoch_id):
df.write.format("paimon").mode("append").saveAsTable("paimon_catalog.tables.table1")streaming_df.writeStream.foreachBatch(write_batch).start()
-
After first micro-batch completes, query the table:
SELECT COUNT(*) FROM paimon_catalog.tables.table1
→ returns only the streaming micro-batch row count, batch data is gone
What doesn't meet your expectations?
After the first streaming micro-batch completes, the table contains only the rows
written by that micro-batch. The 400K rows written by the batch job are gone.
Expected: streaming micro-batches append to existing batch data, final row count = 400K + streaming rows
Actual: first micro-batch overwrites the entire table, final row count = micro-batch rows only (~100 rows)
Snapshot sequence observed:
snapshot 1 - APPEND (batch job, 400K rows)
snapshot 2 - APPEND (stream micro-batch 1, 100 rows)
snapshot 3 - OVERWRITE (stream micro-batch 2, 100 rows) ← table reset occurs here
snapshot 4 - APPEND (stream micro-batch 3, 100 rows)
This issue ONLY occurs when:
- metadata.iceberg.storage = hive-catalog is enabled on the table
- A batch write precedes the start of the streaming job
Stream-only and batch-only both work correctly in isolation.
The following properties did not resolve the issue:
- streaming-read-overwrite = false
- metadata.iceberg.sync-interval = -1 (resolves overwrite but disables Iceberg sync which is a required feature)
Anything else?
No response
Are you willing to submit a PR?
- I'm willing to submit a PR!