[da-vinci] Flush Global RT DIV state on graceful shutdown by KaiSernLim · Pull Request #2836 · linkedin/venice

KaiSernLim · 2026-05-29T00:15:06Z

Problem Statement

Venice's Global RT DIV feature persists DIV (Data Integrity Validation) state through two consumer/leader-driven paths, both gated by byte-count thresholds during steady-state ingestion (the RT DIV produce-to-VT path and the VT DIV snapshot path).

On a graceful shutdown, executeShutdownRunnable() and updateOffsetMetadataAndSyncOffset() both explicitly bail when isGlobalRtDivEnabled() (deliberately deferred — Global RT DIV is consumer-driven, so any drainer sync must be disabled to not interfere). As a result, nothing flushes the in-memory RT/VT DIV deltas accumulated since the last threshold-triggered sync. On restart the server replays from the last synced point, causing a post-restart bootstrap slowdown. The non-Global-RT-DIV (OffsetRecord) path already closes this gap via the drainer SYNC_OFFSET command.

Solution

Add a single on-demand DIV sync, forceGlobalRtDivSync(pcs), that flushes both the RT and VT DIV state, invoked during graceful shutdown — mirroring what the OffsetRecord path already does. It is gated by the existing isServerIngestionCheckpointDuringGracefulShutdownEnabled server config (no new flag).

Leader: for each RT source broker whose latest-consumed RT position (LCRP) is not EARLIEST, produce a GlobalRtDivState via a position-based sendGlobalRtDivMessage variant. The produce callback already chains the VT DIV + LCVP sync, so both halves are covered.
Follower / no-RT-progress: enqueue a new waitable SYNC_GLOBAL_RT_DIV drainer command that snapshots VT DIV in the drainer thread and syncs it to the OffsetRecord, guarded against persisting EARLIEST. RT DIV is already durable from when the follower consumed GlobalRtDivState.

forceGlobalRtDivSync returns an aggregate CompletableFuture. The shutdown await reuses waitForSyncOffsetCmd's timeout/cancel semantics (bounded by getShutdownSyncOffsetTimeoutMs()), so shutdown never hangs on a produce/await failure. The flush runs inside shutdownPartitionConsumptionStates() — after consumerBatchUnsubscribeAllTopics (no new RT records arriving) and before closeVeniceWriters (VeniceWriter still alive) — exactly the window where the "don't interfere" concern no longer applies.

Trade-offs: the work is bounded by RT broker count, runs only on graceful shutdown (not a steady-state hot path), and is best-effort (failures are logged, never thrown, and capped by the shutdown timeout).

Code changes

Added new code behind a config. Reuses the existing isServerIngestionCheckpointDuringGracefulShutdownEnabled (no new config introduced).
Introduced new log lines. One best-effort error log on per-broker produce failure (shutdown path, not rate-limit sensitive).

Concurrency-Specific Checks

Code has no race conditions or thread safety issues. The VT DIV snapshot is taken inside the single-threaded drainer; ordering relies on the drainer's FIFO guarantee.
Proper synchronization mechanisms are used where needed (drainer command queue + completion future).
No blocking calls inside critical sections that could lead to deadlocks; the shutdown await is bounded by the sync-offset timeout.
Verified thread-safe collections are used.
Validated proper exception handling — interrupts complete the future exceptionally (rather than throwing) so the shutdown await never hangs.

How was this PR tested?

New unit tests added (leader per-broker produce, follower VT-only sync, EARLIEST-skip, follower EARLIEST-LCVP guard, shutdown wiring, timeout-does-not-hang, drainer SYNC_GLOBAL_RT_DIV routing).
Modified or extended existing tests (sendGlobalRtDivMessage updated to the position-based signature; steady-state values unchanged).
Verified backward compatibility — steady-state behavior is unchanged; the position-based refactor passes the same values it did before.

Test runs (clients/da-vinci-client): LeaderFollowerStoreIngestionTaskTest (149), StoreBufferServiceTest (22), and the SITWith* subclasses covering StoreIngestionTask (322 each) — all green, 0 failures.

Does this PR introduce any user-facing or breaking changes?

No.

Global RT DIV persists RT/VT DIV state through byte-threshold-triggered consumer/leader paths and deliberately disables the drainer SYNC_OFFSET sync during graceful shutdown. As a result, the in-memory RT/VT DIV deltas accumulated since the last threshold-triggered sync were never flushed on a graceful stop, causing a post-restart bootstrap slowdown. The non-Global-RT-DIV (OffsetRecord) path already closes this gap via the drainer SYNC_OFFSET command. Add an on-demand forceGlobalRtDivSync(pcs) invoked from executeShutdownRunnable, gated by the existing isServerIngestionCheckpointDuringGracefulShutdownEnabled config: - Leader: produce a GlobalRtDivState per RT source broker (skipping brokers whose LCRP is EARLIEST) via a position-based sendGlobalRtDivMessage variant; the produce callback chains the VT DIV + LCVP sync, covering both halves. - Follower / no-RT-progress: enqueue a waitable SYNC_GLOBAL_RT_DIV drainer command that snapshots VT DIV in the drainer thread and syncs it to the OffsetRecord, guarded against persisting EARLIEST. forceGlobalRtDivSync returns an aggregate future; the shutdown await reuses waitForSyncOffsetCmd's timeout/cancel semantics so shutdown never hangs on a produce/await failure. Runs inside shutdownPartitionConsumptionStates() (after consumerBatchUnsubscribeAllTopics, before closeVeniceWriters), so the VeniceWriter is alive and no new RT records are arriving. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR addresses a graceful-shutdown gap in Global RT DIV state persistence by adding a best-effort, on-demand flush path that runs during shutdown to persist accumulated RT/VT DIV deltas that would otherwise be lost until the next byte-threshold-triggered sync (causing slower post-restart bootstrap).

Changes:

Updated graceful shutdown checkpointing to invoke a Global-RT-DIV-specific flush (forceGlobalRtDivSync) when Global RT DIV is enabled, while preserving the existing SYNC_OFFSET drainer path for non-Global-RT-DIV.
Added a new waitable drainer command (SYNC_GLOBAL_RT_DIV) that snapshots VT DIV in the drainer thread and persists it via the OffsetRecord sync path.
Refactored sendGlobalRtDivMessage to accept an explicit position + topic-partition and return the LeaderProducerCallback so shutdown code can wait on persistence completion; added unit tests covering leader/follower shutdown sync behaviors and EARLIEST guards.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
clients/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/StoreIngestionTask.java	Routes graceful shutdown to either `forceGlobalRtDivSync` (Global RT DIV) or SYNC_OFFSET (non-Global), and adds drainer-thread VT snapshot sync (`syncGlobalRtDivFromSnapshot`).
clients/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/LeaderFollowerStoreIngestionTask.java	Implements the shutdown-time Global RT DIV flush (leader per-broker produce + follower VT snapshot fallback) and refactors `sendGlobalRtDivMessage` to a position-based signature returning the producer callback.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/StoreBufferService.java	Adds a new waitable `SYNC_GLOBAL_RT_DIV` command type and plumbing to enqueue/execute it.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/AbstractStoreBufferService.java	Extends the abstract API with `execSyncGlobalRtDivCommandAsync`.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/SeparatedStoreBufferService.java	Delegates the new `execSyncGlobalRtDivCommandAsync` to the appropriate underlying buffer service.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/LeaderProducerCallback.java	Exposes `LeaderProducedRecordContext` via a getter so shutdown code can wait on persistence.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/StoreIngestionTaskTest.java	Adds shutdown-path unit tests specific to Global RT DIV sync invocation and timeout-bounded waiting behavior.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/StoreBufferServiceTest.java	Adds unit test ensuring `SYNC_GLOBAL_RT_DIV` routes to `syncGlobalRtDivFromSnapshot` and remains non-hanging when PCS is null.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/LeaderFollowerStoreIngestionTaskTest.java	Adds unit tests for leader per-broker forced sync, EARLIEST-skip behavior, follower VT-only forced sync, and EARLIEST LCVP guard in snapshot sync.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…nd type Replace the SYNC_GLOBAL_RT_DIV CommandType with a dedicated drainer node, mirroring the existing SyncVtDivNode, so the follower / no-RT-progress shutdown path reads as "enqueue a waitable VT DIV sync node" rather than overloading the CommandQueueNode "command" abstraction with a second, behaviorally-different meaning. - Add SyncGlobalRtDivNode whose execute() calls syncGlobalRtDivFromSnapshot, wired into the drainer loop alongside SyncVtDivNode. - Extract a WaitableQueueNode base holding the LockAssistedCompletableFuture and the lock-guarded executeGuarded(), shared by CommandQueueNode and SyncGlobalRtDivNode, preserving the "once executing, cannot be cancelled" guarantee waitForSyncOffsetCmd relies on (no duplication). - Revert CommandType to { SYNC_OFFSET } and processCommand / execSyncOffsetCommandAsync to their original single-purpose forms. - Rename the enqueue method execSyncGlobalRtDivCommandAsync -> execSyncGlobalRtDivAsync (it no longer routes through CommandQueueNode). - Consolidate the null-PCS guard into syncGlobalRtDivFromSnapshot, dropping the redundant pre-check in processCommand. Tests: testExecSyncGlobalRtDivAsync now also asserts the future completes exceptionally (shutdown never hangs) when the sync throws; the syncGlobalRtDivFromSnapshot test gains a null-PCS case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lluwm · 2026-05-30T07:43:53Z

+          // shutdownPartitionConsumptionStates() (after consumerBatchUnsubscribeAllTopics, before closeVeniceWriters),
+          // so the VeniceWriter is alive and no new RT records are arriving. Both paths await with the shutdown
+          // sync-offset timeout, reusing waitForSyncOffsetCmd's timeout/cancel semantics so shutdown never hangs.
+          CompletableFuture<Void> syncFuture = isGlobalRtDivEnabled()


Not a question, but just a thought to confirm. Since we have called consumerBatchUnsubscribeAllTopics earlier in this workflow, I would assume that all consumer threads cannot poll and process any data at this stage. In other words, all consumer DIV states are fixed and unlikely to be changed. Thus, there should be no races between the consumer thread and the shutdown thread here?

🤖 Correct. executeShutdownRunnable runs inside shutdownPartitionConsumptionStates(), which is after consumerBatchUnsubscribeAllTopics() — so no consumer thread is polling or processing for these partitions, the consumer DIV state is frozen, and there's no race with the shutdown thread (or the drainer-side snapshot). That window is exactly what this relies on, and it's also why the steady-state drainer SYNC_OFFSET path is intentionally disabled for Global RT DIV — it would otherwise interfere while consumers are still live.

lluwm · 2026-05-30T07:50:46Z

+    PartitionTracker vtDivSnapshot =
+        getDataIntegrityValidator().cloneVtProducerStates(partition, true, pcs.getLatestMessageTimeInMs());


Why do we take a consumer div snapshot in the drainer thread? It doesn't seem wrong, but it's a bit different from the regular VT div pattern.

🤖 At shutdown there's no live consumer record driving the sync, so unlike the steady-state path — where the consumer thread clones the snapshot and hands it to a fire-and-forget SyncVtDivNode gated on lastQueuedRecordPersistedFuture — here we clone inside the drainer thread. Because the drainer is single-threaded and processes a partition's records FIFO, by the time this node runs every previously-queued record for the partition has already been persisted, so the snapshot is consistent without needing that future dependency. The actual write still goes through the same updateAndSyncOffsetFromSnapshot the regular path uses.

…hutdown sync The leader graceful-shutdown path produced its GlobalRtDivState to the local VT (whose produce-completion callback already enqueues a fire-and-forget SyncVtDivNode to sync the VT DIV + LCVP), and then ALSO enqueued a second, waitable SyncGlobalRtDivNode purely to obtain a future to await. That second sync re-cloned and re-wrote the same OffsetRecord state — redundant work on every broker at shutdown. Make SyncVtDivNode extend WaitableQueueNode so its drainer-side execution is awaitable, and have the leader await the produce-completion node directly: - StoreBufferService.execSyncOffsetFromSnapshotAsync now returns the node's completion future (steady-state callers ignore it). - sendVtDivSnapshotOnCompletion returns a relay future that completes when the drainer-side VT DIV sync runs; it also fails the relay if the produce/persist fails, since the produce-completion callback only fires on success (so the shutdown await never hangs). - sendGlobalRtDivMessage returns that future; forceGlobalRtDivSync's leader branch awaits it instead of chaining a second SyncGlobalRtDivNode. The follower / no-RT-progress branch still uses the waitable SyncGlobalRtDivNode. The sync stays on the single-threaded drainer: syncOffset (storageEngine.sync + storageMetadataService.put) must remain serialized with record processing, so it cannot move to the SIT thread (same reason the non-Global SYNC_OFFSET path enqueues into the drainer). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

persistedToDBFuture on produce error Two robustness fixes from the Copilot review on the graceful-shutdown Global RT DIV flush: - StoreIngestionTask.updateAndSyncOffsetFromSnapshot: guard against a null PCS. The PCS can be removed between a SyncVtDivNode being enqueued and executed (shutdown/unsubscribe); skip cleanly rather than NPE so the waitable node completes normally and the shutdown await stays deterministic. Mirrors the existing guard in syncGlobalRtDivFromSnapshot. - LeaderProducerCallback.onCompletion: on produce failure, complete persistedToDBFuture exceptionally. Previously it was left uncompleted, so waiters blocked: leader topic-switch's getLastLeaderPersistFuture().get() hangs indefinitely, and the shutdown Global RT DIV relay (sendVtDivSnapshotOnCompletion, which keys its fail-fast off this future) would hang until the shutdown timeout. This is what actually makes the relay fail fast on produce failure. Adds unit tests for both. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 29, 2026 00:15

Copilot started reviewing on behalf of KaiSernLim May 29, 2026 00:15 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

KaiSernLim self-assigned this May 29, 2026

lluwm reviewed May 30, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings June 1, 2026 10:44

Copilot started reviewing on behalf of KaiSernLim June 1, 2026 10:45 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[da-vinci] Flush Global RT DIV state on graceful shutdown#2836

[da-vinci] Flush Global RT DIV state on graceful shutdown#2836
KaiSernLim wants to merge 4 commits into
linkedin:mainfrom
KaiSernLim:global-rt-div-shutdown-sync

KaiSernLim commented May 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

lluwm May 30, 2026 •

edited

Loading

Uh oh!

KaiSernLim May 31, 2026 •

edited

Loading

Uh oh!

lluwm May 30, 2026 •

edited

Loading

Uh oh!

KaiSernLim May 31, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		PartitionTracker vtDivSnapshot =
		getDataIntegrityValidator().cloneVtProducerStates(partition, true, pcs.getLatestMessageTimeInMs());

Conversation

KaiSernLim commented May 29, 2026

Problem Statement

Solution

Code changes

Concurrency-Specific Checks

How was this PR tested?

Does this PR introduce any user-facing or breaking changes?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

lluwm May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KaiSernLim May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lluwm May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KaiSernLim May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lluwm May 30, 2026 •

edited

Loading

KaiSernLim May 31, 2026 •

edited

Loading

lluwm May 30, 2026 •

edited

Loading

KaiSernLim May 31, 2026 •

edited

Loading