Skip to content

Conversation

@misyel
Copy link
Contributor

@misyel misyel commented Jan 23, 2026

Problem Statement

Heartbeats are sent every 60 seconds and it will not always represent the replication lag that the system is operating at. This PR introduces record level delay tracking alongside heartbeat delay.

Solution

This PR introduces record-level timestamp tracking for Venice's ingestion pipeline, providing more granular visibility into data freshness and ingestion delays. The implementation enhances the existing heartbeat monitoring system to track timestamps for individual data records in addition to periodic heartbeat control messages.

Key Features

  • Record-Level Timestamp Tracking: Track timestamps for every individual data record processed during ingestion
  • OpenTelemetry Integration: Emit per-record OTel metrics for real-time monitoring of ingestion delays
  • Configurable Feature Flags: Two new config keys to control the feature:
    SERVER_RECORD_LEVEL_TIMESTAMP_ENABLED: Enable record-level timestamp tracking
    SERVER_PER_RECORD_OTEL_METRICS_ENABLED: Enable per-record OTel metrics emission

Implementation Details

  • Data Model Enhancement: Renamed HeartbeatTimeStampEntry → IngestionTimestampEntry to track both heartbeat and record timestamps
  • Monitoring Service Updates: Extended HeartbeatMonitoringService with new methods for record-level timestamp tracking
  • OTel Metrics: Added RecordOtelStats class for OpenTelemetry metrics with dimensions for region, version role, replica type, and replica state
  • Ingestion Pipeline Integration: Added hooks in LeaderFollowerStoreIngestionTask to record timestamps during record processing

Suggested Review Order
Configuration and Data Model (Start here for context)

  • ConfigKeys.java: New configuration flags
  • IngestionTimestampEntry.java: Enhanced data model for both heartbeat and record timestamps

Core Implementation

  • HeartbeatMonitoringService.java: Core implementation of record-level timestamp tracking
  • RecordOtelStats.java: New class for OpenTelemetry metrics
  • HeartbeatVersionedStats.java: Integration with metrics system

Integration with Ingestion Pipeline

  • LeaderFollowerStoreIngestionTask.java: Hook for recording timestamps during processing
  • StoreIngestionTask.java: Early exit checks and optimizations

Tests

  • HeartbeatMonitoringServiceTest.java: Tests for record-level timestamp tracking
  • RecordOtelStatsTest.java: Tests for OTel metrics
  • HeartbeatVersionedStatsTest.java: Tests for metrics integration

Configuration Integration

  • VeniceServerConfig.java: Exposing new configuration options

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.

  • New integration tests added.

  • Modified or extended existing tests.

  • Verified backward compatibility (if applicable).

  • Unit tests for both heartbeat and record-level timestamp tracking

  • Tests for OTel metrics emission

  • Tests for different configuration combinations

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

@ZacAttack
Copy link
Contributor

Why is this important? This seems like an awfully big expense for dubious observability gain.

@misyel
Copy link
Contributor Author

misyel commented Feb 2, 2026

Why is this important? This seems like an awfully big expense for dubious observability gain.

Hi Zac - we want to measure the true end to end replication latency and enforce a sla for it. Heartbeats are only emitted every 60s and this metric may not be an accurate representation if we do not process a heartbeat for a few mins, but are actually continuing to process records between the heartbeats

@misyel misyel marked this pull request as ready for review February 3, 2026 22:10
@misyel misyel changed the title [WIP][server] Introduce record level delay with heartbeat delay [server] Introduce record level delay with heartbeat delay Feb 3, 2026
/**
* Whether this timestamp entry was consumed from input or if the system initialized it as a default entry
*/
public final boolean consumedFromUpstream;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessary to me. I know this is existing code but why can't we use sentinel values for timestamp to achieve the same goal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it is possible to achieve the same goal by using sentinel values for the timestamp but we will have to update existing code to account for this sentinel value and I feel that it affects readability in the long term. Since we also made a change to reuse the same IngestionTimestampEntry, the extra space is uses shouldn't be too bad

Copy link
Contributor

@sushantmane sushantmane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a ton, @misyel for the changes. Your changes LGTM. But the current state of HeartbeatMonitoringService class is not great. I've left some comments to improve it. Let me know what do you think!

leaderProducedRecordContext,
currentTimeMs);
// Record regular record timestamp for heartbeat monitoring if enabled
if (recordLevelTimestampEnabled) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just throwing it out: should we fold this into if (recordLevelMetricEnabled.get()) {?

* is acceptable. The cached keys are then reused on the per-record hot path to avoid repeated allocation
* and hash computation.
*/
private void refreshCachedHeartbeatKeys(PartitionConsumptionState pcs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this required? We are not embedding role of the replica in HB key, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

result.put(replicaId + "-" + region.getKey(), replicaHeartbeatInfo);
}
}
for (Map.Entry<HeartbeatKey, IngestionTimestampEntry> entry: heartbeatTimestampMap.entrySet()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be optimized IMO. We can all the PCS's for given version topic from SIT and then only get the entries for keys inside those PCS entries

* should be able to tell us all the lag information.
*/
for (Map.Entry<String, HeartbeatTimeStampEntry> entry: replicaTimestampMap.entrySet()) {
for (Map.Entry<HeartbeatKey, IngestionTimestampEntry> entry: getLeaderHeartbeatTimeStamps().entrySet()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to iterate through all of the keys? We already have handle to partitionConsumptionState. We can just iterate through subset of keys (3)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please check other places where we are iterating through the whole map. We should not do that. Almost all lookups in this class should happen with PCS as an arg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants