Skip to content

Conversation

@m-nagarajan
Copy link
Contributor

@m-nagarajan m-nagarajan commented Feb 11, 2026

Problem Statement

This is part of an ongoing effort to add OpenTelemetry (OTel) metrics to the Venice server's ingestion stats. The base class AbstractVeniceAggVersionedStats lacked hooks for subclasses to participate in version lifecycle events (initialization, updates, store deletion), and ServerMetricEntity did not yet define the metric entities needed for ingestion OTel metrics.

Prior PRs in this series:

Solution

This PR makes two sets of changes:

1. ServerMetricEntity definitions:

  • Add 26 new metric entity definitions for ingestion OTel metrics covering: records/bytes consumed and produced, leader/follower latencies, DCR events, batch processing, push timeout, idle time, write compute failures, and RT region
    consumption
  • Comprehensive test validation in ServerMetricEntityTest

2. Version lifecycle refactoring in AbstractVeniceAggVersionedStats:

  • Extract applyVersionInfo() as shared core logic for both initialization and updates, eliminating duplication between the old addStore + updateStatsVersionInfo pattern
  • Change addStore(String)addStore(Store) to atomically create and initialize version info inside computeIfAbsent, eliminating a race window where other threads could observe partially-initialized stats
  • Add onVersionInfoUpdated() hook called after both initialization and version changes, so subclasses can react to version transitions
  • updateTotalStats is intentionally kept outside applyVersionInfo — calling it inside computeIfAbsent would cause a recursive ConcurrentHashMap deadlock

3. Heartbeat OTel stats lifecycle:

  • HeartbeatVersionedStats now overrides onVersionInfoUpdated() to keep its HeartbeatOtelStats version cache in sync
  • HeartbeatVersionedStats now overrides handleStoreDeleted() to close HeartbeatOtelStats and clean up OTel resources
  • Add close() method to HeartbeatOtelStats for resource cleanup

4. Thread safety fix in VeniceVersionedStats:

  • Add synchronized to getStats(int) — the backing Int2ObjectOpenHashMap is not thread-safe, and concurrent removeVersion/addVersion calls could corrupt the map during an unsynchronized read. This was a pre-existing issue fixed opportunistically.

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

New tests:

  • HeartbeatVersionedStatsTest — tests for onVersionInfoUpdated hook, handleStoreDeleted cleanup, and OTel stats lifecycle
  • ServerMetricEntityTest — validates all 26 new metric entities plus existing ones for correct names, types, units, and dimensions
  • AggVersionedStorageEngineStatsTest — updated for addStore(Store) signature change

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

definitions for OTel ingestion metrics

Enhance AbstractVeniceAggVersionedStats with thread-safe initialization via
initializeVersionInfo() and onVersionInfoUpdated() hook for subclass version
lifecycle management. Add close() to HeartbeatOtelStats and integrate with
HeartbeatVersionedStats.handleStoreDeleted(). Add 26 new ServerMetricEntity
definitions for ingestion OTel metrics covering records/bytes
consumed/produced,
latency, DCR, batch processing, and RT region consumption.
@m-nagarajan m-nagarajan changed the title [da-vinci][server] Add version lifecycle hooks and ServerMetricEntity definitions for OTel ingestion metrics [da-vinci][server] Add ServerMetricEntity definitions for OTel ingestion metrics and version lifecycle hooks Feb 11, 2026
Copy link
Contributor Author

@m-nagarajan m-nagarajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Reviewed by 4 specialized agents: code-reviewer, silent-failure-hunter, test-analyzer, comment-analyzer.

3 Critical | 4 Important | 4 Suggestions found across 7 files.

See inline comments below for details. Overall, the computeIfAbsent atomicity improvement is a genuine win, but calling virtual methods inside computeIfAbsent introduces latent deadlock risk and exception-safety concerns that should be addressed before merge.

Also noted: PR description section 4 claims a synchronized fix on VeniceVersionedStats.getStats(int), but VeniceVersionedStats.java is not in the diff. The Int2ObjectOpenHashMap thread-safety issue described is real. Either include the change or update the PR description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant