-
Notifications
You must be signed in to change notification settings - Fork 114
[da-vinci][server] Add ServerMetricEntity definitions for OTel ingestion metrics and version lifecycle hooks #2476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[da-vinci][server] Add ServerMetricEntity definitions for OTel ingestion metrics and version lifecycle hooks #2476
Conversation
definitions for OTel ingestion metrics Enhance AbstractVeniceAggVersionedStats with thread-safe initialization via initializeVersionInfo() and onVersionInfoUpdated() hook for subclass version lifecycle management. Add close() to HeartbeatOtelStats and integrate with HeartbeatVersionedStats.handleStoreDeleted(). Add 26 new ServerMetricEntity definitions for ingestion OTel metrics covering records/bytes consumed/produced, latency, DCR, batch processing, and RT region consumption.
m-nagarajan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review Summary
Reviewed by 4 specialized agents: code-reviewer, silent-failure-hunter, test-analyzer, comment-analyzer.
3 Critical | 4 Important | 4 Suggestions found across 7 files.
See inline comments below for details. Overall, the computeIfAbsent atomicity improvement is a genuine win, but calling virtual methods inside computeIfAbsent introduces latent deadlock risk and exception-safety concerns that should be addressed before merge.
Also noted: PR description section 4 claims a synchronized fix on VeniceVersionedStats.getStats(int), but VeniceVersionedStats.java is not in the diff. The Int2ObjectOpenHashMap thread-safety issue described is real. Either include the change or update the PR description.
Problem Statement
This is part of an ongoing effort to add OpenTelemetry (OTel) metrics to the Venice server's ingestion stats. The base class
AbstractVeniceAggVersionedStatslacked hooks for subclasses to participate in version lifecycle events (initialization, updates, store deletion), andServerMetricEntitydid not yet define the metric entities needed for ingestion OTel metrics.Prior PRs in this series:
Solution
This PR makes two sets of changes:
1. ServerMetricEntity definitions:
consumption
ServerMetricEntityTest2. Version lifecycle refactoring in
AbstractVeniceAggVersionedStats:applyVersionInfo()as shared core logic for both initialization and updates, eliminating duplication between the oldaddStore+updateStatsVersionInfopatternaddStore(String)→addStore(Store)to atomically create and initialize version info insidecomputeIfAbsent, eliminating a race window where other threads could observe partially-initialized statsonVersionInfoUpdated()hook called after both initialization and version changes, so subclasses can react to version transitionsupdateTotalStatsis intentionally kept outsideapplyVersionInfo— calling it insidecomputeIfAbsentwould cause a recursiveConcurrentHashMapdeadlock3. Heartbeat OTel stats lifecycle:
HeartbeatVersionedStatsnow overridesonVersionInfoUpdated()to keep itsHeartbeatOtelStatsversion cache in syncHeartbeatVersionedStatsnow overrideshandleStoreDeleted()to closeHeartbeatOtelStatsand clean up OTel resourcesclose()method toHeartbeatOtelStatsfor resource cleanup4. Thread safety fix in
VeniceVersionedStats:synchronizedtogetStats(int)— the backingInt2ObjectOpenHashMapis not thread-safe, and concurrentremoveVersion/addVersioncalls could corrupt the map during an unsynchronized read. This was a pre-existing issue fixed opportunistically.Code changes
Concurrency-Specific Checks
Both reviewer and PR author to verify
synchronized,RWLock) are used where needed.ConcurrentHashMap,CopyOnWriteArrayList).How was this PR tested?
New tests:
HeartbeatVersionedStatsTest— tests foronVersionInfoUpdatedhook,handleStoreDeletedcleanup, and OTel stats lifecycleServerMetricEntityTest— validates all 26 new metric entities plus existing ones for correct names, types, units, and dimensionsAggVersionedStorageEngineStatsTest— updated foraddStore(Store)signature changeDoes this PR introduce any user-facing or breaking changes?