Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[router][common] Multiple fixes in Opentelemetry #1483

Merged
merged 11 commits into from
Feb 11, 2025

Conversation

m-nagarajan
Copy link
Contributor

@m-nagarajan m-nagarajan commented Jan 30, 2025

Summary

  1. Added otel.venice.metrics.export.interval.in.seconds for OpenTelemetry (OTel) metrics export, with a default value of 60 seconds which is the same without this config right now.
  2. Updated MetricEntityState to maintain a 1:1 relationship between OTel instruments and Tehuti sensors, rather than a 1:n relationship, to eliminate unnecessary lookups during the hot path.
  3. Stopped emitting OTel metrics for the total store in the router. The aggregation will be done on the receiving side. This will be helpful during creation of pre-aggregates in the metrics processing systems by not having to do storeName != total.
  4. Modified venice.response.status_code_category to use success/fail instead of healthy/unhealthy/tardy/throttled/bad_request to keep it standard. Tardy/throttled/bad_request can be inferred from the response status.
  5. removed 'incoming_call_count' otel metric and reverted it back to tehuti only metric request as it was not covering all the incoming cases either and this was making things more confusing.
  6. Renamed the existing OTel metric call_key_count to key_count and converted it into a histogram. This metric will now measure key counts on the response handling side, including success/fail details and response codes, similar to call_time, and will provide a distribution for key counts.
  7. Fixed a bug where the exponential histogram view was configured for only one metric.

How was this PR tested?

GH CI and below log via integration tests shows all the new changes

2025-02-10 16:24:23 - [] INFO [VeniceOpenTelemetryMetricsRepository] [PeriodicMetricReader-1] Logging OpenTelemetry metrics for debug purpose: [ImmutableMetricData{resource=Resource{schemaUrl=null, attributes={}}, instrumentationScopeInfo=InstrumentationScopeInfo{name=venice.router, version=null, schemaUrl=null, attributes={}}, name=venice.router.retry_delay, description=Retry delay time, unit=MILLISECOND, type=HISTOGRAM, data=ImmutableHistogramData{aggregationTemporality=DELTA, points=[ImmutableHistogramPointData{getStartEpochNanos=1739233403028360000, getEpochNanos=1739233463032171000, getAttributes={venice.cluster.name="venice-cluster_1ed021f416bf_e2ee76ba", venice.request.method="single_get", venice.store.name="store_1ed46d18c968_6db4c37d"}, getSum=11.0, getCount=1, hasMin=true, getMin=11.0, hasMax=true, getMax=11.0, getBoundaries=[], getCounts=[1], getExemplars=[]}]}}, ImmutableMetricData{resource=Resource{schemaUrl=null, attributes={}}, instrumentationScopeInfo=InstrumentationScopeInfo{name=venice.router, version=null, schemaUrl=null, attributes={}}, name=venice.router.key_count, description=Count of keys during response handling along with response codes, unit=NUMBER, type=EXPONENTIAL_HISTOGRAM, data=ImmutableExponentialHistogramData{aggregationTemporality=DELTA, points=[ImmutableExponentialHistogramPointData{getStartEpochNanos=1739233403028360000, getEpochNanos=1739233463032171000, getAttributes={http.response.status_code="200", http.response.status_code_category="2xx", venice.cluster.name="venice-cluster_1ed021f416bf_e2ee76ba", venice.request.method="multi_get_streaming", venice.response.status_code_category="success", venice.store.name="store_1ed46d18c968_6db4c37d"}, getScale=3, getSum=100.0, getCount=10, getZeroCount=0, hasMin=true, getMin=10.0, hasMax=true, getMax=10.0, getPositiveBuckets=DoubleExponentialHistogramBuckets{scale: 3, offset: 26, counts: {26=10} }, getNegativeBuckets=EmptyExponentialHistogramBuckets{scale=3, offset=0, bucketCounts=[], totalCount=0}, getExemplars=[]}, ImmutableExponentialHistogramPointData{getStartEpochNanos=1739233403028360000, getEpochNanos=1739233463032171000, getAttributes={http.response.status_code="200", http.response.status_code_category="2xx", venice.cluster.name="venice-cluster_1ed021f416bf_e2ee76ba", venice.request.method="single_get", venice.response.status_code_category="success", venice.store.name="store_1ed46d18c968_6db4c37d"}, getScale=3, getSum=100.0, getCount=100, getZeroCount=0, hasMin=true, getMin=1.0, hasMax=true, getMax=1.0, getPositiveBuckets=DoubleExponentialHistogramBuckets{scale: 3, offset: -1, counts: {-1=100} }, getNegativeBuckets=EmptyExponentialHistogramBuckets{scale=3, offset=0, bucketCounts=[], totalCount=0}, getExemplars=[]}]}}, ImmutableMetricData{resource=Resource{schemaUrl=null, attributes={}}, instrumentationScopeInfo=InstrumentationScopeInfo{name=venice.router, version=null, schemaUrl=null, attributes={}}, name=venice.router.aborted_retry_count, description=Count of aborted retry requests, unit=NUMBER, type=LONG_SUM, data=ImmutableSumData{points=[ImmutableLongPointData{startEpochNanos=1739233403028360000, epochNanos=1739233463032171000, attributes={venice.cluster.name="venice-cluster_1ed021f416bf_e2ee76ba", venice.request.method="single_get", venice.request.retry_abort_reason="no_available_replica", venice.store.name="store_1ed46d18c968_6db4c37d"}, value=1, exemplars=[]}], monotonic=true, aggregationTemporality=DELTA}}, ImmutableMetricData{resource=Resource{schemaUrl=null, attributes={}}, instrumentationScopeInfo=InstrumentationScopeInfo{name=venice.router, version=null, schemaUrl=null, attributes={}}, name=venice.router.call_count, description=Count of all requests during response handling along with response codes, unit=NUMBER, type=LONG_SUM, data=ImmutableSumData{points=[ImmutableLongPointData{startEpochNanos=1739233403028360000, epochNanos=1739233463032171000, attributes={http.response.status_code="200", http.response.status_code_category="2xx", venice.cluster.name="venice-cluster_1ed021f416bf_e2ee76ba", venice.request.method="multi_get_streaming", venice.response.status_code_category="success", venice.store.name="store_1ed46d18c968_6db4c37d"}, value=10, exemplars=[]}, ImmutableLongPointData{startEpochNanos=1739233403028360000, epochNanos=1739233463032171000, attributes={http.response.status_code="200", http.response.status_code_category="2xx", venice.cluster.name="venice-cluster_1ed021f416bf_e2ee76ba", venice.request.method="single_get", venice.response.status_code_category="success", venice.store.name="store_1ed46d18c968_6db4c37d"}, value=100, exemplars=[]}], monotonic=true, aggregationTemporality=DELTA}}, ImmutableMetricData{resource=Resource{schemaUrl=null, attributes={}}, instrumentationScopeInfo=InstrumentationScopeInfo{name=venice.router, version=null, schemaUrl=null, attributes={}}, name=venice.router.call_time, description=Latency based on all responses, unit=MILLISECOND, type=EXPONENTIAL_HISTOGRAM, data=ImmutableExponentialHistogramData{aggregationTemporality=DELTA, points=[ImmutableExponentialHistogramPointData{getStartEpochNanos=1739233403028360000, getEpochNanos=1739233463032171000, getAttributes={http.response.status_code="200", http.response.status_code_category="2xx", venice.cluster.name="venice-cluster_1ed021f416bf_e2ee76ba", venice.request.method="multi_get_streaming", venice.response.status_code_category="success", venice.store.name="store_1ed46d18c968_6db4c37d"}, getScale=3, getSum=39.09754, getCount=10, getZeroCount=0, hasMin=true, getMin=1.842334, hasMax=true, getMax=16.385833, getPositiveBuckets=DoubleExponentialHistogramBuckets{scale: 3, offset: 7, counts: {7=3,8=2,9=2,10=1,11=0,12=0,13=0,14=0,15=0,16=0,17=0,18=0,19=0,20=1,21=0,22=0,23=0,24=0,25=0,26=0,27=0,28=0,29=0,30=0,31=0,32=1} }, getNegativeBuckets=EmptyExponentialHistogramBuckets{scale=3, offset=0, bucketCounts=[], totalCount=0}, getExemplars=[]}, ImmutableExponentialHistogramPointData{getStartEpochNanos=1739233403028360000, getEpochNanos=1739233463032171000, getAttributes={http.response.status_code="200", http.response.status_code_category="2xx", venice.cluster.name="venice-cluster_1ed021f416bf_e2ee76ba", venice.request.method="single_get", venice.response.status_code_category="success", venice.store.name="store_1ed46d18c968_6db4c37d"}, getScale=3, getSum=214.82287399999996, getCount=100, getZeroCount=0, hasMin=true, getMin=1.039375, hasMax=true, getMax=56.24875, getPositiveBuckets=DoubleExponentialHistogramBuckets{scale: 3, offset: 0, counts: {0=2,1=9,2=12,3=12,4=15,5=22,6=13,7=8,8=3,9=0,10=0,11=1,12=0,13=0,14=0,15=1,16=0,17=0,18=0,19=0,20=1,21=0,22=0,23=0,24=0,25=0,26=0,27=0,28=0,29=0,30=0,31=0,32=0,33=0,34=0,35=0,36=0,37=0,38=0,39=0,40=0,41=0,42=0,43=0,44=0,45=0,46=1} }, getNegativeBuckets=EmptyExponentialHistogramBuckets{scale=3, offset=0, bucketCounts=[], totalCount=0}, getExemplars=[]}]}}]

Does this PR introduce any user-facing changes?

  • No. You can skip the rest of this section.
  • Yes. Make sure to explain your proposed changes and call out the behavior change.
  1. otel.venice.metrics.export.interval.in.seconds config with a default value of 60 seconds.
  2. venice.response.status_code_category dimension will emit values success/fail

… default

2. Change venice.response.status_code_category to hold success/fail instead of healthy/unhealthy/tardy/throttled/bad_request
3. Change the existing tehuti metrics key_num and bad_request_key_num to record metrics(1) for single gets as well to keep things uniform
4. Introduce otel metric incoming_key_count that will measure the data similar to key_num and bad_request_key_num at request handling path
5. change otel metic call_key_count to key_count which will now measures key counts on the response handling side with success/fail details as well as response codes
ZacAttack
ZacAttack previously approved these changes Feb 6, 2025
Copy link
Contributor

@ZacAttack ZacAttack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes overall look good, I'm giving a provisional ship it pending some of the things I've asked. I think overall I have some latent concern about the enum tweak and the potential new overhead for single key lookup, but aside from that I don't have any major objections. Thanks!

Copy link
Contributor Author

@m-nagarajan m-nagarajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ZacAttack for the review. Replied to your comments.

lluwm
lluwm previously approved these changes Feb 8, 2025
Copy link
Contributor

@lluwm lluwm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and it makes sense to me!

@m-nagarajan m-nagarajan merged commit 10b6a20 into linkedin:main Feb 11, 2025
58 checks passed
@m-nagarajan
Copy link
Contributor Author

Thanks @ZacAttack and @lluwm for the review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants