Skip to content

Commit 06947df

Browse files
nparaddi-walmartNagappa Paraddi
and
Nagappa Paraddi
authored
add support for publishing percentile time series for the histogram m… (apache#1689)
* add support for publishing percentile time series for the histogram metrics cql-requests, cql-messages and throttling delay. Motivation: Histogram metrics is generating too many metrics overloading the promethous servers. if application has 500 Vms and 1000 cassandra nodes, The histogram metrics generates 100*500*1000 = 50,000,000 time series every 30 seconds. This is just too much metrics. Let us say we can generate percentile 95 timeseries for for every cassandra nodes, then we only have 1*500 = 500 metrics and in applciation side, we can ignore the _bucket time series. This way there will be very less metrics. Modifications: add configurable pre-defined percentiles to Micrometer Timer.Builder.publishPercentiles. This change is being added to cql-requests, cql-messages and throttling delay. Result: Based on the configuration, we will see additonal quantile time series for cql-requests, cql-messages and throttling delay histogram metrics. * add support for publishing percentile time series for the histogram metrics cql-requests, cql-messages and throttling delay. Motivation: Histogram metrics is generating too many metrics overloading the promethous servers. if application has 500 Vms and 1000 cassandra nodes, The histogram metrics generates 100*500*1000 = 50,000,000 time series every 30 seconds. This is just too much metrics. Let us say we can generate percentile 95 timeseries for for every cassandra nodes, then we only have 1*500 = 500 metrics and in applciation side, we can ignore the _bucket time series. This way there will be very less metrics. Modifications: add configurable pre-defined percentiles to Micrometer Timer.Builder.publishPercentiles. This change is being added to cql-requests, cql-messages and throttling delay. Result: Based on the configuration, we will see additonal quantile time series for cql-requests, cql-messages and throttling delay histogram metrics. * using helper method as suggested in review * fixes as per review comments * add configuration option which switches aggregable histogram generation on/off for all metric flavors [default=on] * updating java doc * rename method to publishPercentilesIfDefined * renmae method --------- Co-authored-by: Nagappa Paraddi <[email protected]>
1 parent 1849812 commit 06947df

File tree

10 files changed

+311
-21
lines changed

10 files changed

+311
-21
lines changed

core/src/main/java/com/datastax/dse/driver/api/core/config/DseDriverOption.java

+28
Original file line numberDiff line numberDiff line change
@@ -288,6 +288,34 @@ public enum DseDriverOption implements DriverOption {
288288
* <p>Value-type: {@link java.time.Duration Duration}
289289
*/
290290
METRICS_NODE_GRAPH_MESSAGES_SLO("advanced.metrics.node.graph-messages.slo"),
291+
/**
292+
* Optional list of percentiles to publish for graph-requests metric. Produces an additional time
293+
* series for each requested percentile. This percentile is computed locally, and so can't be
294+
* aggregated with percentiles computed across other dimensions (e.g. in a different instance).
295+
*
296+
* <p>Value type: {@link java.util.List List}&#60;{@link Double}&#62;
297+
*/
298+
METRICS_SESSION_GRAPH_REQUESTS_PUBLISH_PERCENTILES(
299+
"advanced.metrics.session.graph-requests.publish-percentiles"),
300+
/**
301+
* Optional list of percentiles to publish for node graph-messages metric. Produces an additional
302+
* time series for each requested percentile. This percentile is computed locally, and so can't be
303+
* aggregated with percentiles computed across other dimensions (e.g. in a different instance).
304+
*
305+
* <p>Value type: {@link java.util.List List}&#60;{@link Double}&#62;
306+
*/
307+
METRICS_NODE_GRAPH_MESSAGES_PUBLISH_PERCENTILES(
308+
"advanced.metrics.node.graph-messages.publish-percentiles"),
309+
/**
310+
* Optional list of percentiles to publish for continuous paging requests metric. Produces an
311+
* additional time series for each requested percentile. This percentile is computed locally, and
312+
* so can't be aggregated with percentiles computed across other dimensions (e.g. in a different
313+
* instance).
314+
*
315+
* <p>Value type: {@link java.util.List List}&#60;{@link Double}&#62;
316+
*/
317+
CONTINUOUS_PAGING_METRICS_SESSION_CQL_REQUESTS_PUBLISH_PERCENTILES(
318+
"advanced.metrics.session.continuous-cql-requests.publish-percentiles"),
291319
;
292320

293321
private final String path;

core/src/main/java/com/datastax/oss/driver/api/core/config/DefaultDriverOption.java

+35
Original file line numberDiff line numberDiff line change
@@ -939,6 +939,41 @@ public enum DefaultDriverOption implements DriverOption {
939939
* <p>Value-type: List of {@link String}
940940
*/
941941
METADATA_SCHEMA_CHANGE_LISTENER_CLASSES("advanced.schema-change-listener.classes"),
942+
/**
943+
* Optional list of percentiles to publish for cql-requests metric. Produces an additional time
944+
* series for each requested percentile. This percentile is computed locally, and so can't be
945+
* aggregated with percentiles computed across other dimensions (e.g. in a different instance).
946+
*
947+
* <p>Value type: {@link java.util.List List}&#60;{@link Double}&#62;
948+
*/
949+
METRICS_SESSION_CQL_REQUESTS_PUBLISH_PERCENTILES(
950+
"advanced.metrics.session.cql-requests.publish-percentiles"),
951+
/**
952+
* Optional list of percentiles to publish for node cql-messages metric. Produces an additional
953+
* time series for each requested percentile. This percentile is computed locally, and so can't be
954+
* aggregated with percentiles computed across other dimensions (e.g. in a different instance).
955+
*
956+
* <p>Value type: {@link java.util.List List}&#60;{@link Double}&#62;
957+
*/
958+
METRICS_NODE_CQL_MESSAGES_PUBLISH_PERCENTILES(
959+
"advanced.metrics.node.cql-messages.publish-percentiles"),
960+
/**
961+
* Optional list of percentiles to publish for throttling delay metric.Produces an additional time
962+
* series for each requested percentile. This percentile is computed locally, and so can't be
963+
* aggregated with percentiles computed across other dimensions (e.g. in a different instance).
964+
*
965+
* <p>Value type: {@link java.util.List List}&#60;{@link Double}&#62;
966+
*/
967+
METRICS_SESSION_THROTTLING_PUBLISH_PERCENTILES(
968+
"advanced.metrics.session.throttling.delay.publish-percentiles"),
969+
/**
970+
* Adds histogram buckets used to generate aggregable percentile approximations in monitoring
971+
* systems that have query facilities to do so (e.g. Prometheus histogram_quantile, Atlas
972+
* percentiles).
973+
*
974+
* <p>Value-type: boolean
975+
*/
976+
METRICS_GENERATE_AGGREGABLE_HISTOGRAMS("advanced.metrics.histograms.generate-aggregable"),
942977
;
943978

944979
private final String path;

core/src/main/java/com/datastax/oss/driver/api/core/config/OptionsMap.java

+1
Original file line numberDiff line numberDiff line change
@@ -378,6 +378,7 @@ protected static void fillWithDriverDefaults(OptionsMap map) {
378378
map.put(TypedDriverOption.COALESCER_INTERVAL, Duration.of(10, ChronoUnit.MICROS));
379379
map.put(TypedDriverOption.LOAD_BALANCING_DC_FAILOVER_MAX_NODES_PER_REMOTE_DC, 0);
380380
map.put(TypedDriverOption.LOAD_BALANCING_DC_FAILOVER_ALLOW_FOR_LOCAL_CONSISTENCY_LEVELS, false);
381+
map.put(TypedDriverOption.METRICS_GENERATE_AGGREGABLE_HISTOGRAMS, true);
381382
}
382383

383384
@Immutable

core/src/main/java/com/datastax/oss/driver/api/core/config/TypedDriverOption.java

+45
Original file line numberDiff line numberDiff line change
@@ -388,6 +388,10 @@ public String toString() {
388388
/** The consistency level to use for trace queries. */
389389
public static final TypedDriverOption<String> REQUEST_TRACE_CONSISTENCY =
390390
new TypedDriverOption<>(DefaultDriverOption.REQUEST_TRACE_CONSISTENCY, GenericType.STRING);
391+
/** Whether or not to publish aggregable histogram for metrics */
392+
public static final TypedDriverOption<Boolean> METRICS_GENERATE_AGGREGABLE_HISTOGRAMS =
393+
new TypedDriverOption<>(
394+
DefaultDriverOption.METRICS_GENERATE_AGGREGABLE_HISTOGRAMS, GenericType.BOOLEAN);
391395
/** List of enabled session-level metrics. */
392396
public static final TypedDriverOption<List<String>> METRICS_SESSION_ENABLED =
393397
new TypedDriverOption<>(
@@ -409,6 +413,12 @@ public String toString() {
409413
new TypedDriverOption<>(
410414
DefaultDriverOption.METRICS_SESSION_CQL_REQUESTS_SLO,
411415
GenericType.listOf(GenericType.DURATION));
416+
/** Optional pre-defined percentile of cql requests to publish, as a list of percentiles . */
417+
public static final TypedDriverOption<List<Double>>
418+
METRICS_SESSION_CQL_REQUESTS_PUBLISH_PERCENTILES =
419+
new TypedDriverOption<>(
420+
DefaultDriverOption.METRICS_SESSION_CQL_REQUESTS_PUBLISH_PERCENTILES,
421+
GenericType.listOf(GenericType.DOUBLE));
412422
/**
413423
* The number of significant decimal digits to which internal structures will maintain for
414424
* requests.
@@ -433,6 +443,12 @@ public String toString() {
433443
new TypedDriverOption<>(
434444
DefaultDriverOption.METRICS_SESSION_THROTTLING_SLO,
435445
GenericType.listOf(GenericType.DURATION));
446+
/** Optional pre-defined percentile of throttling delay to publish, as a list of percentiles . */
447+
public static final TypedDriverOption<List<Double>>
448+
METRICS_SESSION_THROTTLING_PUBLISH_PERCENTILES =
449+
new TypedDriverOption<>(
450+
DefaultDriverOption.METRICS_SESSION_THROTTLING_PUBLISH_PERCENTILES,
451+
GenericType.listOf(GenericType.DOUBLE));
436452
/**
437453
* The number of significant decimal digits to which internal structures will maintain for
438454
* throttling.
@@ -457,6 +473,12 @@ public String toString() {
457473
new TypedDriverOption<>(
458474
DefaultDriverOption.METRICS_NODE_CQL_MESSAGES_SLO,
459475
GenericType.listOf(GenericType.DURATION));
476+
/** Optional pre-defined percentile of node cql messages to publish, as a list of percentiles . */
477+
public static final TypedDriverOption<List<Double>>
478+
METRICS_NODE_CQL_MESSAGES_PUBLISH_PERCENTILES =
479+
new TypedDriverOption<>(
480+
DefaultDriverOption.METRICS_NODE_CQL_MESSAGES_PUBLISH_PERCENTILES,
481+
GenericType.listOf(GenericType.DOUBLE));
460482
/**
461483
* The number of significant decimal digits to which internal structures will maintain for
462484
* requests.
@@ -700,6 +722,15 @@ public String toString() {
700722
new TypedDriverOption<>(
701723
DseDriverOption.CONTINUOUS_PAGING_METRICS_SESSION_CQL_REQUESTS_SLO,
702724
GenericType.listOf(GenericType.DURATION));
725+
/**
726+
* Optional pre-defined percentile of continuous paging cql requests to publish, as a list of
727+
* percentiles .
728+
*/
729+
public static final TypedDriverOption<List<Double>>
730+
CONTINUOUS_PAGING_METRICS_SESSION_CQL_REQUESTS_PUBLISH_PERCENTILES =
731+
new TypedDriverOption<>(
732+
DseDriverOption.CONTINUOUS_PAGING_METRICS_SESSION_CQL_REQUESTS_PUBLISH_PERCENTILES,
733+
GenericType.listOf(GenericType.DOUBLE));
703734
/**
704735
* The number of significant decimal digits to which internal structures will maintain for
705736
* continuous requests.
@@ -774,6 +805,12 @@ public String toString() {
774805
new TypedDriverOption<>(
775806
DseDriverOption.METRICS_SESSION_GRAPH_REQUESTS_SLO,
776807
GenericType.listOf(GenericType.DURATION));
808+
/** Optional pre-defined percentile of graph requests to publish, as a list of percentiles . */
809+
public static final TypedDriverOption<List<Double>>
810+
METRICS_SESSION_GRAPH_REQUESTS_PUBLISH_PERCENTILES =
811+
new TypedDriverOption<>(
812+
DseDriverOption.METRICS_SESSION_GRAPH_REQUESTS_PUBLISH_PERCENTILES,
813+
GenericType.listOf(GenericType.DOUBLE));
777814
/**
778815
* The number of significant decimal digits to which internal structures will maintain for graph
779816
* requests.
@@ -798,6 +835,14 @@ public String toString() {
798835
new TypedDriverOption<>(
799836
DseDriverOption.METRICS_NODE_GRAPH_MESSAGES_SLO,
800837
GenericType.listOf(GenericType.DURATION));
838+
/**
839+
* Optional pre-defined percentile of node graph requests to publish, as a list of percentiles .
840+
*/
841+
public static final TypedDriverOption<List<Double>>
842+
METRICS_NODE_GRAPH_MESSAGES_PUBLISH_PERCENTILES =
843+
new TypedDriverOption<>(
844+
DseDriverOption.METRICS_NODE_GRAPH_MESSAGES_PUBLISH_PERCENTILES,
845+
GenericType.listOf(GenericType.DOUBLE));
801846
/**
802847
* The number of significant decimal digits to which internal structures will maintain for graph
803848
* requests.

core/src/main/resources/reference.conf

+22-3
Original file line numberDiff line numberDiff line change
@@ -1434,6 +1434,16 @@ datastax-java-driver {
14341434
// prefix = "cassandra"
14351435
}
14361436

1437+
histograms {
1438+
# Adds histogram buckets used to generate aggregable percentile approximations in monitoring
1439+
# systems that have query facilities to do so (e.g. Prometheus histogram_quantile, Atlas percentiles).
1440+
#
1441+
# Required: no
1442+
# Modifiable at runtime: no
1443+
# Overridable in a profile: no
1444+
generate-aggregable = true
1445+
}
1446+
14371447
# The session-level metrics (all disabled by default).
14381448
#
14391449
# Required: yes
@@ -1526,7 +1536,7 @@ datastax-java-driver {
15261536
# Modifiable at runtime: no
15271537
# Overridable in a profile: no
15281538
cql-requests {
1529-
1539+
15301540
# The largest latency that we expect to record.
15311541
#
15321542
# This should be slightly higher than request.timeout (in theory, readings can't be higher
@@ -1569,15 +1579,19 @@ datastax-java-driver {
15691579
# time).
15701580
# Valid for: Dropwizard.
15711581
refresh-interval = 5 minutes
1572-
1582+
15731583
# An optional list of latencies to track as part of the application's service-level
15741584
# objectives (SLOs).
15751585
#
15761586
# If defined, the histogram is guaranteed to contain these boundaries alongside other
15771587
# buckets used to generate aggregable percentile approximations.
15781588
# Valid for: Micrometer.
15791589
// slo = [ 100 milliseconds, 500 milliseconds, 1 second ]
1580-
1590+
1591+
# An optional list of percentiles to be published by Micrometer. Produces an additional time series for each requested percentile.
1592+
# This percentile is computed locally, and so can't be aggregated with percentiles computed across other dimensions (e.g. in a different instance)
1593+
# Valid for: Micrometer.
1594+
// publish-percentiles = [ 0.75, 0.95, 0.99 ]
15811595
}
15821596

15831597
# Required: if the 'throttling.delay' metric is enabled, and Dropwizard or Micrometer is used.
@@ -1589,6 +1603,7 @@ datastax-java-driver {
15891603
significant-digits = 3
15901604
refresh-interval = 5 minutes
15911605
// slo = [ 100 milliseconds, 500 milliseconds, 1 second ]
1606+
// publish-percentiles = [ 0.75, 0.95, 0.99 ]
15921607
}
15931608

15941609
# Required: if the 'continuous-cql-requests' metric is enabled, and Dropwizard or Micrometer
@@ -1601,6 +1616,7 @@ datastax-java-driver {
16011616
significant-digits = 3
16021617
refresh-interval = 5 minutes
16031618
// slo = [ 100 milliseconds, 500 milliseconds, 1 second ]
1619+
// publish-percentiles = [ 0.75, 0.95, 0.99 ]
16041620
}
16051621

16061622
# Required: if the 'graph-requests' metric is enabled, and Dropwizard or Micrometer is used.
@@ -1612,6 +1628,7 @@ datastax-java-driver {
16121628
significant-digits = 3
16131629
refresh-interval = 5 minutes
16141630
// slo = [ 100 milliseconds, 500 milliseconds, 1 second ]
1631+
// publish-percentiles = [ 0.75, 0.95, 0.99 ]
16151632
}
16161633
}
16171634
# The node-level metrics (all disabled by default).
@@ -1776,6 +1793,7 @@ datastax-java-driver {
17761793
significant-digits = 3
17771794
refresh-interval = 5 minutes
17781795
// slo = [ 100 milliseconds, 500 milliseconds, 1 second ]
1796+
// publish-percentiles = [ 0.75, 0.95, 0.99 ]
17791797
}
17801798

17811799
# See graph-requests in the `session` section
@@ -1789,6 +1807,7 @@ datastax-java-driver {
17891807
significant-digits = 3
17901808
refresh-interval = 5 minutes
17911809
// slo = [ 100 milliseconds, 500 milliseconds, 1 second ]
1810+
// publish-percentiles = [ 0.75, 0.95, 0.99 ]
17921811
}
17931812

17941813
# The time after which the node level metrics will be evicted.

metrics/micrometer/src/main/java/com/datastax/oss/driver/internal/metrics/micrometer/MicrometerMetricUpdater.java

+24-2
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,9 @@
1515
*/
1616
package com.datastax.oss.driver.internal.metrics.micrometer;
1717

18+
import com.datastax.oss.driver.api.core.config.DefaultDriverOption;
1819
import com.datastax.oss.driver.api.core.config.DriverExecutionProfile;
20+
import com.datastax.oss.driver.api.core.config.DriverOption;
1921
import com.datastax.oss.driver.internal.core.context.InternalDriverContext;
2022
import com.datastax.oss.driver.internal.core.metrics.AbstractMetricUpdater;
2123
import com.datastax.oss.driver.internal.core.metrics.MetricId;
@@ -27,6 +29,7 @@
2729
import io.micrometer.core.instrument.MeterRegistry;
2830
import io.micrometer.core.instrument.Tag;
2931
import io.micrometer.core.instrument.Timer;
32+
import java.util.List;
3033
import java.util.Set;
3134
import java.util.concurrent.ConcurrentHashMap;
3235
import java.util.concurrent.ConcurrentMap;
@@ -151,12 +154,31 @@ protected Timer getOrCreateTimerFor(MetricT metric) {
151154
}
152155

153156
protected Timer.Builder configureTimer(Timer.Builder builder, MetricT metric, MetricId id) {
154-
return builder.publishPercentileHistogram();
157+
DriverExecutionProfile profile = context.getConfig().getDefaultProfile();
158+
if (profile.getBoolean(DefaultDriverOption.METRICS_GENERATE_AGGREGABLE_HISTOGRAMS)) {
159+
builder.publishPercentileHistogram();
160+
}
161+
return builder;
155162
}
156163

157164
@SuppressWarnings("unused")
158165
protected DistributionSummary.Builder configureDistributionSummary(
159166
DistributionSummary.Builder builder, MetricT metric, MetricId id) {
160-
return builder.publishPercentileHistogram();
167+
DriverExecutionProfile profile = context.getConfig().getDefaultProfile();
168+
if (profile.getBoolean(DefaultDriverOption.METRICS_GENERATE_AGGREGABLE_HISTOGRAMS)) {
169+
builder.publishPercentileHistogram();
170+
}
171+
return builder;
172+
}
173+
174+
static double[] toDoubleArray(List<Double> doubleList) {
175+
return doubleList.stream().mapToDouble(Double::doubleValue).toArray();
176+
}
177+
178+
static void configurePercentilesPublishIfDefined(
179+
Timer.Builder builder, DriverExecutionProfile profile, DriverOption driverOption) {
180+
if (profile.isDefined(driverOption)) {
181+
builder.publishPercentiles(toDoubleArray(profile.getDoubleList(driverOption)));
182+
}
161183
}
162184
}

metrics/micrometer/src/main/java/com/datastax/oss/driver/internal/metrics/micrometer/MicrometerNodeMetricUpdater.java

+10-5
Original file line numberDiff line numberDiff line change
@@ -96,9 +96,9 @@ protected void cancelMetricsExpirationTimeout() {
9696
@Override
9797
protected Timer.Builder configureTimer(Timer.Builder builder, NodeMetric metric, MetricId id) {
9898
DriverExecutionProfile profile = context.getConfig().getDefaultProfile();
99+
super.configureTimer(builder, metric, id);
99100
if (metric == DefaultNodeMetric.CQL_MESSAGES) {
100-
return builder
101-
.publishPercentileHistogram()
101+
builder
102102
.minimumExpectedValue(
103103
profile.getDuration(DefaultDriverOption.METRICS_NODE_CQL_MESSAGES_LOWEST))
104104
.maximumExpectedValue(
@@ -111,9 +111,11 @@ protected Timer.Builder configureTimer(Timer.Builder builder, NodeMetric metric,
111111
: null)
112112
.percentilePrecision(
113113
profile.getInt(DefaultDriverOption.METRICS_NODE_CQL_MESSAGES_DIGITS));
114+
115+
configurePercentilesPublishIfDefined(
116+
builder, profile, DefaultDriverOption.METRICS_NODE_CQL_MESSAGES_PUBLISH_PERCENTILES);
114117
} else if (metric == DseNodeMetric.GRAPH_MESSAGES) {
115-
return builder
116-
.publishPercentileHistogram()
118+
builder
117119
.minimumExpectedValue(
118120
profile.getDuration(DseDriverOption.METRICS_NODE_GRAPH_MESSAGES_LOWEST))
119121
.maximumExpectedValue(
@@ -125,7 +127,10 @@ protected Timer.Builder configureTimer(Timer.Builder builder, NodeMetric metric,
125127
.toArray(new Duration[0])
126128
: null)
127129
.percentilePrecision(profile.getInt(DseDriverOption.METRICS_NODE_GRAPH_MESSAGES_DIGITS));
130+
131+
configurePercentilesPublishIfDefined(
132+
builder, profile, DseDriverOption.METRICS_NODE_GRAPH_MESSAGES_PUBLISH_PERCENTILES);
128133
}
129-
return super.configureTimer(builder, metric, id);
134+
return builder;
130135
}
131136
}

0 commit comments

Comments
 (0)