schedulerlatency: Increase the go scheduler latency metric time coverage #158474
+40
−11
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Before this commit, the go scheduler latency metric was publish once every 10 seconds, and it was based on 2.5 seconds worth of data. That meant that there was 75% blind spot in that metric. This is especially important for short-lived overload that might not have been detected with this metric.
This commit builds on the current interval at which we measure the scheduler latency (100ms), and keeps adding these 100ms measurements into a histogram that gets published (and cleared) every 10s.
The figure below shows the Before/After metric on 2 clusters with the old and the new metric when running the following command:
while true; do timeout 3.5 roachprod run $CLUSTER:4 -- './cockroach workload run kv --concurrency=256 --read-percent=95 --duration=120m {pgurl:1}'; sleep 57.5; doneYou can see that in the Before figure, many of these spikes are missed. While they are visible in the new metric.
Release note: None
Fixes: #158475