-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Pressure Stall Information Metrics #3649
Conversation
Signed-off-by: Daniel Dao <[email protected]>
This adds 2 new set of metrics: - `psi_total`: read total number of seconds a resource is under pressure - `psi_avg`: read ratio of time a resource is under pressure over a sliding time window. For more details about these definitions, see: - https://www.kernel.org/doc/html/latest/accounting/psi.html - https://facebookmicrosites.github.io/psi/docs/overview Signed-off-by: Daniel Dao <[email protected]>
This adds support for reading PSI metrics via prometheus. We exposes the following for `psi_total`: ``` container_cpu_psi_total_seconds container_memory_psi_total_seconds container_io_psi_total_seconds ``` And for `psi_avg`: ``` container_cpu_psi_avg10_ratio container_cpu_psi_avg60_ratio container_cpu_psi_avg300_ratio container_memory_psi_avg10_ratio container_memory_psi_avg60_ratio container_memory_psi_avg300_ratio container_io_psi_avg10_ratio container_io_psi_avg60_ratio container_io_psi_avg300_ratio ``` Signed-off-by: Daniel Dao <[email protected]>
issues: google#3052, google#3083, kubernetes/enhancements#4205 This change adds metrics for pressure stall information, that indicate why some or all tasks of a cgroupv2 have waited due to resource congestion (cpu, memory, io). The change exposes this information by including the _PSIStats_ of each controller in it's stats, i.e. _CPUStats.PSI_, _MemoryStats.PSI_ and _DiskStats.PSI_. The information is additionally exposed as Prometheus metrics. The metrics follow the naming outlined by the prometheus/node-exporter, where stalled eq full and waiting eq some. ``` container_pressure_cpu_stalled_seconds_total container_pressure_cpu_waiting_seconds_total container_pressure_memory_stalled_seconds_total container_pressure_memory_waiting_seconds_total container_pressure_io_stalled_seconds_total container_pressure_io_waiting_seconds_total ``` Signed-off-by: Felix Ehrenpfort <[email protected]>
8b41ec5
to
103b4be
Compare
Looking great so far, the metric names and other conventions look fine. |
Signed-off-by: Felix Ehrenpfort <[email protected]>
@SuperQ Thanks for the quick review. I've added the improvements. |
Signed-off-by: Felix Ehrenpfort <[email protected]>
@SuperQ I'm going take a look at the CPU PSI metrics again today. It seems that the CPU PSI full metric can be neq 0. I've stumbled upon this while reading kubernetes/enhancements#5062 |
@SuperQ I'm going to re-add the CPU full metric, as it's actively being reported by the kernel for cgroups.
See |
Thank you for your work (and investigation) on this, @xinau!
Ah, nevermind. I believe these can be derived. |
@rexagod Yup. With Prometheus we can derive arbitrary averages as they're just |
Signed-off-by: Felix Ehrenpfort <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dims Could you please approve the pending workflow here, or ping someone who could? The patch builds on top of the original PR while additionally following the community guidelines, and looks good to go in.
this is waiting for google maintainers like @cwangVT - i help take care of dependencies mostly (not features!). Also the Github hooks are broken, so the prow based ci jobs don't really work :( |
// asNanosecondsToSeconds converts nanoseconds into a float64 representing seconds. | ||
func asNanosecondsToSeconds(v uint64) float64 { | ||
return float64(v) / float64(time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the old way reads as slightly easier to understand to me. What motivates this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in the review comments, it's confusing when you compare asMicrosecondsToSeconds()
. The base unit is time.Nanosecond
, so you would have to use time.Millisecond
to convert microseconds to seconds.
By simply using the float factor, it's consistent to read both functions.
if includedMetrics.Has(container.PressureMetrics) { | ||
c.containerMetrics = append(c.containerMetrics, []containerMetric{ | ||
{ | ||
name: "container_pressure_cpu_stalled_seconds_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to reason about which metrics make sense to emit. It's interesting we emit the totals but don't emit the 10/60/300... I almost wonder if we should add the structure pieces first and discuss actual metrics we emit after..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No we only need the totals as other metrics are derived values. With the totals the end user can derive arbitrary intervals. For example rate(container_pressure_cpu_stalled_seconds_total[60s])
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair, I am not sure that prom query would get exactly the same but true that you could reconstruct those intervals from the total.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, it's never going to be exactly the same because it depends on exactly the timestamps and values involved.
But it will be within the same tolerance over time.
@@ -1746,6 +1751,54 @@ func NewPrometheusCollector(i infoProvider, f ContainerLabelsFunc, includedMetri | |||
}) | |||
} | |||
|
|||
if includedMetrics.Has(container.PressureMetrics) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can almost see these as also being nested under cpu/memory/disk metrics. I am not sure if there is precedent for this, but maybe add both a PressureMetrics for included metrics, as well as check the respective other metric?
if includedMetrics.Has(container.PressureMetrics) && includedMetrics.Has(container..CPUMetrics) {
report CPU pressure metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@haircommander I couldn't find any precedent for this. I'll add a check for each of PSI resource (cpu, memory, io). I can't find a strong argument against or for adding such a check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why having pressure metrics needs to depend on other metrics. Each metric dataset is independent of each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer not to nest them under other metrics (e.g. container_cpu_
) as this adds confusion to end-users used to the reporting scheme of the node-exporter (e.g. node_pressure_cpu_
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SuperQ as PSI metrics are reported as part of the cpu/memory/io controller it might make sense to not report them when a user actively decides against getting metrics for one of these controllers.
It's a bit strange in the case of io as it's reported for block io not only disk.
Like pointed out before I'm undecided about this, it might also make sense to defer this decision to a later point in time when a real use-case arises. This then raises the question which decision provides more backwards compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can always emit differently in the future, I think this makes sense for now.
I can't remember if I'm powerful enough to do this this LGTM, none of my notes need addressing :) |
issues: #3052, #3083, kubernetes/enhancements#4205
This change adds metrics for pressure stall information, that indicate
why some or all tasks of a cgroupv2 have waited due to resource
congestion (cpu, memory, io). The change exposes this information by
including the PSIStats of each controller in it's stats, i.e.
CPUStats.PSI, MemoryStats.PSI and DiskStats.PSI.
The information is additionally exposed as Prometheus metrics. The
metrics follow the naming outlined by the prometheus/node-exporter,
where stalled eq full and waiting eq some.
This change is a rebase and resolve of the comments the work done in #3083.