Add Pressure Stall Information Metrics #3649

xinau · 2025-01-26T17:24:42Z

issues: #3052, #3083, kubernetes/enhancements#4205

This change adds metrics for pressure stall information, that indicate
why some or all tasks of a cgroupv2 have waited due to resource
congestion (cpu, memory, io). The change exposes this information by
including the PSIStats of each controller in it's stats, i.e.
CPUStats.PSI, MemoryStats.PSI and DiskStats.PSI.

The information is additionally exposed as Prometheus metrics. The
metrics follow the naming outlined by the prometheus/node-exporter,
where stalled eq full and waiting eq some.

container_pressure_cpu_stalled_seconds_total
container_pressure_cpu_waiting_seconds_total
container_pressure_memory_stalled_seconds_total
container_pressure_memory_waiting_seconds_total
container_pressure_io_stalled_seconds_total
container_pressure_io_waiting_seconds_total

This change is a rebase and resolve of the comments the work done in #3083.

Signed-off-by: Daniel Dao <[email protected]>

This adds 2 new set of metrics: - `psi_total`: read total number of seconds a resource is under pressure - `psi_avg`: read ratio of time a resource is under pressure over a sliding time window. For more details about these definitions, see: - https://www.kernel.org/doc/html/latest/accounting/psi.html - https://facebookmicrosites.github.io/psi/docs/overview Signed-off-by: Daniel Dao <[email protected]>

This adds support for reading PSI metrics via prometheus. We exposes the following for `psi_total`: ``` container_cpu_psi_total_seconds container_memory_psi_total_seconds container_io_psi_total_seconds ``` And for `psi_avg`: ``` container_cpu_psi_avg10_ratio container_cpu_psi_avg60_ratio container_cpu_psi_avg300_ratio container_memory_psi_avg10_ratio container_memory_psi_avg60_ratio container_memory_psi_avg300_ratio container_io_psi_avg10_ratio container_io_psi_avg60_ratio container_io_psi_avg300_ratio ``` Signed-off-by: Daniel Dao <[email protected]>

xinau · 2025-01-26T17:27:22Z

@rexagod, @SuperQ Could you please give this a review and advise me how to get this change merged.

issues: google#3052, google#3083, kubernetes/enhancements#4205 This change adds metrics for pressure stall information, that indicate why some or all tasks of a cgroupv2 have waited due to resource congestion (cpu, memory, io). The change exposes this information by including the _PSIStats_ of each controller in it's stats, i.e. _CPUStats.PSI_, _MemoryStats.PSI_ and _DiskStats.PSI_. The information is additionally exposed as Prometheus metrics. The metrics follow the naming outlined by the prometheus/node-exporter, where stalled eq full and waiting eq some. ``` container_pressure_cpu_stalled_seconds_total container_pressure_cpu_waiting_seconds_total container_pressure_memory_stalled_seconds_total container_pressure_memory_waiting_seconds_total container_pressure_io_stalled_seconds_total container_pressure_io_waiting_seconds_total ``` Signed-off-by: Felix Ehrenpfort <[email protected]>

cmd/go.mod

metrics/prometheus.go

SuperQ · 2025-01-26T18:29:46Z

Looking great so far, the metric names and other conventions look fine.

metrics/prometheus.go

Signed-off-by: Felix Ehrenpfort <[email protected]>

xinau · 2025-01-26T19:48:48Z

@SuperQ Thanks for the quick review. I've added the improvements.

metrics/prometheus.go

Signed-off-by: Felix Ehrenpfort <[email protected]>

xinau · 2025-01-27T08:21:16Z

@SuperQ I'm going take a look at the CPU PSI metrics again today. It seems that the CPU PSI full metric can be neq 0. I've stumbled upon this while reading kubernetes/enhancements#5062

xinau · 2025-01-27T08:40:20Z

@SuperQ I'm going to re-add the CPU full metric, as it's actively being reported by the kernel for cgroups.

* Naturally, the FULL state doesn't exist for the CPU resource at the
* system level, but exist at the cgroup level, means all non-idle tasks
* in a cgroup are delayed on the CPU resource which used by others outside
* of the cgroup or throttled by the cgroup cpu.max configuration.

See
https://lore.kernel.org/all/[email protected]/
https://lore.kernel.org/all/[email protected]/

rexagod · 2025-01-27T09:04:45Z

Thank you for your work (and investigation) on this, @xinau!

~~Not sure but after a quick look I can see we dropped container_%s_psi_avg%s_ratio here, was this intentional?~~

Ah, nevermind. I believe these can be derived.

SuperQ · 2025-01-27T10:23:12Z

@rexagod Yup. With Prometheus we can derive arbitrary averages as they're just rate(container_..._total[Xm]).

Signed-off-by: Felix Ehrenpfort <[email protected]>

xinau · 2025-01-27T20:54:32Z

@rexagod, @SuperQ all good from my side now.

rexagod

@dims Could you please approve the pending workflow here, or ping someone who could? The patch builds on top of the original PR while additionally following the community guidelines, and looks good to go in.

pacoxu · 2025-02-14T08:47:47Z

@dims Could you please approve the pending workflow here, or ping someone who could? The patch builds on top of the original PR while additionally following the community guidelines, and looks good to go in.

Kindly ping @dims.

dims · 2025-02-14T11:48:16Z

this is waiting for google maintainers like @cwangVT - i help take care of dependencies mostly (not features!). Also the Github hooks are broken, so the prow based ci jobs don't really work :(

haircommander · 2025-02-17T19:39:50Z

metrics/prometheus.go

 // asNanosecondsToSeconds converts nanoseconds into a float64 representing seconds.
 func asNanosecondsToSeconds(v uint64) float64 {
-	return float64(v) / float64(time.Second)


the old way reads as slightly easier to understand to me. What motivates this change?

As mentioned in the review comments, it's confusing when you compare asMicrosecondsToSeconds(). The base unit is time.Nanosecond, so you would have to use time.Millisecond to convert microseconds to seconds.

By simply using the float factor, it's consistent to read both functions.

haircommander · 2025-02-17T21:08:13Z

metrics/prometheus.go

+	if includedMetrics.Has(container.PressureMetrics) {
+		c.containerMetrics = append(c.containerMetrics, []containerMetric{
+			{
+				name:      "container_pressure_cpu_stalled_seconds_total",


I'm trying to reason about which metrics make sense to emit. It's interesting we emit the totals but don't emit the 10/60/300... I almost wonder if we should add the structure pieces first and discuss actual metrics we emit after..

No we only need the totals as other metrics are derived values. With the totals the end user can derive arbitrary intervals. For example rate(container_pressure_cpu_stalled_seconds_total[60s]).

fair, I am not sure that prom query would get exactly the same but true that you could reconstruct those intervals from the total.

Yea, it's never going to be exactly the same because it depends on exactly the timestamps and values involved.

But it will be within the same tolerance over time.

haircommander · 2025-02-17T21:10:16Z

metrics/prometheus.go

@@ -1746,6 +1751,54 @@ func NewPrometheusCollector(i infoProvider, f ContainerLabelsFunc, includedMetri
 		})
 	}

+	if includedMetrics.Has(container.PressureMetrics) {


I can almost see these as also being nested under cpu/memory/disk metrics. I am not sure if there is precedent for this, but maybe add both a PressureMetrics for included metrics, as well as check the respective other metric?

if includedMetrics.Has(container.PressureMetrics) && includedMetrics.Has(container..CPUMetrics) { report CPU pressure metrics

@haircommander I couldn't find any precedent for this. I'll add a check for each of PSI resource (cpu, memory, io). I can't find a strong argument against or for adding such a check.

I don't see why having pressure metrics needs to depend on other metrics. Each metric dataset is independent of each other.

I'd prefer not to nest them under other metrics (e.g. container_cpu_) as this adds confusion to end-users used to the reporting scheme of the node-exporter (e.g. node_pressure_cpu_)

@SuperQ as PSI metrics are reported as part of the cpu/memory/io controller it might make sense to not report them when a user actively decides against getting metrics for one of these controllers.

It's a bit strange in the case of io as it's reported for block io not only disk.

Like pointed out before I'm undecided about this, it might also make sense to defer this decision to a later point in time when a real use-case arises. This then raises the question which decision provides more backwards compatibility.

we can always emit differently in the future, I think this makes sense for now.

haircommander · 2025-02-18T17:24:09Z

I can't remember if I'm powerful enough to do this
/ok-to-test

this LGTM, none of my notes need addressing :)

See https://docs.kernel.org/accounting/psi.html These metrics are exposed only if the current process runs under cgroup v2 (this is the case for modern Linux systems, plus Kubernetes and Docker). Expose the following metrics: - process_pressure_cpu_waiting_seconds_total - the number of seconds processes in the current cgroup v2 were waiting to be executed - process_pressure_cpu_stalled_seconds_total - the number of seconds all the processes in the current cgroup v2 were stalled - process_pressure_io_waiting_seconds_total - the number of seconds processes in the current cgroup v2 were waiting for io to complete - process_pressure_io_stalled_seconds_total - the number of seconds all the processes in the current cgroup v2 were waiting for io to complete - process_pressure_mem_waiting_seconds_total - the number of seconds processes in the current cgroup v2 were waiting for memory access to complete - process_pressure_mem_stalled_seconds_total - the number of seconds all the processes in the current cgroup v2 were waiting for memory access to complete These metrics are shared across all the processes inside a single cgroup. They may help identifying performance issues when some of these processes saturate CPU, IO or memory. See also prometheus/node_exporter#1174 and google/cadvisor#3649 Additional context: https://blog.zmalik.dev/p/from-utilization-to-psi-rethinking

dqminh added 3 commits January 26, 2025 12:53

Replace runc with dqminh/runc for psi support

b621e78

Signed-off-by: Daniel Dao <[email protected]>

xinau mentioned this pull request Jan 26, 2025

Support for exposing PSI metrics #3083

Open

xinau force-pushed the xinau/add-psi-metrics branch from 8b41ec5 to 103b4be Compare January 26, 2025 17:30

SuperQ suggested changes Jan 26, 2025

View reviewed changes

cmd/go.mod Outdated Show resolved Hide resolved

metrics/prometheus.go Outdated Show resolved Hide resolved

SuperQ reviewed Jan 26, 2025

View reviewed changes

metrics/prometheus.go Show resolved Hide resolved

Add minor improvements to PSI metrics

94a027c

Signed-off-by: Felix Ehrenpfort <[email protected]>

SuperQ reviewed Jan 26, 2025

View reviewed changes

metrics/prometheus.go Outdated Show resolved Hide resolved

Use 1e6/9 instead of time for conversion

e238b08

Signed-off-by: Felix Ehrenpfort <[email protected]>

xinau requested a review from SuperQ January 26, 2025 20:34

SuperQ approved these changes Jan 26, 2025

View reviewed changes

Expose PSI metric for CPU full

20e5af2

Signed-off-by: Felix Ehrenpfort <[email protected]>

rexagod approved these changes Jan 28, 2025

View reviewed changes

haircommander reviewed Feb 17, 2025

View reviewed changes

cwangVT self-requested a review February 19, 2025 17:51

cwangVT approved these changes Feb 20, 2025

View reviewed changes

cwangVT merged commit 5bd422f into google:master Feb 20, 2025
7 checks passed

pacoxu mentioned this pull request Feb 21, 2025

Support PSI based on cgroupv2 kubernetes/enhancements#4205

Open

21 tasks

roycaihw mentioned this pull request Mar 10, 2025

Surface Pressure Stall Information (PSI) metrics kubernetes/kubernetes#130701

Merged

ankush mentioned this pull request Apr 24, 2025

Expose Pressure Stall Information as metrics #3052

Open

Add Pressure Stall Information Metrics #3649

Add Pressure Stall Information Metrics #3649

Uh oh!

Conversation

xinau commented Jan 26, 2025

Uh oh!

xinau commented Jan 26, 2025

Uh oh!

Uh oh!

Uh oh!

SuperQ commented Jan 26, 2025

Uh oh!

Uh oh!

xinau commented Jan 26, 2025

Uh oh!

Uh oh!

xinau commented Jan 27, 2025

Uh oh!

xinau commented Jan 27, 2025

Uh oh!

rexagod commented Jan 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SuperQ commented Jan 27, 2025

Uh oh!

xinau commented Jan 27, 2025

Uh oh!

rexagod left a comment

Choose a reason for hiding this comment

Uh oh!

pacoxu commented Feb 14, 2025

Uh oh!

dims commented Feb 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SuperQ Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haircommander commented Feb 18, 2025

Uh oh!

Uh oh!

Uh oh!

rexagod commented Jan 27, 2025 •

edited

Loading

SuperQ Feb 18, 2025 •

edited

Loading