Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Pressure Stall Information Metrics #3649

Merged
merged 7 commits into from
Feb 20, 2025

Conversation

xinau
Copy link
Contributor

@xinau xinau commented Jan 26, 2025

issues: #3052, #3083, kubernetes/enhancements#4205

This change adds metrics for pressure stall information, that indicate
why some or all tasks of a cgroupv2 have waited due to resource
congestion (cpu, memory, io). The change exposes this information by
including the PSIStats of each controller in it's stats, i.e.
CPUStats.PSI, MemoryStats.PSI and DiskStats.PSI.

The information is additionally exposed as Prometheus metrics. The
metrics follow the naming outlined by the prometheus/node-exporter,
where stalled eq full and waiting eq some.

container_pressure_cpu_stalled_seconds_total
container_pressure_cpu_waiting_seconds_total
container_pressure_memory_stalled_seconds_total
container_pressure_memory_waiting_seconds_total
container_pressure_io_stalled_seconds_total
container_pressure_io_waiting_seconds_total

This change is a rebase and resolve of the comments the work done in #3083.

This adds 2 new set of metrics:
- `psi_total`: read total number of seconds a resource is under pressure
- `psi_avg`: read ratio of time a resource is under pressure over a
  sliding time window.

For more details about these definitions, see:
- https://www.kernel.org/doc/html/latest/accounting/psi.html
- https://facebookmicrosites.github.io/psi/docs/overview

Signed-off-by: Daniel Dao <[email protected]>
This adds support for reading PSI metrics via prometheus. We exposes the
following for `psi_total`:

```
container_cpu_psi_total_seconds
container_memory_psi_total_seconds
container_io_psi_total_seconds
```

And for `psi_avg`:

```
container_cpu_psi_avg10_ratio
container_cpu_psi_avg60_ratio
container_cpu_psi_avg300_ratio

container_memory_psi_avg10_ratio
container_memory_psi_avg60_ratio
container_memory_psi_avg300_ratio

container_io_psi_avg10_ratio
container_io_psi_avg60_ratio
container_io_psi_avg300_ratio
```

Signed-off-by: Daniel Dao <[email protected]>
@xinau
Copy link
Contributor Author

xinau commented Jan 26, 2025

@rexagod, @SuperQ Could you please give this a review and advise me how to get this change merged.

issues: google#3052, google#3083, kubernetes/enhancements#4205

This change adds metrics for pressure stall information, that indicate
why some or all tasks of a cgroupv2 have waited due to resource
congestion (cpu, memory, io). The change exposes this information by
including the _PSIStats_ of each controller in it's stats, i.e.
_CPUStats.PSI_, _MemoryStats.PSI_ and _DiskStats.PSI_.

The information is additionally exposed as Prometheus metrics. The
metrics follow the naming outlined by the prometheus/node-exporter,
where stalled eq full and waiting eq some.

```
container_pressure_cpu_stalled_seconds_total
container_pressure_cpu_waiting_seconds_total
container_pressure_memory_stalled_seconds_total
container_pressure_memory_waiting_seconds_total
container_pressure_io_stalled_seconds_total
container_pressure_io_waiting_seconds_total
```

Signed-off-by: Felix Ehrenpfort <[email protected]>
@xinau xinau force-pushed the xinau/add-psi-metrics branch from 8b41ec5 to 103b4be Compare January 26, 2025 17:30
@SuperQ
Copy link
Contributor

SuperQ commented Jan 26, 2025

Looking great so far, the metric names and other conventions look fine.

@xinau
Copy link
Contributor Author

xinau commented Jan 26, 2025

@SuperQ Thanks for the quick review. I've added the improvements.

@xinau xinau requested a review from SuperQ January 26, 2025 20:34
@xinau
Copy link
Contributor Author

xinau commented Jan 27, 2025

@SuperQ I'm going take a look at the CPU PSI metrics again today. It seems that the CPU PSI full metric can be neq 0. I've stumbled upon this while reading kubernetes/enhancements#5062

@xinau
Copy link
Contributor Author

xinau commented Jan 27, 2025

@SuperQ I'm going to re-add the CPU full metric, as it's actively being reported by the kernel for cgroups.

* Naturally, the FULL state doesn't exist for the CPU resource at the
* system level, but exist at the cgroup level, means all non-idle tasks
* in a cgroup are delayed on the CPU resource which used by others outside
* of the cgroup or throttled by the cgroup cpu.max configuration.

See
https://lore.kernel.org/all/[email protected]/
https://lore.kernel.org/all/[email protected]/

@rexagod
Copy link

rexagod commented Jan 27, 2025

Thank you for your work (and investigation) on this, @xinau!

Not sure but after a quick look I can see we dropped container_%s_psi_avg%s_ratio here, was this intentional?

Ah, nevermind. I believe these can be derived.

@SuperQ
Copy link
Contributor

SuperQ commented Jan 27, 2025

@rexagod Yup. With Prometheus we can derive arbitrary averages as they're just rate(container_..._total[Xm]).

Signed-off-by: Felix Ehrenpfort <[email protected]>
@xinau
Copy link
Contributor Author

xinau commented Jan 27, 2025

@rexagod, @SuperQ all good from my side now.

Copy link

@rexagod rexagod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dims Could you please approve the pending workflow here, or ping someone who could? The patch builds on top of the original PR while additionally following the community guidelines, and looks good to go in.

@pacoxu
Copy link
Contributor

pacoxu commented Feb 14, 2025

@dims Could you please approve the pending workflow here, or ping someone who could? The patch builds on top of the original PR while additionally following the community guidelines, and looks good to go in.

Kindly ping @dims.

@dims
Copy link
Collaborator

dims commented Feb 14, 2025

this is waiting for google maintainers like @cwangVT - i help take care of dependencies mostly (not features!). Also the Github hooks are broken, so the prow based ci jobs don't really work :(

// asNanosecondsToSeconds converts nanoseconds into a float64 representing seconds.
func asNanosecondsToSeconds(v uint64) float64 {
return float64(v) / float64(time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the old way reads as slightly easier to understand to me. What motivates this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the review comments, it's confusing when you compare asMicrosecondsToSeconds(). The base unit is time.Nanosecond, so you would have to use time.Millisecond to convert microseconds to seconds.

By simply using the float factor, it's consistent to read both functions.

if includedMetrics.Has(container.PressureMetrics) {
c.containerMetrics = append(c.containerMetrics, []containerMetric{
{
name: "container_pressure_cpu_stalled_seconds_total",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to reason about which metrics make sense to emit. It's interesting we emit the totals but don't emit the 10/60/300... I almost wonder if we should add the structure pieces first and discuss actual metrics we emit after..

Copy link
Contributor

@SuperQ SuperQ Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we only need the totals as other metrics are derived values. With the totals the end user can derive arbitrary intervals. For example rate(container_pressure_cpu_stalled_seconds_total[60s]).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair, I am not sure that prom query would get exactly the same but true that you could reconstruct those intervals from the total.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it's never going to be exactly the same because it depends on exactly the timestamps and values involved.

But it will be within the same tolerance over time.

@@ -1746,6 +1751,54 @@ func NewPrometheusCollector(i infoProvider, f ContainerLabelsFunc, includedMetri
})
}

if includedMetrics.Has(container.PressureMetrics) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can almost see these as also being nested under cpu/memory/disk metrics. I am not sure if there is precedent for this, but maybe add both a PressureMetrics for included metrics, as well as check the respective other metric?


	if includedMetrics.Has(container.PressureMetrics) && includedMetrics.Has(container..CPUMetrics) {
report CPU pressure metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haircommander I couldn't find any precedent for this. I'll add a check for each of PSI resource (cpu, memory, io). I can't find a strong argument against or for adding such a check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why having pressure metrics needs to depend on other metrics. Each metric dataset is independent of each other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to nest them under other metrics (e.g. container_cpu_) as this adds confusion to end-users used to the reporting scheme of the node-exporter (e.g. node_pressure_cpu_)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SuperQ as PSI metrics are reported as part of the cpu/memory/io controller it might make sense to not report them when a user actively decides against getting metrics for one of these controllers.

It's a bit strange in the case of io as it's reported for block io not only disk.

Like pointed out before I'm undecided about this, it might also make sense to defer this decision to a later point in time when a real use-case arises. This then raises the question which decision provides more backwards compatibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can always emit differently in the future, I think this makes sense for now.

@haircommander
Copy link
Contributor

I can't remember if I'm powerful enough to do this
/ok-to-test

this LGTM, none of my notes need addressing :)

@cwangVT cwangVT self-requested a review February 19, 2025 17:51
@cwangVT cwangVT merged commit 5bd422f into google:master Feb 20, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants