-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Pressure Stall Information Metrics #3649
Changes from all commits
b621e78
d3fefb9
6b23ac7
103b4be
94a027c
e238b08
20e5af2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -33,9 +33,14 @@ import ( | |
// asFloat64 converts a uint64 into a float64. | ||
func asFloat64(v uint64) float64 { return float64(v) } | ||
|
||
// asMicrosecondsToSeconds converts nanoseconds into a float64 representing seconds. | ||
func asMicrosecondsToSeconds(v uint64) float64 { | ||
return float64(v) / 1e6 | ||
} | ||
|
||
// asNanosecondsToSeconds converts nanoseconds into a float64 representing seconds. | ||
func asNanosecondsToSeconds(v uint64) float64 { | ||
return float64(v) / float64(time.Second) | ||
return float64(v) / 1e9 | ||
} | ||
|
||
// fsValues is a helper method for assembling per-filesystem stats. | ||
|
@@ -1746,6 +1751,54 @@ func NewPrometheusCollector(i infoProvider, f ContainerLabelsFunc, includedMetri | |
}) | ||
} | ||
|
||
if includedMetrics.Has(container.PressureMetrics) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can almost see these as also being nested under cpu/memory/disk metrics. I am not sure if there is precedent for this, but maybe add both a PressureMetrics for included metrics, as well as check the respective other metric?
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @haircommander I couldn't find any precedent for this. I'll add a check for each of PSI resource (cpu, memory, io). I can't find a strong argument against or for adding such a check. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see why having pressure metrics needs to depend on other metrics. Each metric dataset is independent of each other. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd prefer not to nest them under other metrics (e.g. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @SuperQ as PSI metrics are reported as part of the cpu/memory/io controller it might make sense to not report them when a user actively decides against getting metrics for one of these controllers. It's a bit strange in the case of io as it's reported for block io not only disk. Like pointed out before I'm undecided about this, it might also make sense to defer this decision to a later point in time when a real use-case arises. This then raises the question which decision provides more backwards compatibility. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we can always emit differently in the future, I think this makes sense for now. |
||
c.containerMetrics = append(c.containerMetrics, []containerMetric{ | ||
{ | ||
name: "container_pressure_cpu_stalled_seconds_total", | ||
xinau marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm trying to reason about which metrics make sense to emit. It's interesting we emit the totals but don't emit the 10/60/300... I almost wonder if we should add the structure pieces first and discuss actual metrics we emit after.. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No we only need the totals as other metrics are derived values. With the totals the end user can derive arbitrary intervals. For example There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fair, I am not sure that prom query would get exactly the same but true that you could reconstruct those intervals from the total. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea, it's never going to be exactly the same because it depends on exactly the timestamps and values involved. But it will be within the same tolerance over time. |
||
help: "Total time duration no tasks in the container could make progress due to CPU congestion.", | ||
valueType: prometheus.CounterValue, | ||
getValues: func(s *info.ContainerStats) metricValues { | ||
return metricValues{{value: asMicrosecondsToSeconds(s.Cpu.PSI.Full.Total), timestamp: s.Timestamp}} | ||
}, | ||
}, { | ||
name: "container_pressure_cpu_waiting_seconds_total", | ||
help: "Total time duration tasks in the container have waited due to CPU congestion.", | ||
valueType: prometheus.CounterValue, | ||
getValues: func(s *info.ContainerStats) metricValues { | ||
return metricValues{{value: asMicrosecondsToSeconds(s.Cpu.PSI.Some.Total), timestamp: s.Timestamp}} | ||
}, | ||
}, { | ||
name: "container_pressure_memory_stalled_seconds_total", | ||
help: "Total time duration no tasks in the container could make progress due to memory congestion.", | ||
valueType: prometheus.CounterValue, | ||
getValues: func(s *info.ContainerStats) metricValues { | ||
return metricValues{{value: asMicrosecondsToSeconds(s.Memory.PSI.Full.Total), timestamp: s.Timestamp}} | ||
}, | ||
}, { | ||
name: "container_pressure_memory_waiting_seconds_total", | ||
help: "Total time duration tasks in the container have waited due to memory congestion.", | ||
valueType: prometheus.CounterValue, | ||
getValues: func(s *info.ContainerStats) metricValues { | ||
return metricValues{{value: asMicrosecondsToSeconds(s.Memory.PSI.Some.Total), timestamp: s.Timestamp}} | ||
}, | ||
}, { | ||
name: "container_pressure_io_stalled_seconds_total", | ||
help: "Total time duration no tasks in the container could make progress due to IO congestion.", | ||
valueType: prometheus.CounterValue, | ||
getValues: func(s *info.ContainerStats) metricValues { | ||
return metricValues{{value: asMicrosecondsToSeconds(s.DiskIo.PSI.Full.Total), timestamp: s.Timestamp}} | ||
}, | ||
}, { | ||
name: "container_pressure_io_waiting_seconds_total", | ||
help: "Total time duration tasks in the container have waited due to IO congestion.", | ||
valueType: prometheus.CounterValue, | ||
getValues: func(s *info.ContainerStats) metricValues { | ||
return metricValues{{value: asMicrosecondsToSeconds(s.DiskIo.PSI.Some.Total), timestamp: s.Timestamp}} | ||
}, | ||
}, | ||
}...) | ||
} | ||
|
||
return c | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the old way reads as slightly easier to understand to me. What motivates this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in the review comments, it's confusing when you compare
asMicrosecondsToSeconds()
. The base unit istime.Nanosecond
, so you would have to usetime.Millisecond
to convert microseconds to seconds.By simply using the float factor, it's consistent to read both functions.