-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Use a non-global metrics registry in Teleport #50913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -54,6 +54,7 @@ import ( | |
"github.com/gravitational/roundtrip" | ||
"github.com/gravitational/trace" | ||
"github.com/jonboulle/clockwork" | ||
"github.com/prometheus/client_golang/prometheus" | ||
"github.com/prometheus/client_golang/prometheus/promhttp" | ||
"github.com/quic-go/quic-go" | ||
"go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc" | ||
|
@@ -657,6 +658,15 @@ type TeleportProcess struct { | |
// resolver is used to identify the reverse tunnel address when connecting via | ||
// the proxy. | ||
resolver reversetunnelclient.Resolver | ||
|
||
// metricRegistry is the prometheus metric registry for the process. | ||
// Every teleport service that wants to register metrics should use this | ||
// instead of the global prometheus.DefaultRegisterer to avoid registration | ||
// conflicts. | ||
// | ||
// Both the metricsRegistry and the default global registry are gathered by | ||
// Telepeort's metric service. | ||
metricsRegistry *prometheus.Registry | ||
} | ||
|
||
// processIndex is an internal process index | ||
|
@@ -1179,6 +1189,7 @@ func NewTeleport(cfg *servicecfg.Config) (*TeleportProcess, error) { | |
logger: cfg.Logger, | ||
cloudLabels: cloudLabels, | ||
TracingProvider: tracing.NoopProvider(), | ||
metricsRegistry: cfg.MetricsRegistry, | ||
} | ||
|
||
process.registerExpectedServices(cfg) | ||
|
@@ -3405,11 +3416,46 @@ func (process *TeleportProcess) initUploaderService() error { | |
return nil | ||
} | ||
|
||
// promHTTPLogAdapter adapts a slog.Logger into a promhttp.Logger. | ||
type promHTTPLogAdapter struct { | ||
ctx context.Context | ||
*slog.Logger | ||
} | ||
|
||
// Println implements the promhttp.Logger interface. | ||
func (l promHTTPLogAdapter) Println(v ...interface{}) { | ||
//nolint:sloglint // msg cannot be constant | ||
l.ErrorContext(l.ctx, fmt.Sprint(v...)) | ||
} | ||
|
||
// initMetricsService starts the metrics service currently serving metrics for | ||
// prometheus consumption | ||
func (process *TeleportProcess) initMetricsService() error { | ||
mux := http.NewServeMux() | ||
mux.Handle("/metrics", promhttp.Handler()) | ||
|
||
// We gather metrics both from the in-process registry (preferred metrics registration method) | ||
// and the global registry (used by some Teleport services and many dependencies). | ||
gatherers := prometheus.Gatherers{ | ||
process.metricsRegistry, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If conflicting metrics are registered I assume they'll be dropped, but unaffected metrics will keep working. Do you know if that's correct? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Currently, registration conflicts in the global registry can cause:
Adding a local registry will not change the failure modes in case of conflict in the same registry. However, we are adding a new failure mode: metrics conflicting between the local and global registry. In this case, As we start using the local registry more, we might create such hard to detect conflicts. The situation is not strictly worse than today (we already have some racy metric registration with silent failure going on 😬). To ensure no conflict happen we can prefix new metrics by wrapping the registry when passing it to the service. I think we would benefit from metrics guideline, setting the teleport component in the metric subsystem would reduce the probability of conflict. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, very informative. |
||
prometheus.DefaultGatherer, | ||
} | ||
|
||
metricsHandler := promhttp.InstrumentMetricHandler( | ||
process.metricsRegistry, promhttp.HandlerFor(gatherers, promhttp.HandlerOpts{ | ||
// Errors can happen if metrics are registered with identical names in both the local and the global registry. | ||
// In this case, we log the error but continue collecting metrics. The first collected metric will win | ||
// (the one from the local metrics registry takes precedence). | ||
// As we move more things to the local registry, especially in other tools like tbot, we will have less | ||
// conflicts in tests. | ||
Comment on lines
+3448
to
+3449
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it advantageous to move all of our current global metrics to the local registry? If so what kind of migration strategy should we have to eliminate global metrics? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think so because we will:
I've not thought this yet, but by supporting both we can take our time with the transition. I'd like to get a few new metrics using the local registry before chosing a recommended pattern. Once we know how we want metrics to be declared and collected, we can write a short metrics RFD and start passing the local registry to the different services. Migrating might not be trivial because we are heavily relying on package-scoped metrics and global registries. We will need to:
I think tbot is a very good starting point because of its limited scope, the conflicts caused by embeddedtbot, and the conflicts it causes in integration tests. |
||
ErrorHandling: promhttp.ContinueOnError, | ||
ErrorLog: promHTTPLogAdapter{ | ||
ctx: process.ExitContext(), | ||
Logger: process.logger.With(teleport.ComponentKey, teleport.ComponentMetrics), | ||
}, | ||
}), | ||
) | ||
|
||
mux.Handle("/metrics", metricsHandler) | ||
|
||
logger := process.logger.With(teleport.ComponentKey, teleport.Component(teleport.ComponentMetrics, process.id)) | ||
|
||
|
Uh oh!
There was an error while loading. Please reload this page.