-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove status metrics for deleted resources #10597
Conversation
Issues linked to changelog: |
/kick-ci |
/kick-ci |
Visit the preview URL for this PR (updated for commit 9e9877a): https://gloo-edge--pr10597-rolds-resource-metri-eq3r3xf3.web.app (expires Thu, 20 Feb 2025 21:37:11 GMT) 🔥 via Firebase Hosting GitHub Action 🌎 Sign: 77c2b86e287749579b7ff9cadb81e099042ef677 |
/kick-ci |
/kick |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont quite understand the rationale for clearing all statuses rather than doing something a bit more targeted and am concerned it could cause some flicker in metrics.
If an extra example can be added to prove otherwise then I am onboard
Co-authored-by: Nathan Fudenberg <[email protected]>
Co-authored-by: Nathan Fudenberg <[email protected]>
The API provided by the metrics packages just don't provide the level of access needed to filter/remove specific label combinations from the metrics/views. I wish it wasn't that way. The same issue exists in the package we would like to move to, but it won't require a sleep reducing the data race we are both concerned about. |
/kick-ci |
…lo-io/gloo into rolds/resource_metrics_deletion
docs/content/reference/values.txt
Outdated
@@ -515,6 +515,7 @@ | |||
|gloo.headerSecretRefNsMatchesUs|bool||Set to true to require that secrets sent in headers via headerSecretRefs come from the same namespace as the destination upstream. Default: false| | |||
|gloo.podDisruptionBudget.minAvailable|string||Corresponds directly with the _minAvailable_ field in the [PodDisruptionBudgetSpec](https://kubernetes.io/docs/reference/kubernetes-api/policy-resources/pod-disruption-budget-v1/#PodDisruptionBudgetSpec). This value is mutually exclusive with _maxUnavailable_.| | |||
|gloo.podDisruptionBudget.maxUnavailable|string||Corresponds directly with the _maxUnavailable_ field in the [PodDisruptionBudgetSpec](https://kubernetes.io/docs/reference/kubernetes-api/policy-resources/pod-disruption-budget-v1/#PodDisruptionBudgetSpec). This value is mutually exclusive with _minAvailable_.| | |||
|gloo.clearStatusMetrics|bool||Set to true to clear status metrics for deleted resources. Default is false.| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think we want to also call out the danger in this too just like the changelog does.
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
settings: | ||
kubeResourceOverride: | ||
spec: | ||
observabilityOptions: | ||
configStatusMetricLabels: | ||
Gateway.v1.gateway.solo.io: | ||
labelToPath: | ||
name: '{.metadata.name}' | ||
namespace: '{.metadata.namespace}' | ||
Upstream.v1.gloo.solo.io: | ||
labelToPath: | ||
name: '{.metadata.name}' | ||
namespace: '{.metadata.namespace}' | ||
VirtualService.v1.gateway.solo.io: | ||
labelToPath: | ||
name: '{.metadata.name}' | ||
namespace: '{.metadata.namespace}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests are run twice, once with the clearing and once without. This change ensures that non-clearing case is writing resource status metrics (it doesn't by default). tl;dr: The tests need it.
"go.opencensus.io/tag" | ||
"k8s.io/apimachinery/pkg/runtime/schema" | ||
"k8s.io/client-go/util/jsonpath" | ||
) | ||
|
||
const ( | ||
ClearStatusMetricsEnvVar = "GLOO_CLEAR_STATUS_METRICS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want this to be prefixed with GLOO given upcoming naming changes or do we think thats in fact a boon?
Im currently leaning towards this being the right naming but figure i might as well make the though persisted so others can chime in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it can go in like this and when the time comes we can rename all env vars prefixed by GLOO_ to a different one (by then we should have consensus on a prefix)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an alternative? It looks like the convention is to prefix env vars with GLOO_
. I would assume the naming change would result in all the env vars needing their prefix changed.
"go.opencensus.io/tag" | ||
"k8s.io/apimachinery/pkg/runtime/schema" | ||
"k8s.io/client-go/util/jsonpath" | ||
) | ||
|
||
const ( | ||
ClearStatusMetricsEnvVar = "GLOO_CLEAR_STATUS_METRICS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it can go in like this and when the time comes we can rename all env vars prefixed by GLOO_ to a different one (by then we should have consensus on a prefix)
s.Assertions.NoError(err, "can delete nginx server upstream and VS") | ||
|
||
// Vary based on if .Values.gloo.deployment.clearStatusMetrics is set to true | ||
if s.TestInstallation.Metadata.ValuesManifestFile == e2e.ManifestPath("clear-status-metrics.yaml") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you point out where is this run with different manifest files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
On my list of things to do is adjust the values validation done in the testutils
to return the merged Helm values loaded into the structs found in ./install/helm/gloo/generate/values.go
. That would allow conditions, like above, to vary on the specific field we are driving helm with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not comfortable with checking the manifest file since the OSS tests run in enterprise which could lead to incorrect results. Would it be better to check the env var of the deployment ? cc @nfuden
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about I make a PR with the addition to the testutils
that I mentioned. After it's merged, I updated this PR to use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The end result will be the test checking the field at
gloo/install/helm/gloo/generate/values.go
Line 316 in b0ce83b
ClearStatusMetrics *bool `json:"clearStatusMetrics,omitempty" desc:"Set to true to clear status metrics for deleted resources. This may cause metric scraping to infrequently not see status metrics. Default is false."` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per our huddle, I've updated the test to check the Gloo deployment for the env var and use that to determine which assertions to make.
I had to make one more change to address flakey test. Because the "view" is shared hard to know if they should exist or not at the start of the test. |
Description
The status metrics for resources are left around after the resource is deleted. If the resource was invalid, the metric would continue to report invalid causing false alarms that required restarting the controller to clear.
Sleeping for 1 second was required to avoid a race condition that resulted in the new metrics getting caught up in the clearing/unregistering. I don't like adding the sleep, but that only other option is upgrading the metrics package we are using to a newer supported one (which doesn't have the race condition, the other package use a mutex).
When we upgrade to a new package we can remove this. The E2E test should have long-term value. To address concerns about the clearing causing the metrics to flicker, it's been made opt-in.
Checklist: