Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove status metrics for deleted resources #10597

Merged
merged 31 commits into from
Feb 13, 2025

Conversation

ryanrolds
Copy link

@ryanrolds ryanrolds commented Feb 4, 2025

Description

The status metrics for resources are left around after the resource is deleted. If the resource was invalid, the metric would continue to report invalid causing false alarms that required restarting the controller to clear.

Sleeping for 1 second was required to avoid a race condition that resulted in the new metrics getting caught up in the clearing/unregistering. I don't like adding the sleep, but that only other option is upgrading the metrics package we are using to a newer supported one (which doesn't have the race condition, the other package use a mutex).

When we upgrade to a new package we can remove this. The E2E test should have long-term value. To address concerns about the clearing causing the metrics to flicker, it's been made opt-in.

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works

@solo-changelog-bot
Copy link

Issues linked to changelog:
kgateway-dev#6938

@ryanrolds ryanrolds changed the title Clear resource metrics to remove metrics for deleted resources Remove status metrics for deleted resources Feb 5, 2025
@ryanrolds ryanrolds marked this pull request as ready for review February 5, 2025 17:22
@ryanrolds
Copy link
Author

/kick-ci

@ryanrolds
Copy link
Author

/kick-ci

Copy link

github-actions bot commented Feb 5, 2025

Visit the preview URL for this PR (updated for commit 9e9877a):

https://gloo-edge--pr10597-rolds-resource-metri-eq3r3xf3.web.app

(expires Thu, 20 Feb 2025 21:37:11 GMT)

🔥 via Firebase Hosting GitHub Action 🌎

Sign: 77c2b86e287749579b7ff9cadb81e099042ef677

@ryanrolds
Copy link
Author

/kick-ci

@ryanrolds
Copy link
Author

/kick

Copy link
Collaborator

@nfuden nfuden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont quite understand the rationale for clearing all statuses rather than doing something a bit more targeted and am concerned it could cause some flicker in metrics.

If an extra example can be added to prove otherwise then I am onboard

@ryanrolds
Copy link
Author

I dont quite understand the rationale for clearing all statuses rather than doing something a bit more targeted and am concerned it could cause some flicker in metrics.

The API provided by the metrics packages just don't provide the level of access needed to filter/remove specific label combinations from the metrics/views. I wish it wasn't that way. The same issue exists in the package we would like to move to, but it won't require a sleep reducing the data race we are both concerned about.

@ryanrolds
Copy link
Author

/kick-ci

@ryanrolds ryanrolds requested a review from a team as a code owner February 13, 2025 00:16
@ryanrolds ryanrolds requested a review from nfuden February 13, 2025 00:18
@@ -515,6 +515,7 @@
|gloo.headerSecretRefNsMatchesUs|bool||Set to true to require that secrets sent in headers via headerSecretRefs come from the same namespace as the destination upstream. Default: false|
|gloo.podDisruptionBudget.minAvailable|string||Corresponds directly with the _minAvailable_ field in the [PodDisruptionBudgetSpec](https://kubernetes.io/docs/reference/kubernetes-api/policy-resources/pod-disruption-budget-v1/#PodDisruptionBudgetSpec). This value is mutually exclusive with _maxUnavailable_.|
|gloo.podDisruptionBudget.maxUnavailable|string||Corresponds directly with the _maxUnavailable_ field in the [PodDisruptionBudgetSpec](https://kubernetes.io/docs/reference/kubernetes-api/policy-resources/pod-disruption-budget-v1/#PodDisruptionBudgetSpec). This value is mutually exclusive with _minAvailable_.|
|gloo.clearStatusMetrics|bool||Set to true to clear status metrics for deleted resources. Default is false.|
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we want to also call out the danger in this too just like the changelog does.
wdyt?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +6 to +22
settings:
kubeResourceOverride:
spec:
observabilityOptions:
configStatusMetricLabels:
Gateway.v1.gateway.solo.io:
labelToPath:
name: '{.metadata.name}'
namespace: '{.metadata.namespace}'
Upstream.v1.gloo.solo.io:
labelToPath:
name: '{.metadata.name}'
namespace: '{.metadata.namespace}'
VirtualService.v1.gateway.solo.io:
labelToPath:
name: '{.metadata.name}'
namespace: '{.metadata.namespace}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this added?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are run twice, once with the clearing and once without. This change ensures that non-clearing case is writing resource status metrics (it doesn't by default). tl;dr: The tests need it.

"go.opencensus.io/tag"
"k8s.io/apimachinery/pkg/runtime/schema"
"k8s.io/client-go/util/jsonpath"
)

const (
ClearStatusMetricsEnvVar = "GLOO_CLEAR_STATUS_METRICS"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this to be prefixed with GLOO given upcoming naming changes or do we think thats in fact a boon?

Im currently leaning towards this being the right naming but figure i might as well make the though persisted so others can chime in

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it can go in like this and when the time comes we can rename all env vars prefixed by GLOO_ to a different one (by then we should have consensus on a prefix)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an alternative? It looks like the convention is to prefix env vars with GLOO_. I would assume the naming change would result in all the env vars needing their prefix changed.

"go.opencensus.io/tag"
"k8s.io/apimachinery/pkg/runtime/schema"
"k8s.io/client-go/util/jsonpath"
)

const (
ClearStatusMetricsEnvVar = "GLOO_CLEAR_STATUS_METRICS"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it can go in like this and when the time comes we can rename all env vars prefixed by GLOO_ to a different one (by then we should have consensus on a prefix)

s.Assertions.NoError(err, "can delete nginx server upstream and VS")

// Vary based on if .Values.gloo.deployment.clearStatusMetrics is set to true
if s.TestInstallation.Metadata.ValuesManifestFile == e2e.ManifestPath("clear-status-metrics.yaml") {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you point out where is this run with different manifest files?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

On my list of things to do is adjust the values validation done in the testutils to return the merged Helm values loaded into the structs found in ./install/helm/gloo/generate/values.go. That would allow conditions, like above, to vary on the specific field we are driving helm with.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not comfortable with checking the manifest file since the OSS tests run in enterprise which could lead to incorrect results. Would it be better to check the env var of the deployment ? cc @nfuden

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about I make a PR with the addition to the testutils that I mentioned. After it's merged, I updated this PR to use it?

Copy link
Author

@ryanrolds ryanrolds Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The end result will be the test checking the field at

ClearStatusMetrics *bool `json:"clearStatusMetrics,omitempty" desc:"Set to true to clear status metrics for deleted resources. This may cause metric scraping to infrequently not see status metrics. Default is false."`

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per our huddle, I've updated the test to check the Gloo deployment for the env var and use that to determine which assertions to make.

@ryanrolds
Copy link
Author

I had to make one more change to address flakey test. Because the "view" is shared hard to know if they should exist or not at the start of the test.

@soloio-bulldozer soloio-bulldozer bot merged commit 0719073 into main Feb 13, 2025
20 checks passed
@soloio-bulldozer soloio-bulldozer bot deleted the rolds/resource_metrics_deletion branch February 13, 2025 23:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants