Skip to content

Fix Route Registry Metrics to Report Correct Result #468

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hoffmaen opened this issue Apr 28, 2025 · 6 comments · May be fixed by #478
Open

Fix Route Registry Metrics to Report Correct Result #468

hoffmaen opened this issue Apr 28, 2025 · 6 comments · May be fixed by #478
Assignees

Comments

@hoffmaen
Copy link
Contributor

hoffmaen commented Apr 28, 2025

Proposed Change

We see an issue with the Route Registry code.

Register

Currently, Register() calls register(), which calls Pool.Put()

Put() updates the pool if required, and returns the PoolPutResult to register():

  • UPDATED, whenever the endpoint has previously been registered in the pool (even if unmodified)
    • Exception: UNMODIFIED, if the message is a duplicate (based on ModificationTag)
  • ADDED, if the endpoint doesn't exist in the pool yet

register() creates the pool if required, sets the time of update and forwards the PoolPutResult to Register(). Register() receives and logs the PoolPutResult, and ingests a generic route registry metric.

Unregister

On the other hand, we have a similar functionality with Unregister(), which calls unregister(), which calls pool.Remove().

However, the structure is quite different here:
Remove() returns a boolean value reflecting whether the route has been removed from the pool, or not. unregister() however does some logging based on the result of Remove(), and removes the pool if required. Unregister() finally unconditionally ingests a generic route unregistry metric.

Issues

There are some problems with the current code:

  • We have a very generic metric for registering and unregistering endpoints - it does not reveal the result of the (un)registration (endpoint added, modified, unchanged, etc). However, tests have shown that the cost for each case differs, and that it is beneficial for operations to have more detailed metrics.
  • pool.Put() always returns status UPDATED when the route exists in the pool. Hence, the metric doesn't reflect whether a route has received an update, or if not (in case its attributes haven't changed).
  • Unregister() always ingests an UnregistryMessage, independently of the result of pool.Remove(). Hence, the metric doesn't reflect whether a route has been removed or not.
  • We should introduce an additional metric that reflects whether a whole route has been created or removed upon Register() / Unregister().
  • The structure of the Register() and Unregister() functions and its related private functions should be refactored, e.g.
    • Code / function structure should be the same
    • Logging should be done in the public function functionality
    • Improve code structure in general, make the long if-clauses in registry.go less complicated

Acceptance criteria

  • Have better metrics that reflect the actual result of registering and unregistering an endpoint, based on the PoolPutResult and a PoolRemoveResult.
    • Avoid breaking changes by using tags for the actual result in CaptureRegistryMessage and CaptureUnregistryMessage.
  • Fix the wrong UPDATED result in case of no-op registry update.
  • Refactor the code to mitigate the issues described above.

Related links

  • Draft Pull-Request Improve gorouter Route Registry Message Metrics gorouter#456: This was created before being able to use metric tags (introduced with support for Prometheus based metrics). This Pull-Request already addresses the issues described here, but needs a final refactoring to have no breaking change due to metrics renaming.
@ameowlia
Copy link
Member

In general I approve making these changes, with a few additions...

  • The new metrics, should be behind a config flag and should be default off for backwards compatibility. I suggest some sort of config that is similar to the way we did extra access log values.
  • The old metrics, should be put behind a config flag that is default on for backwards compatibility. This could be included in the same config you make for the new metrics (if we plan on never deprecating and removing these metrics), or a different one (if we plan on deprecating and removing them eventually, and we should write that in the description).

❓ Do you think that the old metrics should be deprecated and removed eventually?

@maxmoehl
Copy link
Member

maxmoehl commented Apr 29, 2025

The value is reported as a tag on the metric and only exists when using the prometheus metrics we only recently merged (it's ignored in the envelope logic). The part where a long-standing value would change is the log line:

switch endpointAdded {
case route.ADDED:
if r.logger.Enabled(context.Background(), slog.LevelInfo) {
r.logger.Info("endpoint-registered", buildSlogAttrs(uri, endpoint)...)
}
case route.UPDATED:
if r.logger.Enabled(context.Background(), slog.LevelDebug) {
r.logger.Debug("endpoint-registered", buildSlogAttrs(uri, endpoint)...)
}
default:
if r.logger.Enabled(context.Background(), slog.LevelDebug) {
r.logger.Debug("endpoint-not-registered", buildSlogAttrs(uri, endpoint)...)
}
}

So new / old metrics where we could have a feature flag are not a thing. And if we want a compatibility flag for the log statements we would also need to retain the old convoluted if-statements and couldn't clean that up (or at least not as much).

@ameowlia ameowlia transferred this issue from cloudfoundry/gorouter Apr 30, 2025
@ameowlia
Copy link
Member

ameowlia commented May 5, 2025

@maxmoehl ❓ Can you explain more about what would change with the log lines? Is it when it logs? Or the main message that it logs? or the data included in the log message?

@maxmoehl
Copy link
Member

maxmoehl commented May 5, 2025

I've taken a closer look at the code and a possible change would be this commit.

I now better understand the intention of the old logic, it is something like this:

  1. If an endpoint is equal to the one gorouter already knows, the updated field of the endpoint is bumped to ensure it isn't pruned if it's a non-TLS route. This is what's meant by UPDATED.
  2. If the message we receive has been superseded by a previous message because they were received out of order, UNMODIFIED indicates that nothing changed.
  3. For an endpoint we already have but where something changed, we post UPDATED.
  4. For new endpoints we record ADDED

My main issue is with UPDATED, it's not really updated, rather refreshed. I would like to distinguish between "we just refreshed the timer" and "something actually changed". So the proposal in my commit would be (not yet fully implemented):

  1. Tag: UDPATED -> REFRESHED, Log: endpoint-registered -> endpoint-refreshed
  2. Tag: UNMODIFIED, Log: endpoint-not-registered
  3. Tag: UPDATED, Log: endpoint-registered
  4. Tag: ADDED, Log: endpoint-registered

I would propose to make UPDATED and ADDED info-level logs and REFRESHED a debug log. That reduces the noise by the constant route updates a lot but keeps the relevant changes. We could also align the logs a bit closer to the tags but I know that this really is breaking (whereas the tags are probably not yet used anyway?).

The other log fields should not be affected by this.

@ameowlia
Copy link
Member

ameowlia commented May 5, 2025

Gotcha. Thanks for that explanation @maxmoehl. I think that is a good idea, I won't consider this a breaking change. No need to add branching logic to do the old logs.

@maxmoehl maxmoehl self-assigned this May 5, 2025
@maxmoehl maxmoehl moved this from Inbox to Waiting for Changes | Open for Contribution in Application Runtime Platform Working Group May 5, 2025
maxmoehl added a commit to sap-contributions/routing-release that referenced this issue May 6, 2025
maxmoehl added a commit to sap-contributions/routing-release that referenced this issue May 6, 2025
maxmoehl added a commit to sap-contributions/routing-release that referenced this issue May 7, 2025
maxmoehl added a commit to sap-contributions/routing-release that referenced this issue May 7, 2025
@maxmoehl
Copy link
Member

maxmoehl commented May 7, 2025

PR is ready: #478

@maxmoehl maxmoehl linked a pull request May 7, 2025 that will close this issue
1 task
maxmoehl added a commit to sap-contributions/routing-release that referenced this issue May 12, 2025
maxmoehl added a commit to sap-contributions/routing-release that referenced this issue May 12, 2025
maxmoehl added a commit to sap-contributions/routing-release that referenced this issue May 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Waiting for Changes | Open for Contribution
Development

Successfully merging a pull request may close this issue.

3 participants