Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PP <-> Bifrost lag observability #2755

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

muhamadazmy
Copy link
Contributor

@muhamadazmy muhamadazmy commented Feb 19, 2025

PP <-> Bifrost lag observability

This PR improves only on the PP <-> Bifrost observability.
We already collect meterics for both PP and Bifrost but I think
those extra metrics can be useful

  • Utilization of the PP requests queue. Once full
    the PP will start dropping requests as 'Busy' and I thought
    it's important to measure how utilized this queue is

  • Commit to read latency of records. Measures the
    latency between the moment a record is committed until
    it's read by the PP

  • LSN lag (applied to the log tail) This is also shown in resatectl partition list

Fixes #2756


Stack created with Sapling. Best reviewed with ReviewStack.

@muhamadazmy muhamadazmy changed the title WIP: Improve PP <-> Bifrost observability PP <-> Bifrost lag observability Feb 19, 2025
@muhamadazmy muhamadazmy force-pushed the pr2755 branch 2 times, most recently from 66a5519 to 2b51579 Compare February 19, 2025 16:36
@muhamadazmy
Copy link
Contributor Author

muhamadazmy commented Feb 19, 2025

@pcholakov As advised, this only tickle a tiny slice of the observability story card. We already collect a lot of metrics for both PP and Bifrost, this only adds 2 metrics as described in the PR.

This is still under development, so there is a chance I add more metrics regarding the lag. for example how far behind the PP in LSN terms. Still looking into how to figure this out.

EDIT: I think it's not possible to compute the lag in LSN terms without doing a call to find_tail which is an intensive operation and will definitely avoid.

Copy link
Contributor

@pcholakov pcholakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks so much for adding these, @muhamadazmy! The two dimensions you've picked are super valuable and would be a great addition to our observability.

I'm sorry to be a pedant about naming and definitions but I think this is one area where we need to be ultra crisp. The folks working with these in the future won't have the benefit of our familiarity with the internals, so it's important we convey as much meaning as clearly as possible.

I'm not set on any of my suggestions, and ofc happy to chat sync if it would speed up your process 🙏

@@ -9,6 +9,7 @@
// by the Apache License, Version 2.0.

use assert2::let_assert;
use restate_types::time::MillisSinceEpoch;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: bump to the next group of imports

// In this case, someone already proposed this command,
// let's just replace the reciprocal and fail the old one to avoid keeping it dangling
let old_reciprocal = o.remove();
let old_reciprocal = o.insert(RpcReciprocal::new(request_created_at, reciprocal));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@@ -367,6 +368,18 @@ where
);
}

// Telemetry setup
let partition_id_str: &'static str = Box::leak(Box::new(self.partition_id.to_string()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this task may be restarted multiple times over the lifetime of a Restate server, how is this not a real memory leak? I have seen this in a few other places where a 'static label is required and IIRC, Valgrind flags them as leaks. Granted a small absolute amount but it may build up if a server housing many partitions encounters many trim gaps over the process lifetime. Am I missing something? If this is a real problem, could you please open an issue to track it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disclaimer: This is only moved from below :). but i was thinking exactly the same when i first saw it (a while back).

When I investigated back then it seems there was no way you can stop a PP (only move from leader to follower) so back then it wasn't an issue. but I think this is not the case anymore. Hence this is definitely a bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was written before we allowed PPs to be restarted, the right fix for this is to allow TC to accept impl Into<Cow<'static, str>> instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for metrics, I believe we can use Arc so that we don't allocate the string every time, the Arc will need to be created outside the loop though.

describe_histogram!(
PARTITION_RECORD_WRITE_TO_READ_LATENCY_SECONDS,
Unit::Seconds,
"Duration between the time a record was first created until it was read for processing by the partition processor"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Created" is ambiguous here - for someone just looking at the description in a metric system. Also the constant name is potentially out of sync - WRITE_TO_READ implies committed to picked up to me?

describe_histogram!(
PARTITION_REQUEST_TO_SUBMITTED_DURATION,
Unit::Seconds,
"Duration between the time an ingress request was first created until it was submitted"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First created, or first received by the ingress server?

);

describe_histogram!(
PARTITION_REQUEST_TO_OUTPUT_DURATION,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request to output is ambiguous here; could be misinterpreted to mean "request by the external client". Is it actually measured since received by ingress? Similarly, what is "produced an output", exactly? Output was committed to the log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This took me a while to wrap my head around. But this is how it works
When creating an invocation this can be either in 2 modes:

  • Submitted: basically ingress only waits until the request is "submitted" which seems to mean that the invocation was "read" by the PP
  • Output: ingress will block until the invocation is actually executed and the user handler have returned an output, the output is then sent back to ingress (to be forwarded to the caller)

But because these 2 modes of invocation takes 2 different paths, I had to measure the delay for each mode separately. so the request_to_output is the delay of an invocation from the moment ingress created the request until the output was produced. The request_to_submitted is delay of an invocation from the moment ingress created the request until it was "submitted"

pub const PARTITION_RPC_QUEUE_UTILIZATION_PERCENT: &str =
"restate.partition.rpc_queue.utilization.percent";
pub const PARTITION_RECORD_WRITE_TO_READ_LATENCY_SECONDS: &str =
"restate.partition.record_write_to_read_latency.seconds";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, write-to-read doesn't sit quite right - maybe committed or accepted? But what would an operator with low context on the internals of Restate think of this name if they saw it in the metrics list at 3am?

Suggested change
"restate.partition.record_write_to_read_latency.seconds";
"restate.partition.record_committed_to_read_latency.seconds";

@muhamadazmy muhamadazmy force-pushed the pr2755 branch 2 times, most recently from 8854a68 to ce37f39 Compare February 21, 2025 16:40
@muhamadazmy muhamadazmy marked this pull request as ready for review February 21, 2025 16:44
@muhamadazmy
Copy link
Contributor Author

Thank you @pcholakov for your helpful input as usual. I really think the 2 metrics that measures delays from an ingress invocation until submission or return is not very helpful specially that ingress also measure the latency of a request, so I decided to drop them. Instead we have the latency between for each record committed to read and also the lsn lag (applied to tail) which both can give a good insight into the health of the PP.

Copy link
Contributor

@pcholakov pcholakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @muhamadazmy thanks for considering my suggestions :-)

I still have some reservations about the lsn-lag metric as implemented; I think as it should come with some major disclaimers that make it clear to operators that it can vary wildly depending on how (in)frequently each worker is able to refresh the target tail for the logs it follows - maybe explicitly recommend the record_committed_to_read_latency metric for monitoring purposes.

Nothing blocking but I feel a bit uneasy about the LSN lag metric, at least as it is currently implemented.

@@ -37,7 +37,7 @@ impl PeriodicTailChecker {
"Started a background periodic tail checker for this loglet",
);
// Optimization. Don't run the check if the tail/seal has been updated recently.
// Unfortunately this requires a litte bit more setup in the TailOffsetWatch so we don't do
// Unfortunately this requires a little bit more setup in the TailOffsetWatch so we don't do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every little bit helps!

@@ -151,6 +151,7 @@ pub async fn list_partitions(
"PERSISTED-LSN",
"SKIPPED-RECORDS",
"ARCHIVED-LSN",
"LSN-LAG",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It somewhat bothers me using "lag" for something that isn't time but I don't have a much better suggestion. Lag is also conveniently short!! 😆

Comment on lines +526 to +528
let Some(tail_lsn) = tail else {
return;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to handle the case when we might receive a None update? When might that happen - and won't we be posting lies if we're doing a saturating sub from an out-of-date target tail? I'd almost rather reset the tail to none, and post +INF or some other sentinel value to the gauge to represent "unknown".

(The scenario I'm trying to avoid is this: as an operator, I set up an alarm that flags whenever the delta goes above a certain threshold. Some partition processor is isolated and doesn't get target tail updates any more, but continues posting a 0 delta metric, which looks healthy to the monitoring system.)

@@ -241,6 +244,9 @@ impl PartitionProcessorManager {
let mut latest_snapshot_check_interval = tokio::time::interval(Duration::from_secs(5));
latest_snapshot_check_interval.set_missed_tick_behavior(MissedTickBehavior::Skip);

let mut update_target_tail_lsns = tokio::time::interval(Duration::from_secs(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, I'm assuming this is reasonably cheap to do at the given rate?

I imagine we could improve the metric fluctuations (and avoid the need to poll) in the future is with in-band tail information sent along by the log servers with the record stream in the future.

describe_gauge!(
PARTITION_LAST_APPLIED_LSN_LAG,
Unit::Count,
"Number of records between last applied lsn and the log tail"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this should come with a fairly big disclaimer about the sampling error given the current implementation; not sure if this is the best place to do it.

Summary:
FindTailAttr allows find_tail users to adjust durability and accuracy. Currently, two modes are supported:

Approximate: A fast check that usually returns immediately with the last known tail.
Durable (default): Ensures a reliable tail find.

Loglet implementations can choose to always run Durable find_tail.
This PR improves only on the PP <-> Bifrost observability.
We already collect meterics for both PP and Bifrost but I think
those extra metrics can be useful

- Utilization of the PP requests queue. Once full
the PP will start dropping requests as 'Busy' and I thought
it's important to measure how utilized this queue is

- Commit to read latency of records. Measures the
latency between the moment a record is committed until
it's read by the PP

- LSN lag (applied to the log tail) This is also shown in `resatectl partition list`

Fixes restatedev#2756
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Observability for PP <-> Bifrost lag
3 participants