-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PP <-> Bifrost lag observability #2755
base: main
Are you sure you want to change the base?
Conversation
66a5519
to
2b51579
Compare
@pcholakov As advised, this only tickle a tiny slice of the observability story card. We already collect a lot of metrics for both PP and Bifrost, this only adds 2 metrics as described in the PR. This is still under development, so there is a chance I add more metrics regarding the lag. for example how far behind the PP in LSN terms. Still looking into how to figure this out.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thanks so much for adding these, @muhamadazmy! The two dimensions you've picked are super valuable and would be a great addition to our observability.
I'm sorry to be a pedant about naming and definitions but I think this is one area where we need to be ultra crisp. The folks working with these in the future won't have the benefit of our familiarity with the internals, so it's important we convey as much meaning as clearly as possible.
I'm not set on any of my suggestions, and ofc happy to chat sync if it would speed up your process 🙏
@@ -9,6 +9,7 @@ | |||
// by the Apache License, Version 2.0. | |||
|
|||
use assert2::let_assert; | |||
use restate_types::time::MillisSinceEpoch; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: bump to the next group of imports
// In this case, someone already proposed this command, | ||
// let's just replace the reciprocal and fail the old one to avoid keeping it dangling | ||
let old_reciprocal = o.remove(); | ||
let old_reciprocal = o.insert(RpcReciprocal::new(request_created_at, reciprocal)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
crates/worker/src/partition/mod.rs
Outdated
@@ -367,6 +368,18 @@ where | |||
); | |||
} | |||
|
|||
// Telemetry setup | |||
let partition_id_str: &'static str = Box::leak(Box::new(self.partition_id.to_string())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this task may be restarted multiple times over the lifetime of a Restate server, how is this not a real memory leak? I have seen this in a few other places where a 'static
label is required and IIRC, Valgrind flags them as leaks. Granted a small absolute amount but it may build up if a server housing many partitions encounters many trim gaps over the process lifetime. Am I missing something? If this is a real problem, could you please open an issue to track it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disclaimer: This is only moved from below :). but i was thinking exactly the same when i first saw it (a while back).
When I investigated back then it seems there was no way you can stop a PP (only move from leader to follower) so back then it wasn't an issue. but I think this is not the case anymore. Hence this is definitely a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was written before we allowed PPs to be restarted, the right fix for this is to allow TC to accept impl Into<Cow<'static, str>>
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for metrics, I believe we can use Arc so that we don't allocate the string every time, the Arc will need to be created outside the loop though.
describe_histogram!( | ||
PARTITION_RECORD_WRITE_TO_READ_LATENCY_SECONDS, | ||
Unit::Seconds, | ||
"Duration between the time a record was first created until it was read for processing by the partition processor" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Created" is ambiguous here - for someone just looking at the description in a metric system. Also the constant name is potentially out of sync - WRITE_TO_READ
implies committed to picked up to me?
describe_histogram!( | ||
PARTITION_REQUEST_TO_SUBMITTED_DURATION, | ||
Unit::Seconds, | ||
"Duration between the time an ingress request was first created until it was submitted" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First created, or first received by the ingress server?
); | ||
|
||
describe_histogram!( | ||
PARTITION_REQUEST_TO_OUTPUT_DURATION, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Request to output is ambiguous here; could be misinterpreted to mean "request by the external client". Is it actually measured since received by ingress? Similarly, what is "produced an output", exactly? Output was committed to the log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This took me a while to wrap my head around. But this is how it works
When creating an invocation this can be either in 2 modes:
- Submitted: basically ingress only waits until the request is "submitted" which seems to mean that the invocation was "read" by the PP
- Output: ingress will block until the invocation is actually executed and the user handler have returned an output, the output is then sent back to ingress (to be forwarded to the caller)
But because these 2 modes of invocation takes 2 different paths, I had to measure the delay for each mode separately. so the request_to_output is the delay of an invocation from the moment ingress created the request until the output was produced. The request_to_submitted is delay of an invocation from the moment ingress created the request until it was "submitted"
pub const PARTITION_RPC_QUEUE_UTILIZATION_PERCENT: &str = | ||
"restate.partition.rpc_queue.utilization.percent"; | ||
pub const PARTITION_RECORD_WRITE_TO_READ_LATENCY_SECONDS: &str = | ||
"restate.partition.record_write_to_read_latency.seconds"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason, write-to-read doesn't sit quite right - maybe committed or accepted? But what would an operator with low context on the internals of Restate think of this name if they saw it in the metrics list at 3am?
"restate.partition.record_write_to_read_latency.seconds"; | |
"restate.partition.record_committed_to_read_latency.seconds"; |
8854a68
to
ce37f39
Compare
Thank you @pcholakov for your helpful input as usual. I really think the 2 metrics that measures delays from an ingress invocation until submission or return is not very helpful specially that ingress also measure the latency of a request, so I decided to drop them. Instead we have the latency between for each record |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @muhamadazmy thanks for considering my suggestions :-)
I still have some reservations about the lsn-lag metric as implemented; I think as it should come with some major disclaimers that make it clear to operators that it can vary wildly depending on how (in)frequently each worker is able to refresh the target tail for the logs it follows - maybe explicitly recommend the record_committed_to_read_latency
metric for monitoring purposes.
Nothing blocking but I feel a bit uneasy about the LSN lag metric, at least as it is currently implemented.
@@ -37,7 +37,7 @@ impl PeriodicTailChecker { | |||
"Started a background periodic tail checker for this loglet", | |||
); | |||
// Optimization. Don't run the check if the tail/seal has been updated recently. | |||
// Unfortunately this requires a litte bit more setup in the TailOffsetWatch so we don't do | |||
// Unfortunately this requires a little bit more setup in the TailOffsetWatch so we don't do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every little bit helps!
@@ -151,6 +151,7 @@ pub async fn list_partitions( | |||
"PERSISTED-LSN", | |||
"SKIPPED-RECORDS", | |||
"ARCHIVED-LSN", | |||
"LSN-LAG", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It somewhat bothers me using "lag" for something that isn't time but I don't have a much better suggestion. Lag is also conveniently short!! 😆
let Some(tail_lsn) = tail else { | ||
return; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to handle the case when we might receive a None
update? When might that happen - and won't we be posting lies if we're doing a saturating sub from an out-of-date target tail? I'd almost rather reset the tail to none, and post +INF or some other sentinel value to the gauge to represent "unknown".
(The scenario I'm trying to avoid is this: as an operator, I set up an alarm that flags whenever the delta goes above a certain threshold. Some partition processor is isolated and doesn't get target tail updates any more, but continues posting a 0
delta metric, which looks healthy to the monitoring system.)
@@ -241,6 +244,9 @@ impl PartitionProcessorManager { | |||
let mut latest_snapshot_check_interval = tokio::time::interval(Duration::from_secs(5)); | |||
latest_snapshot_check_interval.set_missed_tick_behavior(MissedTickBehavior::Skip); | |||
|
|||
let mut update_target_tail_lsns = tokio::time::interval(Duration::from_secs(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, I'm assuming this is reasonably cheap to do at the given rate?
I imagine we could improve the metric fluctuations (and avoid the need to poll) in the future is with in-band tail information sent along by the log servers with the record stream in the future.
describe_gauge!( | ||
PARTITION_LAST_APPLIED_LSN_LAG, | ||
Unit::Count, | ||
"Number of records between last applied lsn and the log tail" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this should come with a fairly big disclaimer about the sampling error given the current implementation; not sure if this is the best place to do it.
6a98d73
to
29399e9
Compare
Summary: FindTailAttr allows find_tail users to adjust durability and accuracy. Currently, two modes are supported: Approximate: A fast check that usually returns immediately with the last known tail. Durable (default): Ensures a reliable tail find. Loglet implementations can choose to always run Durable find_tail.
This PR improves only on the PP <-> Bifrost observability. We already collect meterics for both PP and Bifrost but I think those extra metrics can be useful - Utilization of the PP requests queue. Once full the PP will start dropping requests as 'Busy' and I thought it's important to measure how utilized this queue is - Commit to read latency of records. Measures the latency between the moment a record is committed until it's read by the PP - LSN lag (applied to the log tail) This is also shown in `resatectl partition list` Fixes restatedev#2756
PP <-> Bifrost lag observability
This PR improves only on the PP <-> Bifrost observability.
We already collect meterics for both PP and Bifrost but I think
those extra metrics can be useful
Utilization of the PP requests queue. Once full
the PP will start dropping requests as 'Busy' and I thought
it's important to measure how utilized this queue is
Commit to read latency of records. Measures the
latency between the moment a record is committed until
it's read by the PP
LSN lag (applied to the log tail) This is also shown in
resatectl partition list
Fixes #2756
Stack created with Sapling. Best reviewed with ReviewStack.