PP <-> Bifrost lag observability #2755

muhamadazmy · 2025-02-19T15:57:55Z

PP <-> Bifrost lag observability

This PR improves only on the PP <-> Bifrost observability.
We already collect meterics for both PP and Bifrost but I think
those extra metrics can be useful

Utilization of the PP requests queue. Once full
the PP will start dropping requests as 'Busy' and I thought
it's important to measure how utilized this queue is
Commit to read latency of records. Measures the
latency between the moment a record is committed until
it's read by the PP
LSN lag (applied to the log tail) This is also shown in resatectl partition list

Fixes #2756

Stack created with Sapling. Best reviewed with ReviewStack.

muhamadazmy · 2025-02-19T16:47:24Z

@pcholakov As advised, this only tickle a tiny slice of the observability story card. We already collect a lot of metrics for both PP and Bifrost, this only adds 2 metrics as described in the PR.

This is still under development, so there is a chance I add more metrics regarding the lag. for example how far behind the PP in LSN terms. Still looking into how to figure this out.

EDIT: I think it's not possible to compute the lag in LSN terms without doing a call to find_tail which is an intensive operation and will definitely avoid.

pcholakov

Hey, thanks so much for adding these, @muhamadazmy! The two dimensions you've picked are super valuable and would be a great addition to our observability.

I'm sorry to be a pedant about naming and definitions but I think this is one area where we need to be ultra crisp. The folks working with these in the future won't have the benefit of our familiarity with the internals, so it's important we convey as much meaning as clearly as possible.

I'm not set on any of my suggestions, and ofc happy to chat sync if it would speed up your process 🙏

pcholakov · 2025-02-21T12:05:38Z

crates/core/src/network/partition_processor_rpc_client.rs

@@ -9,6 +9,7 @@
 // by the Apache License, Version 2.0.

 use assert2::let_assert;
+use restate_types::time::MillisSinceEpoch;


Nit: bump to the next group of imports

pcholakov · 2025-02-21T13:38:27Z

crates/worker/src/partition/leadership/leader_state.rs

                // In this case, someone already proposed this command,
                // let's just replace the reciprocal and fail the old one to avoid keeping it dangling
-                let old_reciprocal = o.remove();
+                let old_reciprocal = o.insert(RpcReciprocal::new(request_created_at, reciprocal));


pcholakov · 2025-02-21T13:43:46Z

crates/worker/src/partition/mod.rs

@@ -367,6 +368,18 @@ where
            );
        }

+        // Telemetry setup
+        let partition_id_str: &'static str = Box::leak(Box::new(self.partition_id.to_string()));


Given that this task may be restarted multiple times over the lifetime of a Restate server, how is this not a real memory leak? I have seen this in a few other places where a 'static label is required and IIRC, Valgrind flags them as leaks. Granted a small absolute amount but it may build up if a server housing many partitions encounters many trim gaps over the process lifetime. Am I missing something? If this is a real problem, could you please open an issue to track it?

Disclaimer: This is only moved from below :). but i was thinking exactly the same when i first saw it (a while back).

When I investigated back then it seems there was no way you can stop a PP (only move from leader to follower) so back then it wasn't an issue. but I think this is not the case anymore. Hence this is definitely a bug.

This was written before we allowed PPs to be restarted, the right fix for this is to allow TC to accept impl Into<Cow<'static, str>> instead

for metrics, I believe we can use Arc so that we don't allocate the string every time, the Arc will need to be created outside the loop though.

pcholakov · 2025-02-21T13:47:46Z

crates/worker/src/metric_definitions.rs

+    describe_histogram!(
+        PARTITION_RECORD_WRITE_TO_READ_LATENCY_SECONDS,
+        Unit::Seconds,
+        "Duration between the time a record was first created until it was read for processing by the partition processor"


"Created" is ambiguous here - for someone just looking at the description in a metric system. Also the constant name is potentially out of sync - WRITE_TO_READ implies committed to picked up to me?

pcholakov · 2025-02-21T13:48:14Z

crates/worker/src/metric_definitions.rs

+    describe_histogram!(
+        PARTITION_REQUEST_TO_SUBMITTED_DURATION,
+        Unit::Seconds,
+        "Duration between the time an ingress request was first created until it was submitted"


First created, or first received by the ingress server?

pcholakov · 2025-02-21T13:51:18Z

crates/worker/src/metric_definitions.rs

+    );
+
+    describe_histogram!(
+        PARTITION_REQUEST_TO_OUTPUT_DURATION,


Request to output is ambiguous here; could be misinterpreted to mean "request by the external client". Is it actually measured since received by ingress? Similarly, what is "produced an output", exactly? Output was committed to the log?

This took me a while to wrap my head around. But this is how it works
When creating an invocation this can be either in 2 modes:

Submitted: basically ingress only waits until the request is "submitted" which seems to mean that the invocation was "read" by the PP

Output: ingress will block until the invocation is actually executed and the user handler have returned an output, the output is then sent back to ingress (to be forwarded to the caller)

But because these 2 modes of invocation takes 2 different paths, I had to measure the delay for each mode separately. so the request_to_output is the delay of an invocation from the moment ingress created the request until the output was produced. The request_to_submitted is delay of an invocation from the moment ingress created the request until it was "submitted"

pcholakov · 2025-02-21T13:59:35Z

crates/worker/src/metric_definitions.rs

+pub const PARTITION_RPC_QUEUE_UTILIZATION_PERCENT: &str =
+    "restate.partition.rpc_queue.utilization.percent";
+pub const PARTITION_RECORD_WRITE_TO_READ_LATENCY_SECONDS: &str =
+    "restate.partition.record_write_to_read_latency.seconds";


For some reason, write-to-read doesn't sit quite right - maybe committed or accepted? But what would an operator with low context on the internals of Restate think of this name if they saw it in the metrics list at 3am?

Suggested change

"restate.partition.record_write_to_read_latency.seconds";

"restate.partition.record_committed_to_read_latency.seconds";

crates/worker/src/metric_definitions.rs

muhamadazmy · 2025-02-21T16:47:17Z

Thank you @pcholakov for your helpful input as usual. I really think the 2 metrics that measures delays from an ingress invocation until submission or return is not very helpful specially that ingress also measure the latency of a request, so I decided to drop them. Instead we have the latency between for each record committed to read and also the lsn lag (applied to tail) which both can give a good insight into the health of the PP.

pcholakov

Hey @muhamadazmy thanks for considering my suggestions :-)

I still have some reservations about the lsn-lag metric as implemented; I think as it should come with some major disclaimers that make it clear to operators that it can vary wildly depending on how (in)frequently each worker is able to refresh the target tail for the logs it follows - maybe explicitly recommend the record_committed_to_read_latency metric for monitoring purposes.

Nothing blocking but I feel a bit uneasy about the LSN lag metric, at least as it is currently implemented.

pcholakov · 2025-02-25T11:23:56Z

crates/bifrost/src/providers/replicated_loglet/tasks/periodic_tail_checker.rs

@@ -37,7 +37,7 @@ impl PeriodicTailChecker {
            "Started a background periodic tail checker for this loglet",
        );
        // Optimization. Don't run the check if the tail/seal has been updated recently.
-        // Unfortunately this requires a litte bit more setup in the TailOffsetWatch so we don't do
+        // Unfortunately this requires a little bit more setup in the TailOffsetWatch so we don't do


Every little bit helps!

pcholakov · 2025-02-25T12:04:55Z

tools/restatectl/src/commands/partition/list.rs

@@ -151,6 +151,7 @@ pub async fn list_partitions(
        "PERSISTED-LSN",
        "SKIPPED-RECORDS",
        "ARCHIVED-LSN",
+        "LSN-LAG",


It somewhat bothers me using "lag" for something that isn't time but I don't have a much better suggestion. Lag is also conveniently short!! 😆

pcholakov · 2025-02-25T12:06:54Z

crates/worker/src/partition_processor_manager.rs

+                let Some(tail_lsn) = tail else {
+                    return;
+                };


Do we need to handle the case when we might receive a None update? When might that happen - and won't we be posting lies if we're doing a saturating sub from an out-of-date target tail? I'd almost rather reset the tail to none, and post +INF or some other sentinel value to the gauge to represent "unknown".

(The scenario I'm trying to avoid is this: as an operator, I set up an alarm that flags whenever the delta goes above a certain threshold. Some partition processor is isolated and doesn't get target tail updates any more, but continues posting a 0 delta metric, which looks healthy to the monitoring system.)

pcholakov · 2025-02-25T12:20:38Z

crates/worker/src/partition_processor_manager.rs

@@ -241,6 +244,9 @@ impl PartitionProcessorManager {
        let mut latest_snapshot_check_interval = tokio::time::interval(Duration::from_secs(5));
        latest_snapshot_check_interval.set_missed_tick_behavior(MissedTickBehavior::Skip);

+        let mut update_target_tail_lsns = tokio::time::interval(Duration::from_secs(1));


Just curious, I'm assuming this is reasonably cheap to do at the given rate?

I imagine we could improve the metric fluctuations (and avoid the need to poll) in the future is with in-band tail information sent along by the log servers with the record stream in the future.

pcholakov · 2025-02-25T12:59:53Z

crates/worker/src/metric_definitions.rs

+    describe_gauge!(
+        PARTITION_LAST_APPLIED_LSN_LAG,
+        Unit::Count,
+        "Number of records between last applied lsn and the log tail"


I feel like this should come with a fairly big disclaimer about the sampling error given the current implementation; not sure if this is the best place to do it.

Summary: FindTailAttr allows find_tail users to adjust durability and accuracy. Currently, two modes are supported: Approximate: A fast check that usually returns immediately with the last known tail. Durable (default): Ensures a reliable tail find. Loglet implementations can choose to always run Durable find_tail.

This PR improves only on the PP <-> Bifrost observability. We already collect meterics for both PP and Bifrost but I think those extra metrics can be useful - Utilization of the PP requests queue. Once full the PP will start dropping requests as 'Busy' and I thought it's important to measure how utilized this queue is - Commit to read latency of records. Measures the latency between the moment a record is committed until it's read by the PP - LSN lag (applied to the log tail) This is also shown in `resatectl partition list` Fixes restatedev#2756

muhamadazmy changed the title ~~WIP: Improve PP <-> Bifrost observability~~ PP <-> Bifrost lag observability Feb 19, 2025

muhamadazmy force-pushed the pr2755 branch 2 times, most recently from 66a5519 to 2b51579 Compare February 19, 2025 16:36

muhamadazmy requested a review from pcholakov February 19, 2025 16:37

muhamadazmy force-pushed the pr2755 branch from 2b51579 to ea35079 Compare February 20, 2025 15:16

pcholakov reviewed Feb 21, 2025

View reviewed changes

muhamadazmy force-pushed the pr2755 branch 2 times, most recently from 8854a68 to ce37f39 Compare February 21, 2025 16:40

muhamadazmy marked this pull request as ready for review February 21, 2025 16:44

muhamadazmy requested review from pcholakov and tillrohrmann February 21, 2025 16:47

muhamadazmy force-pushed the pr2755 branch from ce37f39 to 8355b7a Compare February 25, 2025 08:56

muhamadazmy mentioned this pull request Feb 25, 2025

Improve invoker observability #2779

Open

muhamadazmy force-pushed the pr2755 branch from 8355b7a to d0a6781 Compare February 25, 2025 09:56

pcholakov approved these changes Feb 25, 2025

View reviewed changes

muhamadazmy force-pushed the pr2755 branch from d0a6781 to e9d5373 Compare February 25, 2025 14:02

muhamadazmy mentioned this pull request Feb 25, 2025

Add FindTailAttr for configurable find_tail behavior #2781

Open

muhamadazmy force-pushed the pr2755 branch 3 times, most recently from 6a98d73 to 29399e9 Compare February 25, 2025 15:36

muhamadazmy added 2 commits February 25, 2025 16:46

muhamadazmy force-pushed the pr2755 branch from 29399e9 to 3d8f684 Compare February 25, 2025 15:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PP <-> Bifrost lag observability #2755

PP <-> Bifrost lag observability #2755

muhamadazmy commented Feb 19, 2025 •

edited

Loading

muhamadazmy commented Feb 19, 2025 •

edited

Loading

pcholakov left a comment

pcholakov Feb 21, 2025

pcholakov Feb 21, 2025

pcholakov Feb 21, 2025

muhamadazmy Feb 21, 2025

AhmedSoliman Feb 26, 2025

AhmedSoliman Feb 26, 2025

pcholakov Feb 21, 2025

pcholakov Feb 21, 2025

pcholakov Feb 21, 2025

muhamadazmy Feb 21, 2025

pcholakov Feb 21, 2025

muhamadazmy commented Feb 21, 2025

pcholakov left a comment

pcholakov Feb 25, 2025

pcholakov Feb 25, 2025

pcholakov Feb 25, 2025

pcholakov Feb 25, 2025

pcholakov Feb 25, 2025

	"restate.partition.record_write_to_read_latency.seconds";
	"restate.partition.record_committed_to_read_latency.seconds";

PP <-> Bifrost lag observability #2755

Are you sure you want to change the base?

PP <-> Bifrost lag observability #2755

Conversation

muhamadazmy commented Feb 19, 2025 • edited Loading

muhamadazmy commented Feb 19, 2025 • edited Loading

pcholakov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

muhamadazmy commented Feb 21, 2025

pcholakov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

muhamadazmy commented Feb 19, 2025 •

edited

Loading

muhamadazmy commented Feb 19, 2025 •

edited

Loading