Enhance threads stacks collection and sampling frequency on Linux #191

viglia · 2022-12-15T13:25:24Z

This PR tries to address inconsistencies regarding the collection of samples across time and threads (somehow related to #177).

Currently to collect a sample (stack) we rely on the SIGPROF signal, which is a standard signal. This has the following two limitations:

for standard signal, the set of pending signals is only a mask, meaning that, if multiple instances of a standard signal are generated while that signal is blocked, then only one instance of the signal is marked as pending (and the signal will be delivered just once when it is unblocked)
the SIGPROF signal is directed to the process and will then be handled by one of the thread, hence we'll only collect the stack for the thread that in a given moment is "handling" the signal.

The changes in this PR try to address both of the above mentioned limitations by making use of real-time signals and per-thread delivery.

Real-Time Signal

An important distinction here is that real-time signals are queued. If multiple instances of a realtime signal are sent to a process, then the signal is delivered multiple times (hence we're not going to waste a potential "sampling round").

per-thread delivery

Instead of having a timer that will generate a SIGPROF signal that will be handled by a signal, we propose having an external thread which, with a given frequency, collects the tid (kernel thread id) of the running threads and send the real-time signal to each one of them.

Signed-off-by: Francesco Vigliaturo <[email protected]>

src/profiler.rs

The thread signalling each tids was calling sleep after each tid signal. This was an unintended bug/mistake. The correct behaviour is to send all the signals to the different tids as soon as possible and then sleep.

…to avoid race conditions

YangKeao · 2023-01-09T15:09:14Z

For the normal situation (e.g. 99 frequency with a fast enough CPU), this patch doesn't have many differences with the original one. Then I tried this PR with a simpler situation (the prime_number example), and a higher frequency (999). It becomes worse:

What I have done

https://github.com/tikv/pprof-rs/blob/master/src/profiler.rs#L334-L336

There is already a sample time in the signal handler. I append this sample time into the /tmp/stats:

diff --git a/src/profiler.rs b/src/profiler.rs
index c78f0dc..c826310 100644
--- a/src/profiler.rs
+++ b/src/profiler.rs
@@ -1,6 +1,7 @@
 // Copyright 2019 TiKV Project Authors. Licensed under Apache-2.0.
 
 use std::convert::TryInto;
+use std::io::Write;
 use std::os::raw::c_int;
 use std::time::SystemTime;
 
@@ -350,6 +351,8 @@ extern "C" fn perf_signal_handler(
                     false
                 }
             });
+            let mut stats = std::fs::OpenOptions::new().write(true).read(true).create(true).append(true).open("/tmp/stats").unwrap();
+            write!(stats, "{:?}\n", sample_timestamp).unwrap();
 
             let current_thread = unsafe { libc::pthread_self() };
             let mut name = [0; MAX_THREAD_NAME];

Then, handle the /tmp/stats with the following codes (%1000 because js doesn't have high precision. Why didn't I use python/julia 🤦 ):

const fs= require("fs")

const stats = fs.readFileSync("/tmp/stats", {"encoding":"utf8"})
    .split("\n")
    .filter(item => item.split(" ")[3] != undefined)
    .map(item => [item.split(" ")[3].slice(0, -1), item.split(" ")[5]])
    .map(item => [parseInt(item[0]), parseInt(item[1])])
    .map(item => item[0] % 1000 * 1000000000 + item[1])
for(let i =0;i<stats.length;i++) {
    console.log(`${i} ${stats[i]}`)
}

node main.js > ./stats.dat, and then plot the output with gnuplot:

plot "./stats.dat"

Result

For the master pprof-rs, this graph is "smooth" everywhere:

For the pprof-rs with this PR, this line "jump" at some places:

(zoomed)

I didn't analyze it with LSM or some other statistics methods, as the difference can be found simply through eyes.

Others

I have other thoughts about this idea. As you have described, the main benefit brought by the real-time signal is that no signal is lost. Does it mean some stacktrace is duplicated (because the signal handler runs continuously, they'll all get the same stack trace)? Duplication and missing samples are both errors, and I don't think duplication is better 🤔.

After all, this PR is really a good attempt on the interaction between real-time semantics and the profiling. I'll also study more and do some experiments with https://github.com/thoughtspot/threadstacks to see where is actually the problem. Thanks 🍻 .

viglia · 2023-01-09T18:53:27Z

Hi @YangKeao ,

Many thanks for taking a look at this and for benchmarking this.

When you say for that the same frequency you didn't see much better results and for higher frequency it actually got worse, do you mean in terms of sample collected across time and threads or are you referring at the plot shown above that plots exclusively the timestamp at which samples were collected?

There is already a sample time in the signal handler. I append this sample time into the /tmp/stats:

I'm not sure that using sample_timestamp is a "robust" method. According to the Rust doc this is not a monotonic measurement:

Distinct from the [Instant](https://doc.rust-lang.org/std/time/struct.Instant.html) type, 
this time measurement is not monotonic. This means that you can save a file to the 
file system, then save another file to the file system, and the second file has a 
SystemTime measurement earlier than the first. In other words, an operation that 
happens after another operation in real time may have an earlier SystemTime!

As you have described, the main benefit brought by the real-time signal is that no signal is lost. Does it mean some stacktrace is duplicated

In principle it shouldn't, since we fire the signal at the specific tid with a given frequency. Should the thread not being able to handle that in a timely fashion though, the signal would be queued leading to the possibility that 2 samples might (or might not) being collected at a shorter interval should the signal handler be executed immediately and repeatedly for the same tid. In this case then we'd have collected 2 samples at a more fine-grained interval. If you consider 2 samples belonging to the same interval (let's say 10ms) even if they're collected at different instants as duplicates then yes, this could lead to that.

I'm available if you want to discuss it further.

Thanks again for the valuable feedback!

viglia · 2023-01-10T09:21:45Z

@YangKeao can I ask a couple more questions to contextualize the tests/experiments?

how many threads did the code you were testing the profiler on launch?
Could you share the hardware spec. of the environment on which it was running?

Thanks

viglia added 2 commits December 15, 2022 10:21

Enhance threads stacks collection and sampling frequency on Linux

9b03783

Signed-off-by: Francesco Vigliaturo <[email protected]>

Remove unnecessary dependency on thread-priority

d0b1213

Swatinem reviewed Jan 4, 2023

View reviewed changes

src/profiler.rs Outdated Show resolved Hide resolved

src/profiler.rs Show resolved Hide resolved

src/profiler.rs Show resolved Hide resolved

src/profiler.rs Show resolved Hide resolved

Fixes sleep bug

845d4be

The thread signalling each tids was calling sleep after each tid signal. This was an unintended bug/mistake. The correct behaviour is to send all the signals to the different tids as soon as possible and then sleep.

viglia requested a review from Swatinem January 4, 2023 15:23

Reassign a new Arc<AtomicBool> to running when stopping the profiler …

9c2f344

…to avoid race conditions

YangKeao requested review from YangKeao and removed request for Swatinem January 5, 2023 05:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance threads stacks collection and sampling frequency on Linux #191

Enhance threads stacks collection and sampling frequency on Linux #191

viglia commented Dec 15, 2022

YangKeao commented Jan 9, 2023

viglia commented Jan 9, 2023

viglia commented Jan 10, 2023

Enhance threads stacks collection and sampling frequency on Linux #191

Are you sure you want to change the base?

Enhance threads stacks collection and sampling frequency on Linux #191

Conversation

viglia commented Dec 15, 2022

Real-Time Signal

per-thread delivery

YangKeao commented Jan 9, 2023

What I have done

Result

Others

viglia commented Jan 9, 2023

viglia commented Jan 10, 2023