Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance threads stacks collection and sampling frequency on Linux #191

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

viglia
Copy link
Contributor

@viglia viglia commented Dec 15, 2022

This PR tries to address inconsistencies regarding the collection of samples across time and threads (somehow related to #177).

Currently to collect a sample (stack) we rely on the SIGPROF signal, which is a standard signal. This has the following two limitations:

  1. for standard signal, the set of pending signals is only a mask, meaning that, if multiple instances of a standard signal are generated while that signal is blocked, then only one instance of the signal is marked as pending (and the signal will be delivered just once when it is unblocked)
  2. the SIGPROF signal is directed to the process and will then be handled by one of the thread, hence we'll only collect the stack for the thread that in a given moment is "handling" the signal.

The changes in this PR try to address both of the above mentioned limitations by making use of real-time signals and per-thread delivery.

Real-Time Signal

An important distinction here is that real-time signals are queued. If multiple instances of a realtime signal are sent to a process, then the signal is delivered multiple times (hence we're not going to waste a potential "sampling round").

per-thread delivery

Instead of having a timer that will generate a SIGPROF signal that will be handled by a signal, we propose having an external thread which, with a given frequency, collects the tid (kernel thread id) of the running threads and send the real-time signal to each one of them.

The thread signalling each tids was calling sleep after each tid signal.

This was an unintended bug/mistake.

The correct behaviour is to send all the signals to the different tids as soon as possible and then sleep.
@viglia viglia requested a review from Swatinem January 4, 2023 15:23
@YangKeao YangKeao requested review from YangKeao and removed request for Swatinem January 5, 2023 05:28
@YangKeao
Copy link
Member

YangKeao commented Jan 9, 2023

For the normal situation (e.g. 99 frequency with a fast enough CPU), this patch doesn't have many differences with the original one. Then I tried this PR with a simpler situation (the prime_number example), and a higher frequency (999). It becomes worse:

What I have done

https://github.com/tikv/pprof-rs/blob/master/src/profiler.rs#L334-L336

There is already a sample time in the signal handler. I append this sample time into the /tmp/stats:

diff --git a/src/profiler.rs b/src/profiler.rs
index c78f0dc..c826310 100644
--- a/src/profiler.rs
+++ b/src/profiler.rs
@@ -1,6 +1,7 @@
 // Copyright 2019 TiKV Project Authors. Licensed under Apache-2.0.
 
 use std::convert::TryInto;
+use std::io::Write;
 use std::os::raw::c_int;
 use std::time::SystemTime;
 
@@ -350,6 +351,8 @@ extern "C" fn perf_signal_handler(
                     false
                 }
             });
+            let mut stats = std::fs::OpenOptions::new().write(true).read(true).create(true).append(true).open("/tmp/stats").unwrap();
+            write!(stats, "{:?}\n", sample_timestamp).unwrap();
 
             let current_thread = unsafe { libc::pthread_self() };
             let mut name = [0; MAX_THREAD_NAME];

Then, handle the /tmp/stats with the following codes (%1000 because js doesn't have high precision. Why didn't I use python/julia 🤦 ):

const fs= require("fs")

const stats = fs.readFileSync("/tmp/stats", {"encoding":"utf8"})
    .split("\n")
    .filter(item => item.split(" ")[3] != undefined)
    .map(item => [item.split(" ")[3].slice(0, -1), item.split(" ")[5]])
    .map(item => [parseInt(item[0]), parseInt(item[1])])
    .map(item => item[0] % 1000 * 1000000000 + item[1])
for(let i =0;i<stats.length;i++) {
    console.log(`${i} ${stats[i]}`)
}

node main.js > ./stats.dat, and then plot the output with gnuplot:

plot "./stats.dat"

Result

For the master pprof-rs, this graph is "smooth" everywhere:

Screenshot from 2023-01-09 22-53-16

For the pprof-rs with this PR, this line "jump" at some places:

(zoomed)

Screenshot from 2023-01-09 22-48-45

I didn't analyze it with LSM or some other statistics methods, as the difference can be found simply through eyes.

Others

I have other thoughts about this idea. As you have described, the main benefit brought by the real-time signal is that no signal is lost. Does it mean some stacktrace is duplicated (because the signal handler runs continuously, they'll all get the same stack trace)? Duplication and missing samples are both errors, and I don't think duplication is better 🤔.

After all, this PR is really a good attempt on the interaction between real-time semantics and the profiling. I'll also study more and do some experiments with https://github.com/thoughtspot/threadstacks to see where is actually the problem. Thanks 🍻 .

@viglia
Copy link
Contributor Author

viglia commented Jan 9, 2023

Hi @YangKeao ,

Many thanks for taking a look at this and for benchmarking this.

When you say for that the same frequency you didn't see much better results and for higher frequency it actually got worse, do you mean in terms of sample collected across time and threads or are you referring at the plot shown above that plots exclusively the timestamp at which samples were collected?

There is already a sample time in the signal handler. I append this sample time into the /tmp/stats:

I'm not sure that using sample_timestamp is a "robust" method. According to the Rust doc this is not a monotonic measurement:

Distinct from the [Instant](https://doc.rust-lang.org/std/time/struct.Instant.html) type, 
this time measurement is not monotonic. This means that you can save a file to the 
file system, then save another file to the file system, and the second file has a 
SystemTime measurement earlier than the first. In other words, an operation that 
happens after another operation in real time may have an earlier SystemTime!

As you have described, the main benefit brought by the real-time signal is that no signal is lost. Does it mean some stacktrace is duplicated

In principle it shouldn't, since we fire the signal at the specific tid with a given frequency. Should the thread not being able to handle that in a timely fashion though, the signal would be queued leading to the possibility that 2 samples might (or might not) being collected at a shorter interval should the signal handler be executed immediately and repeatedly for the same tid. In this case then we'd have collected 2 samples at a more fine-grained interval. If you consider 2 samples belonging to the same interval (let's say 10ms) even if they're collected at different instants as duplicates then yes, this could lead to that.

I'm available if you want to discuss it further.

Thanks again for the valuable feedback!

@viglia
Copy link
Contributor Author

viglia commented Jan 10, 2023

@YangKeao can I ask a couple more questions to contextualize the tests/experiments?

  1. how many threads did the code you were testing the profiler on launch?
  2. Could you share the hardware spec. of the environment on which it was running?

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants