-
Notifications
You must be signed in to change notification settings - Fork 0
C POSIX threads
The following example will introduce APEX using C POSIX threads.
This example will use the following programs:
To begin, we'll run just the simple pthread_c
executable, assuming that the CMake build directory is called build
:
kehuck1@uan01:~/src/apex-tutorial> ./build/bin/pthread_c
someThread : Thread 0 working!
someThread : Thread 1 working!
someThread : Thread 2 working!
someThread : Thread 3 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!
As you can see, there isn't a lot going on, just a main function that spawns 4 threads, and each of those 4 treads will spawn 4 more threads, and detach from them. That means the parent thread considers these tasks "fire and forget" tasks, and will not wait for them to finish or expect a return value.
APEX includes a program called apex_exec
, which is a convenience bash script for preloading the APEX libraries and dependencies, as well as setting any necessary environment variables to control APEX. To see all the options possible with apex_exec
, run it without arguments or with the --apex:help
flag:
kehuck1@uan01:~/src/apex-tutorial> apex_exec
Usage:
apex_exec <APEX options> executable <executable options>
where APEX options are zero or more of:
--apex:help show this usage message
--apex:debug run with APEX in debugger
--apex:verbose enable verbose list of APEX environment variables
--apex:screen enable screen text output (on by default)
--apex:quiet disable screen text output
--apex:csv enable csv text output
--apex:tau enable tau profile output
--apex:taskgraph enable taskgraph output
(graphviz required for post-processing)
--apex:tasktree enable tasktree output
(graphviz required for post-processing)
--apex:throttle throttle short-lived timers to reduce overhead (default: off)
--apex:throttle_calls <value> minimum number of calls before throttling (default: 1000)
--apex:throttle_per <value> minimum timer duration in microseconds (default: 10)
--apex:otf2 enable OTF2 trace output
--apex:otf2path <value> specify location of OTF2 archive
(default: ./OTF2_archive)
--apex:otf2name <value> specify name of OTF2 file (default: APEX)
--apex:gtrace enable Google Trace Events output (deprecated)
--apex:pftrace enable Perfetto Trace output
--apex:scatter enable scatterplot output
(python required for post-processing)
--apex:openacc enable OpenACC support
--apex:kokkos enable Kokkos support
--apex:kokkos_tuning enable Kokkos runtime autotuning support
--apex:kokkos_fence enable Kokkos fences for async kernels
--apex:raja enable RAJA support
--apex:pthread enable pthread wrapper support
--apex:gpu_memory enable CPU memory wrapper support
--apex:cpu_memory enable GPU memory wrapper support
--apex:untied enable tasks to migrate cores/OS threads
during execution (not compatible with trace output)
--apex:cuda enable CUDA/CUPTI measurement (default: off)
--apex:cuda_counters enable CUDA/CUPTI counter support (default: off)
--apex:cuda_driver enable CUDA driver API callbacks (default: off)
--apex:cuda_details enable per-kernel statistics where available (default: off)
--apex:hip enable HIP/ROCTracer measurement (default: off)
--apex:hip_metrics enable HIP/ROCProfiler metric support (default: off)
--apex:hip_counters enable HIP/ROCTracer counter support (default: off)
--apex:hip_driver enable HIP/ROCTracer KSA driver API callbacks (default: off)
--apex:hip_details enable per-kernel statistics where available (default: off)
--apex:monitor_gpu enable GPU monitoring services (CUDA NVML, ROCm SMI)
--apex:level0 enable OneAPI Level0 measurement (default: off)
--apex:cpuinfo enable sampling of /proc/cpuinfo (Linux only)
--apex:meminfo enable sampling of /proc/meminfo (Linux only)
--apex:net enable sampling of /proc/net/dev (Linux only)
--apex:status enable sampling of /proc/self/status (Linux only)
--apex:io enable sampling of /proc/self/io (Linux only)
--apex:period <value> specify frequency of OS/HW sampling
--apex:ompt enable OpenMP profiling (requires runtime support)
--apex:ompt_simple only enable OpenMP Tools required events
--apex:ompt_details enable all OpenMP Tools events
--apex:source resolve function, file and line info for address lookups with binutils
(default: function only)
--apex:preload <lib> extra libraries to load with LD_PRELOAD _before_ APEX libraries
(LD_PRELOAD value is added _after_ APEX libraries)
--apex:postprocess run post-process scripts (graphviz, python) on output data after exit
As you can see, there are quite a lot of flags/options, and not all of them are relevant to our example or our platform. But we will explore a few of them in this example. The APEX flags are all prefixed with apex:
to distinguish them from application flags, and can be added anywhere on the command line after apex_exec
- they will be consumed by the script and not passed to the executable.
Let's see what happens when we run with apex_exec
:
kehuck1@uan01:~/src/apex-tutorial> which apex_exec
/users/kehuck1/install/apex/bin/apex_exec
kehuck1@uan01:~/src/apex-tutorial> apex_exec ./build/bin/pthread_c
___ ______ _______ __
/ _ \ | ___ \ ___\ \ / /
/ /_\ \| |_/ / |__ \ V /
| _ || __/| __| / \
| | | || | | |___/ /^\ \
\_| |_/\_| \____/\/ \/
APEX Version: v2.6.1-7e20f7fa-develop
Built on: 18:48:48 Feb 23 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758)
someThread : Thread 0 working!
someThread : Thread 1 working!
someThread : Thread 2 working!
someThread : Thread 3 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!
Start Date/Time: 25/02/2023 21:29:39
Elapsed time: 0.10214 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 1
Available CPU time on rank 0: 0.10214 seconds
Available CPU time on all ranks: 0.10214 seconds
Counter : #samp | mean | max
--------------------------------------------------------------------------------
1 Minute Load average : 1 4.73 4.73
status:Threads : 1 2.00 2.00
status:VmData kB : 1 4.16e+05 4.16e+05
status:VmExe kB : 1 8.00 8.00
status:VmHWM kB : 1 4.24e+04 4.24e+04
status:VmLck kB : 1 0.00 0.00
status:VmLib kB : 1 1.81e+05 1.81e+05
status:VmPTE kB : 1 484.00 484.00
status:VmPeak kB : 1 8.56e+05 8.56e+05
status:VmPin kB : 1 0.00 0.00
status:VmRSS kB : 1 4.24e+04 4.24e+04
status:VmSize kB : 1 7.90e+05 7.90e+05
status:VmStk kB : 1 160.00 160.00
status:VmSwap kB : 1 0.00 0.00
status:nonvoluntary_ctxt_switches : 1 2.00 2.00
status:voluntary_ctxt_switches : 1 51.00 51.00
--------------------------------------------------------------------------------
CPU Timers : #calls| mean | total
--------------------------------------------------------------------------------
APEX MAIN : 1 0.10 0.10
int apex_preload_main(int, char **, char **) : 1 0.10 0.10
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Total timers : 1
We can see from the example output that we know that APEX has been engaged, due to the APEX header that prints APEX configuration information as a sanity check. Then the regular program output is generated, followed by a performance summary text output. The Counter
section of the output shows third-person hardware/operating system measurements, and the CPU Timers
section shows two timers that don't offer a lot of detail. APEX includes a wrapper for the __libc_start_main()
function that captures the time spent in the whole program, and there is an APEX MAIN
timer that captures time spent between APEX initialization and finalization (not included in the "Total timers" count).
Let's see what happens when we add the --apex:pthread
flag (as mentioned above, for convenience we can add the flags anywhere after apex_exec
- we can call apex_exec --apex:pthread ./build/bin/pthread_c
or apex_exec ./build/bin/pthread_c --apex:pthread
):
kehuck1@uan01:~/src/apex-tutorial> apex_exec --apex:pthread ./build/bin/pthread_c
___ ______ _______ __
/ _ \ | ___ \ ___\ \ / /
/ /_\ \| |_/ / |__ \ V /
| _ || __/| __| / \
| | | || | | |___/ /^\ \
\_| |_/\_| \____/\/ \/
APEX Version: v2.6.1-7e20f7fa-develop
Built on: 18:48:48 Feb 23 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758)
someThread : Thread 0 working!
someThread : Thread 1 working!
someThread : Thread 2 working!
someThread : Thread 3 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!
Start Date/Time: 25/02/2023 21:47:58
Elapsed time: 0.10189 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 21
Available CPU time on rank 0: 2.13969 seconds
Available CPU time on all ranks: 2.13969 seconds
Counter : #samp | mean | max
--------------------------------------------------------------------------------
1 Minute Load average : 1 1.43 1.43
status:Threads : 1 2.00 2.00
status:VmData kB : 1 4.16e+05 4.16e+05
status:VmExe kB : 1 8.00 8.00
status:VmHWM kB : 1 4.28e+04 4.28e+04
status:VmLck kB : 1 0.00 0.00
status:VmLib kB : 1 1.81e+05 1.81e+05
status:VmPTE kB : 1 476.00 476.00
status:VmPeak kB : 1 8.56e+05 8.56e+05
status:VmPin kB : 1 0.00 0.00
status:VmRSS kB : 1 4.28e+04 4.28e+04
status:VmSize kB : 1 7.90e+05 7.90e+05
status:VmStk kB : 1 160.00 160.00
status:VmSwap kB : 1 0.00 0.00
status:nonvoluntary_ctxt_switches : 1 1.00 1.00
status:voluntary_ctxt_switches : 1 10.00 10.00
--------------------------------------------------------------------------------
CPU Timers : #calls| mean | total
--------------------------------------------------------------------------------
APEX MAIN : 1 0.10 0.10
OS Thread: APEX pthread wrapper : 21 0.03 0.71
foo : 16 0.03 0.40
someThread : 4 0.05 0.20
Main Thread: APEX Pthread Wrapper : 1 0.10 0.10
int apex_preload_main(int, char **, char **) : 1 0.10 0.10
pthread_join : 4 0.02 0.07
pthread_create : 21 0.00 0.00
pthread_detach : 17 0.00 0.00
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Total timers : 85
We can see much more detail now - the APEX library has wrappers for some pthread_*
functions, allowing it to capture more information. In this case. we now see timers for pthread_create
, as well as a timer around the top-level functions called by the spawned threads, someThread
and foo
. APEX also includes some top-level timers for each thread encountered in the system, both the Main Thread
and the OS Thread
s. For completeness/convenience, we also see wrappers around pthread_join
and pthread_detach
, providing additional insight.
Let's now get more information from APEX, by requesting detailed screen output:
kehuck1@uan01:~/src/apex-tutorial> export APEX_SCREEN_OUTPUT_DETAIL=1
kehuck1@uan01:~/src/apex-tutorial> apex_exec --apex:pthread ./build/bin/pthread_c
___ ______ _______ __
/ _ \ | ___ \ ___\ \ / /
/ /_\ \| |_/ / |__ \ V /
| _ || __/| __| / \
| | | || | | |___/ /^\ \
\_| |_/\_| \____/\/ \/
APEX Version: v2.6.1-7e20f7fa-develop
Built on: 18:48:48 Feb 23 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758)
someThread : Thread 0 working!
someThread : Thread 1 working!
someThread : Thread 2 working!
someThread : Thread 3 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!
Start Date/Time: 25/02/2023 21:53:56
Elapsed time: 0.102441 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 21
Available CPU time on rank 0: 2.15125 seconds
Available CPU time on all ranks: 2.15125 seconds
Counter : #samp | minimum | mean | maximum | stddev
---------------------------------------------------------------------------------------------------------
1 Minute Load average : 1 1.15 1.15 1.15 0.00
status:Threads : 1 2.00 2.00 2.00 0.00
status:VmData kB : 1 4.16e+05 4.16e+05 4.16e+05 0.00
status:VmExe kB : 1 8.00 8.00 8.00 0.00
status:VmHWM kB : 1 4.34e+04 4.34e+04 4.34e+04 0.00
status:VmLck kB : 1 0.00 0.00 0.00 0.00
status:VmLib kB : 1 1.81e+05 1.81e+05 1.81e+05 0.00
status:VmPTE kB : 1 492.00 492.00 492.00 0.00
status:VmPeak kB : 1 8.56e+05 8.56e+05 8.56e+05 0.00
status:VmPin kB : 1 0.00 0.00 0.00 0.00
status:VmRSS kB : 1 4.34e+04 4.34e+04 4.34e+04 0.00
status:VmSize kB : 1 7.90e+05 7.90e+05 7.90e+05 0.00
status:VmStk kB : 1 160.00 160.00 160.00 0.00
status:VmSwap kB : 1 0.00 0.00 0.00 0.00
status:nonvoluntary_ctxt_switches : 1 1.00 1.00 1.00 0.00
status:voluntary_ctxt_switches : 1 5.00 5.00 5.00 0.00
---------------------------------------------------------------------------------------------------------
CPU Timers : #calls| #yields| #thread| mean | total | tot/thr | %total | %wall
--------------------------------------------------------------------------------------------------------------------------------
APEX MAIN : 1 0 1 0.10 0.10 0.10 100.0 100.0
OS Thread: APEX pthread wrapper : 21 0 20 0.03 0.71 0.04 33.0 34.6
foo : 16 0 15 0.03 0.40 0.03 18.7 26.2
someThread : 4 0 3 0.05 0.20 0.07 9.5 66.3
Main Thread: APEX Pthread Wrapper : 1 0 1 0.10 0.10 0.10 4.8 100.0
int apex_preload_main(int, char **, char **) : 1 0 1 0.10 0.10 0.10 4.8 99.9
pthread_join : 4 0 1 0.02 0.07 0.07 3.3 69.5
pthread_create : 21 0 5 0.00 0.00 0.00 0.2 0.8
pthread_detach : 17 0 4 0.00 0.00 0.00 0.0 0.0
--------------------------------------------------------------------------------------------------------------------------------
APEX Idle : 0.56 25.8
--------------------------------------------------------------------------------------------------------------------------------
Total timers : 85
We see much more detailed information. For real applications we will likely want to save this data for later analysis, so APEX provides a few output formats to choose from:
- CSV output with
--apex:csv
- TAU profile output with
--apex:tau
kehuck1@uan01:~/src/apex-tutorial> apex_exec --apex:pthread --apex:csv ./build/bin/pthread_c
___ ______ _______ __
/ _ \ | ___ \ ___\ \ / /
/ /_\ \| |_/ / |__ \ V /
| _ || __/| __| / \
| | | || | | |___/ /^\ \
\_| |_/\_| \____/\/ \/
APEX Version: v2.6.1-7e20f7fa-develop
Built on: 18:48:48 Feb 23 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758)
someThread : Thread 0 working!
someThread : Thread 1 working!
someThread : Thread 3 working!
someThread : Thread 2 working!
...
Writing: .//apex_profiles.csv
Using the CSV output option, we can then post-process the output data with the apex-summary.py
Python script. This python script has a few commonly installed module dependencies, such as pandas
and numpy
.
kehuck1@uan01:~/src/apex-tutorial> apex-summary.py --help
usage: apex-summary.py [-h] [--filename FILENAME] [--counters] [--timers] [--tau] [--other]
[--limit N] [--agg A] [--sort C]
Post-process APEX flat profiles.
optional arguments:
-h, --help show this help message and exit
--filename FILENAME The filename to parse (default: ./apex_profiles.csv)
--counters Print the counter data (default: false)
--timers Print the timer data (default: false)
--tau Convert to TAU profiles (default: false)
--other Aggregate all other timers and show value (default: false)
--limit N Limit timers to top N timers (default: 30)
--agg A Aggregation operation for timers and counters (default: mean)
--sort C Column to sort timers (default: tot/thr)
Sample output from apex-summary.py
:
kehuck1@uan01:~/src/apex-tutorial> apex-summary.py --counters --timers
----------------------------------------------------------------------------------------------------
APEX Counters aggregated by mean
----------------------------------------------------------------------------------------------------
samples minimum mean maximum stddev
Counter
1 Minute Load average 1.00 3.00 3.00 3.00 0.00
status:Threads 1.00 2.00 2.00 2.00 0.00
status:VmData kB 1.00 415540.00 415540.00 415540.00 0.00
status:VmExe kB 1.00 8.00 8.00 8.00 0.00
status:VmHWM kB 1.00 42860.00 42860.00 42860.00 0.00
status:VmLck kB 1.00 0.00 0.00 0.00 0.00
status:VmLib kB 1.00 181316.00 181316.00 181316.00 0.00
status:VmPTE kB 1.00 484.00 484.00 484.00 0.00
status:VmPeak kB 1.00 855608.00 855608.00 855608.00 0.00
status:VmPin kB 1.00 0.00 0.00 0.00 0.00
status:VmRSS kB 1.00 42860.00 42860.00 42860.00 0.00
status:VmSize kB 1.00 790072.00 790072.00 790072.00 0.00
status:VmStk kB 1.00 160.00 160.00 160.00 0.00
status:VmSwap kB 1.00 0.00 0.00 0.00 0.00
status:nonvoluntary_ctxt_switches 1.00 1.00 1.00 1.00 0.00
status:voluntary_ctxt_switches 1.00 5.00 5.00 5.00 0.00
----------------------------------------------------------------------------------------------------
Top 30 APEX Timers sorted by tot/thr aggregated by mean
----------------------------------------------------------------------------------------------------
total calls tot/call ... tot/thr %total %wall
Timer ...
APEX MAIN 0.10 1.00 0.10 ... 0.10 100.00 100.00
Main Thread: APEX Pthread Wrapper 0.10 1.00 0.10 ... 0.10 99.99 99.99
int apex_preload_main(int, char **, char **) 0.10 1.00 0.10 ... 0.10 99.90 99.90
execute_native_thread_routine 0.10 1.00 0.10 ... 0.10 99.87 99.87
pthread_join 0.07 4.00 0.02 ... 0.07 69.47 69.47
someThread 0.20 4.00 0.05 ... 0.07 199.22 66.41
OS Thread: APEX pthread wrapper 0.71 21.00 0.03 ... 0.04 693.13 34.66
foo 0.40 16.00 0.03 ... 0.03 393.73 26.25
pthread_create 0.00 21.00 0.00 ... 0.00 3.30 0.66
pthread_detach 0.00 17.00 0.00 ... 0.00 0.01 0.00
[10 rows x 8 columns]
APEX will also capture the dependency between tasks in the sample application. To capture and visualize the task dependency tree, we'll add the --apex:tasktree
option and use the apex-treesummary.py
analysis script to visualize the resulting CSV file.
kehuck1@uan01:~/src/apex-tutorial> apex_exec --apex:pthread --apex:tasktree ./build/bin/pthread_c
___ ______ _______ __
/ _ \ | ___ \ ___\ \ / /
/ /_\ \| |_/ / |__ \ V /
| _ || __/| __| / \
| | | || | | |___/ /^\ \
\_| |_/\_| \____/\/ \/
APEX Version: v2.6.1-7e20f7fa-develop
Built on: 18:48:48 Feb 23 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758)
someThread : Thread 0 working!
someThread : Thread 1 working!
someThread : Thread 3 working!
someThread : Thread 2 working!
...
Writing: .//apex_tasktree.csv
To see the available options with apex-treesummary.py
, use the --help
flag:
kehuck1@uan01:~/src/apex-tutorial> apex-treesummary.py --help
usage: apex-treesummary.py [-h] [--filename FILENAME] [--tau] [--dot] [--ascii] [--tlimit N]
[--dlimit d] [--qlimit Q] [--rlimit R] [--keep K] [--drop D] [--agg A]
[--sort C]
Post-process APEX flat profiles.
optional arguments:
-h, --help show this help message and exit
--filename FILENAME The filename to parse (default: ./apex_tasktree.csv)
--tau Convert to TAU profiles (default: false)
--dot Generate DOT file for graphviz (default: false)
--ascii Output ASCII tree output (default: true)
--tlimit N Limit timers to those with value over N (default: 0)
--dlimit d Limit tree to depth of d (default: none)
--qlimit Q Limit timers to those in the Q quantile (default: None)
--rlimit R Limit data to those in the first R ranks (default: all ranks)
--keep K Keep only subtree starting at K (default: "APEX MAIN")
--drop D Drop subtree starting at D (default: drop nothing)
--agg A Aggregation operation for timers and counters (default: mean)Accepted methods:
count Returns count for each group
size Returns size for each group
sum Returns total sum for each group
mean Returns mean for each group. Same as average()
average Returns average for each group. Same as mean()
std Returns standard deviation for each group
var Return var for each group
sem Standard error of the mean of groups
min Returns minimum value for each group
max Returns maximum value for each group
first Returns first value for each group
last Returns last value for each group
nth Returns nth value for each group
--sort C Column to sort timers (default: tot/thr)
To render an ASCII tree, we will do the following:
kehuck1@uan01:~/src/apex-tutorial> apex-treesummary.py --ascii
Reading tasktree...
Read 15 rows
Found 0 ranks, with max graph node index of 14 and depth of 8
building common tree...
Rank 0 ...
1-> 0.102 - 100.000% [1] {min=0.102, max=0.102, mean=0.102, threads=1} APEX MAIN
1 |-> 0.102 - 99.980% [1] {min=0.102, max=0.102, mean=0.102, threads=1} Main Thread: APEX Pthread Wrapper
1 | |-> 0.102 - 99.884% [1] {min=0.102, max=0.102, mean=0.102, threads=1} int apex_preload_main(int, char **, char **)
1 | | |-> 0.071 - 69.414% [4] {min=0.071, max=0.071, mean=0.018, threads=1} pthread_join
1 | | |-> 0.001 - 0.810% [4] {min=0.001, max=0.001, mean=0.000, threads=1} pthread_create
1 | | | |-> 0.203 - 199.004% [4] {min=0.203, max=0.203, mean=0.051, threads=4} OS Thread: APEX pthread wrapper
1 | | | | |-> 0.203 - 198.968% [4] {min=0.203, max=0.203, mean=0.051, threads=4} someThread
1 | | | | | |-> 0.002 - 1.937% [16] {min=0.002, max=0.002, mean=0.000, threads=4} pthread_create
1 | | | | | | |-> 0.402 - 394.376% [16] {min=0.402, max=0.402, mean=0.025, threads=16} OS Thread: APEX pthread wrapper
1 | | | | | | | |-> 0.402 - 394.114% [16] {min=0.402, max=0.402, mean=0.025, threads=16} foo
1 | | | | | |-> 0.000 - 0.008% [16] {min=0.000, max=0.000, mean=0.000, threads=4} pthread_detach
1 |-> 0.000 - 0.138% [1] {min=0.000, max=0.000, mean=0.000, threads=1} pthread_create
1 | |-> 0.102 - 100.000% [1] {min=0.102, max=0.102, mean=0.102, threads=1} OS Thread: APEX pthread wrapper
1 | | |-> 0.102 - 99.964% [1] {min=0.102, max=0.102, mean=0.102, threads=1} execute_native_thread_routine
1 |-> 0.000 - 0.002% [1] {min=0.000, max=0.000, mean=0.000, threads=1} pthread_detach
16 total graph nodes
Task tree also written to tasktree.txt.
To generate a DOT file that can be rendered with Graphviz dot
:
kehuck1@uan01:~/src/apex-tutorial> apex-treesummary.py --dot
Reading tasktree...
Read 15 rows
Found 0 ranks, with max graph node index of 14 and depth of 8
building common tree...
Rank 0 ...
Computing new stats...
Building dot file
done.
We then render the dot file (requires a graphviz installation):
dot -Tpdf -O tasktree.dot
APEX provides a couple output formats for tracing data:
- OTF2
- Google Trace Events JSON (Perfetto)
- Perfetto native output (experimental in APEX)
Let's see what we get when we trace the example:
kehuck1@uan01:~/src/apex-tutorial> apex_exec --apex:pthread --apex:gtrace ./build/bin/pthread_c
We should see a trace_events.0.json.gz
file written to the current working directory. To visualize this trace, open it in Perfetto, which is a web-based trace visualization tool that can be accessed at https://ui.perfetto.dev/
in your web browser (Google Chrome recommended).
Open the URL, and load the trace from your local computer:
After opening the trace you should see all threads from the execution, and when selecting on a thread you should see the flow event tracing it back to the function call that created it:
While APEX can extract a lot of information about the program, there are several blind spots. For example, the doWork()
function is not measured. APEX provides an API for instrumentation see Reference Manual, C API. In the meantime, let's look at the source code for the apex_pthread_c
program to see what has been added:
Running this example without apex_exec
will provide some useful information, since APEX is linked into the executable:
kehuck1@uan01:~/src/apex-tutorial> ./build/bin/apex_pthread_c
___ ______ _______ __
/ _ \ | ___ \ ___\ \ / /
/ /_\ \| |_/ / |__ \ V /
| _ || __/| __| / \
| | | || | | |___/ /^\ \
\_| |_/\_| \____/\/ \/
APEX Version: v2.6.1-7e20f7fa-develop
Built on: 18:48:48 Feb 23 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758)
someThread : Thread 0 working!
someThread : Thread 1 working!
someThread : Thread 2 working!
someThread : Thread 3 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!
Start Date/Time: 25/02/2023 22:58:56
Elapsed time: 0.101764 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 21
Available CPU time on rank 0: 2.13703 seconds
Available CPU time on all ranks: 2.13703 seconds
Counter : #samp | minimum | mean | maximum | stddev
---------------------------------------------------------------------------------------------------------
1 Minute Load average : 1 2.33 2.33 2.33 0.00
status:Threads : 1 2.00 2.00 2.00 0.00
status:VmData kB : 1 4.16e+05 4.16e+05 4.16e+05 0.00
status:VmExe kB : 1 8.00 8.00 8.00 0.00
status:VmHWM kB : 1 4.29e+04 4.29e+04 4.29e+04 0.00
status:VmLck kB : 1 0.00 0.00 0.00 0.00
status:VmLib kB : 1 1.81e+05 1.81e+05 1.81e+05 0.00
status:VmPTE kB : 1 488.00 488.00 488.00 0.00
status:VmPeak kB : 1 8.56e+05 8.56e+05 8.56e+05 0.00
status:VmPin kB : 1 0.00 0.00 0.00 0.00
status:VmRSS kB : 1 4.29e+04 4.29e+04 4.29e+04 0.00
status:VmSize kB : 1 7.90e+05 7.90e+05 7.90e+05 0.00
status:VmStk kB : 1 160.00 160.00 160.00 0.00
status:VmSwap kB : 1 0.00 0.00 0.00 0.00
status:nonvoluntary_ctxt_switches : 1 1.00 1.00 1.00 0.00
status:voluntary_ctxt_switches : 1 44.00 44.00 44.00 0.00
---------------------------------------------------------------------------------------------------------
CPU Timers : #calls| #yields| #thread| mean | total | tot/thr | %total | %wall
--------------------------------------------------------------------------------------------------------------------------------
APEX MAIN : 1 0 1 0.10 0.10 0.10 100.0 100.0
doWork : 27 0 21 0.02 0.63 0.03 29.6 29.6
foo : 16 0 15 0.03 0.40 0.03 18.8 26.3
someThread : 4 0 3 0.05 0.20 0.07 9.5 66.6
int apex_preload_main(int, char **, char **) : 1 0 1 0.10 0.10 0.10 4.8 99.9
main : 1 0 1 0.10 0.10 0.10 4.8 99.9
--------------------------------------------------------------------------------------------------------------------------------
APEX Idle : 0.70 32.6
--------------------------------------------------------------------------------------------------------------------------------
Total timers : 49
However, including --apex:pthread
will provide more useful data, and will allow us to capture the task dependency graph:
kehuck1@uan01:~/src/apex-tutorial> apex_exec --apex:pthread --apex:tasktree ./build/bin/apex_pthread_c
___ ______ _______ __
/ _ \ | ___ \ ___\ \ / /
/ /_\ \| |_/ / |__ \ V /
| _ || __/| __| / \
| | | || | | |___/ /^\ \
\_| |_/\_| \____/\/ \/
APEX Version: v2.6.1-7e20f7fa-develop
Built on: 18:48:48 Feb 23 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758)
someThread : Thread 0 working!
someThread : Thread 1 working!
someThread : Thread 2 working!
someThread : Thread 3 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!
Start Date/Time: 25/02/2023 23:00:32
Elapsed time: 0.102636 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 21
Available CPU time on rank 0: 2.15537 seconds
Available CPU time on all ranks: 2.15537 seconds
Counter : #samp | minimum | mean | maximum | stddev
---------------------------------------------------------------------------------------------------------
1 Minute Load average : 1 5.30 5.30 5.30 0.00
status:Threads : 1 2.00 2.00 2.00 0.00
status:VmData kB : 1 4.16e+05 4.16e+05 4.16e+05 0.00
status:VmExe kB : 1 8.00 8.00 8.00 0.00
status:VmHWM kB : 1 4.28e+04 4.28e+04 4.28e+04 0.00
status:VmLck kB : 1 0.00 0.00 0.00 0.00
status:VmLib kB : 1 1.81e+05 1.81e+05 1.81e+05 0.00
status:VmPTE kB : 1 484.00 484.00 484.00 0.00
status:VmPeak kB : 1 8.56e+05 8.56e+05 8.56e+05 0.00
status:VmPin kB : 1 0.00 0.00 0.00 0.00
status:VmRSS kB : 1 4.28e+04 4.28e+04 4.28e+04 0.00
status:VmSize kB : 1 7.90e+05 7.90e+05 7.90e+05 0.00
status:VmStk kB : 1 160.00 160.00 160.00 0.00
status:VmSwap kB : 1 0.00 0.00 0.00 0.00
status:nonvoluntary_ctxt_switches : 1 2.00 2.00 2.00 0.00
status:voluntary_ctxt_switches : 1 6.00 6.00 6.00 0.00
---------------------------------------------------------------------------------------------------------
CPU Timers : #calls| #yields| #thread| mean | total | tot/thr | %total | %wall
--------------------------------------------------------------------------------------------------------------------------------
APEX MAIN : 1 0 1 0.10 0.10 0.10 100.0 100.0
OS Thread: APEX pthread wrapper : 21 0 20 0.03 0.71 0.04 32.9 34.6
doWork : 27 0 21 0.02 0.63 0.03 29.4 29.4
foo : 16 0 15 0.03 0.40 0.03 18.7 26.2
someThread : 4 0 3 0.05 0.20 0.07 9.4 66.0
execute_native_thread_routine : 1 0 1 0.10 0.10 0.10 4.8 100.0
Main Thread: APEX Pthread Wrapper : 1 0 1 0.10 0.10 0.10 4.8 100.0
int apex_preload_main(int, char **, char **) : 1 0 1 0.10 0.10 0.10 4.8 99.9
main : 1 0 1 0.10 0.10 0.10 4.8 99.9
pthread_join : 4 0 1 0.02 0.07 0.07 3.3 69.0
pthread_create : 21 0 5 0.00 0.00 0.00 0.2 0.7
pthread_detach : 17 0 4 0.00 0.00 0.00 0.0 0.0
--------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------
Total timers : 114
Writing: .//apex_tasktree.csv
kehuck1@uan01:~/src/apex-tutorial> apex-treesummary.py --ascii --dot
Reading tasktree...
Read 21 rows
Found 0 ranks, with max graph node index of 20 and depth of 12
building common tree...
Rank 0 ...
1-> 0.103 - 100.000% [1] {min=0.103, max=0.103, mean=0.103, threads=1} APEX MAIN
1 |-> 0.103 - 99.979% [1] {min=0.103, max=0.103, mean=0.103, threads=1} Main Thread: APEX Pthread Wrapper
1 | |-> 0.103 - 99.898% [1] {min=0.103, max=0.103, mean=0.103, threads=1} int apex_preload_main(int, char **, char **)
1 | | |-> 0.102 - 99.857% [1] {min=0.102, max=0.102, mean=0.102, threads=1} main
1 | | | |-> 0.071 - 69.025% [4] {min=0.071, max=0.071, mean=0.018, threads=1} pthread_join
1 | | | |-> 0.030 - 29.419% [3] {min=0.030, max=0.030, mean=0.010, threads=1} doWork
1 | | | |-> 0.001 - 1.347% [4] {min=0.001, max=0.001, mean=0.000, threads=1} pthread_create
1 | | | | |-> 0.203 - 198.111% [4] {min=0.203, max=0.203, mean=0.051, threads=4} OS Thread: APEX pthread wrapper
1 | | | | | |-> 0.203 - 198.066% [4] {min=0.203, max=0.203, mean=0.051, threads=4} someThread
1 | | | | | | |-> 0.203 - 198.014% [4] {min=0.203, max=0.203, mean=0.051, threads=4} someThread
1 | | | | | | | |-> 0.201 - 195.655% [8] {min=0.201, max=0.201, mean=0.025, threads=4} doWork
1 | | | | | | | |-> 0.002 - 1.995% [16] {min=0.002, max=0.002, mean=0.000, threads=4} pthread_create
1 | | | | | | | | |-> 0.403 - 393.128% [16] {min=0.403, max=0.403, mean=0.025, threads=16} OS Thread: APEX pthread wrapper
1 | | | | | | | | | |-> 0.403 - 392.815% [16] {min=0.403, max=0.403, mean=0.025, threads=16} foo
1 | | | | | | | | | | |-> 0.403 - 392.311% [16] {min=0.403, max=0.403, mean=0.025, threads=16} foo
1 | | | | | | | | | | | |-> 0.402 - 391.693% [16] {min=0.402, max=0.402, mean=0.025, threads=16} doWork
1 | | | | | | | |-> 0.000 - 0.008% [16] {min=0.000, max=0.000, mean=0.000, threads=4} pthread_detach
1 |-> 0.000 - 0.142% [1] {min=0.000, max=0.000, mean=0.000, threads=1} pthread_create
1 | |-> 0.103 - 100.040% [1] {min=0.103, max=0.103, mean=0.103, threads=1} OS Thread: APEX pthread wrapper
1 | | |-> 0.103 - 99.983% [1] {min=0.103, max=0.103, mean=0.103, threads=1} execute_native_thread_routine
1 |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} pthread_detach
22 total graph nodes
Task tree also written to tasktree.txt.
Computing new stats...
Building dot file
done.
An observant user will notice that the someFunction
and foo
functions were captured twice - once by the instrumentation and once by the pthread_create
wrapper.
A similar example using compiler-based instrumentation (i.e. -finstrument-functions
) is provided with the pthread_inst_c
executable. It is the same source code as pthread.c, however it was compiled with compiler instrumentation flags (where supported, see CMakeLists.txt). Running this example without apex_exec
should give the same output as the pthread_c
example. However, running with apex_exec
will provide implementations of the __cyg_profile_func_enter
and __cyg_profile_func_exit
symbols, and APEX will time all instrumented functions:
kehuck1@uan01:~/src/apex-tutorial> apex_exec --apex:pthread --apex:tasktree ./build/bin/pthread_inst_c
___ ______ _______ __
/ _ \ | ___ \ ___\ \ / /
/ /_\ \| |_/ / |__ \ V /
| _ || __/| __| / \
| | | || | | |___/ /^\ \
\_| |_/\_| \____/\/ \/
APEX Version: v2.6.1-7e20f7fa-develop
Built on: 18:48:48 Feb 23 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758)
someThread : Thread 0 working!
someThread : Thread 1 working!
someThread : Thread 2 working!
someThread : Thread 3 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!
Start Date/Time: 25/02/2023 23:20:47
Elapsed time: 0.101867 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 21
Available CPU time on rank 0: 2.13921 seconds
Available CPU time on all ranks: 2.13921 seconds
Counter : #samp | minimum | mean | maximum | stddev
---------------------------------------------------------------------------------------------------------
1 Minute Load average : 1 1.52 1.52 1.52 0.00
status:Threads : 1 2.00 2.00 2.00 0.00
status:VmData kB : 1 4.16e+05 4.16e+05 4.16e+05 0.00
status:VmExe kB : 1 8.00 8.00 8.00 0.00
status:VmHWM kB : 1 4.25e+04 4.25e+04 4.25e+04 0.00
status:VmLck kB : 1 0.00 0.00 0.00 0.00
status:VmLib kB : 1 1.81e+05 1.81e+05 1.81e+05 0.00
status:VmPTE kB : 1 492.00 492.00 492.00 0.00
status:VmPeak kB : 1 8.56e+05 8.56e+05 8.56e+05 0.00
status:VmPin kB : 1 0.00 0.00 0.00 0.00
status:VmRSS kB : 1 4.25e+04 4.25e+04 4.25e+04 0.00
status:VmSize kB : 1 7.90e+05 7.90e+05 7.90e+05 0.00
status:VmStk kB : 1 160.00 160.00 160.00 0.00
status:VmSwap kB : 1 0.00 0.00 0.00 0.00
status:nonvoluntary_ctxt_switches : 1 1.00 1.00 1.00 0.00
status:voluntary_ctxt_switches : 1 7.00 7.00 7.00 0.00
---------------------------------------------------------------------------------------------------------
CPU Timers : #calls| #yields| #thread| mean | total | tot/thr | %total | %wall
--------------------------------------------------------------------------------------------------------------------------------
APEX MAIN : 1 0 1 0.10 0.10 0.10 100.0 100.0
foo : 32 0 16 0.03 0.80 0.05 37.6 49.4
OS Thread: APEX pthread wrapper : 21 0 20 0.03 0.71 0.04 33.1 34.8
doWork : 27 0 21 0.02 0.63 0.03 29.6 29.6
someThread : 8 0 4 0.05 0.41 0.10 19.0 100.0
Main Thread: APEX Pthread Wrapper : 1 0 1 0.10 0.10 0.10 4.8 100.0
int apex_preload_main(int, char **, char **) : 1 0 1 0.10 0.10 0.10 4.8 99.9
main : 1 0 1 0.10 0.10 0.10 4.8 99.9
pthread_join : 4 0 1 0.02 0.07 0.07 3.3 69.7
pthread_create : 21 0 5 0.00 0.00 0.00 0.1 0.6
pthread_detach : 17 0 4 0.00 0.00 0.00 0.0 0.0
--------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------
Total timers : 133
Writing: .//apex_tasktree.csv
APEX tutorial, © Copyright 2023, University of Oregon. For more information on APEX, see https://github.com/UO-OACISS/apex