Skip to content

C POSIX threads

Kevin Huck edited this page Feb 26, 2023 · 20 revisions

The following example will introduce APEX using C POSIX threads.

Source Code

This example will use the following programs:

Running the simple example

To begin, we'll run just the simple pthread_c executable, assuming that the CMake build directory is called build:

kehuck1@uan01:~/src/apex-tutorial> ./build/bin/pthread_c
someThread : Thread 0 working!
someThread : Thread 1 working!
someThread : Thread 2 working!
someThread : Thread 3 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!

As you can see, there isn't a lot going on, just a main function that spawns 4 threads, and each of those 4 treads will spawn 4 more threads, and detach from them. That means the parent thread considers these tasks "fire and forget" tasks, and will not wait for them to finish or expect a return value.

Running the simple example with APEX

APEX includes a program called apex_exec, which is a convenience bash script for preloading the APEX libraries and dependencies, as well as setting any necessary environment variables to control APEX. To see all the options possible with apex_exec, run it without arguments or with the --apex:help flag:

kehuck1@uan01:~/src/apex-tutorial> apex_exec

Usage:
apex_exec <APEX options> executable <executable options>

where APEX options are zero or more of:
    --apex:help                   show this usage message
    --apex:debug                  run with APEX in debugger
    --apex:verbose                enable verbose list of APEX environment variables
    --apex:screen                 enable screen text output (on by default)
    --apex:quiet                  disable screen text output
    --apex:csv                    enable csv text output
    --apex:tau                    enable tau profile output
    --apex:taskgraph              enable taskgraph output
                                  (graphviz required for post-processing)
    --apex:tasktree               enable tasktree output
                                  (graphviz required for post-processing)
    --apex:throttle               throttle short-lived timers to reduce overhead (default: off)
    --apex:throttle_calls <value> minimum number of calls before throttling (default: 1000)
    --apex:throttle_per <value>   minimum timer duration in microseconds (default: 10)
    --apex:otf2                   enable OTF2 trace output
    --apex:otf2path <value>       specify location of OTF2 archive
                                  (default: ./OTF2_archive)
    --apex:otf2name <value>       specify name of OTF2 file (default: APEX)
    --apex:gtrace                 enable Google Trace Events output (deprecated)
    --apex:pftrace                enable Perfetto Trace output
    --apex:scatter                enable scatterplot output
                                  (python required for post-processing)
    --apex:openacc                enable OpenACC support
    --apex:kokkos                 enable Kokkos support
    --apex:kokkos_tuning          enable Kokkos runtime autotuning support
    --apex:kokkos_fence           enable Kokkos fences for async kernels
    --apex:raja                   enable RAJA support
    --apex:pthread                enable pthread wrapper support
    --apex:gpu_memory             enable CPU memory wrapper support
    --apex:cpu_memory             enable GPU memory wrapper support
    --apex:untied                 enable tasks to migrate cores/OS threads
                                  during execution (not compatible with trace output)
    --apex:cuda                   enable CUDA/CUPTI measurement (default: off)
    --apex:cuda_counters          enable CUDA/CUPTI counter support (default: off)
    --apex:cuda_driver            enable CUDA driver API callbacks (default: off)
    --apex:cuda_details           enable per-kernel statistics where available (default: off)
    --apex:hip                    enable HIP/ROCTracer measurement (default: off)
    --apex:hip_metrics            enable HIP/ROCProfiler metric support (default: off)
    --apex:hip_counters           enable HIP/ROCTracer counter support (default: off)
    --apex:hip_driver             enable HIP/ROCTracer KSA driver API callbacks (default: off)
    --apex:hip_details            enable per-kernel statistics where available (default: off)
    --apex:monitor_gpu            enable GPU monitoring services (CUDA NVML, ROCm SMI)
    --apex:level0                 enable OneAPI Level0 measurement (default: off)
    --apex:cpuinfo                enable sampling of /proc/cpuinfo (Linux only)
    --apex:meminfo                enable sampling of /proc/meminfo (Linux only)
    --apex:net                    enable sampling of /proc/net/dev (Linux only)
    --apex:status                 enable sampling of /proc/self/status (Linux only)
    --apex:io                     enable sampling of /proc/self/io (Linux only)
    --apex:period <value>         specify frequency of OS/HW sampling
    --apex:ompt                   enable OpenMP profiling (requires runtime support)
    --apex:ompt_simple            only enable OpenMP Tools required events
    --apex:ompt_details           enable all OpenMP Tools events
    --apex:source                 resolve function, file and line info for address lookups with binutils
                                  (default: function only)
    --apex:preload <lib>          extra libraries to load with LD_PRELOAD _before_ APEX libraries
                                  (LD_PRELOAD value is added _after_ APEX libraries)
    --apex:postprocess            run post-process scripts (graphviz, python) on output data after exit

As you can see, there are quite a lot of flags/options, and not all of them are relevant to our example or our platform. But we will explore a few of them in this example. The APEX flags are all prefixed with apex: to distinguish them from application flags, and can be added anywhere on the command line after apex_exec - they will be consumed by the script and not passed to the executable.

Let's see what happens when we run with apex_exec:

kehuck1@uan01:~/src/apex-tutorial> which apex_exec
/users/kehuck1/install/apex/bin/apex_exec
kehuck1@uan01:~/src/apex-tutorial> apex_exec ./build/bin/pthread_c
  ___  ______ _______   __
 / _ \ | ___ \  ___\ \ / /
/ /_\ \| |_/ / |__  \ V /
|  _  ||  __/|  __| /   \
| | | || |   | |___/ /^\ \
\_| |_/\_|   \____/\/   \/
APEX Version: v2.6.1-7e20f7fa-develop
Built on: 18:48:48 Feb 23 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758)
someThread : Thread 0 working!
someThread : Thread 1 working!
someThread : Thread 2 working!
someThread : Thread 3 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!

Start Date/Time: 25/02/2023 21:29:39
Elapsed time: 0.10214 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 1
Available CPU time on rank 0: 0.10214 seconds
Available CPU time on all ranks: 0.10214 seconds

Counter                                              :  #samp |   mean  |  max
--------------------------------------------------------------------------------
                               1 Minute Load average :      1     4.73     4.73
                                      status:Threads :      1     2.00     2.00
                                    status:VmData kB :      1 4.16e+05 4.16e+05
                                     status:VmExe kB :      1     8.00     8.00
                                     status:VmHWM kB :      1 4.24e+04 4.24e+04
                                     status:VmLck kB :      1     0.00     0.00
                                     status:VmLib kB :      1 1.81e+05 1.81e+05
                                     status:VmPTE kB :      1   484.00   484.00
                                    status:VmPeak kB :      1 8.56e+05 8.56e+05
                                     status:VmPin kB :      1     0.00     0.00
                                     status:VmRSS kB :      1 4.24e+04 4.24e+04
                                    status:VmSize kB :      1 7.90e+05 7.90e+05
                                     status:VmStk kB :      1   160.00   160.00
                                    status:VmSwap kB :      1     0.00     0.00
                   status:nonvoluntary_ctxt_switches :      1     2.00     2.00
                      status:voluntary_ctxt_switches :      1    51.00    51.00
--------------------------------------------------------------------------------

CPU Timers                                           : #calls|   mean |   total
--------------------------------------------------------------------------------
                                           APEX MAIN :      1     0.10     0.10
        int apex_preload_main(int, char **, char **) :      1     0.10     0.10
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
                                        Total timers : 1

We can see from the example output that we know that APEX has been engaged, due to the APEX header that prints APEX configuration information as a sanity check. Then the regular program output is generated, followed by a performance summary text output. The Counter section of the output shows third-person hardware/operating system measurements, and the CPU Timers section shows two timers that don't offer a lot of detail. APEX includes a wrapper for the __libc_start_main() function that captures the time spent in the whole program, and there is an APEX MAIN timer that captures time spent between APEX initialization and finalization (not included in the "Total timers" count).

Running the simple example with APEX and POSIX thread support

Let's see what happens when we add the --apex:pthread flag (as mentioned above, for convenience we can add the flags anywhere after apex_exec - we can call apex_exec --apex:pthread ./build/bin/pthread_c or apex_exec ./build/bin/pthread_c --apex:pthread):

kehuck1@uan01:~/src/apex-tutorial> apex_exec --apex:pthread ./build/bin/pthread_c
  ___  ______ _______   __
 / _ \ | ___ \  ___\ \ / /
/ /_\ \| |_/ / |__  \ V /
|  _  ||  __/|  __| /   \
| | | || |   | |___/ /^\ \
\_| |_/\_|   \____/\/   \/
APEX Version: v2.6.1-7e20f7fa-develop
Built on: 18:48:48 Feb 23 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758)
someThread : Thread 0 working!
someThread : Thread 1 working!
someThread : Thread 2 working!
someThread : Thread 3 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!

Start Date/Time: 25/02/2023 21:47:58
Elapsed time: 0.10189 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 21
Available CPU time on rank 0: 2.13969 seconds
Available CPU time on all ranks: 2.13969 seconds

Counter                                              :  #samp |   mean  |  max
--------------------------------------------------------------------------------
                               1 Minute Load average :      1     1.43     1.43
                                      status:Threads :      1     2.00     2.00
                                    status:VmData kB :      1 4.16e+05 4.16e+05
                                     status:VmExe kB :      1     8.00     8.00
                                     status:VmHWM kB :      1 4.28e+04 4.28e+04
                                     status:VmLck kB :      1     0.00     0.00
                                     status:VmLib kB :      1 1.81e+05 1.81e+05
                                     status:VmPTE kB :      1   476.00   476.00
                                    status:VmPeak kB :      1 8.56e+05 8.56e+05
                                     status:VmPin kB :      1     0.00     0.00
                                     status:VmRSS kB :      1 4.28e+04 4.28e+04
                                    status:VmSize kB :      1 7.90e+05 7.90e+05
                                     status:VmStk kB :      1   160.00   160.00
                                    status:VmSwap kB :      1     0.00     0.00
                   status:nonvoluntary_ctxt_switches :      1     1.00     1.00
                      status:voluntary_ctxt_switches :      1    10.00    10.00
--------------------------------------------------------------------------------

CPU Timers                                           : #calls|   mean |   total
--------------------------------------------------------------------------------
                                           APEX MAIN :      1     0.10     0.10
                     OS Thread: APEX pthread wrapper :     21     0.03     0.71
                                                 foo :     16     0.03     0.40
                                          someThread :      4     0.05     0.20
                   Main Thread: APEX Pthread Wrapper :      1     0.10     0.10
        int apex_preload_main(int, char **, char **) :      1     0.10     0.10
                                        pthread_join :      4     0.02     0.07
                                      pthread_create :     21     0.00     0.00
                                      pthread_detach :     17     0.00     0.00
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
                                        Total timers : 85

We can see much more detail now - the APEX library has wrappers for some pthread_* functions, allowing it to capture more information. In this case. we now see timers for pthread_create, as well as a timer around the top-level functions called by the spawned threads, someThread and foo. APEX also includes some top-level timers for each thread encountered in the system, both the Main Thread and the OS Threads. For completeness/convenience, we also see wrappers around pthread_join and pthread_detach, providing additional insight.

Let's now get more information from APEX, by requesting detailed screen output:

kehuck1@uan01:~/src/apex-tutorial> export APEX_SCREEN_OUTPUT_DETAIL=1
kehuck1@uan01:~/src/apex-tutorial> apex_exec --apex:pthread ./build/bin/pthread_c
  ___  ______ _______   __
 / _ \ | ___ \  ___\ \ / /
/ /_\ \| |_/ / |__  \ V /
|  _  ||  __/|  __| /   \
| | | || |   | |___/ /^\ \
\_| |_/\_|   \____/\/   \/
APEX Version: v2.6.1-7e20f7fa-develop
Built on: 18:48:48 Feb 23 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758)
someThread : Thread 0 working!
someThread : Thread 1 working!
someThread : Thread 2 working!
someThread : Thread 3 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!

Start Date/Time: 25/02/2023 21:53:56
Elapsed time: 0.102441 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 21
Available CPU time on rank 0: 2.15125 seconds
Available CPU time on all ranks: 2.15125 seconds

Counter                                              :  #samp | minimum |    mean  |  maximum |  stddev
---------------------------------------------------------------------------------------------------------
                               1 Minute Load average :      1       1.15       1.15       1.15       0.00
                                      status:Threads :      1       2.00       2.00       2.00       0.00
                                    status:VmData kB :      1   4.16e+05   4.16e+05   4.16e+05       0.00
                                     status:VmExe kB :      1       8.00       8.00       8.00       0.00
                                     status:VmHWM kB :      1   4.34e+04   4.34e+04   4.34e+04       0.00
                                     status:VmLck kB :      1       0.00       0.00       0.00       0.00
                                     status:VmLib kB :      1   1.81e+05   1.81e+05   1.81e+05       0.00
                                     status:VmPTE kB :      1     492.00     492.00     492.00       0.00
                                    status:VmPeak kB :      1   8.56e+05   8.56e+05   8.56e+05       0.00
                                     status:VmPin kB :      1       0.00       0.00       0.00       0.00
                                     status:VmRSS kB :      1   4.34e+04   4.34e+04   4.34e+04       0.00
                                    status:VmSize kB :      1   7.90e+05   7.90e+05   7.90e+05       0.00
                                     status:VmStk kB :      1     160.00     160.00     160.00       0.00
                                    status:VmSwap kB :      1       0.00       0.00       0.00       0.00
                   status:nonvoluntary_ctxt_switches :      1       1.00       1.00       1.00       0.00
                      status:voluntary_ctxt_switches :      1       5.00       5.00       5.00       0.00
---------------------------------------------------------------------------------------------------------

CPU Timers                                           : #calls| #yields| #thread|     mean |    total |  tot/thr | %total | %wall
--------------------------------------------------------------------------------------------------------------------------------
                                           APEX MAIN :      1        0        1       0.10       0.10       0.10   100.0   100.0
                     OS Thread: APEX pthread wrapper :     21        0       20       0.03       0.71       0.04    33.0    34.6
                                                 foo :     16        0       15       0.03       0.40       0.03    18.7    26.2
                                          someThread :      4        0        3       0.05       0.20       0.07     9.5    66.3
                   Main Thread: APEX Pthread Wrapper :      1        0        1       0.10       0.10       0.10     4.8   100.0
        int apex_preload_main(int, char **, char **) :      1        0        1       0.10       0.10       0.10     4.8    99.9
                                        pthread_join :      4        0        1       0.02       0.07       0.07     3.3    69.5
                                      pthread_create :     21        0        5       0.00       0.00       0.00     0.2     0.8
                                      pthread_detach :     17        0        4       0.00       0.00       0.00     0.0     0.0
--------------------------------------------------------------------------------------------------------------------------------

                                           APEX Idle :                           0.56    25.8

--------------------------------------------------------------------------------------------------------------------------------
                                        Total timers : 85

Running the simple example with APEX output files

We see much more detailed information. For real applications we will likely want to save this data for later analysis, so APEX provides a few output formats to choose from:

  • CSV output with --apex:csv
  • TAU profile output with --apex:tau
kehuck1@uan01:~/src/apex-tutorial> apex_exec --apex:pthread --apex:csv ./build/bin/pthread_c
  ___  ______ _______   __
 / _ \ | ___ \  ___\ \ / /
/ /_\ \| |_/ / |__  \ V /
|  _  ||  __/|  __| /   \
| | | || |   | |___/ /^\ \
\_| |_/\_|   \____/\/   \/
APEX Version: v2.6.1-7e20f7fa-develop
Built on: 18:48:48 Feb 23 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758)
someThread : Thread 0 working!
someThread : Thread 1 working!
someThread : Thread 3 working!
someThread : Thread 2 working!
...
Writing: .//apex_profiles.csv

Using apex-summary.py

Using the CSV output option, we can then post-process the output data with the apex-summary.py Python script. This python script has a few commonly installed module dependencies, such as pandas and numpy.

kehuck1@uan01:~/src/apex-tutorial> apex-summary.py --help
usage: apex-summary.py [-h] [--filename FILENAME] [--counters] [--timers] [--tau] [--other]
                       [--limit N] [--agg A] [--sort C]

Post-process APEX flat profiles.

optional arguments:
  -h, --help           show this help message and exit
  --filename FILENAME  The filename to parse (default: ./apex_profiles.csv)
  --counters           Print the counter data (default: false)
  --timers             Print the timer data (default: false)
  --tau                Convert to TAU profiles (default: false)
  --other              Aggregate all other timers and show value (default: false)
  --limit N            Limit timers to top N timers (default: 30)
  --agg A              Aggregation operation for timers and counters (default: mean)
  --sort C             Column to sort timers (default: tot/thr)

Sample output from apex-summary.py:

kehuck1@uan01:~/src/apex-tutorial> apex-summary.py --counters --timers

----------------------------------------------------------------------------------------------------
APEX Counters aggregated by mean
----------------------------------------------------------------------------------------------------
                                   samples   minimum      mean   maximum  stddev
Counter
1 Minute Load average                 1.00      3.00      3.00      3.00    0.00
status:Threads                        1.00      2.00      2.00      2.00    0.00
status:VmData kB                      1.00 415540.00 415540.00 415540.00    0.00
status:VmExe kB                       1.00      8.00      8.00      8.00    0.00
status:VmHWM kB                       1.00  42860.00  42860.00  42860.00    0.00
status:VmLck kB                       1.00      0.00      0.00      0.00    0.00
status:VmLib kB                       1.00 181316.00 181316.00 181316.00    0.00
status:VmPTE kB                       1.00    484.00    484.00    484.00    0.00
status:VmPeak kB                      1.00 855608.00 855608.00 855608.00    0.00
status:VmPin kB                       1.00      0.00      0.00      0.00    0.00
status:VmRSS kB                       1.00  42860.00  42860.00  42860.00    0.00
status:VmSize kB                      1.00 790072.00 790072.00 790072.00    0.00
status:VmStk kB                       1.00    160.00    160.00    160.00    0.00
status:VmSwap kB                      1.00      0.00      0.00      0.00    0.00
status:nonvoluntary_ctxt_switches     1.00      1.00      1.00      1.00    0.00
status:voluntary_ctxt_switches        1.00      5.00      5.00      5.00    0.00

----------------------------------------------------------------------------------------------------
Top 30 APEX Timers sorted by tot/thr aggregated by mean
----------------------------------------------------------------------------------------------------
                                              total  calls  tot/call  ...  tot/thr  %total  %wall
Timer                                                                 ...
APEX MAIN                                      0.10   1.00      0.10  ...     0.10  100.00 100.00
Main Thread: APEX Pthread Wrapper              0.10   1.00      0.10  ...     0.10   99.99  99.99
int apex_preload_main(int, char **, char **)   0.10   1.00      0.10  ...     0.10   99.90  99.90
execute_native_thread_routine                  0.10   1.00      0.10  ...     0.10   99.87  99.87
pthread_join                                   0.07   4.00      0.02  ...     0.07   69.47  69.47
someThread                                     0.20   4.00      0.05  ...     0.07  199.22  66.41
OS Thread: APEX pthread wrapper                0.71  21.00      0.03  ...     0.04  693.13  34.66
foo                                            0.40  16.00      0.03  ...     0.03  393.73  26.25
pthread_create                                 0.00  21.00      0.00  ...     0.00    3.30   0.66
pthread_detach                                 0.00  17.00      0.00  ...     0.00    0.01   0.00

[10 rows x 8 columns]

APEX and task dependency trees

APEX will also capture the dependency between tasks in the sample application. To capture and visualize the task dependency tree, we'll add the --apex:tasktree option and use the apex-treesummary.py analysis script to visualize the resulting CSV file.

kehuck1@uan01:~/src/apex-tutorial> apex_exec --apex:pthread --apex:tasktree ./build/bin/pthread_c
  ___  ______ _______   __
 / _ \ | ___ \  ___\ \ / /
/ /_\ \| |_/ / |__  \ V /
|  _  ||  __/|  __| /   \
| | | || |   | |___/ /^\ \
\_| |_/\_|   \____/\/   \/
APEX Version: v2.6.1-7e20f7fa-develop
Built on: 18:48:48 Feb 23 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758)
someThread : Thread 0 working!
someThread : Thread 1 working!
someThread : Thread 3 working!
someThread : Thread 2 working!
...
Writing: .//apex_tasktree.csv

Using apex-treesummary.py

To see the available options with apex-treesummary.py, use the --help flag:

kehuck1@uan01:~/src/apex-tutorial> apex-treesummary.py --help
usage: apex-treesummary.py [-h] [--filename FILENAME] [--tau] [--dot] [--ascii] [--tlimit N]
                           [--dlimit d] [--qlimit Q] [--rlimit R] [--keep K] [--drop D] [--agg A]
                           [--sort C]

Post-process APEX flat profiles.

optional arguments:
  -h, --help           show this help message and exit
  --filename FILENAME  The filename to parse (default: ./apex_tasktree.csv)
  --tau                Convert to TAU profiles (default: false)
  --dot                Generate DOT file for graphviz (default: false)
  --ascii              Output ASCII tree output (default: true)
  --tlimit N           Limit timers to those with value over N (default: 0)
  --dlimit d           Limit tree to depth of d (default: none)
  --qlimit Q           Limit timers to those in the Q quantile (default: None)
  --rlimit R           Limit data to those in the first R ranks (default: all ranks)
  --keep K             Keep only subtree starting at K (default: "APEX MAIN")
  --drop D             Drop subtree starting at D (default: drop nothing)
  --agg A              Aggregation operation for timers and counters (default: mean)Accepted methods:
                         count   Returns count for each group
                         size    Returns size for each group
                         sum     Returns total sum for each group
                         mean    Returns mean for each group. Same as average()
                         average Returns average for each group. Same as mean()
                         std     Returns standard deviation for each group
                         var     Return var for each group
                         sem     Standard error of the mean of groups
                         min     Returns minimum value for each group
                         max     Returns maximum value for each group
                         first   Returns first value for each group
                         last    Returns last value for each group
                         nth     Returns nth value for each group
  --sort C             Column to sort timers (default: tot/thr)

To render an ASCII tree, we will do the following:

kehuck1@uan01:~/src/apex-tutorial> apex-treesummary.py --ascii
Reading tasktree...
Read 15 rows
Found 0 ranks, with max graph node index of 14 and depth of 8
building common tree...
Rank 0 ...
1-> 0.102 - 100.000% [1] {min=0.102, max=0.102, mean=0.102, threads=1} APEX MAIN
1 |-> 0.102 - 99.980% [1] {min=0.102, max=0.102, mean=0.102, threads=1} Main Thread: APEX Pthread Wrapper
1 | |-> 0.102 - 99.884% [1] {min=0.102, max=0.102, mean=0.102, threads=1} int apex_preload_main(int, char **, char **)
1 | | |-> 0.071 - 69.414% [4] {min=0.071, max=0.071, mean=0.018, threads=1} pthread_join
1 | | |-> 0.001 - 0.810% [4] {min=0.001, max=0.001, mean=0.000, threads=1} pthread_create
1 | | | |-> 0.203 - 199.004% [4] {min=0.203, max=0.203, mean=0.051, threads=4} OS Thread: APEX pthread wrapper
1 | | | | |-> 0.203 - 198.968% [4] {min=0.203, max=0.203, mean=0.051, threads=4} someThread
1 | | | | | |-> 0.002 - 1.937% [16] {min=0.002, max=0.002, mean=0.000, threads=4} pthread_create
1 | | | | | | |-> 0.402 - 394.376% [16] {min=0.402, max=0.402, mean=0.025, threads=16} OS Thread: APEX pthread wrapper
1 | | | | | | | |-> 0.402 - 394.114% [16] {min=0.402, max=0.402, mean=0.025, threads=16} foo
1 | | | | | |-> 0.000 - 0.008% [16] {min=0.000, max=0.000, mean=0.000, threads=4} pthread_detach
1 |-> 0.000 - 0.138% [1] {min=0.000, max=0.000, mean=0.000, threads=1} pthread_create
1 | |-> 0.102 - 100.000% [1] {min=0.102, max=0.102, mean=0.102, threads=1} OS Thread: APEX pthread wrapper
1 | | |-> 0.102 - 99.964% [1] {min=0.102, max=0.102, mean=0.102, threads=1} execute_native_thread_routine
1 |-> 0.000 - 0.002% [1] {min=0.000, max=0.000, mean=0.000, threads=1} pthread_detach
16 total graph nodes

Task tree also written to tasktree.txt.

To generate a DOT file that can be rendered with Graphviz dot:

kehuck1@uan01:~/src/apex-tutorial> apex-treesummary.py --dot
Reading tasktree...
Read 15 rows
Found 0 ranks, with max graph node index of 14 and depth of 8
building common tree...
Rank 0 ...
Computing new stats...
Building dot file
done.

We then render the dot file (requires a graphviz installation):

dot -Tpdf -O tasktree.dot

DOT task graph of pthread_c example

APEX and tracing

APEX provides a couple output formats for tracing data:

  • OTF2
  • Google Trace Events JSON (Perfetto)
  • Perfetto native output (experimental in APEX)

Let's see what we get when we trace the example:

kehuck1@uan01:~/src/apex-tutorial> apex_exec --apex:pthread --apex:gtrace ./build/bin/pthread_c

We should see a trace_events.0.json.gz file written to the current working directory. To visualize this trace, open it in Perfetto, which is a web-based trace visualization tool that can be accessed at https://ui.perfetto.dev/ in your web browser (Google Chrome recommended).

Open the URL, and load the trace from your local computer: Perfetto main window After opening the trace you should see all threads from the execution, and when selecting on a thread you should see the flow event tracing it back to the function call that created it: Perfetto main window

pthread_c example with APEX instrumentation

While APEX can extract a lot of information about the program, there are several blind spots. For example, the doWork() function is not measured. APEX provides an API for instrumentation see Reference Manual, C API. In the meantime, let's look at the source code for the apex_pthread_c program to see what has been added:

apex_pthread.c

Running this example without apex_exec will provide some useful information, since APEX is linked into the executable:

kehuck1@uan01:~/src/apex-tutorial> ./build/bin/apex_pthread_c
  ___  ______ _______   __
 / _ \ | ___ \  ___\ \ / /
/ /_\ \| |_/ / |__  \ V /
|  _  ||  __/|  __| /   \
| | | || |   | |___/ /^\ \
\_| |_/\_|   \____/\/   \/
APEX Version: v2.6.1-7e20f7fa-develop
Built on: 18:48:48 Feb 23 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758)
someThread : Thread 0 working!
someThread : Thread 1 working!
someThread : Thread 2 working!
someThread : Thread 3 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!

Start Date/Time: 25/02/2023 22:58:56
Elapsed time: 0.101764 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 21
Available CPU time on rank 0: 2.13703 seconds
Available CPU time on all ranks: 2.13703 seconds

Counter                                              :  #samp | minimum |    mean  |  maximum |  stddev
---------------------------------------------------------------------------------------------------------
                               1 Minute Load average :      1       2.33       2.33       2.33       0.00
                                      status:Threads :      1       2.00       2.00       2.00       0.00
                                    status:VmData kB :      1   4.16e+05   4.16e+05   4.16e+05       0.00
                                     status:VmExe kB :      1       8.00       8.00       8.00       0.00
                                     status:VmHWM kB :      1   4.29e+04   4.29e+04   4.29e+04       0.00
                                     status:VmLck kB :      1       0.00       0.00       0.00       0.00
                                     status:VmLib kB :      1   1.81e+05   1.81e+05   1.81e+05       0.00
                                     status:VmPTE kB :      1     488.00     488.00     488.00       0.00
                                    status:VmPeak kB :      1   8.56e+05   8.56e+05   8.56e+05       0.00
                                     status:VmPin kB :      1       0.00       0.00       0.00       0.00
                                     status:VmRSS kB :      1   4.29e+04   4.29e+04   4.29e+04       0.00
                                    status:VmSize kB :      1   7.90e+05   7.90e+05   7.90e+05       0.00
                                     status:VmStk kB :      1     160.00     160.00     160.00       0.00
                                    status:VmSwap kB :      1       0.00       0.00       0.00       0.00
                   status:nonvoluntary_ctxt_switches :      1       1.00       1.00       1.00       0.00
                      status:voluntary_ctxt_switches :      1      44.00      44.00      44.00       0.00
---------------------------------------------------------------------------------------------------------

CPU Timers                                           : #calls| #yields| #thread|     mean |    total |  tot/thr | %total | %wall
--------------------------------------------------------------------------------------------------------------------------------
                                           APEX MAIN :      1        0        1       0.10       0.10       0.10   100.0   100.0
                                              doWork :     27        0       21       0.02       0.63       0.03    29.6    29.6
                                                 foo :     16        0       15       0.03       0.40       0.03    18.8    26.3
                                          someThread :      4        0        3       0.05       0.20       0.07     9.5    66.6
        int apex_preload_main(int, char **, char **) :      1        0        1       0.10       0.10       0.10     4.8    99.9
                                                main :      1        0        1       0.10       0.10       0.10     4.8    99.9
--------------------------------------------------------------------------------------------------------------------------------

                                           APEX Idle :                           0.70    32.6

--------------------------------------------------------------------------------------------------------------------------------
                                        Total timers : 49

However, including --apex:pthread will provide more useful data, and will allow us to capture the task dependency graph:

kehuck1@uan01:~/src/apex-tutorial> apex_exec --apex:pthread --apex:tasktree ./build/bin/apex_pthread_c
  ___  ______ _______   __
 / _ \ | ___ \  ___\ \ / /
/ /_\ \| |_/ / |__  \ V /
|  _  ||  __/|  __| /   \
| | | || |   | |___/ /^\ \
\_| |_/\_|   \____/\/   \/
APEX Version: v2.6.1-7e20f7fa-develop
Built on: 18:48:48 Feb 23 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758)
someThread : Thread 0 working!
someThread : Thread 1 working!
someThread : Thread 2 working!
someThread : Thread 3 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!

Start Date/Time: 25/02/2023 23:00:32
Elapsed time: 0.102636 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 21
Available CPU time on rank 0: 2.15537 seconds
Available CPU time on all ranks: 2.15537 seconds

Counter                                              :  #samp | minimum |    mean  |  maximum |  stddev
---------------------------------------------------------------------------------------------------------
                               1 Minute Load average :      1       5.30       5.30       5.30       0.00
                                      status:Threads :      1       2.00       2.00       2.00       0.00
                                    status:VmData kB :      1   4.16e+05   4.16e+05   4.16e+05       0.00
                                     status:VmExe kB :      1       8.00       8.00       8.00       0.00
                                     status:VmHWM kB :      1   4.28e+04   4.28e+04   4.28e+04       0.00
                                     status:VmLck kB :      1       0.00       0.00       0.00       0.00
                                     status:VmLib kB :      1   1.81e+05   1.81e+05   1.81e+05       0.00
                                     status:VmPTE kB :      1     484.00     484.00     484.00       0.00
                                    status:VmPeak kB :      1   8.56e+05   8.56e+05   8.56e+05       0.00
                                     status:VmPin kB :      1       0.00       0.00       0.00       0.00
                                     status:VmRSS kB :      1   4.28e+04   4.28e+04   4.28e+04       0.00
                                    status:VmSize kB :      1   7.90e+05   7.90e+05   7.90e+05       0.00
                                     status:VmStk kB :      1     160.00     160.00     160.00       0.00
                                    status:VmSwap kB :      1       0.00       0.00       0.00       0.00
                   status:nonvoluntary_ctxt_switches :      1       2.00       2.00       2.00       0.00
                      status:voluntary_ctxt_switches :      1       6.00       6.00       6.00       0.00
---------------------------------------------------------------------------------------------------------

CPU Timers                                           : #calls| #yields| #thread|     mean |    total |  tot/thr | %total | %wall
--------------------------------------------------------------------------------------------------------------------------------
                                           APEX MAIN :      1        0        1       0.10       0.10       0.10   100.0   100.0
                     OS Thread: APEX pthread wrapper :     21        0       20       0.03       0.71       0.04    32.9    34.6
                                              doWork :     27        0       21       0.02       0.63       0.03    29.4    29.4
                                                 foo :     16        0       15       0.03       0.40       0.03    18.7    26.2
                                          someThread :      4        0        3       0.05       0.20       0.07     9.4    66.0
                       execute_native_thread_routine :      1        0        1       0.10       0.10       0.10     4.8   100.0
                   Main Thread: APEX Pthread Wrapper :      1        0        1       0.10       0.10       0.10     4.8   100.0
        int apex_preload_main(int, char **, char **) :      1        0        1       0.10       0.10       0.10     4.8    99.9
                                                main :      1        0        1       0.10       0.10       0.10     4.8    99.9
                                        pthread_join :      4        0        1       0.02       0.07       0.07     3.3    69.0
                                      pthread_create :     21        0        5       0.00       0.00       0.00     0.2     0.7
                                      pthread_detach :     17        0        4       0.00       0.00       0.00     0.0     0.0
--------------------------------------------------------------------------------------------------------------------------------


--------------------------------------------------------------------------------------------------------------------------------
                                        Total timers : 114
Writing: .//apex_tasktree.csv

kehuck1@uan01:~/src/apex-tutorial> apex-treesummary.py --ascii --dot
Reading tasktree...
Read 21 rows
Found 0 ranks, with max graph node index of 20 and depth of 12
building common tree...
Rank 0 ...
1-> 0.103 - 100.000% [1] {min=0.103, max=0.103, mean=0.103, threads=1} APEX MAIN
1 |-> 0.103 - 99.979% [1] {min=0.103, max=0.103, mean=0.103, threads=1} Main Thread: APEX Pthread Wrapper
1 | |-> 0.103 - 99.898% [1] {min=0.103, max=0.103, mean=0.103, threads=1} int apex_preload_main(int, char **, char **)
1 | | |-> 0.102 - 99.857% [1] {min=0.102, max=0.102, mean=0.102, threads=1} main
1 | | | |-> 0.071 - 69.025% [4] {min=0.071, max=0.071, mean=0.018, threads=1} pthread_join
1 | | | |-> 0.030 - 29.419% [3] {min=0.030, max=0.030, mean=0.010, threads=1} doWork
1 | | | |-> 0.001 - 1.347% [4] {min=0.001, max=0.001, mean=0.000, threads=1} pthread_create
1 | | | | |-> 0.203 - 198.111% [4] {min=0.203, max=0.203, mean=0.051, threads=4} OS Thread: APEX pthread wrapper
1 | | | | | |-> 0.203 - 198.066% [4] {min=0.203, max=0.203, mean=0.051, threads=4} someThread
1 | | | | | | |-> 0.203 - 198.014% [4] {min=0.203, max=0.203, mean=0.051, threads=4} someThread
1 | | | | | | | |-> 0.201 - 195.655% [8] {min=0.201, max=0.201, mean=0.025, threads=4} doWork
1 | | | | | | | |-> 0.002 - 1.995% [16] {min=0.002, max=0.002, mean=0.000, threads=4} pthread_create
1 | | | | | | | | |-> 0.403 - 393.128% [16] {min=0.403, max=0.403, mean=0.025, threads=16} OS Thread: APEX pthread wrapper
1 | | | | | | | | | |-> 0.403 - 392.815% [16] {min=0.403, max=0.403, mean=0.025, threads=16} foo
1 | | | | | | | | | | |-> 0.403 - 392.311% [16] {min=0.403, max=0.403, mean=0.025, threads=16} foo
1 | | | | | | | | | | | |-> 0.402 - 391.693% [16] {min=0.402, max=0.402, mean=0.025, threads=16} doWork
1 | | | | | | | |-> 0.000 - 0.008% [16] {min=0.000, max=0.000, mean=0.000, threads=4} pthread_detach
1 |-> 0.000 - 0.142% [1] {min=0.000, max=0.000, mean=0.000, threads=1} pthread_create
1 | |-> 0.103 - 100.040% [1] {min=0.103, max=0.103, mean=0.103, threads=1} OS Thread: APEX pthread wrapper
1 | | |-> 0.103 - 99.983% [1] {min=0.103, max=0.103, mean=0.103, threads=1} execute_native_thread_routine
1 |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} pthread_detach
22 total graph nodes

Task tree also written to tasktree.txt.
Computing new stats...
Building dot file
done.

DOT task graph of apex_pthread_c example

An observant user will notice that the someFunction and foo functions were captured twice - once by the instrumentation and once by the pthread_create wrapper.

pthread_c example with compiler instrumentation

A similar example using compiler-based instrumentation (i.e. -finstrument-functions) is provided with the pthread_inst_c executable. It is the same source code as pthread.c, however it was compiled with compiler instrumentation flags (where supported, see CMakeLists.txt). Running this example without apex_exec should give the same output as the pthread_c example. However, running with apex_exec will provide implementations of the __cyg_profile_func_enter and __cyg_profile_func_exit symbols, and APEX will time all instrumented functions:

kehuck1@uan01:~/src/apex-tutorial> apex_exec --apex:pthread --apex:tasktree ./build/bin/pthread_inst_c
  ___  ______ _______   __
 / _ \ | ___ \  ___\ \ / /
/ /_\ \| |_/ / |__  \ V /
|  _  ||  __/|  __| /   \
| | | || |   | |___/ /^\ \
\_| |_/\_|   \____/\/   \/
APEX Version: v2.6.1-7e20f7fa-develop
Built on: 18:48:48 Feb 23 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758)
someThread : Thread 0 working!
someThread : Thread 1 working!
someThread : Thread 2 working!
someThread : Thread 3 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 0 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 1 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 2 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!
foo : Thread 3 working!

Start Date/Time: 25/02/2023 23:20:47
Elapsed time: 0.101867 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 21
Available CPU time on rank 0: 2.13921 seconds
Available CPU time on all ranks: 2.13921 seconds

Counter                                              :  #samp | minimum |    mean  |  maximum |  stddev
---------------------------------------------------------------------------------------------------------
                               1 Minute Load average :      1       1.52       1.52       1.52       0.00
                                      status:Threads :      1       2.00       2.00       2.00       0.00
                                    status:VmData kB :      1   4.16e+05   4.16e+05   4.16e+05       0.00
                                     status:VmExe kB :      1       8.00       8.00       8.00       0.00
                                     status:VmHWM kB :      1   4.25e+04   4.25e+04   4.25e+04       0.00
                                     status:VmLck kB :      1       0.00       0.00       0.00       0.00
                                     status:VmLib kB :      1   1.81e+05   1.81e+05   1.81e+05       0.00
                                     status:VmPTE kB :      1     492.00     492.00     492.00       0.00
                                    status:VmPeak kB :      1   8.56e+05   8.56e+05   8.56e+05       0.00
                                     status:VmPin kB :      1       0.00       0.00       0.00       0.00
                                     status:VmRSS kB :      1   4.25e+04   4.25e+04   4.25e+04       0.00
                                    status:VmSize kB :      1   7.90e+05   7.90e+05   7.90e+05       0.00
                                     status:VmStk kB :      1     160.00     160.00     160.00       0.00
                                    status:VmSwap kB :      1       0.00       0.00       0.00       0.00
                   status:nonvoluntary_ctxt_switches :      1       1.00       1.00       1.00       0.00
                      status:voluntary_ctxt_switches :      1       7.00       7.00       7.00       0.00
---------------------------------------------------------------------------------------------------------

CPU Timers                                           : #calls| #yields| #thread|     mean |    total |  tot/thr | %total | %wall
--------------------------------------------------------------------------------------------------------------------------------
                                           APEX MAIN :      1        0        1       0.10       0.10       0.10   100.0   100.0
                                                 foo :     32        0       16       0.03       0.80       0.05    37.6    49.4
                     OS Thread: APEX pthread wrapper :     21        0       20       0.03       0.71       0.04    33.1    34.8
                                              doWork :     27        0       21       0.02       0.63       0.03    29.6    29.6
                                          someThread :      8        0        4       0.05       0.41       0.10    19.0   100.0
                   Main Thread: APEX Pthread Wrapper :      1        0        1       0.10       0.10       0.10     4.8   100.0
        int apex_preload_main(int, char **, char **) :      1        0        1       0.10       0.10       0.10     4.8    99.9
                                                main :      1        0        1       0.10       0.10       0.10     4.8    99.9
                                        pthread_join :      4        0        1       0.02       0.07       0.07     3.3    69.7
                                      pthread_create :     21        0        5       0.00       0.00       0.00     0.1     0.6
                                      pthread_detach :     17        0        4       0.00       0.00       0.00     0.0     0.0
--------------------------------------------------------------------------------------------------------------------------------


--------------------------------------------------------------------------------------------------------------------------------
                                        Total timers : 133
Writing: .//apex_tasktree.csv