-
Notifications
You must be signed in to change notification settings - Fork 0
MPI and OpenMP Example
The following example will introduce APEX using MPI and OpenMP.
The following example is the (now long outdated) Lulesh example from LLNL.
Running the example is straightforward, just run your equivalent of mpirun (srun, jsrun, mpiexec, etc) and the executable name:
[khuck@gilgamesh apex-tutorial]$ mpirun -np 8 ./build/bin/lulesh_MPI_OpenMP_2.0
___ ______ _______ __
/ _ \ | ___ \ ___\ \ / /
/ /_\ \| |_/ / |__ \ V /
| _ || __/| __| / \
| | | || | | |___/ /^\ \
\_| |_/\_| \____/\/ \/
APEX Version: v2.6.1-da0e52e-develop
Built on: 17:54:27 Feb 25 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.2.0 22204 50d6d5d5b608d2abd6af44314abc6ad20036af3b)
Running problem size 30^3 per domain until completion
Num processors: 8
Num threads: 4
Total number of elements: 216000
To run other sizes, use -s <integer>.
To run a fixed number of iterations, use -i <integer>.
To run a more or less balanced region set, use -b <integer>.
To change the relative costs of regions, use -c <integer>.
To print out progress, use -p
To write an output file for VisIt, use -v
See help (-h) for more options
Run completed:
Problem size = 30
MPI tasks = 8
Iteration count = 2031
Final Origin Energy = 7.130703e+05
Testing Plane 0 of Energy Array on rank 0:
MaxAbsDiff = 1.236913e-10
TotalAbsDiff = 4.731078e-09
MaxRelDiff = 3.858431e-14
Elapsed time = 28.96 (s)
Grind time (us/z/c) = 0.52806123 (per dom) (0.066007654 overall)
FOM = 15149.758 (z/s)
Because the example is also used for tuning, it has APEX linked in, so the APEX header is displayed.
For specifics on OpenMP support, please see the OpenMP Examples page.
[khuck@gilgamesh apex-tutorial]$ mpirun -np 8 apex_exec --apex:ompt --apex:tasktree ./build/bin/lulesh_MPI_OpenMP_2.0
----- START LOGGING OF TOOL REGISTRATION -----
Search for OMP tool in current address space... Success.
Tool was started and is using the OMPT interface.
----- END LOGGING OF TOOL REGISTRATION -----
----- START LOGGING OF TOOL REGISTRATION -----
Search for OMP tool in current address space... Success.
Tool was started and is using the OMPT interface.
----- END LOGGING OF TOOL REGISTRATION -----
----- START LOGGING OF TOOL REGISTRATION -----
Search for OMP tool in current address space... Success.
Tool was started and is using the OMPT interface.
----- END LOGGING OF TOOL REGISTRATION -----
----- START LOGGING OF TOOL REGISTRATION -----
Search for OMP tool in current address space... Success.
Tool was started and is using the OMPT interface.
----- END LOGGING OF TOOL REGISTRATION -----
----- START LOGGING OF TOOL REGISTRATION -----
Search for OMP tool in current address space... Success.
Tool was started and is using the OMPT interface.
----- END LOGGING OF TOOL REGISTRATION -----
----- START LOGGING OF TOOL REGISTRATION -----
Search for OMP tool in current address space... Success.
Tool was started and is using the OMPT interface.
----- END LOGGING OF TOOL REGISTRATION -----
----- START LOGGING OF TOOL REGISTRATION -----
Search for OMP tool in current address space... Success.
Tool was started and is using the OMPT interface.
----- END LOGGING OF TOOL REGISTRATION -----
----- START LOGGING OF TOOL REGISTRATION -----
Search for OMP tool in current address space... Success.
Tool was started and is using the OMPT interface.
----- END LOGGING OF TOOL REGISTRATION -----
___ ______ _______ __
/ _ \ | ___ \ ___\ \ / /
/ /_\ \| |_/ / |__ \ V /
| _ || __/| __| / \
| | | || | | |___/ /^\ \
\_| |_/\_| \____/\/ \/
APEX Version: v2.6.1-da0e52e-develop
Built on: 17:54:27 Feb 25 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.2.0 22204 50d6d5d5b608d2abd6af44314abc6ad20036af3b)
Running problem size 30^3 per domain until completion
Num processors: 8
Num threads: 4
Total number of elements: 216000
To run other sizes, use -s <integer>.
To run a fixed number of iterations, use -i <integer>.
To run a more or less balanced region set, use -b <integer>.
To change the relative costs of regions, use -c <integer>.
To print out progress, use -p
To write an output file for VisIt, use -v
See help (-h) for more options
Run completed:
Problem size = 30
MPI tasks = 8
Iteration count = 2031
Final Origin Energy = 7.130703e+05
Testing Plane 0 of Energy Array on rank 0:
MaxAbsDiff = 1.236913e-10
TotalAbsDiff = 4.731078e-09
MaxRelDiff = 3.858431e-14
Elapsed time = 26.34 (s)
Grind time (us/z/c) = 0.48032695 (per dom) (0.060040869 overall)
FOM = 16655.322 (z/s)
Start Date/Time: 26/02/2023 13:36:51
Elapsed time: 26.5126 seconds
Total processes detected: 8
HW Threads detected on rank 0: 96
Worker Threads observed on rank 0: 7
Available CPU time on rank 0: 185.588 seconds
Available CPU time on all ranks: 1484.71 seconds
Counter : #samp | mean | max
--------------------------------------------------------------------------------
1 Minute Load average : 216 15.08 19.00
Bytes : MPI_Allreduce recvbuf : 16240 64.00 64.00
Bytes : MPI_Allreduce sendbuf : 16240 8.00 8.00
Bytes : MPI_Irecv : 219404 1.54e+04 4.61e+04
Bytes : MPI_Isend : 219404 1.54e+04 4.61e+04
Bytes : MPI_Reduce recvbuf : 8 64.00 64.00
Bytes : MPI_Reduce sendbuf : 8 8.00 8.00
CPU Guest % : 208 0.00 0.00
CPU I/O Wait % : 208 0.57 1.12
CPU IRQ % : 208 0.26 0.29
CPU Idle % : 208 61.92 67.68
CPU Nice % : 208 0.24 1.02
CPU Steal % : 208 0.00 0.00
CPU System % : 208 4.82 5.49
CPU User % : 208 32.10 38.07
CPU soft IRQ % : 208 0.09 0.13
DRAM Energy : 208 2.92 4.00
Package-0 Energy : 208 145.59 154.00
status:Threads : 216 8.55 10.00
status:VmData kB : 216 5.06e+05 5.22e+05
status:VmExe kB : 216 136.00 136.00
status:VmHWM kB : 216 7.90e+04 8.22e+04
status:VmLck kB : 216 0.00 0.00
status:VmLib kB : 216 1.43e+05 1.43e+05
status:VmPTE kB : 216 611.39 664.00
status:VmPeak kB : 216 1.32e+06 1.46e+06
status:VmPin kB : 216 0.00 0.00
status:VmRSS kB : 216 7.59e+04 8.15e+04
status:VmSize kB : 216 1.28e+06 1.40e+06
status:VmStk kB : 216 140.00 140.00
status:VmSwap kB : 216 0.00 0.00
status:nonvoluntary_ctxt_switches : 216 63.74 92.00
status:voluntary_ctxt_switches : 216 2.58e+04 1.08e+05
--------------------------------------------------------------------------------
CPU Timers : #calls| mean | total
--------------------------------------------------------------------------------
APEX MAIN : 1 26.51 26.51
int apex_preload_main(int, char **, char **) : 8 26.45 211.63
EvalEOSForElems(Domain&, double*, int, int*, int) : 178728 0.00 89.70
OpenMP Work Loop: .omp_outlined..34:0x20e316 : 64992 0.00 85.67
OpenMP Work Loop: .omp_outlined.:0x2060e7 : 64992 0.00 47.10
OpenMP Work Loop: .omp_outlined..33:0x20ce5a : 64992 0.00 33.45
OpenMP Work Loop: .omp_outlined..31:0x20b962 : 64992 0.00 33.32
OpenMP Parallel Region: main:0x207dee : 16248 0.00 22.41
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 568680 0.00 19.36
OpenMP Work Loop: .omp_outlined..42:0x2105ab : 64992 0.00 18.36
OpenMP Work Loop: .omp_outlined..18:0x2097b3 : 1.95e+06 0.00 15.96
OpenMP Parallel Region: main:0x20824d : 16248 0.00 12.47
OpenMP Work Loop: .omp_outlined..43:0x211273 : 714912 0.00 11.05
OpenMP Work Loop: .omp_outlined..23:0x20a870 : 2.27e+06 0.00 10.67
OpenMP Work Loop: .omp_outlined..26:0x20b0ec : 6.82e+06 0.00 10.30
OpenMP Work Loop: .omp_outlined..21:0x20a3aa : 2.27e+06 0.00 9.55
OpenMP Work Loop: .omp_outlined..25:0x20adc6 : 6.82e+06 0.00 9.45
OpenMP Parallel Region: main:0x207bd6 : 16248 0.00 9.25
OpenMP Parallel Region: main:0x207a77 : 16248 0.00 9.12
int MPI_Waitall(int, MPI_Request *, MPI_Status *) : 48752 0.00 8.42
OpenMP Work Loop: .omp_outlined..35:0x20fa83 : 64992 0.00 7.28
OpenMP Work Loop: .omp_outlined..18:0x20994f : 1.95e+06 0.00 6.96
OpenMP Work Loop: .omp_outlined..32:0x20cc43 : 64992 0.00 6.84
OpenMP Work Loop: .omp_outlined..24:0x20ab83 : 2.27e+06 0.00 6.35
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 568680 0.00 6.27
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 568680 0.00 6.24
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 568680 0.00 6.20
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 568680 0.00 5.18
OpenMP Parallel Region: main:0x20836c : 16248 0.00 4.85
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 568680 0.00 4.58
OpenMP Parallel Region: main:0x2084a2 : 178728 0.00 4.55
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 568680 0.00 4.38
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 568680 0.00 4.38
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 568680 0.00 4.37
OpenMP Work Loop: .omp_outlined..20:0x20a046 : 2.27e+06 0.00 4.28
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 568680 0.00 4.25
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 568680 0.00 4.12
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 568680 0.00 4.08
OpenMP Work Loop: .omp_outlined..19:0x209e96 : 714912 0.00 2.82
OpenMP Work Loop: .omp_outlined..48:0x21214b : 714912 0.00 2.74
OpenMP Work Loop: .omp_outlined..30:0x20b7c0 : 64992 0.00 2.56
OpenMP Work Loop: .omp_outlined..22:0x20a66c : 2.27e+06 0.00 2.45
OpenMP Parallel Region: main:0x207e28 : 16248 0.00 2.25
OpenMP Parallel Region: main:0x207ab4 : 16248 0.00 2.16
OpenMP Work Loop: .omp_outlined..36:0x20fcac : 64992 0.00 2.09
OpenMP Work Loop: .omp_outlined..40:0x210430 : 64992 0.00 2.08
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 178728 0.00 2.02
OpenMP Work Loop: .omp_outlined..18:0x209cbb : 1.95e+06 0.00 2.01
OpenMP Parallel Region: main:0x208831 : 178728 0.00 1.98
OpenMP Work Loop: .omp_outlined..38:0x210163 : 64992 0.00 1.95
OpenMP Work Loop: .omp_outlined..27:0x20b262 : 714912 0.00 1.69
OpenMP Work Loop: .omp_outlined..39:0x2102ef : 64992 0.00 1.63
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 178728 0.00 1.59
OpenMP Parallel Region: main:0x2089cf : 178728 0.00 1.45
OpenMP Work Loop: .omp_outlined..18:0x209ac7 : 1.95e+06 0.00 1.28
OpenMP Work Loop: .omp_outlined..18:0x209bc2 : 1.95e+06 0.00 1.26
OpenMP Parallel Region: main:0x20797b : 16248 0.00 0.84
OpenMP Work Loop: .omp_outlined..28:0x20b430 : 64992 0.00 0.82
OpenMP Parallel Region: main:0x208275 : 16248 0.00 0.74
OpenMP Parallel Region: main:0x2080c6 : 16248 0.00 0.73
OpenMP Parallel Region: main:0x208046 : 16248 0.00 0.71
OpenMP Work Loop: .omp_outlined..49:0x212365 : 714912 0.00 0.64
OpenMP Parallel Region: main:0x2080fd : 16248 0.00 0.64
int MPI_Wait(MPI_Request *, MPI_Status *) : 219404 0.00 0.63
OpenMP Work Loop: .omp_outlined..46:0x211cad : 64992 0.00 0.56
OpenMP Parallel Region: main:0x20861d : 16248 0.00 0.54
OpenMP Parallel Region: main:0x2078cc : 16248 0.00 0.36
int MPI_Isend(const void *, int, MPI_Datatype, int,… : 219404 0.00 0.29
int MPI_Allreduce(const void *, void *, int, MPI_Da… : 16240 0.00 0.28
OpenMP Work Loop: .omp_outlined..29:0x20b6d5 : 64992 0.00 0.27
OpenMP Work Loop: .omp_outlined..47:0x211e90 : 64992 0.00 0.23
OpenMP Parallel Region: main:0x207b15 : 16248 0.00 0.22
OpenMP Work Loop: .omp_outlined..46:0x211b0c : 64992 0.00 0.19
OpenMP Parallel Region: main:0x208077 : 16248 0.00 0.18
MPI Collective Sync : 16256 0.00 0.18
OpenMP Parallel Region: main:0x20871c : 16248 0.00 0.18
OpenMP Work Loop: .omp_outlined..46:0x211bf0 : 64992 0.00 0.16
int MPI_Irecv(void *, int, MPI_Datatype, int, int, … : 219404 0.00 0.10
OpenMP Work Loop: .omp_outlined..37:0x210037 : 32496 0.00 0.03
OpenMP Work Loop: .omp_outlined..37:0x20fe04 : 32496 0.00 0.03
OpenMP Work Loop: .omp_outlined..37:0x20ff17 : 32496 0.00 0.02
int MPI_Reduce(const void *, void *, int, MPI_Datat… : 8 0.00 0.00
int MPI_Barrier(MPI_Comm) : 8 0.00 0.00
: 1 0.00 0.00
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Total timers : 48510953
Writing: .//apex_tasktree.csv...done.
[khuck@gilgamesh apex-tutorial]$ apex-treesummary.py --ascii --dot
Reading tasktree...
Read 700 rows
Found 7 ranks, with max graph node index of 88 and depth of 4
building common tree...
Rank 7 ...
8-> 26.513 - 100.000% [1] {min=26.512, max=26.513, mean=26.513, threads=8} APEX MAIN
8 |-> 26.453 - 99.776% [1] {min=26.401, max=26.506, mean=26.453, threads=8} int apex_preload_main(int, char **, char **)
8 | |-> 11.212 - 42.291% [22341] {min=10.118, max=13.009, mean=0.001, threads=8} EvalEOSForElems(Domain&, double*, int, int*, int)
8 | | |-> 2.420 - 9.126% [71085] {min=1.925, max=3.316, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x208ec6
8 | | | |-> 1.995 - 7.525% [243372] {min=1.153, max=3.530, mean=0.000, threads=48} OpenMP Work Loop: .omp_outlined..18:0x2097b3
8 | | | |-> 0.870 - 3.280% [243372] {min=0.474, max=1.561, mean=0.000, threads=48} OpenMP Work Loop: .omp_outlined..18:0x20994f
8 | | | |-> 0.252 - 0.950% [243372] {min=0.164, max=0.398, mean=0.000, threads=48} OpenMP Work Loop: .omp_outlined..18:0x209cbb
8 | | | |-> 0.160 - 0.605% [243372] {min=0.113, max=0.244, mean=0.000, threads=48} OpenMP Work Loop: .omp_outlined..18:0x209ac7
8 | | | |-> 0.158 - 0.596% [243372] {min=0.111, max=0.244, mean=0.000, threads=48} OpenMP Work Loop: .omp_outlined..18:0x209bc2
8 | | |-> 0.783 - 2.954% [71085] {min=0.628, max=0.990, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x209327
8 | | | |-> 1.334 - 5.030% [284340] {min=0.760, max=2.167, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..23:0x20a870
8 | | |-> 0.781 - 2.944% [71085] {min=0.688, max=0.843, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x209042
8 | | | |-> 0.535 - 2.018% [284340] {min=0.357, max=0.830, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..20:0x20a046
8 | | |-> 0.775 - 2.925% [71085] {min=0.634, max=0.935, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x209182
8 | | | |-> 1.194 - 4.504% [284340] {min=0.735, max=1.892, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..21:0x20a3aa
8 | | |-> 0.648 - 2.444% [71085] {min=0.565, max=0.765, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x209465
8 | | | |-> 0.794 - 2.996% [284340] {min=0.508, max=1.140, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..24:0x20ab83
8 | | |-> 0.572 - 2.159% [71085] {min=0.539, max=0.603, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x2090af
8 | | | |-> 0.451 - 1.700% [284340] {min=0.329, max=0.629, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..25:0x20adc6
8 | | |-> 0.547 - 2.064% [71085] {min=0.489, max=0.620, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x2090fe
8 | | | |-> 0.438 - 1.650% [284340] {min=0.300, max=0.655, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..26:0x20b0ec
8 | | |-> 0.547 - 2.064% [71085] {min=0.519, max=0.586, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x209244
8 | | | |-> 0.385 - 1.451% [284340] {min=0.262, max=0.547, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..25:0x20adc6
8 | | |-> 0.546 - 2.060% [71085] {min=0.500, max=0.593, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x20928b
8 | | | |-> 0.444 - 1.676% [284340] {min=0.300, max=0.652, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..26:0x20b0ec
8 | | |-> 0.531 - 2.004% [71085] {min=0.476, max=0.571, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x2093ad
8 | | | |-> 0.346 - 1.307% [284340] {min=0.236, max=0.489, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..25:0x20adc6
8 | | |-> 0.516 - 1.945% [71085] {min=0.463, max=0.587, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x2093f1
8 | | | |-> 0.406 - 1.532% [284340] {min=0.267, max=0.631, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..26:0x20b0ec
8 | | |-> 0.510 - 1.923% [71085] {min=0.448, max=0.543, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x2091be
8 | | | |-> 0.306 - 1.153% [284340] {min=0.206, max=0.433, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..22:0x20a66c
8 | | |-> 0.252 - 0.952% [22341] {min=0.224, max=0.270, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x2094e4
8 | | | |-> 0.353 - 1.330% [89364] {min=0.312, max=0.377, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..19:0x209e96
8 | | |-> 0.199 - 0.751% [22341] {min=0.174, max=0.225, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x209590
8 | | | |-> 0.211 - 0.795% [89364] {min=0.184, max=0.271, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..27:0x20b262
8 | |-> 2.802 - 10.567% [2031] {min=2.709, max=2.902, mean=0.001, threads=8} OpenMP Parallel Region: main:0x207dee
8 | | |-> 10.709 - 40.392% [8124] {min=10.421, max=11.032, mean=0.001, threads=32} OpenMP Work Loop: .omp_outlined..34:0x20e316
8 | |-> 1.558 - 5.878% [2031] {min=1.537, max=1.596, mean=0.001, threads=8} OpenMP Parallel Region: main:0x20824d
8 | | |-> 5.887 - 22.204% [8124] {min=5.790, max=5.985, mean=0.001, threads=32} OpenMP Work Loop: .omp_outlined.:0x2060e7
8 | |-> 1.156 - 4.360% [2031] {min=1.067, max=1.273, mean=0.001, threads=8} OpenMP Parallel Region: main:0x207bd6
8 | | |-> 4.181 - 15.771% [8124] {min=3.852, max=4.820, mean=0.001, threads=32} OpenMP Work Loop: .omp_outlined..33:0x20ce5a
8 | |-> 1.140 - 4.301% [2031] {min=1.038, max=1.435, mean=0.001, threads=8} OpenMP Parallel Region: main:0x207a77
8 | | |-> 4.165 - 15.708% [8124] {min=4.005, max=4.571, mean=0.001, threads=32} OpenMP Work Loop: .omp_outlined..31:0x20b962
8 | |-> 1.053 - 3.972% [6094] {min=0.779, max=1.294, mean=0.000, threads=8} int MPI_Waitall(int, MPI_Request *, MPI_Status *)
8 | |-> 0.606 - 2.288% [2031] {min=0.595, max=0.616, mean=0.000, threads=8} OpenMP Parallel Region: main:0x20836c
8 | | |-> 2.295 - 8.656% [8124] {min=2.279, max=2.310, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..42:0x2105ab
8 | |-> 0.569 - 2.146% [22341] {min=0.510, max=0.598, mean=0.000, threads=8} OpenMP Parallel Region: main:0x2084a2
8 | | |-> 1.382 - 5.211% [89364] {min=1.285, max=1.463, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..43:0x211273
8 | |-> 0.281 - 1.059% [2031] {min=0.225, max=0.307, mean=0.000, threads=8} OpenMP Parallel Region: main:0x207e28
8 | | |-> 0.910 - 3.432% [8124] {min=0.769, max=1.015, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..35:0x20fa83
8 | |-> 0.270 - 1.019% [2031] {min=0.204, max=0.314, mean=0.000, threads=8} OpenMP Parallel Region: main:0x207ab4
8 | | |-> 0.855 - 3.226% [8124] {min=0.653, max=0.972, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..32:0x20cc43
8 | |-> 0.248 - 0.935% [22341] {min=0.223, max=0.268, mean=0.000, threads=8} OpenMP Parallel Region: main:0x208831
8 | | |-> 0.342 - 1.291% [89364] {min=0.321, max=0.369, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..48:0x21214b
8 | |-> 0.181 - 0.683% [22341] {min=0.163, max=0.194, mean=0.000, threads=8} OpenMP Parallel Region: main:0x2089cf
8 | | |-> 0.080 - 0.301% [89364] {min=0.068, max=0.106, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..49:0x212365
8 | |-> 0.105 - 0.395% [2031] {min=0.097, max=0.112, mean=0.000, threads=8} OpenMP Parallel Region: main:0x20797b
8 | | |-> 0.320 - 1.208% [8124] {min=0.282, max=0.358, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..30:0x20b7c0
8 | |-> 0.092 - 0.348% [2031] {min=0.087, max=0.099, mean=0.000, threads=8} OpenMP Parallel Region: main:0x208275
8 | | |-> 0.260 - 0.980% [8124] {min=0.235, max=0.280, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..40:0x210430
8 | |-> 0.091 - 0.345% [2031] {min=0.067, max=0.119, mean=0.000, threads=8} OpenMP Parallel Region: main:0x2080c6
8 | | |-> 0.244 - 0.919% [8124] {min=0.180, max=0.335, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..38:0x210163
8 | |-> 0.089 - 0.334% [2031] {min=0.085, max=0.093, mean=0.000, threads=8} OpenMP Parallel Region: main:0x208046
8 | | |-> 0.262 - 0.987% [8124] {min=0.257, max=0.269, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..36:0x20fcac
8 | |-> 0.079 - 0.300% [2031] {min=0.057, max=0.109, mean=0.000, threads=8} OpenMP Parallel Region: main:0x2080fd
8 | | |-> 0.204 - 0.769% [8124] {min=0.147, max=0.292, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..39:0x2102ef
8 | |-> 0.078 - 0.295% [27425] {min=0.022, max=0.199, mean=0.000, threads=8} int MPI_Wait(MPI_Request *, MPI_Status *)
8 | |-> 0.067 - 0.253% [2031] {min=0.062, max=0.071, mean=0.000, threads=8} OpenMP Parallel Region: main:0x20861d
8 | | |-> 0.070 - 0.263% [8124] {min=0.069, max=0.071, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..46:0x211cad
8 | | |-> 0.024 - 0.089% [8124] {min=0.023, max=0.024, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..46:0x211b0c
8 | | |-> 0.020 - 0.077% [8124] {min=0.020, max=0.021, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..46:0x211bf0
8 | |-> 0.046 - 0.172% [2031] {min=0.041, max=0.051, mean=0.000, threads=8} OpenMP Parallel Region: main:0x2078cc
8 | | |-> 0.103 - 0.389% [8124] {min=0.090, max=0.120, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..28:0x20b430
8 | |-> 0.037 - 0.138% [27425] {min=0.027, max=0.045, mean=0.000, threads=8} int MPI_Isend(const void *, int, MPI_Datatype, int, int, MPI_Comm, MPI_Request *)
8 | |-> 0.035 - 0.133% [2030] {min=0.030, max=0.040, mean=0.000, threads=8} int MPI_Allreduce(const void *, void *, int, MPI_Datatype, MPI_Op, MPI_Comm)
8 | | |-> 0.022 - 0.083% [2030] {min=0.016, max=0.027, mean=0.000, threads=8} MPI Collective Sync
8 | |-> 0.027 - 0.101% [2031] {min=0.023, max=0.033, mean=0.000, threads=8} OpenMP Parallel Region: main:0x207b15
8 | | |-> 0.033 - 0.125% [8124] {min=0.030, max=0.037, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..29:0x20b6d5
8 | |-> 0.022 - 0.084% [2031] {min=0.005, max=0.036, mean=0.000, threads=8} OpenMP Parallel Region: main:0x208077
4 | | |-> 0.007 - 0.027% [8124] {min=0.006, max=0.008, mean=0.000, threads=16} OpenMP Work Loop: .omp_outlined..37:0x210037
4 | | |-> 0.007 - 0.026% [8124] {min=0.006, max=0.008, mean=0.000, threads=16} OpenMP Work Loop: .omp_outlined..37:0x20fe04
4 | | |-> 0.006 - 0.022% [8124] {min=0.006, max=0.006, mean=0.000, threads=16} OpenMP Work Loop: .omp_outlined..37:0x20ff17
8 | |-> 0.022 - 0.083% [2031] {min=0.021, max=0.023, mean=0.000, threads=8} OpenMP Parallel Region: main:0x20871c
8 | | |-> 0.029 - 0.108% [8124] {min=0.026, max=0.032, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..47:0x211e90
8 | |-> 0.012 - 0.046% [27425] {min=0.010, max=0.015, mean=0.000, threads=8} int MPI_Irecv(void *, int, MPI_Datatype, int, int, MPI_Comm, MPI_Request *)
8 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=8} int MPI_Reduce(const void *, void *, int, MPI_Datatype, MPI_Op, int, MPI_Comm)
8 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=8} MPI Collective Sync
8 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=8} int MPI_Barrier(MPI_Comm)
8 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=8} MPI Collective Sync
90 total graph nodes
Task tree also written to tasktree.txt.
Computing new stats...
Building dot file
done.
As shown in the output, APEX will wrap and measure the MPI communication routines, capturing time spent in the calls as well as the total bytes transferred. In the DOT figure, we see boxes are shaded by either time (blue) or bytes transmitted (red).
APEX tutorial, © Copyright 2023, University of Oregon. For more information on APEX, see https://github.com/UO-OACISS/apex