Skip to content

Commit ab7577c

Browse files
committed
Add profiling with perf chapter OCaml manual
This covers the use of Linux perf to profile OCaml native code including the overlap with frame pointers and DWARF CFI information
1 parent 137dd26 commit ab7577c

File tree

6 files changed

+207
-5
lines changed

6 files changed

+207
-5
lines changed

Diff for: Changes

+3
Original file line numberDiff line numberDiff line change
@@ -182,6 +182,9 @@ Working version
182182
- #13694: Fix name for caml_hash_variant in the C interface.
183183
(Michael Hendricks)
184184

185+
- #13751: Document support for profiling with Linux perf and frame pointers.
186+
(Tim McGilchrist, review by ???)
187+
185188
### Compiler user-interface and warnings:
186189

187190
- #13428: support dump=[source | parsetree | lambda | ... | cmm | ...]

Diff for: manual/README.md

+4
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,8 @@ chapters (or sometimes sections) are mapped to a distinct `.etex` file:
8585
- Optimisation with Flambda: `flambda.etex`
8686
- Fuzzing with afl-fuzz: `afl-fuzz.etex`
8787
- Runtime tracing with Runtime_events: `runtime_tracing.etex`
88+
- The “Tail Modulo Constructor” program transformation: `tail-mod-cons.etex`
89+
- Runtime detection of data races with ThreadSanitizer: `tsan.etex`
8890

8991
Note that ocamlc,ocamlopt and the toplevel options overlap a lot.
9092
Consequently, these options are described together in the file
@@ -104,6 +106,8 @@ of `unified-options.etex` contains the relevant information.
104106
- The runtime_events library: `libruntime_events.etex`
105107
- The dynlink library: dynamic loading and linking of object files:
106108
`libdynlink.etex`
109+
- Recently removed or moved libraries (Graphics, Bigarray, Num, LablTk):
110+
`old.etex`
107111

108112
Latex extensions
109113
----------------

Diff for: manual/src/allfiles.etex

+1
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@ and as a
7070
\input{ocamldoc.tex}
7171
\input{debugger.tex}
7272
\input{profil.tex}
73+
\input{profile-perf.tex}
7374
\input{intf-c.tex}
7475
\input{flambda.tex}
7576
\input{afl-fuzz.tex}

Diff for: manual/src/cmds/Makefile

+2-1
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,8 @@ TRANSF = $(OCAMLRUN) $(TOOLS)/transf
1212
FILES = comp.tex top.tex runtime.tex native.tex lexyacc.tex intf-c.tex \
1313
ocamldep.tex profil.tex debugger.tex ocamldoc.tex \
1414
warnings-help.tex flambda.tex tail-mod-cons.tex \
15-
afl-fuzz.tex runtime-tracing.tex unified-options.tex tsan.tex
15+
afl-fuzz.tex runtime-tracing.tex unified-options.tex tsan.tex \
16+
profile-perf.tex
1617

1718
etex-files: $(FILES)
1819
all: $(FILES)

Diff for: manual/src/cmds/profil.etex

+2-4
Original file line numberDiff line numberDiff line change
@@ -140,7 +140,5 @@ Display a short usage summary and exit.
140140

141141
Profiling with "ocamlprof" only records execution counts, not the actual
142142
time spent within each function. There is currently no way to perform
143-
time profiling on bytecode programs generated by "ocamlc". For time
144-
profiling of native code, users are recommended to use standard tools
145-
such as perf (on Linux), Instruments (on macOS) and DTrace. Profiling
146-
with "gprof" is no longer supported.
143+
time profiling on bytecode programs generated by "ocamlc". For time profiling
144+
of native code (see chapter~\ref{c:profiler-perf}).

Diff for: manual/src/cmds/profile-perf.etex

+195
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
\chapter{Profiling (perf)} \label{c:profiler-perf}
2+
%HEVEA\cutname{profiler-perf.html}
3+
4+
This chapter describes how to use \texttt{perf} to profile OCaml programs.
5+
6+
Linux Performance Events (\texttt{perf (1)}) is a suite of tools for performance observability. The main features covered here are \texttt{perf-record(1)} for recording events, and \texttt{perf-report(1)} for printing and visualising recorded events. \texttt{perf} has many additional profiling and visualising features, for more comprehensive documentation, see (\texttt{perf(1)}, \href{https://perfwiki.github.io/main/}{\texttt{perf} wiki} or \href{https://www.brendangregg.com/perf.html}{Brendan Gregg's Blog}).
7+
8+
\section{s:ocamlperf-call-graph}{Background}
9+
10+
CPU profiling is typically performed by sampling the CPU call graph at frequent intervals to gather statistics on the code paths that are consuming CPU resources. To profile OCaml code, \texttt{perf} needs to understand the call graph of OCaml. \texttt{perf} supports multiple options for recording call graphs:
11+
\begin{itemize}
12+
\item Frame Pointers, which is the default.
13+
\item DWARF's Call Frame Information (CFI).
14+
\item Hardware Last Branch Record (LBR).
15+
\end{itemize}
16+
17+
Frame pointer based call graphs use a convention where a register called the frame pointer holds the address for the beginning of the stack frame, and the previous values for the frame pointer are stored at a known offset from the current frame pointer. Using this information the call graph for the current function can be calculated purely based on the current stack, a process called unwinding. On x86_64 the register use for storing the frame pointer is \$rbp, while ARM64 uses the register x29. OCaml 5 introduced non-contiguous stacks as part of the implementation of effects, see \href{https://dl.acm.org/doi/10.1145/3453483.3454039}{Retrofitting effect handlers onto OCaml} (Section 5.5), which work better with the copying nature of perf and frame pointers.
18+
19+
DWARF based call graphs use the DWARF CFI information to perform unwinding. However this interacts poorly with the copying nature of perf, often leading to truncated call graphs where not enough of the stack has been copied by \texttt{perf}. It also produces larger trace files that are more costly to capture and process. Finally it requires including CFI debugging information in your program resulting in larger binaries.
20+
21+
Hardware Last Branch Record (LBR) uses a processor provided method to record call graphs. This has the limitations of restricted availability (only on certain Intel CPUs) and a limited stack depth (16 on Haswell and 32 since Skylake).
22+
23+
Of these options, frame pointers are recommended for profiling OCaml code. It has the following advantages:
24+
\begin{itemize}
25+
\item Unwinding is faster to calculate.
26+
\item Tracing data produced is smaller.
27+
\item Frame pointers provide more complete call graphs, particularly when used with a Linux distribution that supports them.
28+
\item Frame pointers work better with perf's copying nature and OCaml 5's non-contiguous stacks.
29+
\end{itemize}
30+
31+
\section{s:ocamlperf-compiling}{Compiling for Profiling}
32+
33+
The process for compiling an executable for CPU profiling depends on the OCaml version. For OCaml versions 4.14 and earlier, either frame pointers or DWARF can be used, while for OCaml 5.0 and later, enabling frame pointers is recommended.
34+
35+
To enable frame pointers, configure the compiler with \texttt{--enable-frame-pointers}. Alternatively, you can install an opam switch with frame pointers enabled, as follows:
36+
37+
\begin{verbatim}
38+
opam switch create <YOUR-SWITCH-NAME-HERE> ocaml-option-fp
39+
\end{verbatim}
40+
41+
Frame pointer support for OCaml is available on x86_64 architecture for Linux starting with OCaml 4.12 and on macOS from OCaml 5.3. ARM64 architecture is supported on Linux and macOS from OCaml 5.4, while other Tier-1 architectures (POWER, RISC-V, and s390x) are currently unsupported.
42+
43+
\section{s:ocamlperf-profiling}{Profiling an Execution}
44+
45+
The basic \texttt{perf} command for profiling is:
46+
\begin{verbatim}
47+
perf record -F 99 --call-graph fp <YOUR_EXECUTABLE>
48+
\end{verbatim}
49+
50+
The \texttt{-F 99} option sets \texttt{perf} to sample at 99Hz, reducing excessive data generation during longer runs and minimising overlap with other periodic activities. The \texttt{--call-graph fp} instructs \texttt{perf} to use frame pointers to get the call graph, followed by the OCaml executable you want to profile. This command creates a \texttt{perf.data} file in the current directory. Alternatively use \texttt{--output} to choose a more descriptive filename.
51+
52+
The \texttt{perf record} command works by copying a segment of the call stack at each sample and recording this data into a \texttt{perf.data} file. These samples can then be processed after recording using \texttt{perf report} to reconstruct the profiled program’s call stack at every sample.
53+
54+
\texttt{perf} uses the symbols present in an OCaml executable, so it helps to understand OCaml's name mangling scheme to map names to OCaml source locations. OCaml 5.4 onwards uses a name mangling scheme of \texttt{caml<MODULE_NAME>\$<FUNCTION_NAME>_<NNN>}, where \texttt{NNN} is a randomly generated number. In OCaml 5.3 the separator is a dot \texttt{caml<MODULE_NAME>.<FUNCTION_NAME>_<NNN>}, and before OCaml 5.1 the separator is double underscores. All formats are supported by \texttt{perf}.
55+
56+
Consider the following program:
57+
58+
\begin{caml_example*}{verbatim}
59+
(* fib.ml *)
60+
module Compute = struct
61+
let rec fib n =
62+
if n = 0 then 0
63+
else if n = 1 then 1
64+
else fib (n-1) + fib (n-2)
65+
end
66+
67+
let main () =
68+
let r = Compute.fib 20 in
69+
Printf.printf "fib(20) = %d" r
70+
71+
let _ = main ()
72+
\end{caml_example*}
73+
74+
This program produces the names \texttt{camlFib\$main_274} for the \texttt{main} function and \texttt{camlFib\$fib_271} for the \texttt{fib} function in the \texttt{Compute} module.
75+
76+
\section{s:ocamlperf-printing}{Printing profiling information}
77+
78+
The \texttt{perf report} command summarises data in the \texttt{perf.data} file.
79+
The basic \texttt{perf report} command is:
80+
81+
\begin{verbatim}
82+
perf report -f --no-children -i perf.data
83+
\end{verbatim}
84+
85+
This command provides an interactive interface where you can navigate through the accumulated call graphs and select functions and threads for detailed information. Alternatively \texttt{--stdio} will output similar data using a text based report writing to stdout. Note that if stack traces appear broken, it may be due to software not having frame pointer support enabled.
86+
87+
Consider the following program which calculates the Tak function.
88+
\begin{caml_example*}{verbatim}
89+
let (x,y,z) =
90+
try
91+
let x = int_of_string Sys.argv.(1) in
92+
let y = int_of_string Sys.argv.(2) in
93+
let z = int_of_string Sys.argv.(3) in
94+
(x,y,z)
95+
with _ -> (18,12,6)
96+
97+
let rec tak x y z =
98+
if x > y then
99+
tak (tak (x-1) y z) (tak (y-1) z x) (tak (z-1) x y)
100+
else z
101+
102+
let main () =
103+
let r = tak x y z in
104+
Printf.printf "tak %d %d %d = %d\n" x y z r
105+
106+
let _ = main ()
107+
\end{caml_example*}
108+
109+
The \texttt{perf} report for this might resemble the following.
110+
\begin{verbatim}
111+
Samples: 809 of event 'cycles', Event count (approx.): 24701952617
112+
Overhead Command Shared Object Symbol
113+
- 100.00% tak-fp.exe tak-fp.exe [.] Tak\$tak_402
114+
Tak\$tak_402
115+
Tak\$tak_402
116+
+ Tak\$tak_402
117+
0.00% tak-fp.exe [kernel.kallsyms] [k] 0xffffb9a5ff79d854
118+
0.00% perf-exec [kernel.kallsyms] [k] 0xffffb9a5ff719c34
119+
\end{verbatim}
120+
121+
For a more interesting program implementing a Sudoku solver using Norvig's Backtracking algorithm, the output might resemble.
122+
\begin{verbatim}
123+
Samples: 7K of event 'cycles:P', Event count (approx.): 141874642467
124+
Overhead Command Shared Object Symbol
125+
- 35.90% sudoku_norvig.e sudoku_norvig.exe [.] Sudoku_norvig._pow_747
126+
- 35.29% Sudoku_norvig$_pow_747
127+
+ 17.70% Sudoku_norvig$eliminate_value_760
128+
+ 15.74% Sudoku_norvig$_pow_747
129+
+ 1.03% Sudoku_norvig$fun_1435
130+
0.61% Sudoku_norvig$eliminate_value_760
131+
- 12.99% sudoku_norvig.e sudoku_norvig.exe [.] BatList$findnext_1543
132+
- 12.00% BatList$findnext_1543
133+
- 11.71% BatList$find_all_1539
134+
+ 8.07% Sudoku_norvig$fun_1421
135+
+ 2.73% Sudoku_norvig$eliminate_948
136+
+ 0.90% Sudoku_norvig$assign_949
137+
- 10.63% sudoku_norvig.e sudoku_norvig.exe [.] Sudoku_norvig$eliminate_value_760
138+
+ 8.29% Sudoku_norvig$eliminate_value_760
139+
+ 1.93% Sudoku_norvig$_pow_747
140+
+ 9.50% sudoku_norvig.e [kernel.kallsyms] [k] asm_exc_nmi
141+
+ 7.24% sudoku_norvig.e sudoku_norvig.exe [.] Sudoku_norvig$fun_1435
142+
+ 5.06% sudoku_norvig.e sudoku_norvig.exe [.] caml_modify
143+
+ 3.10% sudoku_norvig.e sudoku_norvig.exe [.] Sudoku_norvig$fun_1360
144+
+ 2.11% sudoku_norvig.e sudoku_norvig.exe [.] Sudoku_norvig$eliminate_948
145+
+ 2.00% sudoku_norvig.e sudoku_norvig.exe [.] Stdlib.List$iter_366
146+
+ 1.54% sudoku_norvig.e sudoku_norvig.exe [.] Sudoku_norvig$fun_1421
147+
+ 1.47% sudoku_norvig.e sudoku_norvig.exe [.] caml_apply2
148+
+ 1.42% sudoku_norvig.e sudoku_norvig.exe [.] BatList$find_all_1539
149+
+ 1.20% sudoku_norvig.e sudoku_norvig.exe [.] Sudoku_norvig$fun_1366
150+
+ 1.09% sudoku_norvig.e sudoku_norvig.exe [.] Stdlib.List$fold_left_380
151+
\end{verbatim}
152+
Note this implementation clearly has some optimsation possible in `_pow_747` a recursive
153+
Power of 2 function and parts of the OCaml runtime `caml_modify` are present.
154+
155+
Profiling data can also be visualised as Flame Graphs, which highlight the most frequent code paths in stack traces. The original scripts scripts \texttt{stackcollapse-perf.pl} and \texttt{flamegraph.pl} can be found at Brendan Gregg's \href{https://www.brendangregg.com/flamegraphs.html}{Flame Graphs} web page and work as follows:
156+
157+
\begin{verbatim}
158+
git clone https://github.com/brendangregg/FlameGraph
159+
cd FlameGraph
160+
161+
# Collect the results into perf.data
162+
perf record -F 99 --call-graph fp <YOUR_EXECUTABLE>
163+
perf script -i perf.data | ./stackcollapse-perf.pl out.folded
164+
flamegraph.pl out.folded > flamegraph.svg ## Create the FlameGraph svg
165+
\end{verbatim}
166+
167+
Alternatively \href{https://github.com/jonhoo/inferno}{inferno} is a Rust port of the Flame Graphs tools which works in a similar way and is faster to process large perf files.
168+
169+
\begin{verbatim}
170+
cargo install inferno
171+
172+
# Collect the results into perf.data
173+
perf script -i perf.data | inferno-collapse-perf > stacks.folded
174+
cat stacks.folded | inferno-flamegraph > flamegraph.svg
175+
\end{verbatim}
176+
177+
Some \texttt{perf} tools (e.g., \texttt{perf-report(1)} and \texttt{perf-annotate(1)}) use DWARF debugging information to associate symbols with source code locations, if you need these features, the program and it's dependencies need to be compiled with \texttt{-g} to include debugging information in the executable.
178+
179+
Captured profile data can also be processed using \texttt{perf script} in various ways or with online tools like \href{https://www.speedscope.app}{speedscope.app} and \href{https://profiler.firefox.com/}{profiler.firefox.com}, or any other tool that supports \texttt{perf}-formatted data.
180+
181+
\section{s:ocamlperf-conclusion}{Conclusion}
182+
183+
For CPU profiling of native code, standard tools such as \texttt{perf}, eBPF, DTrace, or Instruments (on macOS) are recommended. Compiling with frame pointers enabled is often necessary for these tools to work effectively. Profiling with \texttt{gprof} is no longer supported.
184+
185+
Enabling frame pointers can impact performance on certain architectures (up to 10\% performance cost on x86_64 has been measured). Users of this feature are encouraged to benchmark their own applications to assess this impact.
186+
187+
\section{s:ocamlperf-glossary}{Glossary}
188+
189+
The following terminology is used in this chapter of the manual.
190+
191+
\begin{itemize}
192+
\item[{\bf Call graph}] The chain of function calls that have lead to the current function (also referred to as a call stack).
193+
\item[{\bf Unwinding}] The process of restoring the program's state to how it was before some function(s) were called and possibly giving a profiler or debugger much more information. (also called stack unwinding).
194+
\item[{\bf Stack frame}] Refers to the portion of the stack allocated to a single function call. (also called an activation frame, activation record or simply frame).
195+
\end{itemize}

0 commit comments

Comments
 (0)