You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This chapter describes how to use \texttt{perf} to profile OCaml programs.
5
+
6
+
Linux Performance Events (\texttt{perf (1)}) is a suite of tools for performance observability. The main features covered here are \texttt{perf-record(1)} for recording events, and \texttt{perf-report(1)} for printing and visualising recorded events. \texttt{perf} has many additional profiling and visualising features, for more comprehensive documentation, see (\texttt{perf(1)}, \href{https://perfwiki.github.io/main/}{\texttt{perf} wiki} or \href{https://www.brendangregg.com/perf.html}{Brendan Gregg's Blog}).
7
+
8
+
\section{s:ocamlperf-call-graph}{Background}
9
+
10
+
CPU profiling is typically performed by sampling the CPU call graph at frequent intervals to gather statistics on the code paths that are consuming CPU resources. To profile OCaml code, \texttt{perf} needs to understand the call graph of OCaml. \texttt{perf} supports multiple options for recording call graphs:
11
+
\begin{itemize}
12
+
\item Frame Pointers, which is the default.
13
+
\item DWARF's Call Frame Information (CFI).
14
+
\item Hardware Last Branch Record (LBR).
15
+
\end{itemize}
16
+
17
+
Frame pointer based call graphs use a convention where a register called the frame pointer holds the address for the beginning of the stack frame, and the previous values for the frame pointer are stored at a known offset from the current frame pointer. Using this information the call graph for the current function can be calculated purely based on the current stack, a process called unwinding. On x86_64 the register use for storing the frame pointer is \$rbp, while ARM64 uses the register x29. OCaml 5 introduced non-contiguous stacks as part of the implementation of effects, see \href{https://dl.acm.org/doi/10.1145/3453483.3454039}{Retrofitting effect handlers onto OCaml} (Section 5.5), which work better with the copying nature of perf and frame pointers.
18
+
19
+
DWARF based call graphs use the DWARF CFI information to perform unwinding. However this interacts poorly with the copying nature of perf, often leading to truncated call graphs where not enough of the stack has been copied by \texttt{perf}. It also produces larger trace files that are more costly to capture and process. Finally it requires including CFI debugging information in your program resulting in larger binaries.
20
+
21
+
Hardware Last Branch Record (LBR) uses a processor provided method to record call graphs. This has the limitations of restricted availability (only on certain Intel CPUs) and a limited stack depth (16 on Haswell and 32 since Skylake).
22
+
23
+
Of these options, frame pointers are recommended for profiling OCaml code. It has the following advantages:
24
+
\begin{itemize}
25
+
\item Unwinding is faster to calculate.
26
+
\item Tracing data produced is smaller.
27
+
\item Frame pointers provide more complete call graphs, particularly when used with a Linux distribution that supports them.
28
+
\item Frame pointers work better with perf's copying nature and OCaml 5's non-contiguous stacks.
29
+
\end{itemize}
30
+
31
+
\section{s:ocamlperf-compiling}{Compiling for Profiling}
32
+
33
+
The process for compiling an executable for CPU profiling depends on the OCaml version. For OCaml versions 4.14 and earlier, either frame pointers or DWARF can be used, while for OCaml 5.0 and later, enabling frame pointers is recommended.
34
+
35
+
To enable frame pointers, configure the compiler with \texttt{--enable-frame-pointers}. Alternatively, you can install an opam switch with frame pointers enabled, as follows:
Frame pointer support for OCaml is available on x86_64 architecture for Linux starting with OCaml 4.12 and on macOS from OCaml 5.3. ARM64 architecture is supported on Linux and macOS from OCaml 5.4, while other Tier-1 architectures (POWER, RISC-V, and s390x) are currently unsupported.
42
+
43
+
\section{s:ocamlperf-profiling}{Profiling an Execution}
44
+
45
+
The basic \texttt{perf} command for profiling is:
46
+
\begin{verbatim}
47
+
perf record -F 99 --call-graph fp <YOUR_EXECUTABLE>
48
+
\end{verbatim}
49
+
50
+
The \texttt{-F 99} option sets \texttt{perf} to sample at 99Hz, reducing excessive data generation during longer runs and minimising overlap with other periodic activities. The \texttt{--call-graph fp} instructs \texttt{perf} to use frame pointers to get the call graph, followed by the OCaml executable you want to profile. This command creates a \texttt{perf.data} file in the current directory. Alternatively use \texttt{--output} to choose a more descriptive filename.
51
+
52
+
The \texttt{perf record} command works by copying a segment of the call stack at each sample and recording this data into a \texttt{perf.data} file. These samples can then be processed after recording using \texttt{perf report} to reconstruct the profiled program’s call stack at every sample.
53
+
54
+
\texttt{perf} uses the symbols present in an OCaml executable, so it helps to understand OCaml's name mangling scheme to map names to OCaml source locations. OCaml 5.4 onwards uses a name mangling scheme of \texttt{caml<MODULE_NAME>\$<FUNCTION_NAME>_<NNN>}, where \texttt{NNN} is a randomly generated number. In OCaml 5.3 the separator is a dot \texttt{caml<MODULE_NAME>.<FUNCTION_NAME>_<NNN>}, and before OCaml 5.1 the separator is double underscores. All formats are supported by \texttt{perf}.
55
+
56
+
Consider the following program:
57
+
58
+
\begin{caml_example*}{verbatim}
59
+
(* fib.ml *)
60
+
module Compute = struct
61
+
let rec fib n =
62
+
if n = 0 then 0
63
+
else if n = 1 then 1
64
+
else fib (n-1) + fib (n-2)
65
+
end
66
+
67
+
let main () =
68
+
let r = Compute.fib 20 in
69
+
Printf.printf "fib(20) = %d" r
70
+
71
+
let _ = main ()
72
+
\end{caml_example*}
73
+
74
+
This program produces the names \texttt{camlFib\$main_274} for the \texttt{main} function and \texttt{camlFib\$fib_271} for the \texttt{fib} function in the \texttt{Compute} module.
The \texttt{perf report} command summarises data in the \texttt{perf.data} file.
79
+
The basic \texttt{perf report} command is:
80
+
81
+
\begin{verbatim}
82
+
perf report -f --no-children -i perf.data
83
+
\end{verbatim}
84
+
85
+
This command provides an interactive interface where you can navigate through the accumulated call graphs and select functions and threads for detailed information. Alternatively \texttt{--stdio} will output similar data using a text based report writing to stdout. Note that if stack traces appear broken, it may be due to software not having frame pointer support enabled.
86
+
87
+
Consider the following program which calculates the Tak function.
88
+
\begin{caml_example*}{verbatim}
89
+
let (x,y,z) =
90
+
try
91
+
let x = int_of_string Sys.argv.(1) in
92
+
let y = int_of_string Sys.argv.(2) in
93
+
let z = int_of_string Sys.argv.(3) in
94
+
(x,y,z)
95
+
with _ -> (18,12,6)
96
+
97
+
let rec tak x y z =
98
+
if x > y then
99
+
tak (tak (x-1) y z) (tak (y-1) z x) (tak (z-1) x y)
100
+
else z
101
+
102
+
let main () =
103
+
let r = tak x y z in
104
+
Printf.printf "tak %d %d %d = %d\n" x y z r
105
+
106
+
let _ = main ()
107
+
\end{caml_example*}
108
+
109
+
The \texttt{perf} report for this might resemble the following.
110
+
\begin{verbatim}
111
+
Samples: 809 of event 'cycles', Event count (approx.): 24701952617
Note this implementation clearly has some optimsation possible in `_pow_747` a recursive
153
+
Power of 2 function and parts of the OCaml runtime `caml_modify` are present.
154
+
155
+
Profiling data can also be visualised as Flame Graphs, which highlight the most frequent code paths in stack traces. The original scripts scripts \texttt{stackcollapse-perf.pl} and \texttt{flamegraph.pl} can be found at Brendan Gregg's \href{https://www.brendangregg.com/flamegraphs.html}{Flame Graphs} web page and work as follows:
flamegraph.pl out.folded > flamegraph.svg ## Create the FlameGraph svg
165
+
\end{verbatim}
166
+
167
+
Alternatively \href{https://github.com/jonhoo/inferno}{inferno} is a Rust port of the Flame Graphs tools which works in a similar way and is faster to process large perf files.
Some \texttt{perf} tools (e.g., \texttt{perf-report(1)} and \texttt{perf-annotate(1)}) use DWARF debugging information to associate symbols with source code locations, if you need these features, the program and it's dependencies need to be compiled with \texttt{-g} to include debugging information in the executable.
178
+
179
+
Captured profile data can also be processed using \texttt{perf script} in various ways or with online tools like \href{https://www.speedscope.app}{speedscope.app} and \href{https://profiler.firefox.com/}{profiler.firefox.com}, or any other tool that supports \texttt{perf}-formatted data.
180
+
181
+
\section{s:ocamlperf-conclusion}{Conclusion}
182
+
183
+
For CPU profiling of native code, standard tools such as \texttt{perf}, eBPF, DTrace, or Instruments (on macOS) are recommended. Compiling with frame pointers enabled is often necessary for these tools to work effectively. Profiling with \texttt{gprof} is no longer supported.
184
+
185
+
Enabling frame pointers can impact performance on certain architectures (up to 10\% performance cost on x86_64 has been measured). Users of this feature are encouraged to benchmark their own applications to assess this impact.
186
+
187
+
\section{s:ocamlperf-glossary}{Glossary}
188
+
189
+
The following terminology is used in this chapter of the manual.
190
+
191
+
\begin{itemize}
192
+
\item[{\bf Call graph}] The chain of function calls that have lead to the current function (also referred to as a call stack).
193
+
\item[{\bf Unwinding}] The process of restoring the program's state to how it was before some function(s) were called and possibly giving a profiler or debugger much more information. (also called stack unwinding).
194
+
\item[{\bf Stack frame}] Refers to the portion of the stack allocated to a single function call. (also called an activation frame, activation record or simply frame).
0 commit comments