Skip to content

Commit 9c5c8dd

Browse files
authored
CP-52709: use timeslices shorter than 50ms (#6177)
# Changing the default OCaml thread switch timeslice from 50ms The default OCaml 4.x timeslice for switching between threads is 50ms: if there is more than 1 active OCaml threads each one is let to run up to 50ms, and then (at various safepoints) it can switch to another running thread. When the runtime lock is released (and C code or syscalls run) then another OCaml thread is immediately let to run if any. However 50ms is too long, and it inserts large latencies into the handling of API calls. OTOH if a timeslice is too short then we waste CPU time: * overhead of Thread.yield system call, and the cost of switching threads at the OS level * potentially higher L1/L2 cache misses if we switch on the same CPU between multiple OCaml threads * potentially losing branch predictor history * potentially higher L3 cache misses (but on a hypervisor with VMs running L3 will be mostly taken up by VMs anyway, we can only rely on L1/L2 staying with us) A microbenchmark has shown that timeslices as small as 0.5ms might strike an optimal balance between latency and overhead: values lower than that lose performance due to increased overhead, and values higher than that lose performance due to increased latency: ![auto_p](https://github.com/user-attachments/assets/3751291b-8f64-4d70-9a65-9c3fdb053955) ![auto_pr](https://github.com/user-attachments/assets/3b710484-87ba-488a-9507-7916c85aab20) (the microbenchmark measures the number of CPU cycles spent simulating an API call with various working set sizes and timeslice settings) This is all hardware dependent though, and a future PR will introduce an autotune service that measures the yield overhead and L1/L2 cache refill overhead and calculates an optimal timeslice for that particular hardware/Xen/kernel combination. (and while we're at it, we can also tweak the minor heap size to match ~half of CPU L2 cache). # Timeslice change mechanism Initially I used `Unix.set_itimer` using virtual timers, to switch a thread only when it has been actively using CPU for too long. However that relies on delivering a signal to the process, and XAPI is very bad at handling signals. In fact XAPI is not allowed to receive any signals, because it doesn't handle EINTR well (a typical problem, that affects C programs too sometimes). Although this is a well understood problem (described in the [OCaml Unix book](https://ocaml.github.io/ocamlunix/ocamlunix.html#sec88), and some areas of XAPI make an effort to handle it, others just assert that they never receive one. Fixing that would require changes in all of XAPI (and its dependencies). So instead I don't use signals at all, but rely on Statmemprof to trigger a hook to be executed "periodically", but not based purely on time, but on allocation activity (i.e. at places the GC could run). The hook checks the elapsed time since the last time it got called, and if too much then calls Thread.yield. Yield is smart enough to be a no-op if there aren't any other runnable OCaml threads. Yield isn't always beneficial though at reducing latencies, e.g. if we are holding locks then we're just increasing latency for everyone who waits for that lock. So a mechanism is introduced to notify the periodic function when any highly contended locks are held, and the yield is skipped in this instance (e.g. the XAPI DB lock). # Plotting code This PR only includes a very simplified version of the microbenchmark, a separate one will introduce the full cache plotting code (which is useful for development/troubleshooting purposes but won't be needed at runtime). # Default timeslice value Set to 5ms for now, just a bit above 4ms = 1/HZ in our Dom0 kernel, the autotuner from a future PR can change this to a more appropriate value. (the autotuner needs more validation on a wider range of hardware) # Results The cache measurements needs to be repeated on a wider variety of hardware, but the timeslice changes here have already proven useful in reducing XAPI DB lock hold times (together with other optimizations).
2 parents b418d69 + 93f85be commit 9c5c8dd

File tree

11 files changed

+330
-1
lines changed

11 files changed

+330
-1
lines changed

ocaml/libs/timeslice/dune

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
(library
2+
(name xapi_timeslice)
3+
(package xapi-idl)
4+
(libraries threads.posix mtime mtime.clock.os xapi-log)
5+
)

ocaml/libs/timeslice/recommended.ml

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
(*
2+
* Copyright (C) Cloud Software Group
3+
*
4+
* This program is free software; you can redistribute it and/or modify
5+
* it under the terms of the GNU Lesser General Public License as published
6+
* by the Free Software Foundation; version 2.1 only. with the special
7+
* exception on linking described in file LICENSE.
8+
*
9+
* This program is distributed in the hope that it will be useful,
10+
* but WITHOUT ANY WARRANTY; without even the implied warranty of
11+
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
12+
* GNU Lesser General Public License for more details.
13+
*)
14+
15+
module D = Debug.Make (struct let name = "timeslice_recommended" end)
16+
17+
let yield_stop = Atomic.make false
18+
19+
let yield_worker () =
20+
while not (Atomic.get yield_stop) do
21+
Thread.yield ()
22+
done
23+
24+
let yield_overhead () =
25+
(* Thread.yield only has an effect if another thread exists,
26+
so create one that yields back immediately *)
27+
D.debug "Measuring Thread.yield overhead" ;
28+
Atomic.set yield_stop false ;
29+
let t = Thread.create yield_worker () in
30+
let measured = Simple_measure.measure Thread.yield in
31+
D.debug "Thread.yield overhead: %.6fs <= %.6fs <= %.6fs" measured.low
32+
measured.median measured.high ;
33+
D.debug "Waiting for worker thread to stop" ;
34+
Atomic.set yield_stop true ;
35+
Thread.join t ;
36+
measured.median
37+
38+
let measure ?(max_overhead_percentage = 1.0) () =
39+
let overhead = yield_overhead () in
40+
let interval = overhead /. (max_overhead_percentage /. 100.) in
41+
D.debug "Recommended timeslice interval = %.4fs" interval ;
42+
(* Avoid too high or too low intervals:
43+
do not go below 1ms (our HZ is 250, and max is 1000, the kernel would round up anyway)
44+
do not go above 50ms (the current default in OCaml 4.14)
45+
*)
46+
let interval = interval |> Float.max 0.001 |> Float.min 0.050 in
47+
D.debug "Final recommeded timeslice interval = %.4fs" interval ;
48+
interval

ocaml/libs/timeslice/recommended.mli

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
(*
2+
* Copyright (C) Cloud Software Group
3+
*
4+
* This program is free software; you can redistribute it and/or modify
5+
* it under the terms of the GNU Lesser General Public License as published
6+
* by the Free Software Foundation; version 2.1 only. with the special
7+
* exception on linking described in file LICENSE.
8+
*
9+
* This program is distributed in the hope that it will be useful,
10+
* but WITHOUT ANY WARRANTY; without even the implied warranty of
11+
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
12+
* GNU Lesser General Public License for more details.
13+
*)
14+
15+
val measure : ?max_overhead_percentage:float -> unit -> float
16+
(** [measure ?max_overhead_percentage ()] returns the recommended timeslice for the current system.
17+
18+
The returned value should be used in a call to {!val:Timeslice.set}.
19+
20+
@param max_overhead_percentage default 1%
21+
@returns [interval] such that [overhead / interval <= max_overhead_percentage / 100]
22+
*)
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
(*
2+
* Copyright (C) Cloud Software Group
3+
*
4+
* This program is free software; you can redistribute it and/or modify
5+
* it under the terms of the GNU Lesser General Public License as published
6+
* by the Free Software Foundation; version 2.1 only. with the special
7+
* exception on linking described in file LICENSE.
8+
*
9+
* This program is distributed in the hope that it will be useful,
10+
* but WITHOUT ANY WARRANTY; without even the implied warranty of
11+
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
12+
* GNU Lesser General Public License for more details.
13+
*)
14+
15+
(** 95% confidence interval, and median value *)
16+
type t = {low: float; median: float; high: float}
17+
18+
let span_to_s s = Mtime.Span.to_float_ns s *. 1e-9
19+
20+
let ci95 measurements =
21+
let n = Array.length measurements in
22+
Array.sort Float.compare measurements ;
23+
let median = measurements.(n / 2) in
24+
(* "Performance Evaluation of Computer and Communication Systems", Table A. 1 *)
25+
let n = float n in
26+
let d = 0.98 *. sqrt n in
27+
let lo = (n /. 2.) -. d |> Float.to_int
28+
and hi = (n /. 2.) +. 1. +. d |> Float.ceil |> Float.to_int in
29+
{low= measurements.(lo - 1); median; high= measurements.(hi - 1)}
30+
31+
let measure ?(n = 1001) ?(inner = 10) f =
32+
if n <= 70 then (* some of the formulas below are not valid for smaller [n] *)
33+
invalid_arg (Printf.sprintf "n must be at least 70: %d" n) ;
34+
(* warmup *)
35+
Sys.opaque_identity (f ()) ;
36+
37+
let measure_inner _ =
38+
let m = Mtime_clock.counter () in
39+
for _ = 1 to inner do
40+
(* opaque_identity prevents the call from being optimized away *)
41+
Sys.opaque_identity (f ())
42+
done ;
43+
let elapsed = Mtime_clock.count m in
44+
span_to_s elapsed /. float inner
45+
in
46+
let measurements = Array.init n measure_inner in
47+
ci95 measurements
48+
49+
let measure_min ?(n = 1001) f arg =
50+
(* warmup *)
51+
Sys.opaque_identity (f arg) ;
52+
let measure_one _ =
53+
let m = Mtime_clock.counter () in
54+
Sys.opaque_identity (f arg) ;
55+
let elapsed = Mtime_clock.count m in
56+
span_to_s elapsed
57+
in
58+
Seq.ints 0
59+
|> Seq.take n
60+
|> Seq.map measure_one
61+
|> Seq.fold_left Float.min Float.max_float
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
(*
2+
* Copyright (C) Cloud Software Group
3+
*
4+
* This program is free software; you can redistribute it and/or modify
5+
* it under the terms of the GNU Lesser General Public License as published
6+
* by the Free Software Foundation; version 2.1 only. with the special
7+
* exception on linking described in file LICENSE.
8+
*
9+
* This program is distributed in the hope that it will be useful,
10+
* but WITHOUT ANY WARRANTY; without even the implied warranty of
11+
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
12+
* GNU Lesser General Public License for more details.
13+
*)
14+
15+
(** Measure the speed of an operation in a very simple and robust way.
16+
More detailed measurements can be dune using [Bechamel].
17+
*)
18+
19+
(** 95% confidence interval, and median value *)
20+
type t = {low: float; median: float; high: float}
21+
22+
val measure : ?n:int -> ?inner:int -> (unit -> unit) -> t
23+
(** [measure ?n ?inner f] measures [n] times the duration of [inner] iterations of [f ()].
24+
25+
Returns the median of the inner measurements, and a 95% confidence interval.
26+
The median is used, because it makes no assumptions about the distribution of the samples,
27+
i.e. it doesn't require a normal (Gaussian) distribution.
28+
29+
The inner measurements use a simple average, because we only know the duration of [inner] iterations,
30+
not the duration of each individual call to [f ()].
31+
The purpose of the [inner] iterations is to reduce measurement overhead.
32+
33+
@param n iteration count for the outer loop, must be more than [70].
34+
@param n iteration count for the inner loop
35+
@param f function to measure
36+
37+
@raises Invalid_argument if [n<70]
38+
*)
39+
40+
val measure_min : ?n:int -> ('a -> unit) -> 'a -> float
41+
(** [measure_min ?n:int f arg] is the minimum amount of time that [f arg] takes.
42+
43+
This should be used when we try to measure the maximum speed of some operation (e.g. cached memory accesses),
44+
while ignoring latencies/hickups introduced by other processes on the system.
45+
46+
It shouldn't be used for measuring the overhead of an operation, because the hickups may be part of that overhead.
47+
*)

ocaml/libs/timeslice/timeslice.ml

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
(*
2+
* Copyright (C) Cloud Software Group
3+
*
4+
* This program is free software; you can redistribute it and/or modify
5+
* it under the terms of the GNU Lesser General Public License as published
6+
* by the Free Software Foundation; version 2.1 only. with the special
7+
* exception on linking described in file LICENSE.
8+
*
9+
* This program is distributed in the hope that it will be useful,
10+
* but WITHOUT ANY WARRANTY; without even the implied warranty of
11+
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
12+
* GNU Lesser General Public License for more details.
13+
*)
14+
15+
(* avoid allocating an extra option every time *)
16+
let invalid_holder = -1
17+
18+
let last_lock_holder = Atomic.make invalid_holder
19+
20+
let me () = Thread.self () |> Thread.id
21+
22+
let lock_acquired () =
23+
(* these need to be very low overhead, so just keep track of the last lock holder,
24+
i.e. track only one high-priority lock at a time
25+
*)
26+
Atomic.set last_lock_holder (me ())
27+
28+
let lock_released () = Atomic.set last_lock_holder invalid_holder
29+
30+
let[@inline always] am_i_holding_locks () =
31+
let last = Atomic.get last_lock_holder in
32+
last <> invalid_holder && last = me ()
33+
34+
let yield_interval = Atomic.make Mtime.Span.zero
35+
36+
(* TODO: use bechamel.monotonic-clock instead, which has lower overhead,
37+
but not in the right place in xs-opam yet
38+
*)
39+
let last_yield = Atomic.make (Mtime_clock.counter ())
40+
41+
let failures = Atomic.make 0
42+
43+
let periodic_hook (_ : Gc.Memprof.allocation) =
44+
let () =
45+
try
46+
if not (am_i_holding_locks ()) then
47+
let elapsed = Mtime_clock.count (Atomic.get last_yield) in
48+
if Mtime.Span.compare elapsed (Atomic.get yield_interval) > 0 then (
49+
let now = Mtime_clock.counter () in
50+
Atomic.set last_yield now ; Thread.yield ()
51+
)
52+
with _ ->
53+
(* It is not safe to raise exceptions here, it'd require changing all code to be safe to asynchronous interrupts/exceptions,
54+
see https://guillaume.munch.name/software/ocaml/memprof-limits/index.html#isolation
55+
Because this is just a performance optimization, we fall back to safe behaviour: do nothing, and just keep track that we failed
56+
*)
57+
Atomic.incr failures
58+
in
59+
None
60+
61+
let periodic =
62+
Gc.Memprof.
63+
{null_tracker with alloc_minor= periodic_hook; alloc_major= periodic_hook}
64+
65+
let set ?(sampling_rate = 1e-4) interval =
66+
Atomic.set yield_interval
67+
(Mtime.Span.of_float_ns @@ (interval *. 1e9) |> Option.get) ;
68+
Gc.Memprof.start ~sampling_rate ~callstack_size:0 periodic
69+
70+
let clear () =
71+
Gc.Memprof.stop () ;
72+
Atomic.set yield_interval Mtime.Span.zero

ocaml/libs/timeslice/timeslice.mli

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
(*
2+
* Copyright (C) Cloud Software Group
3+
*
4+
* This program is free software; you can redistribute it and/or modify
5+
* it under the terms of the GNU Lesser General Public License as published
6+
* by the Free Software Foundation; version 2.1 only. with the special
7+
* exception on linking described in file LICENSE.
8+
*
9+
* This program is distributed in the hope that it will be useful,
10+
* but WITHOUT ANY WARRANTY; without even the implied warranty of
11+
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
12+
* GNU Lesser General Public License for more details.
13+
*)
14+
15+
val set : ?sampling_rate:float -> float -> unit
16+
(** [set ?sampling_rate interval] calls [Thread.yield ()] at most [interval] seconds.
17+
18+
The implementation of [Thread.yield] guarantees since OCaml 4.09 that we'll switch to a different OCaml thread,
19+
if one exists that is not blocked (i.e. it doesn't rely on [sched_yield] which may run the same thread again,
20+
but uses pthread mutexes and condition variables to ensure the current thread isn't immediately runnable).
21+
22+
The setting is global for the entire process, and currently uses [Gc.Memprof] to ensure that a hook function is called periodically,
23+
although it depends on the allocation rate of the program whether it gets called at all.
24+
25+
Another alternative would be to use {!val:Unix.set_itimer}, but XAPI doesn't cope with [EINTR] in a lot of places,
26+
and POSIX interval timers rely on signals to notify of elapsed time.
27+
28+
We could also have a dedicated thread that sleeps for a certain amount of time, but if it is an OCaml thread,
29+
we'd have no guarantees it'd get scheduled often enough (and it couldn't interrupt other threads anyway,
30+
by the time you'd be running the handler you already gave up running something else).
31+
32+
It may be desirable to avoid yielding if we are currently holding a lock, see {!val:lock_acquired}, and {!val:lock_released}
33+
to notify this module when that happens.
34+
*)
35+
36+
val clear : unit -> unit
37+
(** [clear ()] undoes the changes made by [set].
38+
This is useful for testing multiple timeslices in the same program. *)
39+
40+
val lock_acquired : unit -> unit
41+
(** [lock_acquired ()] notifies about lock acquisition. *)
42+
43+
val lock_released : unit -> unit
44+
(** [lock_acquired ()] notifies about lock release. *)

ocaml/tests/common/dune

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@
2828
xapi-stdext-date
2929
xapi-stdext-threads.scheduler
3030
xapi-stdext-unix
31+
xapi_timeslice
3132
)
3233
)
3334

ocaml/tests/common/suite_init.ml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,4 +11,6 @@ let harness_init () =
1111
Filename.concat Test_common.working_area "xapi-inventory" ;
1212
Xcp_client.use_switch := false ;
1313
Pool_role.set_pool_role_for_test () ;
14-
Message_forwarding.register_callback_fns ()
14+
Message_forwarding.register_callback_fns () ;
15+
(* for unit tests use a fixed value *)
16+
Xapi_timeslice.Timeslice.set 0.004

ocaml/xapi-idl/lib/dune

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626
unix
2727
uri
2828
uuidm
29+
xapi_timeslice
2930
xapi-backtrace
3031
xapi-consts
3132
xapi-log

0 commit comments

Comments
 (0)