Skip to content

[fix](kt-kernel): eliminate clock_gettime hot-path spin in worker_pool#2022

Open
hermannklie wants to merge 2 commits into
kvcache-ai:mainfrom
Bardo-AI:fix/worker-pool-busy-spin
Open

[fix](kt-kernel): eliminate clock_gettime hot-path spin in worker_pool#2022
hermannklie wants to merge 2 commits into
kvcache-ai:mainfrom
Bardo-AI:fix/worker-pool-busy-spin

Conversation

@hermannklie

@hermannklie hermannklie commented May 22, 2026

Copy link
Copy Markdown

Update (2026-05-25, after review) — supersedes the throughput claims below

Thanks to @yyj6666667 and @usrlocalben. Two corrections to this description:

1. The decode-throughput claim (+8–12 %, ~19 → ~20.5 tok/s) is retracted.
It was an artifact of a pathological local state on our box: our profile sits
at ~78 % idle-worker WAITING-spin vs ~12 % compute — i.e. our workers are
starved most of the time for reasons unrelated to this loop. On a healthily-fed
pool there is nothing for the patch to recover, which matches @usrlocalben's
null result on the same EPYC 9654 class. As @yyj6666667 notes, the
clock_gettime cost is paid by WAITING (idle) threads, not the WORKING compute
path, so removing it does not return cycles to decode. The patch's honest
effect is lower idle CPU / power and SMT-sibling relief — not throughput.

Please read every tok/s figure below as retracted.

2. The patch now also covers NumaJobDistributor::worker_thread (identical
_mm_pause() + sample-the-clock-every-100-iterations treatment), and the spin
counter is a plain local instead of static thread_local in both routines
(commit 9d6e02f). The "Applied fix" diff below still shows only the original
InNumaPool change.


kt-kernel: clock_gettime hot-path spin dominates MoE-inference CPU profile (78% [vdso])

Status: Verified locally; ready to submit upstream.
Target: kvcache-ai/ktransformers, kt-kernel/ subproject.
Affected file: kt-kernel/cpu_backend/worker_pool.cpp.
Last upstream touch of the file: commit bdf4bb7 ("Fix worker pool idle CPU usage", #1902).
Discovered while building from source at commit: 35fc6ca (origin/main, fetched 2026-05-22).
Type: Performance bug fix (CPU-waste pattern, no behavioural change to wake-up latency).
Patch file: kt-kernel_worker_pool_busy_spin.patch (sibling to this doc).


Summary

During active MoE inference, the textgen/ktransformers server spends
78.22 % of its CPU cycles in __vdso_clock_gettime — versus
12.77 % in the actual MoE compute kernel
(_kt_kernel_ext_avx512_bf16.so). The cost is not paid in idle but
during active decodes.

Root cause is the WAITING-state spin loop in
InNumaPool::worker_thread: it calls
std::chrono::high_resolution_clock::now() (= clock_gettime via
VDSO) on every iteration of the 50 ms pre-wait spin, with no
_mm_pause() and no thinning.

Existing PRs #1899 and #1902 fixed the idle case (worker reaches
cv.wait after 50 ms and goes to sleep). The hot-path — what
happens during those 50 ms while task bursts are arriving — was
not addressed.

Two minimal changes (_mm_pause() + 1-in-100 clock sampling) reduce
[vdso] CPU share from 78.22 % to 1.37 % (57× less), without
changing the 50 ms wake-up latency.


Steps to reproduce

Component Value
Hardware AMD EPYC 9654 (96C/192T, Zen4, AVX-512+BF16)
OS Debian 13 trixie / LMDE 7, Kernel 6.12
GPU NVIDIA RTX 3090 (24 GB) — irrelevant for CPU profile
kt-kernel upstream/main (HEAD 35fc6ca, file unchanged since bdf4bb7)
Model Qwen3-30B-A3B (48 layers, 128 experts/layer, 8 active/token, Q8_0 GGUF)
ktransformers backend balance_serve, kt_kernel _kt_kernel_ext_avx512_bf16 variant, method=LLAMAFILE
Threading cpuinfer_threads=96, threadpool_count=1 (NPS1, 1 NUMA node)
Decode rate (measured) ~19 tok/s (pre-fix) / ~20.5 tok/s (post-fix)

perf invocation:

PID=$(pgrep -f "server.py" | head -1)
perf record -F 999 -p $PID --call-graph=dwarf -o perf.data -- sleep 60
# During the 60 s: issue a query in the UI that runs at least 30 s of decode.
perf report --no-children -i perf.data --stdio --sort dso
perf report --no-children -i perf.data --stdio --sort symbol

Observation

Top shared objects (sort = DSO):

   Overhead  Shared Object
   --------  ---------------------------------------------------
    78.22%  [vdso]
    12.77%  _kt_kernel_ext_avx512_bf16.cpython-313-x86_64-linux-gnu.so
     3.24%  libstdc++.so.6.0.34
     3.04%  libc.so.6
     1.02%  libasync_store.so
     0.74%  [kernel.kallsyms]

Top symbols (sort = Symbol):

   Overhead  Symbol
   --------  ---------------------------------------------------
    78.22%  __vdso_clock_gettime
     2.89%  clock_gettime
     2.72%  std::chrono::_V2::system_clock::now()
     0.51%  clock_gettime@plt

Call-stack trace:

78.22%  [vdso]
        __vdso_clock_gettime
        clock_gettime
        std::chrono::_V2::system_clock::now()
        execute_native_thread_routine     ← std::thread::_State_impl

Root cause — kt-kernel/cpu_backend/worker_pool.cpp:212-236

void InNumaPool::worker_thread(int thread_id, int numa_id) {
  if (numa_id >= 0) {
    set_memory_to_numa(numa_id);
  }
  auto start = std::chrono::high_resolution_clock::now();
  WorkerPool::thread_local_id = thread_id;
  while (true) {
    ThreadStatus status = thread_state_[thread_id].status.load(std::memory_order_acquire);
    if (status == ThreadStatus::WORKING) {
      process_tasks(thread_id);
      start = std::chrono::high_resolution_clock::now();
    } else if (status == ThreadStatus::WAITING) {
      auto now = std::chrono::high_resolution_clock::now();      // ← every loop iteration
      auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(now - start).count();
      if (duration > 50) {                                       // ← 50 ms busy spin
        std::unique_lock<std::mutex> lock(thread_state_[thread_id].mutex);
        thread_state_[thread_id].cv.wait(lock, [&] {
          return thread_state_[thread_id].status.load(std::memory_order_acquire) != ThreadStatus::WAITING;
        });
      }
    } else if (status == ThreadStatus::EXIT) {
      return;
    }
  }
}

Pathology: hybrid wait pattern. The worker spins for 50 ms
before transitioning to cv.wait. During those 50 ms,
high_resolution_clock::now() (= clock_gettime syscall via VDSO)
is called in a tight loop, with no _mm_pause(), no
std::this_thread::yield(), no CPU hint of any kind
.

Real situation during active MoE inference:

  • Each layer forward submits nth × activated_expert = 24 × 8 = 192
    tasks (Gate+Up), followed by 24 tasks (Down).
  • Between task bursts, workers are in WAITING state.
  • 96 worker threads × busy-spin = CPU saturated with
    clock_gettime calls instead of compute
    .

Universality — every MoE backend is affected

InNumaPool::worker_thread is the only worker-thread implementation
in the kt-kernel repo:

grep -rn "InNumaPool::worker_thread" kt-kernel/
  cpu_backend/worker_pool.cpp:37   (constructor)
  cpu_backend/worker_pool.cpp:57   (constructor with numa_id)
  cpu_backend/worker_pool.cpp:212  (implementation)

It is driven by every MoE backend family through
pool->do_work_stealing_job(...):

Backend family Paths
LLAMAFILE (GGUF) operators/llamafile/moe.hpp, mla.hpp
AMX (Intel-only) operators/amx/{bf16,fp8,fp4,awq,k2,sft,moe_base,awq-moe,fp8-perchannel}-moe.hpp (9 files)
AVX2/AVX-VNNI operators/avx2/{bf16,fp8,gptq_int4,gptq_int4_avxvnni,rawint4,rawint4_avxvnni}-moe.hpp (6 files)
Generic operators/moe_kernel/moe.hpp, operators/moe-sft-tp.hpp

In other words: every code path that uses kt-kernel MoE — every
ktransformers model, every hardware class, every quantization
format — is subject to this CPU waste.


Relationship to existing fixes (#1899 / #1902)

PR #1899 ("TaskQueue worker thread 100% CPU spin when idle") and
PR #1902 ("worker pool idle CPU usage") are present in the current
source as the cv.wait at line 228 — that is, the idle case is
solved
(workers really go to sleep after 50 ms). The hot-path
pathology
(the 50 ms of busy-spin before the cv.wait, with one
clock_gettime per iteration) is not addressed by those PRs.

This change is fully complementary to them — same 50 ms timeout,
same cv.wait exit, only the inside of the spin is changed.


Applied fix

Two minimal, non-invasive changes; the 50 ms spin timeout is preserved
(= unchanged wake-up latency for the next task burst).

A) _mm_pause() — SSE2 hardware hint emitted once per spin
iteration. It tells the CPU "I am just spinning": the HT partner
gets pipeline resources, power consumption drops, no syscall.

C) clock_gettime only every 100th iteration, via a
thread_local int spin_counter → ~99 % fewer VDSO calls.

Diff against kt-kernel/cpu_backend/worker_pool.cpp:

 #include <chrono>
 #include <cstdio>
+#include <immintrin.h>  // _mm_pause()
 #include <stdexcept>
...
     } else if (status == ThreadStatus::WAITING) {
+      // Mitigate clock_gettime hot-path in the WAITING spin: the prior
+      // loop called high_resolution_clock::now() every iteration, which
+      // dominated CPU profiles (78% [vdso] on a 96-thread MoE-inference
+      // run) and starved HT-partner pipelines. _mm_pause() yields
+      // pipeline resources and saves power; sampling the clock once
+      // per 100 iterations keeps the 50ms wake-up budget unchanged
+      // while removing ~99% of the syscalls.
+      _mm_pause();
+      static thread_local int spin_counter = 0;
+      if (++spin_counter < 100) {
+        continue;
+      }
+      spin_counter = 0;
       auto now = std::chrono::high_resolution_clock::now();
       auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(now - start).count();
       if (duration > 50) {
         std::unique_lock<std::mutex> lock(thread_state_[thread_id].mutex);
         thread_state_[thread_id].cv.wait(lock, [&] {
           return thread_state_[thread_id].status.load(std::memory_order_acquire) != ThreadStatus::WAITING;
         });
       }
     } else if (status == ThreadStatus::EXIT) {

Why this combination:

Before fix After fix
Spin iterations / 50 ms / thread ~714 k ~714 k (pause is cheap)
clock_gettime calls / 50 ms / thread ~714 k ~7.14 k (1/100)
Spin CPU load high low (pause + 100× fewer syscalls)
Wake-up latency after task burst ≤ 50 ms ≤ 50 ms (unchanged)
HT-partner throughput throttled free

Alternatives that were rejected:

  • Reduce the spin timeout (50 ms → 5 ms): increases wake-up
    latency for the next task burst by 10×. With short pauses between
    layer forwards this adds cv.wait overhead.
  • Remove the spin phase entirely, go straight to cv.wait:
    maximum CPU savings, but with high-frequency task bursts the
    per-wake-up cv.wait latency (~10–100 µs) would dominate.

The chosen A+C combination neutralizes the clock_gettime cost
without any latency regression.


Expected impact — and measured data

Baseline (pre-fix, 2026-05-22 01:30, Qwen3-30B-A3B Q8_0 LLAMAFILE):

Metric Pre-fix
[vdso] (clock_gettime) 78.22 %
_kt_kernel_ext_avx512_bf16 (MoE compute) 12.77 %
Pure decode rate (prefill-adjusted) ~18.9 tok/s

After fix (same hardware, same model, same sampling config,
measured 2026-05-22 03:23):

Metric Post-fix Change
[vdso] (clock_gettime) 1.37 % 57× less
_kt_kernel_ext_avx512_bf16 (MoE compute) 93.24 % 7.3× more
Pure decode rate (prefill-adjusted) ~20.5 tok/s (Q4: 20.77, Q5: 20.55, Q6: 20.42) +8–12 %

Interpretation of the gap (57× CPU saving, but only ~10 % decode
speedup):

The clock_gettime cost was concentrated on idle worker threads
between task bursts
, not on the critical path of decode. The
WORKING thread does real MoE compute with full pipeline utilization;
its throughput is not directly blocked by other threads spinning
idly. The patch therefore mainly sanitises:

  • CPU waste on the 95 idle worker threads (on a 96-thread system),
  • HT-partner pipelines, which now have free compute resources,
  • Power / heat / power-cap headroom — significant on
    96-core servers,
  • perf profile readability for future diagnosis — the hot path
    is finally cleanly dominated by the MoE compute kernel.

This is not a decode-performance silver bullet — the headline
decode-rate change is modest. But the fix removes a massive systemic
CPU-waste pattern that hits every kt-kernel user on every supported
backend.

All other kt-kernel backends (AMX, AVX2/AVX-VNNI, generic) gain
structurally identical benefit — the patched
InNumaPool::worker_thread loop is backend-agnostic (see the
Universality section).


Reviewer checklist

  • Apply the patch on a clean checkout of upstream/main with
    git apply kt-kernel_worker_pool_busy_spin.patch, build with
    CPUINFER_BUILD_ALL_VARIANTS=1 (or any single-variant build).
  • Run perf record -F 999 -p $PID --call-graph=dwarf -- sleep 60
    during an active decode and confirm that [vdso] drops out of
    the top of perf report --sort dso.
  • Confirm that wake-up latency from a cv.wait is unchanged
    (e.g. measure first-token latency after a long idle period).
  • Confirm _mm_pause() availability: the existing kt-kernel build
    already requires SSE2 (it is part of the x86-64 baseline ABI) —
    no new build flag is needed; <immintrin.h> is already pulled in
    transitively elsewhere in the source.

Raw data

The full perf.data (~45 GB) is not publicly attached due to size
and because the trace captures private model output. Any reviewer
can reproduce the trace locally — instructions are in the
Steps to reproduce section.

In `InNumaPool::worker_thread`, the WAITING branch called
`std::chrono::high_resolution_clock::now()` every spin iteration before
falling through to `cv.wait` at the 50 ms timeout. On a 96-thread
MoE-inference workload (Qwen3-30B-A3B Q8_0, LLAMAFILE backend), this
caused `__vdso_clock_gettime` to dominate the CPU profile at 78.22 %
while the actual MoE compute kernel held only 12.77 %.

Apply two minimal changes that keep the 50 ms wake-up budget intact:

  A) Insert `_mm_pause()` once per spin iteration — SSE2 hardware
     hint that frees HT-partner pipeline resources and reduces
     power draw, with no syscall.

  C) Read the clock only every 100th iteration via a `thread_local`
     counter, reducing VDSO calls by ~99 % without changing the
     50 ms timeout semantics.

Verified on AMD EPYC 9654 (96C/192T, AVX-512+BF16), Linux 6.12,
kt-kernel @ upstream/main (last touch of this file: bdf4bb7,
"Fix worker pool idle CPU usage").

Measured impact (same model, same sampler config, same hardware):

  | Metric                                | Before  | After   |
  | ------------------------------------- | ------- | ------- |
  | [vdso] clock_gettime CPU share        | 78.22 % |  1.37 % |
  | _kt_kernel_ext_avx512_bf16 CPU share  | 12.77 % | 93.24 % |
  | Decode tok/s (Q4/Q5/Q6 mean)          |   ~19   |  ~20.5  |

The fix is backend-agnostic — `InNumaPool::worker_thread` is the only
worker-thread implementation in kt-kernel and is driven by every MoE
backend family (LLAMAFILE, AMX, AVX2/AVX-VNNI, generic).

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the WAITING spin loop in InNumaPool::worker_thread by incorporating _mm_pause() and throttling clock sampling to once every 100 iterations, which significantly reduces clock_gettime overhead. The reviewer recommends applying this optimization consistently to NumaJobDistributor::worker_thread and suggests using a local variable instead of thread_local for the spin counter to minimize access overhead.

Comment on lines +232 to +237
_mm_pause();
static thread_local int spin_counter = 0;
if (++spin_counter < 100) {
continue;
}
spin_counter = 0;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The addition of _mm_pause() and the clock sampling logic significantly improves the performance of InNumaPool::worker_thread by reducing VDSO overhead. However, a structurally identical busy-spin exists in NumaJobDistributor::worker_thread (lines 413-421), which also calls high_resolution_clock::now() every iteration without any throttling or hardware hints. To ensure the performance gains are realized across all worker types in the pool, this optimization should be applied consistently to both implementations. Additionally, spin_counter could be a simple local variable declared outside the while loop to avoid the (minor) overhead of thread_local storage access, as the thread routine only executes once.

@usrlocalben

Copy link
Copy Markdown

Despite all the slop, I read "+8-12 %" and I'll try it out. I also happen to have the same CPU. (2x 9B14, OEM equiv)

I observe no change at all in decode perf w/K2 INT4.

I probed around different concurrency levels to see if it was more efficient, i.e. same perf with fewer cores - Not so. I see the same characteristic wrt. threads vs. decode as before. (I see max decode at ~112 threads. I usually back off to 96 to trade very marginal perf for +8 idle cores)

@yyj6666667

yyj6666667 commented May 24, 2026

Copy link
Copy Markdown
Collaborator
  1. The +8–12% decode claim is not supported by the mechanism. The
    clock_gettime cost is paid by threads in the WAITING branch
    (idle workers in the 50 ms pre-sleep spin); compute threads are in
    WORKING and never enter this path. Removing the VDSO calls lowers
    idle CPU / power and helps the SMT sibling via _mm_pause(), but
    does not return cycles to the compute path. @usrlocalben on the
    same EPYC 9654 reports no decode delta; our internal runs agree.
    Please drop the throughput number or back it with a reproducible
    benchmark.

  2. NumaJobDistributor::worker_thread (worker_pool.cpp:385) has
    the identical WAITING-spin pattern and must be patched in the
    same PR, otherwise the hot spot just relocates.

  3. Nit: static thread_local here adds a TLS access per iteration
    for no reason — a plain local int above the while (true) is
    equivalent.

At this time, the PR is not mergeable as-is.

…drop TLS counter

Follow-up addressing PR review (@yyj6666667, @gemini-code-assist):

- Apply the identical _mm_pause() + clock-sampling-every-100-iterations
  treatment to NumaJobDistributor::worker_thread. Its WAITING branch had
  the same per-iteration high_resolution_clock::now() spin; without this
  the idle-spin hot-spot just relocates there.
- Replace `static thread_local int spin_counter` with a plain local above
  the loop in both worker routines. Each runs once per thread, so the TLS
  storage and its per-iteration %fs access are unnecessary.

No change to the WORKING/compute path: the status load runs every loop
iteration, so work is still detected without delay; only the elapsed-time
check is throttled.
@hermannklie

Copy link
Copy Markdown
Author

Thanks both — @yyj6666667 and @usrlocalben. These are fair and they've changed the PR (commit 9d6e02f).

Throughput claim: retracted. The +8–12 % was measured on our box in a pathological state — our profile sits at ~78 % idle-worker WAITING-spin vs ~12 % compute, i.e. our workers are starved most of the time for reasons unrelated to this loop. On a healthily-fed pool there is nothing for this patch to recover, which is exactly @usrlocalben's null result on the same EPYC 9654 class. The mechanism is as you describe: the clock_gettime cost is paid by WAITING (idle) threads, not the WORKING compute path, so removing it doesn't return cycles to decode. I've struck the tok/s figures from the description; the PR now claims only what the mechanism supports — lower idle CPU/power and SMT-sibling relief.

@usrlocalben — your efficiency probe (same throughput, fewer cores) showing no change is consistent with this, not contradictory: the patch touches only the WAITING branch, so it can't shift the threads-vs-decode scaling curve, which is a property of the WORKING path. The one regime where it does anything measurable is precisely your "back off to 96 for +8 idle cores" case. Modest, but it costs every user a little idle power even when it's invisible in tok/s — our extreme state just made it findable.

Both review points applied (9d6e02f):

  • NumaJobDistributor::worker_thread now carries the identical _mm_pause() + sample-every-100 treatment, so the hot-spot can't just relocate.
  • spin_counter is a plain local above the loop in both functions — no static thread_local, no per-iteration %fs access.

I kept a fixed sample interval rather than an adaptive backoff on purpose: here the spin is only the bounded 50 ms pre-sleep window before cv.wait, and the status load runs every iteration regardless, so a growing interval would only delay the transition into the efficient cv.wait sleep — the opposite of the goal.

One disclosure, in fairness to the note about AI text: we're a very small team and draft these upstream patches with AI assistance for resourcing reasons. We review and verify everything ourselves on real hardware — this change was built and smoke-tested on the EPYC 9654 box above. Happy to reshape the patch if you'd prefer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants