[fix](kt-kernel): eliminate clock_gettime hot-path spin in worker_pool by hermannklie · Pull Request #2022 · kvcache-ai/ktransformers

hermannklie · 2026-05-22T10:10:11Z

Update (2026-05-25, after review) — supersedes the throughput claims below

Thanks to @yyj6666667 and @usrlocalben. Two corrections to this description:

1. The decode-throughput claim (+8–12 %, ~19 → ~20.5 tok/s) is retracted.
It was an artifact of a pathological local state on our box: our profile sits
at ~78 % idle-worker WAITING-spin vs ~12 % compute — i.e. our workers are
starved most of the time for reasons unrelated to this loop. On a healthily-fed
pool there is nothing for the patch to recover, which matches @usrlocalben's
null result on the same EPYC 9654 class. As @yyj6666667 notes, the
clock_gettime cost is paid by WAITING (idle) threads, not the WORKING compute
path, so removing it does not return cycles to decode. The patch's honest
effect is lower idle CPU / power and SMT-sibling relief — not throughput.
Please read every tok/s figure below as retracted.

2. The patch now also covers NumaJobDistributor::worker_thread (identical
_mm_pause() + sample-the-clock-every-100-iterations treatment), and the spin
counter is a plain local instead of static thread_local in both routines
(commit 9d6e02f). The "Applied fix" diff below still shows only the original
InNumaPool change.

kt-kernel: `clock_gettime` hot-path spin dominates MoE-inference CPU profile (78% [vdso])

Status: Verified locally; ready to submit upstream.
Target: kvcache-ai/ktransformers, kt-kernel/ subproject.
Affected file: kt-kernel/cpu_backend/worker_pool.cpp.
Last upstream touch of the file: commit bdf4bb7 ("Fix worker pool idle CPU usage", #1902).
Discovered while building from source at commit: 35fc6ca (origin/main, fetched 2026-05-22).
Type: Performance bug fix (CPU-waste pattern, no behavioural change to wake-up latency).
Patch file: kt-kernel_worker_pool_busy_spin.patch (sibling to this doc).

Summary

During active MoE inference, the textgen/ktransformers server spends
78.22 % of its CPU cycles in __vdso_clock_gettime — versus
12.77 % in the actual MoE compute kernel
(_kt_kernel_ext_avx512_bf16.so). The cost is not paid in idle but
during active decodes.

Root cause is the WAITING-state spin loop in
InNumaPool::worker_thread: it calls
std::chrono::high_resolution_clock::now() (= clock_gettime via
VDSO) on every iteration of the 50 ms pre-wait spin, with no
_mm_pause() and no thinning.

Existing PRs #1899 and #1902 fixed the idle case (worker reaches
cv.wait after 50 ms and goes to sleep). The hot-path — what
happens during those 50 ms while task bursts are arriving — was
not addressed.

Two minimal changes (_mm_pause() + 1-in-100 clock sampling) reduce
[vdso] CPU share from 78.22 % to 1.37 % (57× less), without
changing the 50 ms wake-up latency.

Steps to reproduce

Component	Value
Hardware	AMD EPYC 9654 (96C/192T, Zen4, AVX-512+BF16)
OS	Debian 13 trixie / LMDE 7, Kernel 6.12
GPU	NVIDIA RTX 3090 (24 GB) — irrelevant for CPU profile
kt-kernel	upstream/main (HEAD `35fc6ca`, file unchanged since `bdf4bb7`)
Model	Qwen3-30B-A3B (48 layers, 128 experts/layer, 8 active/token, Q8_0 GGUF)
ktransformers backend	balance_serve, kt_kernel `_kt_kernel_ext_avx512_bf16` variant, method=LLAMAFILE
Threading	`cpuinfer_threads=96`, `threadpool_count=1` (NPS1, 1 NUMA node)
Decode rate (measured)	~19 tok/s (pre-fix) / ~20.5 tok/s (post-fix)

perf invocation:

PID=$(pgrep -f "server.py" | head -1)
perf record -F 999 -p $PID --call-graph=dwarf -o perf.data -- sleep 60
# During the 60 s: issue a query in the UI that runs at least 30 s of decode.
perf report --no-children -i perf.data --stdio --sort dso
perf report --no-children -i perf.data --stdio --sort symbol

Observation

Top shared objects (sort = DSO):

   Overhead  Shared Object
   --------  ---------------------------------------------------
    78.22%  [vdso]
    12.77%  _kt_kernel_ext_avx512_bf16.cpython-313-x86_64-linux-gnu.so
     3.24%  libstdc++.so.6.0.34
     3.04%  libc.so.6
     1.02%  libasync_store.so
     0.74%  [kernel.kallsyms]

Top symbols (sort = Symbol):

   Overhead  Symbol
   --------  ---------------------------------------------------
    78.22%  __vdso_clock_gettime
     2.89%  clock_gettime
     2.72%  std::chrono::_V2::system_clock::now()
     0.51%  clock_gettime@plt

Call-stack trace:

78.22%  [vdso]
        __vdso_clock_gettime
        clock_gettime
        std::chrono::_V2::system_clock::now()
        execute_native_thread_routine     ← std::thread::_State_impl

Root cause — `kt-kernel/cpu_backend/worker_pool.cpp:212-236`

void InNumaPool::worker_thread(int thread_id, int numa_id) {
  if (numa_id >= 0) {
    set_memory_to_numa(numa_id);
  }
  auto start = std::chrono::high_resolution_clock::now();
  WorkerPool::thread_local_id = thread_id;
  while (true) {
    ThreadStatus status = thread_state_[thread_id].status.load(std::memory_order_acquire);
    if (status == ThreadStatus::WORKING) {
      process_tasks(thread_id);
      start = std::chrono::high_resolution_clock::now();
    } else if (status == ThreadStatus::WAITING) {
      auto now = std::chrono::high_resolution_clock::now();      // ← every loop iteration
      auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(now - start).count();
      if (duration > 50) {                                       // ← 50 ms busy spin
        std::unique_lock<std::mutex> lock(thread_state_[thread_id].mutex);
        thread_state_[thread_id].cv.wait(lock, [&] {
          return thread_state_[thread_id].status.load(std::memory_order_acquire) != ThreadStatus::WAITING;
        });
      }
    } else if (status == ThreadStatus::EXIT) {
      return;
    }
  }
}

Pathology: hybrid wait pattern. The worker spins for 50 ms
before transitioning to cv.wait. During those 50 ms,
high_resolution_clock::now() (= clock_gettime syscall via VDSO)
is called in a tight loop, with no _mm_pause(), no
std::this_thread::yield(), no CPU hint of any kind.

Real situation during active MoE inference:

Each layer forward submits nth × activated_expert = 24 × 8 = 192
tasks (Gate+Up), followed by 24 tasks (Down).
Between task bursts, workers are in WAITING state.
96 worker threads × busy-spin = CPU saturated with
clock_gettime calls instead of compute.

Universality — every MoE backend is affected

InNumaPool::worker_thread is the only worker-thread implementation
in the kt-kernel repo:

grep -rn "InNumaPool::worker_thread" kt-kernel/
  cpu_backend/worker_pool.cpp:37   (constructor)
  cpu_backend/worker_pool.cpp:57   (constructor with numa_id)
  cpu_backend/worker_pool.cpp:212  (implementation)

It is driven by every MoE backend family through
pool->do_work_stealing_job(...):

Backend family	Paths
LLAMAFILE (GGUF)	`operators/llamafile/moe.hpp`, `mla.hpp`
AMX (Intel-only)	`operators/amx/{bf16,fp8,fp4,awq,k2,sft,moe_base,awq-moe,fp8-perchannel}-moe.hpp` (9 files)
AVX2/AVX-VNNI	`operators/avx2/{bf16,fp8,gptq_int4,gptq_int4_avxvnni,rawint4,rawint4_avxvnni}-moe.hpp` (6 files)
Generic	`operators/moe_kernel/moe.hpp`, `operators/moe-sft-tp.hpp`

In other words: every code path that uses kt-kernel MoE — every
ktransformers model, every hardware class, every quantization
format — is subject to this CPU waste.

Relationship to existing fixes (#1899 / #1902)

PR #1899 ("TaskQueue worker thread 100% CPU spin when idle") and
PR #1902 ("worker pool idle CPU usage") are present in the current
source as the cv.wait at line 228 — that is, the idle case is
solved (workers really go to sleep after 50 ms). The hot-path
pathology (the 50 ms of busy-spin before the cv.wait, with one
clock_gettime per iteration) is not addressed by those PRs.

This change is fully complementary to them — same 50 ms timeout,
same cv.wait exit, only the inside of the spin is changed.

Applied fix

Two minimal, non-invasive changes; the 50 ms spin timeout is preserved
(= unchanged wake-up latency for the next task burst).

A) _mm_pause() — SSE2 hardware hint emitted once per spin
iteration. It tells the CPU "I am just spinning": the HT partner
gets pipeline resources, power consumption drops, no syscall.

C) clock_gettime only every 100th iteration, via a
thread_local int spin_counter → ~99 % fewer VDSO calls.

Diff against kt-kernel/cpu_backend/worker_pool.cpp:

 #include <chrono>
 #include <cstdio>
+#include <immintrin.h>  // _mm_pause()
 #include <stdexcept>
...
     } else if (status == ThreadStatus::WAITING) {
+      // Mitigate clock_gettime hot-path in the WAITING spin: the prior
+      // loop called high_resolution_clock::now() every iteration, which
+      // dominated CPU profiles (78% [vdso] on a 96-thread MoE-inference
+      // run) and starved HT-partner pipelines. _mm_pause() yields
+      // pipeline resources and saves power; sampling the clock once
+      // per 100 iterations keeps the 50ms wake-up budget unchanged
+      // while removing ~99% of the syscalls.
+      _mm_pause();
+      static thread_local int spin_counter = 0;
+      if (++spin_counter < 100) {
+        continue;
+      }
+      spin_counter = 0;
       auto now = std::chrono::high_resolution_clock::now();
       auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(now - start).count();
       if (duration > 50) {
         std::unique_lock<std::mutex> lock(thread_state_[thread_id].mutex);
         thread_state_[thread_id].cv.wait(lock, [&] {
           return thread_state_[thread_id].status.load(std::memory_order_acquire) != ThreadStatus::WAITING;
         });
       }
     } else if (status == ThreadStatus::EXIT) {

Why this combination:

	Before fix	After fix
Spin iterations / 50 ms / thread	~714 k	~714 k (pause is cheap)
`clock_gettime` calls / 50 ms / thread	~714 k	~7.14 k (1/100)
Spin CPU load	high	low (pause + 100× fewer syscalls)
Wake-up latency after task burst	≤ 50 ms	≤ 50 ms (unchanged)
HT-partner throughput	throttled	free

Alternatives that were rejected:

Reduce the spin timeout (50 ms → 5 ms): increases wake-up
latency for the next task burst by 10×. With short pauses between
layer forwards this adds cv.wait overhead.
Remove the spin phase entirely, go straight to cv.wait:
maximum CPU savings, but with high-frequency task bursts the
per-wake-up cv.wait latency (~10–100 µs) would dominate.

The chosen A+C combination neutralizes the clock_gettime cost
without any latency regression.

Expected impact — and measured data

Baseline (pre-fix, 2026-05-22 01:30, Qwen3-30B-A3B Q8_0 LLAMAFILE):

Metric	Pre-fix
`[vdso]` (clock_gettime)	78.22 %
`_kt_kernel_ext_avx512_bf16` (MoE compute)	12.77 %
Pure decode rate (prefill-adjusted)	~18.9 tok/s

After fix (same hardware, same model, same sampling config,
measured 2026-05-22 03:23):

Metric	Post-fix	Change
`[vdso]` (clock_gettime)	1.37 %	57× less
`_kt_kernel_ext_avx512_bf16` (MoE compute)	93.24 %	7.3× more
Pure decode rate (prefill-adjusted)	~20.5 tok/s (Q4: 20.77, Q5: 20.55, Q6: 20.42)	+8–12 %

Interpretation of the gap (57× CPU saving, but only ~10 % decode
speedup):

The clock_gettime cost was concentrated on idle worker threads
between task bursts, not on the critical path of decode. The
WORKING thread does real MoE compute with full pipeline utilization;
its throughput is not directly blocked by other threads spinning
idly. The patch therefore mainly sanitises:

CPU waste on the 95 idle worker threads (on a 96-thread system),
HT-partner pipelines, which now have free compute resources,
Power / heat / power-cap headroom — significant on
96-core servers,
perf profile readability for future diagnosis — the hot path
is finally cleanly dominated by the MoE compute kernel.

This is not a decode-performance silver bullet — the headline
decode-rate change is modest. But the fix removes a massive systemic
CPU-waste pattern that hits every kt-kernel user on every supported
backend.

All other kt-kernel backends (AMX, AVX2/AVX-VNNI, generic) gain
structurally identical benefit — the patched
InNumaPool::worker_thread loop is backend-agnostic (see the
Universality section).

Reviewer checklist

Apply the patch on a clean checkout of upstream/main with
git apply kt-kernel_worker_pool_busy_spin.patch, build with
CPUINFER_BUILD_ALL_VARIANTS=1 (or any single-variant build).
Run perf record -F 999 -p $PID --call-graph=dwarf -- sleep 60
during an active decode and confirm that [vdso] drops out of
the top of perf report --sort dso.
Confirm that wake-up latency from a cv.wait is unchanged
(e.g. measure first-token latency after a long idle period).
Confirm _mm_pause() availability: the existing kt-kernel build
already requires SSE2 (it is part of the x86-64 baseline ABI) —
no new build flag is needed; <immintrin.h> is already pulled in
transitively elsewhere in the source.

Raw data

The full perf.data (~45 GB) is not publicly attached due to size
and because the trace captures private model output. Any reviewer
can reproduce the trace locally — instructions are in the
Steps to reproduce section.

In `InNumaPool::worker_thread`, the WAITING branch called `std::chrono::high_resolution_clock::now()` every spin iteration before falling through to `cv.wait` at the 50 ms timeout. On a 96-thread MoE-inference workload (Qwen3-30B-A3B Q8_0, LLAMAFILE backend), this caused `__vdso_clock_gettime` to dominate the CPU profile at 78.22 % while the actual MoE compute kernel held only 12.77 %. Apply two minimal changes that keep the 50 ms wake-up budget intact: A) Insert `_mm_pause()` once per spin iteration — SSE2 hardware hint that frees HT-partner pipeline resources and reduces power draw, with no syscall. C) Read the clock only every 100th iteration via a `thread_local` counter, reducing VDSO calls by ~99 % without changing the 50 ms timeout semantics. Verified on AMD EPYC 9654 (96C/192T, AVX-512+BF16), Linux 6.12, kt-kernel @ upstream/main (last touch of this file: bdf4bb7, "Fix worker pool idle CPU usage"). Measured impact (same model, same sampler config, same hardware): | Metric | Before | After | | ------------------------------------- | ------- | ------- | | [vdso] clock_gettime CPU share | 78.22 % | 1.37 % | | _kt_kernel_ext_avx512_bf16 CPU share | 12.77 % | 93.24 % | | Decode tok/s (Q4/Q5/Q6 mean) | ~19 | ~20.5 | The fix is backend-agnostic — `InNumaPool::worker_thread` is the only worker-thread implementation in kt-kernel and is driven by every MoE backend family (LLAMAFILE, AMX, AVX2/AVX-VNNI, generic).

gemini-code-assist

Code Review

This pull request optimizes the WAITING spin loop in InNumaPool::worker_thread by incorporating _mm_pause() and throttling clock sampling to once every 100 iterations, which significantly reduces clock_gettime overhead. The reviewer recommends applying this optimization consistently to NumaJobDistributor::worker_thread and suggests using a local variable instead of thread_local for the spin counter to minimize access overhead.

gemini-code-assist · 2026-05-22T10:15:29Z

+      _mm_pause();
+      static thread_local int spin_counter = 0;
+      if (++spin_counter < 100) {
+        continue;
+      }
+      spin_counter = 0;


The addition of _mm_pause() and the clock sampling logic significantly improves the performance of InNumaPool::worker_thread by reducing VDSO overhead. However, a structurally identical busy-spin exists in NumaJobDistributor::worker_thread (lines 413-421), which also calls high_resolution_clock::now() every iteration without any throttling or hardware hints. To ensure the performance gains are realized across all worker types in the pool, this optimization should be applied consistently to both implementations. Additionally, spin_counter could be a simple local variable declared outside the while loop to avoid the (minor) overhead of thread_local storage access, as the thread routine only executes once.

usrlocalben · 2026-05-23T00:30:19Z

Despite all the slop, I read "+8-12 %" and I'll try it out. I also happen to have the same CPU. (2x 9B14, OEM equiv)

I observe no change at all in decode perf w/K2 INT4.

I probed around different concurrency levels to see if it was more efficient, i.e. same perf with fewer cores - Not so. I see the same characteristic wrt. threads vs. decode as before. (I see max decode at ~112 threads. I usually back off to 96 to trade very marginal perf for +8 idle cores)

yyj6666667 · 2026-05-24T10:49:34Z

The +8–12% decode claim is not supported by the mechanism. The
clock_gettime cost is paid by threads in the WAITING branch
(idle workers in the 50 ms pre-sleep spin); compute threads are in
WORKING and never enter this path. Removing the VDSO calls lowers
idle CPU / power and helps the SMT sibling via _mm_pause(), but
does not return cycles to the compute path. @usrlocalben on the
same EPYC 9654 reports no decode delta; our internal runs agree.
Please drop the throughput number or back it with a reproducible
benchmark.
NumaJobDistributor::worker_thread (worker_pool.cpp:385) has
the identical WAITING-spin pattern and must be patched in the
same PR, otherwise the hot spot just relocates.
Nit: static thread_local here adds a TLS access per iteration
for no reason — a plain local int above the while (true) is
equivalent.

At this time, the PR is not mergeable as-is.

@yyj6666667

…drop TLS counter Follow-up addressing PR review (@yyj6666667, @gemini-code-assist): - Apply the identical _mm_pause() + clock-sampling-every-100-iterations treatment to NumaJobDistributor::worker_thread. Its WAITING branch had the same per-iteration high_resolution_clock::now() spin; without this the idle-spin hot-spot just relocates there. - Replace `static thread_local int spin_counter` with a plain local above the loop in both worker routines. Each runs once per thread, so the TLS storage and its per-iteration %fs access are unnecessary. No change to the WORKING/compute path: the status load runs every loop iteration, so work is still detected without delay; only the elapsed-time check is throttled.

hermannklie · 2026-05-25T18:36:51Z

Thanks both — @yyj6666667 and @usrlocalben. These are fair and they've changed the PR (commit 9d6e02f).

Throughput claim: retracted. The +8–12 % was measured on our box in a pathological state — our profile sits at ~78 % idle-worker WAITING-spin vs ~12 % compute, i.e. our workers are starved most of the time for reasons unrelated to this loop. On a healthily-fed pool there is nothing for this patch to recover, which is exactly @usrlocalben's null result on the same EPYC 9654 class. The mechanism is as you describe: the clock_gettime cost is paid by WAITING (idle) threads, not the WORKING compute path, so removing it doesn't return cycles to decode. I've struck the tok/s figures from the description; the PR now claims only what the mechanism supports — lower idle CPU/power and SMT-sibling relief.

@usrlocalben — your efficiency probe (same throughput, fewer cores) showing no change is consistent with this, not contradictory: the patch touches only the WAITING branch, so it can't shift the threads-vs-decode scaling curve, which is a property of the WORKING path. The one regime where it does anything measurable is precisely your "back off to 96 for +8 idle cores" case. Modest, but it costs every user a little idle power even when it's invisible in tok/s — our extreme state just made it findable.

Both review points applied (9d6e02f):

NumaJobDistributor::worker_thread now carries the identical _mm_pause() + sample-every-100 treatment, so the hot-spot can't just relocate.
spin_counter is a plain local above the loop in both functions — no static thread_local, no per-iteration %fs access.

I kept a fixed sample interval rather than an adaptive backoff on purpose: here the spin is only the bounded 50 ms pre-sleep window before cv.wait, and the status load runs every iteration regardless, so a growing interval would only delay the transition into the efficient cv.wait sleep — the opposite of the goal.

One disclosure, in fairness to the note about AI text: we're a very small team and draft these upstream patches with AI assistance for resourcing reasons. We review and verify everything ourselves on real hardware — this change was built and smoke-tested on the EPYC 9654 box above. Happy to reshape the patch if you'd prefer.

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

hermannklie mentioned this pull request May 25, 2026

[fix](kt-kernel): enable -mavx512vl for AVX512_VBMI/BF16 multi-variant builds #2021

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix](kt-kernel): eliminate clock_gettime hot-path spin in worker_pool#2022

[fix](kt-kernel): eliminate clock_gettime hot-path spin in worker_pool#2022
hermannklie wants to merge 2 commits into
kvcache-ai:mainfrom
Bardo-AI:fix/worker-pool-busy-spin

hermannklie commented May 22, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

usrlocalben commented May 23, 2026

Uh oh!

yyj6666667 commented May 24, 2026 •

edited

Loading

Uh oh!

hermannklie commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hermannklie commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update (2026-05-25, after review) — supersedes the throughput claims below

kt-kernel: clock_gettime hot-path spin dominates MoE-inference CPU profile (78% [vdso])

Summary

Steps to reproduce

Observation

Root cause — kt-kernel/cpu_backend/worker_pool.cpp:212-236

Universality — every MoE backend is affected

Relationship to existing fixes (#1899 / #1902)

Applied fix

Expected impact — and measured data

Reviewer checklist

Raw data

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

usrlocalben commented May 23, 2026

Uh oh!

yyj6666667 commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hermannklie commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hermannklie commented May 22, 2026 •

edited

Loading

kt-kernel: `clock_gettime` hot-path spin dominates MoE-inference CPU profile (78% [vdso])

Root cause — `kt-kernel/cpu_backend/worker_pool.cpp:212-236`

yyj6666667 commented May 24, 2026 •

edited

Loading