[fix](kt-kernel): eliminate clock_gettime hot-path spin in worker_pool#2022
[fix](kt-kernel): eliminate clock_gettime hot-path spin in worker_pool#2022hermannklie wants to merge 2 commits into
Conversation
In `InNumaPool::worker_thread`, the WAITING branch called
`std::chrono::high_resolution_clock::now()` every spin iteration before
falling through to `cv.wait` at the 50 ms timeout. On a 96-thread
MoE-inference workload (Qwen3-30B-A3B Q8_0, LLAMAFILE backend), this
caused `__vdso_clock_gettime` to dominate the CPU profile at 78.22 %
while the actual MoE compute kernel held only 12.77 %.
Apply two minimal changes that keep the 50 ms wake-up budget intact:
A) Insert `_mm_pause()` once per spin iteration — SSE2 hardware
hint that frees HT-partner pipeline resources and reduces
power draw, with no syscall.
C) Read the clock only every 100th iteration via a `thread_local`
counter, reducing VDSO calls by ~99 % without changing the
50 ms timeout semantics.
Verified on AMD EPYC 9654 (96C/192T, AVX-512+BF16), Linux 6.12,
kt-kernel @ upstream/main (last touch of this file: bdf4bb7,
"Fix worker pool idle CPU usage").
Measured impact (same model, same sampler config, same hardware):
| Metric | Before | After |
| ------------------------------------- | ------- | ------- |
| [vdso] clock_gettime CPU share | 78.22 % | 1.37 % |
| _kt_kernel_ext_avx512_bf16 CPU share | 12.77 % | 93.24 % |
| Decode tok/s (Q4/Q5/Q6 mean) | ~19 | ~20.5 |
The fix is backend-agnostic — `InNumaPool::worker_thread` is the only
worker-thread implementation in kt-kernel and is driven by every MoE
backend family (LLAMAFILE, AMX, AVX2/AVX-VNNI, generic).
There was a problem hiding this comment.
Code Review
This pull request optimizes the WAITING spin loop in InNumaPool::worker_thread by incorporating _mm_pause() and throttling clock sampling to once every 100 iterations, which significantly reduces clock_gettime overhead. The reviewer recommends applying this optimization consistently to NumaJobDistributor::worker_thread and suggests using a local variable instead of thread_local for the spin counter to minimize access overhead.
| _mm_pause(); | ||
| static thread_local int spin_counter = 0; | ||
| if (++spin_counter < 100) { | ||
| continue; | ||
| } | ||
| spin_counter = 0; |
There was a problem hiding this comment.
The addition of _mm_pause() and the clock sampling logic significantly improves the performance of InNumaPool::worker_thread by reducing VDSO overhead. However, a structurally identical busy-spin exists in NumaJobDistributor::worker_thread (lines 413-421), which also calls high_resolution_clock::now() every iteration without any throttling or hardware hints. To ensure the performance gains are realized across all worker types in the pool, this optimization should be applied consistently to both implementations. Additionally, spin_counter could be a simple local variable declared outside the while loop to avoid the (minor) overhead of thread_local storage access, as the thread routine only executes once.
|
Despite all the slop, I read "+8-12 %" and I'll try it out. I also happen to have the same CPU. (2x 9B14, OEM equiv) I observe no change at all in decode perf w/K2 INT4. I probed around different concurrency levels to see if it was more efficient, i.e. same perf with fewer cores - Not so. I see the same characteristic wrt. threads vs. decode as before. (I see max decode at ~112 threads. I usually back off to 96 to trade very marginal perf for +8 idle cores) |
At this time, the PR is not mergeable as-is. |
…drop TLS counter Follow-up addressing PR review (@yyj6666667, @gemini-code-assist): - Apply the identical _mm_pause() + clock-sampling-every-100-iterations treatment to NumaJobDistributor::worker_thread. Its WAITING branch had the same per-iteration high_resolution_clock::now() spin; without this the idle-spin hot-spot just relocates there. - Replace `static thread_local int spin_counter` with a plain local above the loop in both worker routines. Each runs once per thread, so the TLS storage and its per-iteration %fs access are unnecessary. No change to the WORKING/compute path: the status load runs every loop iteration, so work is still detected without delay; only the elapsed-time check is throttled.
|
Thanks both — @yyj6666667 and @usrlocalben. These are fair and they've changed the PR (commit 9d6e02f). Throughput claim: retracted. The @usrlocalben — your efficiency probe (same throughput, fewer cores) showing no change is consistent with this, not contradictory: the patch touches only the WAITING branch, so it can't shift the threads-vs-decode scaling curve, which is a property of the WORKING path. The one regime where it does anything measurable is precisely your "back off to 96 for +8 idle cores" case. Modest, but it costs every user a little idle power even when it's invisible in tok/s — our extreme state just made it findable. Both review points applied (9d6e02f):
I kept a fixed sample interval rather than an adaptive backoff on purpose: here the spin is only the bounded 50 ms pre-sleep window before One disclosure, in fairness to the note about AI text: we're a very small team and draft these upstream patches with AI assistance for resourcing reasons. We review and verify everything ourselves on real hardware — this change was built and smoke-tested on the EPYC 9654 box above. Happy to reshape the patch if you'd prefer. |
kt-kernel:
clock_gettimehot-path spin dominates MoE-inference CPU profile (78% [vdso])Status: Verified locally; ready to submit upstream.
Target: kvcache-ai/ktransformers,
kt-kernel/subproject.Affected file:
kt-kernel/cpu_backend/worker_pool.cpp.Last upstream touch of the file: commit
bdf4bb7("Fix worker pool idle CPU usage", #1902).Discovered while building from source at commit:
35fc6ca(origin/main, fetched 2026-05-22).Type: Performance bug fix (CPU-waste pattern, no behavioural change to wake-up latency).
Patch file:
kt-kernel_worker_pool_busy_spin.patch(sibling to this doc).Summary
During active MoE inference, the textgen/ktransformers server spends
78.22 % of its CPU cycles in
__vdso_clock_gettime— versus12.77 % in the actual MoE compute kernel
(
_kt_kernel_ext_avx512_bf16.so). The cost is not paid in idle butduring active decodes.
Root cause is the WAITING-state spin loop in
InNumaPool::worker_thread: it callsstd::chrono::high_resolution_clock::now()(=clock_gettimeviaVDSO) on every iteration of the 50 ms pre-wait spin, with no
_mm_pause()and no thinning.Existing PRs #1899 and #1902 fixed the idle case (worker reaches
cv.waitafter 50 ms and goes to sleep). The hot-path — whathappens during those 50 ms while task bursts are arriving — was
not addressed.
Two minimal changes (
_mm_pause()+ 1-in-100 clock sampling) reduce[vdso]CPU share from 78.22 % to 1.37 % (57× less), withoutchanging the 50 ms wake-up latency.
Steps to reproduce
35fc6ca, file unchanged sincebdf4bb7)_kt_kernel_ext_avx512_bf16variant, method=LLAMAFILEcpuinfer_threads=96,threadpool_count=1(NPS1, 1 NUMA node)perfinvocation:Observation
Top shared objects (sort = DSO):
Top symbols (sort = Symbol):
Call-stack trace:
Root cause —
kt-kernel/cpu_backend/worker_pool.cpp:212-236Pathology: hybrid wait pattern. The worker spins for 50 ms
before transitioning to
cv.wait. During those 50 ms,high_resolution_clock::now()(=clock_gettimesyscall via VDSO)is called in a tight loop, with no
_mm_pause(), nostd::this_thread::yield(), no CPU hint of any kind.Real situation during active MoE inference:
nth × activated_expert = 24 × 8 = 192tasks (Gate+Up), followed by 24 tasks (Down).
WAITINGstate.clock_gettimecalls instead of compute.Universality — every MoE backend is affected
InNumaPool::worker_threadis the only worker-thread implementationin the kt-kernel repo:
It is driven by every MoE backend family through
pool->do_work_stealing_job(...):operators/llamafile/moe.hpp,mla.hppoperators/amx/{bf16,fp8,fp4,awq,k2,sft,moe_base,awq-moe,fp8-perchannel}-moe.hpp(9 files)operators/avx2/{bf16,fp8,gptq_int4,gptq_int4_avxvnni,rawint4,rawint4_avxvnni}-moe.hpp(6 files)operators/moe_kernel/moe.hpp,operators/moe-sft-tp.hppIn other words: every code path that uses kt-kernel MoE — every
ktransformers model, every hardware class, every quantization
format — is subject to this CPU waste.
Relationship to existing fixes (#1899 / #1902)
PR #1899 ("TaskQueue worker thread 100% CPU spin when idle") and
PR #1902 ("worker pool idle CPU usage") are present in the current
source as the
cv.waitat line 228 — that is, the idle case issolved (workers really go to sleep after 50 ms). The hot-path
pathology (the 50 ms of busy-spin before the
cv.wait, with oneclock_gettimeper iteration) is not addressed by those PRs.This change is fully complementary to them — same 50 ms timeout,
same
cv.waitexit, only the inside of the spin is changed.Applied fix
Two minimal, non-invasive changes; the 50 ms spin timeout is preserved
(= unchanged wake-up latency for the next task burst).
A)
_mm_pause()— SSE2 hardware hint emitted once per spiniteration. It tells the CPU "I am just spinning": the HT partner
gets pipeline resources, power consumption drops, no syscall.
C)
clock_gettimeonly every 100th iteration, via athread_local int spin_counter→ ~99 % fewer VDSO calls.Diff against
kt-kernel/cpu_backend/worker_pool.cpp:Why this combination:
clock_gettimecalls / 50 ms / threadAlternatives that were rejected:
latency for the next task burst by 10×. With short pauses between
layer forwards this adds
cv.waitoverhead.cv.wait:maximum CPU savings, but with high-frequency task bursts the
per-wake-up
cv.waitlatency (~10–100 µs) would dominate.The chosen A+C combination neutralizes the
clock_gettimecostwithout any latency regression.
Expected impact — and measured data
Baseline (pre-fix, 2026-05-22 01:30, Qwen3-30B-A3B Q8_0 LLAMAFILE):
[vdso](clock_gettime)_kt_kernel_ext_avx512_bf16(MoE compute)After fix (same hardware, same model, same sampling config,
measured 2026-05-22 03:23):
[vdso](clock_gettime)_kt_kernel_ext_avx512_bf16(MoE compute)Interpretation of the gap (57× CPU saving, but only ~10 % decode
speedup):
The
clock_gettimecost was concentrated on idle worker threadsbetween task bursts, not on the critical path of decode. The
WORKING thread does real MoE compute with full pipeline utilization;
its throughput is not directly blocked by other threads spinning
idly. The patch therefore mainly sanitises:
96-core servers,
perfprofile readability for future diagnosis — the hot pathis finally cleanly dominated by the MoE compute kernel.
This is not a decode-performance silver bullet — the headline
decode-rate change is modest. But the fix removes a massive systemic
CPU-waste pattern that hits every kt-kernel user on every supported
backend.
All other kt-kernel backends (AMX, AVX2/AVX-VNNI, generic) gain
structurally identical benefit — the patched
InNumaPool::worker_threadloop is backend-agnostic (see theUniversality section).
Reviewer checklist
upstream/mainwithgit apply kt-kernel_worker_pool_busy_spin.patch, build withCPUINFER_BUILD_ALL_VARIANTS=1(or any single-variant build).perf record -F 999 -p $PID --call-graph=dwarf -- sleep 60during an active decode and confirm that
[vdso]drops out ofthe top of
perf report --sort dso.cv.waitis unchanged(e.g. measure first-token latency after a long idle period).
_mm_pause()availability: the existing kt-kernel buildalready requires SSE2 (it is part of the x86-64 baseline ABI) —
no new build flag is needed;
<immintrin.h>is already pulled intransitively elsewhere in the source.
Raw data
The full
perf.data(~45 GB) is not publicly attached due to sizeand because the trace captures private model output. Any reviewer
can reproduce the trace locally — instructions are in the
Steps to reproduce section.