Description
Compiler::build() in csrc/jit/compiler.hpp does not synchronize across processes. When multiple processes target the same kernel (same signature hash, same cache directory), they can race on writing and reading kernel.cu and kernel.cubin, potentially causing:
CUDA driver error: 301 (CUDA_ERROR_FILE_NOT_FOUND, file not found)
NVCC compilation failed: cc1plus: fatal error: .../kernel.cu: No such file or directory
This was observed downstream in vllm-project/vllm#39057, where a user running DeepSeek-V3.2 with -dp 8 --enable-expert-parallel (8x H200, cache on a network filesystem) hit both errors during startup. The lack of locking in build() is a plausible explanation and seems worth fixing regardless.
cc @jxdn
Initial code analysis
Multiple processes with the same kernel signature will compute the same dir_path:
const auto dir_path = cache_dir_path / "cache" /
fmt::format("kernel.{}.{}", name, get_hex_digest(kernel_signature));
With no locking, im guessing the following can happen concurrently:
- Both processes call
make_dirs(dir_path) and enter compile()
- Both call
put(code_path, code) which atomically renames a temp file to dir_path/kernel.cu — one process's write replaces the other's
- For
NVCCCompiler, NVCC is invoked as an external subprocess that reads kernel.cu from disk (line 214). If another process replaces or is mid-replace of that file, NVCC can fail with "file not found"
- Similarly, both processes race on
rename(tmp_cubin, kernel.cubin)
NVRTCCompiler compiles from an in-memory string so it avoids the kernel.cu read race, but still shares the cubin output path.
What isnt verified
~~ - I haven't reproduced this in a controlled test — the downstream report is the only data point~~ I was able to verify this last night
Environment (from the downstream report)
- DeepGEMM at commit 477618c (pinned by vLLM)
- vLLM 0.19.1rc1.dev44
- 8x NVIDIA H200, CUDA 12.8, RHEL 9.4
- Cache directory on a shared/network filesystem (
/hpfs/...)
Suggested fix
Add a per-kernel file lock (flock()) around the compilation step in build() with a double-checked locking pattern:
// Fast path — already compiled in this process
if (const auto& runtime = kernel_runtime_cache->get(dir_path); runtime != nullptr)
return runtime;
// Acquire cross-process lock
FileLock lock(lock_path); // RAII wrapper around flock(LOCK_EX)
// Re-check after acquiring lock — another process may have compiled it
if (const auto& runtime = kernel_runtime_cache->get(dir_path); runtime != nullptr)
return runtime;
// Only now compile...
This would have zero overhead during steady-state inference (the in-process cache returns before reaching the lock) and only serializes compilation of the same kernel across processes — different kernels still compile in parallel.
Will try to reproduce and create a fix.
Description
Compiler::build()incsrc/jit/compiler.hppdoes not synchronize across processes. When multiple processes target the same kernel (same signature hash, same cache directory), they can race on writing and readingkernel.cuandkernel.cubin, potentially causing:CUDA driver error: 301 (CUDA_ERROR_FILE_NOT_FOUND, file not found)NVCC compilation failed: cc1plus: fatal error: .../kernel.cu: No such file or directoryThis was observed downstream in vllm-project/vllm#39057, where a user running DeepSeek-V3.2 with
-dp 8 --enable-expert-parallel(8x H200, cache on a network filesystem) hit both errors during startup. The lack of locking inbuild()is a plausible explanation and seems worth fixing regardless.cc @jxdn
Initial code analysis
Multiple processes with the same kernel signature will compute the same
dir_path:With no locking, im guessing the following can happen concurrently:
make_dirs(dir_path)and entercompile()put(code_path, code)which atomically renames a temp file todir_path/kernel.cu— one process's write replaces the other'sNVCCCompiler, NVCC is invoked as an external subprocess that readskernel.cufrom disk (line 214). If another process replaces or is mid-replace of that file, NVCC can fail with "file not found"rename(tmp_cubin, kernel.cubin)NVRTCCompilercompiles from an in-memory string so it avoids thekernel.curead race, but still shares the cubin output path.What isnt verified
~~ - I haven't reproduced this in a controlled test — the downstream report is the only data point~~ I was able to verify this last night
/hpfs/...) could be a contributing factor (e.g. NFS caching behavior), independent of the race- A corrupted or stale cache from a prior run could also produce similar errorsAccounted for in my test / setup in Fix JIT cache race condition with multi-process compilation #302Environment (from the downstream report)
/hpfs/...)Suggested fix
Add a per-kernel file lock (
flock()) around the compilation step inbuild()with a double-checked locking pattern:This would have zero overhead during steady-state inference (the in-process cache returns before reaching the lock) and only serializes compilation of the same kernel across processes — different kernels still compile in parallel.
Will try to reproduce and create a fix.