Skip to content

[Bug] JIT cache lacks cross-process synchronization, can cause failures under multi-process parallelism #301

@Gregory-Pereira

Description

@Gregory-Pereira

Description

Compiler::build() in csrc/jit/compiler.hpp does not synchronize across processes. When multiple processes target the same kernel (same signature hash, same cache directory), they can race on writing and reading kernel.cu and kernel.cubin, potentially causing:

  • CUDA driver error: 301 (CUDA_ERROR_FILE_NOT_FOUND, file not found)
  • NVCC compilation failed: cc1plus: fatal error: .../kernel.cu: No such file or directory

This was observed downstream in vllm-project/vllm#39057, where a user running DeepSeek-V3.2 with -dp 8 --enable-expert-parallel (8x H200, cache on a network filesystem) hit both errors during startup. The lack of locking in build() is a plausible explanation and seems worth fixing regardless.

cc @jxdn

Initial code analysis

Multiple processes with the same kernel signature will compute the same dir_path:

const auto dir_path = cache_dir_path / "cache" /
    fmt::format("kernel.{}.{}", name, get_hex_digest(kernel_signature));

With no locking, im guessing the following can happen concurrently:

  1. Both processes call make_dirs(dir_path) and enter compile()
  2. Both call put(code_path, code) which atomically renames a temp file to dir_path/kernel.cu — one process's write replaces the other's
  3. For NVCCCompiler, NVCC is invoked as an external subprocess that reads kernel.cu from disk (line 214). If another process replaces or is mid-replace of that file, NVCC can fail with "file not found"
  4. Similarly, both processes race on rename(tmp_cubin, kernel.cubin)

NVRTCCompiler compiles from an in-memory string so it avoids the kernel.cu read race, but still shares the cubin output path.

What isnt verified

~~ - I haven't reproduced this in a controlled test — the downstream report is the only data point~~ I was able to verify this last night

Environment (from the downstream report)

  • DeepGEMM at commit 477618c (pinned by vLLM)
  • vLLM 0.19.1rc1.dev44
  • 8x NVIDIA H200, CUDA 12.8, RHEL 9.4
  • Cache directory on a shared/network filesystem (/hpfs/...)

Suggested fix

Add a per-kernel file lock (flock()) around the compilation step in build() with a double-checked locking pattern:

// Fast path — already compiled in this process
if (const auto& runtime = kernel_runtime_cache->get(dir_path); runtime != nullptr)
    return runtime;

// Acquire cross-process lock
FileLock lock(lock_path);  // RAII wrapper around flock(LOCK_EX)

// Re-check after acquiring lock — another process may have compiled it
if (const auto& runtime = kernel_runtime_cache->get(dir_path); runtime != nullptr)
    return runtime;

// Only now compile...

This would have zero overhead during steady-state inference (the in-process cache returns before reaching the lock) and only serializes compilation of the same kernel across processes — different kernels still compile in parallel.

Will try to reproduce and create a fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions