[Bug] JIT cache lacks cross-process synchronization, can cause failures under multi-process parallelism

## Description

`Compiler::build()` in `csrc/jit/compiler.hpp` does not synchronize across processes. When multiple processes target the same kernel (same signature hash, same cache directory), they can race on writing and reading `kernel.cu` and `kernel.cubin`, potentially causing:

- `CUDA driver error: 301 (CUDA_ERROR_FILE_NOT_FOUND, file not found)`
- `NVCC compilation failed: cc1plus: fatal error: .../kernel.cu: No such file or directory`

This was observed downstream in vllm-project/vllm#39057, where a user running DeepSeek-V3.2 with `-dp 8 --enable-expert-parallel` (8x H200, cache on a network filesystem) hit both errors during startup. The lack of locking in `build()` is a plausible explanation and seems worth fixing regardless.

cc @jxdn

## Initial code analysis

Multiple processes with the same kernel signature will compute the same `dir_path`:

```cpp
const auto dir_path = cache_dir_path / "cache" /
    fmt::format("kernel.{}.{}", name, get_hex_digest(kernel_signature));
```

With no locking, im guessing the following can happen concurrently:

1. Both processes call `make_dirs(dir_path)` and enter `compile()`
2. Both call `put(code_path, code)` which atomically renames a temp file to `dir_path/kernel.cu` — one process's write replaces the other's
3. For `NVCCCompiler`, NVCC is invoked as an external subprocess that reads `kernel.cu` from disk (line 214). If another process replaces or is mid-replace of that file, NVCC can fail with "file not found"
4. Similarly, both processes race on `rename(tmp_cubin, kernel.cubin)`

`NVRTCCompiler` compiles from an in-memory string so it avoids the `kernel.cu` read race, but still shares the cubin output path.

## What isnt verified

~~ - I haven't reproduced this in a controlled test — the downstream report is the only data point~~ I was able to verify this last night
- The network filesystem (`/hpfs/...`) could be a contributing factor (e.g. NFS caching behavior), independent of the race
~~- A corrupted or stale cache from a prior run could also produce similar errors~~  Accounted for in my test / setup in #302 

## Environment (from the downstream report)

- DeepGEMM at commit 477618c (pinned by vLLM)
- vLLM 0.19.1rc1.dev44
- 8x NVIDIA H200, CUDA 12.8, RHEL 9.4
- Cache directory on a shared/network filesystem (`/hpfs/...`)

## Suggested fix

Add a per-kernel file lock (`flock()`) around the compilation step in `build()` with a double-checked locking pattern:

```cpp
// Fast path — already compiled in this process
if (const auto& runtime = kernel_runtime_cache->get(dir_path); runtime != nullptr)
    return runtime;

// Acquire cross-process lock
FileLock lock(lock_path);  // RAII wrapper around flock(LOCK_EX)

// Re-check after acquiring lock — another process may have compiled it
if (const auto& runtime = kernel_runtime_cache->get(dir_path); runtime != nullptr)
    return runtime;

// Only now compile...
```

This would have zero overhead during steady-state inference (the in-process cache returns before reaching the lock) and only serializes compilation of the *same* kernel across processes — different kernels still compile in parallel.

Will try to reproduce and create a fix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] JIT cache lacks cross-process synchronization, can cause failures under multi-process parallelism #301

Description

Initial code analysis

What isnt verified

Environment (from the downstream report)

Suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] JIT cache lacks cross-process synchronization, can cause failures under multi-process parallelism #301

Description

Description

Initial code analysis

What isnt verified

Environment (from the downstream report)

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions