-
-
Notifications
You must be signed in to change notification settings - Fork 55
Description
Problem Description
When running the GLM-4.5-Air 106B (an MoE model) across three A100 32GB SXM2 GPUs, the model produces consistent garbled output (nonsensical text). The issue only occurs when the third GPU is connected via an M.2 NVMe to PCIe adapter. If the model is distributed only across the two GPUs connected to the native motherboard PCIe slots, the output is normal and correct.
Expected behavior: The model should generate coherent text across all three GPUs, regardless of their connection path.
Actual behavior: Consistent garbled output when the M.2-connected GPU is involved.
Environment and Setup
- Motherboard: Hua Nan Gold X99 F8d Plus
- GPUs: 3x NVIDIA A100 32GB SXM2 (32GB VRAM each)
- GPU Configuration:
- GPU 0 & 1: Connected to native PCIe slots on the motherboard.
- GPU 2: Connected via an M.2 NVMe to PCIe adapter (using a external GPU dock and independent PSU).
- OS: Windows 10
- Software: Tabby API, utilizing ExLlamaV3.
- Model: GLM-4.5-Air 106B (a large Mixture-of-Experts model).
All three GPUs are correctly recognized by the system and nvidia-smi. The issue is specific to the model inference output when the third GPU participates in the computation.
Hypothesis and Analysis
I suspect the problem stems from data synchronization issues between the GPUs due to heterogeneous PCIe latencies and bandwidth.
- PCIe Topology Heterogeneity: The two GPUs in the native PCIe slots have a direct, high-bandwidth, low-latency path to the CPU/RAM. In contrast, the GPU connected via the M.2 adapter communicates over a different path (likely a PCIe x4 link via the chipset), which may introduce higher and variable latency.
- Impact on Model Parallelism: When ExLlamaV3 distributes a model (especially a large MoE model like GLM-4.5) across multiple GPUs, it requires tight synchronization between devices during forward passes (e.g., for All-Reduce operations). If one GPU (the M.2-connected one) consistently responds slower than the others, it can break this synchronization.
- Result: The subsequent computations are based on incomplete or out-of-sync intermediate data from different GPUs, leading to the final garbled output. This is akin to a high-latency induced data race condition in a distributed system.
Request for Guidance and Features
Could you please provide insight into this issue?
- Parameter Adjustment: Are there any existing parameters or synchronization settings in ExLlamaV3 that could be adjusted to tolerate higher inter-GPU latency?
- Future Development: Would it be possible to consider developing a mechanism within ExLlamaV3 to better handle multi-GPU setups with heterogeneous PCIe topologies? For example, more robust synchronization primitives or configurable timeouts for inter-GPU communication.
- General Advice: Any other suggestions or best practices for using ExLlamaV3 in such mixed-connectivity multi-GPU environments would be greatly appreciated.
Additional Information
- This setup works perfectly with two GPUs on native PCIe slots.
- The problem is reproducible and consistently leads to garbled text.
- The GLM-4.5 Air are MoE models (106B total, 12B activated for GLM-4.5-Air), which might be particularly sensitive to such synchronization issues due to their architecture.
- If I loaded a smaller model which can be fit into single GPU, all three GPUs work normally.
Thank you for your time and for developing this excellent inference engine. I look forward to your thoughts.