GPU Underutilization and CPU Bottleneck with exllamav3

I’ve tested multiple models (e.g., **Devstral**, **Qwen Coder 30B**, **Qwen3 4B**, **Qwen3 14B**) using **ExLLaMAv3** on my hardware, and I consistently encounter the same issue:  

- **GPU underutilization**: The GPU usage peaks at only **80%**, even during inference.  
- **Power consumption**: The GPU draws **80–100W** (out of a **160W power limit**).  
- **CPU bottleneck**: During inference, **one CPU core consistently reaches 100% utilization**, suggesting a potential bottleneck on the CPU side.  

**My hardware configuration**:  
- **CPU**: AMD Ryzen 9 5900X  
- **GPU**: NVIDIA RTX 5060 Ti  16GB
- **RAM**: 32GB DDR4  
- **OS**: Arch Linux with **CUDA 13** installed  

Typical use: Cline

I suspect the CPU might be the limiting factor here. Could there be a specific setting, driver issue, or resource allocation problem on my end causing this behavior?

_Translated with Qwen3 14b_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

GPU Underutilization and CPU Bottleneck with exllamav3 #97

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

GPU Underutilization and CPU Bottleneck with exllamav3 #97

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions