Skip to content

GPU Underutilization and CPU Bottleneck with exllamav3 #97

@DanielusG

Description

@DanielusG

I’ve tested multiple models (e.g., Devstral, Qwen Coder 30B, Qwen3 4B, Qwen3 14B) using ExLLaMAv3 on my hardware, and I consistently encounter the same issue:

  • GPU underutilization: The GPU usage peaks at only 80%, even during inference.
  • Power consumption: The GPU draws 80–100W (out of a 160W power limit).
  • CPU bottleneck: During inference, one CPU core consistently reaches 100% utilization, suggesting a potential bottleneck on the CPU side.

My hardware configuration:

  • CPU: AMD Ryzen 9 5900X
  • GPU: NVIDIA RTX 5060 Ti 16GB
  • RAM: 32GB DDR4
  • OS: Arch Linux with CUDA 13 installed

Typical use: Cline

I suspect the CPU might be the limiting factor here. Could there be a specific setting, driver issue, or resource allocation problem on my end causing this behavior?

Translated with Qwen3 14b

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions