You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This version adds experimental tool call support to the dllama-api, currently only for non-streaming requests (stream: false). Tested with qwen3_8b_q40.
This version improves the reliability of dllama-api. The API now runs as a persistent service designed for continuous operation. If any worker crashes, the API automatically attempts to reconnect to the failed node and reinitialize the cluster. The goal is to ensure that the API remains operational within moments after any node failure.
This version adds support for Qwen3 MoE models on CPU. Vulkan support will be added in a future release.
The performance of MOE models is quite impressive: Qwen3-30B-A3B-Q40 achieves 13.04 tok/s during prediction on 4× Raspberry Pi 5 (8GB). Check details here.
This version fixes a precision issue in multiplication for Qwen models on NVIDIA GPUs. Additionally, it includes several Vulkan shader improvements that increase inference speed.
This version introduces a small optimization for Vulkan that reduces the number of bytes required to synchronize between the CPU and GPU during prediction.
Tested on NVIDIA GeForce RTX 3060 12GB (with --steps 128):