Skip to content

Releases: b4rtaz/distributed-llama

0.16.5

02 Feb 12:06
a003363

Choose a tag to compare

This version adds experimental tool call support to the dllama-api, currently only for non-streaming requests (stream: false). Tested with qwen3_8b_q40.

0.16.4

17 Jan 23:46
e819dc6

Choose a tag to compare

Added the --host 0.0.0.0 argument to allow overriding the default binding host #231.

0.16.3

26 Oct 11:05
96c661e

Choose a tag to compare

This version improves the reliability of dllama-api. The API now runs as a persistent service designed for continuous operation. If any worker crashes, the API automatically attempts to reconnect to the failed node and reinitialize the cluster. The goal is to ensure that the API remains operational within moments after any node failure.

0.16.2

20 Sep 13:58

Choose a tag to compare

Fixed Vulkan support on Raspberry Pi 5. Distributed Llama now runs with Vulkan, though it is slower than CPU-only execution #259.

0.16.1

16 Sep 21:07
649649f

Choose a tag to compare

This version adds support for Qwen3 MoE models on Vulkan.

0.16.0

05 Sep 17:18
5f5adaf

Choose a tag to compare

This version adds support for Qwen3 MoE models on CPU. Vulkan support will be added in a future release.

The performance of MOE models is quite impressive: Qwen3-30B-A3B-Q40 achieves 13.04 tok/s during prediction on 4× Raspberry Pi 5 (8GB). Check details here.

0.15.4

20 Aug 17:14
b9ec995

Choose a tag to compare

This version brings another speedup in Vulkan inference.

Prediction (--steps 128)

RTX 3090 24GB, AMD EPYC 7313 16-Core Processor #252

Model Tokens/s (version 0.15.1) Tokens/s (version 0.15.2) Tokens/s (version 0.15.3) Tokens/s (This version)
llama3_1_8b_instruct_q40 24.80 24.80 33.32 45.33 🚀

0.15.3

17 Aug 10:24
8909825

Choose a tag to compare

This version fixes a precision issue in multiplication for Qwen models on NVIDIA GPUs. Additionally, it includes several Vulkan shader improvements that increase inference speed.

Prediction

CPU: Xeon® E5-2650 v4, Mainboard: Z10PG-D24 Series, GPU: NVIDIA GeForce RTX 3060 12GB #249

Model Tokens/s (version 0.15.0) Tokens/s (version 0.15.2) Tokens/s (0.15.3)
qwen3_8b_q40 12.9 13.65 16.86

0.15.2

13 Aug 23:03
eda0684

Choose a tag to compare

This version brings another small improvement for Vulkan.

Tested on NVIDIA GeForce RTX 3060 12GB (prediction, with --steps 128) #247:

Model Tokens/s (previous version) Tokens/s (0.15.2)
lama3_1_8b_instruct_q40 14.87 16.01

0.15.1

12 Aug 21:29
01305c9

Choose a tag to compare

This version introduces a small optimization for Vulkan that reduces the number of bytes required to synchronize between the CPU and GPU during prediction.

Tested on NVIDIA GeForce RTX 3060 12GB (with --steps 128):

Model Tokens/s (previous version) Tokens/s (0.15.1)
lama3_1_8b_instruct_q40 13.68 14.83
qwen3_0.6b_q40 44.41 61.98