Releases · b4rtaz/distributed-llama

02 Feb 12:06

b4rtaz

v0.16.5

a003363

0.16.5 Latest

Latest

This version adds experimental tool call support to the dllama-api, currently only for non-streaming requests (stream: false). Tested with qwen3_8b_q40.

Assets 2

17 Jan 23:46

b4rtaz

v0.16.4

e819dc6

0.16.4

Added the --host 0.0.0.0 argument to allow overriding the default binding host #231.

Assets 2

26 Oct 11:05

b4rtaz

v0.16.3

96c661e

0.16.3

This version improves the reliability of dllama-api. The API now runs as a persistent service designed for continuous operation. If any worker crashes, the API automatically attempts to reconnect to the failed node and reinitialize the cluster. The goal is to ensure that the API remains operational within moments after any node failure.

Assets 2

20 Sep 13:58

b4rtaz

v0.16.2

6372624

0.16.2

Fixed Vulkan support on Raspberry Pi 5. Distributed Llama now runs with Vulkan, though it is slower than CPU-only execution #259.

Assets 2

16 Sep 21:07

b4rtaz

v0.16.1

649649f

0.16.1

This version adds support for Qwen3 MoE models on Vulkan.

Assets 2

05 Sep 17:18

b4rtaz

v0.16.0

5f5adaf

0.16.0

This version adds support for Qwen3 MoE models on CPU. Vulkan support will be added in a future release.

The performance of MOE models is quite impressive: Qwen3-30B-A3B-Q40 achieves 13.04 tok/s during prediction on 4× Raspberry Pi 5 (8GB). Check details here.

Assets 2

20 Aug 17:14

b4rtaz

v0.15.4

b9ec995

0.15.4

This version brings another speedup in Vulkan inference.

Prediction (--steps 128)

RTX 3090 24GB, AMD EPYC 7313 16-Core Processor #252

Model	Tokens/s (version 0.15.1)	Tokens/s (version 0.15.2)	Tokens/s (version 0.15.3)	Tokens/s (This version)
`llama3_1_8b_instruct_q40`	24.80	24.80	33.32	45.33 🚀

Assets 2

17 Aug 10:24

b4rtaz

v0.15.3

8909825

0.15.3

This version fixes a precision issue in multiplication for Qwen models on NVIDIA GPUs. Additionally, it includes several Vulkan shader improvements that increase inference speed.

Prediction

CPU: Xeon® E5-2650 v4, Mainboard: Z10PG-D24 Series, GPU: NVIDIA GeForce RTX 3060 12GB #249

Model	Tokens/s (version 0.15.0)	Tokens/s (version 0.15.2)	Tokens/s (0.15.3)
`qwen3_8b_q40`	12.9	13.65	16.86

Assets 2

13 Aug 23:03

b4rtaz

v0.15.2

eda0684

0.15.2

This version brings another small improvement for Vulkan.

Tested on NVIDIA GeForce RTX 3060 12GB (prediction, with --steps 128) #247:

Model	Tokens/s (previous version)	Tokens/s (0.15.2)
`lama3_1_8b_instruct_q40`	14.87	16.01

Assets 2

12 Aug 21:29

b4rtaz

v0.15.1

01305c9

0.15.1

This version introduces a small optimization for Vulkan that reduces the number of bytes required to synchronize between the CPU and GPU during prediction.

Tested on NVIDIA GeForce RTX 3060 12GB (with --steps 128):

Model	Tokens/s (previous version)	Tokens/s (0.15.1)
`lama3_1_8b_instruct_q40`	13.68	14.83
`qwen3_0.6b_q40`	44.41	61.98

Assets 2

Releases: b4rtaz/distributed-llama

0.16.5

Uh oh!

0.16.4

Uh oh!

0.16.3

Uh oh!

0.16.2

Uh oh!

0.16.1

Uh oh!

0.16.0

Uh oh!

0.15.4

Prediction (--steps 128)

Uh oh!

0.15.3

Prediction

Uh oh!

0.15.2

Uh oh!

0.15.1

Uh oh!