Inference speed exl2 vs gguf - are my results typical? #471
Closed
LlamaEnjoyer
started this conversation in
General
Replies: 1 comment
-
Got my answers in a reddit thread. Closing. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi folks!
I've been toying around with LLMs for the past few weeks which became my new hobby :) I started out with LM studio, but recently I've installed Exui to see for myself if the exl2 really that awesome. Putting the hurdle-hopping to get it up and running on my Windows PC aside, I've decided to run a quick speed test using the Llama 3 8B Instruct Q8.0 quants in both LM Studio and EXUI.
I tried to match the parameters between both to make it fair and unbiased - flash attention on, context set to 8192, FP16 cache in Exui and no speculative decoding, gguf fully offloaded to the GPU.
I used the following prompt:
"List the first 30 elements of the periodic table, stating their atomic masses in brackets. Do it as a numbered list."
LM Studio reported ~56 t/s while EXUI ~64 t/s which makes exl2 >14% faster that gguf in this specific test.
Is this about in line with what should be expected?
My specs:
i7-14700K, 64GB DDR4 4300MHz of RAM, RTX 4070Ti Super 16GB VRAM, Windows 11 Pro.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions