Replies: 4 comments
-
Thanks for sharing, we have continous performance in place: Feel free to contribute. Could you please share your scripts and dataset ? Have you also tested vLLM ? |
Beta Was this translation helpful? Give feedback.
-
Thanks for the feedback. I'm in the process of testing all permutations 15 times instead of 3, which leads to more reliable numbers, but the general trends are there. Very impressive. I had thought that ctranslate2 was the fastest. Unfortunately, I'm not familiar with vllm and don't have the time to educate myself (as a non-programmer by trade) on a new backend. It took me awhile to get llama_cpp going, for example...If you or anyone have any starter scripts that'd help. I'm new to software testing and am using my own personal benchmarks since every repository has their different ones, but controlling for the same settings best I can...Will share my scripts once I update the graph with 15 tests each. Was looking forward to testing Vulkan (and other llama_cpp backends) in addition, but unfortunately seems like it's borked with a certain commit with plans to fix it in the near future. |
Beta Was this translation helpful? Give feedback.
-
Updated, see my first post. |
Beta Was this translation helpful? Give feedback.
-
is python client having the same performance as a cpp client for llama.cpp? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Here's my initial testing. Feel free to contact me if you want the actual test scripts as I'm hesitant to past the entirety here! EDITED to include numbers from running 15 tests of all models now:
Testing Procedure:
torch==2.2.0
andnvidia-ml-py==12.535.133
Here are the relevant portions of the scripts used to test, with private information omitted and redundant code omitted (but noted) where appropriate:
BitsAndBytes
Then there is a separate class for each model tested, and here is one example:
The test scripts for Ctranslate2 and llama_cpp are all in one script, but testing bitsandbytes testing took 2 scripts. Here is the script that calls the script above. Obviously, you'd comment/uncomment the models you want to test. NOTE: This test is geared towards a RAG application - hence the long "user message" - because that is what my personal repository is all about:
Ctranslate2
Llama_cpp
This script only processed one model at a time and was a royal pain in the ass to run manually multiples times...I ran into problems making it batch process and give reliable outputs for some reason so was forced to do it this way. NOTE: I had to change the structure of the "prompt" to remove all newlines to get the model to respond properly, just FYI:
In conclusion, this is a hobby of mine and I'm not a programmer by trade. However, I've tried to control for as many constants as possible despite varying APIs between libraries - e.g. Ctranslate2 uses "do_sample=false" while I couldn't find anything identical in llama_cpp...
Feedback is always welcome. Constructive criticism is welcome as my goal is to have accurate testing not feed my ego about who's the best. Thanks!
Beta Was this translation helpful? Give feedback.
All reactions