-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ctransformers: another attempt #3313
Conversation
I have built ctransformers wheels for various Cuda versions and linked the Cuda 11.7 one here: |
@jllllll Thanks, updated to use your build. |
It looks like the model unloading of CtransformersModel will not release the GPU memory occupied. |
I have some basic questions about ctransformers:
|
Falcon models can be converted and quantized using the conversion script and binaries provided by ggllm.cpp. The primary use case that ctransformers provides is access to non-LLaMa models. It's llama.cpp bindings don't provide anything that llama-cpp-python doesn't already, as far as I can tell. Starcoder, Falcon and MPT ggml models are what most people want out of ctransformers. There is also Replit, Dolly and the GPT models, but I haven't seen much interest in them lately. It is also worth mentioning that work is being done to unify all of the various ggml implementations into a single file format: |
@lppllppl920 |
It seems a destructor already exists. I'm not too familiar with the ctransformer C/C++ ggml backend to tell if its a memory leak somewhere. |
5ceac52
to
e954b84
Compare
I checked the PR and it doesn't mention WizardML in the models list, I though ctransformers support WizardML... |
WizardML is llama/llama2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 more questions:
-
I assume this PR does the same thing as Add starcoder and starchat ggml support #2364 and Add support for starcoder ggml and similar (GPT-2 GGML?) #2892, so if it gets merged, both can be closed. Is that correct?
-
The README needs to be updated to briefly mention ctransformers support, and it would be nice to have a short entry under
docs/
mentioning what it is and giving a couple of examples of models that it can load.
modules/text_generation.py
Outdated
@@ -89,8 +97,12 @@ def _generate_reply(question, state, stopping_strings=None, is_chat=False): | |||
yield reply | |||
|
|||
|
|||
encode_llama_prompts = ['LlamaCppModel', 'RWKVModel', 'CtransformersModel'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like these variables floating here and think that they should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hate probing by class names, but I have put them back for now. I'll try to refactor them next time.
Generalized ctransformers based on: oobabooga#2892 Credits to randoentity
@oobabooga Yes, this was based on #2892 from @randoentity. |
Looks good now. It's very barebones but it works with falcon-7b: python server.py --model models/falcon-7b-instruct.ggccv1.q4_0.bin --loader ctransformers --model_type "falcon" --n-gpu-layers 10000 I have added @randoentity as a coauthor. Before the merge I also made several little adjustments. Thank you for you work on this PR. |
Awesome, thanks. |
@oobabooga thanks for mentioning me, but most of the work was by @s-kostyaev :) I just did a rebase and a minor refactor. |
Thanks for the work here @cal066 ! I missed your PR but did see GPU offloading was added upstream last week and wanted to bring it here, but ${life}. I'm awestruck how much momentum this project has. 🚀 |
Oops, it looks like I messed up then, I should have credited @s-kostyaev as well. Sorry! I have added one more parameter for ctransformers here 28c8df3. I find it weird that |
@oobabooga I added those according to the def parameters I found in the ctransformers python code. The double seed probably isn't not needed, it tries to get it from the generate call if possible, and then fall back to the config option, likewise for seed. I have created #3543 to resolve this. |
@cal066 thoughts on changing |
@lppllppl920 I just tried TheBloke/airoboros-mpt-30b-gpt4-1p4-GGML, it's not quite reliable. I can make it select ctransfomers but the 'None' model_type still needs to be manually set to 'mpt'. The 'None' is equivalent to auto detect, but requires a config.json in the model directory, something like this.
|
Generalized ctransformers based on:
#2892
Credits to randoentity
RFC since it moves some control structures around to allow extensions to overload them more easily.
Select Model Type "None" to allow ctransformers to guess based on directory config.json file.
You may need to install ctransformers manually at this time, but latest binaries use Cuda 12.x instead of 11.x.
env CT_CUBLAS=1 pip install ctransformers --no-binary ctransformers
Edit: Now using ctransformers build from @jllllll.