-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Will llama.cpp be able to use Phi-2 ? #4437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Here is a related issue: #3146 It is a different model architecture so would require some work, but definitely doable. |
also interested in this -- weights are on huggingface from the official microsoft account: https://huggingface.co/microsoft/phi-2/tree/main |
Any help needed? |
@jadore801120 I think there is help needed. Back when phi-1.5 was released there was already a lot of demand for support and yet it didn't happen (see #3146) Would be down to work on it, but I don't think I can do it alone |
Unless I'm missing something, it looks relatively straightforward to support. The one thing I didn't quite get from the python code is the use of cross-attention in some cases instead of the standard self-attention. Would be helpful if somebody can summarize what is the purpose of cross-attention here |
@ggerganov Personally I have found this implementation more intuitive. I think the cross attention is just another way of treating cached keys and values (i.e. nothing special still decoder only). |
VLLM has this model implemented, in their conversation they mentioned this: "I believe the "cross-attention" used in Phi-1.5 is not true cross-attention, it's just used for current token to attend to past KV-cache during autoregressive generation. From what I can see, the Phi-1.5 architecture is basically the same as a GPT-NeoX with attention/FFN in parallel, except: In each transformer block the pre-FFN and pre-Attention layernorms are the same (GPT-NeoX doesn't share these params, as noted in their paper). GPT-J has separate q, k, v projections and Phi-1.5 has W_qkv Start with GPT-NeoX model, with the right configuration to match Phi-1.5 (including use_parallel_residual) |
FWIW, Phi-2 in GGUF format is also supported by Candle. |
I wrote an implementation based on the MLX code. Had to make some breaking changes to |
would love to test it 👍
El vie, 15 dic 2023 a las 22:02, Ebey Abraham ***@***.***>)
escribió:
… I wrote an implementation based on the MLX code
<https://github.com/ml-explore/mlx-examples/tree/main/phi2>. Had to make
some breaking changes to llama.cpp to support the last layer of Phi2 but
it works!
—
Reply to this email directly, view it on GitHub
<#4437 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACOC2UIBXJZK5YYN4E75NWTYJS3EXAVCNFSM6AAAAABATBICY2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYGQ4DEMZRG4>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Antonio Linares
www.fivetechsoft.com
|
@mrgraycode <https://github.com/mrgraycode> is it possible then to generate
a GGUF from it already ?
El vie, 15 dic 2023 a las 22:34, Aman Salykov ***@***.***>)
escribió:
… hi @mrgraycode <https://github.com/mrgraycode>, I think you forgot to
load biases, phi models have both weights + biases
—
Reply to this email directly, view it on GitHub
<#4437 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACOC2UJCDRXMXQAJKSE3LUTYJS673AVCNFSM6AAAAABATBICY2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYGUYTMMJYHA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Antonio Linares
www.fivetechsoft.com
|
Yes, you can download the weights from https://huggingface.co/microsoft/phi-2 and then use convert-hf-to-gguf.py to generate gguf. |
python convert-hf-to-gguf.py c:/phi-2 --outfile phi-2.gguf --outtype f16
Loading model: phi-2
Traceback (most recent call last):
File "c:\llama.cpp-Phi2\convert-hf-to-gguf.py", line 1052, in <module>
model_instance = model_class(dir_model, ftype_map[args.outtype],
fname_out, args.bigendian)
File "c:\llama.cpp-Phi2\convert-hf-to-gguf.py", line 48, in __init__
self.model_arch = self._get_model_architecture()
File "c:\llama.cpp-Phi2\convert-hf-to-gguf.py", line 229, in
_get_model_architecture
raise NotImplementedError(f'Architecture "{arch}" not supported!')
NotImplementedError: Architecture "MixFormerSequentialForCausalLM" not
supported!
El vie, 15 dic 2023 a las 22:51, Ebey Abraham ***@***.***>)
escribió:
… Yes, you can download the weights from
https://huggingface.co/microsoft/phi-2 and then use convert-hf-to-gguf.py
<https://github.com/mrgraycode/llama.cpp/blob/12cc80cb8975aea3bc9f39d3c9b84f7001ab94c5/convert-hf-to-gguf.py>
to generate gguf.
—
Reply to this email directly, view it on GitHub
<#4437 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACOC2UPNU3K7GCNT6NEAJRDYJTA5JAVCNFSM6AAAAABATBICY2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYGUZTANZYHE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Antonio Linares
www.fivetechsoft.com
|
It seems like you are using some other checkpoint of phi-2, can you try with the huggingface one I shared? |
I used that one. Is it the right one ? thanks for your help |
I believe I got the same error as @FiveTechSoft when I tried (will double check tonight or tomorrow) |
That is weird, because for me config.json shows the model architecture as |
@FiveTechSoft, @lostmygithubaccount just update hf + transformers libraries |
@FiveTechSoft As for now, you can download already quantized Microsoft Phi-2 GGUF files from radames/phi-2-quantized or even make your own in Candle by running:
But I'm not sure if it is compatible with llama.cpp. |
Below is the video I created showing how to run phi-v2 on my mac m1 8GB. Still you can follow to run on linux or windows as well. Summing up, Above video shows running phi-v2 using huggingface/candle repo on github. It shows running quantised gguf model. Command 1: Instruction Format for phi-v2
|
Can you please show how to use another model? For example, I downloaded a 8-bit quantized model and I want to use it instead of the default 4-bit. Also, how can I use the model in the chat mode, similar to LLAMA CPP -ins comand. Thank you. |
@alexcardo If you are using Candle Phi example, just replace |
Actually TheBloke share a version in GGUF format https://huggingface.co/TheBloke/phi-2-GGUF |
https://www.youtube.com/watch?v=ds9bQJFromU |
I can get the Bloke's version of Phi-2 GGUF running on CPU, but GPU (cublas) doesn't work. Anyone else experiencing this? I pulled this one: It just output GGGGGG repeatedly. If I run the llamacpp on just CPU it works fine. |
@SpaceCowboy850 there is currently an issue with phi2 with |
Okay, thanks! I'm getting by with CPU version right now, but realized I should probably surface this issue at some level in case it was unknown. Thanks for the tip! |
Surely we have to wait for a GGUF version, but in the meantime just curious about it
thanks
The text was updated successfully, but these errors were encountered: