Skip to content

Will llama.cpp be able to use Phi-2 ? #4437

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
FiveTechSoft opened this issue Dec 13, 2023 · 27 comments · Fixed by #4490
Closed

Will llama.cpp be able to use Phi-2 ? #4437

FiveTechSoft opened this issue Dec 13, 2023 · 27 comments · Fixed by #4490
Labels
enhancement New feature or request good first issue Good for newcomers model Model specific

Comments

@FiveTechSoft
Copy link

Surely we have to wait for a GGUF version, but in the meantime just curious about it

thanks

@FiveTechSoft FiveTechSoft added the enhancement New feature or request label Dec 13, 2023
@bachittle
Copy link
Contributor

Here is a related issue: #3146

It is a different model architecture so would require some work, but definitely doable.

@lostmygithubaccount
Copy link

also interested in this -- weights are on huggingface from the official microsoft account: https://huggingface.co/microsoft/phi-2/tree/main

@jadore801120
Copy link

Any help needed?

@axelpey
Copy link

axelpey commented Dec 14, 2023

@jadore801120 I think there is help needed. Back when phi-1.5 was released there was already a lot of demand for support and yet it didn't happen (see #3146)

Would be down to work on it, but I don't think I can do it alone

@ggerganov
Copy link
Member

Unless I'm missing something, it looks relatively straightforward to support.

The one thing I didn't quite get from the python code is the use of cross-attention in some cases instead of the standard self-attention. Would be helpful if somebody can summarize what is the purpose of cross-attention here

@ggerganov ggerganov added good first issue Good for newcomers model Model specific labels Dec 14, 2023
@mory91
Copy link

mory91 commented Dec 14, 2023

@ggerganov Personally I have found this implementation more intuitive. I think the cross attention is just another way of treating cached keys and values (i.e. nothing special still decoder only).

@dnlwbb
Copy link

dnlwbb commented Dec 14, 2023

VLLM has this model implemented, in their conversation they mentioned this:

"I believe the "cross-attention" used in Phi-1.5 is not true cross-attention, it's just used for current token to attend to past KV-cache during autoregressive generation. From what I can see, the Phi-1.5 architecture is basically the same as a GPT-NeoX with attention/FFN in parallel, except:

In each transformer block the pre-FFN and pre-Attention layernorms are the same (GPT-NeoX doesn't share these params, as noted in their paper).
Phi-1.5 has a bias on the output linear layer, and GPT-NeoX does not.
Alternatively, Phi-1.5 has the same architecture as GPT-J, except:

GPT-J has separate q, k, v projections and Phi-1.5 has W_qkv
GPT-J does not have bias on the q, k, v layers, and Phi-1.5 does have bias on W_qkv
It shouldn't be too crazy to adapt these, given that vLLM already supports GPT-NeoX and GPT-J. The MixFormer modelling file on Hugging Face is unnecessarily complicated relative to how small the architecture changes they made are relative to GPT-NeoX/GPT-J. To me it seems like the easiest way would be:

Start with GPT-NeoX model, with the right configuration to match Phi-1.5 (including use_parallel_residual)
When loading weights in from Phi-1.5, just copy the same layernorm weight and bias from Phi into both input_layernorm and post_attn_layernorm in GPT-NeoX
Modify the output linear layer to have a bias so that the bias can be loaded in as well."

@niutech
Copy link

niutech commented Dec 15, 2023

FWIW, Phi-2 in GGUF format is also supported by Candle.

@ebeyabraham
Copy link
Contributor

ebeyabraham commented Dec 15, 2023

I wrote an implementation based on the MLX code.
ebeyabraham@12cc80c#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efR6239

Had to make some breaking changes to llama.cpp to support the last layer of Phi2 but it works!

@FiveTechSoft
Copy link
Author

FiveTechSoft commented Dec 15, 2023 via email

@FiveTechSoft
Copy link
Author

FiveTechSoft commented Dec 15, 2023 via email

@ebeyabraham
Copy link
Contributor

Yes, you can download the weights from https://huggingface.co/microsoft/phi-2 and then use convert-hf-to-gguf.py to generate gguf.

@FiveTechSoft
Copy link
Author

FiveTechSoft commented Dec 15, 2023 via email

@ebeyabraham
Copy link
Contributor

It seems like you are using some other checkpoint of phi-2, can you try with the huggingface one I shared?

@FiveTechSoft
Copy link
Author

you can download the weights from https://huggingface.co/microsoft/phi-2

I used that one. Is it the right one ? thanks for your help

@lostmygithubaccount
Copy link

I believe I got the same error as @FiveTechSoft when I tried (will double check tonight or tomorrow)

@ebeyabraham
Copy link
Contributor

ebeyabraham commented Dec 15, 2023

That is weird, because for me config.json shows the model architecture as PhiForCausalLM and not MixFormerSequentialForCausalLM

image

@salykova
Copy link

@FiveTechSoft, @lostmygithubaccount just update hf + transformers libraries

@niutech
Copy link

niutech commented Dec 16, 2023

@FiveTechSoft As for now, you can download already quantized Microsoft Phi-2 GGUF files from radames/phi-2-quantized or even make your own in Candle by running:

git clone https://github.com/huggingface/candle && cd candle && cargo build
cargo run --example tensor-tools -- quantize model-0000*.safetensors --quantization q40 --out-file phi-2-q4k.gguf

But I'm not sure if it is compatible with llama.cpp.

@Pawandeep-prog
Copy link

Pawandeep-prog commented Dec 16, 2023

Below is the video I created showing how to run phi-v2 on my mac m1 8GB. Still you can follow to run on linux or windows as well.
Run phi-v2 Mac

Summing up, Above video shows running phi-v2 using huggingface/candle repo on github. It shows running quantised gguf model.
On mac m1 8GB is generated : 7 Tokens/sec

Command 1: curl https://sh.rustup.rs -sSf | sh ## Installing cargo
Command 2: git clone https://github.com/huggingface/candle.git ## cloning candle
Command 3: cargo build --example phi --release --features metal ## compiling
Command 4: ./phi --prompt "hello, how are you? Assistant: " --model 2 --quantized ## running

Instruction Format for phi-v2

Instruct: {prompt}
Output:

@alexcardo
Copy link

Below is the video I created showing how to run phi-v2 on my mac m1 8GB. Still you can follow to run on linux or windows as well. Run phi-v2 Mac

Summing up, Above video shows running phi-v2 using huggingface/candle repo on github. It shows running quantised gguf model. On mac m1 8GB is generated : 7 Tokens/sec

Command 1: curl https://sh.rustup.rs -sSf | sh ## Installing cargo Command 2: git clone https://github.com/huggingface/candle.git ## cloning candle Command 3: cargo build --example phi --release --features metal ## compiling Command 4: ./phi --prompt "hello, how are you? Assistant: " --model 2 --quantized ## running

Instruction Format for phi-v2

Instruct: {prompt}
Output:

Can you please show how to use another model? For example, I downloaded a 8-bit quantized model and I want to use it instead of the default 4-bit. Also, how can I use the model in the chat mode, similar to LLAMA CPP -ins comand.

Thank you.

@niutech
Copy link

niutech commented Dec 17, 2023

@alexcardo If you are using Candle Phi example, just replace model-v2-q4k.gguf to model-v2-q80.gguf in main.rs:271 (both models are hosted on HuggingFace).
But let's get back on track - this issue is about Phi 2 support in Llama.cpp, not Candle or MLX.

@wladimiravila
Copy link

Actually TheBloke share a version in GGUF format https://huggingface.co/TheBloke/phi-2-GGUF

@DivyatejaDadi
Copy link

https://www.youtube.com/watch?v=ds9bQJFromU
check if above video helps anyone

@SpaceCowboy850
Copy link

I can get the Bloke's version of Phi-2 GGUF running on CPU, but GPU (cublas) doesn't work. Anyone else experiencing this? I pulled this one:

phi-2.Q4_K_M.gguf

It just output GGGGGG repeatedly.

If I run the llamacpp on just CPU it works fine.

@slaren
Copy link
Member

slaren commented Jan 19, 2024

@SpaceCowboy850 there is currently an issue with phi2 with offloadkqv disabled. Enabling it should fix it. It is enabled by default on llama.cpp, but not on some versions of llama.cpp-python.

@SpaceCowboy850
Copy link

Okay, thanks! I'm getting by with CPU version right now, but realized I should probably surface this issue at some level in case it was unknown. Thanks for the tip!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers model Model specific
Projects
None yet
Development

Successfully merging a pull request may close this issue.