Replies: 5 comments 2 replies
-
Can confirm this is happening on my RTX 2060 laptop as well. @JohannesGaessler Please take a look. It's outputting complete nonsense now. ./llama-server -m "Qwen 3\Qwen3-30B-A3B-UD-Q4_K_XL.gguf" -c 32768 -ngl 99 -fa --host 127.0.0.1 --port 5001 -t 6 -ctk q8_0 -ctv q8_0 -ub 2048 -ot ".ffn_.*_exps.=CPU" --jinja My settings. |
Beta Was this translation helpful? Give feedback.
-
My settings as well if it's useful: title llama-server |
Beta Was this translation helpful? Give feedback.
-
Johannes is a machine, already fixed it with this PR: #13415 |
Beta Was this translation helpful? Give feedback.
-
Damn what a beast, thank you! |
Beta Was this translation helpful? Give feedback.
-
@JohannesGaessler latest branch b5335 is still failing with 4070: (Qwen3 8B) flash attention on:
flash attention off:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey this change introduced flash-attention for Deepseek for Ampere (RTX 3000 and above). It somehow broke flash-attention on my RTX 2070 for Qwen3/Gemma GGUF's (I haven't tested other models, but I assume it's across the board).
With flash-attn turned on, the models now output gibberish. Everything works fine without flash-attn.
Can you guys see what could've caused this break?
Beta Was this translation helpful? Give feedback.
All reactions