-
Notifications
You must be signed in to change notification settings - Fork 282
Open
Description
Summary
Request to expose llama.cpp's --reasoning-budget flag in ramalama serve to properly control reasoning/thinking behavior in models like DeepSeek-R1.
Background
- llama.cpp added the
--reasoning-budgetflag (PR #13771) to address issues where reasoning models continue generating thinking tokens even when disabled - The flag supports:
-1(unrestricted, default) and0(disable thinking completely) - This flag is more effective than the older
--thinkingflag orenable_thinking: falseAPI parameter
Current Situation
- Ramalama 0.13.0 currently exposes
--thinking THINKINGflag - The underlying llama-server in the container does support
--reasoning-budget(verified withllama-server --help) - However,
--thinking 0does not effectively prevent DeepSeek-R1 from generating reasoning tokens - Result: Users cannot disable thinking even when explicitly requested, wasting inference time
Test Case
# Current behavior with --thinking 0
$ ramalama serve --port 8080 --thinking 0 ollama://library/deepseek-r1:latest
# Query: "What is 2+2?"
# Result: Still generates 200+ reasoning_content chunks before answeringWith logs showing hundreds of reasoning_content chunks being emitted despite --thinking 0.
Proposed Solution
Add a --reasoning-budget flag to ramalama serve that passes through to llama-server:
ramalama serve --port 8080 --reasoning-budget 0 ollama://library/deepseek-r1:latestAlternative: Update the existing --thinking flag to internally use --reasoning-budget instead of the legacy parameter.
Benefits
- Users can properly control reasoning model behavior
- Aligns with upstream llama.cpp best practices
- Fixes known limitation with DeepSeek-R1 and similar reasoning models
- Improves inference efficiency when thinking is not desired
References
Environment
- Ramalama: 0.13.0-1.fc42
- Fedora: 42
- llama-server version in container: b52edd2
Metadata
Metadata
Assignees
Labels
No labels