Skip to content

RFE: Add --reasoning-budget flag to control thinking in reasoning models #2137

@csoriano2718

Description

@csoriano2718

Summary

Request to expose llama.cpp's --reasoning-budget flag in ramalama serve to properly control reasoning/thinking behavior in models like DeepSeek-R1.

Background

  • llama.cpp added the --reasoning-budget flag (PR #13771) to address issues where reasoning models continue generating thinking tokens even when disabled
  • The flag supports: -1 (unrestricted, default) and 0 (disable thinking completely)
  • This flag is more effective than the older --thinking flag or enable_thinking: false API parameter

Current Situation

  • Ramalama 0.13.0 currently exposes --thinking THINKING flag
  • The underlying llama-server in the container does support --reasoning-budget (verified with llama-server --help)
  • However, --thinking 0 does not effectively prevent DeepSeek-R1 from generating reasoning tokens
  • Result: Users cannot disable thinking even when explicitly requested, wasting inference time

Test Case

# Current behavior with --thinking 0
$ ramalama serve --port 8080 --thinking 0 ollama://library/deepseek-r1:latest
# Query: "What is 2+2?"
# Result: Still generates 200+ reasoning_content chunks before answering

With logs showing hundreds of reasoning_content chunks being emitted despite --thinking 0.

Proposed Solution

Add a --reasoning-budget flag to ramalama serve that passes through to llama-server:

ramalama serve --port 8080 --reasoning-budget 0 ollama://library/deepseek-r1:latest

Alternative: Update the existing --thinking flag to internally use --reasoning-budget instead of the legacy parameter.

Benefits

  • Users can properly control reasoning model behavior
  • Aligns with upstream llama.cpp best practices
  • Fixes known limitation with DeepSeek-R1 and similar reasoning models
  • Improves inference efficiency when thinking is not desired

References

  • llama.cpp issues: #13160, #13189, #15401
  • llama.cpp PR: #13771
  • llama.cpp commit: e121edc

Environment

  • Ramalama: 0.13.0-1.fc42
  • Fedora: 42
  • llama-server version in container: b52edd2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions