RFE: Add --reasoning-budget flag to control thinking in reasoning models

## Summary
Request to expose llama.cpp's `--reasoning-budget` flag in `ramalama serve` to properly control reasoning/thinking behavior in models like DeepSeek-R1.

## Background
- llama.cpp added the `--reasoning-budget` flag (PR [#13771](https://app.semanticdiff.com/gh/ggml-org/llama.cpp/pull/13771/overview)) to address issues where reasoning models continue generating thinking tokens even when disabled
- The flag supports: `-1` (unrestricted, default) and `0` (disable thinking completely)
- This flag is more effective than the older `--thinking` flag or `enable_thinking: false` API parameter

## Current Situation
- Ramalama 0.13.0 currently exposes `--thinking THINKING` flag
- The underlying llama-server in the container **does** support `--reasoning-budget` (verified with `llama-server --help`)
- However, `--thinking 0` does **not** effectively prevent DeepSeek-R1 from generating reasoning tokens
- Result: Users cannot disable thinking even when explicitly requested, wasting inference time

## Test Case
```bash
# Current behavior with --thinking 0
$ ramalama serve --port 8080 --thinking 0 ollama://library/deepseek-r1:latest
# Query: "What is 2+2?"
# Result: Still generates 200+ reasoning_content chunks before answering
```

With logs showing hundreds of `reasoning_content` chunks being emitted despite `--thinking 0`.

## Proposed Solution
Add a `--reasoning-budget` flag to `ramalama serve` that passes through to llama-server:

```bash
ramalama serve --port 8080 --reasoning-budget 0 ollama://library/deepseek-r1:latest
```

**Alternative:** Update the existing `--thinking` flag to internally use `--reasoning-budget` instead of the legacy parameter.

## Benefits
- Users can properly control reasoning model behavior
- Aligns with upstream llama.cpp best practices
- Fixes known limitation with DeepSeek-R1 and similar reasoning models
- Improves inference efficiency when thinking is not desired

## References
- llama.cpp issues: #13160, #13189, #15401
- llama.cpp PR: [#13771](https://app.semanticdiff.com/gh/ggml-org/llama.cpp/pull/13771/overview)
- llama.cpp commit: [e121edc](https://app.semanticdiff.com/gh/ggml-org/llama.cpp/commit/e121edc4324a640be11b7e567edd39b721b0f8e4)

## Environment
- Ramalama: 0.13.0-1.fc42
- Fedora: 42
- llama-server version in container: b52edd2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFE: Add --reasoning-budget flag to control thinking in reasoning models #2137

Summary

Background

Current Situation

Test Case

Proposed Solution

Benefits

References

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFE: Add --reasoning-budget flag to control thinking in reasoning models #2137

Description

Summary

Background

Current Situation

Test Case

Proposed Solution

Benefits

References

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions