Don't Think Too Much (DTTM) Proxy

Qwen/QwQ-32B is a great model, but it tends to think for a long time. OpenAI models have a reasoning_effort parameter, Anthropic models have a thinking.budget_tokens option, however, with open-source llms there isn't a built-in system to decide for how long and for how many tokens the llm thinks. Until now! This proxy adds a max_thinking_chars parameter that allows configuring the max number of characters that the LLM is allowed to think for.

Quickstart

Clone the repo, run yarn install and then yarn start.

Open LM Studio, configure Qwen/QwQ-32B to use this template and start the Dev Server.

curl http://localhost:3000/proxy/localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwq-32b@4bit",
    "messages": [
      { "role": "user", "content": "What day is it today?" }
    ],
    "temperature": 0.5,
    "stream": true,
    "max_thinking_chars": 200
}'

How this works

The proxy only cares about the max_thinking_chars parameter. If it is defined, the proxy starts counting the characters after <think> until a </think>. If the model continues thinking past the max number of chars, the proxy aborts the request and creates a new using using the existing <think>...</think> process as a prefilled assistant message.

In order for this to work, you need to update the LLM chat template. The default template does not allow prefilled messages because it wraps every message in '<|im_start|>' + message.role + '\n' and '<|im_end|>' + '\n', and adds '<|im_start|>assistant\n<think>\n' at the end of all the messages. This template (https://gist.github.com/tomasmcm/6fd3397eb44e3fbef4cf876451098a92) checks if there is an assistant message as the last message and allows the LLM to continue the generation as if it was the same message. This method only works with streaming as it allows halting the generation in the middle. It should also work with DeepSeek R1 and Distilled version (might just need some adjusting of the <think> tags and new lines).

Can I use this with other providers?

Currently this only works with LLM servers that allow prefilling assistant messages. Some providers like Groq support this with some models (doesn't seem Qwen/QwQ-32B is support yet). When using local servers you can configure the chat template to allow prefilling.

Logo was generated with prithivMLmods/Ton618-Only-Stickers-Flux-LoRA using the prompt "Only Sticker, An robot queen with questions logo with the words "Don't Think Too Much" written on it".

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
chat_template		chat_template
package.json		package.json
server.js		server.js
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Don't Think Too Much (DTTM) Proxy

Quickstart

How this works

Can I use this with other providers?

About

Releases

Packages

Languages

License

tomasmcm/dttm-proxy

Folders and files

Latest commit

History

Repository files navigation

Don't Think Too Much (DTTM) Proxy

Quickstart

How this works

Can I use this with other providers?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages