Skip to content

Latest commit

 

History

History
47 lines (36 loc) · 2.19 KB

llm-params.md

File metadata and controls

47 lines (36 loc) · 2.19 KB

Customize LLM Parameters at Runtime

The RAG server exposes an OpenAI-compatible API, using which developers can customize different LLM parameters at runtime. For full details, see APIs for RAG Server.

Use the /generate endpoint in the RAG server of a RAG pipeline to generate responses to prompts.

To configure the behavior of the LLM dynamically at runtime, you can include or change the following parameters in the request body while trying out the generate API using the notebook.

Parameter Description Type Valid Values Default Optional?
max_tokens The maximum number of tokens to generate during inference. This limits the length of the generated text. Integer 1024 Yes
stop A list of strings to use as stop tokens in the text generation. The text returned does not include the stop tokens. Array [] Yes
temperature Adjusts the randomness of token selection. Higher values increase randomness and creativity; lower values promote deterministic and conservative output. Number 0.1 - 1.0 0.2 Yes
top_p A threshold that selects from the most probable tokens until the cumulative probability exceeds p. Number 0.1 - 1.0 0.7 Yes

Example payload for customization

You can include or change the following parameters in the request body while trying out the generate API using this notebook.

  • max_tokens=150 — limits response length to 150 tokens
  • stop=["\n"] — generation stops at the newline character
  • temperature=0.3 — moderate randomness
  • top_p=0.8 — considers tokens with cumulative probability up to 0.8
{
   "messages": [
      {
         "role": "user",
         "content": "Explain the key features of FastAPI."
      }
   ],
   "max_tokens": 150,
   "stop": ["\n"],
   "temperature": 0.3,
   "top_p": 0.8,
   "use_knowledge_base": true
}