docs: Output streaming (NVIDIA#976)

mikemckiernan · mikemckiernan · commit 81da53814729 · 2025-02-07T09:04:59.000-05:00
Documents NVIDIA#966 Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>
diff --git a/docs/getting-started/5-output-rails/README.md b/docs/getting-started/5-output-rails/README.md
@@ -30,22 +30,23 @@ NeMo Guardrails comes with a built-in [output self-checking rail](../../user-gui
 
 Activating the `self check output` rail is similar to the `self check input` rail:
 
-1. Activate the `self check output` rail in *config.yml*.
-2. Add a `self_check_output` prompt in *prompts.yml*.
+1. Activate the `self check output` rail in `config.yml`.
+2. Add a `self_check_output` prompt in `prompts.yml`.
 
-### Activate the rail
+### Activate the Rail
 
-To activate the rail, include the `self check output` flow name in the output rails section of the *config.yml* file:
+To activate the rail, include the `self check output` flow name in the output rails section of the `config.yml` file:
 
 ```yaml
 output:
     flows:
       - self check output
 ```
 
-For reference, the full `rails` section in `config.yml` should look like the following:
+For reference, update the full `rails` section in `config.yml` to look like the following:
 
 ```yaml
+rails:
   input:
     flows:
       - self check input
@@ -66,7 +67,7 @@ define subflow self check output
     stop
 ```
 
-### Add a prompt
+### Add a Prompt
 
 The self-check output rail needs a prompt to perform the check.
 
@@ -130,7 +131,7 @@ Summary: 3 LLM call(s) took 1.89 seconds and used 504 tokens.
 print(info.llm_calls[2].prompt)
 ```
 
-```
+```text
 Your task is to check if the bot message below complies with the company policy.
 
 Company policy for the bot:
@@ -160,15 +161,68 @@ As we can see, the LLM did generate the message containing the word "idiot", how
 
 The following figure depicts the process:
 
-<div align="center">
-<img src="../../_static/puml/output_rails_fig_1.png" width="815">
-</div>
+```{image} ../../_static/puml/output_rails_fig_1.png
+```
+
+## Streaming Output
+
+By default, the output from the rail is synchronous.
+You can enable streaming to provide asynchronous responses and reduce the time to the first response.
+
+1. Modify the `rails` field in the `config.yml` file and add the `streaming` field to enable streaming:
+
+   ```{code-block} yaml
+   :emphasize-lines: 9-11,13
+
+   rails:
+     input:
+       flows:
+         - self check input
+
+     output:
+       flows:
+         - self check output
+       streaming:
+          chunk_size: 200
+          context_size: 50
+
+   streaming: True
+   ```
+
+1. Call the `stream_async` method and handle the chunked response:
+
+   ```python
+   from nemoguardrails import RailsConfig, LLMRails
+
+   config = RailsConfig.from_path("./config")
+
+   rails = LLMRails(config)
+
+   messages = [{"role": "user", "content": "How many days of vacation does a 10-year employee receive?"}]
+
+   async for chunk in rails.stream_async(messages=messages):
+       print(f"CHUNK: {chunk}")
+   ```
+
+   *Partial Output*
+
+   ```output
+   CHUNK: According
+   CHUNK:  to
+   CHUNK:  the
+   CHUNK:  employee
+   CHUNK:  handbook,
+   ...
+   ```
+
+For reference information about the related `config.yaml` file fields,
+refer to [](../../user-guides/configuration-guide.md#output-rails).
 
 ## Custom Output Rail
 
 Build a custom output rail with a list of proprietary words that we want to make sure do not appear in the output.
 
-1. Create a *config/actions.py* file with the following content, which defines an action:
+1. Create a `config/actions.py` file with the following content, which defines an action:
 
 ```python
 from typing import Optional
diff --git a/docs/index.rst b/docs/index.rst
diff --git a/docs/project.json b/docs/project.json
@@ -1 +1 @@
-{ "name": "nemo-guardrails-toolkit", "version": "0.11.1" }
+{ "name": "nemo-guardrails-toolkit", "version": "0.12.0" }
diff --git a/docs/user-guides/configuration-guide.md b/docs/user-guides/configuration-guide.md
@@ -84,7 +84,7 @@ The meaning of the attributes is as follows:
 
 You can use any LLM provider that is supported by LangChain, e.g., `ai21`, `aleph_alpha`, `anthropic`, `anyscale`, `azure`, `cohere`, `huggingface_endpoint`, `huggingface_hub`, `openai`, `self_hosted`, `self_hosted_hugging_face`. Check out the LangChain official documentation for the full list.
 
-In addition to the above LangChain providers, connecting to [Nvidia NIMs](https://docs.nvidia.com/nim/index.html) is supported using the engine `nvidia_ai_endpoints` or synonymously `nim`, for both Nvidia hosted NIMs (accessible through an Nvidia AI Enterprise license) and for locally downloaded and self-hosted NIM containers.
+In addition to the above LangChain providers, connecting to [Nvidia NIMs](https://docs.nvidia.com/nim/index.html) is supported using the engine `nvidia_ai_endpoints` or synonymously `nim`, for both Nvidia hosted NIMs (accessible through an Nvidia AI Enterprise license) and for locally downloaded and elf-hosted NIM containers.
 
 ```{note}
 To use any of the providers, you must install additional packages; when you first try to use a configuration with a new provider, you typically receive an error from LangChain that instructs which packages you should install.
@@ -104,6 +104,7 @@ NIMs can be self hosted, using downloadable containers, or Nvidia hosted and acc
 NeMo Guardrails supports connecting to NIMs as follows:
 
 ##### Self-hosted NIMs
+
 To connect to self-hosted NIMs, set the engine to `nim`. Also make sure the model name matches one of the model names the hosted NIM supports (you can get a list of supported models using a GET request to v1/models endpoint).
 
 ```yaml
@@ -663,6 +664,86 @@ Output rails process a bot message. The message to be processed is available in
 
 You can deactivate output rails temporarily for the next bot message, by setting the `$skip_output_rails` context variable to `True`.
 
+#### Streaming Output Configuration
+
+By default, the response from an output rail is synchronous.
+You can enable streaming to begin receiving responses from the output rail sooner.
+
+You must set the top-level `streaming: True` field in your `config.yml` file.
+
+For each output rail, add the `streaming` field and configuration parameters.
+
+```yaml
+rails:
+  output:
+    - rail name
+  streaming:
+    chunk_size: 200
+    context_size: 50
+    stream_first: True
+
+streaming: True
+```
+
+When streaming is enabled, the toolkit applies output rails to chunks of tokens.
+If a rail blocks a chunk of tokens, the toolkit returns a string in the following format:
+
+```output
+{"event": "ABORT", "data": {"reason": "Blocked by <rail-name> rails.}}
+```
+
+The following table describes the subfields for the `streaming` field:
+
+```{list-table}
+:header-rows: 1
+
+* - Field
+  - Description
+  - Default Value
+
+* - streaming.chunk_size
+  - Specifies the number of tokens for each chunk.
+    The toolkit applies output guardrails on each chunk of tokens.
+
+    Larger values provide more meaningful information for the rail to assess,
+    but can add latency while accumulating tokens for a full chunk.
+    The risk of higher latency is especially true if you specify `stream_first: False`.
+  - `200`
+
+* - streaming.context_size
+  - Specifies the number of tokens to keep from the previous chunk to provide context and continuity in processing.
+
+    Larger values provide continuity across chunks with minimal impact on latency.
+    Small values might fail to detect cross-chunk violations.
+    Specifying approximately 25% of `chunk_size` provides a good compromise.
+  - `50`
+
+* - streaming.stream_first
+  - When set to `False`, the toolkit applies the output rails to the chunks before streaming them to the client.
+    If you set this field to `False`, you can avoid streaming chunks of blocked content.
+
+    By default, the toolkit streams the chunks as soon as possible and before applying output rails to them.
+
+  - `True`
+```
+
+The following table shows how the number of tokens, chunk size, and context size interact to trigger the number of rails invocations.
+
+```{csv-table}
+:header: Input Length, Chunk Size, Context Size, Rails Invocations
+
+512,256,64,3
+600,256,64,3
+256,256,64,1
+1024,256,64,5
+1024,256,32,5
+1024,256,32,5
+1024,128,32,11
+512,128,32,5
+```
+
+Refer to [](../getting-started/5-output-rails/README.md#streaming-output) for a code sample.
+
 ### Retrieval Rails
 
 Retrieval rails process the retrieved chunks, i.e., the `$relevant_chunks` variable.
diff --git a/docs/versions1.json b/docs/versions1.json
@@ -1,7 +1,7 @@
 [
     {
         "preferred": true,
-        "version": "0.11.1",
-        "url": "../0.11.1"
+        "version": "0.12.0",
+        "url": "../0.12.0"
     }
 ]

Original file line number	Diff line number	Diff line change
`@@ -1 +1 @@`
`1`		`-{ "name": "nemo-guardrails-toolkit", "version": "0.11.1" }`
	`1`	`+{ "name": "nemo-guardrails-toolkit", "version": "0.12.0" }`
Original file line number	Diff line number	Diff line change
`@@ -1,7 +1,7 @@`
`1`	`1`	`[`
`2`	`2`	`{`
`3`	`3`	`"preferred": true,`
`4`		`- "version": "0.11.1",`
`5`		`- "url": "../0.11.1"`
	`4`	`+ "version": "0.12.0",`
	`5`	`+ "url": "../0.12.0"`
`6`	`6`	`}`
`7`	`7`	`]`