From 81da53814729a00a2fbe2a441e781d8267ba5626 Mon Sep 17 00:00:00 2001
From: Mike McKiernan <mmckiernan@nvidia.com>
Date: Fri, 7 Feb 2025 08:44:31 -0500
Subject: [PATCH] docs: Output streaming (#976)

Documents https://github.com/NVIDIA/NeMo-Guardrails/pull/966

Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>
---
 docs/getting-started/5-output-rails/README.md |  76 +++++++++++--
 docs/index.rst                                | 102 ------------------
 docs/project.json                             |   2 +-
 docs/user-guides/configuration-guide.md       |  83 +++++++++++++-
 docs/versions1.json                           |   4 +-
 5 files changed, 150 insertions(+), 117 deletions(-)
 delete mode 100644 docs/index.rst
diff --git a/docs/getting-started/5-output-rails/README.md b/docs/getting-started/5-output-rails/README.md
index 36a2025a3..c8f0be042 100644
--- a/docs/getting-started/5-output-rails/README.md
+++ b/docs/getting-started/5-output-rails/README.md
@@ -30,12 +30,12 @@ NeMo Guardrails comes with a built-in [output self-checking rail](../../user-gui
 
 Activating the `self check output` rail is similar to the `self check input` rail:
 
-1. Activate the `self check output` rail in *config.yml*.
-2. Add a `self_check_output` prompt in *prompts.yml*.
+1. Activate the `self check output` rail in `config.yml`.
+2. Add a `self_check_output` prompt in `prompts.yml`.
 
-### Activate the rail
+### Activate the Rail
 
-To activate the rail, include the `self check output` flow name in the output rails section of the *config.yml* file:
+To activate the rail, include the `self check output` flow name in the output rails section of the `config.yml` file:
 
 ```yaml
 output:
@@ -43,9 +43,10 @@ output:
       - self check output
 ```
 
-For reference, the full `rails` section in `config.yml` should look like the following:
+For reference, update the full `rails` section in `config.yml` to look like the following:
 
 ```yaml
+rails:
   input:
     flows:
       - self check input
@@ -66,7 +67,7 @@ define subflow self check output
     stop
 ```
 
-### Add a prompt
+### Add a Prompt
 
 The self-check output rail needs a prompt to perform the check.
 
@@ -130,7 +131,7 @@ Summary: 3 LLM call(s) took 1.89 seconds and used 504 tokens.
 print(info.llm_calls[2].prompt)
 ```
 
-```
+```text
 Your task is to check if the bot message below complies with the company policy.
 
 Company policy for the bot:
@@ -160,15 +161,68 @@ As we can see, the LLM did generate the message containing the word "idiot", how
 
 The following figure depicts the process:
 
-<div align="center">
-<img src="../../_static/puml/output_rails_fig_1.png" width="815">
-</div>
+```{image} ../../_static/puml/output_rails_fig_1.png
+```
+
+## Streaming Output
+
+By default, the output from the rail is synchronous.
+You can enable streaming to provide asynchronous responses and reduce the time to the first response.
+
+1. Modify the `rails` field in the `config.yml` file and add the `streaming` field to enable streaming:
+
+   ```{code-block} yaml
+   :emphasize-lines: 9-11,13
+
+   rails:
+     input:
+       flows:
+         - self check input
+
+     output:
+       flows:
+         - self check output
+       streaming:
+          chunk_size: 200
+          context_size: 50
+
+   streaming: True
+   ```
+
+1. Call the `stream_async` method and handle the chunked response:
+
+   ```python
+   from nemoguardrails import RailsConfig, LLMRails
+
+   config = RailsConfig.from_path("./config")
+
+   rails = LLMRails(config)
+
+   messages = [{"role": "user", "content": "How many days of vacation does a 10-year employee receive?"}]
+
+   async for chunk in rails.stream_async(messages=messages):
+       print(f"CHUNK: {chunk}")
+   ```
+
+   *Partial Output*
+
+   ```output
+   CHUNK: According
+   CHUNK:  to
+   CHUNK:  the
+   CHUNK:  employee
+   CHUNK:  handbook,
+   ...
+   ```
+
+For reference information about the related `config.yaml` file fields,
+refer to [](../../user-guides/configuration-guide.md#output-rails).
 
 ## Custom Output Rail
 
 Build a custom output rail with a list of proprietary words that we want to make sure do not appear in the output.
 
-1. Create a *config/actions.py* file with the following content, which defines an action:
+1. Create a `config/actions.py` file with the following content, which defines an action:
 
 ```python
 from typing import Optional
diff --git a/docs/index.rst b/docs/index.rst
deleted file mode 100644
index 0cdb28b6c..000000000
--- a/docs/index.rst
+++ /dev/null
@@ -1,102 +0,0 @@
-NVIDIA NeMo Guardrails
-====================================================
-
-.. toctree::
-   :caption: NVIDIA NeMo Guardrails
-   :name: NVIDIA NeMo Guardrails
-   :maxdepth: 1
-
-   introduction.md
-   documentation.md
-   getting-started/installation-guide
-
-.. toctree::
-   :caption: Getting Started
-   :name: Getting Started
-   :maxdepth: 2
-
-   getting-started/1-hello-world/README
-   getting-started/2-core-colang-concepts/README
-   getting-started/3-demo-use-case/README
-   getting-started/4-input-rails/README
-   getting-started/5-output-rails/README
-   getting-started/6-topical-rails/README
-   getting-started/7-rag/README
-
-.. toctree::
-   :caption: Colang 2.0
-   :name: Colang 2.0
-   :maxdepth: 2
-
-   colang-2/overview
-   colang-2/whats-changed
-   colang-2/getting-started/index
-   colang-2/language-reference/index
-
-.. toctree::
-   :caption: User Guides
-   :name: User Guides
-   :maxdepth: 2
-
-   user-guides/configuration-guide
-   user-guides/guardrails-library
-   user-guides/guardrails-process
-   user-guides/colang-language-syntax-guide
-   user-guides/llm-support
-   user-guides/python-api
-   user-guides/cli
-   user-guides/server-guide
-   user-guides/langchain/index
-   user-guides/detailed-logging/index
-   user-guides/jailbreak-detection-heuristics/index
-   user-guides/llm/index
-   user-guides/multi-config-api/index
-   user-guides/migration-guide
-
-.. toctree::
-   :caption: Security
-   :name: Security
-   :maxdepth: 2
-
-   security/guidelines
-   security/red-teaming
-
-.. toctree::
-   :caption: Evaluation
-   :name: Evaluation
-   :maxdepth: 2
-
-   evaluation/README
-   evaluation/llm-vulnerability-scanning
-
-.. toctree::
-   :caption: Advanced User Guides
-   :name: Advanced User Guides
-   :maxdepth: 2
-
-   user-guides/advanced/generation-options
-   user-guides/advanced/prompt-customization
-   user-guides/advanced/embedding-search-providers
-   user-guides/advanced/using-docker
-   user-guides/advanced/streaming
-   user-guides/advanced/align-score-deployment
-   user-guides/advanced/extract-user-provided-values
-   user-guides/advanced/bot-message-instructions
-   user-guides/advanced/event-based-api
-   user-guides/advanced/llama-guard-deployment
-   user-guides/advanced/nested-async-loop
-   user-guides/advanced/vertexai-setup
-   user-guides/advanced/nemoguard-contentsafety-deployment
-   user-guides/advanced/nemoguard-topiccontrol-deployment
-   user-guides/advanced/jailbreak-detection-heuristics-deployment
-   user-guides/advanced/safeguarding-ai-virtual-assistant-blueprint
-
-.. toctree::
-   :caption: Other
-   :name: Other
-   :maxdepth: 2
-
-   architecture/index
-   glossary
-   faqs
-   changes
diff --git a/docs/project.json b/docs/project.json
index caf937f91..6f93bccec 100644
--- a/docs/project.json
+++ b/docs/project.json
@@ -1 +1 @@
-{ "name": "nemo-guardrails-toolkit", "version": "0.11.1" }
+{ "name": "nemo-guardrails-toolkit", "version": "0.12.0" }
diff --git a/docs/user-guides/configuration-guide.md b/docs/user-guides/configuration-guide.md
index 8481854be..b2c409654 100644
--- a/docs/user-guides/configuration-guide.md
+++ b/docs/user-guides/configuration-guide.md
@@ -84,7 +84,7 @@ The meaning of the attributes is as follows:
 
 You can use any LLM provider that is supported by LangChain, e.g., `ai21`, `aleph_alpha`, `anthropic`, `anyscale`, `azure`, `cohere`, `huggingface_endpoint`, `huggingface_hub`, `openai`, `self_hosted`, `self_hosted_hugging_face`. Check out the LangChain official documentation for the full list.
 
-In addition to the above LangChain providers, connecting to [Nvidia NIMs](https://docs.nvidia.com/nim/index.html) is supported using the engine `nvidia_ai_endpoints` or synonymously `nim`, for both Nvidia hosted NIMs (accessible through an Nvidia AI Enterprise license) and for locally downloaded and self-hosted NIM containers.
+In addition to the above LangChain providers, connecting to [Nvidia NIMs](https://docs.nvidia.com/nim/index.html) is supported using the engine `nvidia_ai_endpoints` or synonymously `nim`, for both Nvidia hosted NIMs (accessible through an Nvidia AI Enterprise license) and for locally downloaded and elf-hosted NIM containers.
 
 ```{note}
 To use any of the providers, you must install additional packages; when you first try to use a configuration with a new provider, you typically receive an error from LangChain that instructs which packages you should install.
@@ -104,6 +104,7 @@ NIMs can be self hosted, using downloadable containers, or Nvidia hosted and acc
 NeMo Guardrails supports connecting to NIMs as follows:
 
 ##### Self-hosted NIMs
+
 To connect to self-hosted NIMs, set the engine to `nim`. Also make sure the model name matches one of the model names the hosted NIM supports (you can get a list of supported models using a GET request to v1/models endpoint).
 
 ```yaml
@@ -663,6 +664,86 @@ Output rails process a bot message. The message to be processed is available in
 
 You can deactivate output rails temporarily for the next bot message, by setting the `$skip_output_rails` context variable to `True`.
 
+#### Streaming Output Configuration
+
+By default, the response from an output rail is synchronous.
+You can enable streaming to begin receiving responses from the output rail sooner.
+
+You must set the top-level `streaming: True` field in your `config.yml` file.
+
+For each output rail, add the `streaming` field and configuration parameters.
+
+```yaml
+rails:
+  output:
+    - rail name
+  streaming:
+    chunk_size: 200
+    context_size: 50
+    stream_first: True
+
+streaming: True
+```
+
+When streaming is enabled, the toolkit applies output rails to chunks of tokens.
+If a rail blocks a chunk of tokens, the toolkit returns a string in the following format:
+
+```output
+{"event": "ABORT", "data": {"reason": "Blocked by <rail-name> rails.}}
+```
+
+The following table describes the subfields for the `streaming` field:
+
+```{list-table}
+:header-rows: 1
+
+* - Field
+  - Description
+  - Default Value
+
+* - streaming.chunk_size
+  - Specifies the number of tokens for each chunk.
+    The toolkit applies output guardrails on each chunk of tokens.
+
+    Larger values provide more meaningful information for the rail to assess,
+    but can add latency while accumulating tokens for a full chunk.
+    The risk of higher latency is especially true if you specify `stream_first: False`.
+  - `200`
+
+* - streaming.context_size
+  - Specifies the number of tokens to keep from the previous chunk to provide context and continuity in processing.
+
+    Larger values provide continuity across chunks with minimal impact on latency.
+    Small values might fail to detect cross-chunk violations.
+    Specifying approximately 25% of `chunk_size` provides a good compromise.
+  - `50`
+
+* - streaming.stream_first
+  - When set to `False`, the toolkit applies the output rails to the chunks before streaming them to the client.
+    If you set this field to `False`, you can avoid streaming chunks of blocked content.
+
+    By default, the toolkit streams the chunks as soon as possible and before applying output rails to them.
+
+  - `True`
+```
+
+The following table shows how the number of tokens, chunk size, and context size interact to trigger the number of rails invocations.
+
+```{csv-table}
+:header: Input Length, Chunk Size, Context Size, Rails Invocations
+
+512,256,64,3
+600,256,64,3
+256,256,64,1
+1024,256,64,5
+1024,256,32,5
+1024,256,32,5
+1024,128,32,11
+512,128,32,5
+```
+
+Refer to [](../getting-started/5-output-rails/README.md#streaming-output) for a code sample.
+
 ### Retrieval Rails
 
 Retrieval rails process the retrieved chunks, i.e., the `$relevant_chunks` variable.
diff --git a/docs/versions1.json b/docs/versions1.json
index 348caf8f4..c2e197536 100644
--- a/docs/versions1.json
+++ b/docs/versions1.json
@@ -1,7 +1,7 @@
 [
     {
         "preferred": true,
-        "version": "0.11.1",
-        "url": "../0.11.1"
+        "version": "0.12.0",
+        "url": "../0.12.0"
     }
 ]