From 81da53814729a00a2fbe2a441e781d8267ba5626 Mon Sep 17 00:00:00 2001 From: Mike McKiernan Date: Fri, 7 Feb 2025 08:44:31 -0500 Subject: [PATCH] docs: Output streaming (#976) Documents https://github.com/NVIDIA/NeMo-Guardrails/pull/966 Signed-off-by: Mike McKiernan --- docs/getting-started/5-output-rails/README.md | 76 +++++++++++-- docs/index.rst | 102 ------------------ docs/project.json | 2 +- docs/user-guides/configuration-guide.md | 83 +++++++++++++- docs/versions1.json | 4 +- 5 files changed, 150 insertions(+), 117 deletions(-) delete mode 100644 docs/index.rst diff --git a/docs/getting-started/5-output-rails/README.md b/docs/getting-started/5-output-rails/README.md index 36a2025a3..c8f0be042 100644 --- a/docs/getting-started/5-output-rails/README.md +++ b/docs/getting-started/5-output-rails/README.md @@ -30,12 +30,12 @@ NeMo Guardrails comes with a built-in [output self-checking rail](../../user-gui Activating the `self check output` rail is similar to the `self check input` rail: -1. Activate the `self check output` rail in *config.yml*. -2. Add a `self_check_output` prompt in *prompts.yml*. +1. Activate the `self check output` rail in `config.yml`. +2. Add a `self_check_output` prompt in `prompts.yml`. -### Activate the rail +### Activate the Rail -To activate the rail, include the `self check output` flow name in the output rails section of the *config.yml* file: +To activate the rail, include the `self check output` flow name in the output rails section of the `config.yml` file: ```yaml output: @@ -43,9 +43,10 @@ output: - self check output ``` -For reference, the full `rails` section in `config.yml` should look like the following: +For reference, update the full `rails` section in `config.yml` to look like the following: ```yaml +rails: input: flows: - self check input @@ -66,7 +67,7 @@ define subflow self check output stop ``` -### Add a prompt +### Add a Prompt The self-check output rail needs a prompt to perform the check. @@ -130,7 +131,7 @@ Summary: 3 LLM call(s) took 1.89 seconds and used 504 tokens. print(info.llm_calls[2].prompt) ``` -``` +```text Your task is to check if the bot message below complies with the company policy. Company policy for the bot: @@ -160,15 +161,68 @@ As we can see, the LLM did generate the message containing the word "idiot", how The following figure depicts the process: -
- -
+```{image} ../../_static/puml/output_rails_fig_1.png +``` + +## Streaming Output + +By default, the output from the rail is synchronous. +You can enable streaming to provide asynchronous responses and reduce the time to the first response. + +1. Modify the `rails` field in the `config.yml` file and add the `streaming` field to enable streaming: + + ```{code-block} yaml + :emphasize-lines: 9-11,13 + + rails: + input: + flows: + - self check input + + output: + flows: + - self check output + streaming: + chunk_size: 200 + context_size: 50 + + streaming: True + ``` + +1. Call the `stream_async` method and handle the chunked response: + + ```python + from nemoguardrails import RailsConfig, LLMRails + + config = RailsConfig.from_path("./config") + + rails = LLMRails(config) + + messages = [{"role": "user", "content": "How many days of vacation does a 10-year employee receive?"}] + + async for chunk in rails.stream_async(messages=messages): + print(f"CHUNK: {chunk}") + ``` + + *Partial Output* + + ```output + CHUNK: According + CHUNK: to + CHUNK: the + CHUNK: employee + CHUNK: handbook, + ... + ``` + +For reference information about the related `config.yaml` file fields, +refer to [](../../user-guides/configuration-guide.md#output-rails). ## Custom Output Rail Build a custom output rail with a list of proprietary words that we want to make sure do not appear in the output. -1. Create a *config/actions.py* file with the following content, which defines an action: +1. Create a `config/actions.py` file with the following content, which defines an action: ```python from typing import Optional diff --git a/docs/index.rst b/docs/index.rst deleted file mode 100644 index 0cdb28b6c..000000000 --- a/docs/index.rst +++ /dev/null @@ -1,102 +0,0 @@ -NVIDIA NeMo Guardrails -==================================================== - -.. toctree:: - :caption: NVIDIA NeMo Guardrails - :name: NVIDIA NeMo Guardrails - :maxdepth: 1 - - introduction.md - documentation.md - getting-started/installation-guide - -.. toctree:: - :caption: Getting Started - :name: Getting Started - :maxdepth: 2 - - getting-started/1-hello-world/README - getting-started/2-core-colang-concepts/README - getting-started/3-demo-use-case/README - getting-started/4-input-rails/README - getting-started/5-output-rails/README - getting-started/6-topical-rails/README - getting-started/7-rag/README - -.. toctree:: - :caption: Colang 2.0 - :name: Colang 2.0 - :maxdepth: 2 - - colang-2/overview - colang-2/whats-changed - colang-2/getting-started/index - colang-2/language-reference/index - -.. toctree:: - :caption: User Guides - :name: User Guides - :maxdepth: 2 - - user-guides/configuration-guide - user-guides/guardrails-library - user-guides/guardrails-process - user-guides/colang-language-syntax-guide - user-guides/llm-support - user-guides/python-api - user-guides/cli - user-guides/server-guide - user-guides/langchain/index - user-guides/detailed-logging/index - user-guides/jailbreak-detection-heuristics/index - user-guides/llm/index - user-guides/multi-config-api/index - user-guides/migration-guide - -.. toctree:: - :caption: Security - :name: Security - :maxdepth: 2 - - security/guidelines - security/red-teaming - -.. toctree:: - :caption: Evaluation - :name: Evaluation - :maxdepth: 2 - - evaluation/README - evaluation/llm-vulnerability-scanning - -.. toctree:: - :caption: Advanced User Guides - :name: Advanced User Guides - :maxdepth: 2 - - user-guides/advanced/generation-options - user-guides/advanced/prompt-customization - user-guides/advanced/embedding-search-providers - user-guides/advanced/using-docker - user-guides/advanced/streaming - user-guides/advanced/align-score-deployment - user-guides/advanced/extract-user-provided-values - user-guides/advanced/bot-message-instructions - user-guides/advanced/event-based-api - user-guides/advanced/llama-guard-deployment - user-guides/advanced/nested-async-loop - user-guides/advanced/vertexai-setup - user-guides/advanced/nemoguard-contentsafety-deployment - user-guides/advanced/nemoguard-topiccontrol-deployment - user-guides/advanced/jailbreak-detection-heuristics-deployment - user-guides/advanced/safeguarding-ai-virtual-assistant-blueprint - -.. toctree:: - :caption: Other - :name: Other - :maxdepth: 2 - - architecture/index - glossary - faqs - changes diff --git a/docs/project.json b/docs/project.json index caf937f91..6f93bccec 100644 --- a/docs/project.json +++ b/docs/project.json @@ -1 +1 @@ -{ "name": "nemo-guardrails-toolkit", "version": "0.11.1" } +{ "name": "nemo-guardrails-toolkit", "version": "0.12.0" } diff --git a/docs/user-guides/configuration-guide.md b/docs/user-guides/configuration-guide.md index 8481854be..b2c409654 100644 --- a/docs/user-guides/configuration-guide.md +++ b/docs/user-guides/configuration-guide.md @@ -84,7 +84,7 @@ The meaning of the attributes is as follows: You can use any LLM provider that is supported by LangChain, e.g., `ai21`, `aleph_alpha`, `anthropic`, `anyscale`, `azure`, `cohere`, `huggingface_endpoint`, `huggingface_hub`, `openai`, `self_hosted`, `self_hosted_hugging_face`. Check out the LangChain official documentation for the full list. -In addition to the above LangChain providers, connecting to [Nvidia NIMs](https://docs.nvidia.com/nim/index.html) is supported using the engine `nvidia_ai_endpoints` or synonymously `nim`, for both Nvidia hosted NIMs (accessible through an Nvidia AI Enterprise license) and for locally downloaded and self-hosted NIM containers. +In addition to the above LangChain providers, connecting to [Nvidia NIMs](https://docs.nvidia.com/nim/index.html) is supported using the engine `nvidia_ai_endpoints` or synonymously `nim`, for both Nvidia hosted NIMs (accessible through an Nvidia AI Enterprise license) and for locally downloaded and elf-hosted NIM containers. ```{note} To use any of the providers, you must install additional packages; when you first try to use a configuration with a new provider, you typically receive an error from LangChain that instructs which packages you should install. @@ -104,6 +104,7 @@ NIMs can be self hosted, using downloadable containers, or Nvidia hosted and acc NeMo Guardrails supports connecting to NIMs as follows: ##### Self-hosted NIMs + To connect to self-hosted NIMs, set the engine to `nim`. Also make sure the model name matches one of the model names the hosted NIM supports (you can get a list of supported models using a GET request to v1/models endpoint). ```yaml @@ -663,6 +664,86 @@ Output rails process a bot message. The message to be processed is available in You can deactivate output rails temporarily for the next bot message, by setting the `$skip_output_rails` context variable to `True`. +#### Streaming Output Configuration + +By default, the response from an output rail is synchronous. +You can enable streaming to begin receiving responses from the output rail sooner. + +You must set the top-level `streaming: True` field in your `config.yml` file. + +For each output rail, add the `streaming` field and configuration parameters. + +```yaml +rails: + output: + - rail name + streaming: + chunk_size: 200 + context_size: 50 + stream_first: True + +streaming: True +``` + +When streaming is enabled, the toolkit applies output rails to chunks of tokens. +If a rail blocks a chunk of tokens, the toolkit returns a string in the following format: + +```output +{"event": "ABORT", "data": {"reason": "Blocked by rails.}} +``` + +The following table describes the subfields for the `streaming` field: + +```{list-table} +:header-rows: 1 + +* - Field + - Description + - Default Value + +* - streaming.chunk_size + - Specifies the number of tokens for each chunk. + The toolkit applies output guardrails on each chunk of tokens. + + Larger values provide more meaningful information for the rail to assess, + but can add latency while accumulating tokens for a full chunk. + The risk of higher latency is especially true if you specify `stream_first: False`. + - `200` + +* - streaming.context_size + - Specifies the number of tokens to keep from the previous chunk to provide context and continuity in processing. + + Larger values provide continuity across chunks with minimal impact on latency. + Small values might fail to detect cross-chunk violations. + Specifying approximately 25% of `chunk_size` provides a good compromise. + - `50` + +* - streaming.stream_first + - When set to `False`, the toolkit applies the output rails to the chunks before streaming them to the client. + If you set this field to `False`, you can avoid streaming chunks of blocked content. + + By default, the toolkit streams the chunks as soon as possible and before applying output rails to them. + + - `True` +``` + +The following table shows how the number of tokens, chunk size, and context size interact to trigger the number of rails invocations. + +```{csv-table} +:header: Input Length, Chunk Size, Context Size, Rails Invocations + +512,256,64,3 +600,256,64,3 +256,256,64,1 +1024,256,64,5 +1024,256,32,5 +1024,256,32,5 +1024,128,32,11 +512,128,32,5 +``` + +Refer to [](../getting-started/5-output-rails/README.md#streaming-output) for a code sample. + ### Retrieval Rails Retrieval rails process the retrieved chunks, i.e., the `$relevant_chunks` variable. diff --git a/docs/versions1.json b/docs/versions1.json index 348caf8f4..c2e197536 100644 --- a/docs/versions1.json +++ b/docs/versions1.json @@ -1,7 +1,7 @@ [ { "preferred": true, - "version": "0.11.1", - "url": "../0.11.1" + "version": "0.12.0", + "url": "../0.12.0" } ]