Skip to content

Commit 81da538

Browse files
committed
docs: Output streaming (NVIDIA#976)
Documents NVIDIA#966 Signed-off-by: Mike McKiernan <[email protected]>
1 parent a8f7114 commit 81da538

File tree

5 files changed

+150
-117
lines changed

5 files changed

+150
-117
lines changed

docs/getting-started/5-output-rails/README.md

Lines changed: 65 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -30,22 +30,23 @@ NeMo Guardrails comes with a built-in [output self-checking rail](../../user-gui
3030

3131
Activating the `self check output` rail is similar to the `self check input` rail:
3232

33-
1. Activate the `self check output` rail in *config.yml*.
34-
2. Add a `self_check_output` prompt in *prompts.yml*.
33+
1. Activate the `self check output` rail in `config.yml`.
34+
2. Add a `self_check_output` prompt in `prompts.yml`.
3535

36-
### Activate the rail
36+
### Activate the Rail
3737

38-
To activate the rail, include the `self check output` flow name in the output rails section of the *config.yml* file:
38+
To activate the rail, include the `self check output` flow name in the output rails section of the `config.yml` file:
3939

4040
```yaml
4141
output:
4242
flows:
4343
- self check output
4444
```
4545
46-
For reference, the full `rails` section in `config.yml` should look like the following:
46+
For reference, update the full `rails` section in `config.yml` to look like the following:
4747

4848
```yaml
49+
rails:
4950
input:
5051
flows:
5152
- self check input
@@ -66,7 +67,7 @@ define subflow self check output
6667
stop
6768
```
6869

69-
### Add a prompt
70+
### Add a Prompt
7071

7172
The self-check output rail needs a prompt to perform the check.
7273

@@ -130,7 +131,7 @@ Summary: 3 LLM call(s) took 1.89 seconds and used 504 tokens.
130131
print(info.llm_calls[2].prompt)
131132
```
132133

133-
```
134+
```text
134135
Your task is to check if the bot message below complies with the company policy.
135136
136137
Company policy for the bot:
@@ -160,15 +161,68 @@ As we can see, the LLM did generate the message containing the word "idiot", how
160161

161162
The following figure depicts the process:
162163

163-
<div align="center">
164-
<img src="../../_static/puml/output_rails_fig_1.png" width="815">
165-
</div>
164+
```{image} ../../_static/puml/output_rails_fig_1.png
165+
```
166+
167+
## Streaming Output
168+
169+
By default, the output from the rail is synchronous.
170+
You can enable streaming to provide asynchronous responses and reduce the time to the first response.
171+
172+
1. Modify the `rails` field in the `config.yml` file and add the `streaming` field to enable streaming:
173+
174+
```{code-block} yaml
175+
:emphasize-lines: 9-11,13
176+
177+
rails:
178+
input:
179+
flows:
180+
- self check input
181+
182+
output:
183+
flows:
184+
- self check output
185+
streaming:
186+
chunk_size: 200
187+
context_size: 50
188+
189+
streaming: True
190+
```
191+
192+
1. Call the `stream_async` method and handle the chunked response:
193+
194+
```python
195+
from nemoguardrails import RailsConfig, LLMRails
196+
197+
config = RailsConfig.from_path("./config")
198+
199+
rails = LLMRails(config)
200+
201+
messages = [{"role": "user", "content": "How many days of vacation does a 10-year employee receive?"}]
202+
203+
async for chunk in rails.stream_async(messages=messages):
204+
print(f"CHUNK: {chunk}")
205+
```
206+
207+
*Partial Output*
208+
209+
```output
210+
CHUNK: According
211+
CHUNK: to
212+
CHUNK: the
213+
CHUNK: employee
214+
CHUNK: handbook,
215+
...
216+
```
217+
218+
For reference information about the related `config.yaml` file fields,
219+
refer to [](../../user-guides/configuration-guide.md#output-rails).
166220

167221
## Custom Output Rail
168222

169223
Build a custom output rail with a list of proprietary words that we want to make sure do not appear in the output.
170224

171-
1. Create a *config/actions.py* file with the following content, which defines an action:
225+
1. Create a `config/actions.py` file with the following content, which defines an action:
172226

173227
```python
174228
from typing import Optional

docs/index.rst

Lines changed: 0 additions & 102 deletions
This file was deleted.

docs/project.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
{ "name": "nemo-guardrails-toolkit", "version": "0.11.1" }
1+
{ "name": "nemo-guardrails-toolkit", "version": "0.12.0" }

docs/user-guides/configuration-guide.md

Lines changed: 82 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,7 @@ The meaning of the attributes is as follows:
8484

8585
You can use any LLM provider that is supported by LangChain, e.g., `ai21`, `aleph_alpha`, `anthropic`, `anyscale`, `azure`, `cohere`, `huggingface_endpoint`, `huggingface_hub`, `openai`, `self_hosted`, `self_hosted_hugging_face`. Check out the LangChain official documentation for the full list.
8686

87-
In addition to the above LangChain providers, connecting to [Nvidia NIMs](https://docs.nvidia.com/nim/index.html) is supported using the engine `nvidia_ai_endpoints` or synonymously `nim`, for both Nvidia hosted NIMs (accessible through an Nvidia AI Enterprise license) and for locally downloaded and self-hosted NIM containers.
87+
In addition to the above LangChain providers, connecting to [Nvidia NIMs](https://docs.nvidia.com/nim/index.html) is supported using the engine `nvidia_ai_endpoints` or synonymously `nim`, for both Nvidia hosted NIMs (accessible through an Nvidia AI Enterprise license) and for locally downloaded and elf-hosted NIM containers.
8888

8989
```{note}
9090
To use any of the providers, you must install additional packages; when you first try to use a configuration with a new provider, you typically receive an error from LangChain that instructs which packages you should install.
@@ -104,6 +104,7 @@ NIMs can be self hosted, using downloadable containers, or Nvidia hosted and acc
104104
NeMo Guardrails supports connecting to NIMs as follows:
105105

106106
##### Self-hosted NIMs
107+
107108
To connect to self-hosted NIMs, set the engine to `nim`. Also make sure the model name matches one of the model names the hosted NIM supports (you can get a list of supported models using a GET request to v1/models endpoint).
108109

109110
```yaml
@@ -663,6 +664,86 @@ Output rails process a bot message. The message to be processed is available in
663664

664665
You can deactivate output rails temporarily for the next bot message, by setting the `$skip_output_rails` context variable to `True`.
665666

667+
#### Streaming Output Configuration
668+
669+
By default, the response from an output rail is synchronous.
670+
You can enable streaming to begin receiving responses from the output rail sooner.
671+
672+
You must set the top-level `streaming: True` field in your `config.yml` file.
673+
674+
For each output rail, add the `streaming` field and configuration parameters.
675+
676+
```yaml
677+
rails:
678+
output:
679+
- rail name
680+
streaming:
681+
chunk_size: 200
682+
context_size: 50
683+
stream_first: True
684+
685+
streaming: True
686+
```
687+
688+
When streaming is enabled, the toolkit applies output rails to chunks of tokens.
689+
If a rail blocks a chunk of tokens, the toolkit returns a string in the following format:
690+
691+
```output
692+
{"event": "ABORT", "data": {"reason": "Blocked by <rail-name> rails.}}
693+
```
694+
695+
The following table describes the subfields for the `streaming` field:
696+
697+
```{list-table}
698+
:header-rows: 1
699+
700+
* - Field
701+
- Description
702+
- Default Value
703+
704+
* - streaming.chunk_size
705+
- Specifies the number of tokens for each chunk.
706+
The toolkit applies output guardrails on each chunk of tokens.
707+
708+
Larger values provide more meaningful information for the rail to assess,
709+
but can add latency while accumulating tokens for a full chunk.
710+
The risk of higher latency is especially true if you specify `stream_first: False`.
711+
- `200`
712+
713+
* - streaming.context_size
714+
- Specifies the number of tokens to keep from the previous chunk to provide context and continuity in processing.
715+
716+
Larger values provide continuity across chunks with minimal impact on latency.
717+
Small values might fail to detect cross-chunk violations.
718+
Specifying approximately 25% of `chunk_size` provides a good compromise.
719+
- `50`
720+
721+
* - streaming.stream_first
722+
- When set to `False`, the toolkit applies the output rails to the chunks before streaming them to the client.
723+
If you set this field to `False`, you can avoid streaming chunks of blocked content.
724+
725+
By default, the toolkit streams the chunks as soon as possible and before applying output rails to them.
726+
727+
- `True`
728+
```
729+
730+
The following table shows how the number of tokens, chunk size, and context size interact to trigger the number of rails invocations.
731+
732+
```{csv-table}
733+
:header: Input Length, Chunk Size, Context Size, Rails Invocations
734+
735+
512,256,64,3
736+
600,256,64,3
737+
256,256,64,1
738+
1024,256,64,5
739+
1024,256,32,5
740+
1024,256,32,5
741+
1024,128,32,11
742+
512,128,32,5
743+
```
744+
745+
Refer to [](../getting-started/5-output-rails/README.md#streaming-output) for a code sample.
746+
666747
### Retrieval Rails
667748

668749
Retrieval rails process the retrieved chunks, i.e., the `$relevant_chunks` variable.

docs/versions1.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
[
22
{
33
"preferred": true,
4-
"version": "0.11.1",
5-
"url": "../0.11.1"
4+
"version": "0.12.0",
5+
"url": "../0.12.0"
66
}
77
]

0 commit comments

Comments
 (0)