Skip to content

Commit

Permalink
docs: Output streaming (NVIDIA#976)
Browse files Browse the repository at this point in the history
Documents NVIDIA#966

Signed-off-by: Mike McKiernan <[email protected]>
  • Loading branch information
mikemckiernan committed Feb 7, 2025
1 parent a8f7114 commit 81da538
Show file tree
Hide file tree
Showing 5 changed files with 150 additions and 117 deletions.
76 changes: 65 additions & 11 deletions docs/getting-started/5-output-rails/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,22 +30,23 @@ NeMo Guardrails comes with a built-in [output self-checking rail](../../user-gui

Activating the `self check output` rail is similar to the `self check input` rail:

1. Activate the `self check output` rail in *config.yml*.
2. Add a `self_check_output` prompt in *prompts.yml*.
1. Activate the `self check output` rail in `config.yml`.
2. Add a `self_check_output` prompt in `prompts.yml`.

### Activate the rail
### Activate the Rail

To activate the rail, include the `self check output` flow name in the output rails section of the *config.yml* file:
To activate the rail, include the `self check output` flow name in the output rails section of the `config.yml` file:

```yaml
output:
flows:
- self check output
```
For reference, the full `rails` section in `config.yml` should look like the following:
For reference, update the full `rails` section in `config.yml` to look like the following:

```yaml
rails:
input:
flows:
- self check input
Expand All @@ -66,7 +67,7 @@ define subflow self check output
stop
```

### Add a prompt
### Add a Prompt

The self-check output rail needs a prompt to perform the check.

Expand Down Expand Up @@ -130,7 +131,7 @@ Summary: 3 LLM call(s) took 1.89 seconds and used 504 tokens.
print(info.llm_calls[2].prompt)
```

```
```text
Your task is to check if the bot message below complies with the company policy.
Company policy for the bot:
Expand Down Expand Up @@ -160,15 +161,68 @@ As we can see, the LLM did generate the message containing the word "idiot", how

The following figure depicts the process:

<div align="center">
<img src="../../_static/puml/output_rails_fig_1.png" width="815">
</div>
```{image} ../../_static/puml/output_rails_fig_1.png
```

## Streaming Output

By default, the output from the rail is synchronous.
You can enable streaming to provide asynchronous responses and reduce the time to the first response.

1. Modify the `rails` field in the `config.yml` file and add the `streaming` field to enable streaming:

```{code-block} yaml
:emphasize-lines: 9-11,13
rails:
input:
flows:
- self check input
output:
flows:
- self check output
streaming:
chunk_size: 200
context_size: 50
streaming: True
```

1. Call the `stream_async` method and handle the chunked response:

```python
from nemoguardrails import RailsConfig, LLMRails

config = RailsConfig.from_path("./config")

rails = LLMRails(config)

messages = [{"role": "user", "content": "How many days of vacation does a 10-year employee receive?"}]

async for chunk in rails.stream_async(messages=messages):
print(f"CHUNK: {chunk}")
```

*Partial Output*

```output
CHUNK: According
CHUNK: to
CHUNK: the
CHUNK: employee
CHUNK: handbook,
...
```

For reference information about the related `config.yaml` file fields,
refer to [](../../user-guides/configuration-guide.md#output-rails).

## Custom Output Rail

Build a custom output rail with a list of proprietary words that we want to make sure do not appear in the output.

1. Create a *config/actions.py* file with the following content, which defines an action:
1. Create a `config/actions.py` file with the following content, which defines an action:

```python
from typing import Optional
Expand Down
102 changes: 0 additions & 102 deletions docs/index.rst

This file was deleted.

2 changes: 1 addition & 1 deletion docs/project.json
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{ "name": "nemo-guardrails-toolkit", "version": "0.11.1" }
{ "name": "nemo-guardrails-toolkit", "version": "0.12.0" }
83 changes: 82 additions & 1 deletion docs/user-guides/configuration-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ The meaning of the attributes is as follows:

You can use any LLM provider that is supported by LangChain, e.g., `ai21`, `aleph_alpha`, `anthropic`, `anyscale`, `azure`, `cohere`, `huggingface_endpoint`, `huggingface_hub`, `openai`, `self_hosted`, `self_hosted_hugging_face`. Check out the LangChain official documentation for the full list.

In addition to the above LangChain providers, connecting to [Nvidia NIMs](https://docs.nvidia.com/nim/index.html) is supported using the engine `nvidia_ai_endpoints` or synonymously `nim`, for both Nvidia hosted NIMs (accessible through an Nvidia AI Enterprise license) and for locally downloaded and self-hosted NIM containers.
In addition to the above LangChain providers, connecting to [Nvidia NIMs](https://docs.nvidia.com/nim/index.html) is supported using the engine `nvidia_ai_endpoints` or synonymously `nim`, for both Nvidia hosted NIMs (accessible through an Nvidia AI Enterprise license) and for locally downloaded and elf-hosted NIM containers.

```{note}
To use any of the providers, you must install additional packages; when you first try to use a configuration with a new provider, you typically receive an error from LangChain that instructs which packages you should install.
Expand All @@ -104,6 +104,7 @@ NIMs can be self hosted, using downloadable containers, or Nvidia hosted and acc
NeMo Guardrails supports connecting to NIMs as follows:

##### Self-hosted NIMs

To connect to self-hosted NIMs, set the engine to `nim`. Also make sure the model name matches one of the model names the hosted NIM supports (you can get a list of supported models using a GET request to v1/models endpoint).

```yaml
Expand Down Expand Up @@ -663,6 +664,86 @@ Output rails process a bot message. The message to be processed is available in

You can deactivate output rails temporarily for the next bot message, by setting the `$skip_output_rails` context variable to `True`.

#### Streaming Output Configuration

By default, the response from an output rail is synchronous.
You can enable streaming to begin receiving responses from the output rail sooner.

You must set the top-level `streaming: True` field in your `config.yml` file.

For each output rail, add the `streaming` field and configuration parameters.

```yaml
rails:
output:
- rail name
streaming:
chunk_size: 200
context_size: 50
stream_first: True

streaming: True
```
When streaming is enabled, the toolkit applies output rails to chunks of tokens.
If a rail blocks a chunk of tokens, the toolkit returns a string in the following format:
```output
{"event": "ABORT", "data": {"reason": "Blocked by <rail-name> rails.}}
```

The following table describes the subfields for the `streaming` field:

```{list-table}
:header-rows: 1
* - Field
- Description
- Default Value
* - streaming.chunk_size
- Specifies the number of tokens for each chunk.
The toolkit applies output guardrails on each chunk of tokens.
Larger values provide more meaningful information for the rail to assess,
but can add latency while accumulating tokens for a full chunk.
The risk of higher latency is especially true if you specify `stream_first: False`.
- `200`
* - streaming.context_size
- Specifies the number of tokens to keep from the previous chunk to provide context and continuity in processing.
Larger values provide continuity across chunks with minimal impact on latency.
Small values might fail to detect cross-chunk violations.
Specifying approximately 25% of `chunk_size` provides a good compromise.
- `50`
* - streaming.stream_first
- When set to `False`, the toolkit applies the output rails to the chunks before streaming them to the client.
If you set this field to `False`, you can avoid streaming chunks of blocked content.
By default, the toolkit streams the chunks as soon as possible and before applying output rails to them.
- `True`
```

The following table shows how the number of tokens, chunk size, and context size interact to trigger the number of rails invocations.

```{csv-table}
:header: Input Length, Chunk Size, Context Size, Rails Invocations
512,256,64,3
600,256,64,3
256,256,64,1
1024,256,64,5
1024,256,32,5
1024,256,32,5
1024,128,32,11
512,128,32,5
```

Refer to [](../getting-started/5-output-rails/README.md#streaming-output) for a code sample.

### Retrieval Rails

Retrieval rails process the retrieved chunks, i.e., the `$relevant_chunks` variable.
Expand Down
4 changes: 2 additions & 2 deletions docs/versions1.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
[
{
"preferred": true,
"version": "0.11.1",
"url": "../0.11.1"
"version": "0.12.0",
"url": "../0.12.0"
}
]

0 comments on commit 81da538

Please sign in to comment.