You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/user-guides/configuration-guide.md
+82-1Lines changed: 82 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -84,7 +84,7 @@ The meaning of the attributes is as follows:
84
84
85
85
You can use any LLM provider that is supported by LangChain, e.g., `ai21`, `aleph_alpha`, `anthropic`, `anyscale`, `azure`, `cohere`, `huggingface_endpoint`, `huggingface_hub`, `openai`, `self_hosted`, `self_hosted_hugging_face`. Check out the LangChain official documentation for the full list.
86
86
87
-
In addition to the above LangChain providers, connecting to [Nvidia NIMs](https://docs.nvidia.com/nim/index.html) is supported using the engine `nvidia_ai_endpoints` or synonymously `nim`, for both Nvidia hosted NIMs (accessible through an Nvidia AI Enterprise license) and for locally downloaded and self-hosted NIM containers.
87
+
In addition to the above LangChain providers, connecting to [Nvidia NIMs](https://docs.nvidia.com/nim/index.html) is supported using the engine `nvidia_ai_endpoints` or synonymously `nim`, for both Nvidia hosted NIMs (accessible through an Nvidia AI Enterprise license) and for locally downloaded and elf-hosted NIM containers.
88
88
89
89
```{note}
90
90
To use any of the providers, you must install additional packages; when you first try to use a configuration with a new provider, you typically receive an error from LangChain that instructs which packages you should install.
@@ -104,6 +104,7 @@ NIMs can be self hosted, using downloadable containers, or Nvidia hosted and acc
104
104
NeMo Guardrails supports connecting to NIMs as follows:
105
105
106
106
##### Self-hosted NIMs
107
+
107
108
To connect to self-hosted NIMs, set the engine to `nim`. Also make sure the model name matches one of the model names the hosted NIM supports (you can get a list of supported models using a GET request to v1/models endpoint).
108
109
109
110
```yaml
@@ -663,6 +664,86 @@ Output rails process a bot message. The message to be processed is available in
663
664
664
665
You can deactivate output rails temporarily for the next bot message, by setting the `$skip_output_rails` context variable to `True`.
665
666
667
+
#### Streaming Output Configuration
668
+
669
+
By default, the response from an output rail is synchronous.
670
+
You can enable streaming to begin receiving responses from the output rail sooner.
671
+
672
+
You must set the top-level `streaming: True` field in your `config.yml` file.
673
+
674
+
For each output rail, add the `streaming` field and configuration parameters.
675
+
676
+
```yaml
677
+
rails:
678
+
output:
679
+
- rail name
680
+
streaming:
681
+
chunk_size: 200
682
+
context_size: 50
683
+
stream_first: True
684
+
685
+
streaming: True
686
+
```
687
+
688
+
When streaming is enabled, the toolkit applies output rails to chunks of tokens.
689
+
If a rail blocks a chunk of tokens, the toolkit returns a string in the following format:
690
+
691
+
```output
692
+
{"event": "ABORT", "data": {"reason": "Blocked by <rail-name> rails.}}
693
+
```
694
+
695
+
The following table describes the subfields for the `streaming` field:
696
+
697
+
```{list-table}
698
+
:header-rows: 1
699
+
700
+
* - Field
701
+
- Description
702
+
- Default Value
703
+
704
+
* - streaming.chunk_size
705
+
- Specifies the number of tokens for each chunk.
706
+
The toolkit applies output guardrails on each chunk of tokens.
707
+
708
+
Larger values provide more meaningful information for the rail to assess,
709
+
but can add latency while accumulating tokens for a full chunk.
710
+
The risk of higher latency is especially true if you specify `stream_first: False`.
711
+
- `200`
712
+
713
+
* - streaming.context_size
714
+
- Specifies the number of tokens to keep from the previous chunk to provide context and continuity in processing.
715
+
716
+
Larger values provide continuity across chunks with minimal impact on latency.
717
+
Small values might fail to detect cross-chunk violations.
718
+
Specifying approximately 25% of `chunk_size` provides a good compromise.
719
+
- `50`
720
+
721
+
* - streaming.stream_first
722
+
- When set to `False`, the toolkit applies the output rails to the chunks before streaming them to the client.
723
+
If you set this field to `False`, you can avoid streaming chunks of blocked content.
724
+
725
+
By default, the toolkit streams the chunks as soon as possible and before applying output rails to them.
726
+
727
+
- `True`
728
+
```
729
+
730
+
The following table shows how the number of tokens, chunk size, and context size interact to trigger the number of rails invocations.
0 commit comments