-
Notifications
You must be signed in to change notification settings - Fork 737
docs(library): add ATR-inspired threat detection example #1869
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
60fa176
1b1245f
813f41c
29beb3b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,84 @@ | ||
| # ATR-inspired threat detection example | ||
|
|
||
| This example shows how to use the built-in `regex_detection` input rail | ||
| with a small set of patterns inspired by Agent Threat Rules, an open | ||
| detection standard for AI agent threats published under the MIT license: | ||
|
|
||
| https://github.com/Agent-Threat-Rule/agent-threat-rules | ||
|
|
||
| ## What it covers | ||
|
|
||
| The patterns in `config/config.yml` map to common attack categories that | ||
| ATR ships rules for: | ||
|
|
||
| - ATR-PI-001 instruction override ("ignore previous instructions") | ||
| - ATR-PI-002 system prompt exfiltration ("reveal your system prompt") | ||
| - ATR-PI-003 role-play jailbreak ("act as DAN") | ||
| - ATR-PI-004 base64-wrapped payload hint | ||
| - ATR-MCP-001 MCP tool override markers | ||
| - ATR-SSRF-001 `file://` scheme reference | ||
|
|
||
| Each entry is illustrative. The full ruleset and YAML schema live in the | ||
| ATR repository; this example exists so a NeMo Guardrails user can see the | ||
| shape of an agent-specific input rail without needing an external service. | ||
|
|
||
| ## Running the example | ||
|
|
||
| From the project root: | ||
|
|
||
| ```bash | ||
| nemoguardrails chat --config=examples/configs/atr_threat_detection/config | ||
| ``` | ||
|
|
||
| A user message such as "Ignore all previous instructions" will trigger the | ||
| `regex check input` flow and the bot will respond with the library default | ||
| refusal message defined in `nemoguardrails/library/regex/flows.v1.co` | ||
| (`"I'm sorry, I can't respond to that."`). Benign messages are forwarded | ||
| to the configured main model. | ||
|
|
||
| The `config.yml` lists `openai`/`gpt-4o-mini` as the main model so that | ||
| chat runs end-to-end. Replace with your preferred provider; the input | ||
| rail blocks threats before the model is invoked, so the model only sees | ||
| benign inputs. | ||
|
|
||
| ## Extending | ||
|
|
||
| To run against the live ATR YAML ruleset, parse the rule files at startup | ||
| and append the `detection.regex_patterns` field of each rule to the | ||
| `patterns` list under `regex_detection.input`. | ||
|
|
||
| To surface a custom signal (rather than only refusing), add a custom | ||
| flow that calls `detect_regex_pattern` directly. Follow the library's | ||
| established `if $config.enable_rails_exceptions` pattern (see | ||
| `examples/configs/guardrails_only/input/config.co`) so the flow emits | ||
| **either** the exception event **or** the bot utterance, not both — in | ||
| Colang 1.0 the rails event loop short-circuits on the exception and | ||
| drops the bot utterance from the response if both fire in the same | ||
| flow. | ||
|
|
||
| ```colang | ||
| define bot refuse atr_threat | ||
| "I'm sorry, that request was blocked by an ATR input safety rule." | ||
|
|
||
| define flow atr report match | ||
| $result = execute detect_regex_pattern(source="input", text=$user_message) | ||
| if $result["is_match"] | ||
| if $config.enable_rails_exceptions | ||
| create event AtrRuleMatchedRailException(message="ATR input rail blocked") | ||
| else | ||
| bot refuse atr_threat | ||
| stop | ||
| ``` | ||
|
Comment on lines
+59
to
+71
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
When both The established pattern (e.g. Prompt To Fix With AIThis is a comment left during a code review.
Path: examples/configs/atr_threat_detection/README.md
Line: 53-64
Comment:
**Extending snippet silently drops the custom refusal message**
When both `create event AtrRuleMatchedRailException` and `bot refuse atr_threat` fire in the same flow, the Colang 1.0 runtime's event loop (see `llmrails.py` lines 946–954) appends the bot script to `responses` but also sets `exception = event`. The subsequent branch at line 986 (`if exception: new_message = {"role": "exception", ...}`) short-circuits and never uses `responses`, so the caller receives an exception object instead of the custom refusal text.
The established pattern (e.g. `guardrails_only/input/config.co`) guards on `$config.enable_rails_exceptions` and uses an `if/else` to emit **either** the exception event **or** the bot utterance — not both unconditionally.
How can I resolve this? If you propose a fix, please make it concise. |
||
|
|
||
| Then wire `atr report match` instead of `regex check input` under | ||
| `rails.input.flows`. The custom flow uses a non-conflicting bot utterance | ||
| (`bot refuse atr_threat`) so it does not collide with the library | ||
| default, and emits a `AtrRuleMatchedRailException` event when | ||
| `enable_rails_exceptions` is set so downstream observers (audit logging, | ||
| metrics) can subscribe to it. | ||
|
|
||
| If you also want to capture the matched rule list for audit, assign | ||
| `$matched_rules = $result["detections"]` before the if/else and pass it | ||
| through your own action call or to the event message — keep the | ||
| exception/utterance branches single-action to preserve the canonical | ||
| event-loop semantics. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,35 @@ | ||
| # This example wires the built-in regex_detection rail to a small set of | ||
| # ATR-inspired threat patterns covering common AI agent attack categories. | ||
| # The full open detection set lives in Agent Threat Rules (MIT-licensed): | ||
| # https://github.com/Agent-Threat-Rule/agent-threat-rules | ||
| # | ||
| # A main model is configured so `nemoguardrails chat` runs end-to-end against | ||
| # this example. Replace the engine/model with your preferred provider; the | ||
| # input rail blocks threats before the model is invoked, so the model is only | ||
| # called for benign user messages. | ||
| models: | ||
| - type: main | ||
| engine: openai | ||
| model: gpt-4o-mini | ||
|
|
||
| rails: | ||
| config: | ||
| regex_detection: | ||
| input: | ||
| case_insensitive: true | ||
| patterns: | ||
| # ATR-PI-001 instruction override | ||
| - "\\b(ignore|disregard|forget)\\s+(all\\s+)?(previous|prior|above)\\s+(instructions?|prompts?|rules?)" | ||
| # ATR-PI-002 system prompt exfiltration | ||
| - "(reveal|print|repeat|show)\\s+(your\\s+)?(system\\s+prompt|initial\\s+instructions)" | ||
| # ATR-PI-003 role-play jailbreak | ||
| - "\\b(you\\s+are\\s+now|act\\s+as|pretend\\s+to\\s+be)\\s+(DAN|developer\\s+mode|jailbroken|an?\\s+unrestricted)" | ||
| # ATR-PI-004 base64-wrapped payload hint | ||
| - "(decode|run|execute)\\s+(this\\s+)?base64[:\\s]+[A-Za-z0-9+/=]{40,}" | ||
| # ATR-MCP-001 mcp tool override | ||
| - "<\\s*(tool_override|mcp_override|new_tool_definition)\\s*>" | ||
| # ATR-SSRF-001 file:// scheme reference | ||
| - "file://[^\\s\"'<>]+" | ||
| input: | ||
| flows: | ||
| - regex check input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prompt To Fix With AI