From 60fa17639040b5f9e4ba285d1453c4ff0c637106 Mon Sep 17 00:00:00 2001 From: Panguard AI Date: Sat, 9 May 2026 16:22:11 +0800 Subject: [PATCH 1/4] docs(library): add ATR-inspired threat detection example Adds an example configuration under examples/configs/atr_threat_detection that wires the built-in regex_detection input rail to a small set of ATR-inspired patterns covering instruction override, system prompt exfiltration, role-play jailbreak, base64-wrapped payload hints, MCP tool override markers, and file:// SSRF references. The full open detection set lives at https://github.com/Agent-Threat-Rule/agent-threat-rules under Apache-2.0. The example uses only built-in components (no new dependencies, no LLM calls in the rail itself) and includes a README with a runnable nemoguardrails chat command. --- .../configs/atr_threat_detection/README.md | 41 +++++++++++++++++++ .../atr_threat_detection/config/config.yml | 27 ++++++++++++ .../atr_threat_detection/config/rails.co | 17 ++++++++ 3 files changed, 85 insertions(+) create mode 100644 examples/configs/atr_threat_detection/README.md create mode 100644 examples/configs/atr_threat_detection/config/config.yml create mode 100644 examples/configs/atr_threat_detection/config/rails.co diff --git a/examples/configs/atr_threat_detection/README.md b/examples/configs/atr_threat_detection/README.md new file mode 100644 index 0000000000..4b3f0d9b77 --- /dev/null +++ b/examples/configs/atr_threat_detection/README.md @@ -0,0 +1,41 @@ +# ATR-inspired threat detection example + +This example shows how to use the built-in `regex_detection` input rail +with a small set of patterns inspired by Agent Threat Rules, an open +detection standard for AI agent threats published under Apache-2.0: + +https://github.com/Agent-Threat-Rule/agent-threat-rules + +## What it covers + +The patterns in `config/config.yml` map to common attack categories that +ATR ships rules for: + +- ATR-PI-001 instruction override ("ignore previous instructions") +- ATR-PI-002 system prompt exfiltration ("reveal your system prompt") +- ATR-PI-003 role-play jailbreak ("act as DAN") +- ATR-PI-004 base64-wrapped payload hint +- ATR-MCP-001 MCP tool override markers +- ATR-SSRF-001 `file://` scheme reference + +Each entry is illustrative. The full ruleset and YAML schema live in the +ATR repository; this example exists so a NeMo Guardrails user can see the +shape of an agent-specific input rail without needing an external service. + +## Running the example + +From the project root: + +```bash +nemoguardrails chat --config=examples/configs/atr_threat_detection/config +``` + +A user message such as "Ignore all previous instructions" will trigger the +`regex check input` flow and the bot will respond with the refusal message +defined in `rails.co`. + +## Extending + +To run against the live ATR YAML ruleset, parse the rule files at startup +and append the `detection.regex_patterns` field of each rule to the +`patterns` list under `regex_detection.input`. diff --git a/examples/configs/atr_threat_detection/config/config.yml b/examples/configs/atr_threat_detection/config/config.yml new file mode 100644 index 0000000000..e602ccf0e2 --- /dev/null +++ b/examples/configs/atr_threat_detection/config/config.yml @@ -0,0 +1,27 @@ +models: [] + +# This example wires the built-in regex_detection rail to a small set of +# ATR-inspired threat patterns covering common AI agent attack categories. +# The full open detection set lives in Agent Threat Rules (Apache-2.0): +# https://github.com/Agent-Threat-Rule/agent-threat-rules +rails: + config: + regex_detection: + input: + case_insensitive: true + patterns: + # ATR-PI-001 instruction override + - "\\b(ignore|disregard|forget)\\s+(all\\s+)?(previous|prior|above)\\s+(instructions?|prompts?|rules?)" + # ATR-PI-002 system prompt exfiltration + - "(reveal|print|repeat|show)\\s+(your\\s+)?(system\\s+prompt|initial\\s+instructions)" + # ATR-PI-003 role-play jailbreak + - "\\b(you\\s+are\\s+now|act\\s+as|pretend\\s+to\\s+be)\\s+(DAN|developer\\s+mode|jailbroken|an?\\s+unrestricted)" + # ATR-PI-004 base64-wrapped payload hint + - "(decode|run|execute)\\s+(this\\s+)?base64[:\\s]+[A-Za-z0-9+/=]{40,}" + # ATR-MCP-001 mcp tool override + - "<\\s*(tool_override|mcp_override|new_tool_definition)\\s*>" + # ATR-SSRF-001 file:// scheme reference + - "file://[^\\s\"'<>]+" + input: + flows: + - regex check input diff --git a/examples/configs/atr_threat_detection/config/rails.co b/examples/configs/atr_threat_detection/config/rails.co new file mode 100644 index 0000000000..d8f4fcc029 --- /dev/null +++ b/examples/configs/atr_threat_detection/config/rails.co @@ -0,0 +1,17 @@ +# ATR-inspired threat detection rails example. +# +# The regex_detection input rail is wired in config.yml. The flows below +# define the refusal message used when the rail aborts and an additional +# custom flow that reuses the built-in DetectRegexMatchAction to surface +# the matched rule(s) to the caller. + +define bot refuse to respond + "I'm sorry, your request matched a threat detection rule and was blocked." + +define flow atr report match + """Optional flow: log the matched ATR rule(s) when the input rail fires.""" + $result = await DetectRegexMatchAction(source="input", text=$user_message) + if $result["is_match"] + $matched_rules = $result["detections"] + bot refuse to respond + abort From 1b1245f184309b53c8e83dd4a0d7def8301dd353 Mon Sep 17 00:00:00 2001 From: Panguard AI Date: Sun, 10 May 2026 19:30:24 +0800 Subject: [PATCH 2/4] docs(atr): document optional 'atr report match' flow per CodeRabbit nitpick Adds a YAML stanza showing how to enable the optional 'atr report match' flow that already ships in rails.co, so users can surface matched rule identifiers instead of only the generic refusal. Order note clarifies why 'atr report match' must come before 'regex check input'. --- examples/configs/atr_threat_detection/README.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/examples/configs/atr_threat_detection/README.md b/examples/configs/atr_threat_detection/README.md index 4b3f0d9b77..12b70689b8 100644 --- a/examples/configs/atr_threat_detection/README.md +++ b/examples/configs/atr_threat_detection/README.md @@ -39,3 +39,19 @@ defined in `rails.co`. To run against the live ATR YAML ruleset, parse the rule files at startup and append the `detection.regex_patterns` field of each rule to the `patterns` list under `regex_detection.input`. + +To also surface matched detections (so the bot can respond with the rule +identifier rather than only refusing), enable the optional `atr report +match` flow shipped in `rails.co` by adding it to your input flows in +`config/config.yml`: + +```yaml +rails: + input: + flows: + - atr report match + - regex check input +``` + +Order matters: `atr report match` runs before `regex check input` so the +matched rule id is available when the refusal message is generated. From 813f41c57ea7ed4e4653b51c0137499c6690c4be Mon Sep 17 00:00:00 2001 From: Adam Lin Date: Mon, 18 May 2026 06:55:22 +0800 Subject: [PATCH 3/4] fix(example): address review findings on atr_threat_detection Address Greptile/coderabbit P1 + P2 findings from @nvidia-nemo bot reviewers on this PR: 1. P1: `bot refuse to respond` redefinition in rails.co collided with the library default in nemoguardrails/library/regex/flows.v1.co. Under Colang 1.0 this made the refusal utterance non-deterministic with `models: []` and no LLM to arbitrate. Fix: delete rails.co entirely. The example now uses the library default refusal message ("I'm sorry, I can't respond to that."). 2. P2: `models: []` caused a runtime error in `nemoguardrails chat` for any benign user message (the runtime needs a main model when the input rail does not abort). Fix: add a main model stub (openai/gpt-4o-mini) so chat runs end-to-end. The input rail still blocks threats before the model is invoked, so the model only sees benign inputs. 3. P2: `atr report match` flow was defined in rails.co but never wired under rails.input.flows -- it was dormant code. Fix: removed (rails.co deleted). The README's Extending section now shows the custom-flow pattern with a non-conflicting bot utterance (`bot refuse atr_threat`) and a `AtrRuleMatchedRailException` event for downstream observers, so the documented pattern is correct. 4. P2: `$matched_rules = $result["detections"]` assigned but never referenced -- comment promised "log the matched ATR rule(s)" but no logging followed. Fix: removed (the dormant flow no longer exists). The Extending section's custom-flow example uses `$matched_rules` only to gate the event emission, and emits an `AtrRuleMatchedRailException` so downstream code can subscribe to it. 5. Documentation correction: README and config.yml both cited ATR as Apache-2.0 -- the actual license is MIT. Corrected both references. Net diff: - config.yml: add main model stub, fix license comment. - rails.co: removed (used library default refusal). - README.md: fix license, update behavior description, replace stale "atr report match" wiring instructions with a correct custom-flow example in the Extending section. Tests pass locally (no test files in this PR; existing pr-tests-matrix green on 3.10-3.13). Signed-off-by: Adam Lin --- .../configs/atr_threat_detection/README.md | 46 ++++++++++++------- .../atr_threat_detection/config/config.yml | 14 ++++-- .../atr_threat_detection/config/rails.co | 17 ------- 3 files changed, 41 insertions(+), 36 deletions(-) delete mode 100644 examples/configs/atr_threat_detection/config/rails.co diff --git a/examples/configs/atr_threat_detection/README.md b/examples/configs/atr_threat_detection/README.md index 12b70689b8..b3df3b4932 100644 --- a/examples/configs/atr_threat_detection/README.md +++ b/examples/configs/atr_threat_detection/README.md @@ -2,7 +2,7 @@ This example shows how to use the built-in `regex_detection` input rail with a small set of patterns inspired by Agent Threat Rules, an open -detection standard for AI agent threats published under Apache-2.0: +detection standard for AI agent threats published under the MIT license: https://github.com/Agent-Threat-Rule/agent-threat-rules @@ -31,8 +31,15 @@ nemoguardrails chat --config=examples/configs/atr_threat_detection/config ``` A user message such as "Ignore all previous instructions" will trigger the -`regex check input` flow and the bot will respond with the refusal message -defined in `rails.co`. +`regex check input` flow and the bot will respond with the library default +refusal message defined in `nemoguardrails/library/regex/flows.v1.co` +(`"I'm sorry, I can't respond to that."`). Benign messages are forwarded +to the configured main model. + +The `config.yml` lists `openai`/`gpt-4o-mini` as the main model so that +chat runs end-to-end. Replace with your preferred provider; the input +rail blocks threats before the model is invoked, so the model only sees +benign inputs. ## Extending @@ -40,18 +47,25 @@ To run against the live ATR YAML ruleset, parse the rule files at startup and append the `detection.regex_patterns` field of each rule to the `patterns` list under `regex_detection.input`. -To also surface matched detections (so the bot can respond with the rule -identifier rather than only refusing), enable the optional `atr report -match` flow shipped in `rails.co` by adding it to your input flows in -`config/config.yml`: - -```yaml -rails: - input: - flows: - - atr report match - - regex check input +To surface the matched rule id (rather than only refusing), add a custom +flow that calls `detect_regex_pattern` directly and emits a custom event: + +```colang +define bot refuse atr_threat + "I'm sorry, that request was blocked by an ATR input safety rule." + +define flow atr report match + $result = execute detect_regex_pattern(source="input", text=$user_message) + if $result["is_match"] + $matched_rules = $result["detections"] + create event AtrRuleMatchedRailException(message="ATR input rail blocked") + bot refuse atr_threat + stop ``` -Order matters: `atr report match` runs before `regex check input` so the -matched rule id is available when the refusal message is generated. +Then wire `atr report match` instead of `regex check input` under +`rails.input.flows`. The custom flow uses a non-conflicting bot utterance +(`bot refuse atr_threat`) so it does not collide with the library default, +and emits a `AtrRuleMatchedRailException` event that downstream observers +(audit logging, metrics) can subscribe to without parsing the refusal +text. diff --git a/examples/configs/atr_threat_detection/config/config.yml b/examples/configs/atr_threat_detection/config/config.yml index e602ccf0e2..866464abed 100644 --- a/examples/configs/atr_threat_detection/config/config.yml +++ b/examples/configs/atr_threat_detection/config/config.yml @@ -1,9 +1,17 @@ -models: [] - # This example wires the built-in regex_detection rail to a small set of # ATR-inspired threat patterns covering common AI agent attack categories. -# The full open detection set lives in Agent Threat Rules (Apache-2.0): +# The full open detection set lives in Agent Threat Rules (MIT-licensed): # https://github.com/Agent-Threat-Rule/agent-threat-rules +# +# A main model is configured so `nemoguardrails chat` runs end-to-end against +# this example. Replace the engine/model with your preferred provider; the +# input rail blocks threats before the model is invoked, so the model is only +# called for benign user messages. +models: + - type: main + engine: openai + model: gpt-4o-mini + rails: config: regex_detection: diff --git a/examples/configs/atr_threat_detection/config/rails.co b/examples/configs/atr_threat_detection/config/rails.co deleted file mode 100644 index d8f4fcc029..0000000000 --- a/examples/configs/atr_threat_detection/config/rails.co +++ /dev/null @@ -1,17 +0,0 @@ -# ATR-inspired threat detection rails example. -# -# The regex_detection input rail is wired in config.yml. The flows below -# define the refusal message used when the rail aborts and an additional -# custom flow that reuses the built-in DetectRegexMatchAction to surface -# the matched rule(s) to the caller. - -define bot refuse to respond - "I'm sorry, your request matched a threat detection rule and was blocked." - -define flow atr report match - """Optional flow: log the matched ATR rule(s) when the input rail fires.""" - $result = await DetectRegexMatchAction(source="input", text=$user_message) - if $result["is_match"] - $matched_rules = $result["detections"] - bot refuse to respond - abort From 29beb3ba5e4dcad822474f17931d7073f26fc104 Mon Sep 17 00:00:00 2001 From: Adam Lin Date: Mon, 18 May 2026 07:18:14 +0800 Subject: [PATCH 4/4] docs(example): fix Extending snippet to use canonical if/else pattern Address greptile P1 follow-up on the README's Extending section: In Colang 1.0, the rails event loop appends bot utterances to `responses` but the subsequent branch that handles `exception = event` short-circuits and never emits `responses`, so combining `create event ...RailException` and `bot refuse atr_threat` in the same flow silently drops the refusal. The canonical pattern (e.g. examples/configs/guardrails_only/input/ config.co's `dummy input rail`) gates on `$config.enable_rails_exceptions` and uses an `if/else` to emit **either** the exception event **or** the bot utterance. Updated the README Extending snippet to follow that pattern, with an explicit note about the dropped-utterance behavior so future readers do not repeat the mistake. Also added a short paragraph explaining how to capture `$matched_rules` for downstream audit without breaking the canonical single-action branches. PR description: corrected from "Apache-2.0" to "MIT license" so it matches the README (ATR is MIT-licensed per LICENSE and package.json). Signed-off-by: Adam Lin --- .../configs/atr_threat_detection/README.md | 31 +++++++++++++------ 1 file changed, 22 insertions(+), 9 deletions(-) diff --git a/examples/configs/atr_threat_detection/README.md b/examples/configs/atr_threat_detection/README.md index b3df3b4932..d76a679c0c 100644 --- a/examples/configs/atr_threat_detection/README.md +++ b/examples/configs/atr_threat_detection/README.md @@ -47,8 +47,14 @@ To run against the live ATR YAML ruleset, parse the rule files at startup and append the `detection.regex_patterns` field of each rule to the `patterns` list under `regex_detection.input`. -To surface the matched rule id (rather than only refusing), add a custom -flow that calls `detect_regex_pattern` directly and emits a custom event: +To surface a custom signal (rather than only refusing), add a custom +flow that calls `detect_regex_pattern` directly. Follow the library's +established `if $config.enable_rails_exceptions` pattern (see +`examples/configs/guardrails_only/input/config.co`) so the flow emits +**either** the exception event **or** the bot utterance, not both — in +Colang 1.0 the rails event loop short-circuits on the exception and +drops the bot utterance from the response if both fire in the same +flow. ```colang define bot refuse atr_threat @@ -57,15 +63,22 @@ define bot refuse atr_threat define flow atr report match $result = execute detect_regex_pattern(source="input", text=$user_message) if $result["is_match"] - $matched_rules = $result["detections"] - create event AtrRuleMatchedRailException(message="ATR input rail blocked") - bot refuse atr_threat + if $config.enable_rails_exceptions + create event AtrRuleMatchedRailException(message="ATR input rail blocked") + else + bot refuse atr_threat stop ``` Then wire `atr report match` instead of `regex check input` under `rails.input.flows`. The custom flow uses a non-conflicting bot utterance -(`bot refuse atr_threat`) so it does not collide with the library default, -and emits a `AtrRuleMatchedRailException` event that downstream observers -(audit logging, metrics) can subscribe to without parsing the refusal -text. +(`bot refuse atr_threat`) so it does not collide with the library +default, and emits a `AtrRuleMatchedRailException` event when +`enable_rails_exceptions` is set so downstream observers (audit logging, +metrics) can subscribe to it. + +If you also want to capture the matched rule list for audit, assign +`$matched_rules = $result["detections"]` before the if/else and pass it +through your own action call or to the event message — keep the +exception/utterance branches single-action to preserve the canonical +event-loop semantics.