diff --git a/examples/configs/atr_threat_detection/README.md b/examples/configs/atr_threat_detection/README.md new file mode 100644 index 0000000000..d76a679c0c --- /dev/null +++ b/examples/configs/atr_threat_detection/README.md @@ -0,0 +1,84 @@ +# ATR-inspired threat detection example + +This example shows how to use the built-in `regex_detection` input rail +with a small set of patterns inspired by Agent Threat Rules, an open +detection standard for AI agent threats published under the MIT license: + +https://github.com/Agent-Threat-Rule/agent-threat-rules + +## What it covers + +The patterns in `config/config.yml` map to common attack categories that +ATR ships rules for: + +- ATR-PI-001 instruction override ("ignore previous instructions") +- ATR-PI-002 system prompt exfiltration ("reveal your system prompt") +- ATR-PI-003 role-play jailbreak ("act as DAN") +- ATR-PI-004 base64-wrapped payload hint +- ATR-MCP-001 MCP tool override markers +- ATR-SSRF-001 `file://` scheme reference + +Each entry is illustrative. The full ruleset and YAML schema live in the +ATR repository; this example exists so a NeMo Guardrails user can see the +shape of an agent-specific input rail without needing an external service. + +## Running the example + +From the project root: + +```bash +nemoguardrails chat --config=examples/configs/atr_threat_detection/config +``` + +A user message such as "Ignore all previous instructions" will trigger the +`regex check input` flow and the bot will respond with the library default +refusal message defined in `nemoguardrails/library/regex/flows.v1.co` +(`"I'm sorry, I can't respond to that."`). Benign messages are forwarded +to the configured main model. + +The `config.yml` lists `openai`/`gpt-4o-mini` as the main model so that +chat runs end-to-end. Replace with your preferred provider; the input +rail blocks threats before the model is invoked, so the model only sees +benign inputs. + +## Extending + +To run against the live ATR YAML ruleset, parse the rule files at startup +and append the `detection.regex_patterns` field of each rule to the +`patterns` list under `regex_detection.input`. + +To surface a custom signal (rather than only refusing), add a custom +flow that calls `detect_regex_pattern` directly. Follow the library's +established `if $config.enable_rails_exceptions` pattern (see +`examples/configs/guardrails_only/input/config.co`) so the flow emits +**either** the exception event **or** the bot utterance, not both — in +Colang 1.0 the rails event loop short-circuits on the exception and +drops the bot utterance from the response if both fire in the same +flow. + +```colang +define bot refuse atr_threat + "I'm sorry, that request was blocked by an ATR input safety rule." + +define flow atr report match + $result = execute detect_regex_pattern(source="input", text=$user_message) + if $result["is_match"] + if $config.enable_rails_exceptions + create event AtrRuleMatchedRailException(message="ATR input rail blocked") + else + bot refuse atr_threat + stop +``` + +Then wire `atr report match` instead of `regex check input` under +`rails.input.flows`. The custom flow uses a non-conflicting bot utterance +(`bot refuse atr_threat`) so it does not collide with the library +default, and emits a `AtrRuleMatchedRailException` event when +`enable_rails_exceptions` is set so downstream observers (audit logging, +metrics) can subscribe to it. + +If you also want to capture the matched rule list for audit, assign +`$matched_rules = $result["detections"]` before the if/else and pass it +through your own action call or to the event message — keep the +exception/utterance branches single-action to preserve the canonical +event-loop semantics. diff --git a/examples/configs/atr_threat_detection/config/config.yml b/examples/configs/atr_threat_detection/config/config.yml new file mode 100644 index 0000000000..866464abed --- /dev/null +++ b/examples/configs/atr_threat_detection/config/config.yml @@ -0,0 +1,35 @@ +# This example wires the built-in regex_detection rail to a small set of +# ATR-inspired threat patterns covering common AI agent attack categories. +# The full open detection set lives in Agent Threat Rules (MIT-licensed): +# https://github.com/Agent-Threat-Rule/agent-threat-rules +# +# A main model is configured so `nemoguardrails chat` runs end-to-end against +# this example. Replace the engine/model with your preferred provider; the +# input rail blocks threats before the model is invoked, so the model is only +# called for benign user messages. +models: + - type: main + engine: openai + model: gpt-4o-mini + +rails: + config: + regex_detection: + input: + case_insensitive: true + patterns: + # ATR-PI-001 instruction override + - "\\b(ignore|disregard|forget)\\s+(all\\s+)?(previous|prior|above)\\s+(instructions?|prompts?|rules?)" + # ATR-PI-002 system prompt exfiltration + - "(reveal|print|repeat|show)\\s+(your\\s+)?(system\\s+prompt|initial\\s+instructions)" + # ATR-PI-003 role-play jailbreak + - "\\b(you\\s+are\\s+now|act\\s+as|pretend\\s+to\\s+be)\\s+(DAN|developer\\s+mode|jailbroken|an?\\s+unrestricted)" + # ATR-PI-004 base64-wrapped payload hint + - "(decode|run|execute)\\s+(this\\s+)?base64[:\\s]+[A-Za-z0-9+/=]{40,}" + # ATR-MCP-001 mcp tool override + - "<\\s*(tool_override|mcp_override|new_tool_definition)\\s*>" + # ATR-SSRF-001 file:// scheme reference + - "file://[^\\s\"'<>]+" + input: + flows: + - regex check input