Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 84 additions & 0 deletions examples/configs/atr_threat_detection/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# ATR-inspired threat detection example

This example shows how to use the built-in `regex_detection` input rail
with a small set of patterns inspired by Agent Threat Rules, an open
detection standard for AI agent threats published under the MIT license:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 The README states the ATR project is published under the MIT license, but the PR description says it is Apache-2.0. A user relying on this file to assess license compatibility for their project will get incorrect information.

Suggested change
detection standard for AI agent threats published under the MIT license:
detection standard for AI agent threats published under the Apache-2.0 license:
Prompt To Fix With AI
This is a comment left during a code review.
Path: examples/configs/atr_threat_detection/README.md
Line: 5

Comment:
The README states the ATR project is published under the MIT license, but the PR description says it is Apache-2.0. A user relying on this file to assess license compatibility for their project will get incorrect information.

```suggestion
detection standard for AI agent threats published under the Apache-2.0 license:
```

How can I resolve this? If you propose a fix, please make it concise.


https://github.com/Agent-Threat-Rule/agent-threat-rules

## What it covers

The patterns in `config/config.yml` map to common attack categories that
ATR ships rules for:

- ATR-PI-001 instruction override ("ignore previous instructions")
- ATR-PI-002 system prompt exfiltration ("reveal your system prompt")
- ATR-PI-003 role-play jailbreak ("act as DAN")
- ATR-PI-004 base64-wrapped payload hint
- ATR-MCP-001 MCP tool override markers
- ATR-SSRF-001 `file://` scheme reference

Each entry is illustrative. The full ruleset and YAML schema live in the
ATR repository; this example exists so a NeMo Guardrails user can see the
shape of an agent-specific input rail without needing an external service.

## Running the example

From the project root:

```bash
nemoguardrails chat --config=examples/configs/atr_threat_detection/config
```

A user message such as "Ignore all previous instructions" will trigger the
`regex check input` flow and the bot will respond with the library default
refusal message defined in `nemoguardrails/library/regex/flows.v1.co`
(`"I'm sorry, I can't respond to that."`). Benign messages are forwarded
to the configured main model.

The `config.yml` lists `openai`/`gpt-4o-mini` as the main model so that
chat runs end-to-end. Replace with your preferred provider; the input
rail blocks threats before the model is invoked, so the model only sees
benign inputs.

## Extending

To run against the live ATR YAML ruleset, parse the rule files at startup
and append the `detection.regex_patterns` field of each rule to the
`patterns` list under `regex_detection.input`.

To surface a custom signal (rather than only refusing), add a custom
flow that calls `detect_regex_pattern` directly. Follow the library's
established `if $config.enable_rails_exceptions` pattern (see
`examples/configs/guardrails_only/input/config.co`) so the flow emits
**either** the exception event **or** the bot utterance, not both — in
Colang 1.0 the rails event loop short-circuits on the exception and
drops the bot utterance from the response if both fire in the same
flow.

```colang
define bot refuse atr_threat
"I'm sorry, that request was blocked by an ATR input safety rule."

define flow atr report match
$result = execute detect_regex_pattern(source="input", text=$user_message)
if $result["is_match"]
if $config.enable_rails_exceptions
create event AtrRuleMatchedRailException(message="ATR input rail blocked")
else
bot refuse atr_threat
stop
```
Comment on lines +59 to +71

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Extending snippet silently drops the custom refusal message

When both create event AtrRuleMatchedRailException and bot refuse atr_threat fire in the same flow, the Colang 1.0 runtime's event loop (see llmrails.py lines 946–954) appends the bot script to responses but also sets exception = event. The subsequent branch at line 986 (if exception: new_message = {"role": "exception", ...}) short-circuits and never uses responses, so the caller receives an exception object instead of the custom refusal text.

The established pattern (e.g. guardrails_only/input/config.co) guards on $config.enable_rails_exceptions and uses an if/else to emit either the exception event or the bot utterance — not both unconditionally.

Prompt To Fix With AI
This is a comment left during a code review.
Path: examples/configs/atr_threat_detection/README.md
Line: 53-64

Comment:
**Extending snippet silently drops the custom refusal message**

When both `create event AtrRuleMatchedRailException` and `bot refuse atr_threat` fire in the same flow, the Colang 1.0 runtime's event loop (see `llmrails.py` lines 946–954) appends the bot script to `responses` but also sets `exception = event`. The subsequent branch at line 986 (`if exception: new_message = {"role": "exception", ...}`) short-circuits and never uses `responses`, so the caller receives an exception object instead of the custom refusal text.

The established pattern (e.g. `guardrails_only/input/config.co`) guards on `$config.enable_rails_exceptions` and uses an `if/else` to emit **either** the exception event **or** the bot utterance — not both unconditionally.

How can I resolve this? If you propose a fix, please make it concise.


Then wire `atr report match` instead of `regex check input` under
`rails.input.flows`. The custom flow uses a non-conflicting bot utterance
(`bot refuse atr_threat`) so it does not collide with the library
default, and emits a `AtrRuleMatchedRailException` event when
`enable_rails_exceptions` is set so downstream observers (audit logging,
metrics) can subscribe to it.

If you also want to capture the matched rule list for audit, assign
`$matched_rules = $result["detections"]` before the if/else and pass it
through your own action call or to the event message — keep the
exception/utterance branches single-action to preserve the canonical
event-loop semantics.
35 changes: 35 additions & 0 deletions examples/configs/atr_threat_detection/config/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# This example wires the built-in regex_detection rail to a small set of
# ATR-inspired threat patterns covering common AI agent attack categories.
# The full open detection set lives in Agent Threat Rules (MIT-licensed):
# https://github.com/Agent-Threat-Rule/agent-threat-rules
#
# A main model is configured so `nemoguardrails chat` runs end-to-end against
# this example. Replace the engine/model with your preferred provider; the
# input rail blocks threats before the model is invoked, so the model is only
# called for benign user messages.
models:
- type: main
engine: openai
model: gpt-4o-mini

rails:
config:
regex_detection:
input:
case_insensitive: true
patterns:
# ATR-PI-001 instruction override
- "\\b(ignore|disregard|forget)\\s+(all\\s+)?(previous|prior|above)\\s+(instructions?|prompts?|rules?)"
# ATR-PI-002 system prompt exfiltration
- "(reveal|print|repeat|show)\\s+(your\\s+)?(system\\s+prompt|initial\\s+instructions)"
# ATR-PI-003 role-play jailbreak
- "\\b(you\\s+are\\s+now|act\\s+as|pretend\\s+to\\s+be)\\s+(DAN|developer\\s+mode|jailbroken|an?\\s+unrestricted)"
# ATR-PI-004 base64-wrapped payload hint
- "(decode|run|execute)\\s+(this\\s+)?base64[:\\s]+[A-Za-z0-9+/=]{40,}"
# ATR-MCP-001 mcp tool override
- "<\\s*(tool_override|mcp_override|new_tool_definition)\\s*>"
# ATR-SSRF-001 file:// scheme reference
- "file://[^\\s\"'<>]+"
input:
flows:
- regex check input
Loading