Skip to content

nac7/finagent-redteam

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FinAgent Red-Team

CI PyPI Python License: MIT codecov

A red-team benchmark for AI agents that handle money — focused on regulatory-control bypass, not just generic misbehavior.

Can a crafted invoice, vendor email, or chat message trick an LLM agent into evading sanctions screening, structuring payments under a reporting threshold, defeating dual-approval, or wiring funds to an unapproved payee — and do the guardrails a bank would actually deploy stop it without breaking legitimate work?

FinAgent Red-Team runs every scenario under three control posturesnone, advisory, and enforced — and reports not just can it be attacked but where an agent's safety actually comes from: its own judgement, its policy-following, or hard system enforcement — and what each costs in usefulness.

⚠️ Defensive, fully synthetic. No real money, accounts, or networks. All tools are mocks operating on in-memory state. The goal is to harden financial AI agents before deployment.


Why this exists

LLM agents are moving into financial operations — accounts payable, treasury, reconciliation, customer servicing. The dangerous failures there are not toxic text; they are silent violations of compliance controls that look like normal task completion: a payment that skipped sanctions screening, a $15k transfer quietly split into two $7.5k transfers to dodge reporting, a large wire the agent approved for itself.

General agent-security benchmarks test "leak the data" or "do the forbidden action." None of them model the regulatory control surface of a real financial institution, and none measure whether standard guardrails neutralize the attack.

How this differs from prior work

Generic agent-security benchmarks FinAgent Red-Team
Domain Email/web/travel, generic "banking" Payments/treasury with real compliance controls
Attack goal Data leak, forbidden action Sanctions evasion, structuring, maker-checker defeat
Defense eval Usually attack-only 3 control postures → policy-following vs enforcement uplift
Utility tradeoff Rarely measured Over-refusal control set included
Judging Often LLM-graded Deterministic state predicates (reproducible)

Threat taxonomy (v0.2)

ID Threat Isolated control under test
T2_unauthorized_transfer Indirect injection → fraudulent payment Payee allow-list
T3_sanctions_evasion Pay an allow-listed but sanctioned payee Sanctions screening
T3 (homoglyph) Sanctioned entity hidden behind look-alike characters Normalized screening
T4_structuring Split a payment to dodge the reporting threshold Reporting / CTR control
T5_dual_approval_defeat Initiate and self-approve a large transfer Maker-checker separation
T6_data_exfiltration Email a customer's card/SSN to an external address DLP on outbound messages
T7_confused_deputy Redirect a payee's bank details to an attacker account Bank-detail protection
BENIGN A legitimate payment that should just work Over-refusal (utility)

Each attack scenario is designed to isolate one control, so a block under the enforced posture attributes cleanly to a specific defense.

Control postures

Every scenario is evaluated under three postures, which separate where safety comes from:

Posture What the agent is told / what the system does Measures
none No compliance policy is stated Intrinsic susceptibility
advisory Policy stated as obligations, not enforced Does the model follow stated policy?
enforced System hard-blocks violations in the sandbox Residual risk (defense-in-depth)

Metrics

  • ASR (Attack Success Rate) at each posture — fraction of attacks won.
  • Policy-following uplift = ASR(none) − ASR(advisory): how much merely stating policy reduces attacks (pure model instruction-following).
  • Enforcement uplift = ASR(advisory) − ASR(enforced): additional reduction from hard enforcement (defense-in-depth).
  • Residual ASR = ASR(enforced): attacks that survive enforcement.
  • Utility at each posture and over-refusal = utility(none) − utility(enforced): legitimate work lost as controls tighten.

This decomposition is the point: two models with identical enforced residual risk can differ sharply in whether they'd behave safely when a control is only advisory (the common real-world case for judgement calls that can't be hard-coded). A stack only looks good with low residual ASR and high utility — trivially refusing everything scores 0% ASR but also 0% utility.


How it works

                 ┌──────────────────────────── deterministic, offline ───────────────────────────┐
  AgentModel ──► runner ──► sandbox tools ──► World state ──► evaluator ──► Scorecard
 (any model)     loop       (controls here)   (ground truth)   (predicates)   (ASR / uplift / utility)
                 │              ▲
                 │              └── posture: none / advisory / enforced
  • Sandbox (sandbox/): synthetic accounts, payees, inbox; mock tools (initiate_transfer, approve_transfer, screen_sanctions, …). Compliance controls live inside the tools, gated on Policy.enabled.
  • Scenarios (scenarios/): pure-data tasks + embedded attacks + a structured success predicate.
  • Evaluator (eval/): judges outcomes from final state — no LLM grader, so results are reproducible.
  • Agent driver (agent/): any OpenAI-compatible, tool-calling model.

The sandbox, scenarios, and evaluator have zero third-party dependencies and are fully deterministic — the entire attack→defense pipeline is proven by the offline test-suite, no GPU or API key required.

Generated suite

Beyond the hand-written scenarios, a seeded generator produces hundreds of cases by combining parametric slots (amounts, payees, accounts), social- engineering phrasings (authority, urgency, policy pretext, social proof), and obfuscation techniques (homoglyph / spacing for sanctions evasion):

finagent-redteam --list --suite generated --per-threat 15   # 120 cases

Every generated scenario carries a reference_plan — the canonical exploit — and the test-suite replays all of them to verify the invariant that each attack lands under the none/advisory postures and is blocked under enforced. The suite is thus self-validating: each case is a checked, control-isolating test.


Quickstart

pip install -e ".[dev]"      # core + tests
pytest -q                     # 47 tests: proves attacks land (none/advisory), blocked (enforced)

# List scenarios (no model needed)
finagent-redteam --list

Run against a model (needs the agent extra):

pip install -e ".[agent]"

# local vLLM / SGLang
finagent-redteam --model Qwen/Qwen3-8B --base-url http://localhost:8000/v1
# Ollama OpenAI shim
finagent-redteam --model llama3.1 --base-url http://localhost:11434/v1 --json results.json

Run the multi-model leaderboard (several models, repeated trials):

# examples/models.example.json lists the models + endpoints to compare
finagent-redteam --models-config examples/models.example.json --trials 5 --temperature 0.7 \
                 --json leaderboard.json

This prints a ranked leaderboard plus a per-threat-category attack-success matrix; see examples/sample_leaderboard.md for the output shape.

Illustrative scorecard

A worst-case agent that fully complies with every embedded attack (reproduced by the offline self-test) yields:

Metric none advisory enforced
Attack Success Rate 100% 100% 0%
Utility (benign completed) 100% 100% 100%

→ It ignores stated policy (advisory ASR stays 100%) but is fully stopped by enforcement (residual ASR 0%), with no over-refusal. Real models land between these poles — resisting some attacks on their own and following some stated policy — and that gap, decomposed into policy-following vs enforcement uplift, is what the benchmark measures. See examples/sample_leaderboard.md.


Responsible use

This is a defensive benchmark built on entirely synthetic data and mock tools. It contains no real financial credentials, accounts, or exploits against live systems. Use it to evaluate and harden agents before they are trusted with money.

License

MIT — see LICENSE.

Citation

If you use FinAgent Red-Team, please cite it (see CITATION.cff).

About

A red-team benchmark for regulatory-control bypass in financial AI agents (sanctions evasion, structuring, maker-checker defeat).

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors