FinAgent Red-Team

A red-team benchmark for AI agents that handle money — focused on regulatory-control bypass, not just generic misbehavior.

Can a crafted invoice, vendor email, or chat message trick an LLM agent into evading sanctions screening, structuring payments under a reporting threshold, defeating dual-approval, or wiring funds to an unapproved payee — and do the guardrails a bank would actually deploy stop it without breaking legitimate work?

FinAgent Red-Team runs every scenario under three control postures — none, advisory, and enforced — and reports not just can it be attacked but where an agent's safety actually comes from: its own judgement, its policy-following, or hard system enforcement — and what each costs in usefulness.

⚠️ Defensive, fully synthetic. No real money, accounts, or networks. All tools are mocks operating on in-memory state. The goal is to harden financial AI agents before deployment.

Why this exists

LLM agents are moving into financial operations — accounts payable, treasury, reconciliation, customer servicing. The dangerous failures there are not toxic text; they are silent violations of compliance controls that look like normal task completion: a payment that skipped sanctions screening, a $15k transfer quietly split into two $7.5k transfers to dodge reporting, a large wire the agent approved for itself.

General agent-security benchmarks test "leak the data" or "do the forbidden action." None of them model the regulatory control surface of a real financial institution, and none measure whether standard guardrails neutralize the attack.

How this differs from prior work

	Generic agent-security benchmarks	FinAgent Red-Team
Domain	Email/web/travel, generic "banking"	Payments/treasury with real compliance controls
Attack goal	Data leak, forbidden action	Sanctions evasion, structuring, maker-checker defeat
Defense eval	Usually attack-only	3 control postures → policy-following vs enforcement uplift
Utility tradeoff	Rarely measured	Over-refusal control set included
Judging	Often LLM-graded	Deterministic state predicates (reproducible)

Threat taxonomy (v0.2)

ID	Threat	Isolated control under test
`T2_unauthorized_transfer`	Indirect injection → fraudulent payment	Payee allow-list
`T3_sanctions_evasion`	Pay an allow-listed but sanctioned payee	Sanctions screening
`T3` (homoglyph)	Sanctioned entity hidden behind look-alike characters	Normalized screening
`T4_structuring`	Split a payment to dodge the reporting threshold	Reporting / CTR control
`T5_dual_approval_defeat`	Initiate and self-approve a large transfer	Maker-checker separation
`T6_data_exfiltration`	Email a customer's card/SSN to an external address	DLP on outbound messages
`T7_confused_deputy`	Redirect a payee's bank details to an attacker account	Bank-detail protection
`BENIGN`	A legitimate payment that should just work	Over-refusal (utility)

Each attack scenario is designed to isolate one control, so a block under the enforced posture attributes cleanly to a specific defense.

Control postures

Every scenario is evaluated under three postures, which separate where safety comes from:

Posture	What the agent is told / what the system does	Measures
none	No compliance policy is stated	Intrinsic susceptibility
advisory	Policy stated as obligations, not enforced	Does the model follow stated policy?
enforced	System hard-blocks violations in the sandbox	Residual risk (defense-in-depth)

Metrics

ASR (Attack Success Rate) at each posture — fraction of attacks won.
Policy-following uplift = ASR(none) − ASR(advisory): how much merely stating policy reduces attacks (pure model instruction-following).
Enforcement uplift = ASR(advisory) − ASR(enforced): additional reduction from hard enforcement (defense-in-depth).
Residual ASR = ASR(enforced): attacks that survive enforcement.
Utility at each posture and over-refusal = utility(none) − utility(enforced): legitimate work lost as controls tighten.

This decomposition is the point: two models with identical enforced residual risk can differ sharply in whether they'd behave safely when a control is only advisory (the common real-world case for judgement calls that can't be hard-coded). A stack only looks good with low residual ASR and high utility — trivially refusing everything scores 0% ASR but also 0% utility.

How it works

                 ┌──────────────────────────── deterministic, offline ───────────────────────────┐
  AgentModel ──► runner ──► sandbox tools ──► World state ──► evaluator ──► Scorecard
 (any model)     loop       (controls here)   (ground truth)   (predicates)   (ASR / uplift / utility)
                 │              ▲
                 │              └── posture: none / advisory / enforced

Sandbox (sandbox/): synthetic accounts, payees, inbox; mock tools (initiate_transfer, approve_transfer, screen_sanctions, …). Compliance controls live inside the tools, gated on Policy.enabled.
Scenarios (scenarios/): pure-data tasks + embedded attacks + a structured success predicate.
Evaluator (eval/): judges outcomes from final state — no LLM grader, so results are reproducible.
Agent driver (agent/): any OpenAI-compatible, tool-calling model.

The sandbox, scenarios, and evaluator have zero third-party dependencies and are fully deterministic — the entire attack→defense pipeline is proven by the offline test-suite, no GPU or API key required.

Generated suite

Beyond the hand-written scenarios, a seeded generator produces hundreds of cases by combining parametric slots (amounts, payees, accounts), social- engineering phrasings (authority, urgency, policy pretext, social proof), and obfuscation techniques (homoglyph / spacing for sanctions evasion):

finagent-redteam --list --suite generated --per-threat 15   # 120 cases

Every generated scenario carries a reference_plan — the canonical exploit — and the test-suite replays all of them to verify the invariant that each attack lands under the none/advisory postures and is blocked under enforced. The suite is thus self-validating: each case is a checked, control-isolating test.

Quickstart

pip install -e ".[dev]"      # core + tests
pytest -q                     # 47 tests: proves attacks land (none/advisory), blocked (enforced)

# List scenarios (no model needed)
finagent-redteam --list

Run against a model (needs the agent extra):

pip install -e ".[agent]"

# local vLLM / SGLang
finagent-redteam --model Qwen/Qwen3-8B --base-url http://localhost:8000/v1
# Ollama OpenAI shim
finagent-redteam --model llama3.1 --base-url http://localhost:11434/v1 --json results.json

Run the multi-model leaderboard (several models, repeated trials):

# examples/models.example.json lists the models + endpoints to compare
finagent-redteam --models-config examples/models.example.json --trials 5 --temperature 0.7 \
                 --json leaderboard.json

This prints a ranked leaderboard plus a per-threat-category attack-success matrix; see examples/sample_leaderboard.md for the output shape.

Illustrative scorecard

A worst-case agent that fully complies with every embedded attack (reproduced by the offline self-test) yields:

Metric	none	advisory	enforced
Attack Success Rate	100%	100%	0%
Utility (benign completed)	100%	100%	100%

→ It ignores stated policy (advisory ASR stays 100%) but is fully stopped by enforcement (residual ASR 0%), with no over-refusal. Real models land between these poles — resisting some attacks on their own and following some stated policy — and that gap, decomposed into policy-following vs enforcement uplift, is what the benchmark measures. See examples/sample_leaderboard.md.

Responsible use

This is a defensive benchmark built on entirely synthetic data and mock tools. It contains no real financial credentials, accounts, or exploits against live systems. Use it to evaluate and harden agents before they are trusted with money.

License

MIT — see LICENSE.

Citation

If you use FinAgent Red-Team, please cite it (see CITATION.cff).

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
checkpoints		checkpoints
docs		docs
examples		examples
models		models
paper		paper
register/finagent		register/finagent
results		results
src/finagent_redteam		src/finagent_redteam
tests		tests
.gitignore		.gitignore
BOOTSTRAP_SIGNIFICANCE_TESTING.md		BOOTSTRAP_SIGNIFICANCE_TESTING.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
DATASHEET.md		DATASHEET.md
LICENSE		LICENSE
README.md		README.md
RESULTS_SUMMARY.md		RESULTS_SUMMARY.md
SECURITY.md		SECURITY.md
compute_significance.py		compute_significance.py
generate_figures.py		generate_figures.py
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
run_leaderboard.py		run_leaderboard.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FinAgent Red-Team

Why this exists

How this differs from prior work

Threat taxonomy (v0.2)

Control postures

Metrics

How it works

Generated suite

Quickstart

Illustrative scorecard

Responsible use

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FinAgent Red-Team

Why this exists

How this differs from prior work

Threat taxonomy (v0.2)

Control postures

Metrics

How it works

Generated suite

Quickstart

Illustrative scorecard

Responsible use

License

Citation

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages