AG-UI Protocol Research Benchmark

Comprehensive benchmark suite for the AG-UI (Agent-User Interaction) protocol across multiple AI agent frameworks. This is a research guide documenting protocol compliance, performance, and capabilities.

What is AG-UI?

AG-UI is an open, lightweight, event-based protocol for agent-user interaction created by CopilotKit. It enables framework-agnostic communication between AI agents and user interfaces through Server-Sent Events (SSE).

Research Goals

This benchmark provides rigorous testing and documentation of:

Protocol Compliance - Which frameworks support which AG-UI events
Framework Capabilities - What features each framework implements
Performance Characteristics - Response times, throughput, tool calling
HITL Implementation - How human-in-the-loop workflows work via the protocol

Documentation Structure

📖 Core Reference

AGUI Spec Reference - Complete specification of all 26 AG-UI protocol events
Framework Capabilities - Deep analysis of what each framework supports

🔬 Research Results

HITL Validation Results - Testing human-in-the-loop implementation across frameworks
Event Coverage Matrix - Which frameworks emit which events
Framework Comparison - Performance and feature comparison
Event Type Analysis - Event adoption statistics
Benchmark Summary - Overall statistics and rankings

📚 Guides

Report Generation - How to run benchmarks and auto-generate reports

Frameworks Tested

This benchmark covers 26 agent implementations across multiple frameworks:

Multi-Model Frameworks (Anthropic, OpenAI, Google):

Agno, LangGraph, PydanticAI, LlamaIndex, Vercel AI SDK

Single-Model Frameworks:

CrewAI (Anthropic), AG2 (OpenAI), Google ADK (Google)

Raw LLM APIs (baseline):

Anthropic, OpenAI, Google Gemini

For detailed framework analysis, see Framework Capabilities.

Quick Start

Prerequisites

Python 3.11+
Node.js 18+ (for TypeScript agents)
uv package manager
API keys for OpenAI, Anthropic, and Google

Installation

# Clone the repository
git clone https://github.com/namastexlabs/agui-benchmark.git
cd agui-benchmark

# Install Python dependencies
uv sync

# Install TypeScript dependencies
cd ts-agents && npm install --legacy-peer-deps && cd ..

# Create .env file with your API keys
cat > .env << EOF
ANTHROPIC_API_KEY=your-anthropic-key
OPENAI_API_KEY=your-openai-key
GEMINI_API_KEY=your-gemini-key
EOF

Running Benchmarks

# Start all agents
./start_all.sh

# Wait for agents to initialize
sleep 10

# Run the benchmark
uv run python test_agents.py

# Stop all agents
./stop_all.sh

Benchmark Results

Test Summary

Agents Tested: 26 implementations across 9 frameworks
Tests Run: 702 total (27 per agent × 26 agents)
Success Rate: 97.5% (686/702 passed)
Models Tested: Claude, OpenAI GPT, Google Gemini, Cerebras Llama

Performance Rankings (Median Response Time)

Rank	Framework	Model	Time
🥇	Agno	Cerebras	262ms
🥈	LlamaIndex	Claude	1,728ms
🥉	PydanticAI	Claude	1,746ms
4️⃣	LangGraph	Claude	2,296ms
5️⃣	LlamaIndex	Gemini	2,768ms

Key Insights:

Cerebras Llama 3.3-70b is dramatically faster than all others (262ms vs 1.7s+)
PydanticAI offers best balance: fast (1.7s with Claude) + reliable (100% success)
LangGraph has reliability issues (89% success rate on some models)
Raw API wrappers are slower than framework abstractions (20-24s vs 1-8s)

Full Results

See Framework Comparison Matrix for complete rankings and metrics.

Key Findings

HITL Implementation

We validated that human-in-the-loop workflows can be fully implemented using the AG-UI protocol's existing TOOL_CALL_* events, without requiring special HITL-specific events. See HITL Validation Results.

Protocol Coverage

The AG-UI specification defines 26 events across 5 categories:

Lifecycle: RUN_STARTED, RUN_FINISHED, RUN_ERROR
Text Messages: TEXT_MESSAGE_START/CONTENT/END, THINKING_START/END, THINKING_TEXT_MESSAGE_*
Tool Calls: TOOL_CALL_START/ARGS/END/RESULT
State: STATE_SNAPSHOT, STATE_DELTA, MESSAGES_SNAPSHOT, ACTIVITY_SNAPSHOT, ACTIVITY_DELTA
Custom: STEP_STARTED, STEP_FINISHED, RAW, CUSTOM

See AGUI Spec Reference for complete details.

Framework Comparison (Latest Benchmark)

Framework	Tests	Success	Median Time	Throughput	Tool Calls
agno-cerebras	27	100%	284ms	—	—
pydantic-anthropic	27	100%	1,771ms	13.4k c/s	20
agno-anthropic	27	100%	2,388ms	11.3k c/s	19
pydantic-gemini	27	100%	2,738ms	54k c/s	16
vercel-anthropic	27	100%	2,748ms	14.3k c/s	13
llamaindex-anthropic	27	89%	1,637ms	—	—
langgraph-anthropic	27	93%	2,295ms	6.6k c/s	—
openai-raw	27	100%	20,279ms	3.4k c/s	12

See full comparison: Framework Comparison Matrix

Benchmark Architecture

test_agents.py (Benchmark Runner)
    │
    ├─ Starts 26 agent implementations on various ports
    │
    └─ Runs 27 test scenarios per agent:
       • Simple prompt (no tools)
       • Tool calling (6 tools available)
       • Streaming performance
       • Error handling
       • State management
       └─ Collects timing, tool calls, response metrics
          Saves JSON results → generate_reports.py
                                     ↓
                          Auto-generates 4 markdown reports
                          in docs/reports/

Related Resources

AG-UI Protocol - Official specification
CopilotKit - AG-UI creators
Framework repositories:
- PydanticAI
- LangGraph
- Agno
- LlamaIndex
- Vercel AI SDK
- CrewAI
- AG2

License

MIT License - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
ag2_agent		ag2_agent
agno_agent		agno_agent
anthropic_raw		anthropic_raw
benchmark-runs		benchmark-runs
cerebras_raw		cerebras_raw
crewai_agent		crewai_agent
docs		docs
gemini_raw		gemini_raw
google_adk_agent		google_adk_agent
langgraph_agent		langgraph_agent
llamaindex_agent		llamaindex_agent
openai_raw		openai_raw
pydantic_agent		pydantic_agent
ts-agents		ts-agents
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
ag-ui-benchmark-results-20260205-221930.txt		ag-ui-benchmark-results-20260205-221930.txt
benchmark-output-20260206-001156.txt		benchmark-output-20260206-001156.txt
benchmark-results-20260205-221709.txt		benchmark-results-20260205-221709.txt
benchmark-results-20260205-221756.txt		benchmark-results-20260205-221756.txt
feature_matrix.py		feature_matrix.py
generate_reports.py		generate_reports.py
pyproject.toml		pyproject.toml
replay_test.py		replay_test.py
start_all.sh		start_all.sh
stop_all.sh		stop_all.sh
test_agent_enhanced.py		test_agent_enhanced.py
test_agents.py		test_agents.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AG-UI Protocol Research Benchmark

What is AG-UI?

Research Goals

Documentation Structure

📖 Core Reference

🔬 Research Results

📚 Guides

Frameworks Tested

Quick Start

Prerequisites

Installation

Running Benchmarks

Benchmark Results

Test Summary

Performance Rankings (Median Response Time)

Full Results

Key Findings

HITL Implementation

Protocol Coverage

Framework Comparison (Latest Benchmark)

Benchmark Architecture

Related Resources

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AG-UI Protocol Research Benchmark

What is AG-UI?

Research Goals

Documentation Structure

📖 Core Reference

🔬 Research Results

📚 Guides

Frameworks Tested

Quick Start

Prerequisites

Installation

Running Benchmarks

Benchmark Results

Test Summary

Performance Rankings (Median Response Time)

Full Results

Key Findings

HITL Implementation

Protocol Coverage

Framework Comparison (Latest Benchmark)

Benchmark Architecture

Related Resources

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages