Comprehensive benchmark suite for the AG-UI (Agent-User Interaction) protocol across multiple AI agent frameworks. This is a research guide documenting protocol compliance, performance, and capabilities.
AG-UI is an open, lightweight, event-based protocol for agent-user interaction created by CopilotKit. It enables framework-agnostic communication between AI agents and user interfaces through Server-Sent Events (SSE).
This benchmark provides rigorous testing and documentation of:
- Protocol Compliance - Which frameworks support which AG-UI events
- Framework Capabilities - What features each framework implements
- Performance Characteristics - Response times, throughput, tool calling
- HITL Implementation - How human-in-the-loop workflows work via the protocol
- AGUI Spec Reference - Complete specification of all 26 AG-UI protocol events
- Framework Capabilities - Deep analysis of what each framework supports
- HITL Validation Results - Testing human-in-the-loop implementation across frameworks
- Event Coverage Matrix - Which frameworks emit which events
- Framework Comparison - Performance and feature comparison
- Event Type Analysis - Event adoption statistics
- Benchmark Summary - Overall statistics and rankings
- Report Generation - How to run benchmarks and auto-generate reports
This benchmark covers 26 agent implementations across multiple frameworks:
Multi-Model Frameworks (Anthropic, OpenAI, Google):
- Agno, LangGraph, PydanticAI, LlamaIndex, Vercel AI SDK
Single-Model Frameworks:
- CrewAI (Anthropic), AG2 (OpenAI), Google ADK (Google)
Raw LLM APIs (baseline):
- Anthropic, OpenAI, Google Gemini
For detailed framework analysis, see Framework Capabilities.
- Python 3.11+
- Node.js 18+ (for TypeScript agents)
- uv package manager
- API keys for OpenAI, Anthropic, and Google
# Clone the repository
git clone https://github.com/namastexlabs/agui-benchmark.git
cd agui-benchmark
# Install Python dependencies
uv sync
# Install TypeScript dependencies
cd ts-agents && npm install --legacy-peer-deps && cd ..
# Create .env file with your API keys
cat > .env << EOF
ANTHROPIC_API_KEY=your-anthropic-key
OPENAI_API_KEY=your-openai-key
GEMINI_API_KEY=your-gemini-key
EOF# Start all agents
./start_all.sh
# Wait for agents to initialize
sleep 10
# Run the benchmark
uv run python test_agents.py
# Stop all agents
./stop_all.sh- Agents Tested: 26 implementations across 9 frameworks
- Tests Run: 702 total (27 per agent Γ 26 agents)
- Success Rate: 97.5% (686/702 passed)
- Models Tested: Claude, OpenAI GPT, Google Gemini, Cerebras Llama
| Rank | Framework | Model | Time |
|---|---|---|---|
| π₯ | Agno | Cerebras | 262ms |
| π₯ | LlamaIndex | Claude | 1,728ms |
| π₯ | PydanticAI | Claude | 1,746ms |
| 4οΈβ£ | LangGraph | Claude | 2,296ms |
| 5οΈβ£ | LlamaIndex | Gemini | 2,768ms |
Key Insights:
- Cerebras Llama 3.3-70b is dramatically faster than all others (262ms vs 1.7s+)
- PydanticAI offers best balance: fast (1.7s with Claude) + reliable (100% success)
- LangGraph has reliability issues (89% success rate on some models)
- Raw API wrappers are slower than framework abstractions (20-24s vs 1-8s)
See Framework Comparison Matrix for complete rankings and metrics.
We validated that human-in-the-loop workflows can be fully implemented using the AG-UI protocol's existing TOOL_CALL_* events, without requiring special HITL-specific events. See HITL Validation Results.
The AG-UI specification defines 26 events across 5 categories:
- Lifecycle: RUN_STARTED, RUN_FINISHED, RUN_ERROR
- Text Messages: TEXT_MESSAGE_START/CONTENT/END, THINKING_START/END, THINKING_TEXT_MESSAGE_*
- Tool Calls: TOOL_CALL_START/ARGS/END/RESULT
- State: STATE_SNAPSHOT, STATE_DELTA, MESSAGES_SNAPSHOT, ACTIVITY_SNAPSHOT, ACTIVITY_DELTA
- Custom: STEP_STARTED, STEP_FINISHED, RAW, CUSTOM
See AGUI Spec Reference for complete details.
| Framework | Tests | Success | Median Time | Throughput | Tool Calls |
|---|---|---|---|---|---|
| agno-cerebras | 27 | 100% | 284ms | β | β |
| pydantic-anthropic | 27 | 100% | 1,771ms | 13.4k c/s | 20 |
| agno-anthropic | 27 | 100% | 2,388ms | 11.3k c/s | 19 |
| pydantic-gemini | 27 | 100% | 2,738ms | 54k c/s | 16 |
| vercel-anthropic | 27 | 100% | 2,748ms | 14.3k c/s | 13 |
| llamaindex-anthropic | 27 | 89% | 1,637ms | β | β |
| langgraph-anthropic | 27 | 93% | 2,295ms | 6.6k c/s | β |
| openai-raw | 27 | 100% | 20,279ms | 3.4k c/s | 12 |
See full comparison: Framework Comparison Matrix
test_agents.py (Benchmark Runner)
β
ββ Starts 26 agent implementations on various ports
β
ββ Runs 27 test scenarios per agent:
β’ Simple prompt (no tools)
β’ Tool calling (6 tools available)
β’ Streaming performance
β’ Error handling
β’ State management
ββ Collects timing, tool calls, response metrics
Saves JSON results β generate_reports.py
β
Auto-generates 4 markdown reports
in docs/reports/
- AG-UI Protocol - Official specification
- CopilotKit - AG-UI creators
- Framework repositories:
MIT License - see LICENSE for details.