This document describes the automated benchmarking system for measuring cicada MCP tool call frequency when using Claude Code.
The benchmark script (tests/benchmark/benchmark_mcp_tool_calls.py) automates testing of how frequently Claude Code invokes cicada MCP tools when processing various prompts. This is useful for:
- Understanding tool usage patterns
- Optimizing tool descriptions for better adoption
- Measuring the impact of prompt engineering on tool usage
- Tracking improvements in AI agent behavior over time
The benchmark script:
- Displays Tool Descriptions - Shows all cicada MCP server tool descriptions at the start
- Runs Claude Code in Headless Mode - Uses the
-pflag with Haiku model for fast, non-thinking execution - Parses JSON Output - Extracts tool call events from streaming JSON output
- Counts Tool Calls - Tracks which cicada tools were invoked and how often
- Generates Statistics - Provides aggregate statistics across multiple test runs
- Claude Code installed and accessible in PATH (
claudecommand) - Cicada MCP server configured in
.mcp.jsonin the test repository - Python 3.10+ with access to the cicada package
- uv package manager (optional but recommended)
# Ensure cicada is installed
uv tool install git+https://github.com/wende/cicada.git
# Install development dependencies
uv sync
# Or using pip
pip install -e ".[dev]"List available test suites:
python tests/benchmark/benchmark_mcp_tool_calls.py --list-suitesThis will display all available test suites with their descriptions and test counts.
Test a specific prompt:
python tests/benchmark/benchmark_mcp_tool_calls.py --prompt "Show me where the load_index function is called"Run multiple predefined test cases:
# Run built-in default test suite
python tests/benchmark/benchmark_mcp_tool_calls.py --test-suite
# Run all test suites from JSON file
python tests/benchmark/benchmark_mcp_tool_calls.py --test-suite --load-tests tests/benchmark/benchmark_test_prompts.jsonSee all available test suites:
python tests/benchmark/benchmark_mcp_tool_calls.py --list-suites --load-tests tests/benchmark/benchmark_test_prompts.jsonOutput:
Available test suites:
basic_searches: Basic module and function searches (3 tests)
usage_analysis: Function and module usage tracking (3 tests)
git_attribution: Git history and PR attribution queries (4 tests)
complex_multi_tool: Complex queries requiring multiple tool invocations (4 tests)
semantic_search: Keyword-based semantic searches (3 tests)
code_quality: Code quality and analysis queries (3 tests)
realistic_scenarios: Real-world development scenarios (5 tests)
stress_tests: High complexity queries to stress test tool usage (3 tests)
Run a specific test suite from the JSON file:
# Run only git attribution tests
python tests/benchmark/benchmark_mcp_tool_calls.py --test-suite git_attribution
# Run realistic scenarios
python tests/benchmark/benchmark_mcp_tool_calls.py --test-suite realistic_scenarios --load-tests tests/benchmark/benchmark_test_prompts.jsonTest against a different repository:
python tests/benchmark/benchmark_mcp_tool_calls.py --repo-path /path/to/elixir/project --test-suiteThe script produces detailed output including:
================================================================================
CICADA MCP SERVER TOOL DESCRIPTIONS
================================================================================
Tool: search_module
Description: PREFERRED for Elixir: View a module's complete API - functions with arity, signatures, docs, typespecs, and line numbers...
--------------------------------------------------------------------------------
Tool: search_function
Description: PREFERRED for Elixir: Find function definitions and call sites across the codebase...
--------------------------------------------------------------------------------
================================================================================
TEST: Simple Module Search
================================================================================
Running: claude -p "What functions are available..." --model...
Prompt: What functions are available in the Cicada.Formatter module?...
Duration: 8.42s
Total MCP Tool Calls: 3
Tool Call Breakdown:
- search_module: 2
- search_function: 1
================================================================================
BENCHMARK SUMMARY
================================================================================
Total Tests: 5
Total Time: 42.15s
Average Time per Test: 8.43s
Total MCP Tool Calls: 18
Average Tool Calls per Test: 3.60
Tool Usage Across All Tests:
- search_module: 7 (38.9%)
- search_function: 5 (27.8%)
- get_commit_history: 3 (16.7%)
- find_pr_for_line: 2 (11.1%)
- search_module_usage: 1 (5.6%)
Individual Test Results:
1. Simple Module Search: 3 calls in 8.42s
2. Function Usage Search: 4 calls in 9.12s
3. Complex Multi-Tool Query: 6 calls in 11.23s
4. Git Attribution Query: 2 calls in 6.88s
5. Code Analysis: 3 calls in 6.50s
The script uses these Claude Code options:
-p(prompt flag): Enables headless/non-interactive mode--model claude-haiku-4-5-20251001: Uses Haiku 4.5 for fast execution--output-format stream-json: Outputs streaming JSON for parsing- No thinking keywords: Omits "think" keywords to minimize thinking budget
claude -p "Show me the search_module function" \
--model claude-haiku-4-5-20251001 \
--output-format stream-jsonThe script detects MCP tool calls through multiple methods:
Parses streaming JSON events for tool invocation records:
{
"type": "tool_use",
"name": "search_module",
"input": {...}
}Extracts tool calls from content blocks:
{
"content": [
{
"type": "tool_use",
"name": "search_function",
...
}
]
}Searches raw output for tool name mentions as a fallback when JSON parsing is incomplete.
Edit tests/benchmark/benchmark_test_prompts.json to add or modify test cases:
{
"test_suites": {
"my_custom_suite": {
"description": "My custom test suite",
"tests": [
{
"name": "My Custom Test",
"prompt": "Your test prompt here",
"expected_tools": ["search_module"]
}
]
}
}
}Then run:
python tests/benchmark/benchmark_mcp_tool_calls.py --test-suite my_custom_suiteEdit benchmark_mcp_tool_calls.py to modify the built-in test cases:
test_cases = [
{
"name": "My Custom Test",
"prompt": "Your test prompt here",
},
# Add more test cases...
]name: MCP Tool Usage Benchmark
on:
pull_request:
schedule:
- cron: '0 0 * * 0' # Weekly
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install uv
run: curl -LsSf https://astral.sh/uv/install.sh | sh
- name: Install cicada
run: uv tool install git+https://github.com/wende/cicada.git
- name: Setup test environment
run: |
cd test_fixture_repo
cicada
- name: Run benchmark
run: |
cd test_fixture_repo
python ../tests/benchmark_mcp_tool_calls.py --test-suite
- name: Upload results
uses: actions/upload-artifact@v3
with:
name: benchmark-results
path: results.jsonWhen Claude Code makes many cicada tool calls, it indicates:
- Tool descriptions are effective
- AI is choosing specialized tools over generic searches
- Better code understanding with less token usage
Low tool usage might indicate:
- Tool descriptions need improvement
- Prompt doesn't align with tool capabilities
- AI is using alternative approaches (grep, file reads, etc.)
Look for:
- Progressive refinement: Starting with broad searches (search_module) then narrowing (search_function)
- Context gathering: Using git tools (find_pr_for_line, get_commit_history) for historical context
- Dependency analysis: Using search_module_usage before refactoring
Install Claude Code:
# Visit https://docs.claude.com/en/docs/claude-code
npm install -g @anthropics/claude-codeThis could mean:
- JSON parsing needs adjustment for new output format
- Claude Code isn't using MCP tools (check .mcp.json configuration)
- The prompt doesn't trigger tool usage
Enable debug logging:
claude -p "your prompt" --mcp-debug --output-format stream-jsonEnsure cicada is set up in the test repository:
cd your_test_repo
cicadaPotential improvements:
- Token usage tracking: Measure input/output tokens per tool call
- Latency analysis: Track tool invocation latency
- Success rate monitoring: Detect failed tool calls
- Comparison mode: Compare different AI models (Haiku vs Sonnet)
- Regression detection: Alert on significant changes in tool usage patterns
- Export formats: JSON, CSV, HTML report generation
MIT License - See LICENSE file for details