-
Notifications
You must be signed in to change notification settings - Fork 590
Description
Summary
BrowseComp (BC-EN/BC-ZH) evaluations are significantly slower than other benchmarks. A single BC-ZH task averages 48.6 minutes (median 37.3 min), with tail tasks reaching 265 minutes (4.4 hours). BC-EN with 1266 tasks takes 60-70 hours per run.
Profiling shows LLM inference is the dominant bottleneck (~80%+ of total task time), followed by tool execution overhead. Both sides need optimization to achieve the target of 2x+ overall speedup.
1. LLM Inference Optimization (Biggest Impact) 🔴
LLM inference accounts for the vast majority of evaluation time. A BC task runs ~300-400 turns, each requiring a full LLM call with growing context (up to tens of thousands of tokens). Conservative estimate: 20-60 min per task on inference alone, often much more in practice.
1.1 Prefix Caching
Current state: All evaluation tasks share the same system prompt (several thousand tokens of agent instructions). Currently, every request recomputes the full KV cache from scratch.
Optimization: Enable prefix caching (RadixTree) on the sglang server so the system prompt KV cache is computed once and reused across all concurrent requests.
Additional benefit: Within a single task's multi-turn conversation, each turn's context is a prefix of the next. With cache-aware routing (routing same-task requests to the same worker), KV cache from previous turns can be reused.
Expected impact: Significant reduction in prefill compute, especially for later turns with long context. Estimated 30-50% inference time reduction.
1.2 Chunked Prefill with Pipeline Parallelism
Current state: Basic sglang launch with --tp 8, no chunked prefill.
Optimization: Enable --chunked-prefill-size 4096 --enable-dynamic-chunking to pipeline long prefills. SGLang benchmarks show up to 3.3x prefill throughput and 67.9% TTFT reduction for long contexts.
Why this matters for evaluation: BC tasks accumulate long conversation histories (10K-60K+ tokens). Chunked prefill directly reduces the time spent on these long-context prefill operations.
1.3 Prefill-Decode Disaggregation (Multi-Node)
Current state: Single-node TP8 serving both prefill and decode.
Optimization: When multiple nodes are available, separate prefill-heavy and decode-heavy workloads into dedicated workers. Evaluation workloads are prefill-dominant (long contexts, relatively short generations), making this particularly beneficial.
Expected impact: Better GPU utilization and throughput when scaling beyond single node.
1.4 Current vs Optimized sglang Config
Current (basic):
python3 -m sglang.launch_server \
--model-path <path> --tp 8 --host 0.0.0.0 --port 1234 \
--trust-remote-code --enable-metrics --mem-fraction-static 0.9Optimized (recommended):
python3 -m sglang.launch_server \
--model-path <path> --tp 8 --host 0.0.0.0 --port 1234 \
--trust-remote-code --enable-metrics --mem-fraction-static 0.9 \
--chunked-prefill-size 4096 --enable-dynamic-chunkingOur internal infra framework already supports these advanced sglang features (prefix caching, chunked pipeline, PD disaggregation, cache-aware routing). Integrating them into the evaluation serving setup is the highest-leverage optimization.
2. Evaluation Pipeline Optimization
2.1 MCP Tool Server Parallel Initialization ✅ Fixed
Location: libs/miroflow-tools/src/miroflow_tools/manager.py → get_all_tool_definitions()
Each task initialized 3 MCP tool servers sequentially. Under high concurrency: avg 234s, max 945s.
Fix: asyncio.gather() to parallelize. Measured 13.8x speedup (234s → 17s avg).
Status: ✅ Implemented in PR #139.
2.2 MCP Server Connection Not Reused Across Tool Calls 🔴
Location: libs/miroflow-tools/src/miroflow_tools/manager.py → execute_tool_call()
Every single tool call spawns a new MCP server subprocess, performs stdio handshake, executes, then destroys:
# Called ~400 times per BC task!
async with stdio_client(server_params) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize() # Handshake every time
tool_result = await session.call_tool(tool_name, arguments)
# Process destroyed hereNote: playwright already does connection reuse correctly (manager.py:247-252).
Proposed fix: Keep MCP server sessions alive for the lifetime of a task.
Expected impact: Eliminate ~400 process spawns per task. Estimated 2-5 min saved per task.
2.3 httpx Connection Pooling ✅ Fixed
Location: libs/miroflow-tools/src/miroflow_tools/dev_mcp_servers/search_and_scrape_webpage.py
Each search call created a new TCP connection. Now reuses shared httpx.AsyncClient.
Status: ✅ Implemented in PR #139.
2.4 max_turns Reduced from 400 to 300 ✅ Fixed
Algorithm team confirmed tasks not solved within 300 turns essentially never succeed in turns 300-400. New config mirothinker_1.7_keep5_max300.yaml saves ~25% wasted compute on tail tasks.
Status: ✅ Already on main.
2.5 Concurrency Overloading (No Backpressure)
NUM_RUNS=2, MAX_CONCURRENT=60 → peak 120 processes simultaneously. Causes E2B sandbox init to spike from 33s to 631s under contention.
Proposed fix: Shared semaphore across runs, or adaptive concurrency.
Won't Fix (confirmed with algorithm team)
— Jina + LLM extraction are both necessary, cannot be shortened without accuracy loss.scrape_and_extract_infooptimizationLLM retry parameters (base_wait=30s, max_retries=10)— Required for reliability under high load.
Benchmark-Specific Impact
| Benchmark | Tasks | Avg Task Time | Main Bottleneck |
|---|---|---|---|
| BC-EN | 1266 | ~43 min | LLM inference (long context) + scrape calls |
| BC-ZH | 289 | ~49 min | LLM inference + high turn count |
| HLE | 500 | ~15 min | E2B sandbox latency |
| GAIA | 103 | ~20 min | Mixed tools |
Progress Tracker
| Priority | Optimization | Expected Impact | Status |
|---|---|---|---|
| P0 | LLM inference: prefix caching + chunked prefill | 30-50% inference speedup | 🔴 To do |
| P0 | Parallel tool server init | 13.8x init speedup (234s → 17s) | ✅ PR #139 |
| P0 | MCP server connection reuse | Save 2-5 min/task | 🔴 To do |
| P1 | httpx connection pooling | Reduce TCP overhead | ✅ PR #139 |
| P1 | max_turns 400 → 300 | ~25% less wasted compute | ✅ On main |
| P2 | Concurrency backpressure | Reduce E2B init spike | ⬚ To do |
Environment
- Agent config:
mirothinker_1.7_keep5_max300(previously v1.5 max400) - Typical: 30B model, 8×GPU sglang, MAX_CONCURRENT=60, NUM_RUNS=2
- Profiled on: BC-ZH/BC-EN completed evaluations (xxg and lxx checkpoints)