English | 简体中文
Harness-Bench is a real-workspace benchmark for evaluating agent / claw-style frameworks under executable task conditions. Instead of grading only final text answers, it measures whether an agent can operate inside a sandboxed workspace, produce the required artifacts, follow task constraints, and leave enough traces for process and usage analysis.
The benchmark is currently supports multiple adapters, task hooks, oracle-based grading, process grading, and usage accounting. It is designed for side-by-side evaluation of agent harnesses rather than for a single model or a single product.
Harness-Bench provides:
- Real workspace execution in per-task sandboxes.
- Pluggable adapters for multiple agent frameworks.
- Task-local fixtures, prompts, hooks, and oracle graders.
- Process grading in addition to outcome grading.
- Usage tracking through a benchmark-managed proxy and session logs.
- A unified CLI for single-task and full-suite runs.
In practice, this means you can ask different agent frameworks to solve the exact same task, under the exact same workspace layout, and compare:
- Whether they completed the task correctly.
- Whether they used tools coherently.
- Whether they respected task constraints and safety boundaries.
- How much model usage or cost they incurred.
The repository currently contains 28 tasks spanning a broad set of agent capabilities, including:
- File operations
- Shell execution
- Browser / local HTTP interaction
- Meeting summarization and email triage
- Session memory and multi-round workflows
- Vision and image-related tasks
- Git / PR workflows
- Office document processing
- Code debugging and repair
- Multi-document synthesis
- Planning and task decomposition
- Heartbeat / long-running monitoring
- Security and prompt-injection defense
- Provider failover and routing analysis
- Incident analysis and runbook synthesis
The current registry supports the following adapters:
demoopenclawpicoclawnanobotnanoclawnullclawmoltiszeroclawhermes_agentgeneric_cli
Example model entries live in [config/models.example.yaml].
Harness-Bench/
├── config/ # App config and model config examples
├── grading/ # Shared grading prompts / helpers
├── scripts/ # Wrapper scripts for selected frameworks
├── src/clawbench_v2/ # CLI, runner, adapters, config loading, grading pipeline
└── tasks/ # Task definitions, prompts, fixtures, hooks, oracles
Typical task folders contain:
task.yamlprompt.txtorprompt_filesfixtures/oracle_grade.py- optional
hooks.py - optional rubric-related files
- Python 3.10+
PyYAML>=6.0- Framework-specific CLIs or wrappers for the adapters you want to run
Install the Python package locally:
cd Harness-Bench
python3 -m pip install -e .If you prefer not to install the package, you can still run with PYTHONPATH=src.
Harness-Bench uses two top-level config files:
- [config/app.yaml]
- [config/models.example.yaml]
config/app.yaml defines project-level paths and defaults, such as:
tasks_dirresults_dirwork_rootdefault_timeout_sec
Important note: the example app config in this repo is already customized for a local environment. You should review and adjust results_dir and work_root before large benchmark runs.
You can override the app config path with:
export CLAWBENCHV2_APP_CONFIG=/absolute/path/to/app.yamlconfig/models.example.yaml defines runnable model / framework entries. Each model entry typically includes:
adaptercommanduser_configsession_prefixtimeout_sec- adapter-specific extra fields
You can point the benchmark to another model config file with:
export CLAWBENCHV2_MODELS_CONFIG=/absolute/path/to/models.yamlSome adapters expect local framework config files such as:
config/openclaw.jsonconfig/picoclaw.jsonconfig/nullclaw.jsonconfig/zeroclaw.toml
These files are often private, machine-specific, or API-key-bearing, so they are not guaranteed to be committed in a shared repo. Create them locally as needed for your framework.
List all tasks:
PYTHONPATH=src python3 -m clawbench_v2.cli tasksRun a single demo task:
PYTHONPATH=src python3 -m clawbench_v2.cli run-task \
--task 01-file \
--model demo-local \
--mode demoRun a single live task with one of your configured frameworks:
PYTHONPATH=src python3 -m clawbench_v2.cli run-task \
--task 01-file \
--model openclaw-local \
--mode liveRun a full suite:
PYTHONPATH=src python3 -m clawbench_v2.cli run-suite \
--model openclaw-local \
--mode liveResume a suite from a specific task ID:
PYTHONPATH=src python3 -m clawbench_v2.cli run-suite \
--model moltis-local \
--mode live \
--from-task 07-session-memoryDelete the sandbox after a run:
PYTHONPATH=src python3 -m clawbench_v2.cli run-task \
--task 01-file \
--model demo-local \
--mode demo \
--delete-sandboxThe main CLI entrypoint is [src/clawbench_v2/cli.py]
Available commands:
tasksrun-taskrun-suite
Main arguments:
--task--model--mode--delete-sandbox--from-taskfor suite resume
Both run-task and run-suite print progress lines and elapsed time in seconds. Output JSON includes elapsed_sec.
The main runtime logic lives in [src/clawbench_v2/runner.py]
For each run, the benchmark:
- Creates a fresh sandbox under the configured
work_root. - Creates a real task workspace inside that sandbox.
- Copies task fixtures into the workspace.
- Renders prompts using workspace and runtime variables.
- Runs optional task hooks.
- Invokes the selected adapter.
- Runs the oracle grader on the workspace outputs.
- Extracts usage information from the proxy and/or framework session logs.
- Runs a process rubric when configured.
- Writes a result JSON to the configured results directory.
The actual output location depends on [config/app.yaml] especially:
results_dirwork_root
Typical layout after a run:
<work_root>/
└── oc-bench-v2-.../ # sandbox for one task run
├── workspace/ # real task workspace
├── usage-proxy/ # proxy logs and captured responses
└── prompt-round1.txt # rendered prompt snapshot
<results_dir>/
└── <model_id>/
└── <task_id>.json
Each task result JSON typically contains:
task_idmodel_idmodesandboxworkspacesession_idusage_summaryoracle_resultprocess_resultcombined_resultscoring
The benchmark is intentionally designed to preserve both outcome-level and process-level evidence.
Harness-Bench can combine multiple perspectives:
- Oracle outcome score
- Process score from rubric-based trace evaluation
- Combined score
In the current setup:
- Outcome grading usually comes from each task's
oracle_grade.py. - Process grading is derived from proxy traces and rubric logic.
combined_scoreis typically built from outcome and process signals.
Some multimodal or rubric-primary tasks may emphasize rubric-based grading more heavily than pure artifact checks.
When possible, the benchmark captures usage data through a managed usage proxy or framework session logs. Depending on the adapter and trace availability, usage_summary may include:
input_tokensoutput_tokenscache_read_tokenscache_write_tokenstotal_tokens- provider or model metadata
This makes the benchmark useful not only for capability comparison, but also for cost and efficiency analysis.
The adapter layer lives in [src/clawbench_v2/adapters].
To add a new framework:
- Implement a new adapter class in
src/clawbench_v2/adapters/. - Export it in
src/clawbench_v2/adapters/__init__.py. - Register it in [src/clawbench_v2/registry.py].
- Add a model entry to
config/models.example.yaml. - Provide any wrapper scripts or local config files required by that framework.
To add a new benchmark task:
- Create a new folder under
tasks/. - Add
task.yaml. - Add prompt file(s).
- Add fixtures if the task needs input files.
- Implement
oracle_grade.py. - Optionally implement
hooks.py. - Optionally add rubric files if process grading needs task-specific logic.
Good tasks in this benchmark generally:
- Operate on a real workspace
- Require concrete artifacts
- Are auditable by code
- Distinguish between frameworks in meaningful ways
- Reflect realistic agent workflows
Harness-Bench is especially useful when you want to compare agent frameworks under realistic operating conditions rather than isolated prompting conditions.
Compared with purely answer-based benchmarks, it gives you a better view of:
- Tool-use reliability
- Workspace discipline
- Long-horizon execution behavior
- Safety behavior under adversarial or constrained tasks
- Engineering usefulness, not just reasoning quality
This repository contains local-environment assumptions in a few places:
- Some model entries point to local config paths.
config/app.yamlmay already be tuned for a specific local output directory.- Some framework CLIs must already be installed on the machine.
Before sharing or open-sourcing the repo broadly, it is a good idea to review:
- local absolute paths
- machine-specific config files
- private API-bearing config files
- output directories under
data/
Add your preferred license and release policy here before external publication.