feat: add Verilog/SystemVerilog language support with VerilogEval and RT-LLM Benchmark#137
feat: add Verilog/SystemVerilog language support with VerilogEval and RT-LLM Benchmark#137Tyronita wants to merge 20 commits into
Conversation
… CVDP benchmarks Add first-class Verilog support to ShinkaEvolve's evolutionary code optimization framework, targeting NVIDIA's CVDP benchmark (302 problems, 34% best SOTA) as an unsaturated evolution target. Core: - Language definitions (aliases, extensions, fences, EVOLVE-BLOCK markers) - iverilog -t null -g2012 syntax validation in async_apply.py - requires_docker pytest marker + CI exclusion Examples: - examples/verilog_eval/ — local iverilog-based eval with continuous scoring - examples/cvdp/ — Docker/CocoTB eval with pytest output parsing, RTL volume injection, template substitution, and parallel-safe project names Tests: - test_edit_verilog.py — diff/full patch and language helper tests - test_verilog_eval_ci.py — iverilog-based evaluator tests (CI-friendly) - test_cvdp_evaluator.py — CVDP evaluator unit tests (no Docker needed) Docs: - Benchmark survey (6 benchmarks), leaderboards, compute estimates Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add run_all.py for VerilogEval (156 problems) and CVDP (302 problems) with parallel ProcessPoolExecutor, per-problem seed generation, and CLI for workers/generations/model selection - Add download_dataset.py to fetch VerilogEval v2 from HuggingFace - Update evaluate.py with JSONL problem resolution (backward compatible with external dataset directory) - Fix Azure endpoint construction: remove /openai/v1/ suffix that was doubled by the AzureOpenAI SDK Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…_all.py results_dir is a param on EvolutionConfig, not ShinkaEvolveRunner. Also resolve paths at script generation time rather than embedding os.path.abspath() calls that reference the wrong machine's paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Set llm_models explicitly in run_evo.py for both benchmarks (azure-gpt-4-1-mini default, avoids falling back to non-Azure models) - Update run_all.py defaults to multi-model mix: gpt-4.1-mini + gpt-5-codex + deepseek-v4-flash - Update VerilogEval README with JSONL download workflow - Avoid reasoning models (o4-mini) in defaults since they don't support the temperature parameter Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The CVDP dataset is distributed as JSONL files directly (not zipped). Updated to download from nvidia/cvdp-benchmark-dataset v1.1.0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CVDP problems use 25 different service names (direct, 07-new-tb, 1-complete-rtl, etc). The evaluator was hardcoded to direct, causing all non-direct problems to fail with no such service. Now runs docker compose config --services to detect the actual service name at runtime. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Each CVDP problem has a unique module name and port interface. The previous approach used a static LFSR initial.sv for all 302 problems, causing the LLM to generate wrong modules that fail compilation. Now generates per-problem seeds with correct module names and passes the full problem specification to the LLM via task_sys_msg so it knows what to implement. Also fix default models: gpt-5-codex -> gpt-5-4-mini (temperature support). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for the work here and for your interest in extending ShinkaEvolve to Verilog/SystemVerilog. The direction is useful, and the evaluator examples are a good starting point. I found a few issues that should be addressed before merge:
One question: were both examples run and validated end-to-end, including at least one full I did run the targeted tests with: That passed locally: 9 passed, 4 skipped. Mypy also passed for the CI command. The current blocker is ruff plus the CVDP path/compose trust issue. |
…runners) Squashed review fixes for PR SakanaAI#137: - Remove unused imports flagged by ruff (json/re in test_cvdp_evaluator, sys in test_verilog_eval_ci) and an f-string-without-placeholder in download_dataset.py. - Harden the CVDP evaluator trust boundary: _safe_workspace_path() now rejects absolute paths and ".." traversal for every harness.files key and the VERILOG_SOURCES-derived RTL path, with unit tests. - Stop tracking VERIFICATION.md and the planning/survey docs; they remain on disk but are git-ignored via the repo .gitignore. - Decode batch-runner subprocess output as UTF-8 (errors="replace") so the Windows cp1252 locale codec no longer crashes the reader thread. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds examples/rtllm: evolve Verilog RTL for area+delay under a FIXED
functional spec (RTLLM v2.0 designs), a "p_speedup" task for hardware.
Evaluator (evaluate.py) gates each candidate with:
1. iverilog + RTLLM's own testbench (open mirror of Synopsys VCS)
2. yosys SAT formal combinational equivalence vs the reference
(open analog of Formality; closes the testbench-overfitting
reward-hacking hole -- a design that passes the finite testbench
but is not equivalent scores 0)
3. yosys -> ABC AIG: area = cell count, delay = logic depth
(license-free, deterministic proxy for Design Compiler area/WNS)
Fitness = 100 * sqrt(area_ref/area_cand * depth_ref/depth_cand), so the
RTLLM human reference scores exactly 100 and beating it scores > 100.
Candidates are compared to the reference on the IDENTICAL yosys flow to
neutralize tool-dependence (Synopsys tools are unavailable on CPU).
Tooling: iverilog (native) + yosys (native or hdlc/yosys docker). The
JSONL and seeds are generated locally from a RTLLM clone via
extract_dataset.py and are git-ignored (we do not redistribute RTLLM's
files). run_evo.py / run_all.py drive single / parallel evolution and
force this fork onto sys.path + load the repo .env (dual-checkout safe).
Removes the CVDP example and its test per project direction.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks again for putting this together. We would love to support Verilog/SystemVerilog as first-class ShinkaEvolve languages. The core language support pieces look directionally useful. That said, we do not want to overload the main repository with a large amount of benchmark-specific logic. The VerilogEval/CVDP runners, dataset downloaders, full-sweep orchestration, planning docs, and benchmark harness code make this PR much larger than the language-support change itself. Could you clean up the PR so it focuses on the minimal reusable Verilog/SystemVerilog support in ShinkaEvolve? Ideally that would include:
The larger VerilogEval/CVDP benchmark logic may be better kept in a separate example repository or follow-up contribution once the core language support lands. |
…ADME; drop verilog_eval - evaluate.py now scores power via OpenSTA alongside area/depth (Yosys), matching the 3-axis geomean fitness; correctness held by equivalence to the reference - README rewritten for the single-design example (tools, formula, setup, run) - remove the verilog_eval benchmark example + its CI test (kept out per review; benchmark sweep logic to live in a separate repo) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…one example - revert shinka/llm/client.py (Azure endpoint change — unrelated to Verilog) - revert .github/workflows/ci.yml + pyproject.toml (requires_docker marker was only used by the removed CVDP test) - fix the root README examples table: the deleted VerilogEval/CVDP rows now point to the single RTLLM PPA example Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bundle one RTLLM design (adder_8bit) as initial.sv + example.jsonl (RTLLM v2.0, MIT, attributed) so `python run_evo.py` runs out of the box — no external clone needed for the example. - run_evo.py: match the sibling-example template (argparse; drop the sys.path/dotenv hacks — shinka loads .env at import — and the overnight-driver env block). - extract_dataset.py: decouple from the dropped verify_all.py; emit problems/rtllm.jsonl for the full set. - README: out-of-the-box run + one-line PDK fetch (proprietary Nangate45, not vendored) + extract-for-full-set command + RTLLM MIT attribution. - shinka.yaml: single clean example config. - remove run_all.py (inline-script sweep harness). - revert root .gitignore (stop referencing dropped CVDP/VerilogEval docs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- default RTLLM_PROBLEM_FILE -> example.jsonl (the bundled design), so a bare `python evaluate.py --program_path initial.sv` is reproducible out of the box. - collapse the scoring env-knobs (RTLLM_SCORE_MEAN / RTLLM_SCORE_AXES + the dual ppa2/ppa3 scalars) to the one documented formula: 100 x geomean(area, logic-depth, power). Same combined_score, fewer knobs. - drop the unused parameter on _ref_cache_path(). - trim the module docstring: drop the cross-benchmark RTL-OPT / Synopsys Formality/VCS framing and fix the fitness formula to the real 3-axis geomean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Our branch's table had diverged and re-added the maintainer's go_collatz_steps row; the only intended README change is the new RTLLM PPA example row. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…line Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Good Morning Tokyo! @RobertTLange
For the 45 of 50 problems listed in RT-LLM that can be synthesied with open source EDA tools we have 50 generations of results each:
(WIP) results will be published here:
https://github.com/Tyronita/RTLLM-ShinkaEvolve-results/
I noticed some reard hacking due to EVOLVE BLOCK placement, so I'm, re-running affected reference designs.
HF Dataset Published for traces:
https://huggingface.co/datasets/EvanOLeary/rtllm-shinka-evolve
(thank you for the Cuda-Engineer Traces - I post trained a small LM and came 2nd in a Nvidia Hackathon using Sakana AI's datasets - https://www.linkedin.com/feed/update/urn:li:activity:7467348291291291648/)
The rest of the PR should be good to review :)
Best,
Evan
RTLLM pins the module name and all I/O signals (name AND width) as part of the spec. Previously the EVOLVE-BLOCK wrapped the whole module, so a candidate could change the port interface (e.g. narrow a RAM address bus) and still pass a finite testbench — not a valid optimisation. Now the module header / port declaration is emitted OUTSIDE the editable region; only the implementation evolves. The prompt states the interface is fixed. The bundled adder_8bit seed is regenerated this way (still scores 100). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…er comments _wrap_seed now masks comments before locating the module header (so a ';' or ')' inside a port comment no longer fools it) and also freezes the input/output/inout declarations of non-ANSI modules. All 45 RTLLM seeds now freeze the full interface (name + width) and still score 100 on the evaluator. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Hey @RobertTLange — the PR's trimmed to the minimal Verilog/SystemVerilog support + one self-contained example, and it's green locally (ruff / mypy / pytest). The full PPA results live in a gist so they don't clutter the PR: → Results: https://gist.github.com/Tyronita/7e4c6959fc8609fa0d07f5663ebdfeb8 30-second summary27 of 45 in-scope RTLLM designs beat the human reference (PPA = Sidenote: the results and all their tooling are kept entirely out of this PR — it's just the language support + the one example. |
- evaluate.py: remove the unused equiv_induct equivalence path (kept one bounded miter), drop the dead native/no-timeout branches in _yosys_argv, hard-code the bounded-check params (no hidden env knobs), and tighten the verification, scoring, and feedback comments (-42 lines). - run_evo.py: drop the anti-anchor paragraph that duplicated task_sys_msg. - README: replace the stale RTLLM_SCORE_AXES row with RTLLM_POWER (the real knob); drop the specific result claim and the design-count aside. - shinka.yaml: drop the benchmark-result comment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
Adds first-class Verilog/SystemVerilog language support to ShinkaEvolve, with two hardware design benchmarks:
iverilogfor fast local compilation + simulation, continuous scoring via mismatch ratioBoth benchmarks support single-problem quick-start (
run_evo.py) and full-sweep parallel execution (run_all.pywith ProcessPoolExecutor). The UCB bandit dynamically allocates across a configurable model mix.Core changes (
shinka/)shinka/utils/languages.py— add"verilog"language with.svextension and// EVOLVE-BLOCKmarkersshinka/edit/async_apply.py— map Verilog to the same comment-style edit as C/Rustshinka/llm/client.py— fix Azure endpoint construction (was doubling/openai/path)New examples
examples/verilog_eval/iverilog -g2012+vvpexamples/cvdp/Each example includes:
evaluate.py— ShinkaEvolve evaluator contract (compile → simulate → score)run_evo.py— single-problem quick-startrun_all.py— parallel batch runner for the full benchmarkdownload_dataset.py— fetch dataset from HuggingFaceinitial.sv— default seed moduleREADME.md— setup instructions and problem taxonomyArchitecture decisions
(1 - mismatches/total_samples) * 100, CVDP uses(passed_tests/total_tests) * 100. Both give the LLM gradient signal even on partial solutions.run_all.pyspawns each problem as a subprocess with its own env vars, avoiding state leakage between evolution runs.azure-gpt-4-1-mini + azure-gpt-5-codex + azure-deepseek-v4-flashfor the UCB bandit. Reasoning models (o4-mini) excluded from defaults since they don't support thetemperatureparameter.Benchmark Results (in progress)
Running 30-generation sweeps on an Azure A100 VM (24 vCPU, 216GB RAM):
Early results (first ~10 VerilogEval problems): 100% solve rate on easy problems (zero, wire, not gate, popcount3), partial scores on medium problems.
Full results will be published to a separate results repository once the sweep completes.
Test plan
pytest tests/test_edit_verilog.py— Verilog marker detection and edit application🤖 Generated with Claude Code