feat: add Verilog/SystemVerilog language support with VerilogEval and RT-LLM Benchmark by Tyronita · Pull Request #137 · SakanaAI/ShinkaEvolve

Tyronita · 2026-05-26T00:41:16Z

Summary

Adds first-class Verilog/SystemVerilog language support to ShinkaEvolve, with two hardware design benchmarks:

VerilogEval (156 HDLBits problems) — uses iverilog for fast local compilation + simulation, continuous scoring via mismatch ratio
CVDP (302 NVIDIA problems) — uses Docker/CocoTB for pytest-based evaluation, continuous scoring via pass ratio

Both benchmarks support single-problem quick-start (run_evo.py) and full-sweep parallel execution (run_all.py with ProcessPoolExecutor). The UCB bandit dynamically allocates across a configurable model mix.

Core changes (`shinka/`)

shinka/utils/languages.py — add "verilog" language with .sv extension and // EVOLVE-BLOCK markers
shinka/edit/async_apply.py — map Verilog to the same comment-style edit as C/Rust
shinka/llm/client.py — fix Azure endpoint construction (was doubling /openai/ path)

New examples

Directory	Problems	Evaluator	Requirements
`examples/verilog_eval/`	156 (HuggingFace download)	`iverilog -g2012` + `vvp`	iverilog v11+
`examples/cvdp/`	302 (HuggingFace download)	Docker + CocoTB pytest	Docker

Each example includes:

evaluate.py — ShinkaEvolve evaluator contract (compile → simulate → score)
run_evo.py — single-problem quick-start
run_all.py — parallel batch runner for the full benchmark
download_dataset.py — fetch dataset from HuggingFace
initial.sv — default seed module
README.md — setup instructions and problem taxonomy

Architecture decisions

Continuous scoring: VerilogEval uses (1 - mismatches/total_samples) * 100, CVDP uses (passed_tests/total_tests) * 100. Both give the LLM gradient signal even on partial solutions.
Per-problem isolation: run_all.py spawns each problem as a subprocess with its own env vars, avoiding state leakage between evolution runs.
Self-contained JSONL: Both benchmarks embed testbenches and references directly in the JSONL, eliminating external repo dependencies.
Model mix: Defaults to azure-gpt-4-1-mini + azure-gpt-5-codex + azure-deepseek-v4-flash for the UCB bandit. Reasoning models (o4-mini) excluded from defaults since they don't support the temperature parameter.

Benchmark Results (in progress)

Running 30-generation sweeps on an Azure A100 VM (24 vCPU, 216GB RAM):

VerilogEval: 156 problems × 30 gens × 3 models, 4 parallel workers
CVDP: 302 problems × 30 gens × 3 models, 2 parallel workers

Early results (first ~10 VerilogEval problems): 100% solve rate on easy problems (zero, wire, not gate, popcount3), partial scores on medium problems.

Full results will be published to a separate results repository once the sweep completes.

Test plan

pytest tests/test_edit_verilog.py — Verilog marker detection and edit application
End-to-end VerilogEval: download → seed generation → evaluation → evolution (tested locally and on VM)
End-to-end CVDP: Docker build → evaluation → evolution (tested on VM)
Multi-model bandit selection working (gpt-4.1-mini, gpt-5-codex, deepseek-v4-flash)
Full benchmark sweeps running on Azure A100 VM (156 + 302 problems)
Collect final benchmark statistics after sweeps complete

🤖 Generated with Claude Code

… CVDP benchmarks Add first-class Verilog support to ShinkaEvolve's evolutionary code optimization framework, targeting NVIDIA's CVDP benchmark (302 problems, 34% best SOTA) as an unsaturated evolution target. Core: - Language definitions (aliases, extensions, fences, EVOLVE-BLOCK markers) - iverilog -t null -g2012 syntax validation in async_apply.py - requires_docker pytest marker + CI exclusion Examples: - examples/verilog_eval/ — local iverilog-based eval with continuous scoring - examples/cvdp/ — Docker/CocoTB eval with pytest output parsing, RTL volume injection, template substitution, and parallel-safe project names Tests: - test_edit_verilog.py — diff/full patch and language helper tests - test_verilog_eval_ci.py — iverilog-based evaluator tests (CI-friendly) - test_cvdp_evaluator.py — CVDP evaluator unit tests (no Docker needed) Docs: - Benchmark survey (6 benchmarks), leaderboards, compute estimates Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add run_all.py for VerilogEval (156 problems) and CVDP (302 problems) with parallel ProcessPoolExecutor, per-problem seed generation, and CLI for workers/generations/model selection - Add download_dataset.py to fetch VerilogEval v2 from HuggingFace - Update evaluate.py with JSONL problem resolution (backward compatible with external dataset directory) - Fix Azure endpoint construction: remove /openai/v1/ suffix that was doubled by the AzureOpenAI SDK Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…_all.py results_dir is a param on EvolutionConfig, not ShinkaEvolveRunner. Also resolve paths at script generation time rather than embedding os.path.abspath() calls that reference the wrong machine's paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Set llm_models explicitly in run_evo.py for both benchmarks (azure-gpt-4-1-mini default, avoids falling back to non-Azure models) - Update run_all.py defaults to multi-model mix: gpt-4.1-mini + gpt-5-codex + deepseek-v4-flash - Update VerilogEval README with JSONL download workflow - Avoid reasoning models (o4-mini) in defaults since they don't support the temperature parameter Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The CVDP dataset is distributed as JSONL files directly (not zipped). Updated to download from nvidia/cvdp-benchmark-dataset v1.1.0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

CVDP problems use 25 different service names (direct, 07-new-tb, 1-complete-rtl, etc). The evaluator was hardcoded to direct, causing all non-direct problems to fail with no such service. Now runs docker compose config --services to detect the actual service name at runtime. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Each CVDP problem has a unique module name and port interface. The previous approach used a static LFSR initial.sv for all 302 problems, causing the LLM to generate wrong modules that fail compilation. Now generates per-problem seeds with correct module names and passes the full problem specification to the LLM via task_sys_msg so it knows what to implement. Also fix default models: gpt-5-codex -> gpt-5-4-mini (temperature support). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

RobertTLange · 2026-06-09T09:13:12Z

Thanks for the work here and for your interest in extending ShinkaEvolve to Verilog/SystemVerilog. The direction is useful, and the evaluator examples are a good starting point.

I found a few issues that should be addressed before merge:

CI currently fails on ruff
- tests/test_cvdp_evaluator.py imports unused json and re.
- tests/test_verilog_eval_ci.py imports unused sys.
- This reproduces locally with:
  uv run python -m ruff check tests --exclude tests/file.py
The CVDP evaluator trusts paths from the JSONL harness too much
- examples/cvdp/evaluate.py writes problem["harness"]["files"] paths directly under the temp workspace, but absolute paths or .. entries could escape that workspace.
- _extract_rtl_path() also trusts VERILOG_SOURCES; if that path is absolute or traverses upward, the candidate RTL can be written outside the temp dir.
- Since the evaluator then runs the provided docker-compose.yml, this is a meaningful trust boundary. Please normalize/resolve all harness and RTL paths and require them to stay inside the temp workspace, or clearly restrict/document this evaluator as trusted official-dataset-only.
There are unwanted docs artifacts in this PR
- VERIFICATION.md looks like a temporary PR verification note and should not be committed.
- The added planning/survey docs also look broader than the implementation change and likely should be removed unless maintainers explicitly want them:
  docs/cvdp_integration_plan.md, docs/verilog_benchmarks_survey.md, docs/verilog_eval_benchmark.md.

One question: were both examples run and validated end-to-end, including at least one full examples/verilog_eval evolution/evaluation path and one full examples/cvdp Docker/CocoTB evaluation with the required CVDP image?

I did run the targeted tests with:
uv run python -m pytest -q tests/test_edit_verilog.py tests/test_cvdp_evaluator.py tests/test_verilog_eval_ci.py

That passed locally: 9 passed, 4 skipped. Mypy also passed for the CI command. The current blocker is ruff plus the CVDP path/compose trust issue.

…runners) Squashed review fixes for PR SakanaAI#137: - Remove unused imports flagged by ruff (json/re in test_cvdp_evaluator, sys in test_verilog_eval_ci) and an f-string-without-placeholder in download_dataset.py. - Harden the CVDP evaluator trust boundary: _safe_workspace_path() now rejects absolute paths and ".." traversal for every harness.files key and the VERILOG_SOURCES-derived RTL path, with unit tests. - Stop tracking VERIFICATION.md and the planning/survey docs; they remain on disk but are git-ignored via the repo .gitignore. - Decode batch-runner subprocess output as UTF-8 (errors="replace") so the Windows cp1252 locale codec no longer crashes the reader thread. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds examples/rtllm: evolve Verilog RTL for area+delay under a FIXED functional spec (RTLLM v2.0 designs), a "p_speedup" task for hardware. Evaluator (evaluate.py) gates each candidate with: 1. iverilog + RTLLM's own testbench (open mirror of Synopsys VCS) 2. yosys SAT formal combinational equivalence vs the reference (open analog of Formality; closes the testbench-overfitting reward-hacking hole -- a design that passes the finite testbench but is not equivalent scores 0) 3. yosys -> ABC AIG: area = cell count, delay = logic depth (license-free, deterministic proxy for Design Compiler area/WNS) Fitness = 100 * sqrt(area_ref/area_cand * depth_ref/depth_cand), so the RTLLM human reference scores exactly 100 and beating it scores > 100. Candidates are compared to the reference on the IDENTICAL yosys flow to neutralize tool-dependence (Synopsys tools are unavailable on CPU). Tooling: iverilog (native) + yosys (native or hdlc/yosys docker). The JSONL and seeds are generated locally from a RTLLM clone via extract_dataset.py and are git-ignored (we do not redistribute RTLLM's files). run_evo.py / run_all.py drive single / parallel evolution and force this fork onto sys.path + load the repo .env (dual-checkout safe). Removes the CVDP example and its test per project direction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

RobertTLange · 2026-06-16T08:25:49Z

Thanks again for putting this together.

We would love to support Verilog/SystemVerilog as first-class ShinkaEvolve languages. The core language support pieces look directionally useful.

That said, we do not want to overload the main repository with a large amount of benchmark-specific logic. The VerilogEval/CVDP runners, dataset downloaders, full-sweep orchestration, planning docs, and benchmark harness code make this PR much larger than the language-support change itself.

Could you clean up the PR so it focuses on the minimal reusable Verilog/SystemVerilog support in ShinkaEvolve? Ideally that would include:

core language/marker support
focused edit/apply tests
a small, self-contained example or smoke test if needed
removal of benchmark sweep infrastructure, temporary verification/planning docs, and unrelated changes

The larger VerilogEval/CVDP benchmark logic may be better kept in a separate example repository or follow-up contribution once the core language support lands.

…ADME; drop verilog_eval - evaluate.py now scores power via OpenSTA alongside area/depth (Yosys), matching the 3-axis geomean fitness; correctness held by equivalence to the reference - README rewritten for the single-design example (tools, formula, setup, run) - remove the verilog_eval benchmark example + its CI test (kept out per review; benchmark sweep logic to live in a separate repo) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…one example - revert shinka/llm/client.py (Azure endpoint change — unrelated to Verilog) - revert .github/workflows/ci.yml + pyproject.toml (requires_docker marker was only used by the removed CVDP test) - fix the root README examples table: the deleted VerilogEval/CVDP rows now point to the single RTLLM PPA example Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Bundle one RTLLM design (adder_8bit) as initial.sv + example.jsonl (RTLLM v2.0, MIT, attributed) so `python run_evo.py` runs out of the box — no external clone needed for the example. - run_evo.py: match the sibling-example template (argparse; drop the sys.path/dotenv hacks — shinka loads .env at import — and the overnight-driver env block). - extract_dataset.py: decouple from the dropped verify_all.py; emit problems/rtllm.jsonl for the full set. - README: out-of-the-box run + one-line PDK fetch (proprietary Nangate45, not vendored) + extract-for-full-set command + RTLLM MIT attribution. - shinka.yaml: single clean example config. - remove run_all.py (inline-script sweep harness). - revert root .gitignore (stop referencing dropped CVDP/VerilogEval docs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- default RTLLM_PROBLEM_FILE -> example.jsonl (the bundled design), so a bare `python evaluate.py --program_path initial.sv` is reproducible out of the box. - collapse the scoring env-knobs (RTLLM_SCORE_MEAN / RTLLM_SCORE_AXES + the dual ppa2/ppa3 scalars) to the one documented formula: 100 x geomean(area, logic-depth, power). Same combined_score, fewer knobs. - drop the unused parameter on _ref_cache_path(). - trim the module docstring: drop the cross-benchmark RTL-OPT / Synopsys Formality/VCS framing and fix the fitness formula to the real 3-axis geomean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Our branch's table had diverged and re-added the maintainer's go_collatz_steps row; the only intended README change is the new RTLLM PPA example row. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…line Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Tyronita

Good Morning Tokyo! @RobertTLange

For the 45 of 50 problems listed in RT-LLM that can be synthesied with open source EDA tools we have 50 generations of results each:

(WIP) results will be published here:
https://github.com/Tyronita/RTLLM-ShinkaEvolve-results/

I noticed some reard hacking due to EVOLVE BLOCK placement, so I'm, re-running affected reference designs.

HF Dataset Published for traces:
https://huggingface.co/datasets/EvanOLeary/rtllm-shinka-evolve

(thank you for the Cuda-Engineer Traces - I post trained a small LM and came 2nd in a Nvidia Hackathon using Sakana AI's datasets - https://www.linkedin.com/feed/update/urn:li:activity:7467348291291291648/)

The rest of the PR should be good to review :)

Best,

Evan

RTLLM pins the module name and all I/O signals (name AND width) as part of the spec. Previously the EVOLVE-BLOCK wrapped the whole module, so a candidate could change the port interface (e.g. narrow a RAM address bus) and still pass a finite testbench — not a valid optimisation. Now the module header / port declaration is emitted OUTSIDE the editable region; only the implementation evolves. The prompt states the interface is fixed. The bundled adder_8bit seed is regenerated this way (still scores 100). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…er comments _wrap_seed now masks comments before locating the module header (so a ';' or ')' inside a port comment no longer fools it) and also freezes the input/output/inout declarations of non-ANSI modules. All 45 RTLLM seeds now freeze the full interface (name + width) and still score 100 on the evaluator. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Tyronita · 2026-06-17T11:12:55Z

Hey @RobertTLange — the PR's trimmed to the minimal Verilog/SystemVerilog support + one self-contained example, and it's green locally (ruff / mypy / pytest). The full PPA results live in a gist so they don't clutter the PR:

→ Results: https://gist.github.com/Tyronita/7e4c6959fc8609fa0d07f5663ebdfeb8

30-second summary

27 of 45 in-scope RTLLM designs beat the human reference (PPA = 100 · geomean(area, depth, power); 100 = the reference). Correctness is held by a Yosys SAT equivalence gate + an interface freeze + a code-review pass — which caught and rejected 4 testbench-overfit "wins". The gist has per-design code diffs, a worked example, a 45-design graph collage, and a HuggingFace dataset of every candidate.

Sidenote: the results and all their tooling are kept entirely out of this PR — it's just the language support + the one example.

- evaluate.py: remove the unused equiv_induct equivalence path (kept one bounded miter), drop the dead native/no-timeout branches in _yosys_argv, hard-code the bounded-check params (no hidden env knobs), and tighten the verification, scoring, and feedback comments (-42 lines). - run_evo.py: drop the anti-anchor paragraph that duplicated task_sys_msg. - README: replace the stale RTLLM_SCORE_AXES row with RTLLM_POWER (the real knob); drop the specific result claim and the design-count aside. - shinka.yaml: drop the benchmark-result comment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Tyronita and others added 9 commits May 25, 2026 23:00

docs: add Verilog and Go examples to main README table

6f51302

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: update CVDP download_dataset.py to use correct HuggingFace URL

6b5eb3d

The CVDP dataset is distributed as JSONL files directly (not zipped). Updated to download from nvidia/cvdp-benchmark-dataset v1.1.0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: add .gitignore for generated datasets and results

50edf25

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Tyronita and others added 2 commits June 14, 2026 18:31

Tyronita and others added 6 commits June 16, 2026 23:17

docs: drop stray Go Collatz row from examples table

cddcf62

Our branch's table had diverged and re-added the maintainer's go_collatz_steps row; the only intended README change is the new RTLLM PPA example row. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

fix: repair examples-table row that merged when dropping the Collatz …

1e774d4

…line Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Tyronita commented Jun 17, 2026

View reviewed changes

Tyronita and others added 2 commits June 17, 2026 02:33

Tyronita changed the title ~~feat: add Verilog/SystemVerilog language support with VerilogEval and CVDP benchmarks~~ feat: add Verilog/SystemVerilog language support with VerilogEval and RT-LLM Benchmark Jun 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add Verilog/SystemVerilog language support with VerilogEval and RT-LLM Benchmark#137

feat: add Verilog/SystemVerilog language support with VerilogEval and RT-LLM Benchmark#137
Tyronita wants to merge 20 commits into
SakanaAI:mainfrom
Tyronita:feat/verilog-support

Tyronita commented May 26, 2026

Uh oh!

RobertTLange commented Jun 9, 2026

Uh oh!

RobertTLange commented Jun 16, 2026

Uh oh!

Tyronita left a comment •

edited

Loading

Uh oh!

Tyronita commented Jun 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Tyronita commented May 26, 2026

Summary

Core changes (shinka/)

New examples

Architecture decisions

Benchmark Results (in progress)

Test plan

Uh oh!

RobertTLange commented Jun 9, 2026

Uh oh!

RobertTLange commented Jun 16, 2026

Uh oh!

Tyronita left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tyronita commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Core changes (`shinka/`)

Tyronita left a comment •

edited

Loading

Tyronita commented Jun 17, 2026 •

edited

Loading