Skip to content

feat: add Verilog/SystemVerilog language support with VerilogEval and RT-LLM Benchmark#137

Open
Tyronita wants to merge 20 commits into
SakanaAI:mainfrom
Tyronita:feat/verilog-support
Open

feat: add Verilog/SystemVerilog language support with VerilogEval and RT-LLM Benchmark#137
Tyronita wants to merge 20 commits into
SakanaAI:mainfrom
Tyronita:feat/verilog-support

Conversation

@Tyronita

Copy link
Copy Markdown

Summary

Adds first-class Verilog/SystemVerilog language support to ShinkaEvolve, with two hardware design benchmarks:

  • VerilogEval (156 HDLBits problems) — uses iverilog for fast local compilation + simulation, continuous scoring via mismatch ratio
  • CVDP (302 NVIDIA problems) — uses Docker/CocoTB for pytest-based evaluation, continuous scoring via pass ratio

Both benchmarks support single-problem quick-start (run_evo.py) and full-sweep parallel execution (run_all.py with ProcessPoolExecutor). The UCB bandit dynamically allocates across a configurable model mix.

Core changes (shinka/)

  • shinka/utils/languages.py — add "verilog" language with .sv extension and // EVOLVE-BLOCK markers
  • shinka/edit/async_apply.py — map Verilog to the same comment-style edit as C/Rust
  • shinka/llm/client.py — fix Azure endpoint construction (was doubling /openai/ path)

New examples

Directory Problems Evaluator Requirements
examples/verilog_eval/ 156 (HuggingFace download) iverilog -g2012 + vvp iverilog v11+
examples/cvdp/ 302 (HuggingFace download) Docker + CocoTB pytest Docker

Each example includes:

  • evaluate.py — ShinkaEvolve evaluator contract (compile → simulate → score)
  • run_evo.py — single-problem quick-start
  • run_all.py — parallel batch runner for the full benchmark
  • download_dataset.py — fetch dataset from HuggingFace
  • initial.sv — default seed module
  • README.md — setup instructions and problem taxonomy

Architecture decisions

  • Continuous scoring: VerilogEval uses (1 - mismatches/total_samples) * 100, CVDP uses (passed_tests/total_tests) * 100. Both give the LLM gradient signal even on partial solutions.
  • Per-problem isolation: run_all.py spawns each problem as a subprocess with its own env vars, avoiding state leakage between evolution runs.
  • Self-contained JSONL: Both benchmarks embed testbenches and references directly in the JSONL, eliminating external repo dependencies.
  • Model mix: Defaults to azure-gpt-4-1-mini + azure-gpt-5-codex + azure-deepseek-v4-flash for the UCB bandit. Reasoning models (o4-mini) excluded from defaults since they don't support the temperature parameter.

Benchmark Results (in progress)

Running 30-generation sweeps on an Azure A100 VM (24 vCPU, 216GB RAM):

  • VerilogEval: 156 problems × 30 gens × 3 models, 4 parallel workers
  • CVDP: 302 problems × 30 gens × 3 models, 2 parallel workers

Early results (first ~10 VerilogEval problems): 100% solve rate on easy problems (zero, wire, not gate, popcount3), partial scores on medium problems.

Full results will be published to a separate results repository once the sweep completes.

Test plan

  • pytest tests/test_edit_verilog.py — Verilog marker detection and edit application
  • End-to-end VerilogEval: download → seed generation → evaluation → evolution (tested locally and on VM)
  • End-to-end CVDP: Docker build → evaluation → evolution (tested on VM)
  • Multi-model bandit selection working (gpt-4.1-mini, gpt-5-codex, deepseek-v4-flash)
  • Full benchmark sweeps running on Azure A100 VM (156 + 302 problems)
  • Collect final benchmark statistics after sweeps complete

🤖 Generated with Claude Code

Tyronita and others added 9 commits May 25, 2026 23:00
… CVDP benchmarks

Add first-class Verilog support to ShinkaEvolve's evolutionary code optimization
framework, targeting NVIDIA's CVDP benchmark (302 problems, 34% best SOTA) as an
unsaturated evolution target.

Core:
- Language definitions (aliases, extensions, fences, EVOLVE-BLOCK markers)
- iverilog -t null -g2012 syntax validation in async_apply.py
- requires_docker pytest marker + CI exclusion

Examples:
- examples/verilog_eval/ — local iverilog-based eval with continuous scoring
- examples/cvdp/ — Docker/CocoTB eval with pytest output parsing, RTL volume
  injection, template substitution, and parallel-safe project names

Tests:
- test_edit_verilog.py — diff/full patch and language helper tests
- test_verilog_eval_ci.py — iverilog-based evaluator tests (CI-friendly)
- test_cvdp_evaluator.py — CVDP evaluator unit tests (no Docker needed)

Docs:
- Benchmark survey (6 benchmarks), leaderboards, compute estimates

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add run_all.py for VerilogEval (156 problems) and CVDP (302 problems)
  with parallel ProcessPoolExecutor, per-problem seed generation, and
  CLI for workers/generations/model selection
- Add download_dataset.py to fetch VerilogEval v2 from HuggingFace
- Update evaluate.py with JSONL problem resolution (backward compatible
  with external dataset directory)
- Fix Azure endpoint construction: remove /openai/v1/ suffix that was
  doubled by the AzureOpenAI SDK

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…_all.py

results_dir is a param on EvolutionConfig, not ShinkaEvolveRunner.
Also resolve paths at script generation time rather than embedding
os.path.abspath() calls that reference the wrong machine's paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Set llm_models explicitly in run_evo.py for both benchmarks
  (azure-gpt-4-1-mini default, avoids falling back to non-Azure models)
- Update run_all.py defaults to multi-model mix:
  gpt-4.1-mini + gpt-5-codex + deepseek-v4-flash
- Update VerilogEval README with JSONL download workflow
- Avoid reasoning models (o4-mini) in defaults since they
  don't support the temperature parameter

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The CVDP dataset is distributed as JSONL files directly (not zipped).
Updated to download from nvidia/cvdp-benchmark-dataset v1.1.0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CVDP problems use 25 different service names (direct, 07-new-tb,
1-complete-rtl, etc). The evaluator was hardcoded to direct,
causing all non-direct problems to fail with no such service.

Now runs docker compose config --services to detect the actual
service name at runtime.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Each CVDP problem has a unique module name and port interface. The
previous approach used a static LFSR initial.sv for all 302 problems,
causing the LLM to generate wrong modules that fail compilation.

Now generates per-problem seeds with correct module names and passes
the full problem specification to the LLM via task_sys_msg so it
knows what to implement.

Also fix default models: gpt-5-codex -> gpt-5-4-mini (temperature support).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@RobertTLange

Copy link
Copy Markdown
Collaborator

Thanks for the work here and for your interest in extending ShinkaEvolve to Verilog/SystemVerilog. The direction is useful, and the evaluator examples are a good starting point.

I found a few issues that should be addressed before merge:

  1. CI currently fails on ruff

    • tests/test_cvdp_evaluator.py imports unused json and re.
    • tests/test_verilog_eval_ci.py imports unused sys.
    • This reproduces locally with:
      uv run python -m ruff check tests --exclude tests/file.py
  2. The CVDP evaluator trusts paths from the JSONL harness too much

    • examples/cvdp/evaluate.py writes problem["harness"]["files"] paths directly under the temp workspace, but absolute paths or .. entries could escape that workspace.
    • _extract_rtl_path() also trusts VERILOG_SOURCES; if that path is absolute or traverses upward, the candidate RTL can be written outside the temp dir.
    • Since the evaluator then runs the provided docker-compose.yml, this is a meaningful trust boundary. Please normalize/resolve all harness and RTL paths and require them to stay inside the temp workspace, or clearly restrict/document this evaluator as trusted official-dataset-only.
  3. There are unwanted docs artifacts in this PR

    • VERIFICATION.md looks like a temporary PR verification note and should not be committed.
    • The added planning/survey docs also look broader than the implementation change and likely should be removed unless maintainers explicitly want them:
      docs/cvdp_integration_plan.md, docs/verilog_benchmarks_survey.md, docs/verilog_eval_benchmark.md.

One question: were both examples run and validated end-to-end, including at least one full examples/verilog_eval evolution/evaluation path and one full examples/cvdp Docker/CocoTB evaluation with the required CVDP image?

I did run the targeted tests with:
uv run python -m pytest -q tests/test_edit_verilog.py tests/test_cvdp_evaluator.py tests/test_verilog_eval_ci.py

That passed locally: 9 passed, 4 skipped. Mypy also passed for the CI command. The current blocker is ruff plus the CVDP path/compose trust issue.

Tyronita and others added 2 commits June 14, 2026 18:31
…runners)

Squashed review fixes for PR SakanaAI#137:
- Remove unused imports flagged by ruff (json/re in test_cvdp_evaluator,
  sys in test_verilog_eval_ci) and an f-string-without-placeholder in
  download_dataset.py.
- Harden the CVDP evaluator trust boundary: _safe_workspace_path() now
  rejects absolute paths and ".." traversal for every harness.files key
  and the VERILOG_SOURCES-derived RTL path, with unit tests.
- Stop tracking VERIFICATION.md and the planning/survey docs; they remain
  on disk but are git-ignored via the repo .gitignore.
- Decode batch-runner subprocess output as UTF-8 (errors="replace") so the
  Windows cp1252 locale codec no longer crashes the reader thread.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds examples/rtllm: evolve Verilog RTL for area+delay under a FIXED
functional spec (RTLLM v2.0 designs), a "p_speedup" task for hardware.

Evaluator (evaluate.py) gates each candidate with:
  1. iverilog + RTLLM's own testbench  (open mirror of Synopsys VCS)
  2. yosys SAT formal combinational equivalence vs the reference
     (open analog of Formality; closes the testbench-overfitting
     reward-hacking hole -- a design that passes the finite testbench
     but is not equivalent scores 0)
  3. yosys -> ABC AIG: area = cell count, delay = logic depth
     (license-free, deterministic proxy for Design Compiler area/WNS)

Fitness = 100 * sqrt(area_ref/area_cand * depth_ref/depth_cand), so the
RTLLM human reference scores exactly 100 and beating it scores > 100.
Candidates are compared to the reference on the IDENTICAL yosys flow to
neutralize tool-dependence (Synopsys tools are unavailable on CPU).

Tooling: iverilog (native) + yosys (native or hdlc/yosys docker). The
JSONL and seeds are generated locally from a RTLLM clone via
extract_dataset.py and are git-ignored (we do not redistribute RTLLM's
files). run_evo.py / run_all.py drive single / parallel evolution and
force this fork onto sys.path + load the repo .env (dual-checkout safe).

Removes the CVDP example and its test per project direction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@RobertTLange

Copy link
Copy Markdown
Collaborator

Thanks again for putting this together.

We would love to support Verilog/SystemVerilog as first-class ShinkaEvolve languages. The core language support pieces look directionally useful.

That said, we do not want to overload the main repository with a large amount of benchmark-specific logic. The VerilogEval/CVDP runners, dataset downloaders, full-sweep orchestration, planning docs, and benchmark harness code make this PR much larger than the language-support change itself.

Could you clean up the PR so it focuses on the minimal reusable Verilog/SystemVerilog support in ShinkaEvolve? Ideally that would include:

  • core language/marker support
  • focused edit/apply tests
  • a small, self-contained example or smoke test if needed
  • removal of benchmark sweep infrastructure, temporary verification/planning docs, and unrelated changes

The larger VerilogEval/CVDP benchmark logic may be better kept in a separate example repository or follow-up contribution once the core language support lands.

Tyronita and others added 6 commits June 16, 2026 23:17
…ADME; drop verilog_eval

- evaluate.py now scores power via OpenSTA alongside area/depth (Yosys), matching the
  3-axis geomean fitness; correctness held by equivalence to the reference
- README rewritten for the single-design example (tools, formula, setup, run)
- remove the verilog_eval benchmark example + its CI test (kept out per review;
  benchmark sweep logic to live in a separate repo)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…one example

- revert shinka/llm/client.py (Azure endpoint change — unrelated to Verilog)
- revert .github/workflows/ci.yml + pyproject.toml (requires_docker marker was only
  used by the removed CVDP test)
- fix the root README examples table: the deleted VerilogEval/CVDP rows now point to
  the single RTLLM PPA example

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bundle one RTLLM design (adder_8bit) as initial.sv + example.jsonl (RTLLM
v2.0, MIT, attributed) so `python run_evo.py` runs out of the box — no
external clone needed for the example.

- run_evo.py: match the sibling-example template (argparse; drop the
  sys.path/dotenv hacks — shinka loads .env at import — and the
  overnight-driver env block).
- extract_dataset.py: decouple from the dropped verify_all.py; emit
  problems/rtllm.jsonl for the full set.
- README: out-of-the-box run + one-line PDK fetch (proprietary Nangate45,
  not vendored) + extract-for-full-set command + RTLLM MIT attribution.
- shinka.yaml: single clean example config.
- remove run_all.py (inline-script sweep harness).
- revert root .gitignore (stop referencing dropped CVDP/VerilogEval docs).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- default RTLLM_PROBLEM_FILE -> example.jsonl (the bundled design), so a bare
  `python evaluate.py --program_path initial.sv` is reproducible out of the box.
- collapse the scoring env-knobs (RTLLM_SCORE_MEAN / RTLLM_SCORE_AXES + the dual
  ppa2/ppa3 scalars) to the one documented formula:
  100 x geomean(area, logic-depth, power). Same combined_score, fewer knobs.
- drop the unused parameter on _ref_cache_path().
- trim the module docstring: drop the cross-benchmark RTL-OPT / Synopsys
  Formality/VCS framing and fix the fitness formula to the real 3-axis geomean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Our branch's table had diverged and re-added the maintainer's go_collatz_steps
row; the only intended README change is the new RTLLM PPA example row.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…line

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@Tyronita Tyronita left a comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Morning Tokyo! @RobertTLange

For the 45 of 50 problems listed in RT-LLM that can be synthesied with open source EDA tools we have 50 generations of results each:

(WIP) results will be published here:
https://github.com/Tyronita/RTLLM-ShinkaEvolve-results/

I noticed some reard hacking due to EVOLVE BLOCK placement, so I'm, re-running affected reference designs.

HF Dataset Published for traces:
https://huggingface.co/datasets/EvanOLeary/rtllm-shinka-evolve

(thank you for the Cuda-Engineer Traces - I post trained a small LM and came 2nd in a Nvidia Hackathon using Sakana AI's datasets - https://www.linkedin.com/feed/update/urn:li:activity:7467348291291291648/)

The rest of the PR should be good to review :)

Best,

Evan

Tyronita and others added 2 commits June 17, 2026 02:33
RTLLM pins the module name and all I/O signals (name AND width) as part of the
spec. Previously the EVOLVE-BLOCK wrapped the whole module, so a candidate could
change the port interface (e.g. narrow a RAM address bus) and still pass a finite
testbench — not a valid optimisation. Now the module header / port declaration is
emitted OUTSIDE the editable region; only the implementation evolves. The prompt
states the interface is fixed. The bundled adder_8bit seed is regenerated this way
(still scores 100).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…er comments

_wrap_seed now masks comments before locating the module header (so a ';' or ')'
inside a port comment no longer fools it) and also freezes the input/output/inout
declarations of non-ANSI modules. All 45 RTLLM seeds now freeze the full interface
(name + width) and still score 100 on the evaluator.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Tyronita

Tyronita commented Jun 17, 2026

Copy link
Copy Markdown
Author

Hey @RobertTLange — the PR's trimmed to the minimal Verilog/SystemVerilog support + one self-contained example, and it's green locally (ruff / mypy / pytest). The full PPA results live in a gist so they don't clutter the PR:

→ Results: https://gist.github.com/Tyronita/7e4c6959fc8609fa0d07f5663ebdfeb8

30-second summary

27 of 45 in-scope RTLLM designs beat the human reference (PPA = 100 · geomean(area, depth, power); 100 = the reference). Correctness is held by a Yosys SAT equivalence gate + an interface freeze + a code-review pass — which caught and rejected 4 testbench-overfit "wins". The gist has per-design code diffs, a worked example, a 45-design graph collage, and a HuggingFace dataset of every candidate.

Sidenote: the results and all their tooling are kept entirely out of this PR — it's just the language support + the one example.

@Tyronita Tyronita changed the title feat: add Verilog/SystemVerilog language support with VerilogEval and CVDP benchmarks feat: add Verilog/SystemVerilog language support with VerilogEval and RT-LLM Benchmark Jun 21, 2026
- evaluate.py: remove the unused equiv_induct equivalence path (kept one bounded
  miter), drop the dead native/no-timeout branches in _yosys_argv, hard-code the
  bounded-check params (no hidden env knobs), and tighten the verification,
  scoring, and feedback comments (-42 lines).
- run_evo.py: drop the anti-anchor paragraph that duplicated task_sys_msg.
- README: replace the stale RTLLM_SCORE_AXES row with RTLLM_POWER (the real knob);
  drop the specific result claim and the design-count aside.
- shinka.yaml: drop the benchmark-result comment.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants