Skip to content

Comments

Parallelize opt_clean pass#5664

Open
rocallahan wants to merge 25 commits intoYosysHQ:mainfrom
rocallahan:parallel-opt-clean
Open

Parallelize opt_clean pass#5664
rocallahan wants to merge 25 commits intoYosysHQ:mainfrom
rocallahan:parallel-opt-clean

Conversation

@rocallahan
Copy link
Contributor

@rocallahan rocallahan commented Feb 2, 2026

If your work is part of a larger effort, please discuss your general plans on Discourse first to align your vision with maintainers.

This is the next step after parallelizing opt_merge.

This PR depends on #5621, #5629 and #5631. Once those are merged, I will clean up this PR. I'm putting it here now in case people want to see what I'm planning and so I can get CI clean.

opt_clean is much more complex than opt_merge and the changes here are correspondingly greater. In particular I need to introduce several more parallel abstractions. I have done my best to preserve the original code structure. I've fuzzed millions of testcases to detect any differences in results between this and the original pass, and as far as I know I've fixed them all.

The scalability on large flattened circuits is not quite as impressive as it was for opt_merge but it's still pretty good. For a circuit with 3.5M cells with opt_merge already applied, running opt_clean once removes about 10% of the wires. Then I run opt_clean again to measure the scalability when there is nothing to remove (a common case). The results:
opt_clean scalability
So with 40 cores we get a 6x speedup in the dirty case and a 9x speedup in the clean case. But the 1-core case is actually 1.6x faster than current yosys main (a68fee1) and the clean case is 2.1x faster, so the clean case is actually >20x current yosys main. (The dirty case doesn't parallelize as well because modifying RTLIL has to happen on the main thread and there are a lot of wires to remove in this case. This could be improved but it might not be worth the extra complexity --- removing this much of the design is probably rare.)

For smaller circuits there is a penalty for the extra complexity, but I've done my best to mitigate that and offset it with optimizations. For the jpeg synth testcase (read_verilog -sv -I~/OpenROAD-flow-scripts/flow/designs/src/jpeg/include ~/OpenROAD-flow-scripts/flow/designs/src/jpeg/*.v; synth), this is very slightly faster than current yosys main with or without multicore, on my system:

main YOSYS_MAX_THREADS=1:
Benchmark 1: ./yosys -p "read_verilog -sv -I/usr/local/google/home/rocallahan/OpenROAD-flow-scripts/flow/designs/src/jpeg/include ~/OpenROAD-flow-scripts/flow/designs/src/jpeg/*.v; synth"
  Time (mean ± σ):     17.094 s ±  0.707 s    [User: 16.256 s, System: 0.841 s]
  Range (min … max):   16.333 s … 17.816 s    10 runs
 
main:
Benchmark 1: ./yosys -p "read_verilog -sv -I/usr/local/google/home/rocallahan/OpenROAD-flow-scripts/flow/designs/src/jpeg/include ~/OpenROAD-flow-scripts/flow/designs/src/jpeg/*.v; synth"
  Time (mean ± σ):     16.888 s ±  0.421 s    [User: 16.038 s, System: 0.858 s]
  Range (min … max):   16.508 s … 17.952 s    10 runs

parallel-opt-clean YOSYS_MAX_THREADS=1:
Benchmark 1: ./yosys -p "read_verilog -sv -I/usr/local/google/home/rocallahan/OpenROAD-flow-scripts/flow/designs/src/jpeg/include ~/OpenROAD-flow-scripts/flow/designs/src/jpeg/*.v; synth"
  Time (mean ± σ):     16.924 s ±  0.495 s    [User: 15.606 s, System: 1.320 s]
  Range (min … max):   16.563 s … 18.245 s    10 runs
 
parallel-opt-clean:
Benchmark 1: ./yosys -p "read_verilog -sv -I/usr/local/google/home/rocallahan/OpenROAD-flow-scripts/flow/designs/src/jpeg/include ~/OpenROAD-flow-scripts/flow/designs/src/jpeg/*.v; synth"
  Time (mean ± σ):     16.778 s ±  0.150 s    [User: 15.634 s, System: 1.179 s]
  Range (min … max):   16.602 s … 17.101 s    10 runs

This is a big PR so let me know what I can do to make it easier to swallow!

@rocallahan
Copy link
Contributor Author

I should add some unit tests for the types in threading.h.

@widlarizer widlarizer self-requested a review February 2, 2026 23:07
@widlarizer widlarizer self-assigned this Feb 2, 2026
@rocallahan
Copy link
Contributor Author

The dependent CLs have been merged. There are two failures in CI. One issue is that this version of MSVC++ seems to be unable to instantiate std::unordered_set with move-only elements :-(. The other issue is a crash running tests/svtypes/typedef_package.sv with Verific. I'm not sure how this CL would affect that test specifically and there's no diagnostic information in the CI log.

@rocallahan rocallahan force-pushed the parallel-opt-clean branch 2 times, most recently from 83bfac8 to 29a0ca6 Compare February 5, 2026 19:24
@rocallahan
Copy link
Contributor Author

The other issue is a crash running tests/svtypes/typedef_package.sv with Verific. I'm not sure how this CL would affect that test specifically and there's no diagnostic information in the CI log.

This was a bug in my code related to init wire attributes which apparently Verific generates but Yosys does not. Fuzzing didn't catch it because the fuzzing grammar didn't generate init attributes. I've updated the PR to extend the grammar with init attributes and verified that fuzzing now catches this case. I've also added an RTLIL testcase for the bug that will catch it when tests are run without Verific.

I'm still trying to find a workaround for the MSVC++ issue.

@rocallahan
Copy link
Contributor Author

The Verific build is failing because libgmock-dev isn't installed on that system. I don't know how to fix that.

@rocallahan rocallahan force-pushed the parallel-opt-clean branch 2 times, most recently from 01b8f9c to b438afc Compare February 8, 2026 22:52
@rocallahan rocallahan marked this pull request as ready for review February 8, 2026 23:43
@rocallahan
Copy link
Contributor Author

CI is clean except that the system that runs Verific tests needs gmock installed.

@widlarizer
Copy link
Collaborator

@mmicko The test-verific job is running in an unmanaged custom environment right? Please try adding gmock, it should fix the unit tests here since they use the convenient matchers like UnorderedElementsAre

@mmicko
Copy link
Member

mmicko commented Feb 12, 2026

@widlarizer Update CI docker, all green now

Copy link
Collaborator

@widlarizer widlarizer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been working on this review for a couple of days so here's an incomplete review. I've also put up a PR into this PR branch with comments that make the data flow clearer

thread_state.next_batch.emplace_back(std::move(work));
if (GetSize(thread_state.next_batch) < batch_size)
return;
bool was_empty;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably needs explicit initialization

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it? We initialize it unconditionally three lines later.


template <typename V>
struct DefaultCollisionHandler {
void operator()(typename V::Accumulated &, typename V::Accumulated &) const {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer for this to error out. It's not meeting the spec of "used to reduce two V::Accumulated values into a single value."

Copy link
Contributor Author

@rocallahan rocallahan Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably it is :-). The default behavior is just "pick one of the two values" (and we pick the 'current' value because that's free). I'll add a comment to DefaultCollisionHandler and you can tell me if it's satisfying :-).

@rocallahan
Copy link
Contributor Author

I've also put up a PR into this PR branch with comments that make the data flow clearer

Would you prefer me to merge these changes into my commits (losing attribution) or keep your commit separate in my stack of commits?

We've already talked about adding this as an alternative to `log_id()`, and we'll
need it later in this PR.
`log_error()` causes an exit so we don't have to try too hard here. The main
thing is to ensure that we normally are able to exit without causing a stack
overflow due to recursive asserts about not being in a `Multithreaded` context.
This causes problems when compiling with fuzzing instrumenation enabled.
…dIndex`

We'll use these later in this PR.
We'll use this later in the PR.
We'll use this later in the PR.
We'll use this later in the PR.
We'll use this later in the PR.
We'll use this later in the PR.
We'll use this later in the PR.
We will want to query `keep_cache` from parallel threads. If we compute
the results on-demand, that means we need synchronization for cache
access in those queries, which adds complexity and overhead. Instead, prefill
the cache with the status of all relevant modules. Note that this doesn't
actually do more work --- we always consult `keep_cache` for all cells of
all selected modules, so scanning all those cells and determining the kept
status of all dependency modules is always required.

Later in this PR we're going to parallelize `scan_module` itself, and that's also
much easier to do when no other parallel threads are running.
Turns out this is not strictly necessary for this PR but it's
still a good thing to do and makes it clearer that the stats
are not modified in a possibly racy way.
…`, and parallelize `remove_temporary_cells`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants