Optimizing Feedforward PROC Network for Throughput #2919

Girjoaba · 2025-08-21T09:18:41Z

Girjoaba
Aug 21, 2025

Description

We provide the dslx code of a neural network that performs physics jet tagging classification and the associated scripts for measuring latency/throughput.
We look at the performance of a 4-layer dense neural network (NN). Each layer consists of a Matrix-Vector-Multiplication (MVM) followed by the ReLU or Argmax activation functions. Each layer's matrix (the weights) are defined as constants.
The code is attached as an archive.

Implementation

We build it as a feedforward PROC network and provide 3 different implementations:

pub proc jet_tagging_1layer(): all 4 MVMs are defined in one PROC and another surrounding PROC communicates with the "network"
pub proc jet_tagging_2layers(): the 4 MVM are split between 2 PROCs and again we have the surrounding PROC
pub proc jet_tagging_4layers(): the 4 MVM are placed within their own PROC and the surrounding PROC

Rationale

While an implementation that does not use PROCs achieves an initiation interval (II) of 1, we need to experiment with PROCs for two main reasons:

The flexibility of PROCs will benefit us a lot when writing more complex NN architectures. We should be able to obtain optimal throughput on the simplest case.
The codegen pipeline (from dslx to hdl) completes faster for large designs that are split using PROCs.

Scripts

dslx_codegen.sh proc_jet_tagging_dense <jet_tagging_4layers | jet_tagging_2layers | jet_tagging_1layer>"

This script, generates the top.sv hdl directly from the proc_jet_tagging_dense.x file. Choose between one of the 3 implementations. Modify the BAZEL_BIN_PATH variable to point to your xls/bazel-bin.
The last command of the script must also be modified to experiment with various I/O constraints.

make

There is a Makefile that calls a Verilator simulation written in sim_main.cpp. Verilator must be installed. To investigate the throughput (II), look at the cycle interval between class 0 and class 2 outputs. They should oscillate consistently. The latency is given by the cycle at which the first class is generated.

Issues

4 procs implementation (II=15): Considered too long since the equivalent implementation without PROCs achieves an II=1.
2 procs implementation (II=9): As above.
1 proc implementation (II=4): We believe that fitting too many large MVMs in one PROC results in undesired behavior since although the unit test passes, the oscillation behavior of the simulation disappears, always showing one class as output.

Discussion Points

How would an DSLX designer approach this problem? Are there missing constraints, codegen flags, or is the design flawed at an algorithmic level?

Code:
proc_jet_tagging_dense.zip

ericastor · 2025-08-21T13:22:56Z

ericastor
Aug 21, 2025
Maintainer

At a glance, I agree that I don't immediately see why any of your designs should fall short of full throughput, since all of your procs are feed-forward and stateless... the only obstacle to good performance should be your use of handshake exchanges (send->recv edges currently force a pipeline stage boundary), but your designs look proper enough that that should only add latency, not reduce throughput.

I'm particularly suspicious that your 1-layer implementation, which uses a single stateless proc and does not provide the --worst_case_throughput flag, is testing as having II>1; XLS should refuse to generate procs with II>1 unless you specify that flag.

Have you double-checked that the resulting circuit's timing actually meets your simulation's clock cycle? I note that you've generated all of your procs using the "unit" delay model (which is not intended for actual use), and given it an extremely high target delay. I wonder if OpenSTA, or some other static timing analysis tool, would show you what the problem really is. (If your circuit isn't meeting timing, that would explain quite a lot of odd behavior.) More generally, I'd suggest that it's essential to use a real delay model - ideally one tuned against your own synthesis process, but that's currently more difficult than would be ideal unless you're using Yosys for your synthesis. (If you are using Yosys, you can probably just follow the instructions here to generate a new model specific to your target: #1634 (comment))

If the timing is the problem - while you're working on fixing that part of your flow, you might try evaluating your throughput using XLS's eval_proc_main, running it against codegen's "Block IR" output, rather than using a full Verilator simulation. This will simulate your circuit in a cycle-accurate way (rather than time-accurate), under the assumption that the clock is set slow enough that the circuit will meet timing.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimizing Feedforward PROC Network for Throughput #2919

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Optimizing Feedforward PROC Network for Throughput #2919

Uh oh!

Girjoaba Aug 21, 2025

Description

Implementation

Rationale

Scripts

Issues

Discussion Points

Replies: 1 comment

Uh oh!

ericastor Aug 21, 2025 Maintainer

Girjoaba
Aug 21, 2025

ericastor
Aug 21, 2025
Maintainer