Optimizing Feedforward PROC Network for Throughput #2919
Replies: 1 comment
-
|
At a glance, I agree that I don't immediately see why any of your designs should fall short of full throughput, since all of your procs are feed-forward and stateless... the only obstacle to good performance should be your use of handshake exchanges (send->recv edges currently force a pipeline stage boundary), but your designs look proper enough that that should only add latency, not reduce throughput. I'm particularly suspicious that your 1-layer implementation, which uses a single stateless proc and does not provide the Have you double-checked that the resulting circuit's timing actually meets your simulation's clock cycle? I note that you've generated all of your procs using the "unit" delay model (which is not intended for actual use), and given it an extremely high target delay. I wonder if OpenSTA, or some other static timing analysis tool, would show you what the problem really is. (If your circuit isn't meeting timing, that would explain quite a lot of odd behavior.) More generally, I'd suggest that it's essential to use a real delay model - ideally one tuned against your own synthesis process, but that's currently more difficult than would be ideal unless you're using Yosys for your synthesis. (If you are using Yosys, you can probably just follow the instructions here to generate a new model specific to your target: #1634 (comment)) If the timing is the problem - while you're working on fixing that part of your flow, you might try evaluating your throughput using XLS's |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Description
We provide the dslx code of a neural network that performs physics jet tagging classification and the associated scripts for measuring latency/throughput.
We look at the performance of a 4-layer dense neural network (NN). Each layer consists of a Matrix-Vector-Multiplication (MVM) followed by the ReLU or Argmax activation functions. Each layer's matrix (the weights) are defined as constants.
The code is attached as an archive.
Implementation
We build it as a feedforward PROC network and provide 3 different implementations:
Rationale
While an implementation that does not use PROCs achieves an initiation interval (II) of 1, we need to experiment with PROCs for two main reasons:
Scripts
dslx_codegen.sh proc_jet_tagging_dense <jet_tagging_4layers | jet_tagging_2layers | jet_tagging_1layer>"This script, generates the
top.svhdl directly from theproc_jet_tagging_dense.xfile. Choose between one of the 3 implementations. Modify theBAZEL_BIN_PATHvariable to point to yourxls/bazel-bin.The last command of the script must also be modified to experiment with various I/O constraints.
makeThere is a
Makefilethat calls a Verilator simulation written insim_main.cpp. Verilator must be installed. To investigate the throughput (II), look at the cycle interval between class 0 and class 2 outputs. They should oscillate consistently. The latency is given by the cycle at which the first class is generated.Issues
Discussion Points
How would an DSLX designer approach this problem? Are there missing constraints, codegen flags, or is the design flawed at an algorithmic level?
Code:
proc_jet_tagging_dense.zip
Beta Was this translation helpful? Give feedback.
All reactions