-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[aes] Improve GHASH masking to reduce SCA leakage #18
Conversation
21cc5a3
to
74f03b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand the workflow correctly. The tool signals some transient leakages which are then patched by adding blankers etc. Does there exist a formalism that codifies such additional countermeasures on top of the masking methodology such that they can already be included during the first implementation phase?
end | ||
end | ||
end | ||
|
||
GHASH_MASKED_SETTLE: begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the settle state add one more cycle to the computation or is it offset by forwarding the result
of the last addition directly to the output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's a bubble cycle. The design does really nothing. It's required to prevent that a new register value (written at the active clock edge when entering this state) gets combined with a previous intermediate result potentially still present on some downstream wires.
// Note: Once the multiplication finishes, Share 0 of the state depends on Share 0 of the | ||
// hash subkey. Thus, we don't forward it to the second multiplier as this may lead to | ||
// undesirable SCA leakage inside the multiplier. | ||
// When doing the first block only, we have to start computing another correction term | ||
// using the second multiplier in the next clock cycle, i.e., S1 * H1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By deferring the computation of the correction term of the first block to the next clock cycle, does this also
shift all other correction term multiplications for the following blocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this correction term is just used once throughout an entire message. Instead of computing it and storing it in a separate 128-bit register (expensive!) we compute it and directly use it. All other blocks use the two correction terms we compute once at the beginning and then store into registers. Since these terms are used once per block, it pays off to store them in registers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically, we could compute them as well for every block and make the multiplier faster (would probably be more efficient from an area perspective) but it's not nice from an SCA viewpoint, because you compute the same operation on the same inputs many times.
Thanks for your review @andrea-caforio . You understanding is correct. To a limited extent, there are rules that can be followed whenever logic is time-multiplexed between masking shares (such as the second multiplier here, or parts of the ALU in CPU). For example: don't process two shares back to back and use onehot muxes. We do have some guidance in the for OTBN programmers for example. But I fear eventually, it's always pretty much tailored to the implementation at hand. Also, adding this stuff and doing it right takes a lot of time. For a first implementation, one typically needs something quickly that one can then start optimizing. |
This commit adds a wrapper module suitable for formal or simulation- based masking verification using Alma or PROLEAD, respectively. Also, it adds the required setup files to kick off the formal masking verification using Alma. Signed-off-by: Pirmin Vogel <[email protected]>
It turns out that the B input is scanned instead of the A input. For SCA hardening, it's better to connect the hash subkey to the un- scanned input, i.e., input A. Signed-off-by: Pirmin Vogel <[email protected]>
This output is high during the second to the last clock cycle, i.e., in the cycle before the output becomes valid or one cycle before ack_o asserts. Signed-off-by: Pirmin Vogel <[email protected]>
Forwarding the unscanned Operand A (typically used for the secret) in case of SCA hardened designs is not ideal whereas forwarding a deterministic value is less ideal when focusing on FI hardening. This commit adds a parameter to choose what to forward before the result is ready. Signed-off-by: Pirmin Vogel <[email protected]>
This commit implements a series of improvements for the GHASH masking scheme to reduce the SCA leakage. With these changes, the implementation successfully passes formal masking verification using Alma in transient mode, i.e., when glitches are considered. Prior to this commit, the implementation would pass masking verification in stable mode only. The following improvements have been made: - The result of the final addition of Share 1 of S and the unmasked GHASH state is no longer stored into the GHASH state register but directly forwarded to the output, and the state input to this addition is blanked. The input multiplexer (ghash_in_mux) looses one input. (The ghash_state_mux for the unmasked implementation gains one input.) - The two 3-input multiplexers selecting the operands for the addition with the GHASH state (add_in_mux) are replaced by one-hot multiplexers with registered control signals. - The Operand B inputs of both GF multipliers are now blanked. The 3-input multiplexer selecting Operand B of the second GF multiplier is replaced by a one-hot multiplexer with registered control signal. In addition, the last input slice of Operand B for this multiplier is registered. This allows the switching the multiplexer during the last clock cycle of the multiplication to avoid some undesirable transient leakage occurring upon saving the result of the multiplication into the GHASH state register (and this new value propagating through the multiplexer into the multiplier again). - The GF multipliers are configured to output zero instead of Operand A (the hash subkey) while busy. - The state input for the addition required for the generation of the correction term for Share 0 is blanked. - Between adding the correction terms to the GHASH state for the last time and between unmasking the GHASH state, a bubble cycle is added to allow signals to fully settle thereby avoiding undesirable transient effects unmasking the uncorrected state shares. The overall area impact of these changes is low (+0.16 kGE in Yosys + nangate45). Signed-off-by: Pirmin Vogel <[email protected]>
74f03b7
to
9b2ee66
Compare
This PR contains several commits and I recommend reviewing them one by one. The most substantial one is the last one. I recommend reviewing this with the masked block diagram at hand.
With these changes in place, the design passes formal masking verification in Alma considering also transient effects.