Use native ulong shifts when custom shift is not needed #50

Hyodar · 2025-05-09T15:22:56Z

In some cases, we are using our ulong Lsh and Rsh implementations instead of the native ones. This would be necessary in the case that n == 64, since C# masks the shift amount by 64 and as such a << 64 == a, but we want a << 64 == 0.
However, it seems like in some of these cases in UInt256 we can, in fact, consider some restrictions on the values of n and replace the custom with the native shift operations, avoiding some extra steps.

In Udivrem, shift is LeadingZeroes(FirstNonZero(d.u[0], d.u[1], d.u[2], d.u[3])). So, we know this could only be 64 if all of them are zeroes. But in this case d would be zero which means a division by zero. shift can be zero though. This means shifting by shift can be native while by 64 - shift would still require the custom shift.
In Div64 it's a similar situation. shift < 64 since d can only have 64 leading zeroes if it is zero, but shift can still be zero. This also means shifting by shift can be native while by 64 - shift would still require the custom shift.
In Lsh(UInt256,...) and Rsh(UInt256,...), we know that, in the bit shift section at the end, n cannot be 64 as we have already done all the word shifts we could do before. We also know it can't be zero as we are returning early in the cases it's only 0-4 word shifts. So, since n is strictly in [1..63], this means we can replace all the shifts with native ones.

With a very simple benchmark this seems to have had a significant effect in Lsh performance, but requires better testing.
Whether these benefits justify the additional complexity is also an important point.

Copilot

Pull Request Overview

This PR refactors shift operations in the UInt256 implementation so that native ulong shifts are used when the custom shift behavior is not required, improving performance and simplifying the code.

Introduces aggressively inlined helpers NativeLsh and NativeRsh.
Replaces many calls to the custom Lsh/Rsh with native shift operations in Udivrem, Div64, Lsh, and Rsh.
Adds an explicit DivideByZeroException branch in Udivrem to handle invalid divisor cases.

Hyodar · 2025-05-09T17:16:47Z

Comparison with the package benchmarks:

Current `Lsh` Benchmarks

Method	EnvironmentVariables	A	D	Mean	Error	StdDev	Ratio	RatioSD	Allocated	Alloc Ratio
LeftShift_UInt256	Empty	(619(...)658) [156]	(1559(...)5546) [24]	2.343 ns	0.0092 ns	0.0077 ns	1.00	0.00	-	NA
LeftShift_UInt256	DOTNET_EnableHWIntrinsic=0	(619(...)658) [156]	(1559(...)5546) [24]	6.420 ns	0.0443 ns	0.0414 ns	2.74	0.02	-	NA

LeftShift_UInt256	Empty	(619(...)658) [156]	(1649(...)6166) [24]	2.364 ns	0.0286 ns	0.0268 ns	1.00	0.02	-	NA
LeftShift_UInt256	DOTNET_EnableHWIntrinsic=0	(619(...)658) [156]	(1649(...)6166) [24]	6.404 ns	0.0424 ns	0.0376 ns	2.71	0.03	-	NA

LeftShift_UInt256	Empty	(619(...)658) [156]	(1755(...)2844) [24]	2.365 ns	0.0241 ns	0.0226 ns	1.00	0.01	-	NA
LeftShift_UInt256	DOTNET_EnableHWIntrinsic=0	(619(...)658) [156]	(1755(...)2844) [24]	6.394 ns	0.0235 ns	0.0196 ns	2.70	0.03	-	NA

LeftShift_UInt256	Empty	(115(...)935) [160]	(1559(...)5546) [24]	2.330 ns	0.0089 ns	0.0070 ns	1.00	0.00	-	NA
LeftShift_UInt256	DOTNET_EnableHWIntrinsic=0	(115(...)935) [160]	(1559(...)5546) [24]	6.641 ns	0.0225 ns	0.0210 ns	2.85	0.01	-	NA

LeftShift_UInt256	Empty	(115(...)935) [160]	(1649(...)6166) [24]	2.339 ns	0.0088 ns	0.0082 ns	1.00	0.00	-	NA
LeftShift_UInt256	DOTNET_EnableHWIntrinsic=0	(115(...)935) [160]	(1649(...)6166) [24]	6.404 ns	0.0367 ns	0.0343 ns	2.74	0.02	-	NA

LeftShift_UInt256	Empty	(115(...)935) [160]	(1755(...)2844) [24]	2.359 ns	0.0158 ns	0.0148 ns	1.00	0.01	-	NA
LeftShift_UInt256	DOTNET_EnableHWIntrinsic=0	(115(...)935) [160]	(1755(...)2844) [24]	6.391 ns	0.0383 ns	0.0358 ns	2.71	0.02	-	NA

Previous `Lsh` Benchmarks

Method	EnvironmentVariables	A	D	Mean	Error	StdDev	Ratio	RatioSD	Allocated	Alloc Ratio
OriginalLeftShift_UInt256	Empty	(619(...)658) [156]	(1559(...)5546) [24]	2.611 ns	0.0352 ns	0.0329 ns	1.00	0.02	-	NA
OriginalLeftShift_UInt256	DOTNET_EnableHWIntrinsic=0	(619(...)658) [156]	(1559(...)5546) [24]	6.394 ns	0.0303 ns	0.0269 ns	2.45	0.03	-	NA

OriginalLeftShift_UInt256	Empty	(619(...)658) [156]	(1649(...)6166) [24]	2.606 ns	0.0253 ns	0.0237 ns	1.00	0.01	-	NA
OriginalLeftShift_UInt256	DOTNET_EnableHWIntrinsic=0	(619(...)658) [156]	(1649(...)6166) [24]	6.375 ns	0.0114 ns	0.0095 ns	2.45	0.02	-	NA

OriginalLeftShift_UInt256	Empty	(619(...)658) [156]	(1755(...)2844) [24]	2.598 ns	0.0123 ns	0.0115 ns	1.00	0.01	-	NA
OriginalLeftShift_UInt256	DOTNET_EnableHWIntrinsic=0	(619(...)658) [156]	(1755(...)2844) [24]	6.644 ns	0.0367 ns	0.0306 ns	2.56	0.02	-	NA

OriginalLeftShift_UInt256	Empty	(115(...)935) [160]	(1559(...)5546) [24]	2.605 ns	0.0141 ns	0.0132 ns	1.00	0.01	-	NA
OriginalLeftShift_UInt256	DOTNET_EnableHWIntrinsic=0	(115(...)935) [160]	(1559(...)5546) [24]	6.402 ns	0.0277 ns	0.0260 ns	2.46	0.02	-	NA

OriginalLeftShift_UInt256	Empty	(115(...)935) [160]	(1649(...)6166) [24]	2.602 ns	0.0136 ns	0.0120 ns	1.00	0.01	-	NA
OriginalLeftShift_UInt256	DOTNET_EnableHWIntrinsic=0	(115(...)935) [160]	(1649(...)6166) [24]	6.380 ns	0.0149 ns	0.0139 ns	2.45	0.01	-	NA

OriginalLeftShift_UInt256	Empty	(115(...)935) [160]	(1755(...)2844) [24]	2.611 ns	0.0171 ns	0.0160 ns	1.00	0.01	-	NA
OriginalLeftShift_UInt256	DOTNET_EnableHWIntrinsic=0	(115(...)935) [160]	(1755(...)2844) [24]	6.401 ns	0.0148 ns	0.0138 ns	2.45	0.02	-	NA

Scooletz · 2025-05-12T08:07:53Z

I'm running benchmarks for the shift ATM. Will provide comparison and then proceed with the review.

benaadams · 2025-05-12T08:11:07Z

src/Nethermind.Int256/UInt256.cs

            }
+            else
+            {
+                throw new DivideByZeroException();


Is this behaviour change?

Looks like in Evm we check for 0 before calling divide, so should be ok (behaviour)

Should this be a separate method, non inlinable?

Done in f08fba3

I used the existing method for throwing the exception - just followed the same as ThrowOverflowException and am now passing the message as argument

Scooletz · 2025-05-12T08:40:11Z

On my machine there's some boost for AddMod_UInt256 case

Master: 45.702 ns ± 0.3279
Optimized: 44.297 ns ± 0.2110
Speedup: 3.1 % faster

master

Method	EnvironmentVariables	C	A	B	Mean	Error	StdDev	Ratio	RatioSD	Allocated	Alloc Ratio
AddMod_UInt256	Empty	(619(...)658) [156]	(619(...)658) [156]	(619(...)658) [156]	45.702 ns	0.3279 ns	0.3067 ns	1.00	0.01	-	NA
AddMod_UInt256	DOTNET_EnableHWIntrinsic=0	(619(...)658) [156]	(619(...)658) [156]	(619(...)658) [156]	82.629 ns	0.8514 ns	0.7964 ns	1.81	0.02	-	NA

AddMod_UInt256	Empty	(619(...)658) [156]	(619(...)658) [156]	(115(...)935) [160]	58.073 ns	0.7638 ns	0.7144 ns	1.00	0.02	-	NA
AddMod_UInt256	DOTNET_EnableHWIntrinsic=0	(619(...)658) [156]	(619(...)658) [156]	(115(...)935) [160]	97.602 ns	0.5284 ns	0.4684 ns	1.68	0.02	-	NA

AddMod_UInt256	Empty	(619(...)658) [156]	(115(...)935) [160]	(619(...)658) [156]	57.227 ns	0.2468 ns	0.2308 ns	1.00	0.01	-	NA
AddMod_UInt256	DOTNET_EnableHWIntrinsic=0	(619(...)658) [156]	(115(...)935) [160]	(619(...)658) [156]	97.666 ns	0.7717 ns	0.6841 ns	1.71	0.01	-	NA

AddMod_UInt256	Empty	(619(...)658) [156]	(115(...)935) [160]	(115(...)935) [160]	57.473 ns	0.4880 ns	0.4565 ns	1.00	0.01	-	NA
AddMod_UInt256	DOTNET_EnableHWIntrinsic=0	(619(...)658) [156]	(115(...)935) [160]	(115(...)935) [160]	97.672 ns	0.6279 ns	0.5873 ns	1.70	0.02	-	NA

AddMod_UInt256	Empty	(115(...)935) [160]	(619(...)658) [156]	(619(...)658) [156]	6.315 ns	0.0693 ns	0.0614 ns	1.00	0.01	-	NA
AddMod_UInt256	DOTNET_EnableHWIntrinsic=0	(115(...)935) [160]	(619(...)658) [156]	(619(...)658) [156]	19.551 ns	0.1521 ns	0.1348 ns	3.10	0.04	-	NA

AddMod_UInt256	Empty	(115(...)935) [160]	(619(...)658) [156]	(115(...)935) [160]	55.849 ns	0.2608 ns	0.2312 ns	1.00	0.01	-	NA
AddMod_UInt256	DOTNET_EnableHWIntrinsic=0	(115(...)935) [160]	(619(...)658) [156]	(115(...)935) [160]	97.441 ns	0.8027 ns	0.7115 ns	1.74	0.01	-	NA

AddMod_UInt256	Empty	(115(...)935) [160]	(115(...)935) [160]	(619(...)658) [156]	56.123 ns	0.4660 ns	0.4359 ns	1.00	0.01	-	NA
AddMod_UInt256	DOTNET_EnableHWIntrinsic=0	(115(...)935) [160]	(115(...)935) [160]	(619(...)658) [156]	98.184 ns	0.7642 ns	0.7148 ns	1.75	0.02	-	NA

AddMod_UInt256	Empty	(115(...)935) [160]	(115(...)935) [160]	(115(...)935) [160]	56.126 ns	0.3348 ns	0.2614 ns	1.00	0.01	-	NA
AddMod_UInt256	DOTNET_EnableHWIntrinsic=0	(115(...)935) [160]	(115(...)935) [160]	(115(...)935) [160]	98.513 ns	0.4437 ns	0.3705 ns	1.76	0.01	-	NA

shift optimized

Method	EnvironmentVariables	C	A	B	Mean	Error	StdDev	Ratio	RatioSD	Allocated	Alloc Ratio
AddMod_UInt256	Empty	(619(...)658) [156]	(619(...)658) [156]	(619(...)658) [156]	44.297 ns	0.2110 ns	0.1871 ns	1.00	0.01	-	NA
AddMod_UInt256	DOTNET_EnableHWIntrinsic=0	(619(...)658) [156]	(619(...)658) [156]	(619(...)658) [156]	77.610 ns	0.7247 ns	0.6779 ns	1.75	0.02	-	NA

AddMod_UInt256	Empty	(619(...)658) [156]	(619(...)658) [156]	(115(...)935) [160]	52.196 ns	0.4131 ns	0.3864 ns	1.00	0.01	-	NA
AddMod_UInt256	DOTNET_EnableHWIntrinsic=0	(619(...)658) [156]	(619(...)658) [156]	(115(...)935) [160]	94.194 ns	0.8050 ns	0.7530 ns	1.80	0.02	-	NA

AddMod_UInt256	Empty	(619(...)658) [156]	(115(...)935) [160]	(619(...)658) [156]	52.454 ns	0.7119 ns	0.6659 ns	1.00	0.02	-	NA
AddMod_UInt256	DOTNET_EnableHWIntrinsic=0	(619(...)658) [156]	(115(...)935) [160]	(619(...)658) [156]	94.541 ns	0.7663 ns	0.6399 ns	1.80	0.02	-	NA

AddMod_UInt256	Empty	(619(...)658) [156]	(115(...)935) [160]	(115(...)935) [160]	52.315 ns	0.4775 ns	0.4466 ns	1.00	0.01	-	NA
AddMod_UInt256	DOTNET_EnableHWIntrinsic=0	(619(...)658) [156]	(115(...)935) [160]	(115(...)935) [160]	95.140 ns	0.5192 ns	0.4602 ns	1.82	0.02	-	NA

AddMod_UInt256	Empty	(115(...)935) [160]	(619(...)658) [156]	(619(...)658) [156]	5.502 ns	0.0379 ns	0.0336 ns	1.00	0.01	-	NA
AddMod_UInt256	DOTNET_EnableHWIntrinsic=0	(115(...)935) [160]	(619(...)658) [156]	(619(...)658) [156]	19.468 ns	0.1376 ns	0.1287 ns	3.54	0.03	-	NA

AddMod_UInt256	Empty	(115(...)935) [160]	(619(...)658) [156]	(115(...)935) [160]	52.386 ns	0.3741 ns	0.3499 ns	1.00	0.01	-	NA
AddMod_UInt256	DOTNET_EnableHWIntrinsic=0	(115(...)935) [160]	(619(...)658) [156]	(115(...)935) [160]	96.146 ns	0.5248 ns	0.4653 ns	1.84	0.01	-	NA

AddMod_UInt256	Empty	(115(...)935) [160]	(115(...)935) [160]	(619(...)658) [156]	52.290 ns	0.4854 ns	0.4540 ns	1.00	0.01	-	NA
AddMod_UInt256	DOTNET_EnableHWIntrinsic=0	(115(...)935) [160]	(115(...)935) [160]	(619(...)658) [156]	94.458 ns	0.6489 ns	0.6070 ns	1.81	0.02	-	NA

AddMod_UInt256	Empty	(115(...)935) [160]	(115(...)935) [160]	(115(...)935) [160]	52.928 ns	0.5702 ns	0.5334 ns	1.00	0.01	-	NA
AddMod_UInt256	DOTNET_EnableHWIntrinsic=0	(115(...)935) [160]	(115(...)935) [160]	(115(...)935) [160]	96.249 ns	0.9551 ns	0.8934 ns	1.82	0.02	-	NA

Scooletz

Add asserts, don't throw in the body. I added benchmarks in a comment and for AddMod_UInt256 it ~3% faster

Scooletz · 2025-05-12T08:46:57Z

src/Nethermind.Int256/UInt256.cs

+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        internal static ulong NativeLsh(ulong a, int n)
+        {
+            return a << n;


Could we Debug.Assert the requirement when this method can be called (n > 64).

Sure! Done in 2def59d

Scooletz · 2025-05-12T08:47:11Z

src/Nethermind.Int256/UInt256.cs

+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        internal static ulong NativeRsh(ulong a, int n)
+        {
+            return a >> n;


Could we Debug.Assert the requirement when this method can be called?

Sure! Done in 2def59d

Scooletz · 2025-05-12T08:47:30Z

src/Nethermind.Int256/UInt256.cs

            }
+            else
+            {
+                throw new DivideByZeroException();


Should this be a separate method, non inlinable?

src/Nethermind.Int256/UInt256.cs

benaadams · 2025-05-14T08:13:09Z

src/Nethermind.Int256/UInt256.cs


        [DoesNotReturn]
-        private static void ThrowDivideByZeroException() => throw new DivideByZeroException("y == 0");
+        private static void ThrowDivideByZeroException(string message) => throw new DivideByZeroException(message);


I'd scrap the message; will add unneeded string load code; also is obvious what is zero

Agreed, just didn't want to change the existing approach. Done in 9740891

Hyodar added 2 commits May 9, 2025 11:55

Use native shifts when possible

d7429af

Fix custom shift not being used in relevant places

14738e0

benaadams requested a review from Copilot May 9, 2025 15:32

Copilot AI reviewed May 9, 2025

View reviewed changes

Hyodar marked this pull request as ready for review May 9, 2025 17:18

LukaszRozmej requested review from Scooletz and benaadams May 10, 2025 09:17

benaadams reviewed May 12, 2025

View reviewed changes

Scooletz reviewed May 12, 2025

View reviewed changes

src/Nethermind.Int256/UInt256.cs Show resolved Hide resolved

Hyodar added 2 commits May 12, 2025 20:45

Throw division by zero through separate method

f08fba3

Add debug shift amount asserts to native shift methods

2def59d

LukaszRozmej approved these changes May 14, 2025

View reviewed changes

benaadams reviewed May 14, 2025

View reviewed changes

benaadams approved these changes May 14, 2025

View reviewed changes

Hyodar and others added 4 commits May 14, 2025 11:56

Remove message from ThrowDivideByZeroException

9740891

Merge branch 'main' into shift-optimization

3af9cb0

Fix merge issue

cdcfab9

Specify behavior on Udivrem if d is zero

8e4c0a9

benaadams approved these changes May 15, 2025

View reviewed changes

benaadams merged commit 719f07f into NethermindEth:main May 15, 2025
4 checks passed

Use native ulong shifts when custom shift is not needed #50

Use native ulong shifts when custom shift is not needed #50

Uh oh!

Conversation

Hyodar commented May 9, 2025 • edited by Scooletz Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Hyodar commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Scooletz commented May 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Scooletz commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

master

shift optimized

Uh oh!

Scooletz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Hyodar commented May 9, 2025 •

edited by Scooletz

Loading

Hyodar commented May 9, 2025 •

edited

Loading

Scooletz commented May 12, 2025 •

edited

Loading