Skip to content

Conversation

@Hyodar
Copy link
Contributor

@Hyodar Hyodar commented May 9, 2025

In some cases, we are using our ulong Lsh and Rsh implementations instead of the native ones. This would be necessary in the case that n == 64, since C# masks the shift amount by 64 and as such a << 64 == a, but we want a << 64 == 0.
However, it seems like in some of these cases in UInt256 we can, in fact, consider some restrictions on the values of n and replace the custom with the native shift operations, avoiding some extra steps.

  1. In Udivrem, shift is LeadingZeroes(FirstNonZero(d.u[0], d.u[1], d.u[2], d.u[3])). So, we know this could only be 64 if all of them are zeroes. But in this case d would be zero which means a division by zero. shift can be zero though. This means shifting by shift can be native while by 64 - shift would still require the custom shift.
  2. In Div64 it's a similar situation. shift < 64 since d can only have 64 leading zeroes if it is zero, but shift can still be zero. This also means shifting by shift can be native while by 64 - shift would still require the custom shift.
  3. In Lsh(UInt256,...) and Rsh(UInt256,...), we know that, in the bit shift section at the end, n cannot be 64 as we have already done all the word shifts we could do before. We also know it can't be zero as we are returning early in the cases it's only 0-4 word shifts. So, since n is strictly in [1..63], this means we can replace all the shifts with native ones.

With a very simple benchmark this seems to have had a significant effect in Lsh performance, but requires better testing.
Whether these benefits justify the additional complexity is also an important point.

@benaadams benaadams requested a review from Copilot May 9, 2025 15:32
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors shift operations in the UInt256 implementation so that native ulong shifts are used when the custom shift behavior is not required, improving performance and simplifying the code.

  • Introduces aggressively inlined helpers NativeLsh and NativeRsh.
  • Replaces many calls to the custom Lsh/Rsh with native shift operations in Udivrem, Div64, Lsh, and Rsh.
  • Adds an explicit DivideByZeroException branch in Udivrem to handle invalid divisor cases.

@Hyodar
Copy link
Contributor Author

Hyodar commented May 9, 2025

Comparison with the package benchmarks:

Current `Lsh` Benchmarks
Method EnvironmentVariables A D Mean Error StdDev Ratio RatioSD Allocated Alloc Ratio
LeftShift_UInt256 Empty (619(...)658) [156] (1559(...)5546) [24] 2.343 ns 0.0092 ns 0.0077 ns 1.00 0.00 - NA
LeftShift_UInt256 DOTNET_EnableHWIntrinsic=0 (619(...)658) [156] (1559(...)5546) [24] 6.420 ns 0.0443 ns 0.0414 ns 2.74 0.02 - NA
LeftShift_UInt256 Empty (619(...)658) [156] (1649(...)6166) [24] 2.364 ns 0.0286 ns 0.0268 ns 1.00 0.02 - NA
LeftShift_UInt256 DOTNET_EnableHWIntrinsic=0 (619(...)658) [156] (1649(...)6166) [24] 6.404 ns 0.0424 ns 0.0376 ns 2.71 0.03 - NA
LeftShift_UInt256 Empty (619(...)658) [156] (1755(...)2844) [24] 2.365 ns 0.0241 ns 0.0226 ns 1.00 0.01 - NA
LeftShift_UInt256 DOTNET_EnableHWIntrinsic=0 (619(...)658) [156] (1755(...)2844) [24] 6.394 ns 0.0235 ns 0.0196 ns 2.70 0.03 - NA
LeftShift_UInt256 Empty (115(...)935) [160] (1559(...)5546) [24] 2.330 ns 0.0089 ns 0.0070 ns 1.00 0.00 - NA
LeftShift_UInt256 DOTNET_EnableHWIntrinsic=0 (115(...)935) [160] (1559(...)5546) [24] 6.641 ns 0.0225 ns 0.0210 ns 2.85 0.01 - NA
LeftShift_UInt256 Empty (115(...)935) [160] (1649(...)6166) [24] 2.339 ns 0.0088 ns 0.0082 ns 1.00 0.00 - NA
LeftShift_UInt256 DOTNET_EnableHWIntrinsic=0 (115(...)935) [160] (1649(...)6166) [24] 6.404 ns 0.0367 ns 0.0343 ns 2.74 0.02 - NA
LeftShift_UInt256 Empty (115(...)935) [160] (1755(...)2844) [24] 2.359 ns 0.0158 ns 0.0148 ns 1.00 0.01 - NA
LeftShift_UInt256 DOTNET_EnableHWIntrinsic=0 (115(...)935) [160] (1755(...)2844) [24] 6.391 ns 0.0383 ns 0.0358 ns 2.71 0.02 - NA
Previous `Lsh` Benchmarks
Method EnvironmentVariables A D Mean Error StdDev Ratio RatioSD Allocated Alloc Ratio
OriginalLeftShift_UInt256 Empty (619(...)658) [156] (1559(...)5546) [24] 2.611 ns 0.0352 ns 0.0329 ns 1.00 0.02 - NA
OriginalLeftShift_UInt256 DOTNET_EnableHWIntrinsic=0 (619(...)658) [156] (1559(...)5546) [24] 6.394 ns 0.0303 ns 0.0269 ns 2.45 0.03 - NA
OriginalLeftShift_UInt256 Empty (619(...)658) [156] (1649(...)6166) [24] 2.606 ns 0.0253 ns 0.0237 ns 1.00 0.01 - NA
OriginalLeftShift_UInt256 DOTNET_EnableHWIntrinsic=0 (619(...)658) [156] (1649(...)6166) [24] 6.375 ns 0.0114 ns 0.0095 ns 2.45 0.02 - NA
OriginalLeftShift_UInt256 Empty (619(...)658) [156] (1755(...)2844) [24] 2.598 ns 0.0123 ns 0.0115 ns 1.00 0.01 - NA
OriginalLeftShift_UInt256 DOTNET_EnableHWIntrinsic=0 (619(...)658) [156] (1755(...)2844) [24] 6.644 ns 0.0367 ns 0.0306 ns 2.56 0.02 - NA
OriginalLeftShift_UInt256 Empty (115(...)935) [160] (1559(...)5546) [24] 2.605 ns 0.0141 ns 0.0132 ns 1.00 0.01 - NA
OriginalLeftShift_UInt256 DOTNET_EnableHWIntrinsic=0 (115(...)935) [160] (1559(...)5546) [24] 6.402 ns 0.0277 ns 0.0260 ns 2.46 0.02 - NA
OriginalLeftShift_UInt256 Empty (115(...)935) [160] (1649(...)6166) [24] 2.602 ns 0.0136 ns 0.0120 ns 1.00 0.01 - NA
OriginalLeftShift_UInt256 DOTNET_EnableHWIntrinsic=0 (115(...)935) [160] (1649(...)6166) [24] 6.380 ns 0.0149 ns 0.0139 ns 2.45 0.01 - NA
OriginalLeftShift_UInt256 Empty (115(...)935) [160] (1755(...)2844) [24] 2.611 ns 0.0171 ns 0.0160 ns 1.00 0.01 - NA
OriginalLeftShift_UInt256 DOTNET_EnableHWIntrinsic=0 (115(...)935) [160] (1755(...)2844) [24] 6.401 ns 0.0148 ns 0.0138 ns 2.45 0.02 - NA

@Hyodar Hyodar marked this pull request as ready for review May 9, 2025 17:18
@Scooletz
Copy link
Contributor

I'm running benchmarks for the shift ATM. Will provide comparison and then proceed with the review.

}
else
{
throw new DivideByZeroException();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this behaviour change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like in Evm we check for 0 before calling divide, so should be ok (behaviour)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a separate method, non inlinable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in f08fba3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the existing method for throwing the exception - just followed the same as ThrowOverflowException and am now passing the message as argument

@Scooletz
Copy link
Contributor

Scooletz commented May 12, 2025

On my machine there's some boost for AddMod_UInt256 case

Master: 45.702 ns ± 0.3279
Optimized: 44.297 ns ± 0.2110
Speedup: 3.1 % faster

master

Method EnvironmentVariables C A B Mean Error StdDev Ratio RatioSD Allocated Alloc Ratio
AddMod_UInt256 Empty (619(...)658) [156] (619(...)658) [156] (619(...)658) [156] 45.702 ns 0.3279 ns 0.3067 ns 1.00 0.01 - NA
AddMod_UInt256 DOTNET_EnableHWIntrinsic=0 (619(...)658) [156] (619(...)658) [156] (619(...)658) [156] 82.629 ns 0.8514 ns 0.7964 ns 1.81 0.02 - NA
AddMod_UInt256 Empty (619(...)658) [156] (619(...)658) [156] (115(...)935) [160] 58.073 ns 0.7638 ns 0.7144 ns 1.00 0.02 - NA
AddMod_UInt256 DOTNET_EnableHWIntrinsic=0 (619(...)658) [156] (619(...)658) [156] (115(...)935) [160] 97.602 ns 0.5284 ns 0.4684 ns 1.68 0.02 - NA
AddMod_UInt256 Empty (619(...)658) [156] (115(...)935) [160] (619(...)658) [156] 57.227 ns 0.2468 ns 0.2308 ns 1.00 0.01 - NA
AddMod_UInt256 DOTNET_EnableHWIntrinsic=0 (619(...)658) [156] (115(...)935) [160] (619(...)658) [156] 97.666 ns 0.7717 ns 0.6841 ns 1.71 0.01 - NA
AddMod_UInt256 Empty (619(...)658) [156] (115(...)935) [160] (115(...)935) [160] 57.473 ns 0.4880 ns 0.4565 ns 1.00 0.01 - NA
AddMod_UInt256 DOTNET_EnableHWIntrinsic=0 (619(...)658) [156] (115(...)935) [160] (115(...)935) [160] 97.672 ns 0.6279 ns 0.5873 ns 1.70 0.02 - NA
AddMod_UInt256 Empty (115(...)935) [160] (619(...)658) [156] (619(...)658) [156] 6.315 ns 0.0693 ns 0.0614 ns 1.00 0.01 - NA
AddMod_UInt256 DOTNET_EnableHWIntrinsic=0 (115(...)935) [160] (619(...)658) [156] (619(...)658) [156] 19.551 ns 0.1521 ns 0.1348 ns 3.10 0.04 - NA
AddMod_UInt256 Empty (115(...)935) [160] (619(...)658) [156] (115(...)935) [160] 55.849 ns 0.2608 ns 0.2312 ns 1.00 0.01 - NA
AddMod_UInt256 DOTNET_EnableHWIntrinsic=0 (115(...)935) [160] (619(...)658) [156] (115(...)935) [160] 97.441 ns 0.8027 ns 0.7115 ns 1.74 0.01 - NA
AddMod_UInt256 Empty (115(...)935) [160] (115(...)935) [160] (619(...)658) [156] 56.123 ns 0.4660 ns 0.4359 ns 1.00 0.01 - NA
AddMod_UInt256 DOTNET_EnableHWIntrinsic=0 (115(...)935) [160] (115(...)935) [160] (619(...)658) [156] 98.184 ns 0.7642 ns 0.7148 ns 1.75 0.02 - NA
AddMod_UInt256 Empty (115(...)935) [160] (115(...)935) [160] (115(...)935) [160] 56.126 ns 0.3348 ns 0.2614 ns 1.00 0.01 - NA
AddMod_UInt256 DOTNET_EnableHWIntrinsic=0 (115(...)935) [160] (115(...)935) [160] (115(...)935) [160] 98.513 ns 0.4437 ns 0.3705 ns 1.76 0.01 - NA

shift optimized

Method EnvironmentVariables C A B Mean Error StdDev Ratio RatioSD Allocated Alloc Ratio
AddMod_UInt256 Empty (619(...)658) [156] (619(...)658) [156] (619(...)658) [156] 44.297 ns 0.2110 ns 0.1871 ns 1.00 0.01 - NA
AddMod_UInt256 DOTNET_EnableHWIntrinsic=0 (619(...)658) [156] (619(...)658) [156] (619(...)658) [156] 77.610 ns 0.7247 ns 0.6779 ns 1.75 0.02 - NA
AddMod_UInt256 Empty (619(...)658) [156] (619(...)658) [156] (115(...)935) [160] 52.196 ns 0.4131 ns 0.3864 ns 1.00 0.01 - NA
AddMod_UInt256 DOTNET_EnableHWIntrinsic=0 (619(...)658) [156] (619(...)658) [156] (115(...)935) [160] 94.194 ns 0.8050 ns 0.7530 ns 1.80 0.02 - NA
AddMod_UInt256 Empty (619(...)658) [156] (115(...)935) [160] (619(...)658) [156] 52.454 ns 0.7119 ns 0.6659 ns 1.00 0.02 - NA
AddMod_UInt256 DOTNET_EnableHWIntrinsic=0 (619(...)658) [156] (115(...)935) [160] (619(...)658) [156] 94.541 ns 0.7663 ns 0.6399 ns 1.80 0.02 - NA
AddMod_UInt256 Empty (619(...)658) [156] (115(...)935) [160] (115(...)935) [160] 52.315 ns 0.4775 ns 0.4466 ns 1.00 0.01 - NA
AddMod_UInt256 DOTNET_EnableHWIntrinsic=0 (619(...)658) [156] (115(...)935) [160] (115(...)935) [160] 95.140 ns 0.5192 ns 0.4602 ns 1.82 0.02 - NA
AddMod_UInt256 Empty (115(...)935) [160] (619(...)658) [156] (619(...)658) [156] 5.502 ns 0.0379 ns 0.0336 ns 1.00 0.01 - NA
AddMod_UInt256 DOTNET_EnableHWIntrinsic=0 (115(...)935) [160] (619(...)658) [156] (619(...)658) [156] 19.468 ns 0.1376 ns 0.1287 ns 3.54 0.03 - NA
AddMod_UInt256 Empty (115(...)935) [160] (619(...)658) [156] (115(...)935) [160] 52.386 ns 0.3741 ns 0.3499 ns 1.00 0.01 - NA
AddMod_UInt256 DOTNET_EnableHWIntrinsic=0 (115(...)935) [160] (619(...)658) [156] (115(...)935) [160] 96.146 ns 0.5248 ns 0.4653 ns 1.84 0.01 - NA
AddMod_UInt256 Empty (115(...)935) [160] (115(...)935) [160] (619(...)658) [156] 52.290 ns 0.4854 ns 0.4540 ns 1.00 0.01 - NA
AddMod_UInt256 DOTNET_EnableHWIntrinsic=0 (115(...)935) [160] (115(...)935) [160] (619(...)658) [156] 94.458 ns 0.6489 ns 0.6070 ns 1.81 0.02 - NA
AddMod_UInt256 Empty (115(...)935) [160] (115(...)935) [160] (115(...)935) [160] 52.928 ns 0.5702 ns 0.5334 ns 1.00 0.01 - NA
AddMod_UInt256 DOTNET_EnableHWIntrinsic=0 (115(...)935) [160] (115(...)935) [160] (115(...)935) [160] 96.249 ns 0.9551 ns 0.8934 ns 1.82 0.02 - NA

Copy link
Contributor

@Scooletz Scooletz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add asserts, don't throw in the body. I added benchmarks in a comment and for AddMod_UInt256 it ~3% faster

[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static ulong NativeLsh(ulong a, int n)
{
return a << n;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we Debug.Assert the requirement when this method can be called (n > 64).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Done in 2def59d

[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static ulong NativeRsh(ulong a, int n)
{
return a >> n;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we Debug.Assert the requirement when this method can be called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Done in 2def59d

}
else
{
throw new DivideByZeroException();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a separate method, non inlinable?


[DoesNotReturn]
private static void ThrowDivideByZeroException() => throw new DivideByZeroException("y == 0");
private static void ThrowDivideByZeroException(string message) => throw new DivideByZeroException(message);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd scrap the message; will add unneeded string load code; also is obvious what is zero

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, just didn't want to change the existing approach. Done in 9740891

@benaadams benaadams merged commit 719f07f into NethermindEth:main May 15, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants