-
Notifications
You must be signed in to change notification settings - Fork 7
perf(p/uint256): optimize allocation in multiplication, decimal parsing, and string ops #1004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The existing `Uint` type has an array wrapped in a struct, requiring field access every time (e.g., `z.arr[0]`). While the Go compiler optimizes this, Gno has no separate optimization process, so every time the struct is copied, at least a 32-byte array plus field metadata must be moved together. Therefore, we changed the type definition to reduce these operations.
Configured unit tests using the runtime package to check metrics for each function. Used the runtime.GC function to ensure that results from previous function calls do not affect subsequent executions, and controlled function execution to avoid variable reuse. Test target functions were selected based on profiling results. Among the detected functions in the uint256 package, those with the highest memory usage or CPU cycles in average flat values were prioritized. Due to the current implementation of Gno's `runtime` package, there is a limitation where only allocated memory can be checked through the MemStats function, so additional instrumentation information cannot be obtained. However, memory usage can still be sufficiently verified.
Changed the implementation from continuously creating base := NewUint(num), mulResult := new(Uint), and addResult := new(Uint) inside fromDecimal to reusing temporary variables. This eliminates heap allocations by using stack variables instead. After the fix, metric collection tests show that decimal-related functions can save approximately 380KB of additional memory over 200 iterations. The improved figures can be seen in the table below. Function Original Final Total Improvement Improvement Rate SetFromDecimal 55,080,184 54,531,384 -548,800 -1.0% fromDecimal 54,075,384 53,526,584 -548,800 -1.0% MustFromDecimal 55,542,584 54,904,184 -638,400 -1.1% FromDecimal 55,667,384 55,028,984 -638,400 -1.1%
Modified the implementation of the `Dec` function to reuse temporary variables, similar to the approach applied to `fromDecimal`.
After the fix, the `Dec` function saves 1MB of memory over 200 iterations. Additionally, since the `ToString` function internally calls the `Dec` function, the same improvement can be observed.
| Function | Before Optimization | After Optimization | Improvement | Improvement Rate |
|----------|------------|------------|------------|-------|
| Dec | 56,978,184 | 55,972,384 | -1,005,800 | -1.8% |
| ToString | 57,306,184 | 56,300,384 | -1,005,800 | -1.8% |
Key changes are as follows:
- []byte("000...") → [98]byte array
- make([]byte, 0, 19) → [19]byte array
- Removed new(Uint) and use value copy instead
copying 32 bytes is less expensive than keeping memory resident on the heap, frequently used values are now defined as values and copied when needed.
uint256.Uint type.Restructured `umulStep` to add `z` and `carry` to the `bits.Mul64` result in a single operation. By using the pattern `lo, carry := bits.Add64(lo, z, carry)`, the number of `bits.Add64` calls was reduced by half. The same pattern was also applied to the `umulHop` function. ## Memory Profiling Results ### Overall Memory Usage Version Total Memory Reduction Before optimization 404,934,090 B - After this commit 363,401,569 B -10.3 % ### math/bits.Add64 Comparison Metric Before After Change Memory 81,993,472 B 47,627,008 B -41.9 % Calls 116,468 67,652 -41.9 % Rank #1 #2 - ### Analysis The optimization reduced `bits.Add64` calls by 41.9%, resulting in a 34.4 MB memory reduction. This single change accounts for nearly all of the 10.3% total memory improvement. The `bits.Add64` function is no longer the top memory consumer, having been surpassed by `strconv.ParseUint`.
Optimize `umul` by detecting the highest non-zero word in each operand and using a length-aware nested-loop multiply. Changes: - Add `highestNonZeroWord` helper to short-circuit zero inputs and trim iteration range - Skip unnecessary partial-product work when operands have leading zero words - Replace internal accumulation with localized `bits.Mul64 + bits.Add64` chaining for precise carry propagation The existing `umulStep`/`umulHop` functions continue to handle carries correctly and remain in use elsewhere. This redesign preserves all edge cases (zero operands, partial-word products, full 256-bit operands) while reducing operations for small values. ## Memory Profiling Results ### Overall Memory Usage Version Total Memory vs Previous vs Baseline Baseline 404,934,090 B - - After Add64 opt 363,401,569 B -10.3% -10.3% After this commit 300,436,304 B -17.3% -25.8% ### Key Function Comparison (Before → After this commit) Function Memory Before Memory After Change math/bits.Add64 47,627,008 B 19,338,880 B -59.4% math/bits.Mul64 39,673,856 B 10,871,808 B -72.6% umul 5,239,472 B 9,997,896 B +90.8% Function Calls Before Calls After Change math/bits.Add64 67,652 27,470 -59.4% math/bits.Mul64 38,744 10,617 -72.6% umul 8,036 16,872 +110.0% ### Analysis The length-aware multiplication reduces `bits.Mul64` calls by 72.6% and `bits.Add64` calls by 59.4% by skipping unnecessary 256-bit operations when operands have leading zero words. The new `highestNonZeroWord` helper adds 11.8 MB overhead but enables 62.9 MB savings in arithmetic primitives, yielding a net reduction of ~63 MB (17.3%) from the previous optimization pass. The `umulStep` and `umulHop` functions no longer appear in the top 15 memory consumers, indicating the optimization successfully bypasses full-width multiplication for smaller operands.
|
@notJoon Can you address this 1.
|
|
@dongwon8247 Fixed the exposure of global pointers. The |
|
|
@notJoon can you add this case and see if it passes? |
|
@dongwon8247 Testing confirmed that it runs without issues on the modified
|




Description
This PR optimizes the
uint256package by restructuring theUinttype and improving memory allocation patterns. The changes reduce allocator pressure and gas usage for contracts using these utilities.The original
Uintimplementation used a struct-wrapped array with pointer-based helpers, leading to unnecessary heap allocations and indirections in hot paths like decimal parsing and string conversion. These allocations directly translate to increased gas costs when running within contracts.Since the
int256package uses theuint256package internally, performance improvements are also reflected inint256operations.Changes
Flattened
UintStorageUintfrom struct-wrapped array to a flat[4]uint64typemultipliers,Zero,One, arithmetic helpers) to use stack-allocated temporaries instead of heap pointersfromDecimalStack Reusenew(Uint)allocations with stack-scoped temporariesmultiplierstable to value entries to avoid pointer chasingDecToStringBuffer ReuseMultiplication Path Optimization
The multiplication routines (
umul,umulHop,umulStep) were identified as significant contributors to allocation overhead. The following optimizations were applied:Background:
umulimplements full 256×256→512-bit multiplication, callingumulHopandumulStepfor 64-bit word accumulation. Nearly all high-bit operations (Mul,MulOverflow,MulMod,fullmath.gno) depend onumulresults, so changes required careful consideration of side effects and edge cases.Bottlenecks Identified:
bits.Add64calls:umulStepseparately addszandcarryin two stages, recalculating carry at each stage. Fourbits.Add64calls cause memory access and register pressure.umulHopandumulStepare called frequently; without inlining, call costs are significant. The Gno compiler does not inline as aggressively as Go.Optimizations Applied:
umulStep/umulHopConsolidation: Merged the two functions into a single internal helper, reducing call count andAdd64invocations. RestructuredumulStepto addzandcarrytobits.Mul64result in one pass (e.g.,lo, carry := bits.Add64(lo, z, carry)), reducingAdd64calls from 4 to 2.umulentry to detect when upper 64/128 bits are zero, reducing to 64×64 or 128×128 routines. WhentopX + topY <= 1,bits.Mul64suffices andMulOverflowis always false.Verification Points: All carry propagation paths (especially upper-word accumulation) and
MulOverflow's overflow detection (p[4:]must be zero) were regression tested. Edge cases for zero inputs and single-word inputs ensureres[4:]values match expected behavior.Benchmark Results
All measurements use 200 iterations per function, capturing allocations via
runtime.MemStats. The target functions were initially selected based on profiling results showing significant average CPU cycle consumption or memory allocation.Note: These figures were measured through the
runtime_metrics_test.gnotest for each package.uint256
Decimal Parsing Functions
fromDecimalSetFromDecimalMustFromDecimalFromDecimalString Conversion Functions
DecToStringMultiplication Functions
umulMulOverflowumulStepumulHopOther Functions
udivremSetUint64NewUintNote:
NewUintshows a small regression due to the updated initialization pattern. All other helpers are flat or improved.int256
FromDecimalSetStringMustFromDecimalMaxInt256MinInt256ToStringcommon/tick_math
Benchmarks were run on
common/tick_mathfunctions, which heavily utilizesuint256arithmetic operations.GetSqrtRatioAtTick
GetTickAtSqrtRatio