perf(p/uint256): optimize allocation in multiplication, decimal parsing, and string ops #1004

notJoon · 2025-11-25T01:57:50Z

Description

This PR optimizes the uint256 package by restructuring the Uint type and improving memory allocation patterns. The changes reduce allocator pressure and gas usage for contracts using these utilities.

The original Uint implementation used a struct-wrapped array with pointer-based helpers, leading to unnecessary heap allocations and indirections in hot paths like decimal parsing and string conversion. These allocations directly translate to increased gas costs when running within contracts.

Since the int256 package uses the uint256 package internally, performance improvements are also reflected in int256 operations.

Changes

Flattened Uint Storage
- Changed Uint from struct-wrapped array to a flat [4]uint64 type
- Updated helpers (multipliers, Zero, One, arithmetic helpers) to use stack-allocated temporaries instead of heap pointers
fromDecimal Stack Reuse
- Replaced repeated new(Uint) allocations with stack-scoped temporaries
- Converted the multipliers table to value entries to avoid pointer chasing
  - After this, consistent per-call allocation reduction for all decimal parsing entry points are achieved
DecToString Buffer Reuse
- Replaced heap-backed buffers with fixed arrays
- Tightened ASCII append logic
Multiplication Path Optimization

The multiplication routines (umul, umulHop, umulStep) were identified as significant contributors to allocation overhead. The following optimizations were applied:

Background: umul implements full 256×256→512-bit multiplication, calling umulHop and umulStep for 64-bit word accumulation. Nearly all high-bit operations (Mul, MulOverflow, MulMod, fullmath.gno) depend on umul results, so changes required careful consideration of side effects and edge cases.

Bottlenecks Identified:
- Redundant bits.Add64 calls: umulStep separately adds z and carry in two stages, recalculating carry at each stage. Four bits.Add64 calls cause memory access and register pressure.
- Unnecessary path execution: Full 4×4 loop runs even when upper words of multiplication inputs are zero.
- Function call overhead: umulHop and umulStep are called frequently; without inlining, call costs are significant. The Gno compiler does not inline as aggressively as Go.
Optimizations Applied:
- umulStep/umulHop Consolidation: Merged the two functions into a single internal helper, reducing call count and Add64 invocations. Restructured umulStep to add z and carry to bits.Mul64 result in one pass (e.g., lo, carry := bits.Add64(lo, z, carry)), reducing Add64 calls from 4 to 2.
- Early Termination for Zero Upper Words: Added checks at umul entry to detect when upper 64/128 bits are zero, reducing to 64×64 or 128×128 routines. When topX + topY <= 1, bits.Mul64 suffices and MulOverflow is always false.
Verification Points: All carry propagation paths (especially upper-word accumulation) and MulOverflow's overflow detection (p[4:] must be zero) were regression tested. Edge cases for zero inputs and single-word inputs ensure res[4:] values match expected behavior.

Benchmark Results

All measurements use 200 iterations per function, capturing allocations via runtime.MemStats. The target functions were initially selected based on profiling results showing significant average CPU cycle consumption or memory allocation.

Note: These figures were measured through the runtime_metrics_test.gno test for each package.

uint256

Decimal Parsing Functions

Function	Before	After	Δ Bytes	Improvement
`fromDecimal`	54,075,384	33,720,184	−20,355,200	−37.6%
`SetFromDecimal`	55,080,184	34,732,984	−20,347,200	−36.9%
`MustFromDecimal`	55,542,584	35,105,784	−20,436,800	−36.8%
`FromDecimal`	55,667,384	35,230,584	−20,436,800	−36.7%

String Conversion Functions

Function	Before	After	Δ Bytes	Improvement
`Dec`	59,176,584	55,972,384	−3,204,200	−5.4%
`ToString`	59,504,584	56,300,384	−3,204,200	−5.4%

Multiplication Functions

Function	Before	After	Δ Bytes	Improvement
`umul`	13,160,784	4,555,984	−8,604,800	−65.4%
`MulOverflow`	14,199,184	5,504,784	−8,694,400	−61.2%
`umulStep`	1,272,784	991,184	−281,600	−22.1%
`umulHop`	991,184	850,384	−140,800	−14.2%

Other Functions

Function	Before	After	Δ Bytes	Improvement
`udivrem`	8,760,784	8,373,584	−387,200	−4.4%
`SetUint64`	599,184	599,184	0	0%
`NewUint`	803,984	855,184	+51,200	+6.4%

Note: NewUint shows a small regression due to the updated initialization pattern. All other helpers are flat or improved.

int256

Function	Before	After	Δ Bytes	Improvement
`FromDecimal`	57,912,184	37,475,384	−20,436,800	−35.3%
`SetString`	57,731,384	37,294,584	−20,436,800	−35.4%
`MustFromDecimal`	58,028,984	37,592,184	−20,436,800	−35.2%
`MaxInt256`	108,601,184	68,850,784	−39,750,400	−36.6%
`MinInt256`	108,621,384	68,870,984	−39,750,400	−36.6%
`ToString`	58,844,984	55,640,784	−3,204,200	−5.4%

common/tick_math

Benchmarks were run on common/tick_math functions, which heavily utilizes uint256 arithmetic operations.

GetSqrtRatioAtTick

Function	Before	After	Δ Bytes	Improvement
MinTick	161,757,584	72,578,384	−89,179,200	−55.1%
MaxTick	181,371,984	91,805,584	−89,566,400	−49.4%
ZeroTick	17,381,584	15,957,584	−1,424,000	−8.2%
NegativeTick	147,435,984	66,861,584	−80,574,400	−54.7%
PositiveTick	167,050,384	86,088,784	−80,961,600	−48.5%
SmallNegative	61,506,384	32,560,784	−28,945,600	−47.1%
SmallPositive	81,019,984	51,687,184	−29,332,800	−36.2%

GetTickAtSqrtRatio

Function	Before	After	Δ Bytes	Improvement
MinSqrtPrice	691,342,984	417,920,584	−273,422,400	−39.5%
MaxSqrtPrice	683,298,784	412,148,384	−271,150,400	−39.7%
MidSqrtPrice	379,112,784	229,474,384	−149,638,400	−39.5%
LowSqrtPrice	660,205,384	421,501,384	−238,704,000	−36.2%
HighSqrtPrice	618,535,584	382,028,384	−236,507,200	−38.2%
VeryLowPrice	727,130,184	445,581,384	−281,548,800	−38.7%
VeryHighPrice	666,407,584	392,388,384	−274,019,200	−41.1%

The existing `Uint` type has an array wrapped in a struct, requiring field access every time (e.g., `z.arr[0]`). While the Go compiler optimizes this, Gno has no separate optimization process, so every time the struct is copied, at least a 32-byte array plus field metadata must be moved together. Therefore, we changed the type definition to reduce these operations.

Configured unit tests using the runtime package to check metrics for each function. Used the runtime.GC function to ensure that results from previous function calls do not affect subsequent executions, and controlled function execution to avoid variable reuse. Test target functions were selected based on profiling results. Among the detected functions in the uint256 package, those with the highest memory usage or CPU cycles in average flat values were prioritized. Due to the current implementation of Gno's `runtime` package, there is a limitation where only allocated memory can be checked through the MemStats function, so additional instrumentation information cannot be obtained. However, memory usage can still be sufficiently verified.

Changed the implementation from continuously creating base := NewUint(num), mulResult := new(Uint), and addResult := new(Uint) inside fromDecimal to reusing temporary variables. This eliminates heap allocations by using stack variables instead. After the fix, metric collection tests show that decimal-related functions can save approximately 380KB of additional memory over 200 iterations. The improved figures can be seen in the table below. Function Original Final Total Improvement Improvement Rate SetFromDecimal 55,080,184 54,531,384 -548,800 -1.0% fromDecimal 54,075,384 53,526,584 -548,800 -1.0% MustFromDecimal 55,542,584 54,904,184 -638,400 -1.1% FromDecimal 55,667,384 55,028,984 -638,400 -1.1%

Modified the implementation of the `Dec` function to reuse temporary variables, similar to the approach applied to `fromDecimal`. After the fix, the `Dec` function saves 1MB of memory over 200 iterations. Additionally, since the `ToString` function internally calls the `Dec` function, the same improvement can be observed. | Function | Before Optimization | After Optimization | Improvement | Improvement Rate | |----------|------------|------------|------------|-------| | Dec | 56,978,184 | 55,972,384 | -1,005,800 | -1.8% | | ToString | 57,306,184 | 56,300,384 | -1,005,800 | -1.8% | Key changes are as follows: - []byte("000...") → [98]byte array - make([]byte, 0, 19) → [19]byte array - Removed new(Uint) and use value copy instead

copying 32 bytes is less expensive than keeping memory resident on the heap, frequently used values are now defined as values and copied when needed.

Restructured `umulStep` to add `z` and `carry` to the `bits.Mul64` result in a single operation. By using the pattern `lo, carry := bits.Add64(lo, z, carry)`, the number of `bits.Add64` calls was reduced by half. The same pattern was also applied to the `umulHop` function. ## Memory Profiling Results ### Overall Memory Usage Version Total Memory Reduction Before optimization 404,934,090 B - After this commit 363,401,569 B -10.3 % ### math/bits.Add64 Comparison Metric Before After Change Memory 81,993,472 B 47,627,008 B -41.9 % Calls 116,468 67,652 -41.9 % Rank #1 #2 - ### Analysis The optimization reduced `bits.Add64` calls by 41.9%, resulting in a 34.4 MB memory reduction. This single change accounts for nearly all of the 10.3% total memory improvement. The `bits.Add64` function is no longer the top memory consumer, having been surpassed by `strconv.ParseUint`.

Optimize `umul` by detecting the highest non-zero word in each operand and using a length-aware nested-loop multiply. Changes: - Add `highestNonZeroWord` helper to short-circuit zero inputs and trim iteration range - Skip unnecessary partial-product work when operands have leading zero words - Replace internal accumulation with localized `bits.Mul64 + bits.Add64` chaining for precise carry propagation The existing `umulStep`/`umulHop` functions continue to handle carries correctly and remain in use elsewhere. This redesign preserves all edge cases (zero operands, partial-word products, full 256-bit operands) while reducing operations for small values. ## Memory Profiling Results ### Overall Memory Usage Version Total Memory vs Previous vs Baseline Baseline 404,934,090 B - - After Add64 opt 363,401,569 B -10.3% -10.3% After this commit 300,436,304 B -17.3% -25.8% ### Key Function Comparison (Before → After this commit) Function Memory Before Memory After Change math/bits.Add64 47,627,008 B 19,338,880 B -59.4% math/bits.Mul64 39,673,856 B 10,871,808 B -72.6% umul 5,239,472 B 9,997,896 B +90.8% Function Calls Before Calls After Change math/bits.Add64 67,652 27,470 -59.4% math/bits.Mul64 38,744 10,617 -72.6% umul 8,036 16,872 +110.0% ### Analysis The length-aware multiplication reduces `bits.Mul64` calls by 72.6% and `bits.Add64` calls by 59.4% by skipping unnecessary 256-bit operations when operands have leading zero words. The new `highestNonZeroWord` helper adds 11.8 MB overhead but enables 62.9 MB savings in arithmetic primitives, yielding a net reduction of ~63 MB (17.3%) from the previous optimization pass. The `umulStep` and `umulHop` functions no longer appear in the top 15 memory consumers, indicating the optimization successfully bypasses full-width multiplication for smaller operands.

dongwon8247 · 2025-11-26T02:48:37Z

@notJoon Can you address this

1. `gt()` returns global pointer

File: fullmath.gno:96-101

Returning &one or &zero allows callers to mutate global state, which can corrupt calculations across the entire package.

// Problem
return &one
return &zero

// Fix
return One()
return Zero()

2. `MulDivRoundingUp()` uses global pointer

File: fullmath.gno:82

Passing &one directly exposes the global variable to potential modification.

// Problem
return new(Uint).Add(result, &one)

// Fix
return new(Uint).Add(result, One())

3. `DivRoundingUp()` uses global pointer

File: fullmath.gno:93-94

Same issue - &zero exposes global state.

// Problem
z := new(Uint).Add(div, gt(mod, &zero))

// Fix
z := new(Uint).Add(div, gt(mod, Zero()))

4. `umul` edge cases need testing

File: arithmetic.gno

The new optimized umul implementation skips zero words for performance. Edge cases should be explicitly tested to ensure correctness.

func TestUmulEdgeCases(t *testing.T) {
	cases := []struct{ x, y string }{
		{MAX_UINT256, MAX_UINT256},
		{"18446744073709551615", "18446744073709551615"},
		{MAX_UINT256, "1"},
	}
	for _, tc := range cases {
		x, y := MustFromDecimal(tc.x), MustFromDecimal(tc.y)
		if umul(x, y) != umulBaseline(x, y) {
			t.Errorf("mismatch: x=%s, y=%s", tc.x, tc.y)
		}
	}
}

Please add benchmarks with the txtars we made for perf testing, comparing the raw values we have so we know how much exactly it's enhanced. We might as well create a META issue to track all benchmarks.

notJoon · 2025-11-26T03:27:57Z

@dongwon8247 Fixed the exposure of global pointers. The umul tests were not added as the same cases are already covered by TestUmulMatchesBaseline.

contract/p/gnoswap/uint256/conversion.gno

sonarqubecloud · 2025-11-27T09:09:58Z

Quality Gate passed

Issues
5 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

dongwon8247 · 2025-11-27T10:31:37Z

@notJoon can you add this case and see if it passes?

func TestUmulSparsePatterns(t *testing.T) {
    cases := []struct {
        x Uint
        y Uint
    }{
        // lenX=1, lenY=1
        {Uint{0xFFFFFFFFFFFFFFFF, 0, 0, 0}, Uint{0xFFFFFFFFFFFFFFFF, 0, 0, 0}},
        // lenX=4, lenY=4 but lower words zero
        {Uint{0, 0, 0, 0xFFFFFFFFFFFFFFFF}, Uint{0, 0, 0, 0xFFFFFFFFFFFFFFFF}},
        // lenX=1, lenY=4
        {Uint{0xFFFFFFFFFFFFFFFF, 0, 0, 0}, Uint{1, 2, 3, 4}},
        // zero middle words (xi == 0 continue)
        {Uint{0xFFFFFFFFFFFFFFFF, 0, 0xFFFFFFFFFFFFFFFF, 0}, Uint{1, 1, 1, 1}},
    }
    
    for i, tc := range cases {
        got := umul(&tc.x, &tc.y)
        want := umulBaseline(&tc.x, &tc.y)
        if got != want {
            t.Fatalf("sparse test %d failed", i)
        }
    }
}

notJoon · 2025-11-27T11:21:13Z

@dongwon8247 Testing confirmed that it runs without issues on the modified uint256 package. I've reverted the uint256 PR for now as you requested (#1012), but if there are no problems after the second review, I'll add the tests I wrote.

notJoon added 7 commits November 25, 2025 10:57

fix(uint256): Define constants as values to remove pointer dereference

75b69f0

copying 32 bytes is less expensive than keeping memory resident on the heap, frequently used values are now defined as values and copied when needed.

chore: Update global constant's name to floow convention

7f226e0

fix: zero, one function should return copied value's address

312748b

notJoon marked this pull request as ready for review November 25, 2025 07:28

notJoon changed the title ~~fix(uint256): Change type definition of uint256.Uint type.~~ perf(p/uint256): reduce allocations via flat type and buffer reuse Nov 25, 2025

notJoon added 4 commits November 25, 2025 17:17

fix: remove unecessary function calls

f8fcc11

test(p/int256): Add runtume metric test

bd31537

notJoon requested review from jinoosss and junghoon-vans November 26, 2025 01:34

notJoon self-assigned this Nov 26, 2025

notJoon added refactoring performance enhacing the performance labels Nov 26, 2025

notJoon changed the title ~~perf(p/uint256): reduce allocations via flat type and buffer reuse~~ perf(p/uint256): optimize allocation in multiplication, decimal parsing, and string ops Nov 26, 2025

fix: remove usage of global pointers

ff4abf2

Merge branch 'main' into optimize-uint256

6788b48

notJoon mentioned this pull request Nov 26, 2025

fix(common): Remove unnecessary Clone calls and redundant ordering #1007

Open

notJoon added 6 commits November 27, 2025 11:44

chore: remove unused code

49020b2

chore: remove remaining dead codes

0c63075

fix

f9770dc

fix: optimize constructors

c516982

Merge branch 'main' into optimize-uint256

512a04a

fix: Caching repeated values

15ec6ef

Merge branch 'main' into optimize-uint256

5dc0f90

jinoosss reviewed Nov 27, 2025

View reviewed changes

contract/p/gnoswap/uint256/conversion.gno Show resolved Hide resolved

notJoon and others added 4 commits November 27, 2025 18:01

fix: extract local variable from Dec function

eca6a24

revert Dec

c3f48b0

refactor: calculation value

f830a6c

test: remove unused functions

121958d

jinoosss approved these changes Nov 27, 2025

View reviewed changes

notJoon merged commit 9dbd892 into main Nov 27, 2025
86 of 87 checks passed

notJoon deleted the optimize-uint256 branch November 27, 2025 09:22

notJoon mentioned this pull request Nov 27, 2025

Revert "perf(p/uint256): optimize allocation in multiplication, decimal parsing, and string ops" #1012

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(p/uint256): optimize allocation in multiplication, decimal parsing, and string ops #1004

perf(p/uint256): optimize allocation in multiplication, decimal parsing, and string ops #1004

Uh oh!

notJoon commented Nov 25, 2025 •

edited

Loading

Uh oh!

dongwon8247 commented Nov 26, 2025 •

edited

Loading

Uh oh!

notJoon commented Nov 26, 2025

Uh oh!

Uh oh!

sonarqubecloud bot commented Nov 27, 2025

Uh oh!

Uh oh!

dongwon8247 commented Nov 27, 2025

Uh oh!

notJoon commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

perf(p/uint256): optimize allocation in multiplication, decimal parsing, and string ops #1004

perf(p/uint256): optimize allocation in multiplication, decimal parsing, and string ops #1004

Uh oh!

Conversation

notJoon commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Benchmark Results

uint256

int256

common/tick_math

Uh oh!

dongwon8247 commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. gt() returns global pointer

2. MulDivRoundingUp() uses global pointer

3. DivRoundingUp() uses global pointer

4. umul edge cases need testing

Uh oh!

notJoon commented Nov 26, 2025

Uh oh!

Uh oh!

sonarqubecloud bot commented Nov 27, 2025

Quality Gate passed

Uh oh!

Uh oh!

dongwon8247 commented Nov 27, 2025

Uh oh!

notJoon commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

notJoon commented Nov 25, 2025 •

edited

Loading

dongwon8247 commented Nov 26, 2025 •

edited

Loading

1. `gt()` returns global pointer

2. `MulDivRoundingUp()` uses global pointer

3. `DivRoundingUp()` uses global pointer

4. `umul` edge cases need testing