Skip to content

Conversation

@notJoon
Copy link
Member

@notJoon notJoon commented Nov 25, 2025

Description

This PR optimizes the uint256 package by restructuring the Uint type and improving memory allocation patterns. The changes reduce allocator pressure and gas usage for contracts using these utilities.

The original Uint implementation used a struct-wrapped array with pointer-based helpers, leading to unnecessary heap allocations and indirections in hot paths like decimal parsing and string conversion. These allocations directly translate to increased gas costs when running within contracts.

Since the int256 package uses the uint256 package internally, performance improvements are also reflected in int256 operations.

Changes

  1. Flattened Uint Storage

    • Changed Uint from struct-wrapped array to a flat [4]uint64 type
    • Updated helpers (multipliers, Zero, One, arithmetic helpers) to use stack-allocated temporaries instead of heap pointers
  2. fromDecimal Stack Reuse

    • Replaced repeated new(Uint) allocations with stack-scoped temporaries
    • Converted the multipliers table to value entries to avoid pointer chasing
      • After this, consistent per-call allocation reduction for all decimal parsing entry points are achieved
  3. DecToString Buffer Reuse

    • Replaced heap-backed buffers with fixed arrays
    • Tightened ASCII append logic
  4. Multiplication Path Optimization

    The multiplication routines (umul, umulHop, umulStep) were identified as significant contributors to allocation overhead. The following optimizations were applied:

    Background: umul implements full 256×256→512-bit multiplication, calling umulHop and umulStep for 64-bit word accumulation. Nearly all high-bit operations (Mul, MulOverflow, MulMod, fullmath.gno) depend on umul results, so changes required careful consideration of side effects and edge cases.

    Bottlenecks Identified:

    • Redundant bits.Add64 calls: umulStep separately adds z and carry in two stages, recalculating carry at each stage. Four bits.Add64 calls cause memory access and register pressure.
    • Unnecessary path execution: Full 4×4 loop runs even when upper words of multiplication inputs are zero.
    • Function call overhead: umulHop and umulStep are called frequently; without inlining, call costs are significant. The Gno compiler does not inline as aggressively as Go.

    Optimizations Applied:

    • umulStep/umulHop Consolidation: Merged the two functions into a single internal helper, reducing call count and Add64 invocations. Restructured umulStep to add z and carry to bits.Mul64 result in one pass (e.g., lo, carry := bits.Add64(lo, z, carry)), reducing Add64 calls from 4 to 2.
    • Early Termination for Zero Upper Words: Added checks at umul entry to detect when upper 64/128 bits are zero, reducing to 64×64 or 128×128 routines. When topX + topY <= 1, bits.Mul64 suffices and MulOverflow is always false.

    Verification Points: All carry propagation paths (especially upper-word accumulation) and MulOverflow's overflow detection (p[4:] must be zero) were regression tested. Edge cases for zero inputs and single-word inputs ensure res[4:] values match expected behavior.

Benchmark Results

All measurements use 200 iterations per function, capturing allocations via runtime.MemStats. The target functions were initially selected based on profiling results showing significant average CPU cycle consumption or memory allocation.

Note: These figures were measured through the runtime_metrics_test.gno test for each package.

uint256

Decimal Parsing Functions

Function Before After Δ Bytes Improvement
fromDecimal 54,075,384 33,720,184 −20,355,200 −37.6%
SetFromDecimal 55,080,184 34,732,984 −20,347,200 −36.9%
MustFromDecimal 55,542,584 35,105,784 −20,436,800 −36.8%
FromDecimal 55,667,384 35,230,584 −20,436,800 −36.7%

String Conversion Functions

Function Before After Δ Bytes Improvement
Dec 59,176,584 55,972,384 −3,204,200 −5.4%
ToString 59,504,584 56,300,384 −3,204,200 −5.4%

Multiplication Functions

Function Before After Δ Bytes Improvement
umul 13,160,784 4,555,984 −8,604,800 −65.4%
MulOverflow 14,199,184 5,504,784 −8,694,400 −61.2%
umulStep 1,272,784 991,184 −281,600 −22.1%
umulHop 991,184 850,384 −140,800 −14.2%

Other Functions

Function Before After Δ Bytes Improvement
udivrem 8,760,784 8,373,584 −387,200 −4.4%
SetUint64 599,184 599,184 0 0%
NewUint 803,984 855,184 +51,200 +6.4%

Note: NewUint shows a small regression due to the updated initialization pattern. All other helpers are flat or improved.

int256

Function Before After Δ Bytes Improvement
FromDecimal 57,912,184 37,475,384 −20,436,800 −35.3%
SetString 57,731,384 37,294,584 −20,436,800 −35.4%
MustFromDecimal 58,028,984 37,592,184 −20,436,800 −35.2%
MaxInt256 108,601,184 68,850,784 −39,750,400 −36.6%
MinInt256 108,621,384 68,870,984 −39,750,400 −36.6%
ToString 58,844,984 55,640,784 −3,204,200 −5.4%

common/tick_math

Benchmarks were run on common/tick_math functions, which heavily utilizes uint256 arithmetic operations.

GetSqrtRatioAtTick

Function Before After Δ Bytes Improvement
MinTick 161,757,584 72,578,384 −89,179,200 −55.1%
MaxTick 181,371,984 91,805,584 −89,566,400 −49.4%
ZeroTick 17,381,584 15,957,584 −1,424,000 −8.2%
NegativeTick 147,435,984 66,861,584 −80,574,400 −54.7%
PositiveTick 167,050,384 86,088,784 −80,961,600 −48.5%
SmallNegative 61,506,384 32,560,784 −28,945,600 −47.1%
SmallPositive 81,019,984 51,687,184 −29,332,800 −36.2%

GetTickAtSqrtRatio

Function Before After Δ Bytes Improvement
MinSqrtPrice 691,342,984 417,920,584 −273,422,400 −39.5%
MaxSqrtPrice 683,298,784 412,148,384 −271,150,400 −39.7%
MidSqrtPrice 379,112,784 229,474,384 −149,638,400 −39.5%
LowSqrtPrice 660,205,384 421,501,384 −238,704,000 −36.2%
HighSqrtPrice 618,535,584 382,028,384 −236,507,200 −38.2%
VeryLowPrice 727,130,184 445,581,384 −281,548,800 −38.7%
VeryHighPrice 666,407,584 392,388,384 −274,019,200 −41.1%

The existing `Uint` type has an array wrapped in a struct, requiring field access every time (e.g., `z.arr[0]`). While the Go compiler optimizes this, Gno has no separate optimization process, so every time the struct is copied, at least a 32-byte array plus field metadata must be moved together. Therefore, we changed the type definition to reduce these operations.
Configured unit tests using the runtime package to check metrics for each function. Used the runtime.GC function to ensure that results from previous function calls do not affect subsequent executions, and controlled function execution to avoid variable reuse.

Test target functions were selected based on profiling results. Among the detected functions in the uint256 package, those with the highest memory usage or CPU cycles in average flat values were prioritized.

Due to the current implementation of Gno's `runtime` package, there is a limitation where only allocated memory can be checked through the MemStats function, so additional instrumentation information cannot be obtained. However, memory usage can still be sufficiently verified.
Changed the implementation from continuously creating base := NewUint(num), mulResult := new(Uint), and addResult := new(Uint) inside fromDecimal to reusing temporary variables. This eliminates heap allocations by using stack variables instead.
After the fix, metric collection tests show that decimal-related functions can save approximately 380KB of additional memory over 200 iterations.
The improved figures can be seen in the table below.

Function	Original	Final	Total Improvement	Improvement Rate
SetFromDecimal	55,080,184	54,531,384	-548,800	-1.0%
fromDecimal	54,075,384	53,526,584	-548,800	-1.0%
MustFromDecimal	55,542,584	54,904,184	-638,400	-1.1%
FromDecimal	55,667,384	55,028,984	-638,400	-1.1%
Modified the implementation of the `Dec` function to reuse temporary variables, similar to the approach applied to `fromDecimal`.

After the fix, the `Dec` function saves 1MB of memory over 200 iterations. Additionally, since the `ToString` function internally calls the `Dec` function, the same improvement can be observed.

| Function | Before Optimization | After Optimization  | Improvement | Improvement Rate |
|----------|------------|------------|------------|-------|
| Dec      | 56,978,184 | 55,972,384 | -1,005,800 | -1.8% |
| ToString | 57,306,184 | 56,300,384 | -1,005,800 | -1.8% |

Key changes are as follows:
- []byte("000...") → [98]byte array
- make([]byte, 0, 19) → [19]byte array
- Removed new(Uint) and use value copy instead
copying 32 bytes is less expensive than keeping memory resident on the heap, frequently used values are now defined as values and copied when needed.
@notJoon notJoon marked this pull request as ready for review November 25, 2025 07:28
@notJoon notJoon changed the title fix(uint256): Change type definition of uint256.Uint type. perf(p/uint256): reduce allocations via flat type and buffer reuse Nov 25, 2025
Restructured `umulStep` to add `z` and `carry` to the `bits.Mul64` result
in a single operation. By using the pattern `lo, carry := bits.Add64(lo, z, carry)`,
the number of `bits.Add64` calls was reduced by half. The same pattern was also
applied to the `umulHop` function.

## Memory Profiling Results

### Overall Memory Usage

Version	Total Memory	Reduction
Before optimization	404,934,090 B	-
After this commit	363,401,569 B	-10.3 %
### math/bits.Add64 Comparison

Metric	Before	After	Change
Memory	81,993,472 B	47,627,008 B	-41.9 %
Calls	116,468	67,652	-41.9 %
Rank	#1	#2	-

### Analysis

The optimization reduced `bits.Add64` calls by 41.9%, resulting in a 34.4 MB
memory reduction. This single change accounts for nearly all of the 10.3%
total memory improvement. The `bits.Add64` function is no longer the top
memory consumer, having been surpassed by `strconv.ParseUint`.
Optimize `umul` by detecting the
highest non-zero word in each operand and using a length-aware
nested-loop multiply.

Changes:
- Add `highestNonZeroWord` helper to short-circuit zero inputs and
  trim iteration range
- Skip unnecessary partial-product work when operands have leading
  zero words
- Replace internal accumulation with localized `bits.Mul64 + bits.Add64`
  chaining for precise carry propagation

The existing `umulStep`/`umulHop` functions continue to handle carries
correctly and remain in use elsewhere. This redesign preserves all
edge cases (zero operands, partial-word products, full 256-bit
operands) while reducing operations for small values.

## Memory Profiling Results

### Overall Memory Usage

Version	Total Memory	vs Previous	vs Baseline
Baseline	404,934,090 B	-	-
After Add64 opt	363,401,569 B	-10.3%	-10.3%
After this commit	300,436,304 B	-17.3%	-25.8%
### Key Function Comparison (Before → After this commit)

Function	Memory Before	Memory After	Change
math/bits.Add64	47,627,008 B	19,338,880 B	-59.4%
math/bits.Mul64	39,673,856 B	10,871,808 B	-72.6%
umul	5,239,472 B	9,997,896 B	+90.8%

Function	Calls Before	Calls After	Change
math/bits.Add64	67,652	27,470	-59.4%
math/bits.Mul64	38,744	10,617	-72.6%
umul	8,036	16,872	+110.0%
### Analysis

The length-aware multiplication reduces `bits.Mul64` calls by 72.6% and
`bits.Add64` calls by 59.4% by skipping unnecessary 256-bit operations
when operands have leading zero words. The new `highestNonZeroWord`
helper adds 11.8 MB overhead but enables 62.9 MB savings in arithmetic
primitives, yielding a net reduction of ~63 MB (17.3%) from the previous
optimization pass.

The `umulStep` and `umulHop` functions no longer appear in the top 15
memory consumers, indicating the optimization successfully bypasses
full-width multiplication for smaller operands.
@notJoon notJoon self-assigned this Nov 26, 2025
@notJoon notJoon added refactoring performance enhacing the performance labels Nov 26, 2025
@notJoon notJoon changed the title perf(p/uint256): reduce allocations via flat type and buffer reuse perf(p/uint256): optimize allocation in multiplication, decimal parsing, and string ops Nov 26, 2025
@dongwon8247
Copy link
Member

dongwon8247 commented Nov 26, 2025

@notJoon Can you address this

1. gt() returns global pointer

File: fullmath.gno:96-101

Returning &one or &zero allows callers to mutate global state, which can corrupt calculations across the entire package.

// Problem
return &one
return &zero

// Fix
return One()
return Zero()

2. MulDivRoundingUp() uses global pointer

File: fullmath.gno:82

Passing &one directly exposes the global variable to potential modification.

// Problem
return new(Uint).Add(result, &one)

// Fix
return new(Uint).Add(result, One())

3. DivRoundingUp() uses global pointer

File: fullmath.gno:93-94

Same issue - &zero exposes global state.

// Problem
z := new(Uint).Add(div, gt(mod, &zero))

// Fix
z := new(Uint).Add(div, gt(mod, Zero()))

4. umul edge cases need testing

File: arithmetic.gno

The new optimized umul implementation skips zero words for performance. Edge cases should be explicitly tested to ensure correctness.

func TestUmulEdgeCases(t *testing.T) {
	cases := []struct{ x, y string }{
		{MAX_UINT256, MAX_UINT256},
		{"18446744073709551615", "18446744073709551615"},
		{MAX_UINT256, "1"},
	}
	for _, tc := range cases {
		x, y := MustFromDecimal(tc.x), MustFromDecimal(tc.y)
		if umul(x, y) != umulBaseline(x, y) {
			t.Errorf("mismatch: x=%s, y=%s", tc.x, tc.y)
		}
	}
}

Please add benchmarks with the txtars we made for perf testing, comparing the raw values we have so we know how much exactly it's enhanced. We might as well create a META issue to track all benchmarks.

@notJoon
Copy link
Member Author

notJoon commented Nov 26, 2025

@dongwon8247 Fixed the exposure of global pointers. The umul tests were not added as the same cases are already covered by TestUmulMatchesBaseline.

@sonarqubecloud
Copy link

@notJoon notJoon merged commit 9dbd892 into main Nov 27, 2025
86 of 87 checks passed
@notJoon notJoon deleted the optimize-uint256 branch November 27, 2025 09:22
@dongwon8247
Copy link
Member

@notJoon can you add this case and see if it passes?

func TestUmulSparsePatterns(t *testing.T) {
    cases := []struct {
        x Uint
        y Uint
    }{
        // lenX=1, lenY=1
        {Uint{0xFFFFFFFFFFFFFFFF, 0, 0, 0}, Uint{0xFFFFFFFFFFFFFFFF, 0, 0, 0}},
        // lenX=4, lenY=4 but lower words zero
        {Uint{0, 0, 0, 0xFFFFFFFFFFFFFFFF}, Uint{0, 0, 0, 0xFFFFFFFFFFFFFFFF}},
        // lenX=1, lenY=4
        {Uint{0xFFFFFFFFFFFFFFFF, 0, 0, 0}, Uint{1, 2, 3, 4}},
        // zero middle words (xi == 0 continue)
        {Uint{0xFFFFFFFFFFFFFFFF, 0, 0xFFFFFFFFFFFFFFFF, 0}, Uint{1, 1, 1, 1}},
    }
    
    for i, tc := range cases {
        got := umul(&tc.x, &tc.y)
        want := umulBaseline(&tc.x, &tc.y)
        if got != want {
            t.Fatalf("sparse test %d failed", i)
        }
    }
}

@notJoon
Copy link
Member Author

notJoon commented Nov 27, 2025

@dongwon8247 Testing confirmed that it runs without issues on the modified uint256 package. I've reverted the uint256 PR for now as you requested (#1012), but if there are no problems after the second review, I'll add the tests I wrote.

Screenshot 2025-11-27 at 8 19 53 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance enhacing the performance refactoring

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants