execution/state: typed-vio refactor (L2 — typed AddressEntry + WriteCell[T] + sync.Pool)#21536
execution/state: typed-vio refactor (L2 — typed AddressEntry + WriteCell[T] + sync.Pool)#21536mh0lt wants to merge 19 commits into
Conversation
…ctor (L2 on top of #21516) Lifts the typed-vio refactor (AddressEntry struct with per-AccountPath typed fields, generic WriteCell[T], typed VersionedRead/Write[T], typed PerPath ReadSet/WriteSet on IBS, per-T sync.Pool) onto current main on top of the L1 profiling instrumentation merged as #21516. Replaces the untyped any-boxed WriteCell/ReadSet shape across: - execution/state/versionmap.go: VersionMap.s now map[Address]*AddressEntry with typed btree.Map[int, *WriteCell[T]] per AccountPath - execution/state/versionedio.go: typed VersionedRead[T] / VersionedWrite[T] primitives + per-T pools; AnyVersionedWrite interface for collections - execution/state/intra_block_state.go: typed PerPath{Read,Write}Set fields on IBS + recordVR/recordVW helpers - execution/state/state_object.go, journal.go, rw_v3.go, read_paths.go (new): consumers updated to typed shape - execution/stagedsync/exec3_*.go, committer.go, calc_state.go, exec3_filter.go: VersionedWrites consumers migrated to AnyVersionedWrite iteration; commitment touch path uses VersionedWrites.TouchUpdates over the typed shape - execution/exec/block_assembler.go, txtask.go: PerPathReadSet/WriteSet threading - execution/types/accounts/code.go (new): canonical accounts.Code value type used by CodePath WriteCell[T] Shims used to isolate from the non-vio refactors landed on mh/perf-all-followups that vio is NOT structurally dependent on: - execution/cache/state_cache.go: PutCode(hash, code) pass-through (no weak-pointer code-cache landing required) - execution/commitment/vio_shims.go: no-op Record{Sstore*, HasStorageMiss} counters (commitment-metrics landing not required) - common/dbg: BAL{DrivenCommitment, ShadowCompute} stubbed false (BAL-driven commitment landing not required) - execution/balcache: stub package with CachedBlockAccessList returning no-cached (BAL cache landing not required) - db/state/aggregator_vio_shim.go: SetMaxCollationTxNum no-op - execution/state/prewarm_shim.go: PrewarmBlockStateCacheFromBAL no-op - execution/state/access_set_shim.go: AccessSet type + IBS.AccessedAddresses returning nil (access tracking folded into ReadSet under typed vio; api-compat shim for txtask + block_assembler) - execution/state/vio_test_shim.go: VersionedIO.RecordAccesses no-op used only by execution/tests/blockgen chain_makers.go - types.CodeChange.Hash literal stripped (typed-CodeChange landing not required; main's CodeChange has only Index+Bytecode) - mdgas.MdGas refund flattened back to uint64 across IBS/journal/ state_object (multi-dim gas landing not required) - Tracing reasons restored on IBS.SetNonce/SetCode + stateObject.SetNonce/SetCode signatures (main has them; chain-end dropped them — restored so main's evm/vm/aura/bor call sites work) - TouchPlainKeyDirect call sites in versionedio.go and calc_state.go rewritten to main's untyped (string, *Update) signature (typed-commitment-handle landing not required) - Hash.Bytes() → Hash[:] across stagedsync (no Hash.Bytes method on main) - td declared *uint256.Int to match main's chainReader.GetTd return type Foundation for tip-perf alloc reduction in the parallel-exec hot path (versionedRead[T] CPU + WriteSet.Set/ReadSet.Set allocs). Builds clean end-to-end; lint and on-tip characterisation follow. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…typed-vio shape Tests were left on the untyped shape by the initial L2 landing — go vet failed because: - AnyVersionedWrite is an interface (Header(), ValAny()), tests used direct w.Address / w.Path / w.Val field access - ReadSet / WriteSet are structs, tests compared == nil / used map subscript - IBS.SetNonce / SetCode and stateObject.SetNonce / SetCode now take a tracing.NonceChangeReason / CodeChangeReason — tests called with 2 args - chain.Config.ChainID is *uint256.Int on main; chain-end test used big.NewInt(1) - SD.Flush(ctx, tx) takes 2 args on main; chain-end test passed a third `drain bool` - common.Address.Value() is array-typed, so direct slicing isn't addressable in a composite call — bound to a local var first Files refreshed from chain-end (db9705dd16) and adapted to main's signatures with a paren-balanced auto-insert of tracing.NonceChangeUnspecified / tracing.CodeChangeUnspecified at every 2-arg call site. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…imeline log Adds env-gated [sched_timeline] log emission so the per-block concurrency-over-time curve can be reconstructed offline. Useful for "at which point through the block did we max out cpus" questions — the per-block tcpus average masks intra-block dynamics. Stamps SchedStartedNs / SchedFinishedNs on TxResult inside the worker when ERIGON_SCHED_TIMELINE=true, copies them through into the apply-loop txResult, and emits one log line per block from blockExecutor right before the blockResult is constructed. Cost when disabled: zero (Guard on dbg.SchedTimeline). Cost when enabled: 2 × time.Now().UnixNano() per tx (~60ns total via VDSO); ~20 bytes per tx in the emitted CSV (~18KB for a 900-tx block) — negligible. Format: [sched_timeline] blk=N base_ns=B cnt=K data="idx:start_delta_ns:dur_ns,..." Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tion/commitment: tighten vio-shim comments Drop multi-paragraph shim comments per CLAUDE.md (no forensic detail, one-line max); the shim files' names are self-describing and the rationale already lives in the parent commit's message + PR description. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The SD-revival check in versionedReadCore falls through to a normal versionMap read when the current tx has SelfDestructPath=false (which a CREATE/CREATE2 recreate emits). For StoragePath the fall-through surfaces the stale pre-SD slot value, which then matches a re-written same value in stateObject.SetState's no-op filter and drops the SSTORE entirely. A fresh contract's storage is empty by construction. When the current tx has no own write for the slot in versionedWrites, return zero on the SD path. Only the StoragePath case needs this — account-field paths (Balance/Nonce/CodeHash) can legitimately resolve to a post-SD versionMap write via the same-path dep chain. Failing test: TestDeleteRecreateSlotsAcrossManyBlocks under ERIGON_EXEC3_PARALLEL=true. Block 1 destroys+recreates+destroys+recreates aa: tx 3's CREATE2 reads slot 4 to compute SSTORE no-op, gets tx 1's pre-SD value 4, sees value==prev, drops the write, slot 4 reads as 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…dResult blockExecutor.blockIO.Release() ran in the exec loop right after sendResult returned, while the apply loop still held the same applyResult.TxIO reference. The release dropped the per-path pool maps before ProcessBAL could read them, so the validator-side BAL came out empty and produced a spurious BlockAccessListHash mismatch against the producer's BAL (e.g. TestSimulatedBackend_PendingAndCallContractAmsterdamDefaultGas). Move the release into the apply loop, after ProcessBAL — the last TxIO consumer — so the slot maps stay populated for it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The typed-vio refactor replaced ReadSet.Scan() with explicit per-path loops in AsBlockAccessList but only iterated balance, storage, writes and touched — dropping reads recorded against address, code, codeHash, codeSize, nonce, incarnation, selfDestruct and createContract. Per EIP-7928, every accessed address must appear in the BAL, even when the access produced no balance/storage/write payload. The omission caused ~200 BAL-hash mismatches in eest-spec-blocktests-devnet on the typed-vio stack; most visibly the EIP-7702 set-code-to-precompile case drops the precompile address from the BAL (its only "access" was a code-read via ResolveCode during delegation resolution). Re-add per-path ensureAccountState passes for the eight missing maps so every read registers its address. Verified locally with EXEC3_PARALLEL=true: - eip7702 set_code_txs: 301/301 - eip6780 selfdestruct: 149/149 - eip7928 BAL: 688/688 - full for_amsterdam: 21,368/21,368 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Both fixes mirror logic that exists on origin/main but was missed by the typed-vio L2 refactor. 1. MakeWriteSet EIP-6780 hook: when a SAME-TX created contract is SELFDESTRUCTed, storage is wiped at end-of-tx. The BAL must report the dirty slots as reads, not changes. Zero the storage versionedWrite values here so AsBlockAccessList's net-zero check folds them away. 2. SetCode matchesOriginal: skip + delete prior CodePath/CodeHashPath/ CodeSizePath versionedWrite entries when the new code matches the pre-tx original CodeHash. Catches EIP-7702 delegate-then-reset cumulative net-zero within a single tx. Guarded against newlyCreated stateObjects whose .original is the pre-creation snapshot. Local eest-spec sweep over the for_amsterdam devnet fixtures: 21368/21368 pass (previously 21368 with 91 eip6780 + 4 eip7702 fails). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The post-block finalize IBS is built with NewCurrentCachedReaderV3 so
syscalls (EIP-4788 beacon root, EIP-7002 withdrawal-request dequeue,
EIP-7251 consolidation-request dequeue, etc.) see the latest per-tx
writes accumulated in BlockStateCache. The L2 refactor had CachedReaderV3
overriding ReadAccountData and ReadAccountStorage but not ReadAccountCode
or ReadAccountCodeSize — code reads fell through to the domain getter,
which doesn't have the just-deployed code yet (Flush hasn't run).
When block 1 of a fork transition deploys a system contract via CREATE,
the post-block syscall's GetCodeSize reads through the cache miss and
sees an empty code, fails "[EIP-7002] Syscall failure: Empty Code",
and the block can't finalize:
apply loop exited but 1 block(s) had tx-results without a blockResult: [1]
Add GetCurrentCode on BlockStateCache and ReadAccountCode /
ReadAccountCodeSize on CachedReaderV3, matching origin/main's shape.
Fixes the four CancunToPragueAtTime15k system_contract_deployment tests
on parallel exec; local eest-spec stable (69256/69256) + amsterdam
devnet (21368/21368) all green.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ways false
exec3_parallel set InMemHistoryReads=true on entry and unconditionally
restored it to false on exit. That broke any post-exec caller that
expected history reads on the shared mem batch: the engine-x runner's
restore-state path calls GetAsOf and fails with
GetAsOf called on TemporalMemBatch with inMemHistoryReads disabled
Capture the prior flag value and restore it on exit, matching origin/main's
shape.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
MakeWriteSet runs once per-block on the validator path; the block assembler's BAL is built per-tx from ibs.TxIO() and never goes through MakeWriteSet. With the hook only in MakeWriteSet, assembler and validator diverged on CREATE-then-SD-in-init-code: assembler kept the SSTORE as a storageChange, validator suppressed it to a storageRead, engine_newPayload returned INVALID. Mirror origin/main's pattern by adding the same hook to FinalizeTx so both paths emit the per-slot zero writes. Local TestEngineApiBALCreateSSTOREThenSelfdestructInInitCode and the full cancun eip6780_selfdestruct fixture suite (149/149) both green. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The block assembler tracked the single-dim GasPool and restored only the regular dimension on tx failure. Under EIP-8037 the gas pool is two-dimensional, so a tx that fails ConsumeState mid-execution would return a "regular-only" snapshot that refilled the state pool from scratch — a successor tx could then drain the inflated state pool past the block limit, producing gas_used > gas_limit. Switch to NewBlockGasPool + RegularGasAvailable/StateGasAvailable snapshots, matching origin/main. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… not cell pointer ReadStorageCell returned a *WriteCell[uint256.Int] for the caller to read cell.Value AFTER releasing the read lock. flushVWLocked, called under the write lock, mutates that same cell.Value when re-flushing at the same (txIdx, incarnation) pair. The race detector flagged this on TestGetChainConfig and TestStorageRangeAt under -race. Capture the value under the read lock and return it by value; rename the readPathResult field accordingly. OCC correctness is preserved — the returned value is the cell's value at lock-release time, and the dep-idx/incarnation metadata still drives the conflict check. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Test #119 'Re-Org Back into Canonical Chain ... Execute Side Payload on Re-Org' failed in CI run 26774147115 but passes 4/4 runs locally at the same parallelism. Suspected CI-environment timing flake. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CI status summary — 2026-06-01 sessionCI: 84 pass / 2 skipping / 3 fail on Today shipped 9 commits resolving the deterministic CI failures (212 → 0):
The remaining 3 CI fails are environmental flakes:
|
Diagnostic instrumentation for the hive 'Re-Org Back into Canonical Chain ... Execute Side Payload on Re-Org' test that fails consistently on PR #21536 CI but does not reproduce locally despite 6000+ test iterations across stress configurations (parallelism 12, taskset CPU pin + CPU saturators, focused-test loops). On wrong-trie-root, log the count of changed accounts/storage/code in the current change-set plus up to 10 sample addresses so that diverging runs can be cross-referenced. Tracking: project_hive_reorg_bug_2026-06-02.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| @@ -0,0 +1,3 @@ | |||
| package state | |||
|
|
|||
| func (a *Aggregator) SetMaxCollationTxNum(n uint64) {} | |||
There was a problem hiding this comment.
SetMaxCollationTxNum -- has been removed from main...we don't have use for it anymore.
1ad2cd0 introduced a nil-pointer dereference: pe.currentChangeSet is nil for ValidateChain runs (changeset accumulation only fires for canonical stage execution, not fork validation), so the diagnostic panicked on every wrong-trie-root path — including the 18 hive engine api 'Invalid NewPayload, StateRoot/PrevRandao/...' tests that legitimately produce a wrong-trie-root error as the expected behavior. Guard the diagnostic with a non-nil cs check so we still get the dump when changesets are accumulating, and degrade gracefully otherwise. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the changeset-accumulator dump (cs is nil during fork validation, which is exactly the path hive Re-Org Back into Canonical Chain hits) with a targeted dump of ZeroAddr (the hive test's hot recipient for 50 txs/block) and the block coinbase. Reading via Domains().GetLatest goes through sd.mem → tx so we see whatever the parallel-exec apply loop actually persisted. If validation diverges from the BlockAssembler's serial build because of OCC retry contention on those two hot accounts, the dumped balances will tell us exactly by how much. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the raw bytes dump with a decoded balance + nonce so we can read the divergence directly. ZeroAddr is the hive Re-Org test's hot recipient (coinbase=0x0 + 50 txs/block all valued to 0x0), so its final balance is the cleanest signal for whether OCC retry under load drops one of the 50 credits during validation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PR #21536 hit a fully-green CI on cd9078e (87 pass, 0 fail). The chronic hive Re-Org Back into Canonical Chain + eest-spec pre-cancun race flakes both passed. Empty re-trigger to confirm whether (a) the diagnostic GetLatest call masks the race by serializing the read window, or (b) it was just a lucky run. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CI fully green on `cd9078e3a0`87 pass / 2 skipping / 0 fail — first fully-green CI run today. Both chronic flakes (hive Re-Org Back into Canonical Chain + eest-spec pre-cancun race) passed. Caveat: the Re-Org bug is real and not yet fixed. Prior commits (`1ad2cd0a45`, `2e28d06eb6`, `102a389b0b`) captured genuine build-vs-validate state-root divergence in CI logs (completely different roots, not the 1-byte test-framework flip seen in Invalid* tests). The divergence happens during fork-validation of 50-tx-per-block payloads when running under parallel exec3, hitting the test's hot recipient (`ZeroAddr`, == coinbase in the hive test). This green run could be (a) a lucky CI run where the timing happened to favor the race, or (b) the diagnostic `Domains().GetLatest` read in my wrong-trie-root handler subtly serializes the window. Empty re-trigger `67f4330268` will gather more data. Tracking memo: `project_hive_reorg_bug_2026-06-02.md`. PR is mergeable in this state if the team is willing to accept the latent race, but per Mark's directive on flakey tests the right path is to continue investigating until we can repro and root-cause. |
yperbasis
left a comment
There was a problem hiding this comment.
CI is still red. This is a big refactoring - let's hold off merging it until we've cut release/3.5.
What this is
Lifts the typed-vio refactor out of
mh/perf-all-followups(which has it bundled with 31k other commits) and onto current main, on top of #21516's profiling instrumentation.The vio shape: replace the untyped
*WriteCellkeyed bymap[AccountKey]*btree.Map[int, *WriteCell]with a typedAddressEntrystruct of per-AccountPath fields holdingbtree.Map[int, *WriteCell[T]]per type. Eliminatesany-boxing on the OCC write/read path, makes per-Tsync.Poolviable, addsPerPathReadSet/PerPathWriteSetto IBS withrecordVR/recordVWtyped helpers.Why now
L1 perf review on 1069 NewPayload blocks at tip showed parallel exec runs effectively serial (mean tcpus = 0.95, 5.7% utilisation of 16 logical threads). Per-tx alloc cost on the typed-vio path matters for the tight-benchmark side of perf work even though the dominant tip-perf lever is dispatch — keeping this in flight so it ships independently.
Scope guard
This PR does not pull in the constellation of other refactors that vio is bundled with on
mh/perf-all-followups. Those are isolated behind shims:cache.PutCodepass-through (no weak-pointer code-cache)commitment.RecordSstore*/RecordHasStorageMissno-op (no commitment-metrics)dbg.BALDrivenCommitment/BALShadowCompute= false (no BAL-driven commitment)execution/balcachestub package (no BAL cache)Aggregator.SetMaxCollationTxNumno-opstate.PrewarmBlockStateCacheFromBALno-opstate.AccessSet+IBS.AccessedAddressesshim returning nil (access tracking folded into ReadSet under typed vio; api-compat for txtask + block_assembler)types.CodeChange.Hashliteral stripped (main's CodeChange has only Index+Bytecode)mdgas.MdGasrefund flattened to uint64 (no multi-dim gas)SetNonce/SetCodesignatures (main has them; chain-end dropped them)TouchPlainKeyDirectcalls rewritten to main's untyped(string, *Update)signature (no typed-commitment-handle)Files: 17 modified + 7 new (shim files prefixed
_shim.go/_shims.gofor grep-ability). Diff ~+5000 / -2900.Test plan
make erigonclean (binary at/home/erigon/erigon-vio-on-main/build/bin/erigon, 153MB)make lint— pending/home/erigon/perf_results/L1-baseline/