Skip to content

Commit

Permalink
resharding: cleanup (#585)
Browse files Browse the repository at this point in the history
  • Loading branch information
staffik authored Jan 30, 2025
1 parent 74301d0 commit 58bb90d
Showing 1 changed file with 76 additions and 2 deletions.
78 changes: 76 additions & 2 deletions neps/nep-0568.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,22 @@ Cold storage uses the same mapping strategy to manage shard state during reshard

This approach minimizes complexity while maintaining consistency across hot and cold storage.

#### State cleanup

Since [Stateless Validation][NEP-509], all shards tracking is no longer required. Currently, shard cleanup (e.g. when we stopped tracking one shard and started tracking another shard) has not been implemented. With resharding, we would also want to cleanup parent shard once we stop tracking all its descendants. We propose a shard cleanup mechanism that will also handle post-resharding cleanup.

When garbage collection removes the last block of an epoch from the canonical chain, we determine which shards were tracked during that epoch by examining the shards present in `TrieChanges` at that block. Similarly, we collect information on shards tracked in subsequent epochs, up to the present one. A shard State is removed only if:
- It was tracked in the old epoch (for which the last block has just been garbage collected).

Check failure on line 120 in neps/nep-0568.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Unordered list style [Expected: asterisk; Actual: dash]

neps/nep-0568.md:120:1 MD004/ul-style Unordered list style [Expected: asterisk; Actual: dash]

Check failure on line 120 in neps/nep-0568.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Lists should be surrounded by blank lines [Context: "- It was tracked in the old ep..."]

neps/nep-0568.md:120 MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "- It was tracked in the old ep..."]
- It was not tracked in later epochs, is not currently tracked, and will not be tracked in the next epoch.

Check failure on line 121 in neps/nep-0568.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Unordered list style [Expected: asterisk; Actual: dash]

neps/nep-0568.md:121:1 MD004/ul-style Unordered list style [Expected: asterisk; Actual: dash]

To ensure compatibility with resharding, instead of checking tracked shards directly, we analyze the `ShardUId` prefixes they use. A parent shard's state is retained as long as it remains referenced in `DBCol::StateShardUIdMapping` by any descendant shard. Once all descendant shards are no longer tracked, we clean up the parent shard's state (along with its descendants) and remove all mappings to the parent from `DBCol::StateShardUIdMapping`.

#### Negative refcounts

Some trie keys, such as `TrieKey::DelayedReceipt`, are shared among child shards, but their corresponding State is not duplicated. The `DBCol::State` column uses reference counting, meaning that some data is counted only once, even if referenced by multiple child shards. As a result, removing the data can sometimes lead to negative refcounts.
To address this, we have modified the RocksDB `refcount_merge` behavior so that negative refcounts are clamped to zero. However, this is suboptimal, as it can lead to some State being leaked. Specifically, if two operations decrement the refcount for the same key, the RocksDB compaction process may merge them before they are applied, effectively canceling each other out. As a result, the key would never be removed from disk until state sync occurs.
This is a temporary solution, and we should follow up on it later.

### Stateless Validation

Since only a fraction of nodes track the split shard, it is necessary to prove the transition from the state root of the parent shard to the new state roots for the child shards to other validators. Without this proof, chunk producers for the split shard could collude and provide invalid state roots, potentially compromising the protocol, such as by minting tokens out of thin air.
Expand Down Expand Up @@ -521,9 +537,66 @@ fn set_shard_uid_mapping(&mut self, child_shard_uid: ShardUId, parent_shard_uid:
}
```

When a node stops tracking all descendants of a shard, the associated mapping entry can be removed, allowing RocksDB to perform garbage collection. For archival nodes, mappings are retained permanently to ensure access to the historical state of all shards.
When a node stops tracking all descendants of a shard, garbage collection will eventually clear the last block of the last epoch in which the last descendant was tracked. The descendant will then appear in the result of:

```rust
fn get_potential_shards_for_cleanup(..., last_block_of_gced_epoch) -> Result<Vec<ShardUId>> {
let mut tracked_shards = vec![];
for shard_uid in shard_layout.shard_uids() {
if chain_store_update
.store()
.exists(DBCol::TrieChanges, &get_block_shard_uid(&last_block_of_gced_epoch, &shard_uid))?
{
tracked_shards.push(shard_uid);
}
}
tracked_shards
}
```

Then, `gc_state()` is called, mapping the descendant `ShardUId` to the parent `ShardUId`, making the parent shard a candidate for cleanup. We then detect that since `gced_epoch`, the parent `ShardUId` has not been used as a database key prefix. As a result, we can safely remove the state under this prefix (including the parent and all descendants) along with the associated entries from `DBCol::StateShardUIdMapping`.

```rust
fn gc_state(potential_shards_for_cleanup, gced_epoch, shard_tracker, store_update) {
let mut potential_shards_to_cleanup: HashSet<ShardUId> = potential_shards_for_cleanup
.iter()
.map(|shard_uid| get_shard_uid_mapping(&store, *shard_uid))
.collect();

for epoch in gced_epoch+1..current_epoch {
let shard_layout = get_shard_layout(epoch);
let last_block_of_epoch = get_last_block_of_epoch(epoch);
for shard_uid in shard_layout.shard_uids() {
if !store
.exists(DBCol::TrieChanges, &get_block_shard_uid(last_block_of_epoch, &shard_uid))?
{
continue;
}
let mapped_shard_uid = get_shard_uid_mapping(&store, shard_uid);
potential_shards_to_cleanup.remove(&mapped_shard_uid);
}
}

for shard_uid in shard_tracker.get_shards_tracks_this_or_next_epoch() {
let mapped_shard_uid = get_shard_uid_mapping(&store, shard_uid);
potential_shards_to_cleanup.remove(&mapped_shard_uid);
}
let shards_to_cleanup = potential_shards_to_cleanup;

for kv in store.iter_ser::<ShardUId>(DBCol::StateShardUIdMapping) {
let (child_shard_uid, parent_shard_uid) = kv?;
if shards_to_cleanup.contains(&parent_shard_uid) {
store_update.delete(DBCol::StateShardUIdMapping, &child_shard_uid);
}
}
for shard_uid_prefix in shards_to_cleanup {
store_update.delete_shard_uid_prefixed_state(shard_uid_prefix);
}
}
```

For archival nodes, mappings are retained permanently to ensure access to the historical state of all shards.

This implementation ensures efficient and scalable shard state transitions, allowing child shards to use ancestor data without creating redundant entries.

### State Sync

Expand Down Expand Up @@ -631,3 +704,4 @@ Copyright and related rights waived via [CC0](https://creativecommons.org/public

[NEP-040]: https://github.com/near/NEPs/blob/master/specs/Proposals/0040-split-states.md
[NEP-508]: https://github.com/near/NEPs/blob/master/neps/nep-0508.md
[NEP-509]: https://github.com/near/NEPs/blob/master/neps/nep-0509.md

0 comments on commit 58bb90d

Please sign in to comment.