Skip to content

Conversation

pawanjay176
Copy link
Member

Issue Addressed

#8218

Proposed Changes

// TODO

Comment on lines +123 to +124
/// This is derived from CLI flags and not persisted.
#[serde(skip)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these settings are derived from CLI flags? I don't think we need the serde(skip) here, if anything I think we might need it so we can write CLI tests in lighthouse/tests/beacon_node.rs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(the persistence of this config using serde is only for tests)

CustodyContext {
node_custody_type: NodeCustodyType,
) -> Result<Self, String> {
if ssz_context.persisted_is_supernode && node_custody_type != NodeCustodyType::Supernode {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we might want a DB schema change to store the NodeCustodyType? Then we could enforce that the node can't change "down", i.e. supernode to semi-supernode/full, or semi-supernode to full?

Does anything break if we don't block the transition from semi-supernode to full? Because this check already covers both supernode cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I initially added this check thinking we want to prevent changes to the NodeCustodyType.
But it is technically possible to do, we just need to set the correct value of earliest_available_slot.

One use case is all the nodes on sepolia that had to run a supernode to reconstruct all the blobs could downgrade to a semi-supernode.
We initially wanted to avoid switching of modes but its quite easy to handle in the rpc layer by just setting the right value of the earliest_available_slot. So I don't see any strong reason to not to allow this

half of the data columns (enough for reconstruction), enabling efficient \
data availability with lower bandwidth and storage requirements compared to \
a supernode, while still supporting full blob reconstruction.")
.display_order(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just need to update the cli help in lighthouse book

new_custody_group_count: updated_cgc,
sampling_count: self.num_of_custody_groups_to_sample(effective_epoch, spec),
new_custody_group_count: updated_cgc as u64,
sampling_count: self.sampling_count_at_epoch(None, spec) as u64,
Copy link
Member

@jimmygchen jimmygchen Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed num_of_custody_groups_to_sample and num_of_data_columns_to_sample are now both mapped to sampling_count_at_epoch - I'm sure this is intentional, but I want to make sure the reason for the previous separation was clear - it was mainly to support subnet decoupling (which isn't my favourite feature) and unifying this to a single function is likely going to break that.

See:

validator_custody_at_head: ssz_v24.validator_custody_at_head,
persisted_is_supernode: ssz_v24.persisted_is_supernode,
// TODO(pawan): fix
persisted_node_custody_count: if ssz_v24.persisted_is_supernode {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we'll need a new schema migration file for this change

let update_earliest_available_slot = if cgc_current > cgc_persisted {
true
} else {
false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this assume we have previously completed column backfill?

@jimmygchen
Copy link
Member

jimmygchen commented Oct 20, 2025

I've done a quick pass and I feel like this is quite a risky change before the release as there's a bit of refactor and it touches quite a lot of code related to custody / sampling and may be something we want to review and test throughly. I can continue the review tomorrow.

Is it possible to have --semi-supernode with minimal refactor and changes? I think the restart handling could be handled in a separate PR since we don't handle that between supernode and fullnode anyway and is something we may want to think about how to handle properly, e.g. prompt user about require resync before proceeding, prune extra columns etc

@pawanjay176
Copy link
Member Author

I've done a quick pass and I feel like this is quite a risky change before the release as there's a bit of refactor

I agree. Its a lot more change than I anticipated initially. I don't think we should rush it for the release either.

@jimmygchen jimmygchen added the v8.1.0 Post-Fulu release label Oct 21, 2025
@jimmygchen jimmygchen mentioned this pull request Oct 21, 2025
4 tasks
mergify bot pushed a commit that referenced this pull request Oct 22, 2025
Addresses #8218

A simplified version of #8241 for the initial release.

I've tried to minimise the logic change in this PR, although introducing the `NodeCustodyType` enum still result in quite a bit a of diff, but the actual logic change in `CustodyContext` is quite small.

The main changes are in the `CustdoyContext` struct
* ~~combining `validator_custody_count` and `current_is_supernode` fields into a single `custody_group_count_at_head` field. We persist the cgc of the initial cli values into the `custody_group_count_at_head` field and only allow for increase (same behaviour as before).~~
* I noticed the above approach caused a backward compatibility issue, I've [made a fix](15569bc) and changed the approach slightly (which was actually what I had originally in mind):
* when initialising, only override the  `validator_custody_count` value if either flag `--supernode` or `--semi-supernode` is used; otherwise leave it as the existing default `0`. Most other logic remains unchanged.

All existing validator custody unit tests are still all passing, and I've added additional tests to cover semi-supernode, and restoring `CustodyContext` from disk.

Note: I've added a `WARN` if the user attempts to switch to a `--semi-supernode` or `--supernode` - this currently has no effect, but once @eserilev column backfill is merged, we should be able to support this quite easily.

Things to test
- [x] cgc in metadata / enr
- [x] cgc in metrics
- [x] subscribed subnets
- [x] getBlobs endpoint


  


Co-Authored-By: Jimmy Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v8.1.0 Post-Fulu release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants