[NEP-584]: Cross shard bandwidth scheduler #584

jancionear · 2025-01-13T13:49:52Z

NEP Status (Updated by NEP Moderators)

Status: Voting

SME reviews:

Protocol Work Group voting indications (❔ | 👍 | 👎 ):

walnut-the-cat · 2025-01-13T15:10:22Z

neps/nep-0584.md

+}
+```
+
+Additionally, the value that is the closest to `max_receipt_size` is set to to `max_receipt_size`:


walnut-the-cat · 2025-01-13T15:10:44Z

neps/nep-0584.md

+}
+```
+
+The values are calculate using a linear interpolation between `base_bandwidth` and


'calculated'?

walnut-the-cat · 2025-01-13T15:12:16Z

neps/nep-0584.md

+BandwidthSchedulerParams {
+    base_bandwidth: 100000,
+    max_shard_bandwidth: 4500000,
+    max_receipt_size: 4194304,
+    max_allowance: 4500000,
+}
+```


nit: clarify how these values are related to the fact that we are practicing the exercise with '4 shards'

jancionear · 2025-01-14T15:33:43Z

Matej's comment about the NEP can be found here: 0ff27d1#commitcomment-151308824

walnut-the-cat · 2025-01-14T22:07:40Z

As the moderator,

I want to kickstart the review process for this NEP, as the change is part of the upcoming release.

@jancionear , please comment once you believe this proposal is ready for SME review.

@near/wg-protocol , could you help assign SMEs who can review the proposal?

From engineering perspective, we believe @shreyan-gupta and @wacban are good candidates.

Thank you.

jancionear · 2025-01-16T18:54:36Z

@jancionear , please comment once you believe this proposal is ready for SME review.

I think the NEP is ready for review. I addressed the issues found in Matej's review.

bowenwang1996 · 2025-01-21T14:34:51Z

As a working group member, I nominate @shreyan-gupta and @wacban as SME reviewers.

shreyan-gupta

Overall great design! I've left a couple of comments.

shreyan-gupta · 2025-02-07T08:29:25Z

neps/nep-0584.md

+There is already a rudimentary solution in place, added together with stateless validation in
+[NEP-509](https://github.com/near/NEPs/blob/master/neps/nep-0509.md) to limit witness size.


This is the solution we have implemented in congestion control or does it pre-date that? Because I recall we do something similar in congestion control where we round robin shards that we are allowed to send more than the usual limit to.

It was added after congestion control. The allowed shard from congestion control has to be the same as the shard that is allowed to send more receipts to make sure that there are no liveness issues. If they were different we could have a situation where congestion control allows one shard to send receipts, but it's not the shard that can send large receipts and the large receipts could get stuck.

The idea of "allowed shard" from original congestion control was extended to also mean the shard that can send more bytes of receipts.

shreyan-gupta · 2025-02-07T08:46:39Z

neps/nep-0584.md

+Let's take a look at an example. Let's say that the predefined list of values that can be requested
+is:


This setup works well when we are assuming 4 MB size limit. If in the future we would like to increase the size limit to some other number, how would we change the predefined list?

When changing the limits we would have to take another look at max_shard_bandwidth, max_single_grant and base_bandwidth and see if they make sense, adjust as necessary.
If we wanted to lower the receipt size limit to 2MB we could leave either keep max_single_grant as is or make it smaller to increase the base_bandwidth. There's no golden rule, depends on each case.

shreyan-gupta · 2025-02-07T08:49:00Z

neps/nep-0584.md

+
+```rust
+max_shard_bandwidth = 4_500_000;
+max_single_grant = 4194304


nit: 4_194_304 for readability

shreyan-gupta · 2025-02-07T08:50:42Z

neps/nep-0584.md

+of shards is low. There are some tests which have a low number of shards, and having a lower base
+bandwidth allows us to fully test the bandwidth scheduler in those tests.


Is the 100 KB limit only introduced for these tests or do they potentially hold some importance in mainnet as well?

Currently this only matters in tests, with current parameters and 6 shards the base bandwidth is ~60kB, and it'll only get smaller as the number of shards increases.

Making max_shard_bandwidth larger or max_single_grant smaller in the future could make the base bandwidth larger that 100kB, in which case we'll have to reevaluate all the parameters.

Yeah, I guess it shouldn't matter too much given the number of shards can not decrease

shreyan-gupta · 2025-02-07T08:56:52Z

neps/nep-0584.md

+`max_single_grant`, like this:
+
+```rust
+values[-1] = base_bandwidth


nit: Might help here to just add a quick comment saying, value[-1] is the theoretical index -1. This means if BandwidthRequestValues are empty, we would be requesting base_bandwidth.

I initially mistook this for the python indexing where value[-1] means the last element of the array.

I initially mistook this for the python indexing where value[-1] means the last element of the array.

Damn python strikes again x.x
Added a comment to make it clearer

shreyan-gupta · 2025-02-07T10:29:22Z

neps/nep-0584.md

+```rust
+let mut sanity_check_bytes = Vec::new();
+sanity_check_bytes.extend_from_slice(scheduler_state.sanity_check_hash.as_ref());
+sanity_check_bytes.extend_from_slice(CryptoHash::hash_borsh(&all_shards).as_ref());


Is all_shards just the list of all shards in the current shard layout? Sounds like the sanity_check_bytes just basically confirms he hash as per the block_height? Shouldn't we try to include the link_allowances into the hash as well?

Yeah ideally we should hash the whole BandwidthSchedulerState in the hash, but that could potentially be large (~60kB?), which could affect performance. I didn't include the large field just to be safe. But maybe it wouldn't be that bad? 🤔

Yeah, maybe later we can revisit what is being stored in sanity_check_bytes. The current info may not be too useful to us in case there are some inconsistencies related to the allowance calculation across shards/nodes.

Perhaps there may be a cleverer way of getting a digest for the state.

shreyan-gupta · 2025-02-07T10:30:33Z

neps/nep-0584.md

+tens of kilobytes of data, which could take a bit of cpu time, so it's not done. The sanity check
+still checks that all shards ran the algorithm the same number of times and with the same shards.
+
+A new trie column is introduced to keep the scheduler state:


nit: It'll be nice to explicitly define the new columns introduced with the groups as well in the section for groups, i.e. BUFFERED_RECEIPT_GROUPS_QUEUE_DATA and BUFFERED_RECEIPT_GROUPS_QUEUE_ITEM

shreyan-gupta · 2025-02-07T10:35:43Z

neps/nep-0584.md

+because of the gas limit enforced by congestion control. This is not ideal, in the future we might
+consider merging these two algorithm into one better algorithm, but it is good enough for now.


It's great that this point is mentioned here. For now it seems like Congestion Control and Bandwidth Scheduler act independently and both their restrictions are places on the outgoing receipts. While Congestion Control currently deals with Gas limits, Bandwidth Scheduler as of the current implementation is limited to the receipt size.

It definitely makes sense to try to merge the efforts from both these designs to provide a consistent view; a single way to manage outgoing receipts. Congestion Control can be extended to Gas limits as well.

shreyan-gupta · 2025-02-07T10:56:35Z

neps/nep-0584.md

+scheduler will work quicker than that.
+
+The current version of the scheduler should work fine up to 50-100 shards, after that we'll probably
+need to some modifications. A quick solution would be to randomly choose half of the shards at every


nit: remove the to here

shreyan-gupta · 2025-02-07T10:59:32Z

neps/nep-0584.md

+- https://github.com/near/nearcore/pull/12728
+- https://github.com/near/nearcore/pull/12747
+
+TODO - am I supposed to copy the code here? I think that a proper "minimal reference implementation"


I guess in this case it should be fine to keep what we have here. Maybe including the title of the PR would help a lot.

Maybe including the title of the PR would help a lot.

Great idea, will add

wacban

I'm half way there, so far so good! I'll try to finish by eod tomorrow.

note to self - pick up at BandwidthScheduler

wacban · 2025-02-10T12:44:05Z

neps/nep-0584.md

+  heard NEAR DA was moving to a design that doesn't require a lot of cross-shard bandwidth.
+- High latency and bad scalability. A big receipt has to wait for up to `num_shards` heights before
+  it can be sent. This is much higher than it could be, with bandwidth scheduler a receipt never has
+  to wait more than one height (assuming that aren't shards aren't sending much). Even worse is that


(assuming that aren't shards aren't sending much)

words are not wording right ;)

wacban · 2025-02-10T13:02:18Z

neps/nep-0584.md

+It's important to keep the size of `BandwidthRequest` small because bandwidth requests are included
+in the chunk header, and the chunk header shouldn't be too large.


Just for my information, what would it sum up to for 10/ 50/ 100 shards?

Is there any value in wrapping the bitmap in an Option or does that not affect the number of serialized bytes?

Did you check if the serialized size is what you expect? I could imagine borsh rounding things up to the nearest 32bits for each field, in this case it may be worth to customize the serialization.

Just for my information, what would it sum up to for 10/ 50/ 100 shards?

A single bandwidth request takes 6 bytes, with 100 shards the worst case would be 600 bytes per chunk, or 60kB per block. But the worst case is unlikely, usually a shard doesn't do bandwidth requests to all other shards.

Is there any value in wrapping the bitmap in an Option or does that not affect the number of serialized bytes?

A bandwidth request with zeroed out bitmap wouldn't be included in the list of shard's bandwidth requests, so there's no point to use an Option. Bitmap is always nonzero.

Did you check if the serialized size is what you expect? I could imagine borsh rounding things up to the nearest 32bits for each field, in this case it may be worth to customize the serialization.

I hope there's no rounding, that would be terrible, borsh is supposed to be a one-to-one mapping between structs and serialized data.

wacban · 2025-02-10T13:07:12Z

neps/nep-0584.md

+`base_bandwidth` of receipts, it can just send them out immediately. Actual bandwidth grants based
+on bandwidth request happen after granting the base bandwidth.
+
+On current mainnet (with 6 shards) the base bandwidth is 61_139 (61kB)


Can you briefly explained if this number is constant or dependent on the number of shards? If the latter at which point will it become too low to allow request-less traffic for some percentage of chunks?

Added a short note about the relation with number of shards.

base_bandwidth = base_bandwidth = (max_shard_bandwidth - max_single_grant) / (num_shards - 1)

base_bandwidth = (4500000 - 4*1024*1024) / (num_shards - 1)

For 6 shards:
(4500000 - 4*1024*1024)/5 = 61139

For 50 shards:
(4500000 - 4*1024*1024)/49 = 6238

For 100 shards:
(4500000 - 4*1024*1024)/99 = 3087

So with 100 shards a shard will be able to send at most 3kB of data to another shard without making a request.
This isn't a lot, but it should be enough to send a few receipts.

An additional factor is that as the number of shards increases, the amount of receipts sent on each link should become lower, so a smaller base bandwidth should become less of an issue, it becomes smaller at the same rate as the number of receipts per link becomes smaller.

For larger numbers of shards we could revisit the parameters, for example decreasing max_single_grant and max_receipt_size to 2MB would greatly increase the budget for base bandwidth, same with increasing max_shard_bandwidth.

wacban · 2025-02-10T13:16:28Z

neps/nep-0584.md

+    /// The maximum amount of data that a shard can send or receive at a single height.
+    pub max_shard_bandwidth: Bandwidth,


Just out of curiosity is there any fundamental reason to have the "max send" equal to "max receive"? I'm not suggesting to split it into two, just wondering.

No particular reason, it was the easiest to do and there was no need for something more complicated. I guess everything that is sent has to be received, so the amount of data should be similar, assuming equal load.

wacban · 2025-02-10T13:28:12Z

neps/nep-0584.md

+It's important to note that `size_upper_bound` is less than difference between two consecutive
+values in `BandwidthRequestValues` . Thanks to this the requests are just as good as they would be
+if they were generated directly using individual receipt sizes.


It seems like there is some upper bound for how many trie reads are necessary to compute the requests. It can be guaranteed by early return once we exceed the max_shard_bandwidth and some minimum on the size of a single group. The latter isn't strictly enforced but I believe some emergent reasonable minimum still exists. Did I get it right?

That's right, BandwidthRequest::make_from_receipt_sizes reads the groups until it reaches a point where this much bandwidth can't be requested with the predefined values and then it stops. This means that generating a bandwidth request will read at most max_single_grant / group_size trie reads, which is about 42 trie reads per outgoing buffer.

wacban

LGTM, thanks!

wacban · 2025-02-12T17:23:39Z

neps/nep-0584.md

+[This technical section is required for Protocol proposals but optional for other categories. A
+draft implementation should demonstrate a minimal implementation that assists in understanding or
+implementing this proposal. Explain the design in sufficient detail that:
+
+- Its interaction with other features is clear.
+- Where possible, include a Minimum Viable Interface subsection expressing the required behavior and
+  types in a target programming language. (ie. traits and structs for rust, interfaces and classes
+  for javascript, function signatures and structs for c, etc.)
+- It is reasonably clear how the feature would be implemented.
+- Corner cases are dissected by example.
+- For protocol changes: A link to a draft PR on nearcore that shows how it can be integrated in the
+  current code. It should at least solve the key technical challenges.
+
+The section should return to the examples given in the previous section, and explain more fully how
+the detailed proposal makes those examples work.]


I think you can remove that

flmel · 2025-02-20T14:27:28Z

@near/wg-protocol Can you please fully read this NEP and comment in the thread if you are leaning towards approving or rejecting it? Please make sure to include your rationale and any feedback that you have for the author.

wacban · 2025-02-20T15:43:56Z

As a subject matter expert, I approve this NEP.

birchmd · 2025-02-20T21:57:43Z

As a working group member I lean towards approving this proposal. It is detailed, well-written and solves an important problem; thanks for all your work on it!

My only comment is that it would be nice to present some Byzantine analysis of the bandwidth scheduling sub-protocol. What happens if someone maliciously modifies their node to send malformed or otherwise incorrect bandwidth messages? It doesn't look to me at first glance like there is a serious issue here, but I still wanted to raise it for the designers to take a look at.

bowenwang1996 · 2025-02-20T23:23:49Z

As a working group member I lean towards approving this proposal

shreyan-gupta · 2025-02-21T01:01:39Z

As a subject matter expert, I am leaning towards approving this NEP.

Thanks Jan for working on this problem. The proposal is very well written. It's a great step towards a problem that we aniticiapte to hit quite quickly as our blockchain usage and number of shards grow.

While there are specifics we can improve, and revisit, and perhaps redesign over time, they are of minor importance as of the current status of NEAR blockchain. Examples include revisiting the limit calculations, BandwidthRequestValues, sanity_check_bytes, integration with congestion control etc.

The bandwidth scheudler is designed in a way that it is easily extendible and can be moulded in any direction we may see fit based on the future requirements.

jancionear · 2025-02-21T11:52:26Z

My only comment is that it would be nice to present some Byzantine analysis of the bandwidth scheduling sub-protocol. What happens if someone maliciously modifies their node to send malformed or otherwise incorrect bandwidth messages? It doesn't look to me at first glance like there is a serious issue here, but I still wanted to raise it for the designers to take a look at.

Bandwidth scheduler is run during chunk application and its results (the produced bandwidth requests) are stored in the chunk header, along with other things produced when applying a chunk. Chunk validators apply the chunk and verify that produced data matches the data in chunk header, they will not endorse a chunk if it doesn't match. The data from previous chunk header (like previous bandwidth requests) can be trusted because the previous chunk headers were endorsed by chunk validators at the previous height. The logic is pretty much identical to CongestionInfo.

I forgot to describe that in the NEP. I can add a small section, but I'm not sure if it's ok to modify the NEP at this stage?

flmel · 2025-02-21T12:57:36Z

My only comment is that it would be nice to present some Byzantine analysis of the bandwidth scheduling sub-protocol. What happens if someone maliciously modifies their node to send malformed or otherwise incorrect bandwidth messages? It doesn't look to me at first glance like there is a serious issue here, but I still wanted to raise it for the designers to take a look at.

Bandwidth scheduler is run during chunk application and its results (the produced bandwidth requests) are stored in the chunk header, along with other things produced when applying a chunk. Chunk validators apply the chunk and verify that produced data matches the data in chunk header, they will not endorse a chunk if it doesn't match. The data from previous chunk header (like previous bandwidth requests) can be trusted because the previous chunk headers were endorsed by chunk validators at the previous height. The logic is pretty much identical to CongestionInfo.

I forgot to describe that in the NEP. I can add a small section, but I'm not sure if it's ok to modify the NEP at this stage?

If it is something implemented but not described I believe it can be added taking into consideration Workgroup member recommendation.

nep

0ff27d1

jancionear requested a review from a team as a code owner January 13, 2025 13:49

jancionear changed the title ~~NEP XXX - Cross shard bandwidth scheduler~~ NEP 584 - Cross shard bandwidth scheduler Jan 13, 2025

jancionear changed the title ~~NEP 584 - Cross shard bandwidth scheduler~~ [NEP 584]: Cross shard bandwidth scheduler Jan 13, 2025

jancionear added 3 commits January 13, 2025 13:52

Rename to NEP-584

eb9efe7

Markdown lints fixes

fa8c0ba

Lint fixes 2

5e31c1e

jancionear requested review from wacban and shreyan-gupta January 13, 2025 14:10

jancionear added 2 commits January 13, 2025 14:24

Edit authors

5a8f215

584 -> 0584

22f46e8

walnut-the-cat reviewed Jan 13, 2025

View reviewed changes

jancionear added 2 commits January 13, 2025 16:22

Address Yoon's comments

7f24013

Add link to one more PR

3339838

walnut-the-cat added WG-protocol Protocol Standards Work Group should be accountable A-NEP A NEAR Enhancement Proposal (NEP). S-review/needs-sme-review A NEP in the REVIEW stage is waiting for Subject Matter Expert review. labels Jan 14, 2025

jancionear added 3 commits January 16, 2025 13:56

Mention that randomness is deterministic

d7eabb1

Include max_single_grant in the NEP

d569907

Mention how much allowance is added

f300de6

shreyan-gupta reviewed Feb 7, 2025

View reviewed changes

wacban reviewed Feb 10, 2025

View reviewed changes

jancionear added 4 commits February 10, 2025 14:11

4194304 -> 4_194_304

17e96ed

minus one

21776cc

slash the slash

026f533

need to remove to

58d9226

jancionear added 6 commits February 10, 2025 15:41

Name the PRs

9d5675a

Add paragraph about max_allowance

67c4cbd

Add info about receipt group trie columns

d2b3c7f

work the words

0878e8c

Note about base bandwidth

02c6904

lint fix

bdf837c

wacban approved these changes Feb 12, 2025

View reviewed changes

Remove description of Reference Implenentation section

e43384b

flmel added S-voting/needs-wg-voting-indication A NEP in the VOTING stage that needs the working group voting indication. and removed S-review/needs-sme-review A NEP in the REVIEW stage is waiting for Subject Matter Expert review. labels Feb 20, 2025

flmel changed the title ~~[NEP 584]: Cross shard bandwidth scheduler~~ [NEP-584]: Cross shard bandwidth scheduler Feb 20, 2025

		There is already a rudimentary solution in place, added together with stateless validation in
		[NEP-509](https://github.com/near/NEPs/blob/master/neps/nep-0509.md) to limit witness size.

		Let's take a look at an example. Let's say that the predefined list of values that can be requested
		is:

		of shards is low. There are some tests which have a low number of shards, and having a lower base
		bandwidth allows us to fully test the bandwidth scheduler in those tests.

		because of the gas limit enforced by congestion control. This is not ideal, in the future we might
		consider merging these two algorithm into one better algorithm, but it is good enough for now.

		It's important to keep the size of `BandwidthRequest` small because bandwidth requests are included
		in the chunk header, and the chunk header shouldn't be too large.

		/// The maximum amount of data that a shard can send or receive at a single height.
		pub max_shard_bandwidth: Bandwidth,

[NEP-584]: Cross shard bandwidth scheduler #584

Are you sure you want to change the base?

[NEP-584]: Cross shard bandwidth scheduler #584

Conversation

jancionear commented Jan 13, 2025 • edited by flmel Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jancionear commented Jan 14, 2025

walnut-the-cat commented Jan 14, 2025 • edited Loading

jancionear commented Jan 16, 2025

bowenwang1996 commented Jan 21, 2025

shreyan-gupta left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jancionear Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jancionear Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wacban left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jancionear Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

wacban left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flmel commented Feb 20, 2025

wacban commented Feb 20, 2025

birchmd commented Feb 20, 2025

bowenwang1996 commented Feb 20, 2025

shreyan-gupta commented Feb 21, 2025

jancionear commented Feb 21, 2025 • edited Loading

flmel commented Feb 21, 2025

jancionear commented Jan 13, 2025 •

edited by flmel

Loading

walnut-the-cat commented Jan 14, 2025 •

edited

Loading

jancionear Feb 10, 2025 •

edited

Loading

jancionear Feb 10, 2025 •

edited

Loading

jancionear Feb 10, 2025 •

edited

Loading

jancionear commented Feb 21, 2025 •

edited

Loading