-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NEP-584]: Cross shard bandwidth scheduler #584
base: master
Are you sure you want to change the base?
Conversation
neps/nep-0584.md
Outdated
} | ||
``` | ||
|
||
Additionally, the value that is the closest to `max_receipt_size` is set to to `max_receipt_size`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two 'to'?
neps/nep-0584.md
Outdated
} | ||
``` | ||
|
||
The values are calculate using a linear interpolation between `base_bandwidth` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'calculated'?
neps/nep-0584.md
Outdated
BandwidthSchedulerParams { | ||
base_bandwidth: 100000, | ||
max_shard_bandwidth: 4500000, | ||
max_receipt_size: 4194304, | ||
max_allowance: 4500000, | ||
} | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: clarify how these values are related to the fact that we are practicing the exercise with '4 shards'
Matej's comment about the NEP can be found here: 0ff27d1#commitcomment-151308824 |
As the moderator, I want to kickstart the review process for this NEP, as the change is part of the upcoming release. @jancionear , please comment once you believe this proposal is ready for SME review. @near/wg-protocol , could you help assign SMEs who can review the proposal? From engineering perspective, we believe @shreyan-gupta and @wacban are good candidates. Thank you. |
I think the NEP is ready for review. I addressed the issues found in Matej's review. |
As a working group member, I nominate @shreyan-gupta and @wacban as SME reviewers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall great design! I've left a couple of comments.
There is already a rudimentary solution in place, added together with stateless validation in | ||
[NEP-509](https://github.com/near/NEPs/blob/master/neps/nep-0509.md) to limit witness size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the solution we have implemented in congestion control or does it pre-date that? Because I recall we do something similar in congestion control where we round robin shards that we are allowed to send more than the usual limit to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was added after congestion control. The allowed shard from congestion control has to be the same as the shard that is allowed to send more receipts to make sure that there are no liveness issues. If they were different we could have a situation where congestion control allows one shard to send receipts, but it's not the shard that can send large receipts and the large receipts could get stuck.
The idea of "allowed shard" from original congestion control was extended to also mean the shard that can send more bytes of receipts.
Let's take a look at an example. Let's say that the predefined list of values that can be requested | ||
is: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This setup works well when we are assuming 4 MB size limit. If in the future we would like to increase the size limit to some other number, how would we change the predefined list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When changing the limits we would have to take another look at max_shard_bandwidth
, max_single_grant
and base_bandwidth
and see if they make sense, adjust as necessary.
If we wanted to lower the receipt size limit to 2MB we could leave either keep max_single_grant
as is or make it smaller to increase the base_bandwidth
. There's no golden rule, depends on each case.
neps/nep-0584.md
Outdated
|
||
```rust | ||
max_shard_bandwidth = 4_500_000; | ||
max_single_grant = 4194304 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 4_194_304 for readability
of shards is low. There are some tests which have a low number of shards, and having a lower base | ||
bandwidth allows us to fully test the bandwidth scheduler in those tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the 100 KB limit only introduced for these tests or do they potentially hold some importance in mainnet as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently this only matters in tests, with current parameters and 6 shards the base bandwidth is ~60kB, and it'll only get smaller as the number of shards increases.
Making max_shard_bandwidth
larger or max_single_grant
smaller in the future could make the base bandwidth larger that 100kB, in which case we'll have to reevaluate all the parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess it shouldn't matter too much given the number of shards can not decrease
neps/nep-0584.md
Outdated
`max_single_grant`, like this: | ||
|
||
```rust | ||
values[-1] = base_bandwidth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Might help here to just add a quick comment saying, value[-1]
is the theoretical index -1. This means if BandwidthRequestValues
are empty, we would be requesting base_bandwidth
.
I initially mistook this for the python indexing where value[-1] means the last element of the array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially mistook this for the python indexing where value[-1] means the last element of the array.
Damn python strikes again x.x
Added a comment to make it clearer
```rust | ||
let mut sanity_check_bytes = Vec::new(); | ||
sanity_check_bytes.extend_from_slice(scheduler_state.sanity_check_hash.as_ref()); | ||
sanity_check_bytes.extend_from_slice(CryptoHash::hash_borsh(&all_shards).as_ref()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is all_shards
just the list of all shards in the current shard layout? Sounds like the sanity_check_bytes
just basically confirms he hash as per the block_height? Shouldn't we try to include the link_allowances into the hash as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah ideally we should hash the whole BandwidthSchedulerState
in the hash, but that could potentially be large (~60kB?), which could affect performance. I didn't include the large field just to be safe. But maybe it wouldn't be that bad? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, maybe later we can revisit what is being stored in sanity_check_bytes
. The current info may not be too useful to us in case there are some inconsistencies related to the allowance calculation across shards/nodes.
Perhaps there may be a cleverer way of getting a digest for the state.
tens of kilobytes of data, which could take a bit of cpu time, so it's not done. The sanity check | ||
still checks that all shards ran the algorithm the same number of times and with the same shards. | ||
|
||
A new trie column is introduced to keep the scheduler state: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It'll be nice to explicitly define the new columns introduced with the groups as well in the section for groups, i.e. BUFFERED_RECEIPT_GROUPS_QUEUE_DATA
and BUFFERED_RECEIPT_GROUPS_QUEUE_ITEM
because of the gas limit enforced by congestion control. This is not ideal, in the future we might | ||
consider merging these two algorithm into one better algorithm, but it is good enough for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great that this point is mentioned here. For now it seems like Congestion Control and Bandwidth Scheduler act independently and both their restrictions are places on the outgoing receipts. While Congestion Control currently deals with Gas limits, Bandwidth Scheduler as of the current implementation is limited to the receipt size.
It definitely makes sense to try to merge the efforts from both these designs to provide a consistent view; a single way to manage outgoing receipts. Congestion Control can be extended to Gas limits as well.
neps/nep-0584.md
Outdated
scheduler will work quicker than that. | ||
|
||
The current version of the scheduler should work fine up to 50-100 shards, after that we'll probably | ||
need to some modifications. A quick solution would be to randomly choose half of the shards at every |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove the to
here
neps/nep-0584.md
Outdated
- https://github.com/near/nearcore/pull/12728 | ||
- https://github.com/near/nearcore/pull/12747 | ||
|
||
TODO - am I supposed to copy the code here? I think that a proper "minimal reference implementation" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess in this case it should be fine to keep what we have here. Maybe including the title of the PR would help a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe including the title of the PR would help a lot.
Great idea, will add
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm half way there, so far so good! I'll try to finish by eod tomorrow.
note to self - pick up at BandwidthScheduler
neps/nep-0584.md
Outdated
heard NEAR DA was moving to a design that doesn't require a lot of cross-shard bandwidth. | ||
- High latency and bad scalability. A big receipt has to wait for up to `num_shards` heights before | ||
it can be sent. This is much higher than it could be, with bandwidth scheduler a receipt never has | ||
to wait more than one height (assuming that aren't shards aren't sending much). Even worse is that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(assuming that aren't shards aren't sending much)
words are not wording right ;)
It's important to keep the size of `BandwidthRequest` small because bandwidth requests are included | ||
in the chunk header, and the chunk header shouldn't be too large. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my information, what would it sum up to for 10/ 50/ 100 shards?
Is there any value in wrapping the bitmap in an Option or does that not affect the number of serialized bytes?
Did you check if the serialized size is what you expect? I could imagine borsh rounding things up to the nearest 32bits for each field, in this case it may be worth to customize the serialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my information, what would it sum up to for 10/ 50/ 100 shards?
A single bandwidth request takes 6 bytes, with 100 shards the worst case would be 600 bytes per chunk, or 60kB per block. But the worst case is unlikely, usually a shard doesn't do bandwidth requests to all other shards.
Is there any value in wrapping the bitmap in an Option or does that not affect the number of serialized bytes?
A bandwidth request with zeroed out bitmap wouldn't be included in the list of shard's bandwidth requests, so there's no point to use an Option
. Bitmap is always nonzero.
Did you check if the serialized size is what you expect? I could imagine borsh rounding things up to the nearest 32bits for each field, in this case it may be worth to customize the serialization.
I hope there's no rounding, that would be terrible, borsh is supposed to be a one-to-one mapping between structs and serialized data.
`base_bandwidth` of receipts, it can just send them out immediately. Actual bandwidth grants based | ||
on bandwidth request happen after granting the base bandwidth. | ||
|
||
On current mainnet (with 6 shards) the base bandwidth is 61_139 (61kB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you briefly explained if this number is constant or dependent on the number of shards? If the latter at which point will it become too low to allow request-less traffic for some percentage of chunks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a short note about the relation with number of shards.
base_bandwidth = base_bandwidth = (max_shard_bandwidth - max_single_grant) / (num_shards - 1)
base_bandwidth = (4500000 - 4*1024*1024) / (num_shards - 1)
For 6 shards:
(4500000 - 4*1024*1024)/5 = 61139
For 50 shards:
(4500000 - 4*1024*1024)/49 = 6238
For 100 shards:
(4500000 - 4*1024*1024)/99 = 3087
So with 100 shards a shard will be able to send at most 3kB of data to another shard without making a request.
This isn't a lot, but it should be enough to send a few receipts.
An additional factor is that as the number of shards increases, the amount of receipts sent on each link should become lower, so a smaller base bandwidth should become less of an issue, it becomes smaller at the same rate as the number of receipts per link becomes smaller.
For larger numbers of shards we could revisit the parameters, for example decreasing max_single_grant
and max_receipt_size
to 2MB would greatly increase the budget for base bandwidth, same with increasing max_shard_bandwidth
.
/// The maximum amount of data that a shard can send or receive at a single height. | ||
pub max_shard_bandwidth: Bandwidth, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity is there any fundamental reason to have the "max send" equal to "max receive"? I'm not suggesting to split it into two, just wondering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No particular reason, it was the easiest to do and there was no need for something more complicated. I guess everything that is sent has to be received, so the amount of data should be similar, assuming equal load.
It's important to note that `size_upper_bound` is less than difference between two consecutive | ||
values in `BandwidthRequestValues` . Thanks to this the requests are just as good as they would be | ||
if they were generated directly using individual receipt sizes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like there is some upper bound for how many trie reads are necessary to compute the requests. It can be guaranteed by early return once we exceed the max_shard_bandwidth and some minimum on the size of a single group. The latter isn't strictly enforced but I believe some emergent reasonable minimum still exists. Did I get it right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, BandwidthRequest::make_from_receipt_sizes
reads the groups until it reaches a point where this much bandwidth can't be requested with the predefined values and then it stops. This means that generating a bandwidth request will read at most max_single_grant / group_size
trie reads, which is about 42 trie reads per outgoing buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
neps/nep-0584.md
Outdated
[This technical section is required for Protocol proposals but optional for other categories. A | ||
draft implementation should demonstrate a minimal implementation that assists in understanding or | ||
implementing this proposal. Explain the design in sufficient detail that: | ||
|
||
- Its interaction with other features is clear. | ||
- Where possible, include a Minimum Viable Interface subsection expressing the required behavior and | ||
types in a target programming language. (ie. traits and structs for rust, interfaces and classes | ||
for javascript, function signatures and structs for c, etc.) | ||
- It is reasonably clear how the feature would be implemented. | ||
- Corner cases are dissected by example. | ||
- For protocol changes: A link to a draft PR on nearcore that shows how it can be integrated in the | ||
current code. It should at least solve the key technical challenges. | ||
|
||
The section should return to the examples given in the previous section, and explain more fully how | ||
the detailed proposal makes those examples work.] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove that
@near/wg-protocol Can you please fully read this NEP and comment in the thread if you are leaning towards approving or rejecting it? Please make sure to include your rationale and any feedback that you have for the author. |
As a subject matter expert, I approve this NEP. |
As a working group member I lean towards approving this proposal. It is detailed, well-written and solves an important problem; thanks for all your work on it! My only comment is that it would be nice to present some Byzantine analysis of the bandwidth scheduling sub-protocol. What happens if someone maliciously modifies their node to send malformed or otherwise incorrect bandwidth messages? It doesn't look to me at first glance like there is a serious issue here, but I still wanted to raise it for the designers to take a look at. |
As a working group member I lean towards approving this proposal |
As a subject matter expert, I am leaning towards approving this NEP. Thanks Jan for working on this problem. The proposal is very well written. It's a great step towards a problem that we aniticiapte to hit quite quickly as our blockchain usage and number of shards grow. While there are specifics we can improve, and revisit, and perhaps redesign over time, they are of minor importance as of the current status of NEAR blockchain. Examples include revisiting the limit calculations, The bandwidth scheudler is designed in a way that it is easily extendible and can be moulded in any direction we may see fit based on the future requirements. |
Bandwidth scheduler is run during chunk application and its results (the produced bandwidth requests) are stored in the chunk header, along with other things produced when applying a chunk. Chunk validators apply the chunk and verify that produced data matches the data in chunk header, they will not endorse a chunk if it doesn't match. The data from previous chunk header (like previous bandwidth requests) can be trusted because the previous chunk headers were endorsed by chunk validators at the previous height. The logic is pretty much identical to I forgot to describe that in the NEP. I can add a small section, but I'm not sure if it's ok to modify the NEP at this stage? |
If it is something implemented but not described I believe it can be added taking into consideration Workgroup member recommendation. |
NEP Status (Updated by NEP Moderators)
Status: Voting
SME reviews:
Protocol Work Group voting indications (❔ | 👍 | 👎 ):