Lack of wire format validation #4879

alexpyattaev · 2025-02-08T20:25:36Z

Problem

Both gossip and turbine lack validation of wire format compatibility.
It is possible to subtly break compat e.g. through version bumps of certain crates and not notice it
frozen_abi does not completely address it since it does not cover all bytes in the packet
Having ability to just make a bunch of packets opens up opportunities for stress-testing the network-facing code (such as sigverify) better

Some more context

Firedancer is already getting strange packet parsing bugs that are hard to pin down fd_shred_parse permits merkle shreds with zero nodes firedancer-io/firedancer#2539

Proposed Solution

isolate the logic of validating & deserializing the packet from the business logic into smaller subcrates, like gossip-wire-protocol or similar
make it possible to build gossip-wire-serializer and gossip-wire-deserializer that take JSON and spit out pcap (and vice versa)
store a bunch of valid packets from the ether on MNB (few GB would do) and make sure these can pass through round-trip (pcap->json->pcap) without any issues.
if some packets do not pass, find out why and either fix bugs or remove them from curated set if they are indeed invalid
store json and pcap versions of packets for checking of future agave releases (and any other validator impl)
rinse and repeat for turbine/repair/TPU

Progress:

Gossip
Turbine
Repair
TPU

behzadnouri · 2025-02-12T15:45:28Z

Discussed this on zoom/slack.
For the very first step of the task we need to create a repository of binary payloads captured across different software versions, protocols, packet-types, etc. Something like below:

v2.1.11/
├── gossip/
│   ├── crds_value/
│   │   ├── crds_value_00.bin
│   │   ├── crds_value_01.bin
│   │   ├── ...
│   ├── pull_request/
│   │   ├── pull_request_00.bin
│   │   ├── pull_request_01.bin
│   │   ├── ...
├── turbine/
│   ├── shred_00.bin
│   ├── shred_01.bin
│   ├── ...
...

These binary payloads are better:

captured from actual clusters (mainnet, testnet, gce clusters, etc) to be more representative of real distribution of packet payloads.
- In particular randomly generated payloads (e.g. CrdsValue::new_rand, etc) will not (and are not intended) to provide representative packets.
Ideally we should try to capture payloads with maximum diversity (entropy?!) across each other.
- e.g. if we have already captured 100 shreds, the next captured shred should be one that is most different (in terms of bits and bytes or payload size) from the 100 already captured.

Once we have such a repository, we can design what would be the best way to add tests to check that we are not breaking wire format. This could possibly be things like:

A nightly CI job, which downloads this repository and runs bytes -> deserialize -> serialize -> bytes round-trip checks for each payload, meaning that it verifies serialize(deserialize(bytes)) will return the exact same bytes.
- This is very crucial in particular in the context of "signatures". The signatures are often defined on serialized "bytes". If we do not get the same exact bytes from serialize(deserialize(bytes)) then the signatures won't verify.
Maybe some unit-tests inside the modules can utilize this repository by fetching some of the payloads, doing deserialization and performing additional sanity checks.
- In particular doing these additional sanity checks in unit tests within modules prevents the need to expose implementation details or in-memory representations of types to outside of the module, which will break encapsulation.
- In-memory representations and implementation details are always subject to change and different client implementation may do it differently. So that is out-of-scope for wire-format validations.

In terms of implementation

We need to coordinate with devops folks where is best to store this repository of binary payloads.
For now better to do the implementation on an off-master branch. That would allow to iterate faster and does not add unnecessary noise or overhead to the prod code.
- Once the code is more or less stable we can revisit where is best to store it.
- Ideally this would be a separate "anza-xyz/wire_spec" repository outside the monorepo (i.e. anza-xyz/agave). That would allow other none-agave clients to utilize the same tools, and minimize coupling with the agave client implementation.
It is important not to degrade or compromise prod code quality for testing purposes only.
- e.g. exposing implementation details, private fields or functions, etc breaks encapsulation and it is best to be avoided. Anything not pub is best tested using unit tests withing modules/crates.
- Adding code noise or overhead to prod code paths is also best to be avoided.
The domain space for any non-trivial struct is effectively unbounded. e.g. there are practically infinite number of valid (or invalid) gossip CrdsValues. Aiming for ~100% coverage is neither practical nor worth the effort, and so is out-of-scope for this task.

alexpyattaev · 2025-02-12T16:40:33Z

Thanks Behzad!
Current implementation is in
https://github.com/alexpyattaev/agave/tree/wire_format_gossip/wire_format
Some comments on the ideas:

need to create a repository of binary payloads captured across different software versions, protocols, packet-types, etc.

Current implementation generally follows this idea, I have not yet implemented breakdown by version of the software (this is coming up). To improve the speed of capturing I am currently grabbing all packets from the raw socket (not sorting by version), but sorting is definitely going to be possible once I make gossip connection work correctly.

v2.1.11/
├── gossip/
│   ├── push.pcap
│   ├── pull_request.pcap
...
├── turbine/
│   ├── merkle.pcap
│   ├── repair.pcap
...

Storing packets in pcap has the advantage that we do not need to handle a million files. Each packet there has a number in case there are issues, so one can easily skip to the needed one.

The dataset for gossip already exists, it is about 1.5 GB with roughly 500-1500 packets of each type. There is logic to get good coverage of different packets carrying Crds and Vec.

These binary payloads are better:

* captured from actual clusters (mainnet, testnet, gce clusters, etc) to be _more representative of real distribution of packet payloads_.

Captures are running on MNB (at least for now) as it has the widest variety of versions.

* Ideally we should try to capture payloads with maximum diversity (entropy?!) across each other.
  
  * e.g. if we have already captured 100 shreds, the next captured shred should be one that is most different (in terms of bits and bytes or payload size) from the 100 already captured.

The code currently sorts packets by categories and ignores the kinds of packets it already has. I'll later extend this logic to also do selection by version number of the sending validator (so we can collect interesting packets from dinosaurs on MNB)

Once we have such a repository, we can design what would be the best way to add tests to check that we are not breaking wire format. This could possibly be things like:
* A nightly CI job, which downloads this repository and runs `bytes -> deserialize -> serialize -> bytes` round-trip checks for each payload, meaning that it verifies `serialize(deserialize(bytes))` will return the exact same `bytes`.

That code is already done for gossip in the branch linked above. In discussion with @yihau about how to put this into CI.

* Maybe some _unit-tests inside the modules_ can utilize this repository by fetching some of the payloads, doing deserialization and performing additional sanity checks.
  * In particular doing these additional sanity checks in unit tests within modules prevents the need to expose implementation details or in-memory representations of types to outside of the module, which will break encapsulation.
  * _In-memory representations and implementation details are always subject to change and different client implementation may do it differently. So that is out-of-scope for wire-format validations._

Unit tests could definitely run through the packets but that would require for them to have access to the datasets (which are quite bulky). I do not think it would work well unless we only trigger these tests manually, and cache the datasets on devboxes.

In terms of implementation

* We need to coordinate with devops folks where is best to store this repository of binary payloads.

* For now better to do the implementation on an off-master branch. That would allow to iterate faster and does not add unnecessary noise or overhead to the prod code.

Already happening this way

  * Once the code is more or less stable we can revisit where is best to store it.
  * Ideally this would be a separate "anza-xyz/wire_spec" repository outside the monorepo (i.e. `anza-xyz/agave`). _That would allow other none-agav clients to utilize the same tools, and minimize coupling with the agave client implementation_.

git repo (even with LFS) is probably not the best fit for this due to sheer size. We are exploring options.

* It is important not to degrade or compromise prod code quality for testing purposes only.      
  * e.g. exposing implementation details, private fields or functions, etc breaks encapsulation and it is best to be avoided. Anything not `pub` is best tested using unit tests withing modules/crates.
  * Adding code noise or overhead to prod code paths is also best to be avoided.

Current impl does patch some things to be pub to work at all. Not too many things though. 99% of the logic is in a separate crate called wire-format. All of this pub-patching could, in principle, be achieved with a feature flag if we want this eventually merged into master.

alexpyattaev self-assigned this Feb 8, 2025

alexpyattaev mentioned this issue Feb 8, 2025

Organize gossip random packet generation #4880

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lack of wire format validation #4879

Lack of wire format validation #4879

alexpyattaev commented Feb 8, 2025 •

edited

Loading

behzadnouri commented Feb 12, 2025 •

edited

Loading

alexpyattaev commented Feb 12, 2025

Lack of wire format validation #4879

Lack of wire format validation #4879

Comments

alexpyattaev commented Feb 8, 2025 • edited Loading

Problem

Some more context

Proposed Solution

Progress:

behzadnouri commented Feb 12, 2025 • edited Loading

alexpyattaev commented Feb 12, 2025

alexpyattaev commented Feb 8, 2025 •

edited

Loading

behzadnouri commented Feb 12, 2025 •

edited

Loading