Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement secure node join flow #924

Merged
merged 1 commit into from
Feb 14, 2025

Conversation

Unix4ever
Copy link
Member

@Unix4ever Unix4ever commented Feb 11, 2025

Fixes: #840

This PR changes the Talos machine join flow drastically:

  • newly joined machine is first put into a limbo state where Omni creates a temporary Wireguard connection to it.
  • the controller picks up and tries to write a unique machine token to the newly joined machine, in the mean time it also resolves UUID conflicts automatically and writes UUID override to the META partition.
  • the machine re-joins Omni, now with the unique token.
  • the unique token is saved in the siderolink.Link resource and any subsequent join checks that siderolink.Link has matching unique token.

Siderolink manager was refactored, as it was a huge monolithic poorly testable chunk, it was split to:

  • LinkStatus controller, which creates/removes wireguard peers.
  • PendingMachineStatus controller, which ensures all joined machines have unique node tokens.
  • Provision handler, which implements gRPC server and has all logic related to the machine acceptance now.
  • PeersPool, which is used by LinkStatus controllers. It deduplicates peers creation, reuses them when possible.

Additionally updated siderolink loghandler to not accept logger connection for the machines which do not have corresponding log buffers.

Nodes which do not support secure flow are still able to join by default.
Secure join flow can be forced by setting --disable-legacy-join-tokens flag.

@Unix4ever Unix4ever added the integration/e2e Triggers all e2e tests for Omni label Feb 11, 2025
@Unix4ever Unix4ever force-pushed the implement-secure-join-tokens branch 2 times, most recently from 0f385a4 to 75734bf Compare February 12, 2025 08:57
message LinkStatusSpec {
string node_subnet = 1;
string node_public_key = 2;
string virtual_addrport = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why virtual?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's for the wireguard over gRPC. We keep it there to keep track of it being updated.

Comment on lines +174 to +189
func getClient(
ctx context.Context,
r controller.Reader,
pendingMachine *siderolink.PendingMachine,
) (*client.Client, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have more or less the same code in many places I think - should we consider moving them to a central place?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've extracted that and reused that in the place where I copy-pasted it from. Other places are slightly different.

@Unix4ever Unix4ever force-pushed the implement-secure-join-tokens branch 4 times, most recently from 0a4f343 to f72cfa6 Compare February 12, 2025 18:16
@Unix4ever Unix4ever force-pushed the implement-secure-join-tokens branch from f72cfa6 to 45ba1e5 Compare February 14, 2025 15:48
Fixes: siderolabs#840

This PR changes the Talos machine join flow drastically:

- newly joined machine first put into a limbo state where Omni creates a
  temporary Wireguard connection to it.
- the controller picks up and tries to write a unique machine token to
  the newly joined machine, in the mean time it also resolves UUID
  conflicts automatically and writes UUID override to the META
  partition.
- the machine re-joins Omni, now with the unique token.
- the unique token is saved in the `siderolink.Link` resource and any
  subsequent join checks that `siderolink.Link` has matching unique
  token.

Siderolink manager was refactored, as it was a huge monolithic poorly
testable chunk, it was split to:
- LinkStatus controller, which creates/removes wireguard peers.
- PendingMachineStatus controller, which ensures all joined machines
  have unique node tokens.
- Provision handler, which implements gRPC server and has all logic
  related to the machine acceptance now.
- PeersPool, which is used by LinkStatus controllers and deduplicate
  peers creation, reuse them when possible.

Additionally updated siderolink loghandler to not accept logger
connection for the machines which do not have corresponding log buffers.

Nodes which do not support secure flow are still able to join by
default.
Secure join flow can be forced by setting `--disable-legacy-join-tokens`
flag.

Signed-off-by: Artem Chernyshev <[email protected]>
@Unix4ever Unix4ever force-pushed the implement-secure-join-tokens branch from 45ba1e5 to 9bb85f8 Compare February 14, 2025 16:13
@Unix4ever
Copy link
Member Author

/m

@talos-bot talos-bot merged commit 9bb85f8 into siderolabs:main Feb 14, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration/e2e Triggers all e2e tests for Omni
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Put the machine into the limbo state when it join Omni for the first time
3 participants