Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for clickhouse keeper inventory collection #6810

Merged
merged 5 commits into from
Oct 11, 2024

Conversation

andrewjstone
Copy link
Contributor

I made small change to the types stored in inventory to not collect OmicronZoneUuids for each keeper. This allows collection based on DNS names alone. The zone ids were never actually used for anything anyway.

Note that this won't collect anything right now because we won't have any DNS names for clickhouse-admin-keepers.

I made small change to the types stored in inventory to not collect
OmicronZoneUuids for each keeper. This allows collection based on DNS
names alone. The zone ids were never actually used for anything anyway.

Note that this won't collect anything right now because we won't
have any DNS names for clickhouse-admin-keepers.
@andrewjstone
Copy link
Contributor Author

Fixes #6578

@@ -74,7 +77,7 @@ impl<'a> Collector<'a> {

/// Collect inventory from all MGS instances
async fn collect_all_mgs(&mut self) {
let clients = self.mgs_clients.clone();
let clients = std::mem::take(&mut self.mgs_clients);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the Arc, because it seemed unnecessary, then had to make this change because of borrow check rules. It's a cheap memcopy of a few pointers though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: hmm, in this case, rather than being stored as a member on the struct could the list of clients be passed in as an argument, or maybe a child struct be made which has collect_all_mgs as a self method? That would indicate in the type system that all the clients have been consumed.

@@ -74,7 +77,7 @@ impl<'a> Collector<'a> {

/// Collect inventory from all MGS instances
async fn collect_all_mgs(&mut self) {
let clients = self.mgs_clients.clone();
let clients = std::mem::take(&mut self.mgs_clients);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: hmm, in this case, rather than being stored as a member on the struct could the list of clients be passed in as an argument, or maybe a child struct be made which has collect_all_mgs as a self method? That would indicate in the type system that all the clients have been consumed.

Comment on lines 69 to +71
self.collect_all_mgs().await;
self.collect_all_sled_agents().await;
self.collect_all_keepers().await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be run concurrently with tokio::join!?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, yes. Looking at the inventory collection, all of it is 100% serial. I think it all should be made concurrent. I opened an issue for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is actually a comment regarding keeping things serial: https://github.com/oxidecomputer/omicron/blob/main/nexus/inventory/src/collector.rs#L58-L63

I think instead we should probably do it in parallel and limit the concurrency with buffer_unordered or something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, limiting concurrency is a bit tricky. (You probably want all of the methods to produce iterators over futures, then operate on them via buffer-unordered. Or use an async semaphore I guess.)

I think where it would benefit us if more than one of the futures time out -- in that case (assuming the timeout is 60s) we'd spend ~60s rather than 120s+.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, limiting concurrency is a bit tricky. (You probably want all of the methods to produce iterators over futures, then operate on them via buffer-unordered. Or use an async semaphore I guess.)

I actually think we should do something like this. It's more complicated, but we can't be blocking inventory collections for multiple timeouts. We could easily exceed the time it takes between inventory collections.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, we have the same problem with blueprint execution for subqueries to systems that may not be online. For clickhouse-admin interactions they are all in parallel, but nothing else is.

@@ -124,7 +123,7 @@ pub struct Collection {
/// cluster as returned from each available keeper via `clickhouse-admin` in
/// the `ClickhouseKeeper` zone
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything else here is a map -- might be worth adding a short explanation for why this is keyed by the membership itself rather than something else (like the zone ID that it previously was).

Copy link
Contributor Author

@andrewjstone andrewjstone Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't actually know the zone id when we look up the address via DNS, so using that as a key was problematic. However, we could destructure the returned value and map via KeeperId. But then we end up putting it back together again so we can store it as a value in the blueprint. I think this is pretty tedious and doesn't buy us anything. If there was ever a duplicate for some reason with slightly different values we'd end up overwriting early versions with later versions which could hinder debugging. I guess this latter comment leans somewhat towards this being a Vec instead then.

I could change how we access the clickhouse-admin service rather than looking it up in DNS, and do a DB query instead and then map by zone. But then we end up simply discarding the Zone ID downstream since there's no use for it.

I'm happy to add a small comment, but I'm really unsure what to put there. I can put what I said here but that seems somewhat unsatisfying and I'm unsure of the value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you said here is very valuable! Either the full comment, or a shortened form with a link to this comment would be great.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the clickhouse admin server return the zone ID from the endpoint we hit? For cockroach-admin we have sled-agent pass in the zone ID, so it can return it from (one of) the endpoints for similar reasons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it actually useful to do that though? The mapping from keeper id to zone id already exists in the blueprints.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a long comment detailing how this is used and how the reconfigurator should prevent any duplicates. We could cache and use the zone IDs for keys, but I'm not sure that provides much in terms of safety over what is already there.

@andrewjstone
Copy link
Contributor Author

nit: hmm, in this case, rather than being stored as a member on the struct could the list of clients be passed in as an argument, or maybe a child struct be made which has collect_all_mgs as a self method? That would indicate in the type system that all the clients have been consumed.

This is a good question, especially since the keeper_admin_clients follow the same pattern. Presumably other clients passed in will as well. I don't think a child struct will work, because collect_one_mgs and collect_one_keeper rely on the parent self and so we'd still get a borrowck error.

The other alternative I came up with was to put these Vecs in an option and pull them out of that, then pass by argument. That's my usual goto in this case. I'm happy to make that change if you prefer.

@andrewjstone
Copy link
Contributor Author

nit: hmm, in this case, rather than being stored as a member on the struct could the list of clients be passed in as an argument, or maybe a child struct be made which has collect_all_mgs as a self method? That would indicate in the type system that all the clients have been consumed.

This is a good question, especially since the keeper_admin_clients follow the same pattern. Presumably other clients passed in will as well. I don't think a child struct will work, because collect_one_mgs and collect_one_keeper rely on the parent self and so we'd still get a borrowck error.

The other alternative I came up with was to put these Vecs in an option and pull them out of that, then pass by argument. That's my usual goto in this case. I'm happy to make that change if you prefer.

I used Options in 0af62e3

Copy link
Contributor

@sunshowers sunshowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just a few minor comments.

@@ -124,7 +123,7 @@ pub struct Collection {
/// cluster as returned from each available keeper via `clickhouse-admin` in
/// the `ClickhouseKeeper` zone
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the clickhouse admin server return the zone ID from the endpoint we hit? For cockroach-admin we have sled-agent pass in the zone ID, so it can return it from (one of) the endpoints for similar reasons.

@andrewjstone andrewjstone merged commit e627935 into main Oct 11, 2024
17 checks passed
@andrewjstone andrewjstone deleted the clickhouse-inventory-collection branch October 11, 2024 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants