Add support for clickhouse keeper inventory collection #6810

andrewjstone · 2024-10-09T21:28:46Z

I made small change to the types stored in inventory to not collect OmicronZoneUuids for each keeper. This allows collection based on DNS names alone. The zone ids were never actually used for anything anyway.

Note that this won't collect anything right now because we won't have any DNS names for clickhouse-admin-keepers.

I made small change to the types stored in inventory to not collect OmicronZoneUuids for each keeper. This allows collection based on DNS names alone. The zone ids were never actually used for anything anyway. Note that this won't collect anything right now because we won't have any DNS names for clickhouse-admin-keepers.

andrewjstone · 2024-10-09T21:29:07Z

Fixes #6578

andrewjstone · 2024-10-09T21:31:05Z

nexus/inventory/src/collector.rs

@@ -74,7 +77,7 @@ impl<'a> Collector<'a> {

    /// Collect inventory from all MGS instances
    async fn collect_all_mgs(&mut self) {
-        let clients = self.mgs_clients.clone();
+        let clients = std::mem::take(&mut self.mgs_clients);


I removed the Arc, because it seemed unnecessary, then had to make this change because of borrow check rules. It's a cheap memcopy of a few pointers though.

nit: hmm, in this case, rather than being stored as a member on the struct could the list of clients be passed in as an argument, or maybe a child struct be made which has collect_all_mgs as a self method? That would indicate in the type system that all the clients have been consumed.

nexus/db-model/src/inventory.rs

nexus/inventory/src/collector.rs

sunshowers · 2024-10-10T05:56:53Z

nexus/inventory/src/collector.rs

@@ -74,7 +77,7 @@ impl<'a> Collector<'a> {

    /// Collect inventory from all MGS instances
    async fn collect_all_mgs(&mut self) {
-        let clients = self.mgs_clients.clone();
+        let clients = std::mem::take(&mut self.mgs_clients);


nit: hmm, in this case, rather than being stored as a member on the struct could the list of clients be passed in as an argument, or maybe a child struct be made which has collect_all_mgs as a self method? That would indicate in the type system that all the clients have been consumed.

sunshowers · 2024-10-10T05:59:22Z

nexus/inventory/src/collector.rs

        self.collect_all_mgs().await;
        self.collect_all_sled_agents().await;
+        self.collect_all_keepers().await;


Should these be run concurrently with tokio::join!?

Probably, yes. Looking at the inventory collection, all of it is 100% serial. I think it all should be made concurrent. I opened an issue for this.

There is actually a comment regarding keeping things serial: https://github.com/oxidecomputer/omicron/blob/main/nexus/inventory/src/collector.rs#L58-L63

I think instead we should probably do it in parallel and limit the concurrency with buffer_unordered or something.

Yeah, limiting concurrency is a bit tricky. (You probably want all of the methods to produce iterators over futures, then operate on them via buffer-unordered. Or use an async semaphore I guess.)

I think where it would benefit us if more than one of the futures time out -- in that case (assuming the timeout is 60s) we'd spend ~60s rather than 120s+.

Yeah, limiting concurrency is a bit tricky. (You probably want all of the methods to produce iterators over futures, then operate on them via buffer-unordered. Or use an async semaphore I guess.)

I actually think we should do something like this. It's more complicated, but we can't be blocking inventory collections for multiple timeouts. We could easily exceed the time it takes between inventory collections.

FWIW, we have the same problem with blueprint execution for subqueries to systems that may not be online. For clickhouse-admin interactions they are all in parallel, but nothing else is.

sunshowers · 2024-10-10T06:01:00Z

nexus/types/src/inventory.rs

@@ -124,7 +123,7 @@ pub struct Collection {
    /// cluster as returned from each available keeper via `clickhouse-admin` in
    /// the `ClickhouseKeeper` zone


Everything else here is a map -- might be worth adding a short explanation for why this is keyed by the membership itself rather than something else (like the zone ID that it previously was).

We don't actually know the zone id when we look up the address via DNS, so using that as a key was problematic. However, we could destructure the returned value and map via KeeperId. But then we end up putting it back together again so we can store it as a value in the blueprint. I think this is pretty tedious and doesn't buy us anything. If there was ever a duplicate for some reason with slightly different values we'd end up overwriting early versions with later versions which could hinder debugging. I guess this latter comment leans somewhat towards this being a Vec instead then.

I could change how we access the clickhouse-admin service rather than looking it up in DNS, and do a DB query instead and then map by zone. But then we end up simply discarding the Zone ID downstream since there's no use for it.

I'm happy to add a small comment, but I'm really unsure what to put there. I can put what I said here but that seems somewhat unsatisfying and I'm unsure of the value.

I think what you said here is very valuable! Either the full comment, or a shortened form with a link to this comment would be great.

Could the clickhouse admin server return the zone ID from the endpoint we hit? For cockroach-admin we have sled-agent pass in the zone ID, so it can return it from (one of) the endpoints for similar reasons.

Is it actually useful to do that though? The mapping from keeper id to zone id already exists in the blueprints.

I added a long comment detailing how this is used and how the reconfigurator should prevent any duplicates. We could cache and use the zone IDs for keys, but I'm not sure that provides much in terms of safety over what is already there.

andrewjstone · 2024-10-10T17:20:49Z

nit: hmm, in this case, rather than being stored as a member on the struct could the list of clients be passed in as an argument, or maybe a child struct be made which has collect_all_mgs as a self method? That would indicate in the type system that all the clients have been consumed.

This is a good question, especially since the keeper_admin_clients follow the same pattern. Presumably other clients passed in will as well. I don't think a child struct will work, because collect_one_mgs and collect_one_keeper rely on the parent self and so we'd still get a borrowck error.

The other alternative I came up with was to put these Vecs in an option and pull them out of that, then pass by argument. That's my usual goto in this case. I'm happy to make that change if you prefer.

andrewjstone · 2024-10-10T17:27:48Z

nit: hmm, in this case, rather than being stored as a member on the struct could the list of clients be passed in as an argument, or maybe a child struct be made which has collect_all_mgs as a self method? That would indicate in the type system that all the clients have been consumed.

This is a good question, especially since the keeper_admin_clients follow the same pattern. Presumably other clients passed in will as well. I don't think a child struct will work, because collect_one_mgs and collect_one_keeper rely on the parent self and so we'd still get a borrowck error.

The other alternative I came up with was to put these Vecs in an option and pull them out of that, then pass by argument. That's my usual goto in this case. I'm happy to make that change if you prefer.

I used Options in 0af62e3

sunshowers

Thanks! Just a few minor comments.

nexus/inventory/src/collector.rs

nexus/db-model/src/inventory.rs

nexus/db-queries/src/db/datastore/inventory.rs

nexus/inventory/src/collector.rs

jgallagher · 2024-10-10T19:56:08Z

nexus/types/src/inventory.rs

@@ -124,7 +123,7 @@ pub struct Collection {
    /// cluster as returned from each available keeper via `clickhouse-admin` in
    /// the `ClickhouseKeeper` zone


Could the clickhouse admin server return the zone ID from the endpoint we hit? For cockroach-admin we have sled-agent pass in the zone ID, so it can return it from (one of) the endpoints for similar reasons.

andrewjstone requested review from sunshowers, davepacheco and jgallagher October 9, 2024 21:28

andrewjstone commented Oct 9, 2024

View reviewed changes

andrewjstone added 2 commits October 10, 2024 04:08

whoopsie, forgot to declare the primary key

b4b0c74

representative inventory and deletion

c557922

sunshowers reviewed Oct 10, 2024

View reviewed changes

review fixes

0af62e3

sunshowers reviewed Oct 10, 2024

View reviewed changes

nexus/inventory/src/collector.rs Outdated Show resolved Hide resolved

nexus/inventory/src/collector.rs Outdated Show resolved Hide resolved

nexus/inventory/src/collector.rs Outdated Show resolved Hide resolved

sunshowers approved these changes Oct 10, 2024

View reviewed changes

jgallagher reviewed Oct 10, 2024

View reviewed changes

review comments

cb59c68

andrewjstone merged commit e627935 into main Oct 11, 2024
17 checks passed

andrewjstone deleted the clickhouse-inventory-collection branch October 11, 2024 14:31

andrewjstone mentioned this pull request Oct 28, 2024

Add inventory collection for clickhouse keepers #6578

Closed

		@@ -124,7 +123,7 @@ pub struct Collection {
		/// cluster as returned from each available keeper via `clickhouse-admin` in
		/// the `ClickhouseKeeper` zone

Add support for clickhouse keeper inventory collection #6810

Add support for clickhouse keeper inventory collection #6810

Uh oh!

Conversation

andrewjstone commented Oct 9, 2024

Uh oh!

andrewjstone commented Oct 9, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewjstone Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewjstone commented Oct 10, 2024

Uh oh!

andrewjstone commented Oct 10, 2024

Uh oh!

sunshowers left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

andrewjstone Oct 10, 2024 •

edited

Loading