Nova and Placement specific information gathering #24

gibizer · 2023-11-26T15:37:58Z

Implements: https://issues.redhat.com/browse/OSPRH-730

gibizer · 2023-11-27T09:00:50Z

collection-scripts/gather_services_status

+    run_bg /usr/bin/oc -n openstack exec -t nova-cell0-conductor-0 -- nova-manage placement audit --verbose '>' "$NOVA_PATH"/nova-manage_placement_audit
+    run_bg /usr/bin/oc -n openstack exec -t nova-cell0-conductor-0 -- nova-manage placement heal_allocations --dry-run --verbose '>' "$NOVA_PATH"/nova-manage_placement_heal_allocations


I slept on it and these might be too expensive operations for a large cloud as both examine each instances.
@SeanMooney what do you think?

How many seconds do those operations usually take?

the heal allocation and auit command scale based on the number of allocations (instance + migratuions) so on a large cloud this could be minutes

i would not inc them in the nova section personally at least not by default.

can we parmatierise this? i.e. only include it if its really needed.
i think I'm leaning more towards saying if we need this info then we should ask them to run those command separately since in general i don't think we will used that output to debug a customer issue often.

by the way i knwo these are nova commands but this feels more like it should be in placement. in any case i would not consider the instance allocation to be an attribute of the nova status in general that's an attribute of the workloads.

if we are considering "workload" type data in the status
there are a number of other things we could consider but it would require API access

flavor list with extra specs per flavour.
aggregate list, aggrate metadata show and aggregate host list.

cinder is including volume type list so that is basically the same as our flavors
neutorn is basically dumping all the workload object (port,networks, ectra)

so i think including the aggregate info is in scope. that would allow use to diagnose most schuilign issues (if we also have image properties and the placement summaries you are collecting below)

the heal allocation and auit command scale based on the number of allocations (instance + migratuions) so on a large cloud this could be minutes

i would not inc them in the nova section personally at least not by default.

can we parmatierise this? i.e. only include it if its really needed. i think I'm leaning more towards saying if we need this info then we should ask them to run those command separately since in general i don't think we will used that output to debug a customer issue often.

If you both think this can take a while and it won't help in most cases we can definitely do it only upon request.

The way we are currently parametrizing things is using environmental variables, so that could be used to have a flag to gather extra stuff, and then in our docs we could define what cases that should be gathered.

I will remove it as we agree it takes too much time to do in each must gather. The parameterization can be done separately. But if we need to ask a specific thing to add as a parameter that is very similar effort to actually send to command to run to gather the extra output. So I will not add parameterization now.

@SeanMooney we have API access, so we can list flavors. But I'm not sure that is useful in general. Also that can be slow e.g. try to list flavors in PSI :)

removed audit and heal, added aggregate list --long

gibizer · 2023-11-27T09:01:33Z

collection-scripts/gather_services_status

+    # NOTE(gibi): this gives us a very simple resource capacity view of the
+    # cluster. It is intentionally uses 1 MB RAM query to get one candidate
+    # from each compute
+    run_bg ${BASH_ALIASES[os]} allocation candidate list --resource MEMORY_MB=1 --fit-width '>' "$PLACEMENT_PATH"/allocation_candidate_list


in the other hand this should not be expensive as it runs a very similar query than any of the basic VM scheduling scenario

this will not include any ironic nodes but yes it should be pretty quick.
however this could also get pretty large.

this will include all the provider summaries

"provider_summaries": {
"be99627d-e848-44ef-8341-683e2e557c58": {
"resources": {},
"traits": [
"COMPUTE_VOLUME_MULTI_ATTACH"
],
"parent_provider_uuid": null,
"root_provider_uuid": "be99627d-e848-44ef-8341-683e2e557c58"
},
"9a9c6b0f-e8d1-4d16-b053-a2bfe8a76757": {
"resources": {
"VCPU": {
"capacity": 4,
"used": 0
},
"MEMORY_MB": {
"capacity": 2048,
"used": 0
}
},
"traits": [
"HW_NUMA_ROOT",
"CUSTOM_FOO"
],
"parent_provider_uuid": "be99627d-e848-44ef-8341-683e2e557c58",
"root_provider_uuid": "be99627d-e848-44ef-8341-683e2e557c58"
},
"ba415f98-1960-4488-b2ed-4518b77eaa60": {
"resources": {},
"traits": [
"CUSTOM_VNIC_TYPE_DIRECT"
],
"parent_provider_uuid": "be99627d-e848-44ef-8341-683e2e557c58",
"root_provider_uuid": "be99627d-e848-44ef-8341-683e2e557c58"
},
"92e971c9-777a-48bf-a181-a2ca1105c015": {
"resources": {
"NET_BW_EGR_KILOBIT_PER_SEC": {
"capacity": 10000,
"used": 0
}
},
"traits": [
"CUSTOM_PHYSNET1"
],
"parent_provider_uuid": "ba415f98-1960-4488-b2ed-4518b77eaa60",
"root_provider_uuid": "be99627d-e848-44ef-8341-683e2e557c58"
},
"cefbdf54-05a8-4db4-ad2b-d6729e5a4de8": {
"resources": {
"NET_BW_EGR_KILOBIT_PER_SEC": {
"capacity": 20000,
"used": 0
}
},
"traits": [
"CUSTOM_PHYSNET2"
],
"parent_provider_uuid": "ba415f98-1960-4488-b2ed-4518b77eaa60",
"root_provider_uuid": "be99627d-e848-44ef-8341-683e2e557c58"
}

for each host.

that is useful just verbose.
that said its text data so it should be highly compressable so it likely wont increase the size of the result by much after its compressed. on a large cloud this still might be in the 10s of MBs

what i would add is a list fo the resource classes and traits.

we can infer that form the provider summaries i guess but i think capturing them separately as well would be nice.

Exactly the provider summary is the point of the call as it gives us the resource view of the cluster.

Do we have a good way to catch the ironic computes? Or those are always using customer resource classes?

I've added rc and trait list. I think we cannot easily catch ironic computes. :/

we cant no, since they always use a custom resource class we don't have generic way.
having said that we do plan to add some cloud wide capasity/usage enhancements to placement in the future so we may be able to support something like the owner triat to do that

the provider summaries is basically what the capsity API should return
so if we supported

openstack provider summary show --trait OWNER:COMPUTE openstack provider summary show --aggreate [<aggregate uuid>|all]

then i think we could address this in the future with that work.

Yep that will be a nice future improvement here.

Akrog

This PR is missing the triggering of the Guru Meditation Reports in the collection-scripts/gather_trigger_gmr file.

Akrog · 2023-11-27T12:54:12Z

collection-scripts/gather_services_status

+    run_bg /usr/bin/oc -n openstack exec -t nova-cell0-conductor-0 -- nova-manage placement audit --verbose '>' "$NOVA_PATH"/nova-manage_placement_audit
+    run_bg /usr/bin/oc -n openstack exec -t nova-cell0-conductor-0 -- nova-manage placement heal_allocations --dry-run --verbose '>' "$NOVA_PATH"/nova-manage_placement_heal_allocations


How many seconds do those operations usually take?

SeanMooney · 2023-11-27T13:23:36Z

general question, i assume db dumps are out of scope of must gather or at least its default behaovr. i.e. if a db dump is require that will be done via a separate procedure.

fmount · 2023-11-27T14:28:15Z

general question, i assume db dumps are out of scope of must gather or at least its default behaovr. i.e. if a db dump is require that will be done via a separate procedure.

We have a jira to optionally enable db dump, so it will be something possible in the tool (AFAIK it's optionally available in sos reports as well today), but not the default behavior.

gibizer · 2023-11-27T14:50:09Z

This PR is missing the triggering of the Guru Meditation Reports in the collection-scripts/gather_trigger_gmr file.

I will do a separate PR for GMR as I remember nova-api was reluctant to provide it

SeanMooney

lets see if we need to increase the verbosity and collect other output in a follow up and proceed with this as a baseline for now.

capturing the flavor may or may not help use in general so if we want to treat that like a db dump or gmr and gate it behind a env var or defer it to another pr i think that is fine.

thanks gibi for adding basic collection

SeanMooney · 2023-11-28T10:06:40Z

collection-scripts/gather_services_status

+get_nova_status() {
+    local NOVA_PATH="$BASE_COLLECTION_PATH/ctlplane/nova"
+    mkdir -p "$NOVA_PATH"
+    run_bg ${BASH_ALIASES[os]} compute service list '>' "$NOVA_PATH"/service_list


nit: i know your following the exisitn pattern but ideally

"$NOVA_PATH"/service_list would be "${NOVA_PATH}/service_list"

SeanMooney · 2023-11-28T10:12:01Z

collection-scripts/gather_services_status

+    run_bg /usr/bin/oc -n openstack exec -t nova-cell0-conductor-0 -- nova-manage cell_v2 list_cells '>' "$NOVA_PATH"/cell_list
+    run_bg /usr/bin/oc -n openstack exec -t nova-cell0-conductor-0 -- nova-manage cell_v2 list_hosts '>' "$NOVA_PATH"/host_list
+    run_bg ${BASH_ALIASES[os]} aggregate list --long '>' "$NOVA_PATH"/aggregate_list
+}


i still think it would be nice to include a flavour detail list output
other then that i think this is a good start so I'm ok with differing that to a spereate pr.

fmount

/lgtm

fmount

/lgtm

gibizer requested review from SeanMooney and Akrog November 26, 2023 15:38

fmount self-requested a review November 26, 2023 19:47

gibizer commented Nov 27, 2023

View reviewed changes

gibizer force-pushed the nova-must-gather branch from 80cacdc to a833baf Compare November 27, 2023 09:04

Akrog requested changes Nov 27, 2023

View reviewed changes

gibizer added 2 commits November 27, 2023 18:05

Add nova related information gathering

5055e95

Add placement related information gathering

c860acc

gibizer force-pushed the nova-must-gather branch from a833baf to c860acc Compare November 27, 2023 17:06

gibizer requested a review from Akrog November 27, 2023 17:06

SeanMooney approved these changes Nov 28, 2023

View reviewed changes

fmount approved these changes Nov 28, 2023

View reviewed changes

fmount added approved lgtm labels Nov 28, 2023

openshift-merge-bot bot merged commit ea4c956 into openstack-k8s-operators:main Nov 28, 2023
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nova and Placement specific information gathering #24

Nova and Placement specific information gathering #24

gibizer commented Nov 26, 2023 •

edited

Loading

gibizer Nov 27, 2023

Akrog Nov 27, 2023

SeanMooney Nov 27, 2023

SeanMooney Nov 27, 2023

Akrog Nov 27, 2023

gibizer Nov 27, 2023

gibizer Nov 27, 2023

gibizer Nov 27, 2023

gibizer Nov 27, 2023

SeanMooney Nov 27, 2023

SeanMooney Nov 27, 2023

gibizer Nov 27, 2023

gibizer Nov 27, 2023

gibizer Nov 27, 2023

SeanMooney Nov 27, 2023 •

edited

Loading

gibizer Nov 27, 2023

Akrog left a comment •

edited

Loading

Akrog Nov 27, 2023

SeanMooney commented Nov 27, 2023

fmount commented Nov 27, 2023

gibizer commented Nov 27, 2023 •

edited

Loading

SeanMooney left a comment

SeanMooney Nov 28, 2023

SeanMooney Nov 28, 2023

fmount left a comment

fmount left a comment

		run_bg /usr/bin/oc -n openstack exec -t nova-cell0-conductor-0 -- nova-manage placement audit --verbose '>' "$NOVA_PATH"/nova-manage_placement_audit
		run_bg /usr/bin/oc -n openstack exec -t nova-cell0-conductor-0 -- nova-manage placement heal_allocations --dry-run --verbose '>' "$NOVA_PATH"/nova-manage_placement_heal_allocations

Nova and Placement specific information gathering #24

Nova and Placement specific information gathering #24

Conversation

gibizer commented Nov 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SeanMooney Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Akrog left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SeanMooney commented Nov 27, 2023

fmount commented Nov 27, 2023

gibizer commented Nov 27, 2023 • edited Loading

SeanMooney left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fmount left a comment

Choose a reason for hiding this comment

fmount left a comment

Choose a reason for hiding this comment

gibizer commented Nov 26, 2023 •

edited

Loading

SeanMooney Nov 27, 2023 •

edited

Loading

Akrog left a comment •

edited

Loading

gibizer commented Nov 27, 2023 •

edited

Loading