-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nova and Placement specific information gathering #24
Nova and Placement specific information gathering #24
Conversation
run_bg /usr/bin/oc -n openstack exec -t nova-cell0-conductor-0 -- nova-manage placement audit --verbose '>' "$NOVA_PATH"/nova-manage_placement_audit | ||
run_bg /usr/bin/oc -n openstack exec -t nova-cell0-conductor-0 -- nova-manage placement heal_allocations --dry-run --verbose '>' "$NOVA_PATH"/nova-manage_placement_heal_allocations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I slept on it and these might be too expensive operations for a large cloud as both examine each instances.
@SeanMooney what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many seconds do those operations usually take?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the heal allocation and auit command scale based on the number of allocations (instance + migratuions) so on a large cloud this could be minutes
i would not inc them in the nova section personally at least not by default.
can we parmatierise this? i.e. only include it if its really needed.
i think I'm leaning more towards saying if we need this info then we should ask them to run those command separately since in general i don't think we will used that output to debug a customer issue often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by the way i knwo these are nova commands but this feels more like it should be in placement. in any case i would not consider the instance allocation to be an attribute of the nova status in general that's an attribute of the workloads.
if we are considering "workload" type data in the status
there are a number of other things we could consider but it would require API access
flavor list with extra specs per flavour.
aggregate list, aggrate metadata show and aggregate host list.
cinder is including volume type list so that is basically the same as our flavors
neutorn is basically dumping all the workload object (port,networks, ectra)
so i think including the aggregate info is in scope. that would allow use to diagnose most schuilign issues (if we also have image properties and the placement summaries you are collecting below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the heal allocation and auit command scale based on the number of allocations (instance + migratuions) so on a large cloud this could be minutes
i would not inc them in the nova section personally at least not by default.
can we parmatierise this? i.e. only include it if its really needed. i think I'm leaning more towards saying if we need this info then we should ask them to run those command separately since in general i don't think we will used that output to debug a customer issue often.
If you both think this can take a while and it won't help in most cases we can definitely do it only upon request.
The way we are currently parametrizing things is using environmental variables, so that could be used to have a flag to gather extra stuff, and then in our docs we could define what cases that should be gathered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove it as we agree it takes too much time to do in each must gather. The parameterization can be done separately. But if we need to ask a specific thing to add as a parameter that is very similar effort to actually send to command to run to gather the extra output. So I will not add parameterization now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SeanMooney we have API access, so we can list flavors. But I'm not sure that is useful in general. Also that can be slow e.g. try to list flavors in PSI :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed audit and heal, added aggregate list --long
# NOTE(gibi): this gives us a very simple resource capacity view of the | ||
# cluster. It is intentionally uses 1 MB RAM query to get one candidate | ||
# from each compute | ||
run_bg ${BASH_ALIASES[os]} allocation candidate list --resource MEMORY_MB=1 --fit-width '>' "$PLACEMENT_PATH"/allocation_candidate_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the other hand this should not be expensive as it runs a very similar query than any of the basic VM scheduling scenario
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will not include any ironic nodes but yes it should be pretty quick.
however this could also get pretty large.
this will include all the provider summaries
"provider_summaries": {
"be99627d-e848-44ef-8341-683e2e557c58": {
"resources": {},
"traits": [
"COMPUTE_VOLUME_MULTI_ATTACH"
],
"parent_provider_uuid": null,
"root_provider_uuid": "be99627d-e848-44ef-8341-683e2e557c58"
},
"9a9c6b0f-e8d1-4d16-b053-a2bfe8a76757": {
"resources": {
"VCPU": {
"capacity": 4,
"used": 0
},
"MEMORY_MB": {
"capacity": 2048,
"used": 0
}
},
"traits": [
"HW_NUMA_ROOT",
"CUSTOM_FOO"
],
"parent_provider_uuid": "be99627d-e848-44ef-8341-683e2e557c58",
"root_provider_uuid": "be99627d-e848-44ef-8341-683e2e557c58"
},
"ba415f98-1960-4488-b2ed-4518b77eaa60": {
"resources": {},
"traits": [
"CUSTOM_VNIC_TYPE_DIRECT"
],
"parent_provider_uuid": "be99627d-e848-44ef-8341-683e2e557c58",
"root_provider_uuid": "be99627d-e848-44ef-8341-683e2e557c58"
},
"92e971c9-777a-48bf-a181-a2ca1105c015": {
"resources": {
"NET_BW_EGR_KILOBIT_PER_SEC": {
"capacity": 10000,
"used": 0
}
},
"traits": [
"CUSTOM_PHYSNET1"
],
"parent_provider_uuid": "ba415f98-1960-4488-b2ed-4518b77eaa60",
"root_provider_uuid": "be99627d-e848-44ef-8341-683e2e557c58"
},
"cefbdf54-05a8-4db4-ad2b-d6729e5a4de8": {
"resources": {
"NET_BW_EGR_KILOBIT_PER_SEC": {
"capacity": 20000,
"used": 0
}
},
"traits": [
"CUSTOM_PHYSNET2"
],
"parent_provider_uuid": "ba415f98-1960-4488-b2ed-4518b77eaa60",
"root_provider_uuid": "be99627d-e848-44ef-8341-683e2e557c58"
}
for each host.
that is useful just verbose.
that said its text data so it should be highly compressable so it likely wont increase the size of the result by much after its compressed. on a large cloud this still might be in the 10s of MBs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what i would add is a list fo the resource classes and traits.
we can infer that form the provider summaries i guess but i think capturing them separately as well would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly the provider summary is the point of the call as it gives us the resource view of the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a good way to catch the ironic computes? Or those are always using customer resource classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added rc and trait list. I think we cannot easily catch ironic computes. :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we cant no, since they always use a custom resource class we don't have generic way.
having said that we do plan to add some cloud wide capasity/usage enhancements to placement in the future so we may be able to support something like the owner triat to do that
the provider summaries is basically what the capsity API should return
so if we supported
openstack provider summary show --trait OWNER:COMPUTE
openstack provider summary show --aggreate [<aggregate uuid>|all]
then i think we could address this in the future with that work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep that will be a nice future improvement here.
80cacdc
to
a833baf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is missing the triggering of the Guru Meditation Reports in the collection-scripts/gather_trigger_gmr
file.
run_bg /usr/bin/oc -n openstack exec -t nova-cell0-conductor-0 -- nova-manage placement audit --verbose '>' "$NOVA_PATH"/nova-manage_placement_audit | ||
run_bg /usr/bin/oc -n openstack exec -t nova-cell0-conductor-0 -- nova-manage placement heal_allocations --dry-run --verbose '>' "$NOVA_PATH"/nova-manage_placement_heal_allocations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many seconds do those operations usually take?
general question, i assume db dumps are out of scope of must gather or at least its default behaovr. i.e. if a db dump is require that will be done via a separate procedure. |
We have a jira to optionally enable db dump, so it will be something possible in the tool (AFAIK it's optionally available in sos reports as well today), but not the default behavior. |
I will do a separate PR for GMR as I remember nova-api was reluctant to provide it |
a833baf
to
c860acc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets see if we need to increase the verbosity and collect other output in a follow up and proceed with this as a baseline for now.
capturing the flavor may or may not help use in general so if we want to treat that like a db dump or gmr and gate it behind a env var or defer it to another pr i think that is fine.
thanks gibi for adding basic collection
get_nova_status() { | ||
local NOVA_PATH="$BASE_COLLECTION_PATH/ctlplane/nova" | ||
mkdir -p "$NOVA_PATH" | ||
run_bg ${BASH_ALIASES[os]} compute service list '>' "$NOVA_PATH"/service_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: i know your following the exisitn pattern but ideally
"$NOVA_PATH"/service_list would be "${NOVA_PATH}/service_list"
run_bg /usr/bin/oc -n openstack exec -t nova-cell0-conductor-0 -- nova-manage cell_v2 list_cells '>' "$NOVA_PATH"/cell_list | ||
run_bg /usr/bin/oc -n openstack exec -t nova-cell0-conductor-0 -- nova-manage cell_v2 list_hosts '>' "$NOVA_PATH"/host_list | ||
run_bg ${BASH_ALIASES[os]} aggregate list --long '>' "$NOVA_PATH"/aggregate_list | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i still think it would be nice to include a flavour detail list output
other then that i think this is a good start so I'm ok with differing that to a spereate pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
ea4c956
into
openstack-k8s-operators:main
Implements: https://issues.redhat.com/browse/OSPRH-730