-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OTA-1177: Gather OSUS data #416
Conversation
Collect data from OSUS operator if installed in the cluster.
@oarribas: This pull request references OTA-1177 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
OSUS_OPERATOR_NAME="update-service-operator" | ||
get_log_collection_args | ||
|
||
HAS_OSUS=$(oc get csv -A --no-headers -o custom-columns=NS:.metadata.namespace,OPERATOR:.metadata.name --ignore-not-found=true | awk '/'${OSUS_OPERATOR_NAME}'/ {print $1}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a hard direction for an OCP-central must-gather to support. I thought the OLM-installed operator pattern was for each operator to ship a separate image with the must-gather-for-them logic (as described in the KCS you'd linked from OTA-1177, and also in these docs), so the central must-gather maintainers didn't have to be bothered reviewing gather logic for all the many, many, OLM-installed operators that could possibly be present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wking , while I agree with your comment for most of the operators that can be installed in OpenShift, I think creating a full image for this operator is excessive.
Other operators provide extra capacities to the cluster, and usually involve several different CRDs and even several namespaces. This operator helps with one of the core capacities of OpenShift, which is the upgrade (in this case, for disconnected clusters), and lots of cases already includes a must-gather proactively. The info for this operator (when installed) shouldn't increase a lot the size of the must-gather, and will avoid asking for a new different must-gather.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wking , any thoughts on the above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oarribas / @wking the only thing this collects are the details from a given namespace[s]. I think this is fine (given its related to update); but @soltysh or @ingvagabund should have the final say if this is provided by this image or an independent image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oarribas how big are the extra manifests? Can you share an example of running both commands?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ingvagabund , checking data from some cases, it depends a lot on the volume of the logs from the pods.
Largest inspect of the namespace I have seen, without compression is 30MB, and compressed between 2-3MB. The size of the updateservice
resource is few KB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ingvagabund , @soltysh , any thoughts based on the above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some time we have been merging extra collection with a promise of collecting extra "small" data that help to avoid asking customers to run yet another must gather image. More relevant in the disconnected environment. Yet, we don't track how much of these extra collections increase the overall must-gather size on average. Before going further I'd like to see when all the extra collections are collected. E.g. what's the estimated worst case extra bump in the collected data. A new section under https://github.com/openshift/must-gather/blob/master/README.md will do. E.g.
Extra collections
script location | short description | condition | estimated size |
---|---|---|---|
/usr/bin/gather_aro | Gather ARO Cluster Data | ns/openshift-azure-operator or ns/openshift-azure-logging present |
?? |
/usr/bin/gather_vsphere | Gather vSphere resources | vSphere CSI driver is installed | ?? |
... | ... | ... | ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ingvagabund , the estimated size for this one (for the OSUS operator), when compressed, is around 3MB (usually less as per several inspect of the openshift-update-service
namespace I have reviewed). And it's only collected if OSUS is installed.
Regarding the above for the estimated "worst" case, things like ARO and vSphere cannot be collected at the same time if the conditions are OK.
/approve |
@kasturinarra just for curiosity do we have any test case/jobs monitoring how much an average must-gather image grows in time in context of various installations? @oarribas are there any statistics about must-gather size? E.g. a matrix of which operators are installed -> how much data can be collected. Or, what are the variable parts that can significantly increase the size? @sferich888 is it possible to make a matrix of all flavors of an OCP cluster? Including layered products? To see how complex a must-gather gathering can be? I am quite blind in here. Would like to extend my perspective so we can make better decisions when reviewing this kind of additions. |
@ingvagabund , checking in OTA-1177 |
@ingvagabund by my count (before we add on layered products) the matrix your looking at is has 87k+ combinations in it.
However when it come to must-gather and testing are we building a tool that works for the majority of our user base; I think the more important thing to consider is/are only about 9k of those combinations (or 10% of that matrix). The biggest issues I have seen are related to operating at specific sizes and scales! IE: with our Deployment Patterns (combinations). We see the biggest challenges when must-gather can't find a host to run on (SingleNode Clusters or Clusters with scheduleable control planes (that are loaded with work), or has to crowd out a workload to start (people really don't like this; but its necessary). Or when we try and operate at large scales (500+ nodes; with workloads). The biggest issues we see are with 'time to collect' data, and with how much data we collect (Note - we don't automatically compress archives (RFE for this; that hasn't been auctioned yet) - so we probably shouldn't make collection estimates based on compression). The size of our 'archive' is an issue for most customers; because they have to, in a lot of situations move the data from one system to another, just so that they can upload it to Red Hat, that is 2+ data transfers for many customers (mostly customers in disconnected or restricted network environments). Pared with the time to collect a must-gather (20+ min in some situations), we could have a customer collecting and transferring data for up to 30 to 40 min (based on some estimates). |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ingvagabund, oarribas, sferich888 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@oarribas: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@sferich888 |
[ART PR BUILD NOTIFIER] Distgit: ose-must-gather |
/cherry-pick release-4.17 |
@oarribas: new pull request created: #443 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/cherry-pick release-4.16 |
@oarribas: new pull request created: #455 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/cherry-pick release-4.15 |
@oarribas: #416 failed to apply on top of branch "release-4.15":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Collect data from OSUS operator if installed in the cluster.