Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rhel9-ppc64le VMs offline due to NVMe failure at OSUOSL #3998

Closed
richardlau opened this issue Jan 14, 2025 · 9 comments · Fixed by #3999
Closed

rhel9-ppc64le VMs offline due to NVMe failure at OSUOSL #3998

richardlau opened this issue Jan 14, 2025 · 9 comments · Fixed by #3999

Comments

@richardlau
Copy link
Member

Got this email from OSUOSL:

One of our nodes which uses local storage with NVMe has had a failure and all the VMs on that node are offline. Due to the way we had storage configured on this node for performance, we're unlikely to save any of the ephemeral disks. However if you had any volumes attached, those will still be intact since they are on our Ceph cluster.

I'm going to be working on getting the NVMe replaced, in the meantime, if you need us to rebuild any of your VMs, please let us know.

This affects our rhel9-ppc64le VMs, which I had set up with NVMe storage as they had resulted in marginally faster builds:

@mhdawson
Copy link
Member

Removed rhel9-ppc64le from these jobs:

@richardlau said he'd take a look tomorrow.

@mhdawson
Copy link
Member

There are a couple of other jobs that use rhel9-ppc64le, but they have not runs for 5-6 days so they are more infrequent so I've not disabled them yet.

I think we'll get the machine back tomorrow, if not we can exclude those as well:

@richardlau
Copy link
Member Author

No updates on the NVMe replacement. I'm going to create new (non-NVMe) VMs.

@richardlau
Copy link
Member Author

I've created a new test-osuosl-rhel9-ppc64_le-4 and replaced test-osuosl-rhel9-ppc64_le-3 (Ansible inventory update in #3999) with non-NVMe RHEL 9 VMs. That will give us two machines for now.

My plan is to wait to see what the outlook is for getting the NVMe replaced before deciding what to do about test-osuosl-rhel9-ppc64_le-1 and test-osuosl-rhel9-ppc64_le-2.

@richardlau
Copy link
Member Author

@mhdawson
Copy link
Member

@richardlau thanks for the quick work.

@richardlau
Copy link
Member Author

The system is back online and you should be able to recreate the VMs now. I unfortunately needed to remove all current VMs to get the hypervisor back in a sane state so using "rebuild" will not work right now.

So

are gone. I'll recreate them and update #3999 with the new IP addresses (I'm assuming we'll get new IP addresses with new VMs).

@richardlau
Copy link
Member Author

Recreated

as p9.nvme.large (Power 9, 4 VCPUs, 8 GB RAM, 80 GB disk).

For now, I'll keep the other two machines as non-NVMe as they are working. We can revisit later if we want to convert those to NVMe (would involve removing the existing VM and reprovisioning).

@richardlau
Copy link
Member Author

which I had set up with NVMe storage as they had resulted in marginally faster builds

For posterity (and to put some data behind the "marginally faster builds" statement):

CI runs:

Without NVMe took 2 hours and 2 hours 9 mins (both without a populated ccache).

With NVMe took 1 hour 51 mins and 1 hour 45 mins (both without a populated ccache).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants