-
Notifications
You must be signed in to change notification settings - Fork 13
Intel SGX Device Plugin returns error "permission denied" for OpenShift 4.13 #113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
can you also upload /var/log/audit/audit.log |
Hi @tsadowsk! Thanks for submitting the issue and sharing all the details. Currently, our operator officially supports |
@hershpa I sent /var/log/audit/audit.log directly to you as it might contain some sensitive information. |
Looks like SELinux policies in container-selinux changed between 4.12 and 4.13. We will investigate |
@mregmi Since this is a regression, can we create a RH ticket since these policies were missed in OCP 4.13 integration? The policies were already part of container-SELinux upstream project and were backported in OCP 4.12. |
@tsadowsk, which OCP 4.13 z stream version are you using? |
@hershpa Below more details about version:
Please let me know if you would need more info. |
Thanks @tsadowsk |
Can you try OCP |
Thanks @hershpa! |
@uMartinXu @hershpa
Below are yaml files which I used for Intel Device Plugin installation:
Could you please help? Please let me know if you would need more info. |
@tsadowsk thanks for the update. I am testing it on our end, let me see if I observe what you saw above. |
FYI, @eadamsintel got everything deployed OK on 4.13.11. |
I got it to work by turning off SELinux which is not really a proper solution. |
Maybe this issue might be a generalized case of the RHEL-3128 / Bug 2180456 bug? The similarities:
The differences:
Please note that at least one Red Hat's own operator identified the mentioned Bug 2180456 issue as a long-term fix while applying short-term SELinux policy workarounds until Bug 2180456 is fixed. The "{ connectto }" denial was also present there if you take a look at the comments. |
It looks like kubelet is running with incorrect label which is causing the SELinux access denial for plugins. The bugzilla above touches on this issue but does not seem to provide solution/fix. Will investigate further on why its happening and check with RedHat too. sh-5.1# ps -AZ | grep unconfined |
Same issue for all 3 device plugins (SGX, GPU, QAT). We need to work with RH to resolve this regression/bug. |
actively being looked at by RedHat: https://issues.redhat.com/browse/OCPBUGS-20022 |
Thanks @mregmi! |
@tsadowsk Which Intel Device Plugin Operator version are you using and where did you get the images from. Did you build it from IDPO upstream? |
@mregmi We are using Intel Device Plugin 0.26.1 from alpha channel provided by Operator Hub. We haven't built it on our own, so this is the default Intel Device Plugin Operator without customizations. |
We are waiting for a container SELinux patch to show up in a OCP 4.13.z and 4.14.z release. |
@hershpa I checked the issues i.e. https://issues.redhat.com/browse/OCPBUGS-20022 and looks like a blocker for this ticket, which was podman change: containers/container-selinux#277 was merged. I noticed about it in ticket for RedHat. |
Waiting on Red Hat for visibility to target 4.13.z and 4.14.z release with the patch. |
Given that there's no LZ for this fix in upstream OCP, what's the suggested workaround for 4.13.11 (latest supported z per README)? |
Still waiting on fix to propagate to OCP 4.13 and 4.14 (https://issues.redhat.com/browse/OCPBUGS-20022) Issue root cause: Kubelet is running with wrong label in OCP 4.13 and higher Workaround: Since the kubelet is running with wrong label in OCP 4.13 and beyond, we need to run SELinux in permissive mode as a workaround. To do this, In all the nodes, run the following command. |
This is fixed in 4.14.10 |
Just a note here; even with SELinux set to "permissive" pods accessing the GPU through This probably was resolved in this thread (intel/intel-device-plugins-for-kubernetes#1377) but the script that is used to check what device is usable (https://github.com/intel/intel-device-plugins-for-kubernetes/blob/main/cmd/gpu_plugin/render-device.sh) will always have the first device writable - so it's mostly useless for two GPUs on a single node. |
/cc @tkatila for visibility |
Signed-off-by: vbedida79 <[email protected]>
device_plugins: Remove workaround for fixed issue #113
This issue is fixed in 1.2.0 release https://github.com/intel/intel-technology-enabling-for-openshift/releases/tag/v1.2.0 |
Summary
During installation of Intel SGX Device Plugin, an error occurs which states lack of access permissions for
kubelet.sock
socket fromintel-sgx-plugin
pod. This error happens inOpenShift 4.13
and was not present inOpenShift 4.12
.Detail
During installation of Intel SGX Device Plugin below error occurs:
As a workaround, I added privileged access rights for the DaemonSet/Pod by using below command line:
After replacing:
with:
started working. Most probably, such privileges escalation is not needed and can be limited to necessary only privileges.
Resolving this issue would be very helpful/beneficial because 4.13 is a current version of OpenShift, and this plugin works without any issues in OpenShift 4.12, which is a previous version.
Also, it would be great to make sure that such issue does not occur for the upcoming OpenShift version 4.14. Many thanks in advance!
Update as of Dec 14 2023 from @mregmi latest comment:
Still waiting on fix to propagate to OCP 4.13 and 4.14 (https://issues.redhat.com/browse/OCPBUGS-20022)
Workaround:
Since the kubelet is running with wrong label on OCP 4.13 and beyond, we need to run SELinux in permissive mode as a workaround. To do this, please run the following command on all the nodes.
Example output:
The text was updated successfully, but these errors were encountered: