Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GPU with cgroups devices #429

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

wpoely86
Copy link
Contributor

If you build torque with GPU support and cgroups, it doesn't work: the whitelisting of devices in cgroups is not done correctly.

The core of the problem is the difference between initialize_hwloc_topology and cg_initialize_hwloc_topology. The code assumes that it is sufficient to call just one of them which is not true: the function read_all_devices is only called from the cg_initialize_hwloc_topology. I've merged both functions together and the difference between both are handled by macro's. I'm not sure what the idea was to duplicate this function? Copy&pasting code like that is only going to give problems, as this issue shows.

Please backport this to the 6.1.1.1.

Both functions differ only in a call to `read_all_devices` and it was
assumed that if was sufficient to call only one of them. This is not
the case (because `read_all_devices` is required with GPU's and
cgroups).

The difference between both functions in now part of macro's.
@wpoely86
Copy link
Contributor Author

@acvizi It would be nice if you could also merge this one. Without it we cannot get torque & GPUs to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant