Fix GPU with cgroups devices #429

wpoely86 · 2017-07-21T08:11:50Z

If you build torque with GPU support and cgroups, it doesn't work: the whitelisting of devices in cgroups is not done correctly.

The core of the problem is the difference between initialize_hwloc_topology and cg_initialize_hwloc_topology. The code assumes that it is sufficient to call just one of them which is not true: the function read_all_devices is only called from the cg_initialize_hwloc_topology. I've merged both functions together and the difference between both are handled by macro's. I'm not sure what the idea was to duplicate this function? Copy&pasting code like that is only going to give problems, as this issue shows.

Please backport this to the 6.1.1.1.

Both functions differ only in a call to `read_all_devices` and it was assumed that if was sufficient to call only one of them. This is not the case (because `read_all_devices` is required with GPU's and cgroups). The difference between both functions in now part of macro's.

wpoely86 · 2018-03-16T08:00:02Z

@acvizi It would be nice if you could also merge this one. Without it we cannot get torque & GPUs to work.

wpoely86 mentioned this pull request Aug 27, 2017

How to configure torque with GPU? #431

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GPU with cgroups devices #429

Fix GPU with cgroups devices #429

wpoely86 commented Jul 21, 2017

wpoely86 commented Mar 16, 2018

Fix GPU with cgroups devices #429

Are you sure you want to change the base?

Fix GPU with cgroups devices #429

Conversation

wpoely86 commented Jul 21, 2017

wpoely86 commented Mar 16, 2018