-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Torchx support for local NVME drives not available (with PV/PVCs for local NVME drives) - feature required for expected performance with multi-node training #201
Comments
Thanks @dfeddema! For the direct use of the TorchX CLI, I have tested formatting multiple mounts with the argument @MichaelClifford can you help us understand the correct syntax for multiple mounts through the CodeFlare SDK DDPJobDefinition? |
@Sara-KS I also tested the syntax you show above. # Import pieces from codeflare-sdk arg_list = [ jobdef = DDPJobDefinition( |
@dfeddema That formatting is for direct use of the TorchX CLI and it should include multiple mounts in the generated yaml. I recommend running it as a dryrun and sharing the output here. @MichaelClifford Is the following the appropriate change needed to switch to a dryrun mode with the CodeFlare SDK?
|
@Sara-KS yes, |
There is a parameter in the And the syntax should be similar to how we handle `script_args'. So something like:
|
nvm. sorry, @dfeddema just fully read your last comment and see that you still got errors with that approach. |
Is there some way to get the yaml for this? I want to see how codeflare_sdk is specifying the local volumes when I specify them like this in the jobdef: This syntax,
|
Since you are not using a Ray cluster for this I think you need to do the following to see the dry_run output.
|
@MichaelClifford I tried your example above and it didn't produce any output. Maybe I need to import a module that generates this dry run? I see |
@MichaelClifford @Sara-KS I tried the mkfs from the node and it works if you specify /dev/nvme4n1. This works. Three pods are created and distributed model training runs as expected: # Import pieces from codeflare-sdk arg_list = [ jobdef = DDPJobDefinition( |
The type=device approach,
This shows that no filesystem is mounted on /dev/nvme4n1 (the local nvme drive that we specified where /init should have been mounted.
PVCs, specified in this way, If you generate an appwrapper that repeats this section, for each of the PVCs (e.g dianes-amazing-pvc0, dianes-amazing-pvc1, dianes-amazing-pvc2).
(edited)
I think we would have what we need for this multi-node training run with local NVME drives. Note: All of the pods need to mount /init on each of the local nvme drives, because the code is identical on each node (same copy of ResNet50) |
I manually created appwrapper yaml which allows me to have the same training code on each node that accesses the training data in /init, which is mounted on a local nvme drive. This solution works but is not ideal, because I am required to copy the data into /init once the job is running - I can't pre-stage it to the nvme drives before each run. |
I'm running multi-node training of ResNet50 with Torchx, codeflare-sdk, MCAD on OCP 4.12.
I have a 3 node OCP 4.12 cluster, each node has one Nvidia GPU.
Each of the 3 worker nodes has one local 2.9TB NVME drive and an associated PV and PVC.
[root@e23-h21-740xd ResNet]# oc get pv | grep 2980Gi
local-pv-289604ff 2980Gi RWO Delete Bound default/dianes-amazing-pvc2 local-sc 18h
local-pv-8006a340 2980Gi RWO Delete Bound default/dianes-amazing-pvc1 local-sc 4d1h
local-pv-86bac87f 2980Gi RWO Delete Bound default/dianes-amazing-pvc0 local-sc 21h
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
dianes-amazing-pvc0 Bound local-pv-86bac87f 2980Gi RWO local-sc 20h
dianes-amazing-pvc1 Bound local-pv-8006a340 2980Gi RWO local-sc 20h
dianes-amazing-pvc2 Bound local-pv-289604ff 2980Gi RWO local-sc 18h
When I run the following python script "python3 python-multi-node-pvc.py"
[ ResNet]# cat python-multi-node-pvc.py
#
Import pieces from codeflare-sdkfrom codeflare_sdk.job.jobs import DDPJobDefinition
arg_list = [
"--train-dir=/init/tiny-imagenet-200/train",
"--val-dir=/init/tiny-imagenet-200/val",
"--log-dir=/init/tiny-imagenet-200",
"--checkpoint-format=/init/checkpoint-{epoch}.pth.tar"
]
jobdef = DDPJobDefinition(
name="resnet50",
script="pytorch/pytorch_imagenet_resnet50.py",
script_args=arg_list,
scheduler_args={"namespace": "default"},
j="3x1",
gpu=1,
cpu=4,
memMB=24000,
image="quay.io/dfeddema/horovod",
mounts=[['type=volume','src=dianes-amazing-pvc0','dst="/init"'],['type=volume','src=dianes-amazing-pvc1','dst="/init"'],['type=volume','src=dianes-amazing-pvc2','dst="/init"']]
)
job = jobdef.submit()
I get the following error:
AttributeError: 'list' object has no attribute 'partition'
If I specify only one of the local NVME drives as shown below:
[ ResNet]# cat test_resnet3_single_PVC.py
#
Import pieces from codeflare-sdkfrom codeflare_sdk.job.jobs import DDPJobDefinition
arg_list = [
"--train-dir=/init/tiny-imagenet-200/train",
"--val-dir=/init/tiny-imagenet-200/val",
"--log-dir=/init/tiny-imagenet-200",
"--checkpoint-format=/init/checkpoint-{epoch}.pth.tar"
]
jobdef = DDPJobDefinition(
name="resnet50",
script="pytorch/pytorch_imagenet_resnet50.py",
script_args=arg_list,
scheduler_args={"namespace": "default"},
j="3x1",
gpu=1,
cpu=4,
memMB=24000,
image="quay.io/dfeddema/horovod",
mounts=['type=volume','src=dianes-amazing-pvc0','dst="/init"']
)
job = jobdef.submit()
Then one pod starts up successfully and is assigned it's local volume. The other two pods are not scheduled because
"0/3 nodes are available: 1 Insufficient nvidia.com/gpu, 2 nodes(s) had volume node affinity conflict. preemption 0/3 are available... etc etc"
So you can see that as expected the 2nd and 3rd pods could not be scheduled because "2 nodes(s) had volume node affinity conflict".
I need a way to specify the local NVME drive (and associated PVC) for each of the nodes in my cluster. The training data resides on these local NVME drives.
The text was updated successfully, but these errors were encountered: