Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torchx support for local NVME drives not available (with PV/PVCs for local NVME drives) - feature required for expected performance with multi-node training #201

Open
dfeddema opened this issue Jul 7, 2023 · 12 comments

Comments

@dfeddema
Copy link

dfeddema commented Jul 7, 2023

I'm running multi-node training of ResNet50 with Torchx, codeflare-sdk, MCAD on OCP 4.12.

I have a 3 node OCP 4.12 cluster, each node has one Nvidia GPU.
Each of the 3 worker nodes has one local 2.9TB NVME drive and an associated PV and PVC.

[root@e23-h21-740xd ResNet]# oc get pv | grep 2980Gi
local-pv-289604ff 2980Gi RWO Delete Bound default/dianes-amazing-pvc2 local-sc 18h
local-pv-8006a340 2980Gi RWO Delete Bound default/dianes-amazing-pvc1 local-sc 4d1h
local-pv-86bac87f 2980Gi RWO Delete Bound default/dianes-amazing-pvc0 local-sc 21h

NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
dianes-amazing-pvc0 Bound local-pv-86bac87f 2980Gi RWO local-sc 20h
dianes-amazing-pvc1 Bound local-pv-8006a340 2980Gi RWO local-sc 20h
dianes-amazing-pvc2 Bound local-pv-289604ff 2980Gi RWO local-sc 18h

When I run the following python script "python3 python-multi-node-pvc.py"

[ ResNet]# cat python-multi-node-pvc.py
# Import pieces from codeflare-sdk
from codeflare_sdk.job.jobs import DDPJobDefinition

arg_list = [
"--train-dir=/init/tiny-imagenet-200/train",
"--val-dir=/init/tiny-imagenet-200/val",
"--log-dir=/init/tiny-imagenet-200",
"--checkpoint-format=/init/checkpoint-{epoch}.pth.tar"
]

jobdef = DDPJobDefinition(
name="resnet50",
script="pytorch/pytorch_imagenet_resnet50.py",
script_args=arg_list,
scheduler_args={"namespace": "default"},
j="3x1",
gpu=1,
cpu=4,
memMB=24000,
image="quay.io/dfeddema/horovod",
mounts=[['type=volume','src=dianes-amazing-pvc0','dst="/init"'],['type=volume','src=dianes-amazing-pvc1','dst="/init"'],['type=volume','src=dianes-amazing-pvc2','dst="/init"']]
)
job = jobdef.submit()

I get the following error:
AttributeError: 'list' object has no attribute 'partition'


If I specify only one of the local NVME drives as shown below:

[ ResNet]# cat test_resnet3_single_PVC.py
# Import pieces from codeflare-sdk
from codeflare_sdk.job.jobs import DDPJobDefinition

arg_list = [
"--train-dir=/init/tiny-imagenet-200/train",
"--val-dir=/init/tiny-imagenet-200/val",
"--log-dir=/init/tiny-imagenet-200",
"--checkpoint-format=/init/checkpoint-{epoch}.pth.tar"
]

jobdef = DDPJobDefinition(
name="resnet50",
script="pytorch/pytorch_imagenet_resnet50.py",
script_args=arg_list,
scheduler_args={"namespace": "default"},
j="3x1",
gpu=1,
cpu=4,
memMB=24000,
image="quay.io/dfeddema/horovod",
mounts=['type=volume','src=dianes-amazing-pvc0','dst="/init"']
)
job = jobdef.submit()

Then one pod starts up successfully and is assigned it's local volume. The other two pods are not scheduled because
"0/3 nodes are available: 1 Insufficient nvidia.com/gpu, 2 nodes(s) had volume node affinity conflict. preemption 0/3 are available... etc etc"

So you can see that as expected the 2nd and 3rd pods could not be scheduled because "2 nodes(s) had volume node affinity conflict".
I need a way to specify the local NVME drive (and associated PVC) for each of the nodes in my cluster. The training data resides on these local NVME drives.

@dfeddema dfeddema changed the title Torchx support for local NVME drives not available (with PV/PVCs for local drives) - feature required for expected performance with multi-node training Torchx support for local NVME drives not available (with PV/PVCs for local NVME drives) - feature required for expected performance with multi-node training Jul 7, 2023
@Sara-KS
Copy link

Sara-KS commented Jul 7, 2023

Thanks @dfeddema!

For the direct use of the TorchX CLI, I have tested formatting multiple mounts with the argument --mount type=volume,src=foo-pvc,dst="/foo",type=volume,src=bar-pvc,dst="/bar" .

@MichaelClifford can you help us understand the correct syntax for multiple mounts through the CodeFlare SDK DDPJobDefinition?

@dfeddema
Copy link
Author

dfeddema commented Jul 10, 2023

@Sara-KS I also tested the syntax you show above.
--mount type=volume,src=foo-pvc,dst="/data",type=volume,src=bar-pvc,dst="/data" and it did not work for me.
Here's what I used, that is similar to your example, and it failed (note: if I remove the quotes in the mount args I get errors):

# Import pieces from codeflare-sdk
from codeflare_sdk.job.jobs import DDPJobDefinition

arg_list = [
"--train-dir=/init/tiny-imagenet-200/train",
"--val-dir=/init/tiny-imagenet-200/val",
"--log-dir=/init/tiny-imagenet-200",
"--checkpoint-format=/init/checkpoint-{epoch}.pth.tar"
]

jobdef = DDPJobDefinition(
name="resnet50",
script="pytorch/pytorch_imagenet_resnet50.py",
script_args=arg_list,
scheduler_args={"namespace": "default"},
j="3x1",
gpu=1,
cpu=4,
memMB=24000,
image="quay.io/dfeddema/horovod",
mounts=['type=volume','src=dianes-amazing-pvc0','dst="/init"','type=volume','src=dianes-amazing-pvc1','dst="/init"','type=volume','src=dianes-amazing-pvc2','dst="/init"']
)
job = jobdef.submit()

@Sara-KS
Copy link

Sara-KS commented Jul 10, 2023

@dfeddema That formatting is for direct use of the TorchX CLI and it should include multiple mounts in the generated yaml. I recommend running it as a dryrun and sharing the output here. @MichaelClifford Is the following the appropriate change needed to switch to a dryrun mode with the CodeFlare SDK?

jobdef = DDPJobDefinition(

to

jobdef = DDPJobDefinition._dry_run(

@MichaelClifford
Copy link
Collaborator

@Sara-KS yes,
DDPJobDefinition._dry_run(cluster) will generate the dry_run output.

@MichaelClifford
Copy link
Collaborator

There is a parameter in the DDPJobDefiniton() that allows you to define mounts.

see https://github.com/project-codeflare/codeflare-sdk/blob/baec8585b2bd918becd030951bf43e3504d43ada/src/codeflare_sdk/job/jobs.py#L62C11-L62C11

And the syntax should be similar to how we handle `script_args'. So something like:

mounts = ['type=volume',
'src=foo-pvc',
'dst="/foo"',
'type=volume',
'src=bar-pvc',
'dst="/bar"'] 

@MichaelClifford
Copy link
Collaborator

nvm. sorry, @dfeddema just fully read your last comment and see that you still got errors with that approach.

@dfeddema
Copy link
Author

dfeddema commented Jul 10, 2023

Is there some way to get the yaml for this? I want to see how codeflare_sdk is specifying the local volumes when I specify them like this in the jobdef:
mounts=['type=volume','src=dianes-amazing-pvc0','dst="/init"','type=volume','src=dianes-amazing-pvc1','dst="/init"','type=volume','src=dianes-amazing-pvc2','dst="/init"']

This syntax, DDPJobDefinition._dry_run(cluster), doesn't help me because I'm not using a Ray cluster. When I try to use dry run as shown below I get error "TypeError: _dry_run() got an unexpected keyword argument 'name'"

# Import pieces from codeflare-sdk
from codeflare_sdk.job.jobs import DDPJobDefinition

arg_list = [
    "--train-dir=/init/tiny-imagenet-200/train",
    "--val-dir=/init/tiny-imagenet-200/val",
    "--log-dir=/init/tiny-imagenet-200",
    "--checkpoint-format=/init/checkpoint-{epoch}.pth.tar"
]

jobdef = DDPJobDefinition._dry_run(
    name="resnet50",
    script="pytorch/pytorch_imagenet_resnet50.py",
    script_args=arg_list,
    scheduler_args={"namespace": "default"},
    j="3x1",
    gpu=1,
    cpu=4,
    memMB=24000,
    image="quay.io/dfeddema/horovod",
    mounts=['type=volume','src=dianes-amazing-pvc0','dst="/init"','type=volume','src=dianes-amazing-pvc1','dst="/init"','type=volume','src=dianes-amazing-pvc2','dst="/init"']
)
job = jobdef.submit()

@MichaelClifford
Copy link
Collaborator

Since you are not using a Ray cluster for this I think you need to do the following to see the dry_run output.

jobdef = DDPJobDefinition(name="resnet50",
    script="pytorch/pytorch_imagenet_resnet50.py",
    script_args=arg_list,
    scheduler_args={"namespace": "default"},
    j="3x1",
    gpu=1,
    cpu=4,
    memMB=24000,
    image="quay.io/dfeddema/horovod",
    mounts=['type=volume','src=dianes-amazing-pvc0','dst="/init"','type=volume','src=dianes-amazing-pvc1','dst="/init"','type=volume','src=dianes-amazing-pvc2','dst="/init"']
)

jobdef._dry_run_no_cluster()

@dfeddema
Copy link
Author

@MichaelClifford I tried your example above and it didn't produce any output. Maybe I need to import a module that generates this dry run? I see
from torchx.specs import AppDryRunInfo, AppDef , that's not the right module, but is there something similar I'm missing?Is there a DryRun module I need to import?

@dfeddema
Copy link
Author

dfeddema commented Jul 12, 2023

@MichaelClifford @Sara-KS
Thanks Sarah for your suggestion to use type=device. fully qualified path for the device works.
mounts=['type=device','src=nvme4n1','dst="/init", "perm=rw"']
with a slight tweak works. At first I was getting this error (when I used 'src=nvme4n1') :
Error: failed to mkdir nvme4n1: mkdir nvme4n1: operation not permitted

I tried the mkfs from the node and it works if you specify /dev/nvme4n1.
oc debug node/e27-h13-r750
mkfs.xfs -f /dev/nvme4n1

This works. Three pods are created and distributed model training runs as expected:

# Import pieces from codeflare-sdk
from codeflare_sdk.job.jobs import DDPJobDefinition

arg_list = [
"--train-dir=/init/tiny-imagenet-200/train",
"--val-dir=/init/tiny-imagenet-200/val",
"--log-dir=/init/tiny-imagenet-200",
"--checkpoint-format=/init/checkpoint-{epoch}.pth.tar"
]

jobdef = DDPJobDefinition(
name="resnet50",
script="pytorch/pytorch_imagenet_resnet50.py",
script_args=arg_list,
scheduler_args={"namespace": "default"},
j="3x1",
gpu=1,
cpu=4,
memMB=24000,
image="quay.io/dfeddema/horovod",
mounts=['type=device','src=/dev/nvme4n1','dst="/init", "perm=rw"']
)
job = jobdef.submit()

@dfeddema
Copy link
Author

dfeddema commented Jul 24, 2023

The type=device approach, mounts=['type=device','src=/dev/nvme4n1','dst="/init", appeared to work but did not actually solve the problem.
There was a RUN mkdir /init in my container image (dockerfile) that was creating the directory where the training data was copied by this command which followed COPY tiny-imagenet-200 /init/tiny-imagenet-200. Hence, the test appeared to work but mount showed that /init was mounted on devtmpfs.

root@resnet50-d005br4gsdvjw-2:/horovod/examples# mount | grep init
devtmpfs on /"/init", "perm=rw" type devtmpfs (rw,nosuid,seclabel,size=131726212k,nr_inodes=32931553,mode=755)

This shows that no filesystem is mounted on /dev/nvme4n1 (the local nvme drive that we specified where /init should have been mounted.

# lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda       8:0    0 447.1G  0 disk 
|-sda1    8:1    0     1M  0 part 
|-sda2    8:2    0   127M  0 part 
|-sda3    8:3    0   384M  0 part 
`-sda4    8:4    0 446.6G  0 part /dev/termination-log
sr0      11:0    1   1.1G  0 rom  
nvme2n1 259:0    0   1.5T  0 disk 
nvme3n1 259:1    0   1.5T  0 disk 
nvme4n1 259:2    0   2.9T  0 disk 
nvme5n1 259:3    0   1.5T  0 disk 
nvme1n1 259:4    0   1.5T  0 disk 
nvme0n1 259:5    0   1.5T  0 disk

PVCs, specified in this way, mounts=['type=volume','src=dianes-amazing-pvc0','dst=/init', 'type=volume','src=dianes-amazing-pvc1','dst=/init', 'type=volume','src=dianes-amazing-pvc2','dst=/init',] , provide a more general solution to the problem than the mounts=['type=device','src=/dev/nvme4n1','dst="/init" solution.

If you generate an appwrapper that repeats this section, for each of the PVCs (e.g dianes-amazing-pvc0, dianes-amazing-pvc1, dianes-amazing-pvc2).

         volumes:
          - emptyDir:
              medium: Memory
            name: dshm
          - name: mount-0
            persistentVolumeClaim:
              claimName: dianes-amazing-pvc0

(edited)

              volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              - mountPath: /init
                name: mount-0
                readOnly: false

I think we would have what we need for this multi-node training run with local NVME drives.

Note: All of the pods need to mount /init on each of the local nvme drives, because the code is identical on each node (same copy of ResNet50)
The current syntax requires that each filesystem to mount has a different name, e.g. /init0, /init1, /init2.

@dfeddema
Copy link
Author

dfeddema commented Jul 25, 2023

I manually created appwrapper yaml which allows me to have the same training code on each node that accesses the training data in /init, which is mounted on a local nvme drive. This solution works but is not ideal, because I am required to copy the data into /init once the job is running - I can't pre-stage it to the nvme drives before each run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants