Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail in param benchmark #60

Open
avildema opened this issue Jan 18, 2022 · 0 comments
Open

Fail in param benchmark #60

avildema opened this issue Jan 18, 2022 · 0 comments

Comments

@avildema
Copy link

environment:

ucx: master
ucc: master
cuda: cuda11.2
gcc: gcc-9.2.0
pytorch: nightly 
ompi: v5.0.0rc2
hosts: 2

cmd:
$OMPI_HOME/bin/mpirun -np 2 --display map --mca coll_ucc_enable 1 --mca coll_ucc_priority 100 --map-by node --bind-to core -x UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES="" -x UCX_TLS=rc -x UCX_RNDV_THRESH=0 -x UCC_TLS=ucp $EXE --device cuda --backend ucc --c 1 --b 1K --e 1M --collective reduce_scatter

comms/pt/comms.py --device cuda --backend ucc --c 1 --b 1K --e 1M --collective reduce_scatter

note: applicable for all collectives (a same output)

output:

MPI environment: {'world_size': 2, 'local_size': 1, 'global_rank': 0, 'local_rank': 0} 
	 backend: ucc nw-stack: pytorch-dist mode: comms args.b: 1024 args.e: 1048576 args.f: 2 args.z: 1 args.master_ip: 127.0.0.1 [Rank   0] host swx-dgx01.swx.labs.mlnx, device: cuda:0, local_rank: 0 world_size: 2, master_ip: 127.0.0.1
[Rank   0] allSizes: [1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1048576] local_rank: 0 element_size: 4
	 collective=reduce_scatter, src_ranks=[], dst_ranks=[][Rank   1] host swx-dgx02.swx.labs.mlnx, device: cuda:0, local_rank: 0 world_size: 2, master_ip: 127.0.0.1

	COMMS-RES       size (B)  nElementsPerRank   Latency(us):p50         p75         p95         Min         Max    AlgBW(GB/s) BusBW(GB/s)Traceback (most recent call last):
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 1221, in <module>
    main()
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 1217, in main
    collBenchObj.runBench(comms_world_info, commsParams)
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 1170, in runBench
    backendObj.benchmark_comms()
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/pytorch_dist_backend.py", line 659, in benchmark_comms
    self.commsParams.benchTime(index, self.commsParams, self)
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 1099, in benchTime
    comm_fn_pair=collectiveFunc_pair,
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 243, in runColl
    self.backendFuncs.complete_accel_ops(self.collectiveArgs, initOp=True)
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/pytorch_dist_backend.py", line 406, in complete_accel_ops    dist.all_reduce(temp)
  File "/hpc/local/oss/python/torch/distributed/distributed_c10d.py", line 1312, in all_reduce
    work = default_pg.allreduce([tensor], opts)
RuntimeError: [src/torch_ucc.cpp:543] [Rank 0][ProcessGroupUCC-0][READY]failed to init cuda collective, error code -1: Operation is not supported, system error code 0[E torch_ucc.cpp:1262] [Rank 1][ProcessGroupUCC-0][INIT][ERROR] ucc communicator was initialized with different cuda device,multi device is not supported
Traceback (most recent call last):
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 1221, in <module>
    main()
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 1217, in main
    collBenchObj.runBench(comms_world_info, commsParams)
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 1170, in runBench
    backendObj.benchmark_comms()
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/pytorch_dist_backend.py", line 659, in benchmark_comms    self.commsParams.benchTime(index, self.commsParams, self)
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 1099, in benchTime
    comm_fn_pair=collectiveFunc_pair,
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 243, in runColl
    self.backendFuncs.complete_accel_ops(self.collectiveArgs, initOp=True)
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/pytorch_dist_backend.py", line 406, in complete_accel_ops
    dist.all_reduce(temp)
  File "/hpc/local/oss/python/torch/distributed/distributed_c10d.py", line 1312, in all_reduce
    work = default_pg.allreduce([tensor], opts)
RuntimeError: Operation is not supported
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant