You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
note: applicable for all collectives (a same output)
output:
MPI environment: {'world_size': 2, 'local_size': 1, 'global_rank': 0, 'local_rank': 0}
backend: ucc nw-stack: pytorch-dist mode: comms args.b: 1024 args.e: 1048576 args.f: 2 args.z: 1 args.master_ip: 127.0.0.1 [Rank 0] host swx-dgx01.swx.labs.mlnx, device: cuda:0, local_rank: 0 world_size: 2, master_ip: 127.0.0.1
[Rank 0] allSizes: [1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1048576] local_rank: 0 element_size: 4
collective=reduce_scatter, src_ranks=[], dst_ranks=[][Rank 1] host swx-dgx02.swx.labs.mlnx, device: cuda:0, local_rank: 0 world_size: 2, master_ip: 127.0.0.1
COMMS-RES size (B) nElementsPerRank Latency(us):p50 p75 p95 Min Max AlgBW(GB/s) BusBW(GB/s)Traceback (most recent call last):
File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 1221, in <module>
main()
File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 1217, in main
collBenchObj.runBench(comms_world_info, commsParams)
File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 1170, in runBench
backendObj.benchmark_comms()
File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/pytorch_dist_backend.py", line 659, in benchmark_comms
self.commsParams.benchTime(index, self.commsParams, self)
File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 1099, in benchTime
comm_fn_pair=collectiveFunc_pair,
File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 243, in runColl
self.backendFuncs.complete_accel_ops(self.collectiveArgs, initOp=True)
File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/pytorch_dist_backend.py", line 406, in complete_accel_ops dist.all_reduce(temp)
File "/hpc/local/oss/python/torch/distributed/distributed_c10d.py", line 1312, in all_reduce
work = default_pg.allreduce([tensor], opts)
RuntimeError: [src/torch_ucc.cpp:543] [Rank 0][ProcessGroupUCC-0][READY]failed to init cuda collective, error code -1: Operation is not supported, system error code 0[E torch_ucc.cpp:1262] [Rank 1][ProcessGroupUCC-0][INIT][ERROR] ucc communicator was initialized with different cuda device,multi device is not supported
Traceback (most recent call last):
File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 1221, in <module>
main()
File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 1217, in main
collBenchObj.runBench(comms_world_info, commsParams)
File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 1170, in runBench
backendObj.benchmark_comms()
File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/pytorch_dist_backend.py", line 659, in benchmark_comms self.commsParams.benchTime(index, self.commsParams, self)
File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 1099, in benchTime
comm_fn_pair=collectiveFunc_pair,
File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 243, in runColl
self.backendFuncs.complete_accel_ops(self.collectiveArgs, initOp=True)
File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/pytorch_dist_backend.py", line 406, in complete_accel_ops
dist.all_reduce(temp)
File "/hpc/local/oss/python/torch/distributed/distributed_c10d.py", line 1312, in all_reduce
work = default_pg.allreduce([tensor], opts)
RuntimeError: Operation is not supported
The text was updated successfully, but these errors were encountered:
environment:
cmd:
$OMPI_HOME/bin/mpirun -np 2 --display map --mca coll_ucc_enable 1 --mca coll_ucc_priority 100 --map-by node --bind-to core -x UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES="" -x UCX_TLS=rc -x UCX_RNDV_THRESH=0 -x UCC_TLS=ucp $EXE --device cuda --backend ucc --c 1 --b 1K --e 1M --collective reduce_scatter
comms/pt/comms.py --device cuda --backend ucc --c 1 --b 1K --e 1M --collective reduce_scatter
note:
applicable for all collectives (a same output)
output:
The text was updated successfully, but these errors were encountered: