Fail in param benchmark #60

avildema · 2022-01-18T11:31:02Z

environment:

ucx: master
ucc: master
cuda: cuda11.2
gcc: gcc-9.2.0
pytorch: nightly 
ompi: v5.0.0rc2
hosts: 2

cmd:
$OMPI_HOME/bin/mpirun -np 2 --display map --mca coll_ucc_enable 1 --mca coll_ucc_priority 100 --map-by node --bind-to core -x UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES="" -x UCX_TLS=rc -x UCX_RNDV_THRESH=0 -x UCC_TLS=ucp $EXE --device cuda --backend ucc --c 1 --b 1K --e 1M --collective reduce_scatter

comms/pt/comms.py --device cuda --backend ucc --c 1 --b 1K --e 1M --collective reduce_scatter

note: applicable for all collectives (a same output)

output:

MPI environment: {'world_size': 2, 'local_size': 1, 'global_rank': 0, 'local_rank': 0} 
	 backend: ucc nw-stack: pytorch-dist mode: comms args.b: 1024 args.e: 1048576 args.f: 2 args.z: 1 args.master_ip: 127.0.0.1 [Rank   0] host swx-dgx01.swx.labs.mlnx, device: cuda:0, local_rank: 0 world_size: 2, master_ip: 127.0.0.1
[Rank   0] allSizes: [1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1048576] local_rank: 0 element_size: 4
	 collective=reduce_scatter, src_ranks=[], dst_ranks=[][Rank   1] host swx-dgx02.swx.labs.mlnx, device: cuda:0, local_rank: 0 world_size: 2, master_ip: 127.0.0.1

	COMMS-RES       size (B)  nElementsPerRank   Latency(us):p50         p75         p95         Min         Max    AlgBW(GB/s) BusBW(GB/s)Traceback (most recent call last):
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 1221, in <module>
    main()
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 1217, in main
    collBenchObj.runBench(comms_world_info, commsParams)
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 1170, in runBench
    backendObj.benchmark_comms()
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/pytorch_dist_backend.py", line 659, in benchmark_comms
    self.commsParams.benchTime(index, self.commsParams, self)
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 1099, in benchTime
    comm_fn_pair=collectiveFunc_pair,
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 243, in runColl
    self.backendFuncs.complete_accel_ops(self.collectiveArgs, initOp=True)
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/pytorch_dist_backend.py", line 406, in complete_accel_ops    dist.all_reduce(temp)
  File "/hpc/local/oss/python/torch/distributed/distributed_c10d.py", line 1312, in all_reduce
    work = default_pg.allreduce([tensor], opts)
RuntimeError: [src/torch_ucc.cpp:543] [Rank 0][ProcessGroupUCC-0][READY]failed to init cuda collective, error code -1: Operation is not supported, system error code 0[E torch_ucc.cpp:1262] [Rank 1][ProcessGroupUCC-0][INIT][ERROR] ucc communicator was initialized with different cuda device,multi device is not supported
Traceback (most recent call last):
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 1221, in <module>
    main()
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 1217, in main
    collBenchObj.runBench(comms_world_info, commsParams)
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 1170, in runBench
    backendObj.benchmark_comms()
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/pytorch_dist_backend.py", line 659, in benchmark_comms    self.commsParams.benchTime(index, self.commsParams, self)
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 1099, in benchTime
    comm_fn_pair=collectiveFunc_pair,
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/comms.py", line 243, in runColl
    self.backendFuncs.complete_accel_ops(self.collectiveArgs, initOp=True)
  File "/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220114_102333_33221_86747_swx-dgx01.swx.labs.mlnx/installs/92ic/tests/param_repo/param/train/comms/pt/pytorch_dist_backend.py", line 406, in complete_accel_ops
    dist.all_reduce(temp)
  File "/hpc/local/oss/python/torch/distributed/distributed_c10d.py", line 1312, in all_reduce
    work = default_pg.allreduce([tensor], opts)
RuntimeError: Operation is not supported

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail in param benchmark #60

Fail in param benchmark #60

avildema commented Jan 18, 2022

Fail in param benchmark #60

Fail in param benchmark #60

Comments

avildema commented Jan 18, 2022