Skip to content

The PW GPU code with K-point parallelism does not work well #3425

@denghuilu

Description

@denghuilu

Describe the bug

When executing K-point parallelism on GPU environment, we may encounter a segmentation fault.

[denghui@LuDh-4090:pw_Si2]$ /usr/bin/mpirun -n 2 /home/denghui/abacus-develop/build/abacus 
hwloc/linux: Ignoring PCI device with non-16bit domain.
Pass --enable-32bits-pci-domain to configure to support such devices
(warning: it would break the library ABI, don't enable unless really needed).
WARNING: Total thread number on this node mismatches with hardware availability. This may cause poor performance.
Info: Local MPI proc number: 2,OpenMP thread number: 2,Total thread number: 4,Local thread limit: 56
                                                                                     
                              ABACUS v3.5.0

               Atomic-orbital Based Ab-initio Computation at UStc                    

                     Website: http://abacus.ustc.edu.cn/                             
               Documentation: https://abacus.deepmodeling.com/                       
                  Repository: https://github.com/abacusmodeling/abacus-develop       
                              https://github.com/deepmodeling/abacus-develop         
                      Commit: e135c71cb (Sat Jan 13 11:32:38 2024 +0000)

 Sun Jan 14 11:21:56 2024
 MAKE THE DIR         : OUT.ABACUS/
 RUNNING WITH DEVICE  : GPU / NVIDIA GeForce RTX 4090
 UNIFORM GRID DIM        : 36 * 36 * 36
 UNIFORM GRID DIM(BIG)   : 36 * 36 * 36
 DONE(1.49074    SEC) : SETUP UNITCELL
 DONE(1.54515    SEC) : SYMMETRY
 DONE(1.7073     SEC) : INIT K-POINTS
 ---------------------------------------------------------
 Self-consistent calculations for electrons
 ---------------------------------------------------------
 SPIN    KPOINTS         PROCESSORS  
 1       8               2           
 ---------------------------------------------------------
 Use plane wave basis
 ---------------------------------------------------------
 ELEMENT NATOM       XC          
 Si      2           
 ---------------------------------------------------------
 Initial plane wave basis and FFT box
 ---------------------------------------------------------
 DONE(1.71664    SEC) : INIT PLANEWAVE
 MEMORY FOR PSI (MB)  : 1.78162
 DONE(1.72678    SEC) : LOCAL POTENTIAL
 DONE(1.75068    SEC) : NON-LOCAL POTENTIAL
 DONE(1.7729     SEC) : INIT BASIS
 -------------------------------------------
 SELF-CONSISTENT : 
 -------------------------------------------
 START CHARGE      : atomic
 DONE(1.80318    SEC) : INIT SCF
 ITER   ETOT(eV)       EDIFF(eV)      DRHO       TIME(s)    
cuBLAS Assert: CUBLAS_STATUS_INVALID_VALUE /home/denghui/abacus-develop/source/module_hsolver/kernels/cuda/math_kernel_op.cu 855
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[53215,1],1]
  Exit code:    7
--------------------------------------------------------------------------

Expected behavior

Fix this issue.

To Reproduce

No response

Environment

No response

Additional Context

No response

Task list for Issue attackers (only for developers)

  • Verify the issue is not a duplicate.
  • Describe the bug.
  • Steps to reproduce.
  • Expected behavior.
  • Error message.
  • Environment details.
  • Additional context.
  • Assign a priority level (low, medium, high, urgent).
  • Assign the issue to a team member.
  • Label the issue with relevant tags.
  • Identify possible related issues.
  • Create a unit test or automated test to reproduce the bug (if applicable).
  • Fix the bug.
  • Test the fix.
  • Update documentation (if necessary).
  • Close the issue and inform the reporter (if applicable).

Metadata

Metadata

Assignees

Labels

BugsBugs that only solvable with sufficient knowledge of DFT

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions