Skip to content

[Bug] Severe Data Race in fft_cpu.cpp OMP parallel FFT causing H-matrix corruption #7339

@ZhouXY-PKU

Description

@ZhouXY-PKU

Describe the bug

Description

When ABACUS is compiled with KML, OpenMP support and run with OMP_NUM_THREADS > 1, a severe data race occurs in the FFT execution phase (source/source_base/module_fft/fft_cpu.cpp). This race condition corrupts the real-space charge density, leading to an incorrect H-matrix and SCF divergence.

Diagnostic Evidence (TSan output)

WARNING: ThreadSanitizer: data race (pid=3407330)
  Read of size 4 at 0xffffd38283ac by thread T1:
    #0 ModuleBase::FFT_CPU<double>::fftxybac(std::complex<double>*, std::complex<double>*) const [clone ._omp_fn.0] source/source_base/module_fft/fft_cpu.cpp:382 (abacus+0x64da48)
    #1 <null> <null> (libgomp.so.1+0x1b808)

  Previous write of size 8 at 0xffffd38283a8 by main thread:
    [failed to restore the stack]

  Location is heap block of size 320 at 0xffffd3828280 allocated by main thread:
    #0 operator new(unsigned long) <null> (libtsan.so.2+0x93da0)
    #1 std::unique_ptr<ModuleBase::FFT_CPU<double>, std::default_delete<ModuleBase::FFT_CPU<double> > > make_unique<ModuleBase::FFT_CPU<double>, int&>(int&) source/source_base/module_fft/fft_bundle.cpp:20 (abacus+0x64bafc)
    #2 ModuleBase::FFT_Bundle::initfft(int, int, int, int, int, int, int, int, bool, bool, bool) source/source_base/module_fft/fft_bundle.cpp:91 (abacus+0x64bafc)
    #3 ModulePW::PW_Basis::setuptransform() source/source_basis/module_pw/pw_basis.cpp:66 (abacus+0x6344e0)
    #4 pw::setup_pwrho(UnitCell&, bool, bool&, ModulePW::PW_Basis*&, ModulePW::PW_Basis*&, ModulePW::PW_Basis_Big*&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Input_para const&) source/source_pw/module_pwdft/setup_pwrho.cpp:85 (abacus+0xa51fb0)
    #5 ModuleESolver::ESolver_FP::before_all_runners(UnitCell&, Input_para const&) source/source_esolver/esolver_fp.cpp:48 (abacus+0xb3efb0)
    #6 ModuleESolver::ESolver_KS<double, base_device::DEVICE_CPU>::before_all_runners(UnitCell&, Input_para const&) source/source_esolver/esolver_ks.cpp:50 (abacus+0xb3cc00)
    #7 ModuleESolver::ESolver_KS_LCAO<double, double>::before_all_runners(UnitCell&, Input_para const&) source/source_esolver/esolver_ks_lcao.cpp:50 (abacus+0xb695b0)
    #8 Driver::driver_run() source/source_main/driver_run.cpp:67 (abacus+0x92491c)
    #9 Driver::atomic_world() source/source_main/driver.cpp:177 (abacus+0x923e14)
    #10 Driver::init() source/source_main/driver.cpp:33 (abacus+0x9240f0)
    #11 main source/source_main/main.cpp:43 (abacus+0x449274)

  Thread T1 (tid=3407345, running) created by main thread at:
    #0 pthread_create <null> (libtsan.so.2+0x6a980)
    #1 <null> <null> (libgomp.so.1+0x1bc60)

SUMMARY: ThreadSanitizer: data race source/source_base/module_fft/fft_cpu.cpp:382 in ModuleBase::FFT_CPU<double>::fftxybac(std::complex<double>*, std::complex<double>*) const [clone ._omp_fn.0]                                         

Workaround

Commenting out the #pragma omp parallel for directives in fft_cpu.cpp resolves the issue completely, though at the cost of FFT parallel performance.

Suggested Fix

  1. Primary Fix: Insert a proper #pragma omp barrier in ModulePW (or ensure equivalent synchronization) before invoking any MPI send/receive operations that access the FFT buffers.
  2. Secondary Fix: Review the scope of the OMP parallel regions in fft_cpu.cpp. Ensure that the FFT execution and the subsequent data consumption/communication are strictly sequenced, preventing any overlap between OMP writes and MPI reads.

Expected behavior

No response

To Reproduce

No response

Environment

No response

Additional Context

No response

Task list for Issue attackers (only for developers)

  • Verify the issue is not a duplicate.
  • Describe the bug.
  • Steps to reproduce.
  • Expected behavior.
  • Error message.
  • Environment details.
  • Additional context.
  • Assign a priority level (low, medium, high, urgent).
  • Assign the issue to a team member.
  • Label the issue with relevant tags.
  • Identify possible related issues.
  • Create a unit test or automated test to reproduce the bug (if applicable).
  • Fix the bug.
  • Test the fix.
  • Update documentation (if necessary).
  • Close the issue and inform the reporter (if applicable).

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugsBugs that only solvable with sufficient knowledge of DFT

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions