Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update JEDI hashes (20250203) #1475

Draft
wants to merge 25 commits into
base: develop
Choose a base branch
from
Draft

Conversation

RussTreadon-NOAA
Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA commented Feb 2, 2025

Description

Update select JEDI hashes on 20250203

Companion PRs

none

Issues

Resolves #1474

Automated CI tests to run in Global Workflow

  • atm_jjob
  • C96C48_ufs_hybatmDA
  • C96C48_hybatmaerosnowDA
  • C48mx500_3DVarAOWCDA
  • C48mx500_hybAOWCDA
  • C96C48_hybatmDA

@RussTreadon-NOAA RussTreadon-NOAA self-assigned this Feb 2, 2025
@RussTreadon-NOAA
Copy link
Contributor Author

This PR will be marked Ready for review once all GDASApp ctests have been run and pass on Hera, Hercules, and Orion and g-w CI passes on WCOSS2.

@RussTreadon-NOAA RussTreadon-NOAA added hera-GW-RT Queue for automated testing with global-workflow on Hera orion-GW-RT Queue for automated testing with global-workflow on Orion hercules-GW-RT Queue for automated testing with global-workflow on Hercules labels Feb 2, 2025
@emcbot emcbot added hercules-GW-RT-Running Automated testing with global-workflow running on Hercules orion-GW-RT-Running Automated testing with global-workflow running on Orion hera-GW-RT-Running Automated testing with global-workflow running on Hera and removed hercules-GW-RT Queue for automated testing with global-workflow on Hercules orion-GW-RT Queue for automated testing with global-workflow on Orion hera-GW-RT Queue for automated testing with global-workflow on Hera labels Feb 2, 2025
@emcbot
Copy link

emcbot commented Feb 2, 2025

Automated GW-GDASApp Testing Results:
Machine: hercules

Start: Sun Feb  2 14:48:35 CST 2025 on hercules-login-1.hpc.msstate.edu
---------------------------------------------------
Build:                                 *SUCCESS*
Build: Completed at Sun Feb  2 15:24:43 CST 2025
---------------------------------------------------
Tests: ctest -j12 -R gdasapp
Tests:                                 *SUCCESS*
Tests: Completed at Sun Feb  2 16:24:09 CST 2025
Tests: 100% tests passed, 0 tests failed out of 135

@emcbot emcbot added hercules-GW-RT-Passed Automated testing with global-workflow successful on Hercules and removed hercules-GW-RT-Running Automated testing with global-workflow running on Hercules labels Feb 2, 2025
@emcbot
Copy link

emcbot commented Feb 2, 2025

Automated GW-GDASApp Testing Results:
Machine: hera

Start: Sun Feb  2 20:56:24 UTC 2025 on hfe09
---------------------------------------------------
Build:                                 *SUCCESS*
Build: Completed at Sun Feb  2 21:42:53 UTC 2025
---------------------------------------------------
Tests: ctest -j12 -R gdasapp
Tests:                                 *SUCCESS*
Tests: Completed at Sun Feb  2 22:43:09 UTC 2025
Tests: 100% tests passed, 0 tests failed out of 135

@emcbot emcbot added hera-GW-RT-Passed Automated testing with global-workflow successful on Hera and removed hera-GW-RT-Running Automated testing with global-workflow running on Hera labels Feb 2, 2025
@emcbot
Copy link

emcbot commented Feb 2, 2025

Automated GW-GDASApp Testing Results:
Machine: orion

Start: Sun Feb  2 02:51:15 PM CST 2025 on orion-login-1.hpc.msstate.edu
---------------------------------------------------
Build:                                 *SUCCESS*
Build: Completed at Sun Feb  2 03:55:49 PM CST 2025
---------------------------------------------------
Tests: ctest -j12 -R gdasapp
Tests:                                 *SUCCESS*
Tests: Completed at Sun Feb  2 05:38:19 PM CST 2025
Tests: 100% tests passed, 0 tests failed out of 135

@emcbot emcbot added orion-GW-RT-Passed Automated testing with global-workflow successful on Orion and removed orion-GW-RT-Running Automated testing with global-workflow running on Orion labels Feb 2, 2025
@RussTreadon-NOAA
Copy link
Contributor Author

WCOSS2 g-w CI

Clone g-w develop at 380946c on Cactus. Update sorc/gdas.cd to feature/stable-nightly at e6eafb0`.

All g-w components successfully build except GDASApp. The GDASApp build fails with

[100%] Linking CXX executable ../../bin/gdas_soca_error_covariance_toolbox.x
cd /lfs/h2/emc/da/noscrub/russ.treadon/git/global-workflow/stable-nightly/sorc/gdas.cd/build/gdas/mains && /apps/ops/test/spack-stack-1.6.0-nco/envs/nco-intel-19.1.3.304/install/intel/19.1.3.304/cmake-3.23.1-chpcsen/bin/cmake -E remove /lfs/h2/emc/da/noscrub/russ.treadon/git/global-workflow/stable-nightly/sorc/gdas.cd/build/bin/gdas_soca_error_covariance_toolbox.x
cd /lfs/h2/emc/da/noscrub/russ.treadon/git/global-workflow/stable-nightly/sorc/gdas.cd/build/gdas/mains && /apps/ops/test/\
spack-stack-1.6.0-nco/envs/nco-intel-19.1.3.304/install/intel/19.1.3.304/cmake-3.23.1-chpcsen/bin/cmake -E cmake_link_script CMakeFiles/gdas_soca_error_covariance_toolbox.x.dir/link.txt --verbose=NO
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: ../../lib/libsaber.so: undefined reference to `atlas::grid::detail::partitioner::TransPartitioner::TransPartitioner()'
make[2]: *** [gdas/mains/CMakeFiles/gdas_soca_error_covariance_toolbox.x.dir/build.make:174: bin/gdas_soca_error_covariance_toolbox.x] Error 1
make[2]: Leaving directory '/lfs/h2/emc/da/noscrub/russ.treadon/git/global-workflow/stable-nightly/sorc/gdas.cd/build'
make[1]: *** [CMakeFiles/Makefile2:28951: gdas/mains/CMakeFiles/gdas_soca_error_covariance_toolbox.x.dir/all] Error 2
make[1]: Leaving directory '/lfs/h2/emc/da/noscrub/russ.treadon/git/global-workflow/stable-nightly/sorc/gdas.cd/build'
make: *** [Makefile:166: all] Error 2

Notice that Hera, Hercules, and Orion use more recent versions of atlas

./hera.intel.lua:load("atlas/0.35.1")
./orion.intel.lua:load("atlas/0.35.1")
./wcoss2.intel.lua:load("atlas/0.35.0")
./hercules.gnu.lua:load("atlas/0.36.0")
./hercules.intel.lua:load("atlas/0.36.0")

Might a more recent version of atlas resolve the
undefined reference to atlas::grid::detail::partitioner::TransPartitioner::TransPartitioner()'
message in libsaber.so?

Manually load GDASApp wcoss2.intel modulefile on Cactus. module spider atlas returns

---------------------------------------------------------------------------------------------------------------------------------
  atlas:
---------------------------------------------------------------------------------------------------------------------------------
     Versions:
        atlas/0.33.0
        atlas/0.35.0 

Unfortunately, atlas/0.35.0 is the most recent version of atlas available in the current installation of spack-stack/1.6.0 on WCOSS2.

Note that saber PR #1001 contains references to TransPartitioner.h in WriteFields.cc.

As a test back up working copy of sorc/saber to hash d51284c6. This is the commit prior to saber PR #1001. feature/stable-nightly successfully builds on Cactus using this saber hash.

We need to reach out to the library time to request installation of a newer version of atlas on WCOSS2. This PR can not move forward until we can build and run feature/stable-nightly on WCOSS2.

Attention: @DavidNew-NOAA , @danholdaway , @CoryMartin-NOAA , @guillaumevernieres

@RussTreadon-NOAA
Copy link
Contributor Author

WCOSS2 g-w CI
As a test revert sorc/saber back to d51284c6, build GDASApp, and run g-w CI on Cactus. All configurations successfully run to completion.

/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/C48_ATM_stable-nightly
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202103231200        Done    Feb 03 2025 10:05:18    Feb 03 2025 11:15:25
 
/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/C48mx500_3DVarAOWCDA_stable-nightly
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202103241800        Done    Feb 03 2025 10:05:22    Feb 03 2025 10:20:23
202103250000      Active    Feb 03 2025 10:05:22             -          
 
/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/C48mx500_hybAOWCDA_stable-nightly
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202103241800        Done    Feb 03 2025 10:05:24    Feb 03 2025 10:20:30
202103250000        Done    Feb 03 2025 10:05:24    Feb 03 2025 11:25:21
 
/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/C48_S2SWA_gefs_stable-nightly
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202103231200        Done    Feb 03 2025 10:05:39    Feb 03 2025 12:01:05
 
/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/C48_S2SW_stable-nightly
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202103231200        Done    Feb 03 2025 10:05:26    Feb 03 2025 11:30:46
 
/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/C96_atm3DVar_stable-nightly
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202112201800        Done    Feb 03 2025 10:05:29    Feb 03 2025 10:20:40
202112210000        Done    Feb 03 2025 10:05:29    Feb 03 2025 12:30:37
202112210600        Done    Feb 03 2025 10:05:29    Feb 03 2025 12:15:35
 
/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/C96C48_hybatmaerosnowDA_stable-nightly
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202112201200        Done    Feb 03 2025 10:05:31    Feb 03 2025 10:25:41
202112201800        Done    Feb 03 2025 10:05:31    Feb 03 2025 12:30:41
202112210000        Done    Feb 03 2025 10:05:31    Feb 03 2025 12:20:45
 
/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/C96C48_hybatmDA_stable-nightly
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202112201800        Done    Feb 03 2025 10:05:33    Feb 03 2025 10:20:49
202112210000        Done    Feb 03 2025 10:05:33    Feb 03 2025 12:05:39
202112210600        Done    Feb 03 2025 10:05:33    Feb 03 2025 12:10:44
 
/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/C96C48_ufs_hybatmDA_stable-nightly
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202402231800        Done    Feb 03 2025 10:05:36    Feb 03 2025 10:20:53
202402240000        Done    Feb 03 2025 10:05:36    Feb 03 2025 12:53:23
202402240600        Done    Feb 03 2025 10:05:36    Feb 03 2025 12:52:37
 
/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/C96_S2SWA_gefs_replay_ics_stable-nightly
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202011010000        Done    Feb 03 2025 10:05:43    Feb 03 2025 10:56:14

@RussTreadon-NOAA RussTreadon-NOAA added the DO NOT MERGE PR is not ready to be merged yet label Feb 3, 2025
@RussTreadon-NOAA
Copy link
Contributor Author

While g-w based GDASAPpp ctests pass on Hera, Hercules, and Orion, we can not build feature/stable-nightly at e6eafb0 on WCOSS2. Starting with saber @ b85ece5 we need at least atlas/0.35.1. Currently the WCOSS2 spack-stack installation only has atlas versions up to 0.35.0.

Mark this PR DO NO MERGE until WCOSS2 spack stack is updated.

@RussTreadon-NOAA
Copy link
Contributor Author

Update

The issue on WCOSS2 is not the version of atlas. The WCOSS2 atlas is built with -DENABLE_ECTRANS:BOOL=OFF. atlas is built on other machines with -DENABLE_ECTRANS:BOOL=ON. The library team built atlas on Acorn with -DENABLE_ECTRANS:BOOL=ON. GDASApp builds on Acorn. While this is good, @danholdaway and @shlyaeva point out that the error may reside in saber and not in how atlas is built on WCOSS2.

@shlyaeva
Copy link
Collaborator

I see the check for atlas_TRANS_FOUND in saber: https://github.com/JCSDA-internal/saber/blob/develop/CMakeLists.txt#L86. It would be worth checking associated print in the log. Do you mind sharing the build log?

@RussTreadon-NOAA
Copy link
Contributor Author

Run build of feature/stable-nightly at d1a30f9 on Cactus. Build fails as before

cd /lfs/h2/emc/da/noscrub/russ.treadon/git/global-workflow/stable-nightly/sorc/gdas.cd/build/gdas/mains && /apps/ops/test/spack-stack-1.6.0-nco/envs/nco-intel-19.1.3.304/install/intel/19.1.3.304/cmake-3.23.1-chpcsen/bin/cmake -E cmake_link_script CMakeFiles/gdas_soca_error_covariance_toolbox.x.dir/link.txt --verbose=NO
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: ../../lib/libsaber.so: undefined reference to `atlas::grid::detail::partitioner::TransPartitioner::TransPartitioner()'
make[2]: *** [gdas/mains/CMakeFiles/gdas_soca_error_covariance_toolbox.x.dir/build.make:174: bin/gdas_soca_error_covariance_toolbox.x] Error 1
make[2]: Leaving directory '/lfs/h2/emc/da/noscrub/russ.treadon/git/global-workflow/stable-nightly/sorc/gdas.cd/build'
make[1]: *** [CMakeFiles/Makefile2:28951: gdas/mains/CMakeFiles/gdas_soca_error_covariance_toolbox.x.dir/all] Error 2
make[1]: Leaving directory '/lfs/h2/emc/da/noscrub/russ.treadon/git/global-workflow/stable-nightly/sorc/gdas.cd/build'
make: *** [Makefile:166: all] Error 2

feature/stable-nightly at d1a30f9 successfully built and ran GDASApp ctests on Hera.

@shlyaeva
Copy link
Collaborator

I see transpartitioner used in spectralb, but also in WriteFields saber block. spectralb is turned off if trans is not found: https://github.com/JCSDA-internal/saber/blob/2a2878f511fc0ca360e681a271df5a09bb343161/src/saber/spectralb/CMakeLists.txt#L6, but writefields is not: https://github.com/JCSDA-internal/saber/blob/develop/src/saber/generic/CMakeLists.txt. Do you mind trying only adding writefields block in CMakeLists when atlas_trans is found?

@danholdaway
Copy link
Contributor

Thanks @shlyaeva that looks to he culprit to me too. WriteFields needs to be fenced off. Not sure what impact that would have though. I'll try testing that.

@danholdaway
Copy link
Contributor

Draft PR with branch to test https://github.com/JCSDA-internal/saber/pull/1012

@RussTreadon-NOAA
Copy link
Contributor Author

@shlyaeva, @danholdaway

Anna's suggestion was added to sorc/saber/src/saber/generic/CMakeLists.txt as

@@ -35,10 +35,14 @@ VertLoc.h
 # Interpolated vertical localization
 VertLocInterp.cc
 VertLocInterp.h
+)
 
 # Write selected fields to a netCDF file
-WriteFields.h
-WriteFields.cc
+if( atlas_TRANS_FOUND OR atlas_ECTRANS_FOUND )
+    list(APPEND spectralb_src_files_list
+    WriteFields.h
+    WriteFields.cc
 )
+endif()
 
 set( generic_src_files ${generic_src_files_list} PARENT_SCOPE )

The GDASApp build was run again on Cactus. The build ran to completion without any errors.

atlas/0.35.1 in Cactus spack-stack/1.6.0 was built with

whatis("Configure options : 
-DENABLE_OMP:BOOL=ON 
-DENABLE_FCKIT:BOOL=ON 
-DENABLE_EIGEN:BOOL=ON 
-DENABLE_FFTW:BOOL=OFF 
-DPYTHON_EXECUTABLE:FILEPATH=/apps/ops/test/spack-stack-1.6.0-nco/envs/nco-intel-19.1.3.304/install/intel/19.1.3.304/python-3.10.13-qzt4y2i/bin/python3.10 
-DENABLE_ECTRANS:BOOL=OFF 
-DENABLE_TESSELATION:BOOL=ON")

This differs from the spack-stack/1.6.0 builds on Hera, Hercules, and Orion. On these machines we have -DENABLE_ECTRANS:BOOL=ON -DENABLE_TESSELATION:BOOL=ON.

@danholdaway , you made the same change in saber PR #1012.

@RussTreadon-NOAA
Copy link
Contributor Author

Rerun the above test on Cactus by cloning bugfix/fence_of_write_block from saber #1012 into sorc/saber in a working copy of feature/stable-nightly at d1a30f9. As expected the GDASApp build successfully ran to completion.

@danholdaway
Copy link
Contributor

@RussTreadon-NOAA the saber fix is merged so you should be able to proceed now.

@RussTreadon-NOAA
Copy link
Contributor Author

Thank you @danholdaway for the quick fix to saber.

@RussTreadon-NOAA RussTreadon-NOAA removed hera-GW-RT-Passed Automated testing with global-workflow successful on Hera orion-GW-RT-Passed Automated testing with global-workflow successful on Orion DO NOT MERGE PR is not ready to be merged yet hercules-GW-RT-Passed Automated testing with global-workflow successful on Hercules labels Feb 20, 2025
@RussTreadon-NOAA
Copy link
Contributor Author

Before marking this PR as Ready for review manually start stable_driver.sh on Hera to ensure everything works as it should.

@RussTreadon-NOAA
Copy link
Contributor Author

The sorc/soca hash in feature/stable-nightly needs to be updated to 7675ed1 before this PR can move forward. This update will be done by manually kicking off the stable-nightly script on Hera.

@RussTreadon-NOAA
Copy link
Contributor Author

Hera stable-nightly script successfully updated JEDI hashes in feature/stable-nightly and ran GDASApp ctests with all tests passing

Test project /scratch1/NCEPDEV/da/Russ.Treadon/CI/GDASApp/stable/20250221/global-workflow/sorc/gdas.cd/build
        Start 2012: test_gdasapp_C96C48_ufs_hybatmDA
        Start 1993: test_gdasapp_C96C48_hybatmDA
        Start 2037: test_gdasapp_C96C48_hybatmaerosnowDA
        Start 2062: test_gdasapp_C48mx500_3DVarAOWCDA
        Start 2072: test_gdasapp_C48mx500_hybAOWCDA
        Start 2087: test_gdasapp_setup_atm_jjob_cycled_exp
        Start 1604: test_gdasapp_util_prepdata
        Start 1602: test_gdasapp_util_coding_norms
        Start 1603: test_gdasapp_util_ioda_example
        Start 1607: test_gdasapp_util_rtofstmp
        Start 1608: test_gdasapp_util_rtofssal
        Start 1986: test_gdasapp_check_python_norms
  1/137 Test #1603: test_gdasapp_util_ioda_example ...........................................   Passed    0.93 sec
        Start 1987: test_gdasapp_check_yaml_keys
  2/137 Test #1987: test_gdasapp_check_yaml_keys .............................................   Passed    0.18 sec

...

134/137 Test #2034: test_gdasapp_C96C48_ufs_hybatmDA_enkfgdas_ecmn_202402240000 ..............   Passed   62.48 sec
135/137 Test #2035: test_gdasapp_C96C48_ufs_hybatmDA_enkfgdas_esfc_202402240000 ..............   Passed  136.94 sec
        Start 2036: test_gdasapp_C96C48_ufs_hybatmDA_enkfgdas_fcst_202402240000
136/137 Test #2028: test_gdasapp_C96C48_ufs_hybatmDA_gdas_fcst_202402240000 ..................   Passed  436.69 sec
137/137 Test #2036: test_gdasapp_C96C48_ufs_hybatmDA_enkfgdas_fcst_202402240000 ..............   Passed  1198.70 sec

100% tests passed, 0 tests failed out of 137

Label Time Summary:
gdas-utils    =  14.73 sec*proc (15 tests)
gdasapp       = 31068.84 sec*proc (104 tests)
script        = 31083.57 sec*proc (119 tests)

Total Test time (real) = 6557.69 sec

@RussTreadon-NOAA
Copy link
Contributor Author

Install g-w develop at c902645 with feature/stable-nightly at 523b97b. With the change from saber PR #1012 now in feature/stable-nightly, GDASApp successfully builds on Cactus. Given this success, g-w CI has been initiated on Cactus.

Execution of the stable-nightly script is equivalent to apply the Hera-GW-RT label to this PR. Since the stable-nightly script just succeeded, the Hera-GW-RT label will not be applied to this PR ... at present.

MSU Orion and Hercules are experiencing sluggish disk performance. This cause one of the ctests to fail on Hercules in GDASApp PR #1494. As such, the Orion and Hercules GW-RT labels will not be applied to this PR at present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update JEDI hashes (20250203)
5 participants