Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Hercules build capability to GDASApp #774

Merged
merged 9 commits into from
Jan 11, 2024

Conversation

RussTreadon-NOAA
Copy link
Contributor

MSU Hercules is available for use. This PR adds hercules.lua to GDASApp modulefiles/GDAS.

Fixes #773

@RussTreadon-NOAA RussTreadon-NOAA self-assigned this Nov 28, 2023
@RussTreadon-NOAA RussTreadon-NOAA marked this pull request as ready for review November 28, 2023 17:20
Copy link
Contributor

@CoryMartin-NOAA CoryMartin-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @RussTreadon-NOAA !

@DavidHuber-NOAA
Copy link
Collaborator

Just a heads up that GSI-based cycling support in the global workflow has been ported to Hercules, with the exception of the GSI-monitor.

@RussTreadon-NOAA
Copy link
Contributor Author

Install RussTreadon-NOAA:feature/hercules at 23453d7 on Hercules. Run GDASApp ctests. All 32 tests pass.

(gdasapp) hercules-login-4:/work2/noaa/da/rtreadon/git/GDASApp/hercules/build$ ctest -R test_gdasapp
Test project /work2/noaa/da/rtreadon/git/GDASApp/hercules/build
      Start 1324: test_gdasapp_util_coding_norms
 1/32 Test #1324: test_gdasapp_util_coding_norms .............   Passed    0.98 sec
      Start 1325: test_gdasapp_util_ioda_example
 2/32 Test #1325: test_gdasapp_util_ioda_example .............   Passed    4.55 sec
      Start 1326: test_gdasapp_util_prepdata
 3/32 Test #1326: test_gdasapp_util_prepdata .................   Passed    3.27 sec
      Start 1327: test_gdasapp_util_rads2ioda
 4/32 Test #1327: test_gdasapp_util_rads2ioda ................   Passed    0.28 sec
      Start 1328: test_gdasapp_util_ghrsst2ioda
 5/32 Test #1328: test_gdasapp_util_ghrsst2ioda ..............   Passed    0.07 sec
      Start 1329: test_gdasapp_util_smap2ioda
 6/32 Test #1329: test_gdasapp_util_smap2ioda ................   Passed    0.07 sec
      Start 1330: test_gdasapp_util_smos2ioda
 7/32 Test #1330: test_gdasapp_util_smos2ioda ................   Passed    0.18 sec
      Start 1331: test_gdasapp_util_viirsaod2ioda
 8/32 Test #1331: test_gdasapp_util_viirsaod2ioda ............   Passed    0.07 sec
      Start 1332: test_gdasapp_util_icecamsr2ioda
 9/32 Test #1332: test_gdasapp_util_icecamsr2ioda ............   Passed    0.08 sec
      Start 1664: test_gdasapp_check_python_norms
10/32 Test #1664: test_gdasapp_check_python_norms ............   Passed    1.33 sec
      Start 1665: test_gdasapp_check_yaml_keys
11/32 Test #1665: test_gdasapp_check_yaml_keys ...............   Passed    0.52 sec
      Start 1666: test_gdasapp_jedi_increment_to_fv3
12/32 Test #1666: test_gdasapp_jedi_increment_to_fv3 .........   Passed    8.36 sec
      Start 1667: test_gdasapp_convert_ewok_yaml
13/32 Test #1667: test_gdasapp_convert_ewok_yaml .............   Passed    1.64 sec
      Start 1668: test_gdasapp_convert_bufr_temp_dbuoy
14/32 Test #1668: test_gdasapp_convert_bufr_temp_dbuoy .......   Passed    0.99 sec
      Start 1669: test_gdasapp_convert_bufr_salt_dbuoy
15/32 Test #1669: test_gdasapp_convert_bufr_salt_dbuoy .......   Passed    0.92 sec
      Start 1670: test_gdasapp_convert_bufr_temp_mbuoyb
16/32 Test #1670: test_gdasapp_convert_bufr_temp_mbuoyb ......   Passed    8.08 sec
      Start 1671: test_gdasapp_convert_bufr_salt_mbuoyb
17/32 Test #1671: test_gdasapp_convert_bufr_salt_mbuoyb ......   Passed   12.62 sec
      Start 1672: test_gdasapp_convert_bufr_tesacprof
18/32 Test #1672: test_gdasapp_convert_bufr_tesacprof ........   Passed    1.82 sec
      Start 1673: test_gdasapp_convert_bufr_trkobprof
19/32 Test #1673: test_gdasapp_convert_bufr_trkobprof ........   Passed    0.76 sec
      Start 1674: test_gdasapp_convert_bufr_sfcships
20/32 Test #1674: test_gdasapp_convert_bufr_sfcships .........   Passed    0.22 sec
      Start 1675: test_gdasapp_convert_bufr_sfcshipsu
21/32 Test #1675: test_gdasapp_convert_bufr_sfcshipsu ........   Passed    0.15 sec
      Start 1676: test_gdasapp_soca_obsdb
22/32 Test #1676: test_gdasapp_soca_obsdb ....................   Passed   12.99 sec
      Start 1677: test_gdasapp_soca_nsst_increment_to_mom6
23/32 Test #1677: test_gdasapp_soca_nsst_increment_to_mom6 ...   Passed   42.32 sec
      Start 1678: test_gdasapp_land_create_ens
24/32 Test #1678: test_gdasapp_land_create_ens ...............   Passed    1.82 sec
      Start 1679: test_gdasapp_land_imsproc
25/32 Test #1679: test_gdasapp_land_imsproc ..................   Passed    8.21 sec
      Start 1680: test_gdasapp_land_apply_jediincr
26/32 Test #1680: test_gdasapp_land_apply_jediincr ...........   Passed   22.63 sec
      Start 1681: test_gdasapp_land_letkfoi_snowda
27/32 Test #1681: test_gdasapp_land_letkfoi_snowda ...........   Passed   74.13 sec
      Start 1682: test_gdasapp_convert_bufr_adpsfc_snow
28/32 Test #1682: test_gdasapp_convert_bufr_adpsfc_snow ......   Passed    2.96 sec
      Start 1683: test_gdasapp_convert_bufr_adpsfc
29/32 Test #1683: test_gdasapp_convert_bufr_adpsfc ...........   Passed    5.56 sec
      Start 1684: test_gdasapp_convert_gsi_satbias
30/32 Test #1684: test_gdasapp_convert_gsi_satbias ...........   Passed   17.84 sec
      Start 1685: test_gdasapp_store_gsi_satbias
31/32 Test #1685: test_gdasapp_store_gsi_satbias .............   Passed    0.49 sec
      Start 1686: test_gdasapp_aero_gen_3dvar_yaml
32/32 Test #1686: test_gdasapp_aero_gen_3dvar_yaml ...........   Passed    0.25 sec

100% tests passed, 0 tests failed out of 32

Label Time Summary:
gdas-utils    =   9.54 sec*proc (9 tests)
script        =   9.54 sec*proc (9 tests)

Total Test time (real) = 236.47 sec

Install RussTreadon-NOAA:feature/hercules at 23453d7 inside working copy of g-w develop @ b056b53. Run GDASApp ctests. Several soca and atm_jjob tests fail.

The atm_jjob tests failed because driver scripts in test/atm/global-workflow did not include Hercules as a valid $machine. Add Hercules as a valid $machine. Also found it necessary to specify the target file on soft links in the atm_jjob driver scripts. With these changes the atm_jjob tests pass. The following soca tests fail

(gdasapp) hercules-login-4:/work2/noaa/da/rtreadon/git/global-workflow/hercules/sorc/gdas.cd/build$ ctest -R test_gdasapp
Test project /work2/noaa/da/rtreadon/git/global-workflow/hercules/sorc/gdas.cd/build
      Start 1324: test_gdasapp_util_coding_norms
 1/55 Test #1324: test_gdasapp_util_coding_norms ........................   Passed    0.79 sec
      Start 1325: test_gdasapp_util_ioda_example
 2/55 Test #1325: test_gdasapp_util_ioda_example ........................   Passed    6.93 sec
      Start 1326: test_gdasapp_util_prepdata
 3/55 Test #1326: test_gdasapp_util_prepdata ............................   Passed    0.50 sec
      Start 1327: test_gdasapp_util_rads2ioda
 4/55 Test #1327: test_gdasapp_util_rads2ioda ...........................   Passed    0.47 sec
      Start 1328: test_gdasapp_util_ghrsst2ioda
 5/55 Test #1328: test_gdasapp_util_ghrsst2ioda .........................   Passed    0.09 sec
      Start 1329: test_gdasapp_util_smap2ioda
 6/55 Test #1329: test_gdasapp_util_smap2ioda ...........................   Passed    0.09 sec
      Start 1330: test_gdasapp_util_smos2ioda
 7/55 Test #1330: test_gdasapp_util_smos2ioda ...........................   Passed    0.09 sec
      Start 1331: test_gdasapp_util_viirsaod2ioda
 8/55 Test #1331: test_gdasapp_util_viirsaod2ioda .......................   Passed    0.09 sec
      Start 1332: test_gdasapp_util_icecamsr2ioda
 9/55 Test #1332: test_gdasapp_util_icecamsr2ioda .......................   Passed    0.08 sec
      Start 1664: test_gdasapp_check_python_norms
10/55 Test #1664: test_gdasapp_check_python_norms .......................   Passed    1.15 sec
      Start 1665: test_gdasapp_check_yaml_keys
11/55 Test #1665: test_gdasapp_check_yaml_keys ..........................   Passed    0.06 sec
      Start 1666: test_gdasapp_jedi_increment_to_fv3
12/55 Test #1666: test_gdasapp_jedi_increment_to_fv3 ....................   Passed    0.31 sec
      Start 1667: test_gdasapp_convert_ewok_yaml
13/55 Test #1667: test_gdasapp_convert_ewok_yaml ........................   Passed    0.17 sec
      Start 1668: test_gdasapp_setup_cycled_exp
14/55 Test #1668: test_gdasapp_setup_cycled_exp .........................   Passed    0.36 sec
      Start 1669: test_gdasapp_convert_bufr_temp_dbuoy
15/55 Test #1669: test_gdasapp_convert_bufr_temp_dbuoy ..................   Passed    0.60 sec
      Start 1670: test_gdasapp_convert_bufr_salt_dbuoy
16/55 Test #1670: test_gdasapp_convert_bufr_salt_dbuoy ..................   Passed    0.19 sec
      Start 1671: test_gdasapp_convert_bufr_temp_mbuoyb
17/55 Test #1671: test_gdasapp_convert_bufr_temp_mbuoyb .................   Passed    0.20 sec
      Start 1672: test_gdasapp_convert_bufr_salt_mbuoyb
18/55 Test #1672: test_gdasapp_convert_bufr_salt_mbuoyb .................   Passed    0.20 sec
      Start 1673: test_gdasapp_convert_bufr_tesacprof
19/55 Test #1673: test_gdasapp_convert_bufr_tesacprof ...................   Passed    0.20 sec
      Start 1674: test_gdasapp_convert_bufr_trkobprof
20/55 Test #1674: test_gdasapp_convert_bufr_trkobprof ...................   Passed    0.19 sec
      Start 1675: test_gdasapp_convert_bufr_sfcships
21/55 Test #1675: test_gdasapp_convert_bufr_sfcships ....................   Passed    0.20 sec
      Start 1676: test_gdasapp_convert_bufr_sfcshipsu
22/55 Test #1676: test_gdasapp_convert_bufr_sfcshipsu ...................   Passed    0.19 sec
      Start 1677: test_gdasapp_soca_obsdb
23/55 Test #1677: test_gdasapp_soca_obsdb ...............................   Passed    0.88 sec
      Start 1678: test_gdasapp_soca_nsst_increment_to_mom6
24/55 Test #1678: test_gdasapp_soca_nsst_increment_to_mom6 ..............   Passed    5.31 sec
      Start 1679: test_gdasapp_soca_prep
25/55 Test #1679: test_gdasapp_soca_prep ................................   Passed    4.09 sec
      Start 1680: test_gdasapp_soca_concatioda
26/55 Test #1680: test_gdasapp_soca_concatioda ..........................   Passed    1.01 sec
      Start 1681: test_gdasapp_soca_run_clean
27/55 Test #1681: test_gdasapp_soca_run_clean ...........................   Passed    0.10 sec
      Start 1682: test_gdasapp_soca_setup_obsproc
28/55 Test #1682: test_gdasapp_soca_setup_obsproc .......................   Passed    0.33 sec
      Start 1683: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_PREP
29/55 Test #1683: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_PREP ....***Failed    0.58 sec
      Start 1684: test_gdasapp_soca_JGLOBAL_PREP_OCEAN_OBS
30/55 Test #1684: test_gdasapp_soca_JGLOBAL_PREP_OCEAN_OBS ..............***Failed    0.12 sec
      Start 1685: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_BMAT
31/55 Test #1685: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_BMAT ....***Failed    0.12 sec
      Start 1686: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_RUN
32/55 Test #1686: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_RUN .....***Failed    0.12 sec
      Start 1687: test_gdasapp_soca_copy_scratch
33/55 Test #1687: test_gdasapp_soca_copy_scratch ........................***Failed    0.05 sec
      Start 1688: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_CHKPT
34/55 Test #1688: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_CHKPT ...***Failed    0.12 sec
      Start 1689: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_POST
35/55 Test #1689: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_POST ....***Failed    0.12 sec
      Start 1690: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY
36/55 Test #1690: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY ....***Failed    0.12 sec
      Start 1691: test_gdasapp_soca_socahybridweights
37/55 Test #1691: test_gdasapp_soca_socahybridweights ...................***Failed    0.10 sec
      Start 1692: test_gdasapp_soca_incr_handler
38/55 Test #1692: test_gdasapp_soca_incr_handler ........................***Failed    0.10 sec
      Start 1693: test_gdasapp_soca_ens_handler
39/55 Test #1693: test_gdasapp_soca_ens_handler .........................***Failed    0.10 sec
      Start 1694: test_gdasapp_land_create_ens
40/55 Test #1694: test_gdasapp_land_create_ens ..........................   Passed    0.64 sec
      Start 1695: test_gdasapp_land_imsproc
41/55 Test #1695: test_gdasapp_land_imsproc .............................   Passed    2.05 sec
      Start 1696: test_gdasapp_land_apply_jediincr
42/55 Test #1696: test_gdasapp_land_apply_jediincr ......................   Passed    3.60 sec
      Start 1697: test_gdasapp_land_letkfoi_snowda
43/55 Test #1697: test_gdasapp_land_letkfoi_snowda ......................   Passed   27.79 sec
      Start 1698: test_gdasapp_convert_bufr_adpsfc_snow
44/55 Test #1698: test_gdasapp_convert_bufr_adpsfc_snow .................   Passed    2.28 sec
      Start 1699: test_gdasapp_convert_bufr_adpsfc
45/55 Test #1699: test_gdasapp_convert_bufr_adpsfc ......................   Passed    3.07 sec
      Start 1700: test_gdasapp_convert_gsi_satbias
46/55 Test #1700: test_gdasapp_convert_gsi_satbias ......................   Passed    3.38 sec
      Start 1701: test_gdasapp_store_gsi_satbias
47/55 Test #1701: test_gdasapp_store_gsi_satbias ........................   Passed    0.40 sec
      Start 1702: test_gdasapp_setup_atm_cycled_exp
48/55 Test #1702: test_gdasapp_setup_atm_cycled_exp .....................   Passed    0.55 sec
      Start 1703: test_gdasapp_atm_jjob_var_init
49/55 Test #1703: test_gdasapp_atm_jjob_var_init ........................   Passed   12.62 sec
      Start 1704: test_gdasapp_atm_jjob_var_run
50/55 Test #1704: test_gdasapp_atm_jjob_var_run .........................   Passed   79.17 sec
      Start 1705: test_gdasapp_atm_jjob_var_final
51/55 Test #1705: test_gdasapp_atm_jjob_var_final .......................   Passed   10.04 sec
      Start 1706: test_gdasapp_atm_jjob_ens_init
52/55 Test #1706: test_gdasapp_atm_jjob_ens_init ........................   Passed   11.51 sec
      Start 1707: test_gdasapp_atm_jjob_ens_run
53/55 Test #1707: test_gdasapp_atm_jjob_ens_run .........................   Passed  253.19 sec
      Start 1708: test_gdasapp_atm_jjob_ens_final
54/55 Test #1708: test_gdasapp_atm_jjob_ens_final .......................   Passed   14.23 sec
      Start 1709: test_gdasapp_aero_gen_3dvar_yaml
55/55 Test #1709: test_gdasapp_aero_gen_3dvar_yaml ......................   Passed    0.21 sec

80% tests passed, 11 tests failed out of 55

Label Time Summary:
gdas-utils    =   9.13 sec*proc (9 tests)
script        =   9.13 sec*proc (9 tests)

Total Test time (real) = 451.50 sec

The following tests FAILED:
        1683 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_PREP (Failed)
        1684 - test_gdasapp_soca_JGLOBAL_PREP_OCEAN_OBS (Failed)
        1685 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_BMAT (Failed)
        1686 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_RUN (Failed)
        1687 - test_gdasapp_soca_copy_scratch (Failed)
        1688 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_CHKPT (Failed)
        1689 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_POST (Failed)
        1690 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY (Failed)
        1691 - test_gdasapp_soca_socahybridweights (Failed)
        1692 - test_gdasapp_soca_incr_handler (Failed)
        1693 - test_gdasapp_soca_ens_handler (Failed)
Errors while running CTest

The test_gdasapp_soca failures appear to be similar to the $machine problem with test_gdasapp_jjob_atm. test/soca/gw/CMakeLists.txt sets MACHINE based on a directory

# Identify machine
set(MACHINE "container")
IF (IS_DIRECTORY /work2/noaa/da)
  set(MACHINE "orion")
  set(PARTITION "orion")
ENDIF()
IF (IS_DIRECTORY /scratch2/NCEPDEV/)
  set(MACHINE "hera")
  set(PARTITION "hera")
ENDIF()

Hercules and Orion share the same filesets. /work2/noaa/da is a valid directory on Hercules and Orion. We need a different way to distinguish between Hercules and Orion.

ush/soca/run_jjobs.py does not include hercules in the list of valid machines

machines = {"container", "hera", "orion"}

While it is necessary to add hercules to the machine list, this alone is not sufficient. The yaml file processed by run_jjobs.py contains machine: orion. This traces back to test/soca/CMakeLists.txt.

Tagging @guillaumevernieres for awareness.

@DavidHuber-NOAA
Copy link
Collaborator

@RussTreadon-NOAA The directory I suggest using to differentiate the two is /apps/other, which only exists on Hercules. You can start by checking for /work2/noaa/da to ensure you are testing just Hercules and Orion and limit the possibility of misidentifying another machine added in the future. This is how I did it in the global-workflow.

@RussTreadon-NOAA
Copy link
Contributor Author

Thank you @DavidHuber-NOAA for sharing your experience. test/soca/gw/CMakeLists.txt was updated following your example. GDASApp was rebuilt and ctests rerun. The number of soca failures was reduced to

        1685 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_BMAT (Failed)
        1686 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_RUN (Failed)
        1688 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_CHKPT (Failed)
        1689 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_POST (Failed)
        1690 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY (Failed)
        1692 - test_gdasapp_soca_incr_handler (Failed)

I see that g-w jobs/JGDAS_GLOBAL_OCEAN_ANALYSIS_BMAT sources the wrong config files.

-source "${HOMEgfs}/ush/jjob_header.sh" -e "ocnanalrun" -c "base ocnanal ocnanalrun"
+source "${HOMEgfs}/ush/jjob_header.sh" -e "ocnanalbmat" -c "base ocnanal ocnanalbmat"

After the above correction test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_BMAT passed.

Unfortunately, test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_RUN failed with

+ exgdas_global_marine_analysis_run.sh[44]: srun -l --export=ALL -n 16 --cpus-per-task=2 /work2/noaa/da/rtreadon/git/global-workflow/hercules/sorc/gdas.cd/../../sorc/gdas.cd/build/bin/soca_var.x var.yaml
srun: error: Unable to create step for job 373792: More processors requested than permitted
+ exgdas_global_marine_analysis_run.sh[45]: export err=1

I see that g-w parm/config/gfs/config.resources specifies 2 threads. I changed this to one

     export npe_ocnanalrun=${npes}
-    export nth_ocnanalrun=2
+    export nth_ocnanalrun=1
     export is_exclusive=True

With this change test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_RUN successfully ran to completion but I do not understand why 2 threads didn't work. 16 tasks with 2 threads per task fits on a Hercules node.

The remaining failed test_gdasapp_soca jobs successfully completed after getting test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_RUN to pass.

The only exception is test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY. This job still failed. A check of the log file shows this job fails with

/work2/noaa/da/python/opt/core/miniconda3/4.6.14/envs/eva/lib/python3.9/site-packages/cartopy/io/__init__.py:241: DownloadWarning: Downloading: https://naturalearth.s3.amazonaws.com/110m_physical/ne_110m_coastline.zip
  warnings.warn(f'Downloading: {url}', DownloadWarning)
slurmstepd: error: *** JOB 373844 ON hercules-01-19 CANCELLED AT 2024-01-09T15:14:36 DUE TO TIME LIMIT ***

The GDASApp changes needed to get atm and soca ctests to run were committed to RussTreadon-NOAA:feature/hercules at 8e67f9b.

The following g-w changes have also been made

        modified:   jobs/JGDAS_GLOBAL_OCEAN_ANALYSIS_BMAT
        modified:   parm/config/gfs/config.resources
        modified:   sorc/build_all.sh

The change to sorc/build.sh was to allow GDASApp to be built on hercules.

 if [[ "${_build_ufsda}" == "YES" ]]; then
-   if [[ "${MACHINE_ID}" != "orion" && "${MACHINE_ID}" != "hera" ]]; then
+   if [[ "${MACHINE_ID}" != "orion" && "${MACHINE_ID}" != "hera" && "${MACHINE_ID}" != "hercules"]]; then
       echo "NOTE: The GDAS App is not supported on ${MACHINE_ID}.  Disabling build."
    else

The above g-w changes remain local to my working copy of g-w on hercules. Look in /work2/noaa/da/rtreadon/git/global-workflow/hercules

Running RussTreadon-NOAA:feature/hercules at 8e67f9b with the above local g-w changes yields the following ctest results

(gdasapp) hercules-login-4:/work2/noaa/da/rtreadon/git/global-workflow/hercules/sorc/gdas.cd/build$ ctest -R test_gdasapp
Test project /work2/noaa/da/rtreadon/git/global-workflow/hercules/sorc/gdas.cd/build
      Start 1324: test_gdasapp_util_coding_norms
 1/55 Test #1324: test_gdasapp_util_coding_norms ........................   Passed    0.78 sec
      Start 1325: test_gdasapp_util_ioda_example
 2/55 Test #1325: test_gdasapp_util_ioda_example ........................   Passed    0.14 sec
      Start 1326: test_gdasapp_util_prepdata
 3/55 Test #1326: test_gdasapp_util_prepdata ............................   Passed    0.90 sec
      Start 1327: test_gdasapp_util_rads2ioda
 4/55 Test #1327: test_gdasapp_util_rads2ioda ...........................   Passed    0.10 sec
      Start 1328: test_gdasapp_util_ghrsst2ioda
 5/55 Test #1328: test_gdasapp_util_ghrsst2ioda .........................   Passed    0.48 sec
      Start 1329: test_gdasapp_util_smap2ioda
 6/55 Test #1329: test_gdasapp_util_smap2ioda ...........................   Passed    0.11 sec
      Start 1330: test_gdasapp_util_smos2ioda
 7/55 Test #1330: test_gdasapp_util_smos2ioda ...........................   Passed    0.09 sec
      Start 1331: test_gdasapp_util_viirsaod2ioda
 8/55 Test #1331: test_gdasapp_util_viirsaod2ioda .......................   Passed    0.24 sec
      Start 1332: test_gdasapp_util_icecamsr2ioda
 9/55 Test #1332: test_gdasapp_util_icecamsr2ioda .......................   Passed    0.24 sec
      Start 1664: test_gdasapp_check_python_norms
10/55 Test #1664: test_gdasapp_check_python_norms .......................   Passed    1.08 sec
      Start 1665: test_gdasapp_check_yaml_keys
11/55 Test #1665: test_gdasapp_check_yaml_keys ..........................   Passed    0.06 sec
      Start 1666: test_gdasapp_jedi_increment_to_fv3
12/55 Test #1666: test_gdasapp_jedi_increment_to_fv3 ....................   Passed    0.28 sec
      Start 1667: test_gdasapp_convert_ewok_yaml
13/55 Test #1667: test_gdasapp_convert_ewok_yaml ........................   Passed    0.17 sec
      Start 1668: test_gdasapp_setup_cycled_exp
14/55 Test #1668: test_gdasapp_setup_cycled_exp .........................   Passed    0.44 sec
      Start 1669: test_gdasapp_convert_bufr_temp_dbuoy
15/55 Test #1669: test_gdasapp_convert_bufr_temp_dbuoy ..................   Passed    3.01 sec
      Start 1670: test_gdasapp_convert_bufr_salt_dbuoy
16/55 Test #1670: test_gdasapp_convert_bufr_salt_dbuoy ..................   Passed    0.18 sec
      Start 1671: test_gdasapp_convert_bufr_temp_mbuoyb
17/55 Test #1671: test_gdasapp_convert_bufr_temp_mbuoyb .................   Passed    0.18 sec
      Start 1672: test_gdasapp_convert_bufr_salt_mbuoyb
18/55 Test #1672: test_gdasapp_convert_bufr_salt_mbuoyb .................   Passed    0.18 sec
      Start 1673: test_gdasapp_convert_bufr_tesacprof
19/55 Test #1673: test_gdasapp_convert_bufr_tesacprof ...................   Passed    0.19 sec
      Start 1674: test_gdasapp_convert_bufr_trkobprof
20/55 Test #1674: test_gdasapp_convert_bufr_trkobprof ...................   Passed    0.18 sec
      Start 1675: test_gdasapp_convert_bufr_sfcships
21/55 Test #1675: test_gdasapp_convert_bufr_sfcships ....................   Passed    0.34 sec
      Start 1676: test_gdasapp_convert_bufr_sfcshipsu
22/55 Test #1676: test_gdasapp_convert_bufr_sfcshipsu ...................   Passed    0.18 sec
      Start 1677: test_gdasapp_soca_obsdb
23/55 Test #1677: test_gdasapp_soca_obsdb ...............................   Passed    0.78 sec
      Start 1678: test_gdasapp_soca_nsst_increment_to_mom6
24/55 Test #1678: test_gdasapp_soca_nsst_increment_to_mom6 ..............   Passed    5.02 sec
      Start 1679: test_gdasapp_soca_prep
25/55 Test #1679: test_gdasapp_soca_prep ................................   Passed    6.71 sec
      Start 1680: test_gdasapp_soca_concatioda
26/55 Test #1680: test_gdasapp_soca_concatioda ..........................   Passed    1.52 sec
      Start 1681: test_gdasapp_soca_run_clean
27/55 Test #1681: test_gdasapp_soca_run_clean ...........................   Passed    0.17 sec
      Start 1682: test_gdasapp_soca_setup_obsproc
28/55 Test #1682: test_gdasapp_soca_setup_obsproc .......................   Passed    0.38 sec
      Start 1683: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_PREP
29/55 Test #1683: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_PREP ....   Passed  173.83 sec
      Start 1684: test_gdasapp_soca_JGLOBAL_PREP_OCEAN_OBS
30/55 Test #1684: test_gdasapp_soca_JGLOBAL_PREP_OCEAN_OBS ..............   Passed   42.15 sec
      Start 1685: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_BMAT
31/55 Test #1685: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_BMAT ....   Passed   42.15 sec
      Start 1686: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_RUN
32/55 Test #1686: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_RUN .....   Passed   42.16 sec
      Start 1687: test_gdasapp_soca_copy_scratch
33/55 Test #1687: test_gdasapp_soca_copy_scratch ........................   Passed    6.42 sec
      Start 1688: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_CHKPT
34/55 Test #1688: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_CHKPT ...   Passed   42.15 sec
      Start 1689: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_POST
35/55 Test #1689: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_POST ....   Passed   42.15 sec
      Start 1690: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY
36/55 Test #1690: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY ....***Failed  650.29 sec
      Start 1691: test_gdasapp_soca_socahybridweights
37/55 Test #1691: test_gdasapp_soca_socahybridweights ...................   Passed   10.13 sec
      Start 1692: test_gdasapp_soca_incr_handler
38/55 Test #1692: test_gdasapp_soca_incr_handler ........................   Passed    2.11 sec
      Start 1693: test_gdasapp_soca_ens_handler
39/55 Test #1693: test_gdasapp_soca_ens_handler .........................   Passed   10.12 sec
      Start 1694: test_gdasapp_land_create_ens
40/55 Test #1694: test_gdasapp_land_create_ens ..........................   Passed    1.78 sec
      Start 1695: test_gdasapp_land_imsproc
41/55 Test #1695: test_gdasapp_land_imsproc .............................   Passed    2.36 sec
      Start 1696: test_gdasapp_land_apply_jediincr
42/55 Test #1696: test_gdasapp_land_apply_jediincr ......................   Passed    4.90 sec
      Start 1697: test_gdasapp_land_letkfoi_snowda
43/55 Test #1697: test_gdasapp_land_letkfoi_snowda ......................   Passed   13.60 sec
      Start 1698: test_gdasapp_convert_bufr_adpsfc_snow
44/55 Test #1698: test_gdasapp_convert_bufr_adpsfc_snow .................   Passed    2.32 sec
      Start 1699: test_gdasapp_convert_bufr_adpsfc
45/55 Test #1699: test_gdasapp_convert_bufr_adpsfc ......................   Passed    3.03 sec
      Start 1700: test_gdasapp_convert_gsi_satbias
46/55 Test #1700: test_gdasapp_convert_gsi_satbias ......................   Passed    6.03 sec
      Start 1701: test_gdasapp_store_gsi_satbias
47/55 Test #1701: test_gdasapp_store_gsi_satbias ........................   Passed    0.41 sec
      Start 1702: test_gdasapp_setup_atm_cycled_exp
48/55 Test #1702: test_gdasapp_setup_atm_cycled_exp .....................   Passed    0.86 sec
      Start 1703: test_gdasapp_atm_jjob_var_init
49/55 Test #1703: test_gdasapp_atm_jjob_var_init ........................   Passed   44.35 sec
      Start 1704: test_gdasapp_atm_jjob_var_run
50/55 Test #1704: test_gdasapp_atm_jjob_var_run .........................   Passed   74.13 sec
      Start 1705: test_gdasapp_atm_jjob_var_final
51/55 Test #1705: test_gdasapp_atm_jjob_var_final .......................   Passed   42.12 sec
      Start 1706: test_gdasapp_atm_jjob_ens_init
52/55 Test #1706: test_gdasapp_atm_jjob_ens_init ........................   Passed   44.16 sec
      Start 1707: test_gdasapp_atm_jjob_ens_run
53/55 Test #1707: test_gdasapp_atm_jjob_ens_run .........................   Passed  330.15 sec
      Start 1708: test_gdasapp_atm_jjob_ens_final
54/55 Test #1708: test_gdasapp_atm_jjob_ens_final .......................   Passed   42.14 sec
      Start 1709: test_gdasapp_aero_gen_3dvar_yaml
55/55 Test #1709: test_gdasapp_aero_gen_3dvar_yaml ......................   Passed    0.22 sec

98% tests passed, 1 tests failed out of 55

Label Time Summary:
gdas-utils    =   3.08 sec*proc (9 tests)
script        =   3.08 sec*proc (9 tests)

Total Test time (real) = 1700.68 sec

The following tests FAILED:
        1690 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY (Failed)
Errors while running CTest

@RussTreadon-NOAA
Copy link
Contributor Author

@guillaumevernieres, @AndrewEichmann-NOAA, @CoryMartin-NOAA , and @DavidHuber-NOAA : this PR is ready for merger into develop apart from three items:

  1. test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY fails because the cartopy download of ne_110m_coastline.zip fails (hangs). Should we download and install this file somewhere on Hercules?

  2. execution of soca_var.x in test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_RUN fails when executing srun -l --export=ALL -n 16 --cpus-per-task=2 with error message More processors requested than permitted
    config.resources specifies two threads for ocnanalrun. Changing this to one thread allows test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_RUN to successfully run. Good, but why did srun complain about 2 threads? 16 tasks with 2 threads per task fits on a Hercules node.

  3. To fully enable the GDASApp build from g-w, we need to update two, possibly three, g-w files

  • jobs/JGDAS_GLOBAL_OCEAN_ANALYSIS_BMAT - source correct config files
  • parm/config/gfs/config.resources - change ocnanalrun to 1 thread
  • sorc/build_all.sh - add hercules as valid platform for GDASApp build

An additional change would be to update the hash for the gdas.cd submodule. A g-w issue and PR are needed to get these changes into g-w develop.

@RussTreadon-NOAA
Copy link
Contributor Author

RussTreadon-NOAA commented Jan 10, 2024

test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_RUN

The srun error is due to a mismatch between how the job is configured via #SBATCH lines in the submitted run script and the srun options used to execute soca_var.x.

Script ush/soca/run_jjobs.py creates and submits local file run_jjob.sh to run JGDAS_GLOBAL_OCEAN_ANALYSIS_RUN. The #SBATCH preamble in run_jjobs.sh is

#!/usr/bin/env bash
# Running on hercules
#SBATCH --account=da-cpu
#SBATCH --qos=batch
#SBATCH --output=JGDAS_GLOBAL_OCEAN_ANALYSIS_RUN.out
#SBATCH --nodes=1
#SBATCH --ntasks=16
#SBATCH --partition=hercules
#SBATCH --time=00:10:00

whereas the srun command executed by exgdas_global_marine_analysis_run.sh is srun -l --export=ALL -n 16 --cpus-per-task=2. The line #SBATCH --cpus-per-task=2 is missing from run_jjobs.sh.

Manually add the line #SBATCH --cpus-per-task=2 to run_jjobs.sh and resubmit. The script successfully ran soca_var.x via srun -l --export=ALL -n 16 --cpus-per-task=2.

It is not yet clear to me how to modify run_jjobs.py such that it inserts the correct #SBATCH --cpus-per-task line for each job for which it creates and submits run_jjobs.sh. Most jobs which run_jjobs.py launches run with one thread. config.resources specifies two threads for JGDAS_GLOBAL_OCEAN_ANALYSIS_RUN

@CoryMartin-NOAA
Copy link
Contributor

the download warning must mean that the home directories are not shared between orion and hercules. If you copy the shapefiles from ~/.local/cartopy on orion to the same path on hercules (or run cartopy once in a login shell), then it should pass

@RussTreadon-NOAA
Copy link
Contributor Author

Thank you @CoryMartin-NOAA . I rsync'd shapefiles from my Orion cartopy directory to Hercules. Does every user their own snapshot of the shapefiles? Can we have the shape files installed in single publicly shared location?

I reran test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY. The job got past the previous cartopy download failure. @guillaumevernieres and @AndrewEichmann-NOAA , the VRFY job now fails with the message

2024-01-10 14:52:08:ERROR:Error occurred when attempting to load: /work2/noaa/da/rtreadon/git/global-workflow/hercules/sorc/gd\
as.cd/../../sorc/gdas.cd/build/test/soca/gw/testrun/testjjobs/ROTDIRS/gdas_test/gdas.20180415/12//analysis/ocean/yaml/var_orig\
inal.yaml, error: while scanning for the next token
found character '%' that cannot start any token
  in "/work2/noaa/da/rtreadon/git/global-workflow/hercules/sorc/gdas.cd/../../sorc/gdas.cd/build/test/soca/gw/testrun/testjjob\
s/ROTDIRS/gdas_test/gdas.20180415/12//analysis/ocean/yaml/var_original.yaml", line 135, column 20
Traceback (most recent call last):
  File "/work2/noaa/da/rtreadon/git/global-workflow/hercules/sorc/gdas.cd/../../sorc/gdas.cd/scripts/exgdas_global_marine_anal\
ysis_vrfy.py", line 219, in <module>
    gen_eva_obs_yaml.gen_eva_obs_yaml(varyaml, marinetemplate, 'preevayamls')
  File "/work2/noaa/da/rtreadon/git/global-workflow/hercules/sorc/gdas.cd/ush/eva/gen_eva_obs_yaml.py", line 20, in gen_eva_obs_yaml
    if 'cost function' in jedi_yaml_dict:
UnboundLocalError: local variable 'jedi_yaml_dict' referenced before assignment

Line 135 of var_original.yaml is the pattern line below

            - hsnon
          pattern: %mem%
          nmembers: 4

How do we fix this?

@CoryMartin-NOAA
Copy link
Contributor

I think there's an env var one can set, @ShastriPaturi did this on Hera I think for role.jedipara.
See a familiar face here: SciTools/cartopy#1827

@RussTreadon-NOAA
Copy link
Contributor Author

I think there's an env var one can set, @ShastriPaturi did this on Hera I think for role.jedipara. See a familiar face here: SciTools/cartopy#1827

Thanks, @CoryMartin-NOAA . Two questions

  1. Are cartopy shapefiles installed in a central location on Orion and Hercules? If so, where?
  2. @ShastriPaturi , where did you set CARTOPY_DATA_DIR in the Hera role.jedipara environment?

@CoryMartin-NOAA
Copy link
Contributor

@RussTreadon-NOAA I've copied mine to /work2/noaa/da/cmartin/GDASApp/fix/cartopy (from ~/.local/share/cartopy) but I'm not sure exacty which subdirectory the env var should be set to to find these properly.

@RussTreadon-NOAA
Copy link
Contributor Author

Thank you @CoryMartin-NOAA . I rsync'd shapefiles from my Orion cartopy directory to Hercules. Does every user their own snapshot of the shapefiles? Can we have the shape files installed in single publicly shared location?

I reran test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY. The job got past the previous cartopy download failure. @guillaumevernieres and @AndrewEichmann-NOAA , the VRFY job now fails with the message

2024-01-10 14:52:08:ERROR:Error occurred when attempting to load: /work2/noaa/da/rtreadon/git/global-workflow/hercules/sorc/gd\
as.cd/../../sorc/gdas.cd/build/test/soca/gw/testrun/testjjobs/ROTDIRS/gdas_test/gdas.20180415/12//analysis/ocean/yaml/var_orig\
inal.yaml, error: while scanning for the next token
found character '%' that cannot start any token
  in "/work2/noaa/da/rtreadon/git/global-workflow/hercules/sorc/gdas.cd/../../sorc/gdas.cd/build/test/soca/gw/testrun/testjjob\
s/ROTDIRS/gdas_test/gdas.20180415/12//analysis/ocean/yaml/var_original.yaml", line 135, column 20
Traceback (most recent call last):
  File "/work2/noaa/da/rtreadon/git/global-workflow/hercules/sorc/gdas.cd/../../sorc/gdas.cd/scripts/exgdas_global_marine_anal\
ysis_vrfy.py", line 219, in <module>
    gen_eva_obs_yaml.gen_eva_obs_yaml(varyaml, marinetemplate, 'preevayamls')
  File "/work2/noaa/da/rtreadon/git/global-workflow/hercules/sorc/gdas.cd/ush/eva/gen_eva_obs_yaml.py", line 20, in gen_eva_obs_yaml
    if 'cost function' in jedi_yaml_dict:
UnboundLocalError: local variable 'jedi_yaml_dict' referenced before assignment

Line 135 of var_original.yaml is the pattern line below

            - hsnon
          pattern: %mem%
          nmembers: 4

How do we fix this?

@CoryMartin-NOAA suggested adding quotes around %mem%. I did so in the local copy of var_original.yaml so that it reads

            - hsnon
          pattern: '%mem%'
          nmembers: 4

The rerun of test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY passed

1/1 Test #1690: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY ...   Passed  362.35 sec

The following tests passed:
        test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 362.70 sec

As a test remove the quotes and rerun the VRFY test. Once again the test failed with the same error message as above.

1/1 Test #1690: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY ...***Failed  330.51 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) = 330.88 sec

The pattern section of var_original.yaml is from parm/soca/berror/saber_blocks.yaml. Interestingly, quotes are present in saber_blocks.yaml.

        state variables: [tocn, socn, ssh, uocn, vocn, cicen, hicen, hsnon]
      pattern: '%mem%'
      nmembers: ${ENS_SIZE}

Quotes must be removed in the process of combining various source yamls into var_original.yaml.

@CoryMartin-NOAA and @guillaumevernieres , is there a way to construct var_original.yaml such that the quotes in saber_blocks.yaml are retained?

@RussTreadon-NOAA
Copy link
Contributor Author

I am fine with merging this PR into develop.

Most soca ctests failures can be resolved by updating global-workflow. Doing so requires a separate issue and PR in g-w. How best to resolve the yaml quote issue in the soca VRFY ctest can be addressed in a future GDASApp issue.

@CoryMartin-NOAA
Copy link
Contributor

I'm perplexed by this '%mem%' issue since that seems like it has been there for 8 months. Why is it only an issue on Hercules?

@RussTreadon-NOAA
Copy link
Contributor Author

I'm perplexed by this '%mem%' issue since that seems like it has been there for 8 months. Why is it only an issue on Hercules?

Out of curiosity I installed the current head of GDASApp develop at 4846a7f on Orion inside the current head of g-w develop at 4cb5802. test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY passed Orion.

      Start 1689: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY
34/52 Test #1689: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY ....   Passed  654.63 sec

A check of var_original.yaml on Orion shows that quotes are retained!

            - hsnon
          pattern: '%mem%'
          nmembers: '4'

This prompted me to rerun all test_gdasapp on Hercules. This time the Hercules var_original.yaml retains quotes and the VRFY job passes

      Start 1690: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY
36/55 Test #1690: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY ....   Passed  330.59 sec

The following section of scripts/exgdas_global_marine_analysis_run.sh removes quotes when creating var.yaml.

function clean_yaml()
{
    mv $1 tmp_yaml;
    sed -e "s/'//g" tmp_yaml > $1
}

################################################################################
# run 3DVAR FGAT
cp var.yaml var_original.yaml
clean_yaml var.yaml

I can not explain the previous failure on Hercules apart from user (ie, me) error.

@CoryMartin-NOAA CoryMartin-NOAA merged commit cc4c940 into NOAA-EMC:develop Jan 11, 2024
2 checks passed
@RussTreadon-NOAA RussTreadon-NOAA deleted the feature/hercules branch January 17, 2024 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add modulefiles for Hercules
3 participants