Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WOMBATlite tracers are not deterministic with MOM6 symmetric memory #289

Open
dougiesquire opened this issue Feb 27, 2025 · 7 comments
Open
Assignees

Comments

@dougiesquire
Copy link
Collaborator

dougiesquire commented Feb 27, 2025

Historical repro checks on the dev-1deg_jra55do_ryf_wombatlite branch fail because WOMBATlite tracers are not bitwise reproducible - e.g. see ACCESS-NRI/access-om3-configs#199 (comment). I think this will take some digging to resolve, so I'm opening this issue to keep track of things.

@dougiesquire dougiesquire self-assigned this Feb 27, 2025
@MartinDix
Copy link

What about WOMBATlite in ESM1.6? Though CABLE isn't reproducible in some versions so would need to be careful what was tested.

@dougiesquire
Copy link
Collaborator Author

To try and determine whether the issue is in the generic tracer shared code or specifically in the WOMBATlite module, I set up a 1 deg ACCESS-OM3 + BLING config in this branch: 289-1deg_jra55do_ryf_bling.

This configuration also does not have historical repro of BLING tracers:

diff --git a/control/om3_1deg_jra55do_ryf_bling-test_bit_repro_historical/testing/checksum/historical-3hr-checksum.json b/checksum/historical-3hr-checksum.json
index c673c39..78dd7c8 100644
--- a/control/om3_1deg_jra55do_ryf_bling-test_bit_repro_historical/testing/checksum/historical-3hr-checksum.json
+++ b/checksum/historical-3hr-checksum.json
@@ -44,25 +44,25 @@
       "B64AD181B58D3F4B"
     ],
     "alk": [
-      "1BE0A1F18199262"
+      "1F4C091BFFC81F3E"
     ],
     "ave_ssh": [
       "40E5A21D6F0D1F87"
     ],
     "biomass_p": [
-      "E6C8E6FA73781EE1"
+      "321423192F3052EA"
     ],
     "cased": [
-      "385E1D16040802FD"
+      "1916F3449A3B4B1A"
     ],
     "chl": [
-      "3CFCDA0157D051EF"
+      "4DC58E47DE015488"
     ],
     "co3_ion": [
-      "5E249562C7AC5725"
+      "92A52D91592F8461"
     ],
     "dic": [
-      "4BB5268ACCFD468E"
+      "F1C1D4A3E01616FB"
     ],
     "diffu": [
       "749F723BA63A013C"
@@ -71,10 +71,10 @@
       "CD7614674E679DA8"
     ],
     "dop": [
-      "3EA7A97240B9A2EB"
+      "6893C9AE6145B4D2"
     ],
     "fed": [
-      "57675B5F0ED845D5"
+      "1CC7FE18C07E404B"
     ],
     "frazil": [
       "5FE9C9F905239FC4"
@@ -86,19 +86,19 @@
       "A815BF54039C2D82"
     ],
     "htotal": [
-      "D0E323115D157D09"
+      "FBD8FE65155581DE"
     ],
     "irr_mem": [
-      "61642F0F4BCBE233"
+      "5F2163824369219"
     ],
     "o2": [
-      "4596FA109C8FF43E"
+      "36CBB6A2BEF40FCA"
     ],
     "p_surf_EOS": [
       "161DADB852A006F1"
     ],
     "po4": [
-      "5A45C4D691BF13D2"
+      "3911BB7A6860093D"
     ],
     "sfc": [
       "68FCD5E9DE45D1DF"

So this check didn't really narrow things down at all. The issue could be coming from:

  • the generic tracer shared code
  • our modifications to the MOM6 NUOPC cap to allow generic tracers
  • both the BLING and WOMBATlite modules
  • any combination of the above

@dougiesquire
Copy link
Collaborator Author

dougiesquire commented Feb 28, 2025

What about WOMBATlite in ESM1.6?

@MartinDix WOMBATlite tracers in ESM1.6 are deterministic. I checked this dev-preindustrial+concentrations configuration using model-config-tests from this PR. After first updating the reference checksums because they were out of date:

$ model-config-tests -m checksum
================================================= test session starts ==================================================
platform linux -- Python 3.10.0, pytest-8.3.4, pluggy-1.5.0
rootdir: /g/data/tm70/ds0092/model/config/model-config-tests
configfile: pyproject.toml
plugins: cov-6.0.0
collected 56 items / 55 deselected / 1 selected

../model-config-tests/src/model_config_tests/test_bit_reproducibility.py .                                       [100%]

===================================== 1 passed, 55 deselected in 517.51s (0:08:37) =====================================

So it's something about our use with MOM6 that causes the issue. Our changes to the MOM6 NUOPC cap seem probable.

ADDED: ACCESS-ESM1.6 isn't restart reproducible (2x1month vs 1x2month runs differ) but that is not only due to WOMBATlite (note, to run this check with model-config-tests requires some modifications - see ACCESS-NRI/model-config-tests#123)

@dougiesquire dougiesquire changed the title WOMBATlite tracers are not reproducible WOMBATlite tracers are not deterministic Mar 2, 2025
@dougiesquire
Copy link
Collaborator Author

WOMBATlite tracers in OM2 are deterministic, however, they are not restart reproducible. I checked this 1deg_jra55_ryf_wombatlite configuration. After first updating the reference checksums because they were out of date:

$ model-config-tests -m 'checksum or checksum_slow'
======================================================== test session starts ========================================================
platform linux -- Python 3.10.0, pytest-8.3.4, pluggy-1.5.0
rootdir: /g/data/tm70/ds0092/model/config/model-config-tests
configfile: pyproject.toml
plugins: cov-6.0.0
collected 56 items / 53 deselected / 3 selected

../model-config-tests/src/model_config_tests/test_bit_reproducibility.py ..F                                                  [100%]

============================================================= FAILURES ==============================================================
_____________________________________________ TestBitReproducibility.test_restart_repro _____________________________________________
.
.
.
../model-config-tests/src/model_config_tests/test_bit_reproducibility.py:203: AssertionError
------------------------------------------------------- Captured stdout call --------------------------------------------------------
['/scratch/tm70/ds0092/tmp/test-model-repro/control/om2_1deg_jra55_ryf_wombatlite_repro-test_restart_repro_2x1day/1deg_jra55_ryf.o136171583']
['/scratch/tm70/ds0092/tmp/test-model-repro/control/om2_1deg_jra55_ryf_wombatlite_repro-test_restart_repro_2x1day/1deg_jra55_ryf.o136171815']
['/scratch/tm70/ds0092/tmp/test-model-repro/control/om2_1deg_jra55_ryf_wombatlite_repro-test_restart_repro_2day/1deg_jra55_ryf.o136171848']
Unequal checksum: fe: 5073509194411859757
Unequal checksum: alk: 6040048339886718124
Unequal checksum: dicr: 827074460521828265
Unequal checksum: dicp: -5832384118945528393
Unequal checksum: dic: -8846842365167654943
Unequal checksum: caco3: -1046332211322085776
Unequal checksum: detfe: -2427454795934160686
Unequal checksum: det: -5806657737020406331
Unequal checksum: zoofe: 3321917421957310585
Unequal checksum: zoo: 8017834511148714162
Unequal checksum: o2: 8467935364592519065
Unequal checksum: phyfe: 6572384186769650546
Unequal checksum: pchl: -4180477745958961613
Unequal checksum: phy: -3240759858822795484
Unequal checksum: no3: 7016381418954613286
Unequal checksum: caco3bury: 8200280247876020679
Unequal checksum: detbury: 1574276993404708052
Unequal checksum: caco3_sediment: -6727768601044577267
Unequal checksum: detfe_sediment: 3294342256399119565
Unequal checksum: det_sediment: 6110075955233090468
====================================================== short test summary info ======================================================
FAILED ../model-config-tests/src/model_config_tests/test_bit_reproducibility.py::TestBitReproducibility::test_restart_repro - assert False
====================================== 1 failed, 2 passed, 53 deselected in 801.40s (0:13:21) =======================================

@MartinDix
Copy link

ADDED: ACCESS-ESM1.6 isn't restart reproducible (2x1month vs 1x2month runs differ) but that is not due to WOMBATlite (note, to run this check with model-config-tests requires some modifications - see ACCESS-NRI/model-config-tests#123)

The ESM1.5/6 atmosphere is not restart reproducible because of CABLE.

@dougiesquire
Copy link
Collaborator Author

Note, the most recent release of ACCESS-OM3 (2025.01.0) does not compile ESMF with -fp-model precise. However, even with ESMF -fp-model precise WOMBATlite tracers are non-deterministic.

@dougiesquire dougiesquire moved this to In Progress in ACCESS-OM3 025 Mar 7, 2025
@dougiesquire
Copy link
Collaborator Author

This is yet another issue caused by using mom_symmetric (which also broke historical repro and restart repro in ACCESS-OM3). We're going to revert back to non-symmetric memory in ACCESS-OM3 while we investigate, but I'll leave this issue open as a reminder to check this in the future.

@dougiesquire dougiesquire changed the title WOMBATlite tracers are not deterministic WOMBATlite tracers are not deterministic with MOM6 symmetric memory Mar 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants