Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debugging non-reproducibility of ACCESS-NRI 0.3.1 -> 0.4.0 #173

Closed
wants to merge 9 commits into from

Conversation

dougiesquire
Copy link
Collaborator

DO NOT MERGE

Companion PR to ACCESS-NRI/ACCESS-OM3#47 to debug and document where we lost reproducibility when we updated from 0.3.1 to 0.4.0.

@dougiesquire
Copy link
Collaborator Author

!test repro

Copy link

✅ The Bitwise Reproducibility Check Succeeded ✅

When comparing:

  • 266-025deg_jra55do_ryf (checksums created using commit dcdacd5), against
  • dev-025deg_jra55do_ryf (checksums in commit 90a3e99)
Further information

The experiment can be found on Gadi at /scratch/tm70/repro-ci/experiments/access-om3-configs/dcdacd53427156984c278888aafaea89928c9cea, and the test results at https://github.com/ACCESS-NRI/access-om3-configs/runs/37157098873.

The checksums generated by this !test command are found in the testing/checksum directory of https://github.com/ACCESS-NRI/access-om3-configs/actions/runs/13305855537/artifacts/2584899472.

The checksums compared against are found here https://github.com/ACCESS-NRI/access-om3-configs/tree/90a3e99186d6c8548b4892bbde46b08067299949/testing/checksum

@dougiesquire
Copy link
Collaborator Author

!test repro

Copy link

✅ The Bitwise Reproducibility Check Succeeded ✅

When comparing:

  • 266-025deg_jra55do_ryf (checksums created using commit 9f772ef), against
  • dev-025deg_jra55do_ryf (checksums in commit 90a3e99)
Further information

The experiment can be found on Gadi at /scratch/tm70/repro-ci/experiments/access-om3-configs/9f772ef6e376c76ea4214da5a2fe836bf0c9827a, and the test results at https://github.com/ACCESS-NRI/access-om3-configs/runs/37187627970.

The checksums generated by this !test command are found in the testing/checksum directory of https://github.com/ACCESS-NRI/access-om3-configs/actions/runs/13315120832/artifacts/2588122329.

The checksums compared against are found here https://github.com/ACCESS-NRI/access-om3-configs/tree/90a3e99186d6c8548b4892bbde46b08067299949/testing/checksum

@dougiesquire
Copy link
Collaborator Author

!test repro

Copy link

✅ The Bitwise Reproducibility Check Succeeded ✅

When comparing:

  • 266-025deg_jra55do_ryf (checksums created using commit 586757c), against
  • dev-025deg_jra55do_ryf (checksums in commit 90a3e99)
Further information

The experiment can be found on Gadi at /scratch/tm70/repro-ci/experiments/access-om3-configs/586757cc0e443d0388f5a20516a285f10d22e992, and the test results at https://github.com/ACCESS-NRI/access-om3-configs/runs/37198094002.

The checksums generated by this !test command are found in the testing/checksum directory of https://github.com/ACCESS-NRI/access-om3-configs/actions/runs/13318356254/artifacts/2589278831.

The checksums compared against are found here https://github.com/ACCESS-NRI/access-om3-configs/tree/90a3e99186d6c8548b4892bbde46b08067299949/testing/checksum

@dougiesquire
Copy link
Collaborator Author

!test repro

Copy link

❌ The Bitwise Reproducibility Check Failed ❌

When comparing:

  • 266-025deg_jra55do_ryf (checksums created using commit 618844f), against
  • dev-025deg_jra55do_ryf (checksums in commit 90a3e99)
Further information

The experiment can be found on Gadi at /scratch/tm70/repro-ci/experiments/access-om3-configs/618844fb50d6953f16eb09f2c0859c9787fe22bc, and the test results at https://github.com/ACCESS-NRI/access-om3-configs/runs/37204948671.

The checksums generated by this !test command are found in the testing/checksum directory of https://github.com/ACCESS-NRI/access-om3-configs/actions/runs/13320693848/artifacts/2589972299.

The checksums compared against are found here https://github.com/ACCESS-NRI/access-om3-configs/tree/90a3e99186d6c8548b4892bbde46b08067299949/testing/checksum

@dougiesquire
Copy link
Collaborator Author

!test repro

Copy link

❌ The Bitwise Reproducibility Check Failed ❌

When comparing:

  • 266-025deg_jra55do_ryf (checksums created using commit 4afb68c), against
  • dev-025deg_jra55do_ryf (checksums in commit 90a3e99)
Further information

The experiment can be found on Gadi at /scratch/tm70/repro-ci/experiments/access-om3-configs/4afb68ca888c020e4bf490b92b1828a7a5c5ee84, and the test results at https://github.com/ACCESS-NRI/access-om3-configs/runs/37205673809.

The checksums generated by this !test command are found in the testing/checksum directory of https://github.com/ACCESS-NRI/access-om3-configs/actions/runs/13320987417/artifacts/2590046482.

The checksums compared against are found here https://github.com/ACCESS-NRI/access-om3-configs/tree/90a3e99186d6c8548b4892bbde46b08067299949/testing/checksum

…versions used in 0.4.0

This also inclusde updating ESMF from 8.5.0 to 8.7.0. Unfortunately, the new versions of CMEPS/CDEPS require updating ESMF and the old versions don't work with the updated ESMF. This makes it very difficult to test just updating ESMF
@dougiesquire
Copy link
Collaborator Author

!test repro

Copy link

❌ The Bitwise Reproducibility Check Failed ❌

When comparing:

  • 266-025deg_jra55do_ryf (checksums created using commit e075dad), against
  • dev-025deg_jra55do_ryf (checksums in commit 90a3e99)
Further information

The experiment can be found on Gadi at /scratch/tm70/repro-ci/experiments/access-om3-configs/e075dad51205bef7b5a7886f0cef5ea206d4bc1e, and the test results at https://github.com/ACCESS-NRI/access-om3-configs/runs/37219985531.

The checksums generated by this !test command are found in the testing/checksum directory of https://github.com/ACCESS-NRI/access-om3-configs/actions/runs/13326066832/artifacts/2591456025.

The checksums compared against are found here https://github.com/ACCESS-NRI/access-om3-configs/tree/90a3e99186d6c8548b4892bbde46b08067299949/testing/checksum

@dougiesquire
Copy link
Collaborator Author

Summarising what this PR shows:

MOM

  • We can preserve answers across the MOM update with the right MOM parameter settings (see 9f772ef and repro test)

CESM-share

CICE

  • We do not preserve answers across the CICE6 update, even without using the new MOM supergrid functionality (see 618844f and repro test)
  • Using the new MOM supergrid functionality causes ACCESS-OM3 to crash within three hours due to velocity truncations in MOM (see 4afb68c and repro test)

CMEPS/CDEPS

  • Updating CMEPS and CDEPS requires also updating ESMF. We do not preserve answers across this update (see e075dad and repro test)

I'll open a separate issue/PR with suggestions for how to set the MOM6 parameters for the update to 0.4.0.

I'm a little worried that using the new MOM supergrid functionality in CICE causes MOM velocity truncations. Should we dig into this a little before we commit to using it in ACCESS-OM3? @anton-seaice, @chrisb13?

@aekiss
Copy link
Contributor

aekiss commented Feb 16, 2025

Thanks for exploring this and laying it out so clearly.

Can we conclude that there's a problem with the supergrid implementation in CICE (grid_format = "mom_nc") that somehow produces fluxes that crash MOM?

Is it expected that the CICE and ESMF updates break reproducibility?

@dougiesquire
Copy link
Collaborator Author

dougiesquire commented Feb 17, 2025

Can we conclude that there's a problem with the supergrid implementation in CICE (grid_format = "mom_nc") that somehow produces fluxes that crash MOM?

Possibly... I think it's unclear at this stage, but we are going to revert back to the old grid while we investigate.

Is it expected that the CICE and ESMF updates break reproducibility?

I'll defer to @anton-seaice re CICE.

Regarding the ESMF update from 8.5.0 to 8.7.0, I'd say no. Looking at the changelog, there's only one reported bfb change between these two releases (in 8.6.0) and that should only be observed when not using strict floating point compiler options. We use -fp-model precise so I think should see bfb reproducibility. I guess that suggests that it's the changes to CMEPS/CDEPS that change answers. Unfortunately it's hard to test this explicitly since the 0.3.1 versions of CMEPS/CDEPS don't compile with ESMF 8.7.0, and the 0.4.0 versions don't compile with 8.5.0... I'll try building 0.3.1 with ESMF with 8.6.0 and see if that learns us anything.

@anton-seaice
Copy link
Contributor

I'll defer to @anton-seaice re CICE.

Its not immediately clear the answers should have changed.

ACCESS-NRI/CICE@12dd204...e68e05b

There are some updates which are not bit for bit in there but none look like their should impact our configurations.

@anton-seaice
Copy link
Contributor

I'm a little worried that using the new MOM supergrid functionality in CICE causes MOM velocity truncations. Should we dig into this a little before we commit to using it in ACCESS-OM3? @anton-seaice, @chrisb13?

Yes lets drop the commit for now and make a new issue to investigate it

@dougiesquire
Copy link
Collaborator Author

!test repro

Copy link

✅ The Bitwise Reproducibility Check Succeeded ✅

When comparing:

  • 266-025deg_jra55do_ryf (checksums created using commit 47ac794), against
  • dev-025deg_jra55do_ryf (checksums in commit 90a3e99)
Further information

The experiment can be found on Gadi at /scratch/tm70/repro-ci/experiments/access-om3-configs/47ac794bd038d514f4f10ed4f1ce551dde230fee, and the test results at https://github.com/ACCESS-NRI/access-om3-configs/runs/37311687707.

The checksums generated by this !test command are found in the testing/checksum directory of https://github.com/ACCESS-NRI/access-om3-configs/actions/runs/13361350132/artifacts/2600532870.

The checksums compared against are found here https://github.com/ACCESS-NRI/access-om3-configs/tree/90a3e99186d6c8548b4892bbde46b08067299949/testing/checksum

@dougiesquire
Copy link
Collaborator Author

dougiesquire commented Feb 17, 2025

I'll try building 0.3.1 with ESMF with 8.6.0 and see if that learns us anything.

This passed repro tests (see 47ac794 and repro test) which suggests that the answer changes that arise from the CMEPS/CDEPS/ESMF updates in ACCESS-OM3 0.4.0 come from CMEPS/CDEPS rather than ESMF (since the ESMF changelog reports full bfb reproducibility between ESMF 8.6.0 and 8.7.0).

CMEPS changes: ESCOMP/CMEPS@ffb5737...959e9a0
CDEPS changes: ESCOMP/CDEPS@3c70fc8...8197f05

@dougiesquire
Copy link
Collaborator Author

!test repro

Copy link

✅ The Bitwise Reproducibility Check Succeeded ✅

When comparing:

  • 266-025deg_jra55do_ryf (checksums created using commit 2f52959), against
  • dev-025deg_jra55do_ryf (checksums in commit 90a3e99)
Further information

The experiment can be found on Gadi at /scratch/tm70/repro-ci/experiments/access-om3-configs/2f529596b6c6b3691f72dcc7198d77a8970c7683, and the test results at https://github.com/ACCESS-NRI/access-om3-configs/runs/37680474084.

The checksums generated by this !test command are found in the testing/checksum directory of https://github.com/ACCESS-NRI/access-om3-configs/actions/runs/13487396308/artifacts/2638085455.

The checksums compared against are found here https://github.com/ACCESS-NRI/access-om3-configs/tree/90a3e99186d6c8548b4892bbde46b08067299949/testing/checksum

@dougiesquire
Copy link
Collaborator Author

!test repro

Copy link

❌ The Bitwise Reproducibility Check Failed ❌

When comparing:

  • 266-025deg_jra55do_ryf (checksums created using commit 4c6aef7), against
  • dev-025deg_jra55do_ryf (checksums in commit 90a3e99)
Further information

The experiment can be found on Gadi at /scratch/tm70/repro-ci/experiments/access-om3-configs/4c6aef7ced9ca29edc35cf398cc152e5363f97d9, and the test results at https://github.com/ACCESS-NRI/access-om3-configs/runs/37681364167.

The checksums generated by this !test command are found in the testing/checksum directory of https://github.com/ACCESS-NRI/access-om3-configs/actions/runs/13487779720/artifacts/2638168448.

The checksums compared against are found here https://github.com/ACCESS-NRI/access-om3-configs/tree/90a3e99186d6c8548b4892bbde46b08067299949/testing/checksum

@dougiesquire
Copy link
Collaborator Author

Closing as I think we've learnt what we wanted to about the loss of historical repro when we updated from 0.3.1 to 0.4.0.

@dougiesquire dougiesquire deleted the 266-025deg_jra55do_ryf branch March 12, 2025 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants