Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving restart hangs forever on long 1deg runs #681

Open
jamesavery opened this issue Dec 12, 2024 · 7 comments
Open

Saving restart hangs forever on long 1deg runs #681

jamesavery opened this issue Dec 12, 2024 · 7 comments

Comments

@jamesavery
Copy link

On several 100+ year 1degree runs (after a week of calculation), the calculation completes, starts writing the restart.h5 file, completes a 96 byte header, and then stalls forever (seemingly due to deadlock). This has happened on multiple machines with different setup.

Atttaching gdb to the running process shows hundreds of threads, most stuck in __futex_abstimed_wait_common64 (which never seems to time out, as waiting for multiple days does not yield progress). A few threads are stuck in a syscall, and a single thread is in epoll_wait (thread 408 below).

* 1    Thread 0x71fad4876300 (LWP 1750111) "veros-run"      syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  2    Thread 0x71f718a006c0 (LWP 1750847) "veros-run"      0x000071fad4698d61 in __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x71f7189ffb10, op=137, expected=0, futex_word=0x71f05c000ba0) at ./nptl/futex-internal.c:57
  3    Thread 0x71f71be006c0 (LWP 1750621) "veros-run"      0x000071fad4698d61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x8882f58) at ./nptl/futex-internal.c:57
...
 132  Thread 0x71f8234006c0 (LWP 1750411) "veros-run"      0x000071fad4698d61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x3ae0658) at ./nptl/futex-internal.c:57
  133  Thread 0x71f823e006c0 (LWP 1750410) "veros-run"      syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  134  Thread 0x71f82cc006c0 (LWP 1750409) "veros-run"      syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  135  Thread 0x71f82d6006c0 (LWP 1750408) "veros-run"      syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  136  Thread 0x71f82e0006c0 (LWP 1750407) "veros-run"      syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  137  Thread 0x71f82ea006c0 (LWP 1750406) "veros-run"      syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  138  Thread 0x71f82f4006c0 (LWP 1750405) "veros-run"      syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  139  Thread 0x71f82fe006c0 (LWP 1750404) "veros-run"      syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  140  Thread 0x71f838a006c0 (LWP 1750403) "veros-run"      syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  141  Thread 0x71f851a006c0 (LWP 1750402) "veros-run"      syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  142  Thread 0x71f8510006c0 (LWP 1750401) "veros-run"      syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  143  Thread 0x71f847e006c0 (LWP 1750400) "veros-run"      syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  144  Thread 0x71f8474006c0 (LWP 1750399) "veros-run"      syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  145  Thread 0x71f8394006c0 (LWP 1750398) "veros-run"      0x000071fad4698d61 in __futex_abstimed_wait_common64 (private=29176, cancel=true, abstime=0x71f8393ffe50, op=393, expected=0, futex_word=0x393bff0) at ./nptl/futex-internal.c:57
  146  Thread 0x71f839e006c0 (LWP 1750397) "veros-run"      0x000071fad4698d61 in __futex_abstimed_wait_common64 (private=29178, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x39445e0) at ./nptl/futex-internal.c:57
  147  Thread 0x71f83a8006c0 (LWP 1750396) "veros-run"      0x000071fad4698d61 in __futex_abstimed_wait_common64 (private=29178, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x3961650) at ./nptl/futex-internal.c:57
  148  Thread 0x71f83b2006c0 (LWP 1750395) "veros-run"      0x000071fad4698d61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x396aa70) at ./nptl/futex-internal.c:57
  149  Thread 0x71f844c006c0 (LWP 1750394) "cuda-EvtHandlr" 0x000071fad471b4cd in __GI___poll (fds=0x71f5ec000c20, nfds=10, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
  150  Thread 0x71f8456006c0 (LWP 1750393) "cuda-EvtHandlr" 0x000071fad471b4cd in __GI___poll (fds=0x71f5f0000c20, nfds=10, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
  151  Thread 0x71f8460006c0 (LWP 1750392) "cuda-EvtHandlr" 0x000071fad471b4cd in __GI___poll (fds=0x71f5f4000c20, nfds=10, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
  152  Thread 0x71f846a006c0 (LWP 1750391) "cuda-EvtHandlr" 0x000071fad471b4cd in __GI___poll (fds=0x71f5fc000c20, nfds=10, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
  153  Thread 0x71f8524006c0 (LWP 1750386) "cuda-EvtHandlr" 0x000071fad471b4cd in __GI___poll (fds=0x3949420, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
  154  Thread 0x71f853e006c0 (LWP 1750385) "veros-run"      0x000071fad4698d61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x3917258) at ./nptl/futex-internal.c:57
...
  407  Thread 0x71fab82006c0 (LWP 1750130) "veros-run"      0x000071fad4698d61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x71fabb1cade4 <thread_status+100>) at ./nptl/futex-internal.c:57
  408  Thread 0x71fad28006c0 (LWP 1750128) "veros-run"      0x000071fad472a042 in epoll_wait (epfd=11, events=0x1980490, maxevents=32, timeout=524517) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  409  Thread 0x71fad36006c0 (LWP 1750127) "veros-run"      0x000071fad471b4cd in __GI___poll (fds=0x71facc000b70, nfds=1, timeout=3600000) at ../sysdeps/unix/sysv/linux/poll.c:29
@jamesavery
Copy link
Author

jamesavery commented Dec 12, 2024

The setup can be found here: https://sid.erda.dk/sharelink/h1q6PQ4sMZ

It was run with veros-run --backend jax --device gpu --float-type float32 experiment.py with veros version 1.5.1, with JAX_ENABLE_X64=1, using python3.12, and jax/jaxlib version 0.4.34 with CUDA 12.2. on Ubuntu 24.01.1 LTS.

@dionhaefner
Copy link
Collaborator

Is this something that is specific to long running setups? What happens if you write restart files periodically instead (say every few hours)?

@dionhaefner
Copy link
Collaborator

Also, this is using a single device right? (No MPI involved)

@jamesavery
Copy link
Author

jamesavery commented Dec 12, 2024

Also, this is using a single device right? (No MPI involved).

Correct, no MPI involved.

Is this something that is specific to long running setups? What happens if you write restart files periodically instead (say every few hours)?

Running that experiment now (restart write was switched off until calculation end due to disk quota limitations on LUMI).

I'll update once there is a result on the periodic write test.

@dionhaefner
Copy link
Collaborator

Alright. In absence of a reproducer this will be difficult to debug – could be fixable by explicitly copying more data to CPU before handing off to other libraries with C extension (like h5netcdf), but it may also be some JAX bug or race condition somewhere that we don't have a handle on.

A pragmatic solution could be to use veros resubmit, also adding a callback script that periodically cleans up old restart files.

@jamesavery
Copy link
Author

jamesavery commented Dec 23, 2024

Just finished running two 200yr experiments that write out restart.h5 every 50 years. All intermediate restarts succeed, but hang on the restart write after completing the simulation. It seems reproducible, since it happens in all cases I tried.

Breaking the long job into short jobs of 50 years using veros-resubmit does function as a workaround.

@dionhaefner
Copy link
Collaborator

Oh so 50 year restarts work? That's really weird... Unfortunately seems almost impossible to debug :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants