-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Saving restart hangs forever on long 1deg runs #681
Comments
The setup can be found here: https://sid.erda.dk/sharelink/h1q6PQ4sMZ It was run with |
Is this something that is specific to long running setups? What happens if you write restart files periodically instead (say every few hours)? |
Also, this is using a single device right? (No MPI involved) |
Correct, no MPI involved.
Running that experiment now (restart write was switched off until calculation end due to disk quota limitations on LUMI). I'll update once there is a result on the periodic write test. |
Alright. In absence of a reproducer this will be difficult to debug – could be fixable by explicitly copying more data to CPU before handing off to other libraries with C extension (like A pragmatic solution could be to use |
Just finished running two 200yr experiments that write out restart.h5 every 50 years. All intermediate restarts succeed, but hang on the restart write after completing the simulation. It seems reproducible, since it happens in all cases I tried. Breaking the long job into short jobs of 50 years using veros-resubmit does function as a workaround. |
Oh so 50 year restarts work? That's really weird... Unfortunately seems almost impossible to debug :/ |
On several 100+ year 1degree runs (after a week of calculation), the calculation completes, starts writing the restart.h5 file, completes a 96 byte header, and then stalls forever (seemingly due to deadlock). This has happened on multiple machines with different setup.
Atttaching gdb to the running process shows hundreds of threads, most stuck in
__futex_abstimed_wait_common64
(which never seems to time out, as waiting for multiple days does not yield progress). A few threads are stuck in a syscall, and a single thread is inepoll_wait
(thread 408 below).The text was updated successfully, but these errors were encountered: