-
Notifications
You must be signed in to change notification settings - Fork 12
Description
Hi @dbeltrankyl @kinow, I'm having an issue with Autosubmit, we've checked a bit with @dbeltrankyl offline but it doesn't seem to be any of the usual mistakes.
Description
When running an experiment in HPC2020 (ECMWF), using ecaccess as the connection mechanism, it stops without any clear error. It usually transfers some of the files (template and additional) and then stops. Probably during the ecaccess commands, and it might be receiving some output from it that it is not treated properly making the autosubmit process stop without dying (the process is still active in the machine, as well as the ecaccess). But checking manually the ecaccess commands I have not been able to reproduce the issue (it might also be due to too many simultaneous connections, which could be solved by using mput instead of put if that is the issue, but I have no idea).
- Short log:
9 of 9 jobs remaining (09:06)
No jobs to check for platform ECMWF-HPC2020
File /esarchive/autosubmit/t0lr/proj/auto-ecearth4/templates/synchronize.tmpl.sh exists
Calculating possible ready jobs for ECMWF-HPC2020
Section:SYNCHRONIZE can submit 20 jobs at this time
[WARNING] Script t0lr_SYNCHRONIZE has some empty variables. An empty value has substituted these variables
- ASLOG last lines:
9 of 9 jobs remaining (09:06)
2025-10-21 09:06:55,746 Sleep: 10
2025-10-21 09:06:55,746 Number of retrials: 0
2025-10-21 09:06:55,747 Checking jobs for platform=ECMWF-HPC2020
2025-10-21 09:06:55,748 No jobs to check for platform ECMWF-HPC2020
2025-10-21 09:06:55,750 Updating FAILED jobs
2025-10-21 09:06:55,818 Updating WAITING jobs
2025-10-21 09:06:55,819 Update finished
2025-10-21 09:06:55,826 Saving JobList: /esarchive/autosubmit/t0lr/pkl/job_list_t0lr.pkl.tmp
2025-10-21 09:06:55,854 JobList saved in /esarchive/autosubmit/t0lr/pkl/job_list_t0lr.pkl
2025-10-21 09:06:55,857 File /esarchive/autosubmit/t0lr/proj/auto-ecearth4/templates/synchronize.tmpl.sh exists
2025-10-21 09:06:55,859 Number of jobs available: 20
2025-10-21 09:06:55,860 Number of jobs ready: 1
2025-10-21 09:06:55,861 Jobs ready for ECMWF-HPC2020: 1
2025-10-21 09:06:55,862 Calculating possible ready jobs for ECMWF-HPC2020
2025-10-21 09:06:55,878 Section:SYNCHRONIZE can submit 20 jobs at this time
2025-10-21 09:06:55,883 Saving JobList: /esarchive/autosubmit/t0lr/pkl/job_list_t0lr.pkl.tmp
2025-10-21 09:06:55,904 JobList saved in /esarchive/autosubmit/t0lr/pkl/job_list_t0lr.pkl
2025-10-21 09:06:55,906
Jobs ready for ECMWF-HPC2020: 1
[WARNING] 2025-10-21 09:06:55,935 Script t0lr_SYNCHRONIZE has some empty variables. An empty value has substituted these variables
2025-10-21 09:06:55,991 Creating Scripts
2025-10-21 09:06:56,172 Sending Files
- ps output:
[eferre1@bscesautosubmit03 ~]$ nohup autosubmit run t0lr >> ~/mn5_logs.txt &
[3] 330680
[eferre1@bscesautosubmit03 ~]$ ps
330680 pts/0 00:00:02 autosubmit
330688 pts/0 00:00:00 python3.12
330689 pts/0 00:00:02 autosubmit log
330697 pts/0 00:00:00 ecaccess-file-p
330797 pts/0 00:00:00 ps
- Autosubmit version: 4.1.15-conda
- Machine/VM/environment name (if applicable): bscesautosubmithub03 and 04
- Experiment ID (if applicable): t0lr
- Experiment tasks & log path if applicable):
- /esarchive/autosubmit/t0lr/tmp/ASLOGS/20251021_114029_run.log (and others in this path)
- pre-SYNCHRONIZE (files - template and additional - are partially moved to the hpc, but no submission of the job)
Reproducible Example
In theory, running a copy of the experiment t0lr (with dummyfied templates with sleeps) should already show the same behavior of Autosubmit, as it is happening even before starting the first job.
Expected Behaviour
The files should be transferred properly and then the job submission happen to the hpc slurm system.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status