-
Notifications
You must be signed in to change notification settings - Fork 12
Description
This happened in a testing suite for DestinE. The experiment was running fine, but apparently there was some network glitch and when Autosubmit tried to submit jobs, paramiko raised an exception, which caused a critical error.
Perhaps we can improve this, to have retrials for paramiko/network errors like this. This could help operators/devs in case the network becomes unstable due to traffic or maintenance.
From chat with @dbeltrankyl on the DestinE workflow repo (weekly 28):
Is there a way to tell Autosubmit to try to send files N times with X interval? i.e. try to send it now and if it fails, sleep X seconds, then keep trying?
It should do something similar to that.
It raises an autosubmiterror or autosubmitcritical. If it is a critical it is just that is an unrecoverable error.
If it raises an error, like in the log, it should test all the connections from all platform and try again later. ( once all platforms are working again)
Can you show me the complete log in the submission?
Can you show me the complete log in the submission?
I think it's this one you want, right:
Calculating possible ready jobs for local
Calculating possible ready jobs for LUMI-LOGIN
ESC[32mSection:DN can submit 1 jobs at this timeESC[0mESC[39m
ERROR:paramiko.transport:Socket exception: Connection reset by peer (104)
Traceback (most recent call last):
File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/platforms/paramiko_platf
orm.py", line 1040, in send_command
stdin, stdout, stderr = self.exec_command(command, x11=x11)
File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/platforms/paramiko_platf
orm.py", line 969, in exec_command
chan = self.transport.open_session()
File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/paramiko-3.4.0-py3.9.egg/paramiko/transport.py", line 959, in
open_session
return self.open_channel(
File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/paramiko-3.4.0-py3.9.egg/paramiko/transport.py", line 1090, in
open_channel
raise e
File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/paramiko-3.4.0-py3.9.egg/paramiko/transport.py", line 2159, in
run
ptype, m = self.packetizer.read_message()
File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/paramiko-3.4.0-py3.9.egg/paramiko/packet.py", line 463, in rea
d_message
header = self.read_all(self.__block_size_in, check_rekey=True)
File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/paramiko-3.4.0-py3.9.egg/paramiko/packet.py", line 308, in rea
d_all
x = self.__socket.recv(n)
ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/platforms/paramiko_platform.py", line 1417, in check_remote_log_dir
if self.send_command(self.get_mkdir_cmd()):
File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/platforms/paramiko_platform.py", line 1131, in send_command
raise AutosubmitError(str(e),6016)
log.log.AutosubmitError:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/job/job_packages.py", line 203, in submit
self._send_files()
File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/job/job_packages.py", line 240, in _send_files
self.platform.send_file(self._job_scripts[job.name])
File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/platforms/paramiko_platform.py", line 378, in send_file
self.check_remote_log_dir()
File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/platforms/paramiko_platform.py", line 1424, in check_remote_log_dir
raise AutosubmitError("Couldn't send the file {0} to HPC {1}".format(
log.log.AutosubmitError:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/platforms/platform.py", line 308, in submit_ready_jobs
package.submit(as_conf, job_list.parameters, inspect, hold=hold)
File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/job/job_packages.py", line 209, in submit
raise AutosubmitCritical("Error while submitting jobs: {0}".format(e), 7013)
log.log.AutosubmitCritical:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/autosubmit.py", line 2229, in run_experiment
Autosubmit.submit_ready_jobs(as_conf, job_list, platforms_to_test, packages_persistence, hold=False)
File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/autosubmit.py", line 2516, in submit_ready_jobs
save_1, failed_packages, error_message, valid_packages_to_submit, any_job_submitted = platform.submit_ready_jobs(as_conf,
File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/platforms/platform.py", line 346, in submit_ready_jobs
raise AutosubmitCritical(e.message, e.code, e.trace)
log.log.AutosubmitCritical:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/appl/AS/4.1.9-final-beta/lib/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/EGG-INFO/scripts/autosubmit", line 59, in main
return_value = Autosubmit.parse_args()
File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/autosubmit.py", line 707, in parse_args
return Autosubmit.run_experiment(args.expid, args.notransitive,args.start_time,args.start_after, args.run_only_members, args.profile)
File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/autosubmit.py", line 2340, in run_experiment
raise AutosubmitCritical(e.message, e.code, e.trace)
log.log.AutosubmitCritical:
ESC[1m ESC[31m[CRITICAL] Error while submitting jobs: [eCode=7013]ESC[0mESC[39m
More info at https://autosubmit.readthedocs.io/en/master/troubleshooting/error-codes.html
(END)
This was logged right after other jobs succeeded, and I think the connection to submit new jobs was not successful, but it didn't try to submit again... I had a look at job_packages but I couldn't find a place with retrials for job submission if a network error occurs...
I think it's this one you want, right:
Yes, thanks
This was logged right after other jobs succeeded, and I think the connection to submit new jobs was not successful, but it didn't try to submit again... I had a look at
job_packagesbut I couldn't find a place with retrials for job submission if a network error occurs...
The problem is that this error is not handled as a "network error", I don't recall having this error before.
When there is an ERROR, which is not the case, it will trigger the code of Autosubmit.py 2263 ( inside def run_experiment()
Amount other stuff, there is a Autosubmit.restore_platforms(platforms_to_test, mail_notify=mail_notify, as_conf=as_conf, expid=expid)
Once the connection is restored, autosubmit run will return to the main loop and continue the experiment from where it crashed.
However, if it is an Autosubmit critical, autosubmit will stop
So what is missing is to handle this error so the recovery routine can be triggered
The submission process is a little bit delicate tho, if there is a failure in check the status of the logs there is no issue, that command also has retries etc and when retrials are spent it will trigger the autosubmit error to make a new connection.
But the submission could "fail" and still something be submitted(which happened before), so we have to be careful with "ghost" jobs
Metadata
Metadata
Assignees
Labels
Type
Projects
Status