Skip to content

[enhancement] retry job submission on network errors #1369

@kinow

Description

@kinow

This happened in a testing suite for DestinE. The experiment was running fine, but apparently there was some network glitch and when Autosubmit tried to submit jobs, paramiko raised an exception, which caused a critical error.

Perhaps we can improve this, to have retrials for paramiko/network errors like this. This could help operators/devs in case the network becomes unstable due to traffic or maintenance.


From chat with @dbeltrankyl on the DestinE workflow repo (weekly 28):

@dbeltrankyl

Is there a way to tell Autosubmit to try to send files N times with X interval? i.e. try to send it now and if it fails, sleep X seconds, then keep trying?

It should do something similar to that.

It raises an autosubmiterror or autosubmitcritical. If it is a critical it is just that is an unrecoverable error.

If it raises an error, like in the log, it should test all the connections from all platform and try again later. ( once all platforms are working again)

Can you show me the complete log in the submission?


@kinow

Can you show me the complete log in the submission?

I think it's this one you want, right:

Calculating possible ready jobs for local
Calculating possible ready jobs for LUMI-LOGIN
ESC[32mSection:DN can submit 1 jobs at this timeESC[0mESC[39m
ERROR:paramiko.transport:Socket exception: Connection reset by peer (104)
Traceback (most recent call last):
  File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/platforms/paramiko_platf
orm.py", line 1040, in send_command
    stdin, stdout, stderr = self.exec_command(command, x11=x11)
  File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/platforms/paramiko_platf
orm.py", line 969, in exec_command
    chan = self.transport.open_session()
  File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/paramiko-3.4.0-py3.9.egg/paramiko/transport.py", line 959, in 
open_session
    return self.open_channel(
  File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/paramiko-3.4.0-py3.9.egg/paramiko/transport.py", line 1090, in
 open_channel
    raise e
  File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/paramiko-3.4.0-py3.9.egg/paramiko/transport.py", line 2159, in
 run
    ptype, m = self.packetizer.read_message()
  File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/paramiko-3.4.0-py3.9.egg/paramiko/packet.py", line 463, in rea
d_message
    header = self.read_all(self.__block_size_in, check_rekey=True)
  File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/paramiko-3.4.0-py3.9.egg/paramiko/packet.py", line 308, in rea
d_all
    x = self.__socket.recv(n)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/platforms/paramiko_platform.py", line 1417, in check_remote_log_dir
    if self.send_command(self.get_mkdir_cmd()):
  File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/platforms/paramiko_platform.py", line 1131, in send_command
    raise AutosubmitError(str(e),6016)
log.log.AutosubmitError:  

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/job/job_packages.py", line 203, in submit
    self._send_files()
  File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/job/job_packages.py", line 240, in _send_files
    self.platform.send_file(self._job_scripts[job.name])
  File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/platforms/paramiko_platform.py", line 378, in send_file
    self.check_remote_log_dir()
  File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/platforms/paramiko_platform.py", line 1424, in check_remote_log_dir
    raise AutosubmitError("Couldn't send the file {0} to HPC {1}".format(
log.log.AutosubmitError:  

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/platforms/platform.py", line 308, in submit_ready_jobs
    package.submit(as_conf, job_list.parameters, inspect, hold=hold)
  File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/job/job_packages.py", line 209, in submit
    raise AutosubmitCritical("Error while submitting jobs: {0}".format(e), 7013)
log.log.AutosubmitCritical:  

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/autosubmit.py", line 2229, in run_experiment
    Autosubmit.submit_ready_jobs(as_conf, job_list, platforms_to_test, packages_persistence, hold=False)
  File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/autosubmit.py", line 2516, in submit_ready_jobs
    save_1, failed_packages, error_message, valid_packages_to_submit, any_job_submitted = platform.submit_ready_jobs(as_conf,
  File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/platforms/platform.py", line 346, in submit_ready_jobs
    raise AutosubmitCritical(e.message, e.code, e.trace)
log.log.AutosubmitCritical:  

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/appl/AS/4.1.9-final-beta/lib/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/EGG-INFO/scripts/autosubmit", line 59, in main
    return_value = Autosubmit.parse_args()
  File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/autosubmit.py", line 707, in parse_args
    return Autosubmit.run_experiment(args.expid, args.notransitive,args.start_time,args.start_after, args.run_only_members, args.profile)
  File "/appl/AS/4.1.9-final-beta/lib64/python3.9/site-packages/autosubmit-4.1.9-py3.9.egg/autosubmit/autosubmit.py", line 2340, in run_experiment
    raise AutosubmitCritical(e.message, e.code, e.trace)
log.log.AutosubmitCritical:  

ESC[1m ESC[31m[CRITICAL] Error while submitting jobs:   [eCode=7013]ESC[0mESC[39m
More info at https://autosubmit.readthedocs.io/en/master/troubleshooting/error-codes.html
(END)

This was logged right after other jobs succeeded, and I think the connection to submit new jobs was not successful, but it didn't try to submit again... I had a look at job_packages but I couldn't find a place with retrials for job submission if a network error occurs...


@dbeltrankyl

I think it's this one you want, right:

Yes, thanks

This was logged right after other jobs succeeded, and I think the connection to submit new jobs was not successful, but it didn't try to submit again... I had a look at job_packages but I couldn't find a place with retrials for job submission if a network error occurs...

The problem is that this error is not handled as a "network error", I don't recall having this error before.

When there is an ERROR, which is not the case, it will trigger the code of Autosubmit.py 2263 ( inside def run_experiment()

Amount other stuff, there is a Autosubmit.restore_platforms(platforms_to_test, mail_notify=mail_notify, as_conf=as_conf, expid=expid)

Once the connection is restored, autosubmit run will return to the main loop and continue the experiment from where it crashed.

However, if it is an Autosubmit critical, autosubmit will stop

So what is missing is to handle this error so the recovery routine can be triggered

The submission process is a little bit delicate tho, if there is a failure in check the status of the logs there is no issue, that command also has retries etc and when retrials are spent it will trigger the autosubmit error to make a new connection.

But the submission could "fail" and still something be submitted(which happened before), so we have to be careful with "ghost" jobs

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdiscussionThe issue is created to keep track a discussion

Type

No type

Projects

Status

In Progress

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions