-
Notifications
You must be signed in to change notification settings - Fork 18
Description
Courtesy of Tim Sackton:
I ran into another issue with progressiveCactus where a job that is issued with too little memory to finish will lead to a loop of that job getting restarted, failing, and getting restarted again, without the retryCount (apparently) going down at all. So basically we get stuck in a loop where one job is constantly failing.
Am I misunderstanding something? It seems like retryCount should deincrement each time the job fails, but you can see from the log here (https://gist.github.com/tsackton/03b1605c4e29762376f2) that the failing job is reissued several times but after the second and third failures the retry count is still at 5.
Is this a bug or an error in my code/understanding? It could easily be the latter....
Regardless of whether this is how retries are supposed to work, I was able to get past that error by doubling the memory the retry gets each time a job fails (see here: https://github.com/harvardinformatics/jobTree/blob/master/src/master.py#L79). Ideally I'd also be able to increase the amount of time a job requests, as that would be the other reason to get consistent failures, but I don't see how to do that, or even if it is possible.