job failures not correctly reducing retry count

Courtesy of Tim Sackton:

I ran into another issue with progressiveCactus where a job that is issued with too little memory to finish will lead to a loop of that job getting restarted, failing, and getting restarted again, without the retryCount (apparently) going down at all. So basically we get stuck in a loop where one job is constantly failing.

Am I misunderstanding something? It seems like retryCount should deincrement each time the job fails, but you can see from the log here (https://gist.github.com/tsackton/03b1605c4e29762376f2) that the failing job is reissued several times but after the second and third failures the retry count is still at 5. 

Is this a bug or an error in my code/understanding? It could easily be the latter....

Regardless of whether this is how retries are supposed to work, I was able to get past that error by doubling the memory the retry gets each time a job fails (see here: https://github.com/harvardinformatics/jobTree/blob/master/src/master.py#L79). Ideally I'd also be able to increase the amount of time a job requests, as that would be the other reason to get consistent failures, but I don't see how to do that, or even if it is possible.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job failures not correctly reducing retry count #34

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

job failures not correctly reducing retry count #34

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions