Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSF job status unknown should probably be treated as HOLD not EXIT #5666

Open
mp15 opened this issue Jan 14, 2025 · 3 comments · May be fixed by #5756
Open

LSF job status unknown should probably be treated as HOLD not EXIT #5666

mp15 opened this issue Jan 14, 2025 · 3 comments · May be fixed by #5756

Comments

@mp15
Copy link

mp15 commented Jan 14, 2025

The UNKNOWN status in LSF does not mean the job has actually died, just that the LSF daemon has lost touch with each other. A job may continue running in an unknown state for a long period of time and write output via shared disks. It may also recover and terminate with exit code 0. I would suggest treating it as QueueStatus.HOLD not QueueStatus.ERROR.

@muffato
Copy link

muffato commented Jan 28, 2025

To chime in on @mp15 's comment, I got the exact same problem.

Excerpt from the LSF log

Fri Jan 24 12:21:47: Starting (Pid 2599435);
Fri Jan 24 12:25:08: Running with execution home <(...)>
Sat Jan 25 03:25:48: Unknown; unable to reach the execution host;
Sat Jan 25 03:45:54: Running;
Sat Jan 25 03:51:07: Unknown; unable to reach the execution host;
Sat Jan 25 03:51:13: Running;
Sat Jan 25 03:56:26: Unknown; unable to reach the execution host;
Sat Jan 25 03:56:47: Running;
Sat Jan 25 04:27:19: Done successfully. The CPU time used is 1189382.2 seconds;
Sat Jan 25 04:27:19: Post job process done successfully;

LSF had a hiccup but the job completed. All the output files are there, including .exitcode and .command.trace. Both .command.out and .command.log are there as well, and the latter contains the job summary stats:

Successfully completed.

Resource usage summary:

    CPU time :                                   1189248.38 sec.
    Max Memory :                                 136074 MB
    Average Memory :                             15940.29 MB
    Total Requested Memory :                     221184.00 MB
    Delta Memory :                               85110.00 MB
    Max Swap :                                   -
    Max Processes :                              130
    Max Threads :                                1408
    Run time :                                   57839 sec.
    Turnaround time :                            58249 sec.

However, the "Unknown; unable to reach the execution host" bit made the job status turn to UNKWN for a little while, and Nextflow then thought the job was having an error:

Jan-25 03:31:01.817 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - Failed to get exit status for process TaskHandler[jobId: 313223; id: 20; name: SANGERTOL_VARIANTCALLING:VARIANTCALLING:DEEPVARIANT_CALLER:DEEPVARIANT (qqMarAlbu1_unpublished.qqMarAlbu1.1.curated_primary.no_mt.unscrubbed.renamed_onlychrom.CHR01.1); status: RUNNING; exit: -; error: -; workDir: (...)/work/0c/cea94f1db6be1b8c3cebb921c5c51a started: 1737721230100; exited: -; ] -- exitStatusReadTimeoutMillis: 270000; delta: 274976
Current queue status:
>   job: 264491: RUNNING
>   job: 313223: ERROR

Content of workDir: (...)/work/0c/cea94f1db6be1b8c3cebb921c5c51a
null
Jan-25 03:31:01.817 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 313223; id: 20; name: SANGERTOL_VARIANTCALLING:VARIANTCALLING:DEEPVARIANT_CALLER:DEEPVARIANT (qqMarAlbu1_unpublished.qqMarAlbu1.1.curated_primary.no_mt.unscrubbed.renamed_onlychrom.CHR01.1); status: COMPLETED; exit: -; error: -; workDir: (...)/work/0c/cea94f1db6be1b8c3cebb921c5c51a started: 1737721230100; exited: -; ]
Jan-25 03:31:01.848 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=SANGERTOL_VARIANTCALLING:VARIANTCALLING:DEEPVARIANT_CALLER:DEEPVARIANT (qqMarAlbu1_unpublished.qqMarAlbu1.1.curated_primary.no_mt.unscrubbed.renamed_onlychrom.CHR01.1); work-dir=(...)/work/0c/cea94f1db6be1b8c3cebb921c5c51a
  error [nextflow.exception.ProcessFailedException]: Process `SANGERTOL_VARIANTCALLING:VARIANTCALLING:DEEPVARIANT_CALLER:DEEPVARIANT (qqMarAlbu1_unpublished.qqMarAlbu1.1.curated_primary.no_mt.unscrubbed.renamed_onlychrom.CHR01.1)` terminated for an unknown reason -- Likely it has been terminated by the external system
Jan-25 03:31:02.193 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'SANGERTOL_VARIANTCALLING:VARIANTCALLING:DEEPVARIANT_CALLER:DEEPVARIANT (qqMarAlbu1_unpublished.qqMarAlbu1.1.curated_primary.no_mt.unscrubbed.renamed_onlychrom.CHR01.1)'

If UNKWN could be interpreted as something else than "error", Nextflow would have survived this, waited for the job to complete, and carried on with the rest of the pipeline.

Best,
Matthieu

@bentsherman bentsherman linked a pull request Feb 6, 2025 that will close this issue
@bentsherman
Copy link
Member

@mp15 @muffato I have created a PR based on your suggestion: #5756

Can either (or both) of you test this fix on your cluster? Instructions for building/testing locally are here

Actually, my main concern is not that the PR works -- it's pretty straightforward -- but I wonder whether your suggestion is always appropriate? Can we be confident that the UNKWN status is always a temporary state? Because if it isn't, then Nextflow will wait forever.

@muffato
Copy link

muffato commented Feb 6, 2025

Here is what I could find in the LSF documentation about the status itself

UNKWN
mbatchd has lost contact with the sbatchd on the host on which the job runs.

and then in the section about the time summary of a job

UNKWN
The total unknown time of the job (job status becomes unknown if the sbatchd daemon on the execution host is
temporarily unreachable).

I think LSF will change the job status to UNKWN when it loses contact with the node, and the job will remain UNKWN until contact is made again, either naturally or forcibly (e.g. a sys admin rebooting the node). But that's more @mp15 and his team's expertise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants