Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix (5794): Handle Azure error when using Fusion #5806

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

adamrtalbot
Copy link
Collaborator

@adamrtalbot adamrtalbot commented Feb 21, 2025

The error strategy was ignored when using Fusion on Azure, this PR fixes it by handling the error after reading the exitCode

I don't know if this is the right way to do it...but it works?

the error strategy was ignored when using Fusion on Azure, this PR fixes it by handling the failure with a new method

Signed-off-by: adamrtalbot <[email protected]>
Copy link

netlify bot commented Feb 21, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit d292860
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/67c1a39c382e300008e7edba

@adamrtalbot adamrtalbot changed the title fix(5794): Handle Azure error when using Fusion fix (5794): Handle Azure error when using Fusion Feb 21, 2025
@adamrtalbot adamrtalbot linked an issue Feb 21, 2025 that may be closed by this pull request
@bentsherman
Copy link
Member

I wonder why it was using ProcessUnrecoverableException in the first place. I would think it should just be ProcessFailedException

Also, why would this issue only happen with Fusion?

@bentsherman
Copy link
Member

Here's how aws batch handles it:

boolean checkIfCompleted() {
assert jobId
if( !isRunning() )
return false
final job = describeJob(jobId)
final done = job?.status in ['SUCCEEDED', 'FAILED']
if( done ) {
// take the exit code of the container, if 0 (successful) or missing
// take the exit code from the `.exitcode` file create by nextflow
// the rationale of this is that, in case of error, the exit code return
// by the batch API is more reliable.
task.exitStatus = job.container.exitCode ?: readExitFile()
// finalize the task
task.stdout = outputFile
if( job?.status == 'FAILED' || task.exitStatus==Integer.MAX_VALUE ) {
final reason = errReason(job)
// retry all CannotPullContainer errors apart when it does not exist or cannot be accessed
final unrecoverable = reason.contains('CannotPullContainer') && reason.contains('unauthorized')
task.error = unrecoverable ? new ProcessUnrecoverableException(reason) : new ProcessException(reason)
task.stderr = executor.getJobOutputStream(jobId) ?: errorFile
}
else {
task.stderr = errorFile
}
status = TaskStatus.COMPLETED
return true
}
return false
}

@adamrtalbot
Copy link
Collaborator Author

Looks like an oversight on this PR: #2099, not a big deal.

We could refactor it to be closer to the AWS one?

@pditommaso
Copy link
Member

Doesn't look an oversight, was made on purpose by #2099

@adamrtalbot
Copy link
Collaborator Author

Either way - do we want to refactor at all or is this adequate?

@adamrtalbot
Copy link
Collaborator Author

This will also use the native Azure Batch SDK first then use the exitcode:

    @Override
    boolean checkIfCompleted() {
        assert taskKey
        if( !isRunning() )
            return false
        final done = taskState0(taskKey)==BatchTaskState.COMPLETED
        if( done ) {
            // finalize the task
            final info = batchService.getTask(taskKey).executionInfo
            task.exitStatus = info?.exitCode ?: readExitFile()
            task.stdout = outputFile
            task.stderr = errorFile
            status = TaskStatus.COMPLETED
            if (info.result == BatchTaskExecutionResult.FAILURE) {
                if (task.exitStatus != 0) {
                    // If the exit status is not 0, throw a process failed exception and Nextflow will handle it with errorStrategy
                    task.error = new ProcessFailedException("Task failed with exit code ${task.exitStatus}:\n${info.failureInfo.message}".toString())
                } else {
                    // Else use the existing error handling
                    task.error = new ProcessUnrecoverableException(info.failureInfo.message)
                }
            }
            deleteTask(taskKey, task)
            return true
        }
        return false
    }

@alberto-miranda
Copy link
Contributor

I looked a bit into this to ensure fusion wasn't misbehaving and fusion is correctly reporting the exit code of the process to Azure Batch, which is in turn communicating it to the AzBatchHandler. The problem seems to be in the treatment that AzBatchHandler does of the BatchTaskExecutionResult in checkIfCompleted():

final info = batchService.getTask(taskKey).executionInfo
if (info.result == BatchTaskExecutionResult.FAILURE)
task.error = new ProcessUnrecoverableException(info.failureInfo.message)

I added some extra logging statements to double check info.result with a minimal test and it turns out that despite the exit status being correctly propagated, without fusion Azure Batch reports success whereas with fusion it reports failure:

user@host:~ $ cat proc.nf
workflow {
main:
    EXITCODE_TEST()
}

process EXITCODE_TEST {
    """
    exit 42
    """
}

# nextflow-dev is just nextflow's `master` with some extra logs
user@host:~ $ nextflow-dev run -w az://fusion-develop/scratch/amiranda/bugs/exitcode
[...]
# logs without fusion
Feb-24 17:58:24.028 [Task monitor] DEBUG n.c.azure.batch.AzBatchTaskHandler - [AZURE BATCH] Task EXITCODE_TEST completed with exit status: 42 -- result=success
[...]
# logs with fusion
Feb-24 17:55:51.122 [Task monitor] DEBUG n.c.azure.batch.AzBatchTaskHandler - [AZURE BATCH] Task EXITCODE_TEST completed with exit status: 42 -- result=failure

This causes two differences (at least) between the fusion and non-fusion executions:

  1. task.error is always null in the non-fusion execution, which trips up the evaluation below in TaskProcessor:
    // -- verify task exit status
    if( task.error )
    throw new ProcessFailedException("Process `${safeTaskName(task)}` failed", task.error)
  2. As shown above, the fusion execution always throws a new ProcessUnrecoverableException() which is never thrown in the non-fusion execution flow.

@adamrtalbot's fix circumvents this divergence by making the error """not-unrecoverable""", but IMHO we should:

  1. Investigate why Azure Batch is reporting success for a task with a non-zero exit code.
  2. Ensure both execution flows are equivalent.

@adamrtalbot
Copy link
Collaborator Author

@bentsherman I refactored this to be closer to the AWS version, but didn't push because I'm not sure it's necessary:

@Override
boolean checkIfCompleted() {
    assert taskKey
    if( !isRunning() )
        return false
    final done = taskState0(taskKey)==BatchTaskState.COMPLETED
    if( done ) {
        // finalize the task
        final info = batchService.getTask(taskKey).executionInfo
        task.exitStatus = info?.exitCode ?: readExitFile()
        task.stdout = outputFile
        task.stderr = errorFile
        status = TaskStatus.COMPLETED
        if (info.result == BatchTaskExecutionResult.FAILURE || task.exitStatus==Integer.MAX_VALUE) {
            final String reason = info?.failureInfo?.message ?: "Unknown failure"
            if ( task.exitStatus && task.exitStatus != 0 ){
                // If the exit status is not 0, throw a process failed exception and Nextflow will handle it with errorStrategy
                task.error = new ProcessFailedException("Task failed with exit code ${task.exitStatus}:\n${reason}")
            } else {
                // Else use the existing error handling
                task.error = new ProcessUnrecoverableException(reason)
            }
        }
        deleteTask(taskKey, task)
        return true
    }
    return false
}

What do you think?

@jorgee
Copy link
Contributor

jorgee commented Feb 28, 2025

Following what @alberto-miranda suggested, the reason why Azure Batch Tasks are different when enabling fusion is because of the cmd command set in this part of the code

  • For fusion: /usr/bin/fusion bash /fusion/az/xxxx/.command.run

  • For non-fusion: sh -c 'bash .command.run 2>&1 | tee .command.log'

The | tee .command.log is preventing Azure Batch to get the real exit code. I have reproduced what is happening with a small script that is returning an exit code 42.

$ sh -c 'bash test.sh'
$ echo $?
42
$ sh -c 'bash test.sh | tee -a .command.log'
$ echo $?
0

So, without fusion the Azure Batch Task exit code is always 0 without fusion, but the failure is detected later in the code checking the Nextflow task exit code.

To make both executions equivalent, I think we should make the non-fusion cmd to get the real exist code with something like:

bash -o pipefail -c 'bash .command.run 2>&1 | tee .command.log'

I think @adamrtalbot solutions are fine to avoid exit code failures are recoverable but I do not have clear is why the rest of failures are Unrecoverable by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issue when running Azure batch + Fusion
5 participants