Issue when running Azure batch + Fusion #5794

joaodemeirel · 2025-02-14T19:47:46Z

Bug report

When running Azure Batch + Fusion, it should ignore failure, but it's not ignoring. Slack thread can be checked here

Expected behavior and actual behavior

Should ignore fail and continue the workflow, however, when using this configuration, it doesn't ignore a task failure and terminates all tasks.

Steps to reproduce the problem

Run nf-canary, with --run TEST_IGNORED_FAIL, and in the config errorStrategy = 'retry'. I also added at the next section the full nextflow config used (only need to change the azure batch pools):

Nextflow config

params {
   skip = ''
   gpu = false
   run = 'TEST_IGNORED_FAIL'
   outdir = null
   remoteFile = null
}

process {
   container = 'docker.io/library/ubuntu:23.10'
   errorStrategy = 'retry'
   when = { 
        ( params.run ? params.run.split(',').any{ "NF_CANARY:${it.toUpperCase()}".contains(task.process) } : true ) && 
        (!params.skip.split(',').any{ "NF_CANARY:${it.toUpperCase()}".contains(task.process) } ) 
    }
   scratch = false
   executor = 'azurebatch'
   queue = '{CHANGE HERE}'
}

timeline {
   enabled = true
   file = 'timeline-UF0v4fqsb1SUI.html'
}

wave {
   enabled = true
   endpoint = 'https://wave.seqera.io'
}

fusion {
   enabled = true
}

azure {
   storage {
      accountName = '{CHANGE HERE}'
   }
   batch {
      location = 'eastus'
      accountName = '{CHANGE HERE}'
      copyToolInstallMode = 'node'
      autoPoolMode = false
      allowPoolCreation = false
      pools {
         '{CHANGE HERE}' {
            vmType = 'standard_e16d_v5'
            vmCount = 0
         }
      }
   }
}

workDir = '{CHANGE HERE}'
runName = 'insane_euler'

tower {
   enabled = true
   endpoint = 'https://api.cloud.seqera.io'
}

cloudcache {
   enabled = true
   path = '{CHANGE HERE}'
}

The text was updated successfully, but these errors were encountered:

adamrtalbot · 2025-02-21T12:03:26Z

I can recreate this issue. Completely befuddled, I turned to LLMs. It said:

Looking at the codebase, I can identify the issue. When Fusion is enabled with Azure Batch, task failures are not being properly handled because the task error information is not being correctly propagated.
The key issue appears in AzBatchTaskHandler.groovy:
        final done = taskState0(taskKey)==BatchTaskState.COMPLETED
        if( done ) {
            // finalize the task
            task.exitStatus = readExitFile()
            task.stdout = outputFile
            task.stderr = errorFile
            status = TaskStatus.COMPLETED
            final info = batchService.getTask(taskKey).executionInfo
            if (info.result == BatchTaskExecutionResult.FAILURE)
                task.error = new ProcessUnrecoverableException(info.failureInfo.message)
            deleteTask(taskKey, task)
When a task completes, the handler checks the task status and sets the exit status. However, when Fusion is enabled, the error handling path is different because Fusion tasks run in privileged containers with different error reporting mechanisms.
The issue stems from the fact that the Azure Batch executor doesn't properly handle the error status from Fusion tasks. When a Fusion task fails, the error information should be propagated through the task handler to allow the errorStrategy to work, but this isn't happening.
This is in contrast to other executors like AWS Batch which properly handle Fusion errors:
    protected BashWrapperBuilder createTaskWrapper() {
        return fusionEnabled()
                ? fusionLauncher()
                : new AwsBatchScriptLauncher(task.toTaskBean(), getAwsOptions())
    }
To fix this, the Azure Batch implementation needs to be updated to properly handle error propagation when Fusion is enabled, similar to how AWS Batch handles it.
Until this is fixed, I would recommend either:
Disabling Fusion when using Azure Batch if you need error strategy functionality
Using a different executor if you need both Fusion and error strategy support
The root cause appears to be a gap in the Azure Batch implementation rather than an intentional limitation.

adamrtalbot · 2025-02-21T12:37:44Z

I made a valiant attempt to fix it in #5806 which seems to work. Here's my config:

azure {
    activeDirectory {
        servicePrincipalId     = "$AZURE_DIRECTORY_TENANT_ID"
        servicePrincipalSecret = "$AZURE_SERVICE_PRINCIPAL_SECRET"
        tenantId               = "$AZURE_APPLICATION_TENANT_ID" 
    }
    storage {
        accountName  = "$AZURE_STORAGE_ACCOUNT_NAME"
    }
    batch {
        location     = "$AZURE_BATCH_ACCOUNT_REGION"
        accountName  = "$AZURE_BATCH_ACCOUNT_NAME"
        copyToolInstallMode = 'node'
        deletePoolsOnCompletion = true
        autoPoolMode = true
        allowPoolCreation = true
        pools {
            auto {
                autoScale = true
                vmCount = 1
                maxVmCount = 4
            }
        }
   }
}
process {
    executor = "azurebatch"
    machineType = "standard_e*d_v5"
    scratch = true
    withName: '.*' {
        errorStrategy = "retry"
    }
}

wave {
   enabled = true
}

fusion {
   enabled = true
}

adamrtalbot linked a pull request Feb 21, 2025 that will close this issue

fix (5794): Handle Azure error when using Fusion #5806

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue when running Azure batch + Fusion #5794

Issue when running Azure batch + Fusion #5794

joaodemeirel commented Feb 14, 2025

adamrtalbot commented Feb 21, 2025

adamrtalbot commented Feb 21, 2025

Issue when running Azure batch + Fusion #5794

Issue when running Azure batch + Fusion #5794

Comments

joaodemeirel commented Feb 14, 2025

Bug report

Expected behavior and actual behavior

Steps to reproduce the problem

Nextflow config

adamrtalbot commented Feb 21, 2025

adamrtalbot commented Feb 21, 2025