Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue when running Azure batch + Fusion #5794

Open
joaodemeirel opened this issue Feb 14, 2025 · 2 comments · May be fixed by #5806
Open

Issue when running Azure batch + Fusion #5794

joaodemeirel opened this issue Feb 14, 2025 · 2 comments · May be fixed by #5806

Comments

@joaodemeirel
Copy link

Bug report

When running Azure Batch + Fusion, it should ignore failure, but it's not ignoring. Slack thread can be checked here

Expected behavior and actual behavior

Should ignore fail and continue the workflow, however, when using this configuration, it doesn't ignore a task failure and terminates all tasks.

Steps to reproduce the problem

Run nf-canary, with --run TEST_IGNORED_FAIL, and in the config errorStrategy = 'retry'. I also added at the next section the full nextflow config used (only need to change the azure batch pools):

Nextflow config

params {
   skip = ''
   gpu = false
   run = 'TEST_IGNORED_FAIL'
   outdir = null
   remoteFile = null
}

process {
   container = 'docker.io/library/ubuntu:23.10'
   errorStrategy = 'retry'
   when = { 
        ( params.run ? params.run.split(',').any{ "NF_CANARY:${it.toUpperCase()}".contains(task.process) } : true ) && 
        (!params.skip.split(',').any{ "NF_CANARY:${it.toUpperCase()}".contains(task.process) } ) 
    }
   scratch = false
   executor = 'azurebatch'
   queue = '{CHANGE HERE}'
}

timeline {
   enabled = true
   file = 'timeline-UF0v4fqsb1SUI.html'
}

wave {
   enabled = true
   endpoint = 'https://wave.seqera.io'
}

fusion {
   enabled = true
}

azure {
   storage {
      accountName = '{CHANGE HERE}'
   }
   batch {
      location = 'eastus'
      accountName = '{CHANGE HERE}'
      copyToolInstallMode = 'node'
      autoPoolMode = false
      allowPoolCreation = false
      pools {
         '{CHANGE HERE}' {
            vmType = 'standard_e16d_v5'
            vmCount = 0
         }
      }
   }
}

workDir = '{CHANGE HERE}'
runName = 'insane_euler'

tower {
   enabled = true
   endpoint = 'https://api.cloud.seqera.io'
}

cloudcache {
   enabled = true
   path = '{CHANGE HERE}'
}

@adamrtalbot
Copy link
Collaborator

I can recreate this issue. Completely befuddled, I turned to LLMs. It said:

Looking at the codebase, I can identify the issue. When Fusion is enabled with Azure Batch, task failures are not being properly handled because the task error information is not being correctly propagated.
The key issue appears in AzBatchTaskHandler.groovy:

        final done = taskState0(taskKey)==BatchTaskState.COMPLETED
        if( done ) {
            // finalize the task
            task.exitStatus = readExitFile()
            task.stdout = outputFile
            task.stderr = errorFile
            status = TaskStatus.COMPLETED
            final info = batchService.getTask(taskKey).executionInfo
            if (info.result == BatchTaskExecutionResult.FAILURE)
                task.error = new ProcessUnrecoverableException(info.failureInfo.message)
            deleteTask(taskKey, task)

When a task completes, the handler checks the task status and sets the exit status. However, when Fusion is enabled, the error handling path is different because Fusion tasks run in privileged containers with different error reporting mechanisms.
The issue stems from the fact that the Azure Batch executor doesn't properly handle the error status from Fusion tasks. When a Fusion task fails, the error information should be propagated through the task handler to allow the errorStrategy to work, but this isn't happening.
This is in contrast to other executors like AWS Batch which properly handle Fusion errors:

    protected BashWrapperBuilder createTaskWrapper() {
        return fusionEnabled()
                ? fusionLauncher()
                : new AwsBatchScriptLauncher(task.toTaskBean(), getAwsOptions())
    }

To fix this, the Azure Batch implementation needs to be updated to properly handle error propagation when Fusion is enabled, similar to how AWS Batch handles it.
Until this is fixed, I would recommend either:
Disabling Fusion when using Azure Batch if you need error strategy functionality
Using a different executor if you need both Fusion and error strategy support
The root cause appears to be a gap in the Azure Batch implementation rather than an intentional limitation.

@adamrtalbot adamrtalbot linked a pull request Feb 21, 2025 that will close this issue
@adamrtalbot
Copy link
Collaborator

I made a valiant attempt to fix it in #5806 which seems to work. Here's my config:

azure {
    activeDirectory {
        servicePrincipalId     = "$AZURE_DIRECTORY_TENANT_ID"
        servicePrincipalSecret = "$AZURE_SERVICE_PRINCIPAL_SECRET"
        tenantId               = "$AZURE_APPLICATION_TENANT_ID" 
    }
    storage {
        accountName  = "$AZURE_STORAGE_ACCOUNT_NAME"
    }
    batch {
        location     = "$AZURE_BATCH_ACCOUNT_REGION"
        accountName  = "$AZURE_BATCH_ACCOUNT_NAME"
        copyToolInstallMode = 'node'
        deletePoolsOnCompletion = true
        autoPoolMode = true
        allowPoolCreation = true
        pools {
            auto {
                autoScale = true
                vmCount = 1
                maxVmCount = 4
            }
        }
   }
}
process {
    executor = "azurebatch"
    machineType = "standard_e*d_v5"
    scratch = true
    withName: '.*' {
        errorStrategy = "retry"
    }
}

wave {
   enabled = true
}

fusion {
   enabled = true
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants