Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLURM SBATCH time setting wrong #5775

Open
gevro opened this issue Feb 10, 2025 · 10 comments
Open

SLURM SBATCH time setting wrong #5775

gevro opened this issue Feb 10, 2025 · 10 comments

Comments

@gevro
Copy link

gevro commented Feb 10, 2025

Hi, For nextflow v24.10.3 all SLURM scripts generated by nextflow are being assigned time = 0:

#SBATCH -t 00:00:00

Even though the processes are defined with specific times in this format:
time '10h'

There is no configuration I can think of that could cause something like this.

General config looks like this:

params {
    max_memory                 = 3000.GB
    max_cpus                   = 96
    max_time                   = 7d
}

process {
    resourceLimits = [
        memory: 3000.GB,
        cpus: 96,
        time: 7d
    ]
    executor       = 'slurm'
    clusterOptions = '--export=NONE'
    maxRetries     = 3
    errorStrategy  = { task.attempt <= 3 ? 'retry' : 'finish' }
    cache          = 'lenient'
}

Any idea how this could be occurring?

@gevro
Copy link
Author

gevro commented Feb 10, 2025

Note this bug is occurring with v24.10.3 but not with v23.04.1. So there is a bug in v24.10.3.

@gevro
Copy link
Author

gevro commented Feb 10, 2025

Just to add more info in case relevant to fixing the bug: this bug occurs for a repo on github in which I put an empty nextflow.config file as required by nextflow.

@gevro
Copy link
Author

gevro commented Feb 10, 2025

I just tested various versions. The bug started in 24.04.0. Looking at git blame, the last time SLURM task execution changed substantially was at 24.04.0 for array submissions: #3892

So my strong suspicion is this change introduced this bug. I'm surprised others have not reported it yet.

@bentsherman
Copy link
Member

Actually I think it was introduced by the resourceLimits directive (#2911) which was also added in 24.04. I'm also surprised it hasn't been reported before. But maybe it only occurs in specific conditions.

The time setting for a task is computed here:

private Duration getDuration0(String key) {
def value = get(key)
if( !value )
return null
if( value instanceof Duration )
return (Duration)value
if( value instanceof Number )
return new Duration(value as long)
try {
new Duration(value.toString().trim())
}
catch( Exception e ) {
throw new AbortOperationException("Not a valid `$key` value in process definition: $value")
}
}
private Duration getTime0() {
return getDuration0('time')
}
Duration getTime() {
final val = getTime0()
final lim = getResourceLimit('time') as Duration
return val && lim && val > lim ? lim : val
}

And the time setting is added to the SLURM job here:

if( task.config.getTime() ) {
result << '-t' << task.config.getTime().format('HH:mm:ss')
}

So getTime() must be returning a non-null Duration that is equal to 0. Also, I would expect this bug to affect all grid executors because they all have similar logic

@bentsherman
Copy link
Member

bentsherman commented Feb 13, 2025

Oh, I see the problem now:

    resourceLimits = [
        time: 7d
    ]

7d is equivalent to 7.0, which becomes 7 milliseconds as a time value. The SLURM executor receives this limit instead of the requested 10 hours, and it rounds down to 0 seconds. You should change it to 7.d or '7d'

@gevro
Copy link
Author

gevro commented Feb 13, 2025

Thanks! But not sure I follow why 7d isn't a legitimate notation?

7h for example works in process time directives. Is this behavior documented?

@gevro
Copy link
Author

gevro commented Feb 13, 2025

(The intention is a limit of 7 days)

@bentsherman
Copy link
Member

It is a notation for numeric literals that Groovy inherits from Java (which copied it from C/C++). It allows you to be very precise about the numeric type that you want to use.

For example, these are all valid ways to denote the number 7:

7  // 32-bit integer (default)
7L // 64-bit integer ("long")
7f // 32-bit floating-point ("float")
7d // 64-bit floating-point ("double")

I come from a more low-level programming background so I'm used to this, but I suspect most Nextflow users don't know and don't care about how a number is being stored in memory. I had thought it wouldn't hurt anything, but considering how close it is syntactically to a duration like 7.d, maybe we should remove it in the strict syntax.

7h is not a valid literal and I would expect Nextflow to report a syntax error. Only 7.h or '7h' should work.

@gevro
Copy link
Author

gevro commented Feb 13, 2025

Does '7d' denote 7 days with quotes?

@bentsherman
Copy link
Member

Yes, it will work if you have quotes, because then it's a string

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants