CSCDESTINCLIMADT-794: Fix 'list' object has no attribute 'status' #2474

kinow · 2025-07-21T13:17:40Z

Check List

I have read CONTRIBUTING.md.
Contains logically grouped changes (else tidy your branch by rebase).
Does not contain off-topic changes (use other PRs for other changes).
Applied any dependency changes to pyproject.toml.
Tests are included (or explain why tests are not needed).
Changelog entry included in CHANGELOG.md if this is a change that can affect users.
Documentation updated.
If this is a bug fix, PR should include a link to the issue (e.g. Closes #1234).

autosubmit/job/job_list.py

test/unit/test_job_list.py

kinow · 2025-07-23T09:28:24Z

When you create the workflow without wrappers, with just two jobs, you get two jobs in job_list created in update_genealogy, both with the same ID.

When you run it, it again calls the same update_genealogy, getting two jobs in the job_list, with the same ID, but only one job each time is executed (depend on each other).

When the jobs are wrapped, they have different IDs in update_genealogy. When you run it, job_list will have two jobs, different IDs too.

However, in get_in_queue_grouped_id, when it calls self.get_in_queue(platform), it will return two jobs, with the exact same ID.

At that point, when the function goes to check_wrappers, we still don't get/reproduce the bug from @youcheng-csc because the code enters the first part of the if statement, and not the second where the bug happened.

That's why we normally don't see this bug happening.

When you run the same experiment after it failed, without creating it first, in update_genealogy now instead of two different IDs, you get two identical IDs 💫 My guess is that it's loading from disk, and since the last run had the same ID, that's what was saved.

Here, the code enters the second part of the if statement in check_wrappers, but now it has a single job, as the first failed, and the second succeeded.

In the existing experiment, I have two jobs. One fails, then I recover (similar to what @youcheng-csc did), and re-launch it. One completed, one failed. But having a single failed job, doesn't trigger the part of the code we want with a list (noticed this as I had spent some time reviewing the code and writing that unit test).

I increased NUMCHUNKS to 3, and modified the SIM SCRIPT to

# Fails on the second and third chunks
CHUNK="%CHUNK%"
if [ "$CHUNK" -eq 1 ]
then
  echo "OK!"
else
  echo "Uh oh"
  crashit!
fi

Here, I setstatus so I had one job failed, and two waiting in the wrapper.

Still not able to reproduce Cheng's issue. Getting closer, but can't figure out what makes the code to enter that branch of the if with a list. (At least we now know what happened.)

kinow · 2025-07-23T10:02:07Z

It finally happened! 🎉

But the issue is that it's not deterministic. I had the debugger running, with a few breakpoints, so I wonder if the part of the code I was looking at could have slowed down something? (Not sure if there are threads/processes in parallel doing anything, or something related to this?)

This is the job_data DB state. Maybe the sequence of states of jobs is part of the reason for the bug. I did a backup of my pickle file, and will clean everything, and re-launch it with the same breakpoints.

kinow · 2025-07-23T10:34:03Z

Just before I cleaned everything, I spent some more minutes looking at the variables in different frames in the debugger, and taking note of the DB and its data.

Now I'm really grateful, as the bug repeated every time I launched the debugger -- this indicates that this is not a random error, but based on data/state.

Before, I was chasing how job_list.job_package_map could be empty. But I noticed in the debugger, that at one of the frames, the function/self had no job_package, it was empty (I think it was supposed to be a list or dict).

I had looked at the table in SQLite, and it was indeed empty.

Now I cannot reproduce the bug anymore, but my job_package is populated.

So now we just need to understand what caused my job_package to be empty. And I suspect @youcheng-csc 's DB had the same when the issue happened.

kinow · 2025-07-31T12:31:40Z

Possible scenarios for the error to happen:

1 job ID in the jobs_by_id dictionary,

autosubmit/autosubmit/job/job_list.py

Line 2276 in 9a27f2b

if len(jobs_by_id[job_id]) == 1:

And:

job_list.packages_dict is empty,

autosubmit/autosubmit/autosubmit.py

Line 1910 in 9a27f2b

for package_name, jobs in job_list.packages_dict.items():

, and/or
The wrapper_job.status is set in WAITING, which causes the packages_dict to be emptied

autosubmit/autosubmit/autosubmit.py

Lines 1829 to 1830 in 9a27f2b

job_list.packages_dict.pop(

wrapper_id, None)

, and/or
package.name does not exist, and platform.submit_ready_jobs does not populate job_package_map

autosubmit/autosubmit/platforms/platform.py

Line 397 in 9a27f2b

job_list.job_package_map[package.jobs[0].id] = wrapper_job

(less likely, I think)

Tomorrow I will set breakpoints in these three locations linked above

kinow · 2025-08-01T10:45:51Z

I wasn't able to reproduce the bug today. I started by setting the breakpoints in the locations that I mentioned in my previous message, then run + recovery + run, like what I did the last time it happened. But this time the issue did not happen.

I confirmed I am on the correct branch. The only differences are the MN5 could be on a different "mood", and I don't have as many breakpoints as I had last time it happened. I'll start setting more breakpoints to see if I can reproduce the issue.

kinow · 2025-08-04T12:41:18Z

Having some difficulty to test this PR today, due to MN5. I'm preparing the Slurm container that we use for tests as I believe that should suffice for what we need here, and it will be a lot faster to test things out.

kinow · 2025-08-04T16:27:33Z

My Slurm Docker container failed to run, but after troubleshooting it with @VindeeR , I managed to get it working.

Follow these instructions to start a local cluster: https://github.com/BSC-ES/autosubmit/blob/master/CONTRIBUTING.md
Add a platform

PLATFORMS:
  LOCALDOCKER:
    TYPE: slurm
    HOST: localDocker
    PROJECT: group
    HPC_PROJECT_ROOT: /run/user/
    USER: root
    QUEUE: gp_ehpc
    MAX_WALLCLOCK: 00:15
    SCRATCH_DIR: /run/user/
    ADD_PROJECT_TO_HOST: false
    TEMP_DIR: ''
    PROCESSORS_PER_NODE: 112
    APP_PARTITION: gp_bsces
    TASKS: 1

At this point, @VindeeR & @manuel-g-castro were helping me as I couldn't get my Slurm to work. So I had a break and when I came back I read the logs with calm.

[2025-08-04T15:44:05.002] debug2: _slurm_connect: failed to connect to 10.88.0.14:6818: Connection refused
[2025-08-04T15:44:05.002] debug2: Error connecting slurm stream socket at 10.88.0.14:6818: Connection refused

The 10.88.0.14 is the Docker slurm container. The 6818 I think is supposed to be the slurm worker, but it didn't start. Looking a the other log file, that Erick suggested inspecting, /var/log/slurmd.log, I had:

# cat slurmd.log 
[2025-08-04T15:44:01.765] error: The cgroup mountpoint does not align with the current namespace. Please, ensure all namespaces are correctly mounted. Refer to the slurm cgroup_v2 documentation.
[2025-08-04T15:44:01.765] error: We detected a pid 0 which means you are in a cgroup namespace and a mounted cgroup but with pids from the host that we're not allowed to manage.
[2025-08-04T15:44:01.765] error: cgroup /sys/fs/cgroup contains pids from outside of our pid namespace, so we cannot manage this cgroup.
[2025-08-04T15:44:01.765] error: cannot setup the scope for cgroup
[2025-08-04T15:44:01.765] error: Unable to initialize cgroup plugin
[2025-08-04T15:44:01.765] error: slurmd initialization failed
root@slurmctld:/var/log/slurm# ^C
root@slurmctld:/var/log/slurm# 
logout
Connection to localhost closed.

The line "cgroup /sys/fs/cgroup contains pids from outside of our pid namespace, so we cannot manage this cgroup" seems to be the root cause. So I created a cgroup.

I tried creating a new directory and binding into the container, but it didn't find the systemd hierarchy in the cgroup folder (needed).

I tried --pids=host, but that also didn't work.

Then I decided to search for ways to create a separate cgroup space for the container, and found this: https://serverfault.com/questions/1053187/systemd-fails-to-run-in-a-docker-container-when-using-cgroupv2-cgroupns-priva/1054414#1054414

@VindeeR I think this might be better if that works for you. It gives the container its own space (slice) and makes the command line to run Docker a bit shorter.

You just update your /etc/docker/daemon.json to match theirs (I also added "debug:" true), then systemctl restart docker, and run the container again without the volume, and with cgroupsns private.

$ docker run --rm -it --cgroupns=private --privileged --hostname slurmctld --name slurm-container -p 2222:2222 autosubmit/slurm-openssh-container:25-05-0-1

And update your jobs

Create required directories (ssh root@localDocker mkdir -p /run/user/group/root)
Run your experiment

Done that, the whole experiment finishes in seconds!!! Before I had to wait minutes for each test :) It'll save a lot of time troubleshooting future problems like this.

manuel-g-castro · 2025-08-04T16:44:07Z

Hi! Great news that you manage to get it working.

I have a question: why do you need to create the folder /run/user/group/root?

Also, for future reference, I will clarify the following comment:

The 6818 I think is supposed to be the slurm worker

It is the default port for communication of the slurm deamon (slurmd), the process that runs on every node and launches and checks if the process is running.

The other default doors used in this image are 6817, for the slurmctld, the controller that aggregates all the information of each of the nodes, queues, and allocates resources to jobs; and 6819 for the accounting deamon (slurmdbd), the interface between slurmctld and all the database solutions (mysql, linux files, etc).

kinow · 2025-08-04T16:52:28Z

I had to create that folder due to the platform settings. AS expects the scratch + project + user folder to exist and doesn't create if it doesn't. Failing to run my test experiment.

Thanks for the info about the ports!

kinow · 2025-08-05T08:21:31Z

Interestingly, the issue hasn't happened yet for me in the Docker container. Maybe something in MN5 contributes to this issue happening (do you know if it has ever happened in LUMI, @youcheng-csc?).

$ autosubmit recovery o002 \
    --all -f -s -np && autosubmit -lc DEBUG run o002 ; \
  autosubmit -lc DEBUG cat-log o002 -f o

I had saved my pickle file and DB from when the issue happened in my local development environment last week, and I compared the current one I have (working) with that one (broken).

Summary of changes in the broken one:

No wallclock
No tasks, processors, memory...
id is zero
no remote logs
status is 5 (-1 in good)
prev and new status are 0 (4 and 5 in good)
no write_start
check is true
total jobs and max waiting are zero (20 in good)
no submit time
no _script
wrapper name is blank
no wallclock_in_seconds (a property that relies on wallclock, so that makes sense)

kinow · 2025-08-05T15:03:41Z

This afternoon I started "hacking" some objects, to force the code to go through some parts (e.g. instead of checking if there are failures, forcing it to believe there are).

I managed to get a job_list object very similar to what I had when the issue happened, by hacking this part of the code,

autosubmit/autosubmit/job/job.py

Lines 324 to 325 in 6e436ca

    
           if self.status == Status.FAILED and self.fail_count >= self.retrials: 
        
               return None

Deleting that if then all attributes were deleted.

Now, the part I still don't get is that when the bug happened, I didn't have write_start in the Job object. But it's specified in the slots, and I can't think (right now) how we can end up without it 💫 But it may be related to what's causing this issue...

kinow · 2025-08-07T12:21:59Z

Not much progress on it today due to meetings (3 hours uninterrupted 😓), I tested setting some tasks to DELAYED, and with different retries/retrials. I am able to produce some interesting scenarios when messing with the WALLCLOCK of the platform and of the wrapper. Will check in Cheng's experiments if the wallclock of the jobs and the wrapper were close to the platform defined max wallclock.

kinow · 2025-08-07T16:04:04Z

@kellekai had this same problem in a26i. I copied his experiment, and patch my local o002 removing EXTEND_WALLCLOCK from the wrapper (from Cheng's, Kai's doesn't have it), setting retrials, check on submission, nodes, and queue in the job. Ran a couple of times, hopeful that would trigger the error locally, but it didn 😞

kinow · 2025-08-07T16:11:41Z

Copied Kai experiment's MARENOSTRUM5 platform, changed user & host, then updated my SIM to use that, create, run, then recovery + run a couple of times, but no success at replicating the problem.

Looking at Kai's job package DB, both tables (wrappers and jobs) are correctly populated apparently. So that's not part of the bug -- what I was suspecting before. 🤔

kinow · 2025-08-11T12:08:39Z

Tried running the experiment until Slurm jobs were submitted, then killed Autosubmit and submitted a new experiment with create (recovery stops the jobs).

Did that a few (4 or 5) times with MN5, submitting to gp_bsces. I was not able to reproduce the bug that way.

Then simply create, then started run + recovery. The problem did not happen, but the plots of the wrappers were not consistent.

@kellekai's experiment appears to be consistently failing, so now I will spend some time inspecting his Pickle and DB files, and reading most of his recent logs to see if that gives me any clue.

I also tried to use @manuel-g-castro's new Slurm container to launch multiple jobs locally, which would be faster, but we have an issue with the trixie new image running on my laptop with Ubuntu 24.04 (I think Manuel mentioned @pablogoitia has the same problem). So all I have for now is MN5 and the old slurm container with a single job executed each time.

kinow · 2025-08-12T06:41:06Z

Had an idea last night and managed to reproduce the bug again this morning. I started mapping possible scenarios for the bug, and started setting breakpoints/bookmarks to simulate each one.

My first option was to look at job_package_map of job_list, making it empty. On my first try the issue happened, but before I even made it empty.

I set a breakpoint in save_wrappers, right at its top, on the first if, and exited the debugger immediately (was just checking the frame variables).

Then, I started a new run without recovery. And the bug happened. I took notes of several variables in the debugger, and made more notes about how it would be possible to reach this combination of variables. Will continue working on this scenario today, maybe simulating a few things by hacking the code.

kinow · 2025-08-22T10:44:55Z

Thanks for the review, @dbeltrankyl . I think I addressed all your comments agreeing with them, deleting TODOs that you answered already, or creating issues.

Just this one, https://github.com/BSC-ES/autosubmit/pull/2474/files#r2292870824, I couldn't answer very well, I think.

Could you have another look, please, @dbeltrankyl ?

I rebased, and pushed the changes that are simply removing the TODO/FIXME you answered. If an intermittent test fails (test_send_data or one in test_slurm) then I'll kick the build and then it should pass.

Thanks

dbeltrankyl · 2025-08-22T11:41:25Z

Done! Thanks @kinow

kinow · 2025-08-26T06:35:31Z

Rebased, still pending to look at one comment from Dani's feedback.

kinow · 2025-09-15T06:38:49Z

Rebased, but will hold until #2577 is merged, then rebase this one again and then this PR should have a lot less changes.

dbeltrankyl · 2025-10-22T09:57:39Z

This should be done for 4.1.16. It was already working but blocked by #2577

dbeltrankyl · 2025-10-22T09:58:56Z

This should be done for 4.1.16. It was already working but blocked by #2577

What I'm not sure is when the #2577 will be ready? Because maybe we can merge this one without the other as it is a DestinE bug. cc @kinow

kinow · 2025-10-22T11:02:39Z

The #2551 should be ready to be reviewed, once @VindeeR has time to finish reviewing #2604 (which contains tests for some of his changes). Then I will rebase this one and we should be ready to review/merge this one.

kinow · 2025-10-23T14:00:14Z

Will rebase and fix issues tomorrow 👍 then after that will jump to the disappearing HPC variables if nothing else happens.

kinow · 2025-11-19T22:17:02Z

Rebased, but there are more changes to be moved to another branch. Will start doing that tomorrow.

…eturn a list. - Added mypy types - Added unit and integration tests - Changed the order at which ``JobList.save_wrappers`` is called in ``autosubmit/autosubmit.py`` - Added docstrings, and fixed linter errors (to debug/see code execution)

kinow added this to the 4.1.16 milestone Jul 21, 2025

kinow self-assigned this Jul 21, 2025

kinow commented Jul 21, 2025

View reviewed changes

autosubmit/job/job_list.py Outdated Show resolved Hide resolved

kinow commented Jul 21, 2025

View reviewed changes

test/unit/test_job_list.py Show resolved Hide resolved

kinow mentioned this pull request Jul 30, 2025

Adding unit tests for autosubmit.helpers #2529

Merged

8 tasks

kinow force-pushed the CSCDESTINCLIMADT-794 branch from 6a5705c to 59ad37c Compare July 31, 2025 14:50

kinow added this to Autosubmit project Jul 31, 2025

github-project-automation bot moved this to Todo in Autosubmit project Jul 31, 2025

kinow moved this from Todo to In Progress in Autosubmit project Jul 31, 2025

kinow force-pushed the CSCDESTINCLIMADT-794 branch from 57ef611 to 014cd93 Compare August 6, 2025 07:54

kinow force-pushed the CSCDESTINCLIMADT-794 branch from 1569447 to 23099b2 Compare August 22, 2025 10:43

kinow force-pushed the CSCDESTINCLIMADT-794 branch from 23099b2 to b93660f Compare August 26, 2025 06:35

kinow force-pushed the CSCDESTINCLIMADT-794 branch 2 times, most recently from 9833154 to 097ce77 Compare September 15, 2025 06:23

kinow force-pushed the CSCDESTINCLIMADT-794 branch 8 times, most recently from 3675d67 to 851b66e Compare October 27, 2025 13:53

kinow mentioned this pull request Nov 19, 2025

files arg is never used in Autosubmit.upgrade_scripts #2711

Open

kinow force-pushed the CSCDESTINCLIMADT-794 branch 2 times, most recently from 3dc0ec3 to 9c40372 Compare November 19, 2025 22:16

kinow marked this pull request as draft November 19, 2025 22:16

kinow force-pushed the CSCDESTINCLIMADT-794 branch from 9c40372 to 22feacd Compare November 19, 2025 22:21

CSCDESTINCLIMADT-794: Fix 'list' object has no attribute 'status' #2474

Are you sure you want to change the base?

CSCDESTINCLIMADT-794: Fix 'list' object has no attribute 'status' #2474

Uh oh!

Conversation

kinow commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kinow commented Jul 23, 2025

Uh oh!

kinow commented Jul 23, 2025

Uh oh!

kinow commented Jul 23, 2025

Uh oh!

kinow commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kinow commented Aug 1, 2025

Uh oh!

kinow commented Aug 4, 2025

Uh oh!

kinow commented Aug 4, 2025

Uh oh!

manuel-g-castro commented Aug 4, 2025

Uh oh!

kinow commented Aug 4, 2025

Uh oh!

kinow commented Aug 5, 2025

Uh oh!

kinow commented Aug 5, 2025

Uh oh!

kinow commented Aug 7, 2025

Uh oh!

kinow commented Aug 7, 2025

Uh oh!

kinow commented Aug 7, 2025

Uh oh!

kinow commented Aug 11, 2025

Uh oh!

kinow commented Aug 12, 2025

Uh oh!

kinow commented Aug 22, 2025

Uh oh!

dbeltrankyl commented Aug 22, 2025

Uh oh!

kinow commented Aug 26, 2025

Uh oh!

kinow commented Sep 15, 2025

Uh oh!

dbeltrankyl commented Oct 22, 2025

Uh oh!

dbeltrankyl commented Oct 22, 2025

Uh oh!

kinow commented Oct 22, 2025

Uh oh!

kinow commented Oct 23, 2025

Uh oh!

kinow commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kinow commented Jul 21, 2025 •

edited

Loading

kinow commented Jul 31, 2025 •

edited

Loading