Skip to content

Conversation

@kinow
Copy link
Member

@kinow kinow commented Jul 21, 2025

Closes #2463

Check List

  • I have read CONTRIBUTING.md.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to pyproject.toml.
  • Tests are included (or explain why tests are not needed).
  • Changelog entry included in CHANGELOG.md if this is a change that can affect users.
  • Documentation updated.
  • If this is a bug fix, PR should include a link to the issue (e.g. Closes #1234).

@kinow kinow added this to the 4.1.16 milestone Jul 21, 2025
@kinow kinow self-assigned this Jul 21, 2025
@kinow
Copy link
Member Author

kinow commented Jul 23, 2025

When you create the workflow without wrappers, with just two jobs, you get two jobs in job_list created in update_genealogy, both with the same ID.

image image

When you run it, it again calls the same update_genealogy, getting two jobs in the job_list, with the same ID, but only one job each time is executed (depend on each other).


When the jobs are wrapped, they have different IDs in update_genealogy. When you run it, job_list will have two jobs, different IDs too.

However, in get_in_queue_grouped_id, when it calls self.get_in_queue(platform), it will return two jobs, with the exact same ID.

image

At that point, when the function goes to check_wrappers, we still don't get/reproduce the bug from @youcheng-csc because the code enters the first part of the if statement, and not the second where the bug happened.

image

That's why we normally don't see this bug happening.


When you run the same experiment after it failed, without creating it first, in update_genealogy now instead of two different IDs, you get two identical IDs 💫 My guess is that it's loading from disk, and since the last run had the same ID, that's what was saved.

Here, the code enters the second part of the if statement in check_wrappers, but now it has a single job, as the first failed, and the second succeeded.


In the existing experiment, I have two jobs. One fails, then I recover (similar to what @youcheng-csc did), and re-launch it. One completed, one failed. But having a single failed job, doesn't trigger the part of the code we want with a list (noticed this as I had spent some time reviewing the code and writing that unit test).

I increased NUMCHUNKS to 3, and modified the SIM SCRIPT to

# Fails on the second and third chunks
CHUNK="%CHUNK%"
if [ "$CHUNK" -eq 1 ]
then
  echo "OK!"
else
  echo "Uh oh"
  crashit!
fi
image

Here, I setstatus so I had one job failed, and two waiting in the wrapper.

image

Still not able to reproduce Cheng's issue. Getting closer, but can't figure out what makes the code to enter that branch of the if with a list. (At least we now know what happened.)

@kinow
Copy link
Member Author

kinow commented Jul 23, 2025

image

It finally happened! 🎉

But the issue is that it's not deterministic. I had the debugger running, with a few breakpoints, so I wonder if the part of the code I was looking at could have slowed down something? (Not sure if there are threads/processes in parallel doing anything, or something related to this?)

image image

This is the job_data DB state. Maybe the sequence of states of jobs is part of the reason for the bug. I did a backup of my pickle file, and will clean everything, and re-launch it with the same breakpoints.

@kinow
Copy link
Member Author

kinow commented Jul 23, 2025

Just before I cleaned everything, I spent some more minutes looking at the variables in different frames in the debugger, and taking note of the DB and its data.

Now I'm really grateful, as the bug repeated every time I launched the debugger -- this indicates that this is not a random error, but based on data/state.

Before, I was chasing how job_list.job_package_map could be empty. But I noticed in the debugger, that at one of the frames, the function/self had no job_package, it was empty (I think it was supposed to be a list or dict).

I had looked at the table in SQLite, and it was indeed empty.

Now I cannot reproduce the bug anymore, but my job_package is populated.

image

So now we just need to understand what caused my job_package to be empty. And I suspect @youcheng-csc 's DB had the same when the issue happened.

@kinow
Copy link
Member Author

kinow commented Jul 31, 2025

Possible scenarios for the error to happen:

And:

Tomorrow I will set breakpoints in these three locations linked above

@kinow kinow force-pushed the CSCDESTINCLIMADT-794 branch from 6a5705c to 59ad37c Compare July 31, 2025 14:50
@kinow kinow moved this from Todo to In Progress in Autosubmit project Jul 31, 2025
@kinow
Copy link
Member Author

kinow commented Aug 1, 2025

I wasn't able to reproduce the bug today. I started by setting the breakpoints in the locations that I mentioned in my previous message, then run + recovery + run, like what I did the last time it happened. But this time the issue did not happen.

I confirmed I am on the correct branch. The only differences are the MN5 could be on a different "mood", and I don't have as many breakpoints as I had last time it happened. I'll start setting more breakpoints to see if I can reproduce the issue.

@kinow
Copy link
Member Author

kinow commented Aug 4, 2025

Having some difficulty to test this PR today, due to MN5. I'm preparing the Slurm container that we use for tests as I believe that should suffice for what we need here, and it will be a lot faster to test things out.

@kinow
Copy link
Member Author

kinow commented Aug 4, 2025

My Slurm Docker container failed to run, but after troubleshooting it with @VindeeR , I managed to get it working.

PLATFORMS:
  LOCALDOCKER:
    TYPE: slurm
    HOST: localDocker
    PROJECT: group
    HPC_PROJECT_ROOT: /run/user/
    USER: root
    QUEUE: gp_ehpc
    MAX_WALLCLOCK: 00:15
    SCRATCH_DIR: /run/user/
    ADD_PROJECT_TO_HOST: false
    TEMP_DIR: ''
    PROCESSORS_PER_NODE: 112
    APP_PARTITION: gp_bsces
    TASKS: 1

At this point, @VindeeR & @manuel-g-castro were helping me as I couldn't get my Slurm to work. So I had a break and when I came back I read the logs with calm.

[2025-08-04T15:44:05.002] debug2: _slurm_connect: failed to connect to 10.88.0.14:6818: Connection refused
[2025-08-04T15:44:05.002] debug2: Error connecting slurm stream socket at 10.88.0.14:6818: Connection refused

The 10.88.0.14 is the Docker slurm container. The 6818 I think is supposed to be the slurm worker, but it didn't start. Looking a the other log file, that Erick suggested inspecting, /var/log/slurmd.log, I had:

# cat slurmd.log 
[2025-08-04T15:44:01.765] error: The cgroup mountpoint does not align with the current namespace. Please, ensure all namespaces are correctly mounted. Refer to the slurm cgroup_v2 documentation.
[2025-08-04T15:44:01.765] error: We detected a pid 0 which means you are in a cgroup namespace and a mounted cgroup but with pids from the host that we're not allowed to manage.
[2025-08-04T15:44:01.765] error: cgroup /sys/fs/cgroup contains pids from outside of our pid namespace, so we cannot manage this cgroup.
[2025-08-04T15:44:01.765] error: cannot setup the scope for cgroup
[2025-08-04T15:44:01.765] error: Unable to initialize cgroup plugin
[2025-08-04T15:44:01.765] error: slurmd initialization failed
root@slurmctld:/var/log/slurm# ^C
root@slurmctld:/var/log/slurm# 
logout
Connection to localhost closed.

The line "cgroup /sys/fs/cgroup contains pids from outside of our pid namespace, so we cannot manage this cgroup" seems to be the root cause. So I created a cgroup.

I tried creating a new directory and binding into the container, but it didn't find the systemd hierarchy in the cgroup folder (needed).

I tried --pids=host, but that also didn't work.

Then I decided to search for ways to create a separate cgroup space for the container, and found this: https://serverfault.com/questions/1053187/systemd-fails-to-run-in-a-docker-container-when-using-cgroupv2-cgroupns-priva/1054414#1054414

@VindeeR I think this might be better if that works for you. It gives the container its own space (slice) and makes the command line to run Docker a bit shorter.

You just update your /etc/docker/daemon.json to match theirs (I also added "debug:" true), then systemctl restart docker, and run the container again without the volume, and with cgroupsns private.

$ docker run --rm -it --cgroupns=private --privileged --hostname slurmctld --name slurm-container -p 2222:2222 autosubmit/slurm-openssh-container:25-05-0-1

And update your jobs

  • Create required directories (ssh root@localDocker mkdir -p /run/user/group/root)
  • Run your experiment

Done that, the whole experiment finishes in seconds!!! Before I had to wait minutes for each test :) It'll save a lot of time troubleshooting future problems like this.

@manuel-g-castro
Copy link
Contributor

Hi! Great news that you manage to get it working.

I have a question: why do you need to create the folder /run/user/group/root?

Also, for future reference, I will clarify the following comment:

The 6818 I think is supposed to be the slurm worker

It is the default port for communication of the slurm deamon (slurmd), the process that runs on every node and launches and checks if the process is running.

The other default doors used in this image are 6817, for the slurmctld, the controller that aggregates all the information of each of the nodes, queues, and allocates resources to jobs; and 6819 for the accounting deamon (slurmdbd), the interface between slurmctld and all the database solutions (mysql, linux files, etc).

@kinow
Copy link
Member Author

kinow commented Aug 4, 2025

I had to create that folder due to the platform settings. AS expects the scratch + project + user folder to exist and doesn't create if it doesn't. Failing to run my test experiment.

Thanks for the info about the ports!

@kinow
Copy link
Member Author

kinow commented Aug 5, 2025

Interestingly, the issue hasn't happened yet for me in the Docker container. Maybe something in MN5 contributes to this issue happening (do you know if it has ever happened in LUMI, @youcheng-csc?).

$ autosubmit recovery o002 \
    --all -f -s -np && autosubmit -lc DEBUG run o002 ; \
  autosubmit -lc DEBUG cat-log o002 -f o

I had saved my pickle file and DB from when the issue happened in my local development environment last week, and I compared the current one I have (working) with that one (broken).

image image

Summary of changes in the broken one:

  • No wallclock
  • No tasks, processors, memory...
  • id is zero
  • no remote logs
  • status is 5 (-1 in good)
  • prev and new status are 0 (4 and 5 in good)
  • no write_start
  • check is true
  • total jobs and max waiting are zero (20 in good)
  • no submit time
  • no _script
  • wrapper name is blank
  • no wallclock_in_seconds (a property that relies on wallclock, so that makes sense)

@kinow
Copy link
Member Author

kinow commented Aug 5, 2025

This afternoon I started "hacking" some objects, to force the code to go through some parts (e.g. instead of checking if there are failures, forcing it to believe there are).

I managed to get a job_list object very similar to what I had when the issue happened, by hacking this part of the code,

if self.status == Status.FAILED and self.fail_count >= self.retrials:
return None

Deleting that if then all attributes were deleted.

Now, the part I still don't get is that when the bug happened, I didn't have write_start in the Job object. But it's specified in the slots, and I can't think (right now) how we can end up without it 💫 But it may be related to what's causing this issue...

@kinow kinow force-pushed the CSCDESTINCLIMADT-794 branch from 57ef611 to 014cd93 Compare August 6, 2025 07:54
@kinow
Copy link
Member Author

kinow commented Aug 7, 2025

Not much progress on it today due to meetings (3 hours uninterrupted 😓), I tested setting some tasks to DELAYED, and with different retries/retrials. I am able to produce some interesting scenarios when messing with the WALLCLOCK of the platform and of the wrapper. Will check in Cheng's experiments if the wallclock of the jobs and the wrapper were close to the platform defined max wallclock.

@kinow
Copy link
Member Author

kinow commented Aug 7, 2025

@kellekai had this same problem in a26i. I copied his experiment, and patch my local o002 removing EXTEND_WALLCLOCK from the wrapper (from Cheng's, Kai's doesn't have it), setting retrials, check on submission, nodes, and queue in the job. Ran a couple of times, hopeful that would trigger the error locally, but it didn 😞

@kinow
Copy link
Member Author

kinow commented Aug 7, 2025

Copied Kai experiment's MARENOSTRUM5 platform, changed user & host, then updated my SIM to use that, create, run, then recovery + run a couple of times, but no success at replicating the problem.

Looking at Kai's job package DB, both tables (wrappers and jobs) are correctly populated apparently. So that's not part of the bug -- what I was suspecting before. 🤔

@kinow
Copy link
Member Author

kinow commented Aug 11, 2025

Tried running the experiment until Slurm jobs were submitted, then killed Autosubmit and submitted a new experiment with create (recovery stops the jobs).

Did that a few (4 or 5) times with MN5, submitting to gp_bsces. I was not able to reproduce the bug that way.

Then simply create, then started run + recovery. The problem did not happen, but the plots of the wrappers were not consistent.

image image image

@kellekai's experiment appears to be consistently failing, so now I will spend some time inspecting his Pickle and DB files, and reading most of his recent logs to see if that gives me any clue.

I also tried to use @manuel-g-castro's new Slurm container to launch multiple jobs locally, which would be faster, but we have an issue with the trixie new image running on my laptop with Ubuntu 24.04 (I think Manuel mentioned @pablogoitia has the same problem). So all I have for now is MN5 and the old slurm container with a single job executed each time.

@kinow
Copy link
Member Author

kinow commented Aug 12, 2025

Had an idea last night and managed to reproduce the bug again this morning. I started mapping possible scenarios for the bug, and started setting breakpoints/bookmarks to simulate each one.

My first option was to look at job_package_map of job_list, making it empty. On my first try the issue happened, but before I even made it empty.

I set a breakpoint in save_wrappers, right at its top, on the first if, and exited the debugger immediately (was just checking the frame variables).

Then, I started a new run without recovery. And the bug happened. I took notes of several variables in the debugger, and made more notes about how it would be possible to reach this combination of variables. Will continue working on this scenario today, maybe simulating a few things by hacking the code.

@kinow kinow force-pushed the CSCDESTINCLIMADT-794 branch from 1569447 to 23099b2 Compare August 22, 2025 10:43
@kinow
Copy link
Member Author

kinow commented Aug 22, 2025

Thanks for the review, @dbeltrankyl . I think I addressed all your comments agreeing with them, deleting TODOs that you answered already, or creating issues.

Just this one, https://github.com/BSC-ES/autosubmit/pull/2474/files#r2292870824, I couldn't answer very well, I think.

Could you have another look, please, @dbeltrankyl ?

I rebased, and pushed the changes that are simply removing the TODO/FIXME you answered. If an intermittent test fails (test_send_data or one in test_slurm) then I'll kick the build and then it should pass.

Thanks

@dbeltrankyl
Copy link
Collaborator

Done! Thanks @kinow

@kinow kinow force-pushed the CSCDESTINCLIMADT-794 branch from 23099b2 to b93660f Compare August 26, 2025 06:35
@kinow
Copy link
Member Author

kinow commented Aug 26, 2025

Rebased, still pending to look at one comment from Dani's feedback.

@kinow
Copy link
Member Author

kinow commented Sep 15, 2025

Rebased, but will hold until #2577 is merged, then rebase this one again and then this PR should have a lot less changes.

@dbeltrankyl
Copy link
Collaborator

This should be done for 4.1.16. It was already working but blocked by #2577

@dbeltrankyl
Copy link
Collaborator

This should be done for 4.1.16. It was already working but blocked by #2577

What I'm not sure is when the #2577 will be ready? Because maybe we can merge this one without the other as it is a DestinE bug. cc @kinow

@kinow
Copy link
Member Author

kinow commented Oct 22, 2025

The #2551 should be ready to be reviewed, once @VindeeR has time to finish reviewing #2604 (which contains tests for some of his changes). Then I will rebase this one and we should be ready to review/merge this one.

@kinow
Copy link
Member Author

kinow commented Oct 23, 2025

Will rebase and fix issues tomorrow 👍 then after that will jump to the disappearing HPC variables if nothing else happens.

@kinow kinow force-pushed the CSCDESTINCLIMADT-794 branch 8 times, most recently from 3675d67 to 851b66e Compare October 27, 2025 13:53
@kinow kinow force-pushed the CSCDESTINCLIMADT-794 branch 2 times, most recently from 3dc0ec3 to 9c40372 Compare November 19, 2025 22:16
@kinow kinow marked this pull request as draft November 19, 2025 22:16
@kinow
Copy link
Member Author

kinow commented Nov 19, 2025

Rebased, but there are more changes to be moved to another branch. Will start doing that tomorrow.

…eturn a list.

- Added mypy types
- Added unit and integration tests
- Changed the order at which ``JobList.save_wrappers`` is called in
  ``autosubmit/autosubmit.py``
- Added docstrings, and fixed linter errors (to debug/see code
  execution)
@kinow kinow force-pushed the CSCDESTINCLIMADT-794 branch from 9c40372 to 22feacd Compare November 19, 2025 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Review

Development

Successfully merging this pull request may close these issues.

[CRITICAL] Unexpected error: 'list' object has no attribute 'status'

5 participants