-
Notifications
You must be signed in to change notification settings - Fork 12
CSCDESTINCLIMADT-794: Fix 'list' object has no attribute 'status' #2474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
When you
When you run it, it again calls the same When the jobs are wrapped, they have different IDs in However, in
At that point, when the function goes to
That's why we normally don't see this bug happening. When you run the same experiment after it failed, without creating it first, in Here, the code enters the second part of the In the existing experiment, I have two jobs. One fails, then I recover (similar to what @youcheng-csc did), and re-launch it. One completed, one failed. But having a single failed job, doesn't trigger the part of the code we want with a I increased # Fails on the second and third chunks
CHUNK="%CHUNK%"
if [ "$CHUNK" -eq 1 ]
then
echo "OK!"
else
echo "Uh oh"
crashit!
fi
Here, I setstatus so I had one job failed, and two waiting in the wrapper.
Still not able to reproduce Cheng's issue. Getting closer, but can't figure out what makes the code to enter that branch of the |
|
Just before I cleaned everything, I spent some more minutes looking at the variables in different frames in the debugger, and taking note of the DB and its data. Now I'm really grateful, as the bug repeated every time I launched the debugger -- this indicates that this is not a random error, but based on data/state. Before, I was chasing how I had looked at the table in SQLite, and it was indeed empty. Now I cannot reproduce the bug anymore, but my
So now we just need to understand what caused my |
|
Possible scenarios for the error to happen:
And:
Tomorrow I will set breakpoints in these three locations linked above |
6a5705c to
59ad37c
Compare
|
I wasn't able to reproduce the bug today. I started by setting the breakpoints in the locations that I mentioned in my previous message, then run + recovery + run, like what I did the last time it happened. But this time the issue did not happen. I confirmed I am on the correct branch. The only differences are the MN5 could be on a different "mood", and I don't have as many breakpoints as I had last time it happened. I'll start setting more breakpoints to see if I can reproduce the issue. |
|
Having some difficulty to test this PR today, due to MN5. I'm preparing the Slurm container that we use for tests as I believe that should suffice for what we need here, and it will be a lot faster to test things out. |
|
My Slurm Docker container failed to run, but after troubleshooting it with @VindeeR , I managed to get it working.
At this point, @VindeeR & @manuel-g-castro were helping me as I couldn't get my Slurm to work. So I had a break and when I came back I read the logs with calm. [2025-08-04T15:44:05.002] debug2: _slurm_connect: failed to connect to 10.88.0.14:6818: Connection refused
[2025-08-04T15:44:05.002] debug2: Error connecting slurm stream socket at 10.88.0.14:6818: Connection refusedThe 10.88.0.14 is the Docker slurm container. The 6818 I think is supposed to be the slurm worker, but it didn't start. Looking a the other log file, that Erick suggested inspecting, # cat slurmd.log
[2025-08-04T15:44:01.765] error: The cgroup mountpoint does not align with the current namespace. Please, ensure all namespaces are correctly mounted. Refer to the slurm cgroup_v2 documentation.
[2025-08-04T15:44:01.765] error: We detected a pid 0 which means you are in a cgroup namespace and a mounted cgroup but with pids from the host that we're not allowed to manage.
[2025-08-04T15:44:01.765] error: cgroup /sys/fs/cgroup contains pids from outside of our pid namespace, so we cannot manage this cgroup.
[2025-08-04T15:44:01.765] error: cannot setup the scope for cgroup
[2025-08-04T15:44:01.765] error: Unable to initialize cgroup plugin
[2025-08-04T15:44:01.765] error: slurmd initialization failed
root@slurmctld:/var/log/slurm# ^C
root@slurmctld:/var/log/slurm#
logout
Connection to localhost closed.The line "cgroup /sys/fs/cgroup contains pids from outside of our pid namespace, so we cannot manage this cgroup" seems to be the root cause. So I created a cgroup. I tried creating a new directory and binding into the container, but it didn't find the systemd hierarchy in the cgroup folder (needed). I tried Then I decided to search for ways to create a separate cgroup space for the container, and found this: https://serverfault.com/questions/1053187/systemd-fails-to-run-in-a-docker-container-when-using-cgroupv2-cgroupns-priva/1054414#1054414 @VindeeR I think this might be better if that works for you. It gives the container its own space (slice) and makes the command line to run Docker a bit shorter. You just update your $ docker run --rm -it --cgroupns=private --privileged --hostname slurmctld --name slurm-container -p 2222:2222 autosubmit/slurm-openssh-container:25-05-0-1And update your jobs
Done that, the whole experiment finishes in seconds!!! Before I had to wait minutes for each test :) It'll save a lot of time troubleshooting future problems like this. |
|
Hi! Great news that you manage to get it working. I have a question: why do you need to create the folder Also, for future reference, I will clarify the following comment:
It is the default port for communication of the slurm deamon (slurmd), the process that runs on every node and launches and checks if the process is running. The other default doors used in this image are 6817, for the slurmctld, the controller that aggregates all the information of each of the nodes, queues, and allocates resources to jobs; and 6819 for the accounting deamon (slurmdbd), the interface between slurmctld and all the database solutions (mysql, linux files, etc). |
|
I had to create that folder due to the platform settings. AS expects the scratch + project + user folder to exist and doesn't create if it doesn't. Failing to run my test experiment. Thanks for the info about the ports! |
|
Interestingly, the issue hasn't happened yet for me in the Docker container. Maybe something in MN5 contributes to this issue happening (do you know if it has ever happened in LUMI, @youcheng-csc?). $ autosubmit recovery o002 \
--all -f -s -np && autosubmit -lc DEBUG run o002 ; \
autosubmit -lc DEBUG cat-log o002 -f oI had saved my pickle file and DB from when the issue happened in my local development environment last week, and I compared the current one I have (working) with that one (broken).
Summary of changes in the broken one:
|
|
This afternoon I started "hacking" some objects, to force the code to go through some parts (e.g. instead of checking if there are failures, forcing it to believe there are). I managed to get a autosubmit/autosubmit/job/job.py Lines 324 to 325 in 6e436ca
Deleting that Now, the part I still don't get is that when the bug happened, I didn't have |
57ef611 to
014cd93
Compare
|
Not much progress on it today due to meetings (3 hours uninterrupted 😓), I tested setting some tasks to DELAYED, and with different retries/retrials. I am able to produce some interesting scenarios when messing with the WALLCLOCK of the platform and of the wrapper. Will check in Cheng's experiments if the wallclock of the jobs and the wrapper were close to the platform defined max wallclock. |
|
@kellekai had this same problem in a26i. I copied his experiment, and patch my local |
|
Copied Kai experiment's Looking at Kai's job package DB, both tables (wrappers and jobs) are correctly populated apparently. So that's not part of the bug -- what I was suspecting before. 🤔 |
|
Tried running the experiment until Slurm jobs were submitted, then killed Autosubmit and submitted a new experiment with create (recovery stops the jobs). Did that a few (4 or 5) times with MN5, submitting to gp_bsces. I was not able to reproduce the bug that way. Then simply create, then started run + recovery. The problem did not happen, but the plots of the wrappers were not consistent.
@kellekai's experiment appears to be consistently failing, so now I will spend some time inspecting his Pickle and DB files, and reading most of his recent logs to see if that gives me any clue. I also tried to use @manuel-g-castro's new Slurm container to launch multiple jobs locally, which would be faster, but we have an issue with the trixie new image running on my laptop with Ubuntu 24.04 (I think Manuel mentioned @pablogoitia has the same problem). So all I have for now is MN5 and the old slurm container with a single job executed each time. |
|
Had an idea last night and managed to reproduce the bug again this morning. I started mapping possible scenarios for the bug, and started setting breakpoints/bookmarks to simulate each one. My first option was to look at I set a breakpoint in Then, I started a new |
1569447 to
23099b2
Compare
|
Thanks for the review, @dbeltrankyl . I think I addressed all your comments agreeing with them, deleting TODOs that you answered already, or creating issues. Just this one, https://github.com/BSC-ES/autosubmit/pull/2474/files#r2292870824, I couldn't answer very well, I think. Could you have another look, please, @dbeltrankyl ? I rebased, and pushed the changes that are simply removing the TODO/FIXME you answered. If an intermittent test fails (test_send_data or one in test_slurm) then I'll kick the build and then it should pass. Thanks |
|
Done! Thanks @kinow |
23099b2 to
b93660f
Compare
|
Rebased, still pending to look at one comment from Dani's feedback. |
9833154 to
097ce77
Compare
|
Rebased, but will hold until #2577 is merged, then rebase this one again and then this PR should have a lot less changes. |
|
This should be done for 4.1.16. It was already working but blocked by #2577 |
|
Will rebase and fix issues tomorrow 👍 then after that will jump to the disappearing HPC variables if nothing else happens. |
3675d67 to
851b66e
Compare
3dc0ec3 to
9c40372
Compare
|
Rebased, but there are more changes to be moved to another branch. Will start doing that tomorrow. |
…eturn a list. - Added mypy types - Added unit and integration tests - Changed the order at which ``JobList.save_wrappers`` is called in ``autosubmit/autosubmit.py`` - Added docstrings, and fixed linter errors (to debug/see code execution)
9c40372 to
22feacd
Compare















Closes #2463
Check List
CONTRIBUTING.md.pyproject.toml.CHANGELOG.mdif this is a change that can affect users.Closes #1234).