-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduled profiling runs from GitHub Actions run-profiling.yaml
workflow intermittently failing
#1277
Comments
/run profiling |
Profiled run of the model failed ❌🆔 21780698857 |
Looks like the change of the frame recording interval from 0.1 to 0.2 seconds in #1278 wasn't sufficient to fix out of memory issue 😞 as manually triggered run still failed with error code 247 in rendering HTML output step. |
Can we turn off the progress bar output, please - my laptop/browser is struggling to load the std output for the job. |
They are the same machines, unfortunately. Do you have a sense of how much memory is required to do the profiling? |
Yes will do - I think the log output, which is always shown if progress bar not enabled, is potentially even more verbose though so I guess probably best to disable progress bar and redirect |
I'll try do some runs locally to see if I can get an idea of how much memory is required and how the recording interval affects the peak memory usage. Given some runs are successfully completing I was surprised doubling interval didn't solve as I naively assumed this would roughly half memory requirements and seemed like we were probably not too far over memory available given some runs are completing. As we're running out of memory when writing the HTML output rather than than during the actual profiling, it might be that the peak memory usage is more a function of the maximum call stack depth as this is what determines the complexity of the HTML output. |
Piping to /dev/null seems easy solution in this case.
Looking at https://github.com/joerick/pyinstrument/blob/main/pyinstrument/renderers/jsonrenderer.py#L53 seems to suggest it's building the json string for the entire frame in one recursive call. It would have to be fairly big use all the memory on the machine, though. Each runner would have, on average, ~2GB. None of the tests use that much memory, even if all the other runners were active while the profiling was running. |
Running
locally with |
Maximum resident set sizes for runs with |
Could we write our own HTML renderer to stream output to file rather than building the string? Is it for sure that step? Doesn't look so difficult. If we did it nicely, could do a PR upstream. (assuming infinite time here, of course) |
From logs it seems to be consistently in the lines TLOmodel/src/scripts/profiling/run_profiling.py Lines 257 to 260 in 0914625
as we get a We could have a go at writing our own rendered, agree it doesn't look like it would be too difficult as interface is nice and clean. Upstreaming something to |
As the GitHub provided Actions runners for public repositories now have 16GiB memory (https://github.blog/2024-01-17-github-hosted-runners-double-the-power-for-open-source/), we could potentially run the profiling on these runners rather than our self-hosted ones. As there is 6 hour job time limit, this would still require us to reduce the total simulation length as we're currently averaging a bit less than 10 hours for the profiling runs (its possible things might run quicker on the GitHub runners compared to our Azure instances particularly given they now have 4 vCPUs, but could also be slower of course!). However, it looks like we might need to do this anyway to get things running robustly on our self-hosted runners unless decrease of frame depth recorded in #1236 by excluding Pandas internal frames decreases memory usage when rendering HTML. |
/run profiling |
Profiled run of the model succeeded ✅🆔 22895858097 |
Other than one failure due to other factors that was fixed in #1306, the scheduled profiling runs for the last couple of months have been completing successfully, so closing this for now. |
The scheduled runs of the
run-profiling.yaml
workflow are intermittently failing.In most cases this appears to be due to an out of memory issue (process exits with error code 247 which some searching suggests is related to container running out of memory) for example
https://github.com/UCL/TLOmodel/actions/runs/7304903455/job/19907772489#step:4:16223
This seems to happen after profiling run has completed during the writing of HTML output in the lines
TLOmodel/src/scripts/profiling/run_profiling.py
Lines 257 to 260 in 0914625
as we get the
"Writing {output_html_file}"
message printed but not the corresponding"done"
. This is probably therefore thepyinstrument
HTML renderer taking up too much memory for the large profiles generated. Simplest option to fix would be to just reduce the simulation length (currently 5 years), population size (currently 50000) or frame recording interval (currently 0.1s). Alternatively @tamuri do any of the other self-hosted runners have a larger memory (profiling is currently being run on runners withtest
tag, not sure if those withtasks
tag have more resources and/or if it would make sense to use these instead?).In the most recent run a different error occured
This looks like it might be due to the resource files that are under LFS not having being checked out correctly - it looks like the workflow doesn't set the
lfs
option toactions/checkout
action totrue
, though not sure why this would have suddenly stopped working if this is issue unless there is some sort of caching going on.The text was updated successfully, but these errors were encountered: