Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduled profiling runs from GitHub Actions run-profiling.yaml workflow intermittently failing #1277

Closed
matt-graham opened this issue Feb 16, 2024 · 16 comments
Assignees

Comments

@matt-graham
Copy link
Collaborator

The scheduled runs of the run-profiling.yaml workflow are intermittently failing.

In most cases this appears to be due to an out of memory issue (process exits with error code 247 which some searching suggests is related to container running out of memory) for example

https://github.com/UCL/TLOmodel/actions/runs/7304903455/job/19907772489#step:4:16223

This seems to happen after profiling run has completed during the writing of HTML output in the lines

print(f"Writing {output_html_file}", end="...", flush=True)
with open(output_html_file, "w") as f:
f.write(html_renderer.render(scale_run_session))
print("done")

as we get the "Writing {output_html_file}" message printed but not the corresponding "done". This is probably therefore the pyinstrument HTML renderer taking up too much memory for the large profiles generated. Simplest option to fix would be to just reduce the simulation length (currently 5 years), population size (currently 50000) or frame recording interval (currently 0.1s). Alternatively @tamuri do any of the other self-hosted runners have a larger memory (profiling is currently being run on runners with test tag, not sure if those with tasks tag have more resources and/or if it would make sense to use these instead?).

In the most recent run a different error occured

 [00:26:04:INFO] Starting profiling runs
Traceback (most recent call last):
  File "/runner/_work/TLOmodel/TLOmodel/src/scripts/profiling/run_profiling.py", line 389, in <module>
    run_profiling(**vars(args))
  File "/runner/_work/TLOmodel/TLOmodel/src/scripts/profiling/run_profiling.py", line 232, in run_profiling
    completed_simulation = scale_run(
                           ^^^^^^^^^^
  File "/runner/_work/TLOmodel/TLOmodel/src/scripts/profiling/scale_run.py", line 83, in scale_run
    sim.register(
  File "/runner/_work/TLOmodel/TLOmodel/.tox/profile/lib/python3.11/site-packages/tlo/simulation.py", line 168, in register
    module.read_parameters('')
  File "/runner/_work/TLOmodel/TLOmodel/.tox/profile/lib/python3.11/site-packages/tlo/methods/healthsystem.py", line 776, in read_parameters
    self.load_parameters_from_dataframe(pd.read_csv(
  File "/runner/_work/TLOmodel/TLOmodel/.tox/profile/lib/python3.11/site-packages/tlo/core.py", line 265, in load_parameters_from_dataframe
    resource.set_index('parameter_name', inplace=True)
  File "/runner/_work/TLOmodel/TLOmodel/.tox/profile/lib/python3.11/site-packages/pandas/core/frame.py", line 5859, in set_index
    raise KeyError(f"None of {missing} are in the columns")
KeyError: "None of ['parameter_name'] are in the columns"

This looks like it might be due to the resource files that are under LFS not having being checked out correctly - it looks like the workflow doesn't set the lfs option to actions/checkout action to true, though not sure why this would have suddenly stopped working if this is issue unless there is some sort of caching going on.

@matt-graham
Copy link
Collaborator Author

/run profiling

@ucl-comment-bot
Copy link

Profiled run of the model failed ❌

🆔 21780698857
⏲️ 624 minutes
#️⃣ 8473a31

@matt-graham
Copy link
Collaborator Author

Looks like the change of the frame recording interval from 0.1 to 0.2 seconds in #1278 wasn't sufficient to fix out of memory issue 😞 as manually triggered run still failed with error code 247 in rendering HTML output step.

@tamuri
Copy link
Collaborator

tamuri commented Feb 21, 2024

Can we turn off the progress bar output, please - my laptop/browser is struggling to load the std output for the job.

@tamuri
Copy link
Collaborator

tamuri commented Feb 21, 2024

@tamuri do any of the other self-hosted runners have a larger memory (profiling is currently being run on runners with test tag, not sure if those with tasks tag have more resources and/or if it would make sense to use these instead?).

They are the same machines, unfortunately. Do you have a sense of how much memory is required to do the profiling?

@matt-graham
Copy link
Collaborator Author

Can we turn off the progress bar output, please - my laptop/browser is struggling to load the std output for the job.

Yes will do - I think the log output, which is always shown if progress bar not enabled, is potentially even more verbose though so I guess probably best to disable progress bar and redirect stdout to a file or /dev/null? Or should we an option to disable printing logged output to stdout as I often find the captured stdout in tests to be a pain to navigate through and not usually all that helpful in diagnosing test failures.

@matt-graham
Copy link
Collaborator Author

They are the same machines, unfortunately. Do you have a sense of how much memory is required to do the profiling?

I'll try do some runs locally to see if I can get an idea of how much memory is required and how the recording interval affects the peak memory usage. Given some runs are successfully completing I was surprised doubling interval didn't solve as I naively assumed this would roughly half memory requirements and seemed like we were probably not too far over memory available given some runs are completing. As we're running out of memory when writing the HTML output rather than than during the actual profiling, it might be that the peak memory usage is more a function of the maximum call stack depth as this is what determines the complexity of the HTML output.

@tamuri
Copy link
Collaborator

tamuri commented Feb 21, 2024

disable progress bar and redirect stdout to a file or /dev/null

Piping to /dev/null seems easy solution in this case.

it might be that the peak memory usage is more a function of the maximum call stack depth as this is what determines the complexity of the HTML output.

Looking at https://github.com/joerick/pyinstrument/blob/main/pyinstrument/renderers/jsonrenderer.py#L53 seems to suggest it's building the json string for the entire frame in one recursive call. It would have to be fairly big use all the memory on the machine, though. Each runner would have, on average, ~2GB. None of the tests use that much memory, even if all the other runners were active while the profiling was running.

@matt-graham
Copy link
Collaborator Author

Running

/usr/bin/time -v python src/scripts/profiling/run_profiling.py --html --interval={interval} -y={years} -m={months}

locally with years = 0, months = 2 I get a maximum resident size recorded of 906MB for interval = 0.1 and 860MB for interval = 0.2 which suggests that changing the recording interval doesn't have a significant effect on peak memory usage for this simulation length at least. I'm rerunning with years = 1, months = 0 to see if we get similar pattern on a longer simulation, and also how much peak memory usage scales with simulation length.

@matt-graham
Copy link
Collaborator Author

matt-graham commented Feb 21, 2024

Maximum resident set sizes for runs with years = 1 and months = 0 were 1701MB for interval = 0.1 and 1631MB for interval = 0.2. This suggests it's simulation length that's the main determiner of peak memory usage and changing recording interval is unlikely to make much difference. Linearly extrapolating, this would suggest around 5.5GB peak memory usage for interval = 0.1 and 5.3GB peak memory usage for interval = 0.2 for the default 5-year simulation length, which would explain why we're hitting out of memory issues. Easiest fix is probably therefore to just reduce simulation length to ~2 years, though this would still have a peak memory usage of around 2.5GB based on above, so might still be too resource heavy. Reducing much further though I think would be problematic as we would be mainly profiling the initial model spin-up.

@tamuri
Copy link
Collaborator

tamuri commented Feb 21, 2024

Could we write our own HTML renderer to stream output to file rather than building the string? Is it for sure that step? Doesn't look so difficult. If we did it nicely, could do a PR upstream. (assuming infinite time here, of course)

@matt-graham
Copy link
Collaborator Author

From logs it seems to be consistently in the lines

print(f"Writing {output_html_file}", end="...", flush=True)
with open(output_html_file, "w") as f:
f.write(html_renderer.render(scale_run_session))
print("done")

as we get a Writing *.html file... message but no corresponding done, though admittedly there didn't seem to be a major peak in memory usage at that point when I was running locally, though maybe it was too quick to spot.

We could have a go at writing our own rendered, agree it doesn't look like it would be too difficult as interface is nice and clean. Upstreaming something to pyinstrument would be good, but I've already got a probably much smaller PR on pyinstrument (joerick/pyinstrument#285) which has been sitting for a while, so not sure if maintainer would have capacity to review something more significant like adding a new renderer though.

@matt-graham
Copy link
Collaborator Author

As the GitHub provided Actions runners for public repositories now have 16GiB memory (https://github.blog/2024-01-17-github-hosted-runners-double-the-power-for-open-source/), we could potentially run the profiling on these runners rather than our self-hosted ones. As there is 6 hour job time limit, this would still require us to reduce the total simulation length as we're currently averaging a bit less than 10 hours for the profiling runs (its possible things might run quicker on the GitHub runners compared to our Azure instances particularly given they now have 4 vCPUs, but could also be slower of course!). However, it looks like we might need to do this anyway to get things running robustly on our self-hosted runners unless decrease of frame depth recorded in #1236 by excluding Pandas internal frames decreases memory usage when rendering HTML.

@matt-graham
Copy link
Collaborator Author

/run profiling

@ucl-comment-bot
Copy link

Profiled run of the model succeeded ✅

🆔 22895858097
⏲️ 570 minutes
#️⃣ 92e5b13

@matt-graham
Copy link
Collaborator Author

Other than one failure due to other factors that was fixed in #1306, the scheduled profiling runs for the last couple of months have been completing successfully, so closing this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants