Scheduled profiling runs from GitHub Actions `run-profiling.yaml` workflow intermittently failing #1277

matt-graham · 2024-02-16T09:28:06Z

The scheduled runs of the run-profiling.yaml workflow are intermittently failing.

In most cases this appears to be due to an out of memory issue (process exits with error code 247 which some searching suggests is related to container running out of memory) for example

https://github.com/UCL/TLOmodel/actions/runs/7304903455/job/19907772489#step:4:16223

This seems to happen after profiling run has completed during the writing of HTML output in the lines

TLOmodel/src/scripts/profiling/run_profiling.py

Lines 257 to 260 in 0914625

    
           print(f"Writing {output_html_file}", end="...", flush=True) 
        
           with open(output_html_file, "w") as f: 
        
               f.write(html_renderer.render(scale_run_session)) 
        
           print("done")

as we get the "Writing {output_html_file}" message printed but not the corresponding "done". This is probably therefore the pyinstrument HTML renderer taking up too much memory for the large profiles generated. Simplest option to fix would be to just reduce the simulation length (currently 5 years), population size (currently 50000) or frame recording interval (currently 0.1s). Alternatively @tamuri do any of the other self-hosted runners have a larger memory (profiling is currently being run on runners with test tag, not sure if those with tasks tag have more resources and/or if it would make sense to use these instead?).

In the most recent run a different error occured

 [00:26:04:INFO] Starting profiling runs
Traceback (most recent call last):
  File "/runner/_work/TLOmodel/TLOmodel/src/scripts/profiling/run_profiling.py", line 389, in <module>
    run_profiling(**vars(args))
  File "/runner/_work/TLOmodel/TLOmodel/src/scripts/profiling/run_profiling.py", line 232, in run_profiling
    completed_simulation = scale_run(
                           ^^^^^^^^^^
  File "/runner/_work/TLOmodel/TLOmodel/src/scripts/profiling/scale_run.py", line 83, in scale_run
    sim.register(
  File "/runner/_work/TLOmodel/TLOmodel/.tox/profile/lib/python3.11/site-packages/tlo/simulation.py", line 168, in register
    module.read_parameters('')
  File "/runner/_work/TLOmodel/TLOmodel/.tox/profile/lib/python3.11/site-packages/tlo/methods/healthsystem.py", line 776, in read_parameters
    self.load_parameters_from_dataframe(pd.read_csv(
  File "/runner/_work/TLOmodel/TLOmodel/.tox/profile/lib/python3.11/site-packages/tlo/core.py", line 265, in load_parameters_from_dataframe
    resource.set_index('parameter_name', inplace=True)
  File "/runner/_work/TLOmodel/TLOmodel/.tox/profile/lib/python3.11/site-packages/pandas/core/frame.py", line 5859, in set_index
    raise KeyError(f"None of {missing} are in the columns")
KeyError: "None of ['parameter_name'] are in the columns"

This looks like it might be due to the resource files that are under LFS not having being checked out correctly - it looks like the workflow doesn't set the lfs option to actions/checkout action to true, though not sure why this would have suddenly stopped working if this is issue unless there is some sort of caching going on.

The text was updated successfully, but these errors were encountered:

matt-graham · 2024-02-20T17:28:08Z

/run profiling

ucl-comment-bot · 2024-02-21T04:01:00Z

Profiled run of the model failed ❌

🆔 21780698857
⏲️ 624 minutes
#️⃣ 8473a31

matt-graham · 2024-02-21T09:16:25Z

Looks like the change of the frame recording interval from 0.1 to 0.2 seconds in #1278 wasn't sufficient to fix out of memory issue 😞 as manually triggered run still failed with error code 247 in rendering HTML output step.

tamuri · 2024-02-21T09:21:15Z

Can we turn off the progress bar output, please - my laptop/browser is struggling to load the std output for the job.

tamuri · 2024-02-21T09:22:22Z

@tamuri do any of the other self-hosted runners have a larger memory (profiling is currently being run on runners with test tag, not sure if those with tasks tag have more resources and/or if it would make sense to use these instead?).

They are the same machines, unfortunately. Do you have a sense of how much memory is required to do the profiling?

matt-graham · 2024-02-21T09:45:00Z

Can we turn off the progress bar output, please - my laptop/browser is struggling to load the std output for the job.

Yes will do - I think the log output, which is always shown if progress bar not enabled, is potentially even more verbose though so I guess probably best to disable progress bar and redirect stdout to a file or /dev/null? Or should we an option to disable printing logged output to stdout as I often find the captured stdout in tests to be a pain to navigate through and not usually all that helpful in diagnosing test failures.

matt-graham · 2024-02-21T09:49:16Z

They are the same machines, unfortunately. Do you have a sense of how much memory is required to do the profiling?

I'll try do some runs locally to see if I can get an idea of how much memory is required and how the recording interval affects the peak memory usage. Given some runs are successfully completing I was surprised doubling interval didn't solve as I naively assumed this would roughly half memory requirements and seemed like we were probably not too far over memory available given some runs are completing. As we're running out of memory when writing the HTML output rather than than during the actual profiling, it might be that the peak memory usage is more a function of the maximum call stack depth as this is what determines the complexity of the HTML output.

tamuri · 2024-02-21T10:46:00Z

disable progress bar and redirect stdout to a file or /dev/null

Piping to /dev/null seems easy solution in this case.

it might be that the peak memory usage is more a function of the maximum call stack depth as this is what determines the complexity of the HTML output.

Looking at https://github.com/joerick/pyinstrument/blob/main/pyinstrument/renderers/jsonrenderer.py#L53 seems to suggest it's building the json string for the entire frame in one recursive call. It would have to be fairly big use all the memory on the machine, though. Each runner would have, on average, ~2GB. None of the tests use that much memory, even if all the other runners were active while the profiling was running.

matt-graham · 2024-02-21T10:56:46Z

Running

/usr/bin/time -v python src/scripts/profiling/run_profiling.py --html --interval={interval} -y={years} -m={months}

locally with years = 0, months = 2 I get a maximum resident size recorded of 906MB for interval = 0.1 and 860MB for interval = 0.2 which suggests that changing the recording interval doesn't have a significant effect on peak memory usage for this simulation length at least. I'm rerunning with years = 1, months = 0 to see if we get similar pattern on a longer simulation, and also how much peak memory usage scales with simulation length.

matt-graham · 2024-02-21T13:59:25Z

Maximum resident set sizes for runs with years = 1 and months = 0 were 1701MB for interval = 0.1 and 1631MB for interval = 0.2. This suggests it's simulation length that's the main determiner of peak memory usage and changing recording interval is unlikely to make much difference. Linearly extrapolating, this would suggest around 5.5GB peak memory usage for interval = 0.1 and 5.3GB peak memory usage for interval = 0.2 for the default 5-year simulation length, which would explain why we're hitting out of memory issues. Easiest fix is probably therefore to just reduce simulation length to ~2 years, though this would still have a peak memory usage of around 2.5GB based on above, so might still be too resource heavy. Reducing much further though I think would be problematic as we would be mainly profiling the initial model spin-up.

tamuri · 2024-02-21T14:36:51Z

Could we write our own HTML renderer to stream output to file rather than building the string? Is it for sure that step? Doesn't look so difficult. If we did it nicely, could do a PR upstream. (assuming infinite time here, of course)

matt-graham · 2024-02-21T14:54:53Z

From logs it seems to be consistently in the lines

TLOmodel/src/scripts/profiling/run_profiling.py

Lines 257 to 260 in 0914625

    
           print(f"Writing {output_html_file}", end="...", flush=True) 
        
           with open(output_html_file, "w") as f: 
        
               f.write(html_renderer.render(scale_run_session)) 
        
           print("done")

as we get a Writing *.html file... message but no corresponding done, though admittedly there didn't seem to be a major peak in memory usage at that point when I was running locally, though maybe it was too quick to spot.

We could have a go at writing our own rendered, agree it doesn't look like it would be too difficult as interface is nice and clean. Upstreaming something to pyinstrument would be good, but I've already got a probably much smaller PR on pyinstrument (joerick/pyinstrument#285) which has been sitting for a while, so not sure if maintainer would have capacity to review something more significant like adding a new renderer though.

matt-graham · 2024-03-20T15:35:05Z

As the GitHub provided Actions runners for public repositories now have 16GiB memory (https://github.blog/2024-01-17-github-hosted-runners-double-the-power-for-open-source/), we could potentially run the profiling on these runners rather than our self-hosted ones. As there is 6 hour job time limit, this would still require us to reduce the total simulation length as we're currently averaging a bit less than 10 hours for the profiling runs (its possible things might run quicker on the GitHub runners compared to our Azure instances particularly given they now have 4 vCPUs, but could also be slower of course!). However, it looks like we might need to do this anyway to get things running robustly on our self-hosted runners unless decrease of frame depth recorded in #1236 by excluding Pandas internal frames decreases memory usage when rendering HTML.

matt-graham · 2024-03-20T17:25:46Z

/run profiling

ucl-comment-bot · 2024-03-21T03:12:11Z

Profiled run of the model succeeded ✅

🆔 22895858097
⏲️ 570 minutes
#️⃣ 92e5b13

matt-graham · 2024-04-08T11:03:47Z

Other than one failure due to other factors that was fixed in #1306, the scheduled profiling runs for the last couple of months have been completing successfully, so closing this for now.

matt-graham added the gh-actions label Feb 16, 2024

matt-graham self-assigned this Feb 16, 2024

matt-graham mentioned this issue Feb 16, 2024

Enable LFS in profiling workflow checkout + reduce frame recording interval #1278

Merged

matt-graham mentioned this issue Mar 20, 2024

Minor profiling output improvements #1236

Merged

matt-graham closed this as completed Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduled profiling runs from GitHub Actions `run-profiling.yaml` workflow intermittently failing #1277

Scheduled profiling runs from GitHub Actions `run-profiling.yaml` workflow intermittently failing #1277

matt-graham commented Feb 16, 2024

matt-graham commented Feb 20, 2024

ucl-comment-bot bot commented Feb 21, 2024

matt-graham commented Feb 21, 2024

tamuri commented Feb 21, 2024 •

edited

Loading

tamuri commented Feb 21, 2024 •

edited

Loading

matt-graham commented Feb 21, 2024

matt-graham commented Feb 21, 2024

tamuri commented Feb 21, 2024 •

edited

Loading

matt-graham commented Feb 21, 2024

matt-graham commented Feb 21, 2024 •

edited

Loading

tamuri commented Feb 21, 2024 •

edited

Loading

matt-graham commented Feb 21, 2024

matt-graham commented Mar 20, 2024

matt-graham commented Mar 20, 2024

ucl-comment-bot bot commented Mar 21, 2024

matt-graham commented Apr 8, 2024

Scheduled profiling runs from GitHub Actions run-profiling.yaml workflow intermittently failing #1277

Scheduled profiling runs from GitHub Actions run-profiling.yaml workflow intermittently failing #1277

Comments

matt-graham commented Feb 16, 2024

matt-graham commented Feb 20, 2024

ucl-comment-bot bot commented Feb 21, 2024

Profiled run of the model failed ❌

matt-graham commented Feb 21, 2024

tamuri commented Feb 21, 2024 • edited Loading

tamuri commented Feb 21, 2024 • edited Loading

matt-graham commented Feb 21, 2024

matt-graham commented Feb 21, 2024

tamuri commented Feb 21, 2024 • edited Loading

matt-graham commented Feb 21, 2024

matt-graham commented Feb 21, 2024 • edited Loading

tamuri commented Feb 21, 2024 • edited Loading

matt-graham commented Feb 21, 2024

matt-graham commented Mar 20, 2024

matt-graham commented Mar 20, 2024

ucl-comment-bot bot commented Mar 21, 2024

Profiled run of the model succeeded ✅

matt-graham commented Apr 8, 2024

Scheduled profiling runs from GitHub Actions `run-profiling.yaml` workflow intermittently failing #1277

Scheduled profiling runs from GitHub Actions `run-profiling.yaml` workflow intermittently failing #1277

tamuri commented Feb 21, 2024 •

edited

Loading

tamuri commented Feb 21, 2024 •

edited

Loading

tamuri commented Feb 21, 2024 •

edited

Loading

matt-graham commented Feb 21, 2024 •

edited

Loading

tamuri commented Feb 21, 2024 •

edited

Loading