Skip to content

TMP: Clean up self hosted runaways #95

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 21 commits into from

Conversation

dennisameling
Copy link
Contributor

Companion of #93 for testing CI.

dennisameling and others added 21 commits November 22, 2023 10:23
…ariable

The image SKU was hidden a bit in the azure-arm-template.json. Let's
make it more visible by turning it into a parameter that the CI pipeline
can provide.

Signed-off-by: Dennis Ameling <[email protected]>
Version 23h2 was released on October 31st, 2023. Let's update to it,
so that the runners benefit from the latest improvements to Windows 11
ARM64.

Signed-off-by: Dennis Ameling <[email protected]>
Sometimes, VMs don't get deleted properly. Let's set up a script to
clean up VMs that were created more than 3 hours ago. This script
runs every 6 hours (cron) and we can also invoke it manually if
needed.

Signed-off-by: Dennis Ameling <[email protected]>
This resurrects a GitHub workflow I had originally asked to keep out of
#60,
when I was concerned that we might delete runner VMs while they are
still busy running a workflow.

However, in the meantime it happened already twice that I had missed a
run-away VM that had not been cleaned up for over a week, which did put
a worrisome dent into my Azure credits (and I cannot risk running out of
those because a lot of stuff depends on it, e.g. GitGitGadget).

So let's merge the commit as-is, in preparation for adjusting it a
little bit further in accordance with the experience we gained in the
meantime.

Signed-off-by: Johannes Schindelin <[email protected]>
…o large

In general, if..else..fi constructs are more readable when the first
block is shorter than the second, i.e. to handle the "easy" things
first; According to Cognitive Load Theory, this uses less mental
resources when wrapping our head around the code.

In this instance, there is an even better reason: The next commit will
add _another_ condition, and having it in this order will avoid adding
another nesting level:

	if too young
		skip
	else if busy
		skip
	else
		force delete VM
	fi

Signed-off-by: Johannes Schindelin <[email protected]>
In Git for Windows' automation, we do a lot with Javascript, but
sometimes it is also convenient to use the GitHub CLI (`gh`) in a shell
script.

Sadly, the scope of the `GITHUB_TOKEN` provided in GitHub workflows is
often too limited, and we'd like to authenticate as a GitHub App
instead. Even more sadly, authenticating with `gh` this way is quite
complicated and fraught with problems because the token _needs_ to be
masked in the GitHub workflow logs.

Rejoice! This patch brings a shell script that hides all that nasty
complexity. All it needs are the environment variables:

- GH_APP_ID

- GH_APP_PRIVATE_KEY

- GITHUB_REPOSITORY

and of course `gh` on the `PATH`. That's it!

Signed-off-by: Johannes Schindelin <[email protected]>
So far, we avoid deleting VMs when they are too young.

But we can do better than that: since the VMs are intended to host an
ephemeral runner, and that runner is registered with the current
repository, we can have a look whether GitHub says that the runner is
busy. If it is, well, let's leave it a-running!

Since this query requires a larger scope than the standard GitHub
workflows' `GITHUB_TOKEN` provides, we need to authenticate as the
GitForWindowsHelper GitHub App to do that. This is too complicated,
unfortunately, in a GitHub workflow (at least without the risk of using
non-official GitHub Actions), therefore we use the newly-added shell
script to authenticate the GitHub CLI as a GitHub App (which of course
now requires us to check out the repository).

Signed-off-by: Johannes Schindelin <[email protected]>
The original idea was to wait at least 3 hours before deleting a VM, to
avoid interfering with long-running workflows.

Now that we have a way to query whether the runner _is_ busy, we do not
need to wait that long. Basically, we only need to ensure that the
runner had a chance to register itself and to then pick up the job.
Typically, this takes around 6-7 minutes, so waiting for an hour is
probably even a tad conservative.

Signed-off-by: Johannes Schindelin <[email protected]>
When logging output that we're deleting a VM, it should be marked a bit
more prominently. Let's mark it as a warning:
https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/workflow-commands-for-github-actions#setting-a-warning-message

This marks it in red in the output.

Conversely, we want to mention when we skipped the deletion, and to make
those messages stick out a bit (but less than warnings), let's use:
https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/workflow-commands-for-github-actions#setting-a-notice-message

Signed-off-by: Johannes Schindelin <[email protected]>
@dscho dscho closed this Sep 15, 2024
@dscho dscho deleted the clean-up-self-hosted-runaways branch September 15, 2024 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants