Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wato expirement #24

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 89 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,26 +20,103 @@ flowchart LR

## Enabling Docker Within CI Jobs

We run our CI jobs with a container, [actions-runner-image](https://github.com/WATonomous/actions-runner-image). Within these jobs we also run containers, thus we needed to run Docker within Docker.


```mermaid
graph TD
A[Docker Rootless Daemon] -->| Creates | B[Docker Rootless Socket]
B -->| Creates | C[Custom Actions Runner Image]
C -.->| Calls | B
C --->| Mounts | B
C -->| Creates | E[CI Helper Containers]
E -.->| Calls | B
graph RL
A[Docker Rootless Daemon] -->|Creates| B[Docker Rootless Socket]
C[Actions Runner Image] -->|Mounts| B
C -->|Creates| E[CI Helper Containers]
E -->|Mounts| B
```

Since CI Docker commands will use the same filesystem, as they have the same Docker socket, you need to configure the working directory of your runners accordingly.
Normally it would be a security risk to mount the Docker socket into a container, but since we are using [Docker Rootless](https://docs.docker.com/engine/security/rootless/), we are able to mitigate this risk.

> **Note:** The CI's Docker commands will use the same filesystem, as they have the same Docker socket, you must configure the working directory of your runners accordingly. In our case this meant place the working directory in the ephemeral storage, via the `/tmp`, folder within a Slurm job.

## Speeding up the start time for our Actions Runner Image

After we were able to run the actions runner image in as Slurm job using [sbatch](https://slurm.schedmd.com/sbatch.html) and custom script we ran into the issue of having to pull the actions runner image for every job. From the time the script allocated resources to the time the job began was ~ 2 minutes. When you are running 70+ jobs in a workflow, with jobs depending on others, this time adds up quickly.

Unfortunately, caching the image in our filesystem was not an elegant solution because this would require mounting the filesystem directory to the Slurm job. This means we would need to have multiple directories if we wanted to support multiple concurrent runners. This would require creating a system to manage these directories and would introduce problems such as starvation and dead locks.

This led us to investigate several options:
- [Docker pull through cache](https://docs.docker.com/docker-hub/mirror/)
- [Stargz Snapshotter](https://github.com/containerd/stargz-snapshotter)
- [Apptainer](https://apptainer.org/docs/user/main/index.html)

We decided to go with Apptainer as it was the most straightforward solution. Apptainer is a tool that allows you to create a snapshot of a container image and store it in a tarball. This tarball can be loaded into the containerd runtime and used as a cache. This allows us to pull the image once (per machine) and use it for all subsequent jobs.

## Speeding up our Actions Runner Image
## CVMFS ephermal for caching our Provisioner image

After we were able to run the actions runner image in as Slurm job using [sbatch](https://slurm.schedmd.com/sbatch.html) and [custom script](https://github.com/WATonomous/run-gha-on-slurm/blob/main/allocate-ephemeral-runner-from-docker.sh) we ran into the issue of having to pull the docker image for every job. From the time the script allocated resources to the time the job began was ~ 2 minutes. When you are running 70+ jobs in a workflow, with some jobs depending on others, this time adds up fast.
For many of our jobs we use a container we call the provisioner. This container is a tool that is used to provision a service. This container is large, over 1.5GB, and needs to be rebuilt with every CI run. This new container is then used to run the subsequent dependent jobs.

Unfortunately, caching the image is not an elegant solution because this would require mounting the filesystem directory to the Slurm job. This means we would need to have multiple directories if we wanted to support multiple concurrent runners. This would require creating a system to manage these directories and would introduce the potenital for starvation and dead locks.
### Previous Solution
Previously after the provisioner was built, we would cache in S3, via [rgw](https://docs.ceph.com/en/latest/man/8/radosgw/), within our Kubernetes cluster. This worked well as we previously ran [actions-runner-controller](https://github.com/actions/actions-runner-controller) within the same cluster. Previously, it would only take 1 hop when there was a cache miss. Now that we are running the actions runner on our Slurm cluster, it would take at least 1 hop, 2 when there was a cache miss.

```mermaid
flowchart TD
SLURM[SLURM Job Node] -->|1 hop| Kubernetes[Kubernetes Node]
Kubernetes -->|0 or 1 depending on cache| RGW[RGW Container]
```

### New Idea
We decided to leverage a (CVMFS ephemeral server)[https://github.com/WATonomous/cvmfs-ephemeral/] to be used for hosting ephemeral files such as our provisioner image. In this setup, the provisioner image is cached in a CVMFS repository hosted on a high-performance node in the kubernetes cluster. By avoiding the Kubernetes RGW cache, which is deployed on a separate less performant node, we can reduce the time it takes to pull the provisioner image.


```mermaid
flowchart TD
SLURM[SLURM Job Node] -->|1 hop| CVMFS[CVMFS Ephemeral Server]
```

This led us to investegate a [Docker pull through cache](https://docs.docker.com/docker-hub/mirror/).

## Deployment to Kubernetes

We deployed this on our self hosted Kubernetes, via the docker image in this repo. To communicate with the GitHub API, an access token is needed. For our use case a [personal access token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens#about-personal-access-tokens) which provides 5000 requests per hour was sufficient. This was deployed as a Kubernetes environment variable.

To enable communication with the Slurm controller we set up a [munge key](https://dun.github.io/munge/). The Python script is then able to allocate an actions runner by triggering a bash script run with `sbatch`.


# Speed comparison

## Before:
### User ingestion:
https://github.com/WATonomous/infra-config/actions/runs/12427574689
https://github.com/WATonomous/infra-config/actions/runs/12427547885
https://github.com/WATonomous/infra-config/actions/runs/12424954086
https://github.com/WATonomous/infra-config/actions/runs/12424337584
https://github.com/WATonomous/infra-config/actions/runs/12422341445
https://github.com/WATonomous/infra-config/actions/runs/12420112260

# Master - scheduled:
https://github.com/WATonomous/infra-config/actions/runs/12553360748
https://github.com/WATonomous/infra-config/actions/runs/12530736888
https://github.com/WATonomous/infra-config/actions/runs/12521935862
https://github.com/WATonomous/infra-config/actions/runs/12509665511
https://github.com/WATonomous/infra-config/actions/runs/12487314642
https://github.com/WATonomous/infra-config/actions/runs/12449453195
https://github.com/WATonomous/infra-config/actions/runs/12440178947
https://github.com/WATonomous/infra-config/actions/runs/12422821414

## After:
### User ingestion:
https://github.com/WATonomous/infra-config/actions/runs/12854845210
https://github.com/WATonomous/infra-config/actions/runs/12851571085
https://github.com/WATonomous/infra-config/actions/runs/12850108807
https://github.com/WATonomous/infra-config/actions/runs/12696207371
https://github.com/WATonomous/infra-config/actions/runs/12682617238

# Master - scheduled:
https://github.com/WATonomous/infra-config/actions/runs/12848528449
https://github.com/WATonomous/infra-config/actions/runs/12838792205
https://github.com/WATonomous/infra-config/actions/runs/12819860145
https://github.com/WATonomous/infra-config/actions/runs/12799262666
https://github.com/WATonomous/infra-config/actions/runs/12778673306
https://github.com/WATonomous/infra-config/actions/runs/12738272633
https://github.com/WATonomous/infra-config/actions/runs/12738272633

## Next Steps

### References
1. [Docker Rootless](https://docs.docker.com/engine/security/rootless/)
Expand Down
Binary file added after/master-scheduled/12738272633
Binary file not shown.
Binary file added after/master-scheduled/12778673306
Binary file not shown.
Binary file added after/master-scheduled/12799262666
Binary file not shown.
Binary file added after/master-scheduled/12819860145
Binary file not shown.
Binary file added after/master-scheduled/12838792205
Binary file not shown.
Binary file added after/master-scheduled/12848528449
Binary file not shown.
Binary file added after/user-ingestion/12682617238
Binary file not shown.
Binary file added after/user-ingestion/12696207371
Binary file not shown.
Binary file added after/user-ingestion/12696207371.json
Binary file not shown.
Binary file added after/user-ingestion/12850108807
Binary file not shown.
Binary file added after/user-ingestion/12850108807.json
Binary file not shown.
Binary file added after/user-ingestion/12851571085
Binary file not shown.
Binary file added after/user-ingestion/12851571085.json
Binary file not shown.
Binary file added after/user-ingestion/12854845210
Binary file not shown.
Binary file added after/user-ingestion/12854845210.json
Binary file not shown.
2 changes: 1 addition & 1 deletion allocation_scripts/apptainer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ export ACTIONS_RUNNER_IMAGE="/cvmfs/unpacked.cern.ch/ghcr.io/watonomous/actions-

log "INFO Starting Apptainer container and configuring runner"

apptainer exec --writable-tmpfs --containall --fakeroot --bind /dev/fuse --bind /tmp/run/docker.sock:/tmp/run/docker.sock --bind /cvmfs:/cvmfs --bind /tmp:/tmp "$ACTIONS_RUNNER_IMAGE" /bin/bash -c "export DOCKER_HOST=unix:///tmp/run/docker.sock && export RUNNER_ALLOW_RUNASROOT=1 && export PYTHONPATH=/home/runner/.local/lib/python3.10/site-packages && /home/runner/config.sh --work \"${GITHUB_ACTIONS_WKDIR}\" --url \"${REPO_URL}\" --token \"${REGISTRATION_TOKEN}\" --labels \"${LABELS}\" --name \"slurm-${SLURMD_NODENAME}-${SLURM_JOB_ID}\" --unattended --ephemeral && /home/runner/run.sh && /home/runner/config.sh remove --token \"${REMOVAL_TOKEN}\""
apptainer exec --writable-tmpfs --containall --fakeroot --bind /dev/fuse --bind /tmp/run/docker.sock:/tmp/run/docker.sock --bind /cvmfs:/cvmfs --bind /mnt/wato-drive:/mnt/wato-drive --bind /tmp:/tmp "$ACTIONS_RUNNER_IMAGE" /bin/bash -c "export DOCKER_HOST=unix:///tmp/run/docker.sock && export RUNNER_ALLOW_RUNASROOT=1 && export PYTHONPATH=/home/runner/.local/lib/python3.10/site-packages && /home/runner/config.sh --work \"${GITHUB_ACTIONS_WKDIR}\" --url \"${REPO_URL}\" --token \"${REGISTRATION_TOKEN}\" --labels \"${LABELS}\" --name \"slurm-${SLURMD_NODENAME}-${SLURM_JOB_ID}\" --unattended --ephemeral && /home/runner/run.sh && /home/runner/config.sh remove --token \"${REMOVAL_TOKEN}\""

log "INFO Runner removed (Duration: $(($end_time - $start_time)) seconds)"

Expand Down
Binary file added before/master-scheduled/12422821414
Binary file not shown.
Binary file added before/master-scheduled/12440178947
Binary file not shown.
Binary file added before/master-scheduled/12449453195
Binary file not shown.
Binary file added before/master-scheduled/12487314642
Binary file not shown.
Binary file added before/master-scheduled/12509665511
Binary file not shown.
Binary file added before/master-scheduled/12521935862
Binary file not shown.
Binary file added before/master-scheduled/12530736888
Binary file not shown.
Binary file added before/master-scheduled/12553360748
Binary file not shown.
Binary file added before/user-ingestion/12420112260
Binary file not shown.
Binary file added before/user-ingestion/12422341445
Binary file not shown.
Binary file added before/user-ingestion/12424337584
Binary file not shown.
Binary file added before/user-ingestion/12424954086
Binary file not shown.
Binary file added before/user-ingestion/12427547885
Binary file not shown.
Binary file added before/user-ingestion/12427574689
Binary file not shown.
15 changes: 15 additions & 0 deletions config.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,18 @@
GITHUB_API_BASE_URL = 'https://api.github.com/repos/WATonomous/infra-config'
GITHUB_REPO_URL = 'https://github.com/WATonomous/infra-config'
ALLOCATE_RUNNER_SCRIPT_PATH = "apptainer.sh" # relative path from '/allocation_script'


REPOS_TO_MONITOR = [
{
'name': 'WATonomous/infra-config',
'api_base_url': 'https://api.github.com/repos/WATonomous/infra-config',
'repo_url': 'https://github.com/WATonomous/infra-config'
},
{
'name': 'WATonomous/wato_asd_training',
'api_base_url': 'https://api.github.com/repos/WATonomous/wato_asd_training',
'repo_url': 'https://github.com/WATonomous/wato_asd_training'
}

]
Loading
Loading