Skip to content

[CI][Github] Add linux premerge workflow #119635

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

boomanaiden154
Copy link
Contributor

This patch adds a Github Actions workflow for Linux premerge. This currently just calls into the existing CI scripts as a starting point.

@boomanaiden154 boomanaiden154 marked this pull request as ready for review December 15, 2024 07:30
@boomanaiden154
Copy link
Contributor Author

This has been tested, but needs a couple additional patches to be fully functional:

  1. [Github] Default to non-root user in linux CI container #119987
  2. [Github] Add some additional system packages #119988
  3. [CI] Only upload test results if buildkite-agent is present #119954

This just adds the workflow in for testing. There is at least one kink that needs to be worked out on the infrastructure side, namely that sometimes scaling down kills the runner/container pod. We should be able to fix that with an annotation.

Afterwards, I think we want to do a period of post-commit testing and monitor failures there. Once that is working, we then should be able to run everything concurrently with the existing premerge pipeline to work out any kinks before finally turning down the existing premerge pipeline.

@boomanaiden154
Copy link
Contributor Author

This is also just a start. I wanted to try and get a working prototype landed and the we can iterate in tree. The biggest thing to fix is probably the logging, but really fixing that probably involves splitting things into multiple steps on the GHA side which means splitting the pipeline shell scripts, which I would like to avoid initially if possible. For now, I think the Github raw logs feature works well enough with ctrl+f in the browser. It is definitely a regression from what @DavidSpickett was able to setup with the Buildkite annotations, but I think it will be easier to build similar infrastructure once we have everything moved over to the new infra.

@DavidSpickett
Copy link
Collaborator

The biggest thing to fix is probably the logging, but really fixing that probably involves splitting things into multiple steps on the GHA side which means splitting the pipeline shell scripts, which I would like to avoid initially if possible.
It is definitely a regression from what @DavidSpickett was able to setup with the Buildkite annotations, but I think it will be easier to build similar infrastructure once we have everything moved over to the new infra.

The only reason we needed all the complication to do the test reporting was the lack of steps, so I agree with your approach here.

(and if we're lucky, there is a GitHub reporting plugin that doesn't need docker either)

@joker-eph
Copy link
Collaborator

This just adds the workflow in for testing. There is at least one kink that needs to be worked out on the infrastructure side, namely that sometimes scaling down kills the runner/container pod.

There is no way to wait for the current job to finish?
I implemented a downscaling of a Kubernetes cluster for running builbot workers where the downscale wouldn't just kill the pod but send a signal to terminate. That would instruct the buildbot master to not schedule new job on the worker running on this pod and the worker would exit at the end of the current job. The pod would then be garbage collected.

modified_dirs=$(echo "$modified_files" | cut -d'/' -f1 | sort -u)

echo $modified_files
echo $modified_dirs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean to leave this here for debugging, should we add better output?
Like:

echo "===== Modified files in the PR ====="
echo $modified_files
echo "===== Modified dirs in the PR ====="
echo $modified_dies

Copy link
Collaborator

@joker-eph joker-eph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG, seems like a straightforward translation from the BuildKite flow.
What's the transition plan by the way: are there enough machines provisioned to land this? We'll run the two in parallel until you disconnect the buildkite linux flow right?

@Keenuts
Copy link
Contributor

Keenuts commented Dec 16, 2024

This just adds the workflow in for testing. There is at least one kink that needs to be worked out on the infrastructure side, namely that sometimes scaling down kills the runner/container pod.

There is no way to wait for the current job to finish? I implemented a downscaling of a Kubernetes cluster for running builbot workers where the downscale wouldn't just kill the pod but send a signal to terminate. That would instruct the buildbot master to not schedule new job on the worker running on this pod and the worker would exit at the end of the current job. The pod would then be garbage collected.

In our case we use GCP autoscale for scale up/down. The fact that a pod can be killed is surprising as the scale down trigger should be 10mn with almost 0 CPU activity on the node (some instance services are always running).

@Keenuts
Copy link
Contributor

Keenuts commented Dec 16, 2024

LG, seems like a straightforward translation from the BuildKite flow. What's the transition plan by the way: are there enough machines provisioned to land this? We'll run the two in parallel until you disconnect the buildkite linux flow right?

For now, we have a new quota for this infra on top of the available buildkite quota. Plan is to run the 2 presubmits in parallel, while we observe how things go, gather metrics, and make sure things are stable.We won't remove any buildkite machines until the other infra is stable.

@joker-eph
Copy link
Collaborator

In our case we use GCP autoscale for scale up/down.

Yeah that's what I was using as well. I don't remember the customization needed, but that's all just Kubernetes under the hood right? And Kubernetes supports graceful scale down and draining down a service before decommissioning a pod I believe.

@Keenuts
Copy link
Contributor

Keenuts commented Dec 16, 2024

In our case we use GCP autoscale for scale up/down.

Yeah that's what I was using as well. I don't remember the customization needed, but that's all just Kubernetes under the hood right? And Kubernetes supports graceful scale down and draining down a service before decommissioning a pod I believe.

Yes, that was my understanding. I remember we had issues with pod getting killed, but it was because of spot instances or quota limits being reached, but not autoscale issues.
I'll follow up with Aiden to debug that

@boomanaiden154
Copy link
Contributor Author

boomanaiden154 commented Dec 16, 2024

From my understanding, it seems like autoscale was seeing that there were nodes with zero CPU, but scaling down the node without a job actually running on it, assuming that the pod could migrate. At least, that's my working hypothesis. To fix that we should be able to just add an annotation. I need to dig into it more and figure out what exactly is going on though.

Thanks for the reviews!

@boomanaiden154 boomanaiden154 merged commit 484a281 into main Dec 16, 2024
9 checks passed
@boomanaiden154 boomanaiden154 deleted the users/boomanaiden154/github-actions-linux-pipeline branch December 16, 2024 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants