-
Notifications
You must be signed in to change notification settings - Fork 13.3k
[CI][Github] Add linux premerge workflow #119635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI][Github] Add linux premerge workflow #119635
Conversation
This patch refactors some common functionality present in the CI scripts to a separate shell script. This is mainly intended to make it easier to reuse this functionality inside of a Github Actions pipeline as we make the switch.
This patch adds a Github Actions workflow for Linux premerge. This currently just calls into the existing CI scripts as a starting point.
This has been tested, but needs a couple additional patches to be fully functional:
This just adds the workflow in for testing. There is at least one kink that needs to be worked out on the infrastructure side, namely that sometimes scaling down kills the runner/container pod. We should be able to fix that with an annotation. Afterwards, I think we want to do a period of post-commit testing and monitor failures there. Once that is working, we then should be able to run everything concurrently with the existing premerge pipeline to work out any kinks before finally turning down the existing premerge pipeline. |
This is also just a start. I wanted to try and get a working prototype landed and the we can iterate in tree. The biggest thing to fix is probably the logging, but really fixing that probably involves splitting things into multiple steps on the GHA side which means splitting the pipeline shell scripts, which I would like to avoid initially if possible. For now, I think the Github raw logs feature works well enough with ctrl+f in the browser. It is definitely a regression from what @DavidSpickett was able to setup with the Buildkite annotations, but I think it will be easier to build similar infrastructure once we have everything moved over to the new infra. |
The only reason we needed all the complication to do the test reporting was the lack of steps, so I agree with your approach here. (and if we're lucky, there is a GitHub reporting plugin that doesn't need docker either) |
There is no way to wait for the current job to finish? |
modified_dirs=$(echo "$modified_files" | cut -d'/' -f1 | sort -u) | ||
|
||
echo $modified_files | ||
echo $modified_dirs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you mean to leave this here for debugging, should we add better output?
Like:
echo "===== Modified files in the PR ====="
echo $modified_files
echo "===== Modified dirs in the PR ====="
echo $modified_dies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG, seems like a straightforward translation from the BuildKite flow.
What's the transition plan by the way: are there enough machines provisioned to land this? We'll run the two in parallel until you disconnect the buildkite linux flow right?
In our case we use GCP autoscale for scale up/down. The fact that a pod can be killed is surprising as the scale down trigger should be 10mn with almost 0 CPU activity on the node (some instance services are always running). |
For now, we have a new quota for this infra on top of the available buildkite quota. Plan is to run the 2 presubmits in parallel, while we observe how things go, gather metrics, and make sure things are stable.We won't remove any buildkite machines until the other infra is stable. |
Yeah that's what I was using as well. I don't remember the customization needed, but that's all just Kubernetes under the hood right? And Kubernetes supports graceful scale down and draining down a service before decommissioning a pod I believe. |
Yes, that was my understanding. I remember we had issues with pod getting killed, but it was because of spot instances or quota limits being reached, but not autoscale issues. |
From my understanding, it seems like autoscale was seeing that there were nodes with zero CPU, but scaling down the node without a job actually running on it, assuming that the pod could migrate. At least, that's my working hypothesis. To fix that we should be able to just add an annotation. I need to dig into it more and figure out what exactly is going on though. Thanks for the reviews! |
This patch adds a Github Actions workflow for Linux premerge. This currently just calls into the existing CI scripts as a starting point.