[CI][Github] Add linux premerge workflow #119635

boomanaiden154 · 2024-12-11T23:32:20Z

This patch adds a Github Actions workflow for Linux premerge. This currently just calls into the existing CI scripts as a starting point.

This patch refactors some common functionality present in the CI scripts to a separate shell script. This is mainly intended to make it easier to reuse this functionality inside of a Github Actions pipeline as we make the switch.

This patch adds a Github Actions workflow for Linux premerge. This currently just calls into the existing CI scripts as a starting point.

This reverts commit 8f0dd03.

…peline

This reverts commit 70bffd2.

boomanaiden154 · 2024-12-15T07:35:04Z

This has been tested, but needs a couple additional patches to be fully functional:

This just adds the workflow in for testing. There is at least one kink that needs to be worked out on the infrastructure side, namely that sometimes scaling down kills the runner/container pod. We should be able to fix that with an annotation.

Afterwards, I think we want to do a period of post-commit testing and monitor failures there. Once that is working, we then should be able to run everything concurrently with the existing premerge pipeline to work out any kinks before finally turning down the existing premerge pipeline.

boomanaiden154 · 2024-12-15T07:40:11Z

This is also just a start. I wanted to try and get a working prototype landed and the we can iterate in tree. The biggest thing to fix is probably the logging, but really fixing that probably involves splitting things into multiple steps on the GHA side which means splitting the pipeline shell scripts, which I would like to avoid initially if possible. For now, I think the Github raw logs feature works well enough with ctrl+f in the browser. It is definitely a regression from what @DavidSpickett was able to setup with the Buildkite annotations, but I think it will be easier to build similar infrastructure once we have everything moved over to the new infra.

DavidSpickett · 2024-12-16T08:54:48Z

The biggest thing to fix is probably the logging, but really fixing that probably involves splitting things into multiple steps on the GHA side which means splitting the pipeline shell scripts, which I would like to avoid initially if possible.
It is definitely a regression from what @DavidSpickett was able to setup with the Buildkite annotations, but I think it will be easier to build similar infrastructure once we have everything moved over to the new infra.

The only reason we needed all the complication to do the test reporting was the lack of steps, so I agree with your approach here.

(and if we're lucky, there is a GitHub reporting plugin that doesn't need docker either)

joker-eph · 2024-12-16T12:21:26Z

This just adds the workflow in for testing. There is at least one kink that needs to be worked out on the infrastructure side, namely that sometimes scaling down kills the runner/container pod.

There is no way to wait for the current job to finish?
I implemented a downscaling of a Kubernetes cluster for running builbot workers where the downscale wouldn't just kill the pod but send a signal to terminate. That would instruct the buildbot master to not schedule new job on the worker running on this pod and the worker would exit at the end of the current job. The pod would then be garbage collected.

joker-eph · 2024-12-16T12:22:41Z

.github/workflows/premerge.yaml

+          modified_dirs=$(echo "$modified_files" | cut -d'/' -f1 | sort -u)
+
+          echo $modified_files
+          echo $modified_dirs


If you mean to leave this here for debugging, should we add better output?
Like:

echo "===== Modified files in the PR =====" echo $modified_files echo "===== Modified dirs in the PR =====" echo $modified_dies

joker-eph

LG, seems like a straightforward translation from the BuildKite flow.
What's the transition plan by the way: are there enough machines provisioned to land this? We'll run the two in parallel until you disconnect the buildkite linux flow right?

Keenuts · 2024-12-16T12:30:13Z

This just adds the workflow in for testing. There is at least one kink that needs to be worked out on the infrastructure side, namely that sometimes scaling down kills the runner/container pod.

There is no way to wait for the current job to finish? I implemented a downscaling of a Kubernetes cluster for running builbot workers where the downscale wouldn't just kill the pod but send a signal to terminate. That would instruct the buildbot master to not schedule new job on the worker running on this pod and the worker would exit at the end of the current job. The pod would then be garbage collected.

In our case we use GCP autoscale for scale up/down. The fact that a pod can be killed is surprising as the scale down trigger should be 10mn with almost 0 CPU activity on the node (some instance services are always running).

Keenuts · 2024-12-16T12:32:20Z

LG, seems like a straightforward translation from the BuildKite flow. What's the transition plan by the way: are there enough machines provisioned to land this? We'll run the two in parallel until you disconnect the buildkite linux flow right?

For now, we have a new quota for this infra on top of the available buildkite quota. Plan is to run the 2 presubmits in parallel, while we observe how things go, gather metrics, and make sure things are stable.We won't remove any buildkite machines until the other infra is stable.

joker-eph · 2024-12-16T12:47:58Z

In our case we use GCP autoscale for scale up/down.

Yeah that's what I was using as well. I don't remember the customization needed, but that's all just Kubernetes under the hood right? And Kubernetes supports graceful scale down and draining down a service before decommissioning a pod I believe.

Keenuts · 2024-12-16T12:57:14Z

In our case we use GCP autoscale for scale up/down.

Yeah that's what I was using as well. I don't remember the customization needed, but that's all just Kubernetes under the hood right? And Kubernetes supports graceful scale down and draining down a service before decommissioning a pod I believe.

Yes, that was my understanding. I remember we had issues with pod getting killed, but it was because of spot instances or quota limits being reached, but not autoscale issues.
I'll follow up with Aiden to debug that

boomanaiden154 · 2024-12-16T19:39:59Z

From my understanding, it seems like autoscale was seeing that there were nodes with zero CPU, but scaling down the node without a job actually running on it, assuming that the pod could migrate. At least, that's my working hypothesis. To fix that we should be able to just add an annotation. I need to dig into it more and figure out what exactly is going on though.

Thanks for the reviews!

boomanaiden154 added 30 commits December 11, 2024 09:38

[CI] Refactor common functionality into separate script

6ffa419

This patch refactors some common functionality present in the CI scripts to a separate shell script. This is mainly intended to make it easier to reuse this functionality inside of a Github Actions pipeline as we make the switch.

Maybe fix paths

d858ae3

Fix spelling

a8eae85

Add missing functions

1c020bb

[CI][Github] Add linux premerge workflow

19f4098

This patch adds a Github Actions workflow for Linux premerge. This currently just calls into the existing CI scripts as a starting point.

Modification stuff

9c86619

Adjust depth

60c6212

maybe

49b3859

fix thing

41efe71

incremental testing

cb5cf35

Add pipeline

43ca653

Switch container

67b537d

debugging

5f4e451

debugging2

cd8094f

debugging 3

58860c6

debugging 4

c626093

debugging 5

a99b5a3

debugging 6

4d17c17

debugging 7

61a9e61

debugging 8

41296b6

debugging 9

80db387

debugging 10

ed2f888

debugging 11

47c666f

debugging 12

264ae7f

Reenable

8ecd16d

Add missing system dep

240908a

More system deps

9ad6ac9

Also update

740ac18

Force clang

e6a4282

Early exit

715079b

boomanaiden154 added 9 commits December 15, 2024 06:09

Revert "Fix libc++ test failures"

7a0634a

This reverts commit 8f0dd03.

Merge branch 'main' into users/boomanaiden154/github-actions-linux-pi…

dfd04e9

…peline

Clean some stuff up

53bd20f

more cleanup

cc7bce1

Switch image back

70bffd2

Revert "Switch image back"

a37cfb8

This reverts commit 70bffd2.

Minor improvements

691dcaa

Revert changes

f3f68a4

Switch image to official one

e529bc9

boomanaiden154 marked this pull request as ready for review December 15, 2024 07:30

boomanaiden154 requested review from tstellar, DavidSpickett, lnihlen, joker-eph, Keenuts and Endilll December 15, 2024 07:35

Keenuts approved these changes Dec 16, 2024

View reviewed changes

joker-eph reviewed Dec 16, 2024

View reviewed changes

joker-eph approved these changes Dec 16, 2024

View reviewed changes

boomanaiden154 merged commit 484a281 into main Dec 16, 2024
9 checks passed

boomanaiden154 deleted the users/boomanaiden154/github-actions-linux-pipeline branch December 16, 2024 21:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI][Github] Add linux premerge workflow #119635

[CI][Github] Add linux premerge workflow #119635

Uh oh!

boomanaiden154 commented Dec 11, 2024

Uh oh!

boomanaiden154 commented Dec 15, 2024

Uh oh!

boomanaiden154 commented Dec 15, 2024

Uh oh!

DavidSpickett commented Dec 16, 2024

Uh oh!

joker-eph commented Dec 16, 2024

Uh oh!

joker-eph Dec 16, 2024

Uh oh!

joker-eph left a comment

Uh oh!

Keenuts commented Dec 16, 2024

Uh oh!

Keenuts commented Dec 16, 2024

Uh oh!

joker-eph commented Dec 16, 2024

Uh oh!

Keenuts commented Dec 16, 2024

Uh oh!

boomanaiden154 commented Dec 16, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[CI][Github] Add linux premerge workflow #119635

[CI][Github] Add linux premerge workflow #119635

Uh oh!

Conversation

boomanaiden154 commented Dec 11, 2024

Uh oh!

boomanaiden154 commented Dec 15, 2024

Uh oh!

boomanaiden154 commented Dec 15, 2024

Uh oh!

DavidSpickett commented Dec 16, 2024

Uh oh!

joker-eph commented Dec 16, 2024

Uh oh!

joker-eph Dec 16, 2024

Choose a reason for hiding this comment

Uh oh!

joker-eph left a comment

Choose a reason for hiding this comment

Uh oh!

Keenuts commented Dec 16, 2024

Uh oh!

Keenuts commented Dec 16, 2024

Uh oh!

joker-eph commented Dec 16, 2024

Uh oh!

Keenuts commented Dec 16, 2024

Uh oh!

boomanaiden154 commented Dec 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

boomanaiden154 commented Dec 16, 2024 •

edited

Loading