workflows: Rewrite build-ci-container to work on larger runners #117353

tstellar · 2024-11-22T17:40:30Z

Also switch them over to the new depot runners.

This reverts commit 5ed7ba6. This reverts commit 7985506. This reverts commit 63e0fc6.

boomanaiden154

Was just about to start hacking on this today. 😆

Not too familiar with the depot runners. It seems like they're paid self hosted runners? Is the foundation paying for them?

llvm-premerge-linux-runners should also work and will point to self hosted runners on the GCP project handling premerge. We intend to use that runner set to run the transitioned premerge jobs once we migrate to GHA.

tstellar · 2024-11-24T00:18:54Z

Not too familiar with the depot runners. It seems like they're paid self hosted runners? Is the foundation paying for them?

The Foundation is providing a small budget for testing them out. We're using 'Depot Managed' which means the instances are deployed in the Foundation's AWS account rather than using the machines from Depot's pool.

llvm-premerge-linux-runners should also work and will point to self hosted runners on the GCP project handling premerge. We intend to use that runner set to run the transitioned premerge jobs once we migrate to GHA.

Are these ready to use now? I don't have a preference for which runners to use.

boomanaiden154 · 2024-11-24T00:26:51Z

The Foundation is providing a small budget for testing them out. We're using 'Depot Managed' which means the instances are deployed in the Foundation's AWS account rather than using the machines from Depot's pool.

Ah, okay. I didn't realize that had taken place. Thanks for the additional context.

Are these ready to use now? I don't have a preference for which runners to use.

Theoretically, yes, they are ready to use and should just work. We can only run one job at a time currently though because we don't have enough quota in the region where the cluster is deployed, but that should be enough for testing. @lnihlen is working on getting additional quota. I think it would probably make more sense to use the Google hosted runners given it doesn't require any budget from the foundation, and lets us consolidate on a single system.

I think our runners will also be more powerful too (configured to be 56 vCPUs), but depends upon how the depot runners are configured.

tstellar · 2024-11-24T00:36:02Z

The Foundation is providing a small budget for testing them out. We're using 'Depot Managed' which means the instances are deployed in the Foundation's AWS account rather than using the machines from Depot's pool.

Ah, okay. I didn't realize that had taken place. Thanks for the additional context.

Are these ready to use now? I don't have a preference for which runners to use.

Theoretically, yes, they are ready to use and should just work. We can only run one job at a time currently though because we don't have enough quota in the region where the cluster is deployed, but that should be enough for testing. @lnihlen is working on getting additional quota. I think it would probably make more sense to use the Google hosted runners given it doesn't require any budget from the foundation, and lets us consolidate on a single system.

OK, do you think I should update this PR to use those runners instead or wait a few weeks?

I think our runners will also be more powerful too (configured to be 56 vCPUs), but depends upon how the depot runners are configured.

Depot runners are up-to 64 CPUs. The number of CPUs is part of the runs-on label so you can choose the right size machine that fits the job.

boomanaiden154 · 2024-11-24T00:42:18Z

OK, do you think I should update this PR to use those runners instead or wait a few weeks?

I would say update it to use the new runners. This job doesn't run very often and probably shouldn't take that long to run on a decent sized machine.

Depot runners are up-to 64 CPUs. The number of CPUs is part of the runs-on label so you can choose the right size machine that fits the job.

Ah, interesting. We don't have that much flexibility currently but could theoretically add the capability. Everything that I have seen/can imagine though either requires not much compute and should probably run on the free Github runners or requires a lot and would benefit from as many cores as possible.

tstellar · 2024-11-24T00:52:21Z

OK, do you think I should update this PR to use those runners instead or wait a few weeks?

I would say update it to use the new runners. This job doesn't run very often and probably shouldn't take that long to run on a decent sized machine.

Depot runners are up-to 64 CPUs. The number of CPUs is part of the runs-on label so you can choose the right size machine that fits the job.

Ah, interesting. We don't have that much flexibility currently but could theoretically add the capability. Everything that I have seen/can imagine though either requires not much compute and should probably run on the free Github runners or requires a lot and would benefit from as many cores as possible.

Yeah, the nice thing about the builds is that they are highly parallel, so if you are paying for runners, you may as well use the one with the most CPUs, because the cost/job ends up being roughly the same no matter how many CPUs you have.

tstellar · 2024-11-24T01:18:28Z

@boomanaiden154 It looks like thellvm-premerge-linux-runners require jobs to use a container and having nested containers doesn't work, so I may have to switch it back to the depot runners.

This reverts commit 1f3a36f. This reverts commit 00f4dcd. This reverts commit 966fd98.

boomanaiden154 · 2024-11-24T01:20:58Z

It looks like thellvm-premerge-linux-runners require jobs to use a container and having nested containers doesn't work, so I may have to switch it back to the depot runners.

Ah, right. I forgot about that detail. My plan was to use https://github.com/GoogleContainerTools/kaniko to build things on the cluster, but that's definitely something I can get to later.

.github/workflows/containers/github-action-ci/stage1.Dockerfile

boomanaiden154 · 2024-11-24T01:47:01Z

.github/workflows/containers/github-action-ci/stage1.Dockerfile

-  -DBOOTSTRAP_CLANG_PGO_TRAINING_DATA_SOURCE_DIR=/llvm-project-llvmorg-$LLVM_VERSION/llvm
+  -DCLANG_DEFAULT_LINKER="lld"
+
+RUN ninja -C ./build stage2-clang-bolt stage2-install-distribution && ninja -C ./build install-distribution && rm -rf ./build


Maybe add a comment on why the rm -rf ./build is here? I believe it was originally to avoid out of disk space errors. Assuming the depot runners have enough disk, it would still be useful as it probably reduces checkpointing time.

I'm inclined to drop it. I think this is a common pattern when doing something on the final image, but since this is just a builder image, I'm not sure if it makes sense to do. I don't have a strong opinion though.

Seems reasonable enough to me.

.github/workflows/build-ci-container.yml

boomanaiden154

LGTM after addressing the final comment.

.github/workflows/build-ci-container.yml

We can save money by only supporting on ubuntu version in depot and we need to use the oldest version when building the release binaries.

boomanaiden154 · 2024-12-10T05:36:15Z

@tstellar What are your plans on landing this? I wanted to try porting it to the premerge cluster this week, but would prefer to build off of this patch in tree.

…#117353) Also switch them over to the new depot runners.

tstellar added 13 commits November 22, 2024 09:39

workflows: Rewrite build-ci-container to work on larger runners

eac6c1b

Also switch them over to the new depot runners.

Remove unneeded target

62fac26

Fix targets and apply new patch

7f1cd6d

Fix container name

f0c3050

Container push changes

db0861f

Fix identation

0c76987

Fix container load/save

fe77a87

XXX: debug

63e0fc6

XXX: debug

7985506

XXX: debug

5ed7ba6

Revert Debug patches.

24b33e5

This reverts commit 5ed7ba6. This reverts commit 7985506. This reverts commit 63e0fc6.

Fix container path name

7ddb314

Merge branch 'main' into container-rewrite

379c49d

tstellar marked this pull request as ready for review November 24, 2024 00:05

tstellar requested a review from boomanaiden154 November 24, 2024 00:05

boomanaiden154 reviewed Nov 24, 2024

View reviewed changes

Switch to using Google runners

966fd98

tstellar added 2 commits November 24, 2024 01:03

Try using a container

00f4dcd

Install podman

1f3a36f

Revert back to depot runners

9eaf5d3

This reverts commit 1f3a36f. This reverts commit 00f4dcd. This reverts commit 966fd98.

boomanaiden154 reviewed Nov 24, 2024

View reviewed changes

Review comments

8710b53

boomanaiden154 approved these changes Nov 24, 2024

View reviewed changes

boomanaiden154 reviewed Nov 24, 2024

View reviewed changes

.github/workflows/build-ci-container.yml Outdated Show resolved Hide resolved

tstellar added 2 commits November 24, 2024 06:01

Remove redundant -f Dockerfile

a941073

Switch to using older ubuntu version

8660cae

We can save money by only supporting on ubuntu version in depot and we need to use the oldest version when building the release binaries.

tstellar merged commit df4c5d5 into llvm:main Dec 10, 2024
8 checks passed

mgehre-amd pushed a commit to Xilinx/llvm-project that referenced this pull request Jan 6, 2025

workflows: Rewrite build-ci-container to work on larger runners (llvm…

2fece8c

…#117353) Also switch them over to the new depot runners.

workflows: Rewrite build-ci-container to work on larger runners #117353

workflows: Rewrite build-ci-container to work on larger runners #117353

Uh oh!

Conversation

tstellar commented Nov 22, 2024

Uh oh!

boomanaiden154 left a comment

Choose a reason for hiding this comment

Uh oh!

tstellar commented Nov 24, 2024

Uh oh!

boomanaiden154 commented Nov 24, 2024

Uh oh!

tstellar commented Nov 24, 2024

Uh oh!

boomanaiden154 commented Nov 24, 2024

Uh oh!

tstellar commented Nov 24, 2024

Uh oh!

tstellar commented Nov 24, 2024

Uh oh!

boomanaiden154 commented Nov 24, 2024

Uh oh!

Uh oh!

boomanaiden154 Nov 24, 2024

Choose a reason for hiding this comment

Uh oh!

tstellar Nov 24, 2024

Choose a reason for hiding this comment

Uh oh!

boomanaiden154 Nov 24, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

boomanaiden154 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

boomanaiden154 commented Dec 10, 2024

Uh oh!

Uh oh!

Uh oh!