Skip to content

workflows: Rewrite build-ci-container to work on larger runners #117353

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Dec 10, 2024

Conversation

tstellar
Copy link
Collaborator

Also switch them over to the new depot runners.

@tstellar tstellar marked this pull request as ready for review November 24, 2024 00:05
Copy link
Contributor

@boomanaiden154 boomanaiden154 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was just about to start hacking on this today. 😆

Not too familiar with the depot runners. It seems like they're paid self hosted runners? Is the foundation paying for them?

llvm-premerge-linux-runners should also work and will point to self hosted runners on the GCP project handling premerge. We intend to use that runner set to run the transitioned premerge jobs once we migrate to GHA.

@tstellar
Copy link
Collaborator Author

Not too familiar with the depot runners. It seems like they're paid self hosted runners? Is the foundation paying for them?

The Foundation is providing a small budget for testing them out. We're using 'Depot Managed' which means the instances are deployed in the Foundation's AWS account rather than using the machines from Depot's pool.

llvm-premerge-linux-runners should also work and will point to self hosted runners on the GCP project handling premerge. We intend to use that runner set to run the transitioned premerge jobs once we migrate to GHA.

Are these ready to use now? I don't have a preference for which runners to use.

@boomanaiden154
Copy link
Contributor

The Foundation is providing a small budget for testing them out. We're using 'Depot Managed' which means the instances are deployed in the Foundation's AWS account rather than using the machines from Depot's pool.

Ah, okay. I didn't realize that had taken place. Thanks for the additional context.

Are these ready to use now? I don't have a preference for which runners to use.

Theoretically, yes, they are ready to use and should just work. We can only run one job at a time currently though because we don't have enough quota in the region where the cluster is deployed, but that should be enough for testing. @lnihlen is working on getting additional quota. I think it would probably make more sense to use the Google hosted runners given it doesn't require any budget from the foundation, and lets us consolidate on a single system.

I think our runners will also be more powerful too (configured to be 56 vCPUs), but depends upon how the depot runners are configured.

@tstellar
Copy link
Collaborator Author

The Foundation is providing a small budget for testing them out. We're using 'Depot Managed' which means the instances are deployed in the Foundation's AWS account rather than using the machines from Depot's pool.

Ah, okay. I didn't realize that had taken place. Thanks for the additional context.

Are these ready to use now? I don't have a preference for which runners to use.

Theoretically, yes, they are ready to use and should just work. We can only run one job at a time currently though because we don't have enough quota in the region where the cluster is deployed, but that should be enough for testing. @lnihlen is working on getting additional quota. I think it would probably make more sense to use the Google hosted runners given it doesn't require any budget from the foundation, and lets us consolidate on a single system.

OK, do you think I should update this PR to use those runners instead or wait a few weeks?

I think our runners will also be more powerful too (configured to be 56 vCPUs), but depends upon how the depot runners are configured.

Depot runners are up-to 64 CPUs. The number of CPUs is part of the runs-on label so you can choose the right size machine that fits the job.

@boomanaiden154
Copy link
Contributor

OK, do you think I should update this PR to use those runners instead or wait a few weeks?

I would say update it to use the new runners. This job doesn't run very often and probably shouldn't take that long to run on a decent sized machine.

Depot runners are up-to 64 CPUs. The number of CPUs is part of the runs-on label so you can choose the right size machine that fits the job.

Ah, interesting. We don't have that much flexibility currently but could theoretically add the capability. Everything that I have seen/can imagine though either requires not much compute and should probably run on the free Github runners or requires a lot and would benefit from as many cores as possible.

@tstellar
Copy link
Collaborator Author

OK, do you think I should update this PR to use those runners instead or wait a few weeks?

I would say update it to use the new runners. This job doesn't run very often and probably shouldn't take that long to run on a decent sized machine.

Depot runners are up-to 64 CPUs. The number of CPUs is part of the runs-on label so you can choose the right size machine that fits the job.

Ah, interesting. We don't have that much flexibility currently but could theoretically add the capability. Everything that I have seen/can imagine though either requires not much compute and should probably run on the free Github runners or requires a lot and would benefit from as many cores as possible.

Yeah, the nice thing about the builds is that they are highly parallel, so if you are paying for runners, you may as well use the one with the most CPUs, because the cost/job ends up being roughly the same no matter how many CPUs you have.

@tstellar
Copy link
Collaborator Author

@boomanaiden154 It looks like thellvm-premerge-linux-runners require jobs to use a container and having nested containers doesn't work, so I may have to switch it back to the depot runners.

This reverts commit 1f3a36f.

This reverts commit 00f4dcd.

This reverts commit 966fd98.
@boomanaiden154
Copy link
Contributor

It looks like thellvm-premerge-linux-runners require jobs to use a container and having nested containers doesn't work, so I may have to switch it back to the depot runners.

Ah, right. I forgot about that detail. My plan was to use https://github.com/GoogleContainerTools/kaniko to build things on the cluster, but that's definitely something I can get to later.

-DBOOTSTRAP_CLANG_PGO_TRAINING_DATA_SOURCE_DIR=/llvm-project-llvmorg-$LLVM_VERSION/llvm
-DCLANG_DEFAULT_LINKER="lld"

RUN ninja -C ./build stage2-clang-bolt stage2-install-distribution && ninja -C ./build install-distribution && rm -rf ./build
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment on why the rm -rf ./build is here? I believe it was originally to avoid out of disk space errors. Assuming the depot runners have enough disk, it would still be useful as it probably reduces checkpointing time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to drop it. I think this is a common pattern when doing something on the final image, but since this is just a builder image, I'm not sure if it makes sense to do. I don't have a strong opinion though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable enough to me.

Copy link
Contributor

@boomanaiden154 boomanaiden154 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after addressing the final comment.

We can save money by only supporting on ubuntu version in depot and we need to use the oldest version when building the release binaries.
@boomanaiden154
Copy link
Contributor

@tstellar What are your plans on landing this? I wanted to try porting it to the premerge cluster this week, but would prefer to build off of this patch in tree.

@tstellar tstellar merged commit df4c5d5 into llvm:main Dec 10, 2024
8 checks passed
mgehre-amd pushed a commit to Xilinx/llvm-project that referenced this pull request Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants