Skip to content

Update Docker images to latest CUDA version and Ubuntu version #610

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

vrdn-23
Copy link

@vrdn-23 vrdn-23 commented May 22, 2025

What does this PR do?

The docker base images haven't been updated in a while so I was wondering if we could port them over to the more newer base images and Ubuntu LTS version. Let me know if there are any concerns!

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

cc @Narsil @alvarobartt

@Narsil
Copy link
Collaborator

Narsil commented Jun 2, 2025

What's the rationale here ?

Upgrading deps is indeed nice, but Cuda 12.8 is rather new (Jan 2025) so it would make TEI fail to run on any older deployments/noeds. Unless it unlocks things, I don't think we should upgrade at the moment.

Ubuntu 24 should be ok.

@vrdn-23
Copy link
Author

vrdn-23 commented Jun 2, 2025

I was hoping to get us upgraded to the latest CUDA 12.x versions since within minor releases, CUDA is mostly backwards compatible.
If I understand the link correctly, since the current TEI image is on 12.2, most nodes/deployments will already have the minimum required driver version to be able to run 12.8.

Let me know if I misunderstood something here @Narsil

@Narsil
Copy link
Collaborator

Narsil commented Jun 3, 2025

I think doesn't hold for nvidia container: NVIDIA/nvidia-container-toolkit#940

It's been a while I haven't personally see this arise since we're trying to keep up a lot with newer versions of everything, but the cuda version of the node has caused issues in the past in clusters I manage.

Is there any particular reason wanting to upgrade ? (The stance here is that if it's not broken, no need to fix it, and we can take advantage of a later minor upgrade do to such potentially breaking version upgrades)

@vrdn-23
Copy link
Author

vrdn-23 commented Jun 3, 2025

@Narsil Thanks for pointing out that issue. Forward compatibility is not something that I considered.

Is there any particular reason wanting to upgrade ?

I think the rationale is to just ensure that we don't fall too behind on dependency upgrades. CUDA 12.2 was released in June 2023, and the driver version shipped for 12.2 is not really compatible with some of the newer GPUs coming out (see the AWS EC2 instance/nvidia-driver compatibility matrix).

I am fine with reverting the PR to just the Ubuntu update and we can maybe update the CUDA version in a later major TEI release (1.8.0 or 2.0?), but the current change is still technically only a minor version update of the CUDA drivers themselves. So I'm a little ambivalent/curious on how this would fit into a TEI release lifecycle?

@Narsil
Copy link
Collaborator

Narsil commented Jun 3, 2025

1.8 is fine for those kind of upgrades.

If you just update ubuntu I will definitely merge as-is, otherwise we can leave as-is and I'll merge when 1.8 hits (there are no plans just yet, usually it happens when there's something significant happening, not necessarily a breaking change).

Again, I think it's welcome in general to update regularly, but having been bitten in the past, and seeing no obvious reason right now, I tend to delay those including them by default.

Thanks a lot for the PR regardless.

@vrdn-23
Copy link
Author

vrdn-23 commented Jun 3, 2025

@Narsil Thanks for the update! I'll revert the CUDA changes then, so it can make the 1.7.1 release

@vrdn-23
Copy link
Author

vrdn-23 commented Jun 3, 2025

Oops. Looks like I was too late! Either way I can keep track of this and raise another PR when the time is right to update to the latest CUDA version. Thanks for the feedback and the discussion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants