Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codegate 844 #931

Merged
merged 34 commits into from
Feb 12, 2025
Merged

Codegate 844 #931

merged 34 commits into from
Feb 12, 2025

Conversation

therealnb
Copy link
Contributor

This is a retry of #917. It was initially merged and caused build problems. Reverted here #930.

This PR is to identify ways to fix the build space problem.

@aponcedeleonch
Copy link
Contributor

My guess is that Pytorch is trying to install some CUDA dependencies and is causing the build to fail. That's my assumption from the following error at build

Error: buildx failed with: ERROR: failed to solve: rpc error: code = Unknown desc = write /usr/local/lib/python3.12/site-packages/nvidia/cudnn/lib/libcudnn_engines_precompiled.so.9: no space left on device

CodeGate is meant to run as a container, i.e. most probably we won't have access to GPUs at runtime. Hence all CUDA-related stuff is an overhead. I would try to install Pytorch without CUDA dependencies. I haven't tested the solutions proposed in this SO post but seems like they could work.

Suggestion from the post. Include this at the.toml file

[tool.poetry.dependencies]
torch = { url = "https://download.pytorch.org/whl/torch_stable.html/cpu/torch-1.3.1%2Bcpu-cp36-cp36m-win_amd64.whl" }

Signed-off-by: nigel brown <[email protected]>
@therealnb
Copy link
Contributor Author

(codegate-py3.12) nigels-MacBook-Pro:codegate nigel$ poetry lock
Resolving dependencies... (0.3s)

403 Client Error: Forbidden for url: https://download.pytorch.org/whl/torch_stable.html/cpu/torch-2.6.0%2Bcpu-cp36-cp36m-win_amd64.whl

(codegate-py3.12) nigels-MacBook-Pro:codegate nigel$ poetry lock
Resolving dependencies... (0.3s)

403 Client Error: Forbidden for url: https://download.pytorch.org/whl/torch_stable.html/cpu/torch-1.3.1%2Bcpu-cp36-cp36m-win_amd64.whl

@therealnb
Copy link
Contributor Author

I tried it a different way, manually installing torch with

    - name: Install torch
      run: . `poetry env info --path`/bin/activate; pip install --upgrade torch==2.6.0 -f https://download.pytorch.org/whl/cpu

and taking it out of the toml file. I got this https://github.com/stacklok/codegate/actions/runs/13185369848/job/36806090206?pr=931

It still ran out of disk space.

Any more ideas, @aponcedeleonch ?

Signed-off-by: nigel brown <[email protected]>
Signed-off-by: nigel brown <[email protected]>
@therealnb
Copy link
Contributor Author

This still ran out of space, without the model file https://github.com/stacklok/codegate/actions/runs/13163734423/job/36738711373?pr=931

@aponcedeleonch
Copy link
Contributor

@therealnb I did a quick test on a PR of just installing Pytorch. I wanted to test if the culprit was installing Pytorch or there was something else. It turns out that the error comes indeed just by installing Pytorch. See the PR #975

I found this other SO post that has some suggestions on how to free up disk space during the Github action. See the accepted answer. Maybe doing something like that can help us. Either that or increasing the runner disk space (I don't know if that's possible)

@aponcedeleonch
Copy link
Contributor

aponcedeleonch commented Feb 7, 2025

Not related with the image and Pytorch install problem. I just remembered you mentioned that the first run of the model was slow because the weights needed to be loaded. I was thinking that you could load the weights at server startup so that way you already have them when you run the model. There is other stuff that gets initialized or done at server startup, e.g. DB migrations.

This is the function used to startup the server:

def serve( # noqa: C901

@therealnb
Copy link
Contributor Author

@therealnb I did a quick test on a PR of just installing Pytorch. I wanted to test if the culprit was installing Pytorch or there was something else. It turns out that the error comes indeed just by installing Pytorch. See the PR #975

I tried the reduced pytorch in a few ways (long chat with Claude). I think it is just size in general - nothing pytorch specific.

I found this other SO post that has some suggestions on how to free up disk space during the Github action. See the accepted answer. Maybe doing something like that can help us. Either that or increasing the runner disk space (I don't know if that's possible)

I wonder if there is some caching that we could kill. I think the decision is to use a bigger runner (@lukehinds @ptelang ). If we squeeze this into the current runner, something else will go over the limit.

@therealnb
Copy link
Contributor Author

Not related with the image and Pytorch install problem. I just remembered you mentioned that the first run of the model was slow because the weights needed to be loaded. I was thinking that you could load the weights at server startup so that way you already have them when you run the model. There is other stuff that gets initialized or done at server startup, e.g. DB migrations.

Let's ask Gemini...

When I use pytorch and load an ANN from a file, the first inference is slower than the rest. Is there further internal initialisation that needs to happen?...

You're observing a common phenomenon with PyTorch (and other deep learning frameworks). The first inference after loading a model often takes longer due to various initialization steps that happen behind the scenes. It's not exactly "further internal initialization" in the sense of model parameters being altered, but rather the setup of the execution environment.

Here's a breakdown of the reasons and how to address them:

Reasons for Slow First Inference:

CUDA Context Initialization (if using GPU): If you're using a GPU, the first CUDA operation (which inference is) triggers the initialization of the CUDA context. This involves loading necessary drivers, setting up communication with the GPU, and allocating memory. This is a one-time cost per process.

Just-In-Time (JIT) Compilation (for some operations): PyTorch sometimes uses JIT compilation for certain operations, especially if you're using features like torch.jit.script or tracing. The first time a particular operation is encountered, it might need to be compiled, adding to the initial overhead. Subsequent calls will use the compiled version.

Memory Allocation and Initialization: Even without explicit CUDA context initialization, memory needs to be allocated for the model's weights, activations, and other intermediate values. This can take a small amount of time, especially for larger models.

Framework Overhead: There's always some inherent overhead in the first call to any function or method within a framework. This includes function call setup, argument parsing, and other internal bookkeeping.

How to Pre-trigger/Warm Up:

The standard and recommended way to avoid this slow first inference is to perform a "warm-up" inference. This means running a dummy inference before you start your actual benchmarking or time-sensitive processing.

Python

import torch

# ... load your model ...
model = torch.load("your_model.pth")
model.eval()  # Important: Set to evaluation mode

# Create dummy input data (matching your model's input shape)
dummy_input = torch.randn(1, *input_shape)  # Example: batch size 1, your input dimensions

# Warm-up inference
with torch.no_grad():  # Important: Disable gradients during warm-up
    _ = model(dummy_input)

# Now, time your actual inference calls
start_time = time.time()
with torch.no_grad():
    result = model(real_input)
end_time = time.time()
print(f"Inference time: {end_time - start_time}")

Key Points:

model.eval(): Crucially, set your model to evaluation mode (model.eval()) during both the warm-up and your actual inference. This disables features like dropout and batch normalization that behave differently during training and inference.
torch.no_grad(): Use torch.no_grad() to prevent the calculation of gradients during the warm-up and inference. This saves memory and speeds up the process.
Dummy Input: Create a dummy input tensor that has the correct shape and data type for your model. The actual values don't matter much for the warm-up.
Warm-up Iterations: For very critical applications, you could even perform a few warm-up iterations. This can be especially helpful if you suspect JIT compilation is involved.
CUDA Synchronization (if using GPU): If you're using CUDA, you might want to add torch.cuda.synchronize() after the warm-up inference to ensure that all CUDA operations are completed before you start timing:
Python

if torch.cuda.is_available():
    torch.cuda.synchronize()

By following these practices, you can effectively eliminate the slow first inference and get more consistent and accurate performance measurements. The overhead of the warm-up is negligible compared to the benefits.

This is the function used to startup the server:

def serve( # noqa: C901

We could put a dummy eval in there. I think it is an optimisation and will leave it until we get the thing built.

Signed-off-by: nigel brown <[email protected]>
Signed-off-by: nigel brown <[email protected]>
@aponcedeleonch
Copy link
Contributor

Increasing the worker size does help: #975 . We need to discuss if this is the way we want to go

@therealnb
Copy link
Contributor Author

I rewrote the whole model eval stuff to save in onnx format and run the inference in onnxruntime. This makes the whole image smaller.

Signed-off-by: nigel brown <[email protected]>
aponcedeleonch
aponcedeleonch previously approved these changes Feb 12, 2025
Copy link
Contributor

@aponcedeleonch aponcedeleonch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

Signed-off-by: nigel brown <[email protected]>
@therealnb therealnb merged commit b1d055f into main Feb 12, 2025
10 checks passed
@therealnb therealnb deleted the codegate-844 branch February 12, 2025 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants