-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Codegate 844 #931
Codegate 844 #931
Conversation
Signed-off-by: nigel brown <[email protected]>
Signed-off-by: nigel brown <[email protected]>
Signed-off-by: nigel brown <[email protected]>
Signed-off-by: nigel brown <[email protected]>
Signed-off-by: nigel brown <[email protected]>
Signed-off-by: nigel brown <[email protected]>
Signed-off-by: nigel brown <[email protected]>
Signed-off-by: nigel brown <[email protected]>
Signed-off-by: nigel brown <[email protected]>
f6f164c
to
300da89
Compare
My guess is that Pytorch is trying to install some CUDA dependencies and is causing the build to fail. That's my assumption from the following error at build
CodeGate is meant to run as a container, i.e. most probably we won't have access to GPUs at runtime. Hence all CUDA-related stuff is an overhead. I would try to install Pytorch without CUDA dependencies. I haven't tested the solutions proposed in this SO post but seems like they could work. Suggestion from the post. Include this at the [tool.poetry.dependencies]
torch = { url = "https://download.pytorch.org/whl/torch_stable.html/cpu/torch-1.3.1%2Bcpu-cp36-cp36m-win_amd64.whl" } |
Signed-off-by: nigel brown <[email protected]>
|
Signed-off-by: nigel brown <[email protected]>
Signed-off-by: nigel brown <[email protected]>
Signed-off-by: nigel brown <[email protected]>
Signed-off-by: nigel brown <[email protected]>
Signed-off-by: nigel brown <[email protected]>
I tried it a different way, manually installing torch with
and taking it out of the toml file. I got this https://github.com/stacklok/codegate/actions/runs/13185369848/job/36806090206?pr=931 It still ran out of disk space. Any more ideas, @aponcedeleonch ? |
Signed-off-by: nigel brown <[email protected]>
Signed-off-by: nigel brown <[email protected]>
This still ran out of space, without the model file https://github.com/stacklok/codegate/actions/runs/13163734423/job/36738711373?pr=931 |
@therealnb I did a quick test on a PR of just installing Pytorch. I wanted to test if the culprit was installing Pytorch or there was something else. It turns out that the error comes indeed just by installing Pytorch. See the PR #975 I found this other SO post that has some suggestions on how to free up disk space during the Github action. See the accepted answer. Maybe doing something like that can help us. Either that or increasing the runner disk space (I don't know if that's possible) |
Not related with the image and Pytorch install problem. I just remembered you mentioned that the first run of the model was slow because the weights needed to be loaded. I was thinking that you could load the weights at server startup so that way you already have them when you run the model. There is other stuff that gets initialized or done at server startup, e.g. DB migrations. This is the function used to startup the server: Line 257 in dde0570
|
I tried the reduced pytorch in a few ways (long chat with Claude). I think it is just size in general - nothing pytorch specific.
I wonder if there is some caching that we could kill. I think the decision is to use a bigger runner (@lukehinds @ptelang ). If we squeeze this into the current runner, something else will go over the limit. |
Let's ask Gemini... When I use pytorch and load an ANN from a file, the first inference is slower than the rest. Is there further internal initialisation that needs to happen?... You're observing a common phenomenon with PyTorch (and other deep learning frameworks). The first inference after loading a model often takes longer due to various initialization steps that happen behind the scenes. It's not exactly "further internal initialization" in the sense of model parameters being altered, but rather the setup of the execution environment. Here's a breakdown of the reasons and how to address them: Reasons for Slow First Inference: CUDA Context Initialization (if using GPU): If you're using a GPU, the first CUDA operation (which inference is) triggers the initialization of the CUDA context. This involves loading necessary drivers, setting up communication with the GPU, and allocating memory. This is a one-time cost per process. Just-In-Time (JIT) Compilation (for some operations): PyTorch sometimes uses JIT compilation for certain operations, especially if you're using features like torch.jit.script or tracing. The first time a particular operation is encountered, it might need to be compiled, adding to the initial overhead. Subsequent calls will use the compiled version. Memory Allocation and Initialization: Even without explicit CUDA context initialization, memory needs to be allocated for the model's weights, activations, and other intermediate values. This can take a small amount of time, especially for larger models. Framework Overhead: There's always some inherent overhead in the first call to any function or method within a framework. This includes function call setup, argument parsing, and other internal bookkeeping. How to Pre-trigger/Warm Up: The standard and recommended way to avoid this slow first inference is to perform a "warm-up" inference. This means running a dummy inference before you start your actual benchmarking or time-sensitive processing. Python import torch
Key Points: model.eval(): Crucially, set your model to evaluation mode (model.eval()) during both the warm-up and your actual inference. This disables features like dropout and batch normalization that behave differently during training and inference.
By following these practices, you can effectively eliminate the slow first inference and get more consistent and accurate performance measurements. The overhead of the warm-up is negligible compared to the benefits.
We could put a dummy eval in there. I think it is an optimisation and will leave it until we get the thing built. |
Signed-off-by: nigel brown <[email protected]>
Signed-off-by: nigel brown <[email protected]>
Increasing the worker size does help: #975 . We need to discuss if this is the way we want to go |
Signed-off-by: nigel brown <[email protected]>
Signed-off-by: nigel brown <[email protected]>
so we don't have import errors when torch is not there. Signed-off-by: nigel brown <[email protected]>
I rewrote the whole model eval stuff to save in onnx format and run the inference in onnxruntime. This makes the whole image smaller. |
Signed-off-by: nigel brown <[email protected]>
src/codegate/pipeline/suspicious_commands/suspicious_commands_trainer.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!
Signed-off-by: nigel brown <[email protected]>
This is a retry of #917. It was initially merged and caused build problems. Reverted here #930.
This PR is to identify ways to fix the build space problem.