Codegate 844 #931

therealnb · 2025-02-05T12:53:01Z

This is a retry of #917. It was initially merged and caused build problems. Reverted here #930.

This PR is to identify ways to fix the build space problem.

Signed-off-by: nigel brown <[email protected]>

aponcedeleonch · 2025-02-06T16:54:05Z

My guess is that Pytorch is trying to install some CUDA dependencies and is causing the build to fail. That's my assumption from the following error at build

Error: buildx failed with: ERROR: failed to solve: rpc error: code = Unknown desc = write /usr/local/lib/python3.12/site-packages/nvidia/cudnn/lib/libcudnn_engines_precompiled.so.9: no space left on device

CodeGate is meant to run as a container, i.e. most probably we won't have access to GPUs at runtime. Hence all CUDA-related stuff is an overhead. I would try to install Pytorch without CUDA dependencies. I haven't tested the solutions proposed in this SO post but seems like they could work.

Suggestion from the post. Include this at the.toml file

[tool.poetry.dependencies]
torch = { url = "https://download.pytorch.org/whl/torch_stable.html/cpu/torch-1.3.1%2Bcpu-cp36-cp36m-win_amd64.whl" }

Signed-off-by: nigel brown <[email protected]>

therealnb · 2025-02-06T17:46:25Z

(codegate-py3.12) nigels-MacBook-Pro:codegate nigel$ poetry lock
Resolving dependencies... (0.3s)

403 Client Error: Forbidden for url: https://download.pytorch.org/whl/torch_stable.html/cpu/torch-2.6.0%2Bcpu-cp36-cp36m-win_amd64.whl

(codegate-py3.12) nigels-MacBook-Pro:codegate nigel$ poetry lock
Resolving dependencies... (0.3s)

403 Client Error: Forbidden for url: https://download.pytorch.org/whl/torch_stable.html/cpu/torch-1.3.1%2Bcpu-cp36-cp36m-win_amd64.whl

Signed-off-by: nigel brown <[email protected]>

therealnb · 2025-02-06T18:50:41Z

I tried it a different way, manually installing torch with

    - name: Install torch
      run: . `poetry env info --path`/bin/activate; pip install --upgrade torch==2.6.0 -f https://download.pytorch.org/whl/cpu

and taking it out of the toml file. I got this https://github.com/stacklok/codegate/actions/runs/13185369848/job/36806090206?pr=931

It still ran out of disk space.

Any more ideas, @aponcedeleonch ?

Signed-off-by: nigel brown <[email protected]>

therealnb · 2025-02-06T19:06:02Z

This still ran out of space, without the model file https://github.com/stacklok/codegate/actions/runs/13163734423/job/36738711373?pr=931

aponcedeleonch · 2025-02-07T08:29:03Z

@therealnb I did a quick test on a PR of just installing Pytorch. I wanted to test if the culprit was installing Pytorch or there was something else. It turns out that the error comes indeed just by installing Pytorch. See the PR #975

I found this other SO post that has some suggestions on how to free up disk space during the Github action. See the accepted answer. Maybe doing something like that can help us. Either that or increasing the runner disk space (I don't know if that's possible)

aponcedeleonch · 2025-02-07T09:32:02Z

Not related with the image and Pytorch install problem. I just remembered you mentioned that the first run of the model was slow because the weights needed to be loaded. I was thinking that you could load the weights at server startup so that way you already have them when you run the model. There is other stuff that gets initialized or done at server startup, e.g. DB migrations.

This is the function used to startup the server:

codegate/src/codegate/cli.py

Line 257 in dde0570

def serve( # noqa: C901

therealnb · 2025-02-07T10:06:26Z

@therealnb I did a quick test on a PR of just installing Pytorch. I wanted to test if the culprit was installing Pytorch or there was something else. It turns out that the error comes indeed just by installing Pytorch. See the PR #975

I tried the reduced pytorch in a few ways (long chat with Claude). I think it is just size in general - nothing pytorch specific.

I found this other SO post that has some suggestions on how to free up disk space during the Github action. See the accepted answer. Maybe doing something like that can help us. Either that or increasing the runner disk space (I don't know if that's possible)

I wonder if there is some caching that we could kill. I think the decision is to use a bigger runner (@lukehinds @ptelang ). If we squeeze this into the current runner, something else will go over the limit.

therealnb · 2025-02-07T10:15:22Z

Not related with the image and Pytorch install problem. I just remembered you mentioned that the first run of the model was slow because the weights needed to be loaded. I was thinking that you could load the weights at server startup so that way you already have them when you run the model. There is other stuff that gets initialized or done at server startup, e.g. DB migrations.

Let's ask Gemini...

When I use pytorch and load an ANN from a file, the first inference is slower than the rest. Is there further internal initialisation that needs to happen?...

You're observing a common phenomenon with PyTorch (and other deep learning frameworks). The first inference after loading a model often takes longer due to various initialization steps that happen behind the scenes. It's not exactly "further internal initialization" in the sense of model parameters being altered, but rather the setup of the execution environment.

Here's a breakdown of the reasons and how to address them:

Reasons for Slow First Inference:

CUDA Context Initialization (if using GPU): If you're using a GPU, the first CUDA operation (which inference is) triggers the initialization of the CUDA context. This involves loading necessary drivers, setting up communication with the GPU, and allocating memory. This is a one-time cost per process.

Just-In-Time (JIT) Compilation (for some operations): PyTorch sometimes uses JIT compilation for certain operations, especially if you're using features like torch.jit.script or tracing. The first time a particular operation is encountered, it might need to be compiled, adding to the initial overhead. Subsequent calls will use the compiled version.

Memory Allocation and Initialization: Even without explicit CUDA context initialization, memory needs to be allocated for the model's weights, activations, and other intermediate values. This can take a small amount of time, especially for larger models.

Framework Overhead: There's always some inherent overhead in the first call to any function or method within a framework. This includes function call setup, argument parsing, and other internal bookkeeping.

How to Pre-trigger/Warm Up:

The standard and recommended way to avoid this slow first inference is to perform a "warm-up" inference. This means running a dummy inference before you start your actual benchmarking or time-sensitive processing.

Python

import torch

# ... load your model ...
model = torch.load("your_model.pth")
model.eval()  # Important: Set to evaluation mode

# Create dummy input data (matching your model's input shape)
dummy_input = torch.randn(1, *input_shape)  # Example: batch size 1, your input dimensions

# Warm-up inference
with torch.no_grad():  # Important: Disable gradients during warm-up
    _ = model(dummy_input)

# Now, time your actual inference calls
start_time = time.time()
with torch.no_grad():
    result = model(real_input)
end_time = time.time()
print(f"Inference time: {end_time - start_time}")

Key Points:

model.eval(): Crucially, set your model to evaluation mode (model.eval()) during both the warm-up and your actual inference. This disables features like dropout and batch normalization that behave differently during training and inference.
torch.no_grad(): Use torch.no_grad() to prevent the calculation of gradients during the warm-up and inference. This saves memory and speeds up the process.
Dummy Input: Create a dummy input tensor that has the correct shape and data type for your model. The actual values don't matter much for the warm-up.
Warm-up Iterations: For very critical applications, you could even perform a few warm-up iterations. This can be especially helpful if you suspect JIT compilation is involved.
CUDA Synchronization (if using GPU): If you're using CUDA, you might want to add torch.cuda.synchronize() after the warm-up inference to ensure that all CUDA operations are completed before you start timing:
Python

if torch.cuda.is_available():
    torch.cuda.synchronize()

By following these practices, you can effectively eliminate the slow first inference and get more consistent and accurate performance measurements. The overhead of the warm-up is negligible compared to the benefits.

This is the function used to startup the server:

codegate/src/codegate/cli.py

Line 257 in dde0570

def serve( # noqa: C901

We could put a dummy eval in there. I think it is an optimisation and will leave it until we get the thing built.

Signed-off-by: nigel brown <[email protected]>

aponcedeleonch · 2025-02-10T17:10:39Z

Increasing the worker size does help: #975 . We need to discuss if this is the way we want to go

Signed-off-by: nigel brown <[email protected]>

so we don't have import errors when torch is not there. Signed-off-by: nigel brown <[email protected]>

therealnb · 2025-02-12T12:50:47Z

I rewrote the whole model eval stuff to save in onnx format and run the inference in onnxruntime. This makes the whole image smaller.

pyproject.toml

Signed-off-by: nigel brown <[email protected]>

src/codegate/pipeline/suspicious_commands/suspicious_commands_trainer.py

…gate-844

aponcedeleonch

Very nice!

Signed-off-by: nigel brown <[email protected]>

therealnb added 12 commits February 4, 2025 15:46

Initial suspicious commands

6159108

Signed-off-by: nigel brown <[email protected]>

Update lock file

b3a35e9

Signed-off-by: nigel brown <[email protected]>

Well that's worse, in my view

b8a93c3

Signed-off-by: nigel brown <[email protected]>

Yep, the test file looks worse too

ed7c5b8

Signed-off-by: nigel brown <[email protected]>

More linting...

79f7990

Signed-off-by: nigel brown <[email protected]>

Merge branch 'main' into codegate-844

af85b5d

Merge branch 'main' into codegate-844

ced07c8

Pin versions, remove h5py

c842a3d

Signed-off-by: nigel brown <[email protected]>

Change saving protocol

35d928b

Signed-off-by: nigel brown <[email protected]>

Merge branch 'main' into codegate-844

1ec1d83

try skipping test

8782308

Signed-off-by: nigel brown <[email protected]>

Unskip test

d205ba0

Signed-off-by: nigel brown <[email protected]>

therealnb requested review from ptelang, jhrozek, blkt, aponcedeleonch and rdimitrov February 5, 2025 12:53

Merge branch 'main' into codegate-844

300da89

therealnb mentioned this pull request Feb 5, 2025

[Task]: Spotting commands in the stream from coding assistants like cline #844

Open

poppysec force-pushed the codegate-844 branch from f6f164c to 300da89 Compare February 6, 2025 16:42

Try pip for torch

a87bdeb

Signed-off-by: nigel brown <[email protected]>

therealnb added 5 commits February 6, 2025 18:02

Merge branch 'main' into codegate-844

ce9f728

Signed-off-by: nigel brown <[email protected]>

install torch for tests too

11c2caa

Signed-off-by: nigel brown <[email protected]>

try installing after poetry

f6a6101

Signed-off-by: nigel brown <[email protected]>

don't use cache

d63aa06

Signed-off-by: nigel brown <[email protected]>

put the command in the right place

71c4952

Signed-off-by: nigel brown <[email protected]>

Try removing big file

dc54194

Signed-off-by: nigel brown <[email protected]>

Put it back

bdd13d2

Signed-off-by: nigel brown <[email protected]>

therealnb added 2 commits February 7, 2025 15:04

Fix weight loading.

05de7b0

Signed-off-by: nigel brown <[email protected]>

Revert pytorch based changes

8cc94fb

Signed-off-by: nigel brown <[email protected]>

therealnb added 7 commits February 11, 2025 11:25

Merge branch 'main' into codegate-844

ebd4343

remove pandas

80f895f

Signed-off-by: nigel brown <[email protected]>

Merge branch 'main' into codegate-844

9842913

onnx basically working

f1bdeb8

Signed-off-by: nigel brown <[email protected]>

Move training to a specific class

a436354

so we don't have import errors when torch is not there. Signed-off-by: nigel brown <[email protected]>

Merge branch 'main' into codegate-844

d8976d2

Merge branch 'main' into codegate-844

3077372

aponcedeleonch reviewed Feb 12, 2025

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

pin versions

35320c7

Signed-off-by: nigel brown <[email protected]>

aponcedeleonch reviewed Feb 12, 2025

View reviewed changes

src/codegate/pipeline/suspicious_commands/suspicious_commands_trainer.py Show resolved Hide resolved

Merge branch 'codegate-844' of github.com:stacklok/codegate into code…

8e1bde5

…gate-844

aponcedeleonch previously approved these changes Feb 12, 2025

View reviewed changes

some more detailed comments

d31d080

Signed-off-by: nigel brown <[email protected]>

therealnb dismissed aponcedeleonch’s stale review via d31d080 February 12, 2025 14:01

Merge branch 'main' into codegate-844

7a1a5f8

aponcedeleonch approved these changes Feb 12, 2025

View reviewed changes

aponcedeleonch mentioned this pull request Feb 12, 2025

Test Pytorch install #975

Closed

therealnb merged commit b1d055f into main Feb 12, 2025
10 checks passed

therealnb deleted the codegate-844 branch February 12, 2025 14:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Codegate 844 #931

Codegate 844 #931

Uh oh!

therealnb commented Feb 5, 2025

Uh oh!

aponcedeleonch commented Feb 6, 2025

Uh oh!

therealnb commented Feb 6, 2025

Uh oh!

therealnb commented Feb 6, 2025

Uh oh!

therealnb commented Feb 6, 2025

Uh oh!

aponcedeleonch commented Feb 7, 2025

Uh oh!

aponcedeleonch commented Feb 7, 2025 •

edited

Loading

Uh oh!

therealnb commented Feb 7, 2025

Uh oh!

therealnb commented Feb 7, 2025

Uh oh!

aponcedeleonch commented Feb 10, 2025

Uh oh!

therealnb commented Feb 12, 2025

Uh oh!

Uh oh!

Uh oh!

aponcedeleonch left a comment

Uh oh!

Uh oh!

Uh oh!

Codegate 844 #931

Codegate 844 #931

Uh oh!

Conversation

therealnb commented Feb 5, 2025

Uh oh!

aponcedeleonch commented Feb 6, 2025

Uh oh!

therealnb commented Feb 6, 2025

Uh oh!

therealnb commented Feb 6, 2025

Uh oh!

therealnb commented Feb 6, 2025

Uh oh!

aponcedeleonch commented Feb 7, 2025

Uh oh!

aponcedeleonch commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

therealnb commented Feb 7, 2025

Uh oh!

therealnb commented Feb 7, 2025

Uh oh!

aponcedeleonch commented Feb 10, 2025

Uh oh!

therealnb commented Feb 12, 2025

Uh oh!

Uh oh!

Uh oh!

aponcedeleonch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aponcedeleonch commented Feb 7, 2025 •

edited

Loading