-
Notifications
You must be signed in to change notification settings - Fork 11.4k
server : (experimental) vision support via libmtmd #12898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@qnixsynapse can you capture the raw http request? If the json paymoad is big, you can share it via a gist |
@ngxson Will this be okay? https://gist.github.com/qnixsynapse/a4c61368d05180d3cb6c00f1baedf92c |
at minimum I ask for this https://wiki.wireshark.org/hyper_text_transfer_protocol not the raw IP packet |
I don't have wireshark installed unfortunately. But you can still inspect for example:
|
@qnixsynapse I had a problem with my logic, which make it discard the text batch comes before the image batch. It should be fixed now, could you give it a try? |
Btw @ggerganov I'm noting here for visibility: while working on this PR, I realize that I can have 2 refactoring which can be done in their dedicated PR:
And I also have a question regarding the logic around Edit: optionally one more refactoring, we should split |
@ngxson Nvm. Ended up using your fork .. On further testing, it seems that llama_batch_size exceeds sometimes in successive requests.
|
This was useful mainly before the defragmentation support was added. The reason is that with time the KV cache can become highly fragmented and even if it has capacity for I'll think about the input chunk question today and let you know if I have any thoughts. |
params.mmproj.hf_repo = params.model.hf_repo; | ||
} | ||
// TODO @ngxson : this will break non-vision model with -hf, need to fix before merging | ||
common_params_handle_model(params.mmproj, params.hf_token, "", true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngxson Is it possible to add a --no-offload-mmproj
param here to keep the mmproj model on the CPU and the larger text model on GPU?
We can use mtmd_param_param: use_gpu=false to keep the projector model on CPU itself.
llama.cpp/examples/llava/mtmd.h
Lines 52 to 58 in 0019279
struct mtmd_context_params { | |
bool use_gpu = true; | |
bool print_timings = true; | |
int n_threads = 4; | |
enum ggml_log_level verbosity = GGML_LOG_LEVEL_INFO; | |
const char * image_marker = "<__image__>"; | |
}; |
It will be useful where the GPU VRAM is limited.
Seems like the batch decoding dies when you send a variety of longer requests.
Easiest way to trigger is to just wiggle the sequence length around, like with the example code import json
import base64
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="sk-test", timeout=9999)
# Function to encode the image
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
# Path to your image
image_path = "../models/bliss.png"
# Getting the Base64 string
base64_image = encode_image(image_path)
for mult in [100, 200]: # (beinsezii) make sure it has to rebuild some cache the 2nd time
response = client.chat.completions.create(
model="gpt-4o",
temperature=0.1,
stream=True,
messages=[
{
"role": "user",
"content": [
{ "type": "text", "text": "describe what you see in details\n" * mult },
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{base64_image}",
},
},
],
}
],
)
for chunk in response:
print(chunk.choices[0].delta.content, end="")
print("\n\n") |
Cont #12849
This is my first trial to bring
libmtmd
toserver.cpp
. ONLY GEMMA 3 is supported right nowThe current goals of this PR are:
libmtmd
can be used in a different context than CLI, so that I can adapt it progressively in upcoming PRsThere are still quite a lot of problems:
Implementation
The core idea of this implementation is to migrate the input from using a
std::vector<llama_token>
tostd::vector<server_inp_chunk>
There was an API called
mtmd_input_chunk
introduced in #12849 , and the difference betweenmtmd_input_chunk
vsserver_inp_chunk
is thatserver_inp_chunk
only store one single token in case of text ; in case of image, it stores a pointer to themtmd_image_tokens
The reason why I did this is because keeping track of KV this way seems easier (i.e. the code easier to write). Here we mostly care about the individual tokens ; We never need to look into individual image tokens anyway, if an image is different from the last one in cache, their embeddings will be completely different, so we simply throw away the whole image.
As
mtmd_image_tokens_ptr
usesunique_ptr
under the hood, a side effect of this change is that now we eliminated some copy when passing thetask
from one function to another, hence manystd::move
are added to the change.TODOs
image_url
in addition ofbase64
Demo
The server can be run with this command:
Client code, ONLY
base64
input is supported atm:With the image:
This will output: