Skip to content

server : (experimental) vision support via libmtmd #12898

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Apr 11, 2025

Cont #12849

This is my first trial to bring libmtmd to server.cpp. ONLY GEMMA 3 is supported right now

The current goals of this PR are:

  • To see how libmtmd can be used in a different context than CLI, so that I can adapt it progressively in upcoming PRs
  • To provide a place to test the integration of other vision models

There are still quite a lot of problems:

  • Many features are too hard to be compatible, like speculative decoding, context shifting, slot cache save/load,
  • Cached prompt is half working now:
    • Missing image hash compare (to know if we should remove cached tokens of the image)
    • Sometimes, we get batch with 0 tokens (for example, enter 2 times the same prompt)
  • Batched decode is disabled on image embd batch, which will degrade performance in case of multi-slots

Implementation

The core idea of this implementation is to migrate the input from using a std::vector<llama_token> to std::vector<server_inp_chunk>

There was an API called mtmd_input_chunk introduced in #12849 , and the difference between mtmd_input_chunk vs server_inp_chunk is that server_inp_chunk only store one single token in case of text ; in case of image, it stores a pointer to the mtmd_image_tokens

struct server_inp_chunk {
    llama_token tok_text; // one single token, not a list of tokens
    mtmd_image_tokens_ptr tok_image;
};

The reason why I did this is because keeping track of KV this way seems easier (i.e. the code easier to write). Here we mostly care about the individual tokens ; We never need to look into individual image tokens anyway, if an image is different from the last one in cache, their embeddings will be completely different, so we simply throw away the whole image.

As mtmd_image_tokens_ptr uses unique_ptr under the hood, a side effect of this change is that now we eliminated some copy when passing the task from one function to another, hence many std::move are added to the change.

TODOs

  • automatically deactivate certain features if vision is enabled, we will work on these features later
  • implement hash function for image (to keep track of the cache)
  • fix detokenize(server_inp_chunk)
  • add more error handlings
  • maybe find a way to batch-decode image embeddings?
  • maybe support remote image_url in addition of base64

Demo

The server can be run with this command:

llama-server -hf ggml-org/gemma-3-4b-it-GGUF

Client code, ONLY base64 input is supported atm:

import json
import base64
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="sk-test", timeout=9999)

# Function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


# Path to your image
image_path = "../models/bliss.png"

# Getting the Base64 string
base64_image = encode_image(image_path)

response = client.chat.completions.create(
    model="gpt-4o",
    temperature=0.1,
    stream=True,
    messages=[
        {
            "role": "user",
            "content": [
                { "type": "text", "text": "describe what you see in details" },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{base64_image}",
                    },
                },
            ],
        }
    ],
)

for chunk in response:
    print(chunk.choices[0].delta.content, end="")

print("\n\n")

With the image:

bliss

This will output:

image

@qnixsynapse
Copy link
Collaborator

Awesome work. However, I noticed the model usually ignores the text prompt when the prompt is the first in the conversation!
image

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 12, 2025

@qnixsynapse can you capture the raw http request? If the json paymoad is big, you can share it via a gist

@qnixsynapse
Copy link
Collaborator

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 12, 2025

at minimum I ask for this https://wiki.wireshark.org/hyper_text_transfer_protocol

not the raw IP packet

@qnixsynapse
Copy link
Collaborator

qnixsynapse commented Apr 12, 2025

I don't have wireshark installed unfortunately. But you can still inspect for example:

POST /v1/chat/completions HTTP/1.1
Host: localhost:8080
Authorization: Bearer -key
Content-Type: application/json
Accept: */*
Accept-Encoding: gzip, deflate
User-Agent: Python/3.11 aiohttp/3.11.11
Content-Length: 615117

{"stream": true, "model": "Gemma", "messages": [{"role": "user", "content": [{"type": "text", "text": "Fact check the content in this image please."}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,<base64 png data from line 88>"}}]}], "stream_options": {"include_usage": true}, "temperature": 1.0, "top_p": 0.9}

HTTP/1.1 200 OK
Keep-Alive: timeout=5, max=100
Content-Type: text/event-stream
Server: llama.cpp
Transfer-Encoding: chunked
Access-Control-Allow-Origin: 

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 13, 2025

@qnixsynapse I had a problem with my logic, which make it discard the text batch comes before the image batch.

It should be fixed now, could you give it a try?

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 13, 2025

Btw @ggerganov I'm noting here for visibility: while working on this PR, I realize that I can have 2 refactoring which can be done in their dedicated PR:

  • The first one is quite simple, currently server_task is passed-by-copy in some places, we need to add some std::move
  • The second one is a bit more tricky. Currently, we track everything using a std::vector<llama_token>. However, for multimodal, I introduced the notion of "input chunks" along with libmtmd. Server need to be adapted to work with chunks of tokens / embeddings instead of a simple list of tokens.
    In the current PR, I'm kinda hacking this by having server_inp_chunk to wrap around one single text token (so most of the text-related logic are unchanged). But obviously this brings some complication when dealing with both text + image chunks. Do you have any better ideas to handle this?

And I also have a question regarding the logic around batch_view. IIRC, this is because sometimes the batch is too large for llama_decode to process, so we may want to reduce the input batch size (dynamically). However, we also internally split the batch into ubatch, so I'm wondering if this logic is now obsolete.


Edit: optionally one more refactoring, we should split llama-server into different compilation units, currently it may takes up to 20s to compile

@qnixsynapse
Copy link
Collaborator

qnixsynapse commented Apr 14, 2025

@ngxson Can you please refresh this branch with master?

Nvm. Ended up using your fork .. working great!!! 👍

On further testing, it seems that llama_batch_size exceeds sometimes in successive requests.

common/common.cpp:1161: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") failed

@ggerganov
Copy link
Member

And I also have a question regarding the logic around batch_view. IIRC, this is because sometimes the batch is too large for llama_decode to process, so we may want to reduce the input batch size (dynamically). However, we also internally split the batch into ubatch, so I'm wondering if this logic is now obsolete.

This was useful mainly before the defragmentation support was added. The reason is that with time the KV cache can become highly fragmented and even if it has capacity for n_tokens it won't be able to find a contiguous slot, so attempting to split the batch into smaller chunks was a way to workaround this. With defragmentation enabled by default this is now rarely necessary. So yes, this should be simplified in a separate PR.

I'll think about the input chunk question today and let you know if I have any thoughts.

params.mmproj.hf_repo = params.model.hf_repo;
}
// TODO @ngxson : this will break non-vision model with -hf, need to fix before merging
common_params_handle_model(params.mmproj, params.hf_token, "", true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngxson Is it possible to add a --no-offload-mmproj param here to keep the mmproj model on the CPU and the larger text model on GPU?

We can use mtmd_param_param: use_gpu=false to keep the projector model on CPU itself.

struct mtmd_context_params {
bool use_gpu = true;
bool print_timings = true;
int n_threads = 4;
enum ggml_log_level verbosity = GGML_LOG_LEVEL_INFO;
const char * image_marker = "<__image__>";
};

It will be useful where the GPU VRAM is limited.

@Beinsezii
Copy link

Seems like the batch decoding dies when you send a variety of longer requests.

common/common.cpp:1159: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") failed

Easiest way to trigger is to just wiggle the sequence length around, like with the example code

import json
import base64
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="sk-test", timeout=9999)

# Function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


# Path to your image
image_path = "../models/bliss.png"

# Getting the Base64 string
base64_image = encode_image(image_path)

for mult in [100, 200]:  # (beinsezii) make sure it has to rebuild some cache the 2nd time
    response = client.chat.completions.create(
        model="gpt-4o",
        temperature=0.1,
        stream=True,
        messages=[
            {
                "role": "user",
                "content": [
                    { "type": "text", "text": "describe what you see in details\n" * mult },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{base64_image}",
                        },
                    },
                ],
            }
        ],
    )

    for chunk in response:
        print(chunk.choices[0].delta.content, end="")

    print("\n\n")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants