server : (experimental) vision support via libmtmd #12898

ngxson · 2025-04-11T16:08:00Z

This is my first trial to bring libmtmd to server.cpp. ONLY GEMMA 3 is supported right now

The current goals of this PR are:

To see how libmtmd can be used in a different context than CLI, so that I can adapt it progressively in upcoming PRs
To provide a place to test the integration of other vision models

There are still quite a lot of problems:

Many features are too hard to be compatible, like speculative decoding, context shifting, slot cache save/load,
Cached prompt is half working now:
- Missing image hash compare (to know if we should remove cached tokens of the image)
- Sometimes, we get batch with 0 tokens (for example, enter 2 times the same prompt)
Batched decode is disabled on image embd batch, which will degrade performance in case of multi-slots

Implementation

The core idea of this implementation is to migrate the input from using a std::vector<llama_token> to std::vector<server_inp_chunk>

There was an API called mtmd_input_chunk introduced in #12849 , and the difference between mtmd_input_chunk vs server_inp_chunk is that server_inp_chunk only store one single token in case of text ; in case of image, it stores a pointer to the mtmd_image_tokens

struct server_inp_chunk {
    llama_token tok_text; // one single token, not a list of tokens
    mtmd_image_tokens_ptr tok_image;
};

The reason why I did this is because keeping track of KV this way seems easier (i.e. the code easier to write). Here we mostly care about the individual tokens ; We never need to look into individual image tokens anyway, if an image is different from the last one in cache, their embeddings will be completely different, so we simply throw away the whole image.

As mtmd_image_tokens_ptr uses unique_ptr under the hood, a side effect of this change is that now we eliminated some copy when passing the task from one function to another, hence many std::move are added to the change.

TODOs

automatically deactivate certain features if vision is enabled, we will work on these features later
implement hash function for image (to keep track of the cache)
fix detokenize(server_inp_chunk)
add more error handlings
maybe find a way to batch-decode image embeddings?
maybe support remote image_url in addition of base64

Demo

The server can be run with this command:

llama-server -hf ggml-org/gemma-3-4b-it-GGUF

Client code, ONLY base64 input is supported atm:

import json
import base64
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="sk-test", timeout=9999)

# Function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


# Path to your image
image_path = "../models/bliss.png"

# Getting the Base64 string
base64_image = encode_image(image_path)

response = client.chat.completions.create(
    model="gpt-4o",
    temperature=0.1,
    stream=True,
    messages=[
        {
            "role": "user",
            "content": [
                { "type": "text", "text": "describe what you see in details" },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{base64_image}",
                    },
                },
            ],
        }
    ],
)

for chunk in response:
    print(chunk.choices[0].delta.content, end="")

print("\n\n")

With the image:

This will output:

qnixsynapse · 2025-04-12T14:57:09Z

Awesome work. However, I noticed the model usually ignores the text prompt when the prompt is the first in the conversation!

ngxson · 2025-04-12T15:27:05Z

@qnixsynapse can you capture the raw http request? If the json paymoad is big, you can share it via a gist

qnixsynapse · 2025-04-12T15:45:57Z

@ngxson Will this be okay? https://gist.github.com/qnixsynapse/a4c61368d05180d3cb6c00f1baedf92c

ngxson · 2025-04-12T16:01:47Z

at minimum I ask for this https://wiki.wireshark.org/hyper_text_transfer_protocol

not the raw IP packet

qnixsynapse · 2025-04-12T16:11:40Z

I don't have wireshark installed unfortunately. But you can still inspect for example:

POST /v1/chat/completions HTTP/1.1
Host: localhost:8080
Authorization: Bearer -key
Content-Type: application/json
Accept: */*
Accept-Encoding: gzip, deflate
User-Agent: Python/3.11 aiohttp/3.11.11
Content-Length: 615117

{"stream": true, "model": "Gemma", "messages": [{"role": "user", "content": [{"type": "text", "text": "Fact check the content in this image please."}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,<base64 png data from line 88>"}}]}], "stream_options": {"include_usage": true}, "temperature": 1.0, "top_p": 0.9}

HTTP/1.1 200 OK
Keep-Alive: timeout=5, max=100
Content-Type: text/event-stream
Server: llama.cpp
Transfer-Encoding: chunked
Access-Control-Allow-Origin:

ngxson · 2025-04-13T21:39:53Z

@qnixsynapse I had a problem with my logic, which make it discard the text batch comes before the image batch.

It should be fixed now, could you give it a try?

ngxson · 2025-04-13T22:02:44Z

Btw @ggerganov I'm noting here for visibility: while working on this PR, I realize that I can have 2 refactoring which can be done in their dedicated PR:

The first one is quite simple, currently server_task is passed-by-copy in some places, we need to add some std::move
The second one is a bit more tricky. Currently, we track everything using a std::vector<llama_token>. However, for multimodal, I introduced the notion of "input chunks" along with libmtmd. Server need to be adapted to work with chunks of tokens / embeddings instead of a simple list of tokens.
In the current PR, I'm kinda hacking this by having server_inp_chunk to wrap around one single text token (so most of the text-related logic are unchanged). But obviously this brings some complication when dealing with both text + image chunks. Do you have any better ideas to handle this?

And I also have a question regarding the logic around batch_view. IIRC, this is because sometimes the batch is too large for llama_decode to process, so we may want to reduce the input batch size (dynamically). However, we also internally split the batch into ubatch, so I'm wondering if this logic is now obsolete.

Edit: optionally one more refactoring, we should split llama-server into different compilation units, currently it may takes up to 20s to compile

qnixsynapse · 2025-04-14T05:23:10Z

@ngxson ~~Can you please refresh this branch with master?~~

Nvm. Ended up using your fork .. ~~working great!!!~~ 👍

On further testing, it seems that llama_batch_size exceeds sometimes in successive requests.

common/common.cpp:1161: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") failed

ggerganov · 2025-04-14T06:20:52Z

And I also have a question regarding the logic around batch_view. IIRC, this is because sometimes the batch is too large for llama_decode to process, so we may want to reduce the input batch size (dynamically). However, we also internally split the batch into ubatch, so I'm wondering if this logic is now obsolete.

This was useful mainly before the defragmentation support was added. The reason is that with time the KV cache can become highly fragmented and even if it has capacity for n_tokens it won't be able to find a contiguous slot, so attempting to split the batch into smaller chunks was a way to workaround this. With defragmentation enabled by default this is now rarely necessary. So yes, this should be simplified in a separate PR.

I'll think about the input chunk question today and let you know if I have any thoughts.

qnixsynapse · 2025-04-15T02:55:36Z

common/arg.cpp

        params.mmproj.hf_repo = params.model.hf_repo;
    }
+    // TODO @ngxson : this will break non-vision model with -hf, need to fix before merging
    common_params_handle_model(params.mmproj,            params.hf_token, "", true);


@ngxson Is it possible to add a --no-offload-mmproj param here to keep the mmproj model on the CPU and the larger text model on GPU?

We can use mtmd_param_param: use_gpu=false to keep the projector model on CPU itself.

llama.cpp/examples/llava/mtmd.h

Lines 52 to 58 in 0019279

struct mtmd_context_params {

bool use_gpu = true;

bool print_timings = true;

int n_threads = 4;

enum ggml_log_level verbosity = GGML_LOG_LEVEL_INFO;

const char * image_marker = "<__image__>";

};

It will be useful where the GPU VRAM is limited.

Beinsezii · 2025-04-15T08:36:29Z

Seems like the batch decoding dies when you send a variety of longer requests.

common/common.cpp:1159: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") failed

Easiest way to trigger is to just wiggle the sequence length around, like with the example code

import json
import base64
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="sk-test", timeout=9999)

# Function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


# Path to your image
image_path = "../models/bliss.png"

# Getting the Base64 string
base64_image = encode_image(image_path)

for mult in [100, 200]:  # (beinsezii) make sure it has to rebuild some cache the 2nd time
    response = client.chat.completions.create(
        model="gpt-4o",
        temperature=0.1,
        stream=True,
        messages=[
            {
                "role": "user",
                "content": [
                    { "type": "text", "text": "describe what you see in details\n" * mult },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{base64_image}",
                        },
                    },
                ],
            }
        ],
    )

    for chunk in response:
        print(chunk.choices[0].delta.content, end="")

    print("\n\n")

ngxson added 2 commits April 11, 2025 17:46

server : (experimental) vision support via libmtmd

466c6cd

mtmd : add more api around mtmd_image_tokens

2317e61

github-actions bot added examples server labels Apr 11, 2025

ngxson mentioned this pull request Apr 11, 2025

server: Bring back multimodal support #8010

Open

14 tasks

ngxson added 2 commits April 11, 2025 21:49

mtmd : add more api around mtmd_image_tokens

a46b6db

mtmd : ability to calc image hash

7ac0b7b

ngxson mentioned this pull request Apr 11, 2025

mtmd : add methods to access mtmd_image_tokens #12906

Open

ngxson added 2 commits April 12, 2025 10:34

shared_ptr for mtmd_image_tokens

58c4767

move hash to user-define ID (fixed)

d3c3e20

ngxson added 2 commits April 13, 2025 17:40

Merge branch 'xsn/mtmd_image_api' into xsn/server_mtmd

a44029a

abstract out the batch management

5e6c7ba

Merge branch 'master' into xsn/server_mtmd

78a76de

ngxson mentioned this pull request Apr 14, 2025

server : use std::move whenever possible #12936

Open

qnixsynapse reviewed Apr 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : (experimental) vision support via libmtmd #12898

server : (experimental) vision support via libmtmd #12898

ngxson commented Apr 11, 2025 •

edited

Loading

qnixsynapse commented Apr 12, 2025

ngxson commented Apr 12, 2025

qnixsynapse commented Apr 12, 2025

ngxson commented Apr 12, 2025

qnixsynapse commented Apr 12, 2025 •

edited

Loading

ngxson commented Apr 13, 2025

ngxson commented Apr 13, 2025 •

edited

Loading

qnixsynapse commented Apr 14, 2025 •

edited

Loading

ggerganov commented Apr 14, 2025

qnixsynapse Apr 15, 2025

Beinsezii commented Apr 15, 2025

	struct mtmd_context_params {
	bool use_gpu = true;
	bool print_timings = true;
	int n_threads = 4;
	enum ggml_log_level verbosity = GGML_LOG_LEVEL_INFO;
	const char * image_marker = "<__image__>";
	};

server : (experimental) vision support via libmtmd #12898

Are you sure you want to change the base?

server : (experimental) vision support via libmtmd #12898

Conversation

ngxson commented Apr 11, 2025 • edited Loading

Implementation

TODOs

Demo

qnixsynapse commented Apr 12, 2025

ngxson commented Apr 12, 2025

qnixsynapse commented Apr 12, 2025

ngxson commented Apr 12, 2025

qnixsynapse commented Apr 12, 2025 • edited Loading

ngxson commented Apr 13, 2025

ngxson commented Apr 13, 2025 • edited Loading

qnixsynapse commented Apr 14, 2025 • edited Loading

ggerganov commented Apr 14, 2025

qnixsynapse Apr 15, 2025

Choose a reason for hiding this comment

Beinsezii commented Apr 15, 2025

ngxson commented Apr 11, 2025 •

edited

Loading

qnixsynapse commented Apr 12, 2025 •

edited

Loading

ngxson commented Apr 13, 2025 •

edited

Loading

qnixsynapse commented Apr 14, 2025 •

edited

Loading