Context window for LLM #196

johnZYW · 2025-02-06T01:49:24Z

I just deployed the model on my iPhone, and currently i can only ask one question at a time and can’t ask follow-up question. Is there a way to keep the context by implementing a context window and keep asking follow-up question ? Thanks

davidkoski · 2025-02-06T02:43:33Z

Yes, there are two parts to it. The first is you need to manage the stream of messages:

https://github.com/ml-explore/mlx-swift-examples/blob/main/Libraries/MLXLMCommon/UserInput.swift#L18

These are typically going to have a role & content and be the back and forth.

The second is to manage the KVCache -- this isn't strictly required but the performance gain is considerable:

https://github.com/ml-explore/mlx-swift-examples/blob/main/Libraries/MLXLMCommon/Evaluate.swift#L288

If you were to pass in and maintain the KVCache you would be able to save the past work.

This is beyond what the example code does, but the pieces are certainly there to make your own application that does this.

johnZYW · 2025-02-06T03:00:15Z

Thanks ! I will give it a go

3DTOPO · 2025-02-07T08:35:38Z

I took a look at implementing KVCache wasn't able to connect the dots.

I see there is a prepare() extension for LLMModel to pass it KVCache, but it's not clear to me how I can use it with LLMModelFactory. Or do I have to create my own model like the example here:

https://github.com/ml-explore/mlx-swift-examples/tree/main/Libraries/MLXLLM

Any pointers or tips to get me started would be appreciated!

johnZYW · 2025-02-07T08:57:12Z

@3DTOPO have you implemented the context window for the first part ? I tried to code with the help of chatGPT, i manage the retain the history inside the formattedHistory and encode it inside inputTokens, but for some reason just before the model generate the answer, the memory is cleared. Im still debugging on this part

3DTOPO · 2025-02-07T09:18:37Z

I hadn't got that far yet. Thought I'd ask for some advice first.

If you want to share the relevant code I can try to figure out why the memory is cleared.

johnZYW · 2025-02-07T09:34:24Z

Ohh, i only edited Userinput and Evaluate.swift files and i just posted here https://github.com/johnZYW/LLMplayground. basically i have a formatted history that append each user+response and feed as input before the next response is generated

3DTOPO · 2025-02-07T09:39:18Z

Cool. I'll take a look at it tomorrow - just about time for bed. I'll let you know if I get it working.

awni · 2025-02-07T14:23:14Z

In case it's useful here is a reference implementation in the Python version of MLX LM.

3DTOPO · 2025-02-07T20:55:50Z

Thanks for the reference. It doesn't really help me though because the Swift generate functions do not have any way to pass the cache like the example you provided does.

davidkoski · 2025-02-07T21:07:23Z

Sure, you can pass it here:

https://github.com/ml-explore/mlx-swift-examples/blob/main/Libraries/MLXLMCommon/Evaluate.swift#L288

instead of letting it default to a new one each time. If this API proves inadequate then a PR to fix it would be appreciated!

3DTOPO · 2025-02-07T21:14:10Z

I did see that cache could be passed to the tokenizer, but that is different than the example python provided where it is provided with the generate call.

And the ModelFactory creates the tokenizer from the model container.

davidkoski · 2025-02-07T22:20:38Z

# Make the initial prompt cache for the model
prompt_cache = make_prompt_cache(model)

is the same as:

let cache = model.newCache(parameters: parameters)

and this call:

response = generate(
    model,
    tokenizer,
    prompt=prompt,
    verbose=True,
    prompt_cache=prompt_cache,
)

is roughly the same as: https://github.com/ml-explore/mlx-swift-examples/tree/main/Libraries/MLXLMCommon#using-a-model

    let response = try MLXLMCommon.generate(
        input: input, parameters: generateParameters, context: context
    ) { tokens in .more }

The example code shows how you might want to print the tokens as they were generated, but if you just want to generate the string in one shot this would do it.

Perhaps the confusion is the disconnect between the TokenIterator and generate() -- it isn't obvious what to do with this KVCache (since it hasn't been used this way in the example code).

Consider these two functions:

The first is what is being called here and it constructs the tokenIterator internally:

public func generate(
    input: LMInput, parameters: GenerateParameters, context: ModelContext,
    didGenerate: ([Int]) -> GenerateDisposition
) throws -> GenerateResult {
    let iterator = try TokenIterator(
        input: input, model: context.model, parameters: parameters)
    return generate(
        input: input, context: context, iterator: iterator, didGenerate: didGenerate)
}

If you were to add a cache: [KVCache]? = nil parameter to that and pass that in to the initializer for TokenIterator (right after model:) I think that is probably the missing piece.

3DTOPO · 2025-02-07T22:59:06Z

Awesome! Thanks for your help. I'll give that a shot.

Do I also need to store the cache?

How does using a cache affect the maximum context size, if at all?

davidkoski · 2025-02-08T01:02:18Z

You would want to keep the cache between calls to generate/TokenIterator -- that is the context state that represents past tokens.

@awni can you point at how the context cache affects context size?

awni · 2025-02-08T01:10:07Z

The length of the cache (in tokens) + any generated text should stay below the maximum context size of the model.

It's not checked in the Python version though as it's quite rare to go over the maximum context size for most modern models.

davidkoski · 2025-05-12T21:17:19Z

There are now some chat examples and use of KVCache as well. See also #310 and #312

davidkoski mentioned this issue Mar 7, 2025

Question: best way to cancel an ongoing generation? #227

Open

davidkoski closed this as completed May 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Context window for LLM #196

Context window for LLM #196

johnZYW commented Feb 6, 2025

davidkoski commented Feb 6, 2025

johnZYW commented Feb 6, 2025

3DTOPO commented Feb 7, 2025

johnZYW commented Feb 7, 2025

3DTOPO commented Feb 7, 2025

johnZYW commented Feb 7, 2025

3DTOPO commented Feb 7, 2025

awni commented Feb 7, 2025

3DTOPO commented Feb 7, 2025

davidkoski commented Feb 7, 2025

3DTOPO commented Feb 7, 2025

davidkoski commented Feb 7, 2025

3DTOPO commented Feb 7, 2025

davidkoski commented Feb 8, 2025

awni commented Feb 8, 2025 •

edited

Loading

davidkoski commented May 12, 2025

Context window for LLM #196

Context window for LLM #196

Comments

johnZYW commented Feb 6, 2025

davidkoski commented Feb 6, 2025

johnZYW commented Feb 6, 2025

3DTOPO commented Feb 7, 2025

johnZYW commented Feb 7, 2025

3DTOPO commented Feb 7, 2025

johnZYW commented Feb 7, 2025

3DTOPO commented Feb 7, 2025

awni commented Feb 7, 2025

3DTOPO commented Feb 7, 2025

davidkoski commented Feb 7, 2025

3DTOPO commented Feb 7, 2025

davidkoski commented Feb 7, 2025

3DTOPO commented Feb 7, 2025

davidkoski commented Feb 8, 2025

awni commented Feb 8, 2025 • edited Loading

davidkoski commented May 12, 2025

awni commented Feb 8, 2025 •

edited

Loading