Skip to content

Context window for LLM #196

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
johnZYW opened this issue Feb 6, 2025 · 16 comments
Closed

Context window for LLM #196

johnZYW opened this issue Feb 6, 2025 · 16 comments

Comments

@johnZYW
Copy link

johnZYW commented Feb 6, 2025

I just deployed the model on my iPhone, and currently i can only ask one question at a time and can’t ask follow-up question. Is there a way to keep the context by implementing a context window and keep asking follow-up question ? Thanks

@davidkoski
Copy link
Collaborator

Yes, there are two parts to it. The first is you need to manage the stream of messages:

These are typically going to have a role & content and be the back and forth.

The second is to manage the KVCache -- this isn't strictly required but the performance gain is considerable:

If you were to pass in and maintain the KVCache you would be able to save the past work.

This is beyond what the example code does, but the pieces are certainly there to make your own application that does this.

@johnZYW
Copy link
Author

johnZYW commented Feb 6, 2025

Thanks ! I will give it a go

@3DTOPO
Copy link

3DTOPO commented Feb 7, 2025

I took a look at implementing KVCache wasn't able to connect the dots.

I see there is a prepare() extension for LLMModel to pass it KVCache, but it's not clear to me how I can use it with LLMModelFactory. Or do I have to create my own model like the example here:

https://github.com/ml-explore/mlx-swift-examples/tree/main/Libraries/MLXLLM

Any pointers or tips to get me started would be appreciated!

@johnZYW
Copy link
Author

johnZYW commented Feb 7, 2025

@3DTOPO have you implemented the context window for the first part ? I tried to code with the help of chatGPT, i manage the retain the history inside the formattedHistory and encode it inside inputTokens, but for some reason just before the model generate the answer, the memory is cleared. Im still debugging on this part

@3DTOPO
Copy link

3DTOPO commented Feb 7, 2025

I hadn't got that far yet. Thought I'd ask for some advice first.

If you want to share the relevant code I can try to figure out why the memory is cleared.

@johnZYW
Copy link
Author

johnZYW commented Feb 7, 2025

Ohh, i only edited Userinput and Evaluate.swift files and i just posted here https://github.com/johnZYW/LLMplayground. basically i have a formatted history that append each user+response and feed as input before the next response is generated

@3DTOPO
Copy link

3DTOPO commented Feb 7, 2025

Cool. I'll take a look at it tomorrow - just about time for bed. I'll let you know if I get it working.

@awni
Copy link
Member

awni commented Feb 7, 2025

In case it's useful here is a reference implementation in the Python version of MLX LM.

@3DTOPO
Copy link

3DTOPO commented Feb 7, 2025

Thanks for the reference. It doesn't really help me though because the Swift generate functions do not have any way to pass the cache like the example you provided does.

@davidkoski
Copy link
Collaborator

Sure, you can pass it here:

instead of letting it default to a new one each time. If this API proves inadequate then a PR to fix it would be appreciated!

@3DTOPO
Copy link

3DTOPO commented Feb 7, 2025

I did see that cache could be passed to the tokenizer, but that is different than the example python provided where it is provided with the generate call.

And the ModelFactory creates the tokenizer from the model container.

@davidkoski
Copy link
Collaborator

# Make the initial prompt cache for the model
prompt_cache = make_prompt_cache(model)

is the same as:

let cache = model.newCache(parameters: parameters)

and this call:

response = generate(
    model,
    tokenizer,
    prompt=prompt,
    verbose=True,
    prompt_cache=prompt_cache,
)

is roughly the same as: https://github.com/ml-explore/mlx-swift-examples/tree/main/Libraries/MLXLMCommon#using-a-model

    let response = try MLXLMCommon.generate(
        input: input, parameters: generateParameters, context: context
    ) { tokens in .more }

The example code shows how you might want to print the tokens as they were generated, but if you just want to generate the string in one shot this would do it.

Perhaps the confusion is the disconnect between the TokenIterator and generate() -- it isn't obvious what to do with this KVCache (since it hasn't been used this way in the example code).

Consider these two functions:

The first is what is being called here and it constructs the tokenIterator internally:

public func generate(
    input: LMInput, parameters: GenerateParameters, context: ModelContext,
    didGenerate: ([Int]) -> GenerateDisposition
) throws -> GenerateResult {
    let iterator = try TokenIterator(
        input: input, model: context.model, parameters: parameters)
    return generate(
        input: input, context: context, iterator: iterator, didGenerate: didGenerate)
}

If you were to add a cache: [KVCache]? = nil parameter to that and pass that in to the initializer for TokenIterator (right after model:) I think that is probably the missing piece.

@3DTOPO
Copy link

3DTOPO commented Feb 7, 2025

Awesome! Thanks for your help. I'll give that a shot.

Do I also need to store the cache?

How does using a cache affect the maximum context size, if at all?

@davidkoski
Copy link
Collaborator

You would want to keep the cache between calls to generate/TokenIterator -- that is the context state that represents past tokens.

@awni can you point at how the context cache affects context size?

@awni
Copy link
Member

awni commented Feb 8, 2025

The length of the cache (in tokens) + any generated text should stay below the maximum context size of the model.

It's not checked in the Python version though as it's quite rare to go over the maximum context size for most modern models.

@davidkoski
Copy link
Collaborator

There are now some chat examples and use of KVCache as well. See also #310 and #312

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants