-
Notifications
You must be signed in to change notification settings - Fork 228
Context window for LLM #196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yes, there are two parts to it. The first is you need to manage the stream of messages:
These are typically going to have a role & content and be the back and forth. The second is to manage the KVCache -- this isn't strictly required but the performance gain is considerable:
If you were to pass in and maintain the KVCache you would be able to save the past work. This is beyond what the example code does, but the pieces are certainly there to make your own application that does this. |
Thanks ! I will give it a go |
I took a look at implementing KVCache wasn't able to connect the dots. I see there is a prepare() extension for LLMModel to pass it KVCache, but it's not clear to me how I can use it with LLMModelFactory. Or do I have to create my own model like the example here: https://github.com/ml-explore/mlx-swift-examples/tree/main/Libraries/MLXLLM Any pointers or tips to get me started would be appreciated! |
@3DTOPO have you implemented the context window for the first part ? I tried to code with the help of chatGPT, i manage the retain the history inside the formattedHistory and encode it inside inputTokens, but for some reason just before the model generate the answer, the memory is cleared. Im still debugging on this part |
I hadn't got that far yet. Thought I'd ask for some advice first. If you want to share the relevant code I can try to figure out why the memory is cleared. |
Ohh, i only edited Userinput and Evaluate.swift files and i just posted here https://github.com/johnZYW/LLMplayground. basically i have a formatted history that append each user+response and feed as input before the next response is generated |
Cool. I'll take a look at it tomorrow - just about time for bed. I'll let you know if I get it working. |
In case it's useful here is a reference implementation in the Python version of MLX LM. |
Thanks for the reference. It doesn't really help me though because the Swift generate functions do not have any way to pass the cache like the example you provided does. |
Sure, you can pass it here:
instead of letting it default to a new one each time. If this API proves inadequate then a PR to fix it would be appreciated! |
I did see that cache could be passed to the tokenizer, but that is different than the example python provided where it is provided with the generate call. And the ModelFactory creates the tokenizer from the model container. |
# Make the initial prompt cache for the model
prompt_cache = make_prompt_cache(model) is the same as: let cache = model.newCache(parameters: parameters) and this call: response = generate(
model,
tokenizer,
prompt=prompt,
verbose=True,
prompt_cache=prompt_cache,
) is roughly the same as: https://github.com/ml-explore/mlx-swift-examples/tree/main/Libraries/MLXLMCommon#using-a-model let response = try MLXLMCommon.generate(
input: input, parameters: generateParameters, context: context
) { tokens in .more } The example code shows how you might want to print the tokens as they were generated, but if you just want to generate the string in one shot this would do it. Perhaps the confusion is the disconnect between the TokenIterator and generate() -- it isn't obvious what to do with this KVCache (since it hasn't been used this way in the example code). Consider these two functions:
The first is what is being called here and it constructs the tokenIterator internally: public func generate(
input: LMInput, parameters: GenerateParameters, context: ModelContext,
didGenerate: ([Int]) -> GenerateDisposition
) throws -> GenerateResult {
let iterator = try TokenIterator(
input: input, model: context.model, parameters: parameters)
return generate(
input: input, context: context, iterator: iterator, didGenerate: didGenerate)
} If you were to add a |
Awesome! Thanks for your help. I'll give that a shot. Do I also need to store the cache? How does using a cache affect the maximum context size, if at all? |
You would want to keep the cache between calls to generate/TokenIterator -- that is the context state that represents past tokens. @awni can you point at how the context cache affects context size? |
The length of the cache (in tokens) + any generated text should stay below the maximum context size of the model. It's not checked in the Python version though as it's quite rare to go over the maximum context size for most modern models. |
I just deployed the model on my iPhone, and currently i can only ask one question at a time and can’t ask follow-up question. Is there a way to keep the context by implementing a context window and keep asking follow-up question ? Thanks
The text was updated successfully, but these errors were encountered: