Can we re-enqueue a finished job? #731

nvlas · 2025-02-02T21:43:36Z

nvlas
Feb 2, 2025

I noticed that if we enqueue a dynamic job after it was finished, it resumes and starts generating again. Is this intentional?
If yes, it'd be cool, but the following code crashes after generating ~200 tokens.
If not, then how can we best replicate this behavior? Explicitly maintain the state (concatenating ids) and create new ExLlamaV2DynamicJob each time?

generator = ExLlamaV2DynamicGenerator(model=model, cache=cache, tokenizer=tokenizer)
job = ExLlamaV2DynamicJob(input_ids = ids, max_new_tokens = 1, gen_settings = ExLlamaV2Sampler.Settings.greedy())

generator.clear_queue()
for i in range(500):
    generator.enqueue(job)
    while generator.num_remaining_jobs():
        j = generator.iterate()
    print(j[-1].get('full_completion',''), end='')

Answered by turboderp

Feb 4, 2025

It would complicate the generator a lot to allow for suspending jobs indefinitely.

The caching solution it does use is very general-purpose, though. If you generate up to some stop condition (or until a job is canceled), then whatever was in the prompt+completion of that job is going to be cached. So the next time you start a job with that same sequence it will skip the prefill entirely, and really there's no performance penalty to speak of compared to "reviving" a completed job.

And this way all the concerns about how to manage the limited resources are hidden behind an abstraction, with still very efficient inference under the hood. Imagine some model that may output a <think> token whe…

View full answer

turboderp · 2025-02-04T14:09:07Z

turboderp
Feb 4, 2025
Maintainer

The job won't be in a valid state after finishing, so it would be coincidental that this works at all. Failing after ~200 tokens is unsurprising since it's probably crossing a cache page boundary at that point.

If it were to be supported it would have to be implemented the way you suggest anyway, i.e. creating a new job with the result of the old job, copying sample settings etc. across and then enqueuing the new job. So there wouldn't be any benefit to it, performance wise. I could maybe see the utility of it as a convenience feature, but I'm not sure it's worth the effort to implement.

2 replies

nvlas Feb 4, 2025
Author

(Yes, it crashes at assert self.kv_position == self.generator.page_size)
Indeed the copying solution is not ideal. But isn't technically possible to pause/resume a job within the scheduler? (Assuming there is value doing that, over creating multiple new jobs. I am thinking of applications that involve planning, constrained decoding, backtracking, etc. In the flavor of banned strings, but have the user controlling the process, with as little overhead as possible.)

turboderp Feb 4, 2025
Maintainer

It would complicate the generator a lot to allow for suspending jobs indefinitely.

The caching solution it does use is very general-purpose, though. If you generate up to some stop condition (or until a job is canceled), then whatever was in the prompt+completion of that job is going to be cached. So the next time you start a job with that same sequence it will skip the prefill entirely, and really there's no performance penalty to speak of compared to "reviving" a completed job.

And this way all the concerns about how to manage the limited resources are hidden behind an abstraction, with still very efficient inference under the hood. Imagine some model that may output a <think> token when it decides to do some internal reasoning that would normally be hidden in the chat log. You could then do something like (pseudocode):

context = format_prompt(...)

while True:
    result = generate(context, stop_conditions = [eos_token, think_token])
    context += result.completion
    if result.stop_reason == eos_token:
        break
    elif result.stop_reason == think_token:
        branches = [context + result.completion for _ in range(num_branches)]
        branch_results = generate(branches, stop_conditions = [stop_think_token])
        i = judge_best_result(branch_results)
        context += branch_results[i].completion

And this would run efficiently under the hood using batching when possible, deduplication to minimize VRAM, queuing if the total number of concurrent requests wouldn't fit in the cache at once, etc. That's a lot to manually manage otherwise.

Answer selected by nvlas

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we re-enqueue a finished job? #731

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Can we re-enqueue a finished job? #731

nvlas Feb 2, 2025

Replies: 1 comment · 2 replies

turboderp Feb 4, 2025 Maintainer

nvlas Feb 4, 2025 Author

turboderp Feb 4, 2025 Maintainer

nvlas
Feb 2, 2025

Replies: 1 comment 2 replies

turboderp
Feb 4, 2025
Maintainer

nvlas Feb 4, 2025
Author

turboderp Feb 4, 2025
Maintainer