-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
[Core] LoRA: V1 Scheduler optimization #15422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] LoRA: V1 Scheduler optimization #15422
Conversation
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
Requesting review from @russellb for the changes to the "Structured Outputs" land! 🙌 |
Requesting reviews from @jeejeelee 🙌 |
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
1ca090c
to
7675df8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the variable name, the PR looks good to me!
@varun-sundar-rabindranath Do you have any performance numbers after this PR?
I am running some benchmarks now. Ill add it to the PR 👍 |
Hi @WoosukKwon - Added benchmark numbers to the PR description. It definitely helps V1 when max_loras < number of loras used. However V1 LoRA in this case does lag behind V0 - I think it has to do with the |
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Wes Medford <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Mu Huai <[email protected]>
LoRA Scheduler Optimization
Running Example:
Let
max_loras
be set to 4Let Waiting queue be : { R1, R2, R3, R4-L1, R5-L3, R6-L2, R7-L4, R8-L5, R9-L5, R10-L5, R11-L5, R12, R13, R14-L1, R15-L2, R16-L3 }
R
x
- Request numberx
. Request doesn't need any LoRAR
x
-Ly
- Request numberx
that needs LoRA numbery
Why:
In V1 + LoRA, at the moment we stop scheduling waiting requests when we can no longer honor
max_loras
user inputs. This is not optimal as,are blocked from scheduling.
In the example above,
What:
This PR updates the scheduling logic to continue scanning the waiting queue for requests that can still be scheduled.
With this PR,
How:
We leverage the "skip waiting requests" logic introduced by structured decoding.
Benchmarks:
max_loras
= 4 , number of lora modules = 8 ,max_num_seqs
= 256,max_num_batched_tokens
= 4096Server Command :
benchmark_serving.py command
main V1
main V0
This PR