-
Notifications
You must be signed in to change notification settings - Fork 11.8k
sycl : Implemented reorder Q4_K mmvq #13109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Could you share the GPU type of above test result?
- Have you test the PR by local UT?
- Could you check the detailed output of Q4_K LLM?
I guess the output should be different to legacy code.
ggml/src/ggml-sycl/ggml-sycl.cpp
Outdated
// Dispatch becomes obscure with the reorder, MMVQ when the reorder optimization | ||
// is enabled takes precedence over DMMV, the current if-else implementation | ||
// requires disabling DMMV if both conditions are met | ||
|| (reorder && ggml_sycl_supports_reorder_mmvq(src0->type))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have same comment and concern:
This code will impact the code path of below. That would lead to the wrong result.
I suggest this PR only optimize the mmvq() function.
You could add another PR to optimize by changing the code path, like by this sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK! Let's focus on the review of #12858 firtly.
ggml/src/ggml-sycl/ggml-sycl.cpp
Outdated
@@ -2968,14 +2994,17 @@ static void ggml_sycl_mul_mat(ggml_backend_sycl_context & ctx, const ggml_tensor | |||
// KQ + KQV multi-batch | |||
ggml_sycl_mul_mat_batched_sycl(ctx, src0, src1, dst); | |||
} else if (use_dequantize_mul_mat_vec) { | |||
ggml_sycl_op_mul_mat(ctx, src0, src1, dst, ggml_sycl_op_dequantize_mul_mat_vec, false); | |||
// save_tensor_txt("1/dst_1.txt", (float*) dst->data, src0->ne[1], sizeof(float), ctx.stream()); | |||
constexpr bool convert_src1_to_q8_1 = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you follow the solution of PR #13003?
It fixed base issue of reorder Q4_0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rebased this branch and will rebase it again when #12858 is merged, so these changes should make it into this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on the discussion WRT opt_for_reorder
and how to call the function, this will require another rebase, sorry about that. Will keep it to a single commit so you can cherry pick with no issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK! Let's focus on the review of #12858 firtly.
@sgeor255 We need promote SYCL backend in related cases. :) |
d1f5b2d
to
105a01d
Compare
I rebased the PR on @Alcpz 's latest changes & updated the description with more performance numbers. |
@NeoZhangJianyu to answer your questions:
I updated the PR description with results from more devices.
Unit tests pass locally (if I understood the question correctly?).
I ran the example script with master @ 8936784
This PR
|
@sgeor255 I cannot resolve my comments (because the "resolve conversation " button is just isn't there for me), consider them resolved 👍🏻 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM
ggml/src/ggml-sycl/ggml-sycl.cpp
Outdated
@@ -2968,14 +2994,17 @@ static void ggml_sycl_mul_mat(ggml_backend_sycl_context & ctx, const ggml_tensor | |||
// KQ + KQV multi-batch | |||
ggml_sycl_mul_mat_batched_sycl(ctx, src0, src1, dst); | |||
} else if (use_dequantize_mul_mat_vec) { | |||
ggml_sycl_op_mul_mat(ctx, src0, src1, dst, ggml_sycl_op_dequantize_mul_mat_vec, false); | |||
// save_tensor_txt("1/dst_1.txt", (float*) dst->data, src0->ne[1], sizeof(float), ctx.stream()); | |||
constexpr bool convert_src1_to_q8_1 = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on the discussion WRT opt_for_reorder
and how to call the function, this will require another rebase, sorry about that. Will keep it to a single commit so you can cherry pick with no issues.
105a01d
to
685e02b
Compare
https://github.com/NeoZhangJianyu there's a small improvement for this model too:
build: 105a01d (5223)
build: 105a01d (5223) |
685e02b
to
d7e5179
Compare
d7e5179
to
bb10b4a
Compare
bb10b4a
to
f7e7d2a
Compare
This PR is now rebased on master as #12858 was merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM so far
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall!
f7e7d2a
to
2a2aef0
Compare
|
||
static bool can_use_dequantize_mul_mat_vec(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static bool can_use_dequantize_mul_mat_vec(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) { | |
static bool choose_dequantize_mul_mat_vec(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's nit but can_use
seems more accurate to me since there are more logic later on to make the final decision on the matmul implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use one function to detect if need call deq_mul_mat_vec(), instead of several.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is that the final choice depends on the outputs from can_use_dequantize_mul_mat_vec
and can_use_mul_mat_vec_q
so it can't all be in a single choose_dequantize_mul_mat_vec
currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree with @Rbiessy and that's why it is currently implemented this way.
src0->ne[0] % GGML_SYCL_DMMV_X == 0 && src1->ne[1] == 1; | ||
} | ||
|
||
static bool can_use_mul_mat_vec_q(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static bool can_use_mul_mat_vec_q(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) { | |
static bool choose_mul_mat_vec_q(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) { |
Merging now since this PR includes an important fix with the reorder optimization mentioned here: #13109 (comment) |
This PR enables reorder optimization for Q4_K layout similarly to #12858 . This branch is based off of @Alcpz 's and before that is merged the easiest way to review it is looking at the diff for d1f5b2d .
Some performance numbers below:
Lunar lake
GGML_SYCL_DISABLE_OPT=0
build: 105a01d (5223)
GGML_SYCL_DISABLE_OPT=1
build: 105a01d (5223)
Arc B580 (Battlemage)
GGML_SYCL_DISABLE_OPT=0
build: 105a01d (5223)
GGML_SYCL_DISABLE_OPT=1
build: 105a01d (5223)
Arc A770
GGML_SYCL_DISABLE_OPT=0
build: 105a01d (5223)
GGML_SYCL_DISABLE_OPT=1
build: 105a01d (5223)