-
Notifications
You must be signed in to change notification settings - Fork 11.4k
sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs #12858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Alberto Cabrera <[email protected]>
44e199d
to
9c8d809
Compare
// Dispatch becomes obscure with the reorder, MMVQ when the reorder optimization | ||
// is enabled takes precedence over DMMV, the current if-else implementation | ||
// requires disabling DMMV if both conditions are met | ||
|| (reorder && ggml_sycl_supports_reorder_mmvq(src0->type))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is named for Intel GPUs.
Why change the code for CUDA?
In fact, reorder the src0 won't happen for non-intel GPU.
So this code has no impact.
Suggest remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your comment aligns with my suspicion that this change is obscure. This line changes the kernels from DMMV to MMVQ if reorder is enabled and it's supported, so it's no longer only for CUDA devices.
I need to rethink how the dispatcher does the work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please ignore this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another comment:
The reorder behavior impact the code path in this PR: use_dequantize_mul_mat_vec = use_dequantize_mul_mat_vec && !use_mul_mat_vec_q;
This code works well for CUDA, instead of Intel GPU.
That's why it's limited for only CUDA backend.
Some cases (models) will get benefit from it, some will become bad for Intel GPU.
I suggest removing this behavior.
Only optimize the OPs by reorder. Not change the code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what we have measured the new mmvq code path with the reorder optimization is more optimized on Intel devices as well (cf the PR description). Can you let us know if you find a model or device where this is causing a performance regression? That's why we suggest to enable it by default now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changed code path is bund with optimize code in this PR.
The side-effect of code path change will be covered by the optimize code.
So the test result of this PR can't approve the code path change has no side-effect.
Let me check it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use ./examples/sycl/run-llama2.sh to test it on Arc.
I disable the reorder and only change the code path as same as CUDA path.
It will impact the performance and result in same time:
The perf is changed from 28 to 16.
The output text is changed too.
base:
ommit 307bfa2 (HEAD -> master, tag: b5126, origin/master, origin/HEAD)
Author: Alan Gray [email protected]
Date: Sun Apr 13 22:12:21 2025 +0100
llama_perf_context_print: prompt eval time = 288.98 ms / 19 tokens ( 15.21 ms per token, 65.75 tokens per second)
llama_perf_context_print: eval time = 13903.83 ms / 399 runs ( 34.85 ms per token, 28.70 tokens per second)
Step 1: Get to know the basics of web design
Step 2: Set up a web hosting account
Step 3: Download a free website builder
Step 4: Set up a domain name
Step 5: Design your website
Step 6: Add content to your site
Step 7: Make the site responsive
Step 8: Add a contact form
Step 9: Add a social media share button
Step 10: Advertise your website
I change the code path by:
if (ctx.stream()->get_backend() == sycl::backend::ext_oneapi_cuda || true)
llama_perf_context_print: prompt eval time = 210.78 ms / 19 tokens ( 11.09 ms per token, 90.14 tokens per second)
llama_perf_context_print: eval time = 24153.18 ms / 399 runs ( 60.53 ms per token, 16.52 tokens per second)
Step 1: Select the Website Name
Step 2: Select the Website Type
Step 3: Choose a Website Theme
Step 4: Select a Website Builder
Step 5: Register a Domain Name
Step 6: Buy Web Hosting
Step 7: Upload the Website
Step 8: Connect to Social Media
Step 9: Add a Call to Action
Step 10: Improve Search Engine Optimization (SEO)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I test this PR code by ./examples/sycl/run-llama2.sh on Arc.
By this PR: https://github.com/Alcpz/llama.cpp/tree/Alcpz/mmvq_q4_0_reorder
commit 52b1622 (HEAD -> Alcpz/mmvq_q4_0_reorder, origin/Alcpz/mmvq_q4_0_reorder)
Author: Alberto Cabrera [email protected]
Date: Thu Apr 10 16:10:43 2025 +0100
The performance is increased, but the result is with more error:
llama_perf_context_print: prompt eval time = 318.59 ms / 19 tokens ( 16.77 ms per token, 59.64 tokens per second)
llama_perf_context_print: eval time = 13073.32 ms / 399 runs ( 32.77 ms per token, 30.52 tokens per second)
Step 1: Select the Website Name
Step 2: Select the Website Type
Step 3: Choose a Website Theme
Step 4: Choose a Website Name
Step 5: Choose a Website Theme
Step 6: Choose a Website Name
Step 7: Choose a Website Theme
Step 8: Choose a Website Name
Step 9: Choose a Website Theme
Step 10: Choose a Website Theme
Step 11: Choose a Website Name
I update the code to disable the code path change by following code:
if (ctx.stream()->get_backend() == sycl::backend::ext_oneapi_cuda){
// Dispatch becomes obscure with the reorder, MMVQ when the reorder optimization
// is enabled takes precedence over DMMV, the current if-else implementation
// requires disabling DMMV if both conditions are met
// || (reorder && ggml_sycl_supports_reorder_mmvq(src0->type))) {
use_dequantize_mul_mat_vec = use_dequantize_mul_mat_vec && !use_mul_mat_vec_q;
}
The performance is decreased from the non-optimize code (28) to 25.
But the result is correct.
llama_perf_context_print: prompt eval time = 577.75 ms / 19 tokens ( 30.41 ms per token, 32.89 tokens per second)
llama_perf_context_print: eval time = 15765.81 ms / 399 runs ( 39.51 ms per token, 25.31 tokens per second)
Step 1: Get to know the basics of web design
Step 2: Set up a web hosting account
Step 3: Download a free website builder
Step 4: Set up a domain name
Step 5: Design your website
Step 6: Add content to your site
Step 7: Make the site responsive
Step 8: Add a contact form
Step 9: Add a social media share button
Step 10: Advertise your website
When disable the reorder optimize and use following code:
if (ctx.stream()->get_backend() == sycl::backend::ext_oneapi_cuda
// Dispatch becomes obscure with the reorder, MMVQ when the reorder optimization
// is enabled takes precedence over DMMV, the current if-else implementation
// requires disabling DMMV if both conditions are met
// || (reorder && ggml_sycl_supports_reorder_mmvq(src0->type))) {
|| (ggml_sycl_supports_reorder_mmvq(src0->type))) {
The performance is reduced and the result is wong.
llama_perf_context_print: prompt eval time = 211.28 ms / 19 tokens ( 11.12 ms per token, 89.93 tokens per second)
llama_perf_context_print: eval time = 23958.10 ms / 399 runs ( 60.05 ms per token, 16.65 tokens per second)
...
Step 1: Select the Website Name
Step 2: Select the Website Type
Step 3: Choose a Website Theme
Step 4: Choose a Website Name
Step 5: Choose a Website Theme
Step 6: Choose a Website Name
Step 7: Choose a Website Theme
Step 8: Choose a Website Name
Step 9: Choose a Website Theme
Step 10: Choose a Website Theme
Signed-off-by: Alberto Cabrera <[email protected]>
// qr number of weights in a byte (described as 'before dequantization') | ||
// for quantization types that has low and high bits split, qr is calculated with | ||
// using the lower bits, e.g for Q6 quants QR6 is 2 | ||
// qi size of a block in 32 bit integers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// qi size of a block in 32 bit integers | |
// qi number of 32 bit integers needed to represent all the quants from a block (`qs` field) |
This is more inline with my understanding of qi
, thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's the number of quants that can be packed in a 32 bit type, that's why from ggml-common.h,the value of QI4_0 (which is (QK4_0 / (4 * QR4_0))
) comes out to be 8, as each quant in this case is 4 bit, thus 2 per byte and therefore 4 * 2 = 8 in a 32 bit type..
Number of 32 bit integer sounds like more than one integer could be required, which I do not think that' what this is field is trying to convey
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value of QI4_0
is 4 from what I'm seeing: QK4_0=32
and QR4_0=2
so 32 / (4 * 2)
.
Number of 32 bit integer sounds like more than one integer could be required, which I do not think that' what this is field is trying to convey
That's my understanding of the field and seems to align with the comment from
llama.cpp/ggml/src/ggml-common.h
Line 94 in fccf9ca
// QI = number of 32 bit integers before dequantization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot clearer, I agree.
assert(blocks_per_subgroup > 0); | ||
assert(block_elements_per_subgroup > 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having asserts inside the kernel seems incorrect, seeing as they check constexpr
s if the asserts are changed to static they can be compile time asserts which would be better.
// qr number of weights in a byte (described as 'before dequantization') | ||
// for quantization types that has low and high bits split, qr is calculated with | ||
// using the lower bits, e.g for Q6 quants QR6 is 2 | ||
// qi size of a block in 32 bit integers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot clearer, I agree.
constexpr size_t safe_div(const size_t m, const size_t n) { | ||
return (m + n - 1) / n; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think a better name for it could be ceil_div
. safe_div
seems to sound like it checks for overflows or promotes to a higher precision type for additions, which is does not
This PR extends the work introduced in #12035.
MMVQ Q4_0 now supports the block_q_t reorder layout.
The improvements are reflected in Text generation. The improvement of PP512 in the DataMax 1100 is noise.
The PR includes:
reorder_vec_dot_q_sycl
struct.reorder_mul_mat_vec_q4_0_q8_1_sycl
safe_div
, to be more consistent with the naming of other backends for ceil/roundup divisionStill pending TODOs:
Benchmarking
Compiler: ICPX 2025.1
Builds:
GPU & Drivers:
DISABLE_OPT is the value of GGML_SYCL_DISABLE_OPT