Skip to content

sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs #12858

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Alcpz
Copy link
Collaborator

@Alcpz Alcpz commented Apr 10, 2025

This PR extends the work introduced in #12035.
MMVQ Q4_0 now supports the block_q_t reorder layout.

The improvements are reflected in Text generation. The improvement of PP512 in the DataMax 1100 is noise.

The PR includes:

  • A refactor of vecdot traits, defined in the reorder_vec_dot_q_sycl struct.
  • A new entrypoint for reordered MMVQ vecdots reorder_mul_mat_vec_q4_0_q8_1_sycl
  • The new helper function safe_div, to be more consistent with the naming of other backends for ceil/roundup division

Still pending TODOs:

  • Improve and find a proper location for the comment describing the reordered layout
  • Default to DMMV if the reordered Q4_0 is not supported.
  • Get the performance for an Arc7X0

Benchmarking

Compiler: ICPX 2025.1

Builds:

GPU & Drivers:

  • Intel(R) Arc(TM) B580 Graphics 20.1.0 [1.6.32567+16]
  • Intel(R) Data Center GPU Max 1100 12.60.7 [1.6.32567+18]
  • Lunar Lake, Intel(R) Arc(TM) Graphics 20.4.4 [1.6.32567+16] (iGPU)

DISABLE_OPT is the value of GGML_SYCL_DISABLE_OPT

GPU model backend ngl DISABLE_OPT (02082f1) pp512 (44e199d) pp512 (02082f1) tg128 (44e199d) tg128
B580 qwen2 1.5B Q4_0 SYCL 99 0 6286.16 ± 14.00 6233.05 ± 22.66 105.35 ± 1.70 134.61 ± 5.38
B580 llama 7B Q4_0 SYCL 99 0 1649.27 ± 1.84 1648.96 ± 2.41 40.97 ± 0.19 65.21 ± 0.21
B580 phi3 3B Q4_0 SYCL 99 0 2461.62 ± 3.06 2462.38 ± 3.46 62.36 ± 0.43 94.31 ± 0.20
B580 qwen2 1.5B Q4_0 SYCL 99 1 7863.81 ± 30.10 7813.15 ± 55.52 100.45 ± 2.72 96.97 ± 0.32
B580 llama 7B Q4_0 SYCL 99 1 2211.87 ± 1.64 2212.20 ± 1.83 40.03 ± 0.22 39.85 ± 0.08
B580 phi3 3B Q4_0 SYCL 99 1 3133.46 ± 5.73 3132.75 ± 4.61 61.17 ± 0.34 61.75 ± 0.45
GPU model backend ngl DISABLE_OPT (02082f1) pp512 (44e199d) pp512 (02082f1) tg128 (44e199d) tg128
DataMax 1100 qwen2 1.5B Q4_0 SYCL 99 0 6759.80 ± 38.41 7272.88 ± 40.08 121.96 ± 1.07 143.40 ± 0.90
DataMax 1100 llama 7B Q4_0 SYCL 99 0 1778.88 ± 6.92 1793.16 ± 7.07 56.72 ± 0.25 71.40 ± 0.41
DataMax 1100 phi3 3B Q4_0 SYCL 99 0 2863.51 ± 13.92 2867.34 ± 4.07 92.18 ± 0.20 110.15 ± 0.57
DataMax 1100 qwen2 1.5B Q4_0 SYCL 99 1 9169.12 ± 142.33 9350.59 ± 60.30 94.20 ± 0.43 94.29 ± 0.41
DataMax 1100 llama 7B Q4_0 SYCL 99 1 2543.61 ± 8.16 2553.34 ± 22.99 36.27 ± 0.09 36.61 ± 0.07
DataMax 1100 phi3 3B Q4_0 SYCL 99 1 3952.37 ± 24.30 3938.14 ± 23.66 66.91 ± 0.07 67.24 ± 0.15
GPU model backend ngl DISABLE_OPT (02082f1) pp512 (44e199d) pp512 (02082f1) tg128 (44e199d) tg128
Arc 140V qwen2 1.5B Q4_0 SYCL 99 0 1100.09 ± 1.13 1127.81 ± 35.95 38.14 ± 0.38 46.03 ± 0.21
Arc 140V llama 7B Q4_0 SYCL 99 0 316.47 ± 0.41 321.85 ± 5.92 13.09 ± 0.73 20.52 ± 0.04
Arc 140V phi3 3B Q4_0 SYCL 99 0 512.94 ± 0.32 515.34 ± 1.68 20.49 ± 0.33 30.56 ± 0.10
Arc 140V qwen2 1.5B Q4_0 SYCL 99 1 1492.78 ± 60.74 1514.96 ± 55.74 34.06 ± 0.23 33.80 ± 1.07
Arc 140V llama 7B Q4_0 SYCL 99 1 519.44 ± 0.78 393.29 ± 17.66 11.69 ± 0.45 12.04 ± 0.94
Arc 140V phi3 3B Q4_0 SYCL 99 1 752.71 ± 21.10 787.60 ± 6.11 18.77 ± 0.04 18.85 ± 0.10

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Apr 10, 2025
@Alcpz Alcpz force-pushed the Alcpz/mmvq_q4_0_reorder branch from 44e199d to 9c8d809 Compare April 10, 2025 00:55
// Dispatch becomes obscure with the reorder, MMVQ when the reorder optimization
// is enabled takes precedence over DMMV, the current if-else implementation
// requires disabling DMMV if both conditions are met
|| (reorder && ggml_sycl_supports_reorder_mmvq(src0->type))) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is named for Intel GPUs.
Why change the code for CUDA?
In fact, reorder the src0 won't happen for non-intel GPU.
So this code has no impact.
Suggest remove it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your comment aligns with my suspicion that this change is obscure. This line changes the kernels from DMMV to MMVQ if reorder is enabled and it's supported, so it's no longer only for CUDA devices.

I need to rethink how the dispatcher does the work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please ignore this comment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another comment:
The reorder behavior impact the code path in this PR: use_dequantize_mul_mat_vec = use_dequantize_mul_mat_vec && !use_mul_mat_vec_q;

This code works well for CUDA, instead of Intel GPU.
That's why it's limited for only CUDA backend.
Some cases (models) will get benefit from it, some will become bad for Intel GPU.

I suggest removing this behavior.
Only optimize the OPs by reorder. Not change the code path.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what we have measured the new mmvq code path with the reorder optimization is more optimized on Intel devices as well (cf the PR description). Can you let us know if you find a model or device where this is causing a performance regression? That's why we suggest to enable it by default now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changed code path is bund with optimize code in this PR.
The side-effect of code path change will be covered by the optimize code.
So the test result of this PR can't approve the code path change has no side-effect.

Let me check it.

Copy link
Collaborator

@NeoZhangJianyu NeoZhangJianyu Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use ./examples/sycl/run-llama2.sh to test it on Arc.

I disable the reorder and only change the code path as same as CUDA path.
It will impact the performance and result in same time:
The perf is changed from 28 to 16.
The output text is changed too.

base:
ommit 307bfa2 (HEAD -> master, tag: b5126, origin/master, origin/HEAD)
Author: Alan Gray [email protected]
Date: Sun Apr 13 22:12:21 2025 +0100

llama_perf_context_print: prompt eval time =     288.98 ms /    19 tokens (   15.21 ms per token,    65.75 tokens per second)
llama_perf_context_print:        eval time =   13903.83 ms /   399 runs   (   34.85 ms per token,    28.70 tokens per second)

Step 1: Get to know the basics of web design
Step 2: Set up a web hosting account
Step 3: Download a free website builder
Step 4: Set up a domain name
Step 5: Design your website
Step 6: Add content to your site
Step 7: Make the site responsive
Step 8: Add a contact form
Step 9: Add a social media share button
Step 10: Advertise your website

I change the code path by:

if (ctx.stream()->get_backend() == sycl::backend::ext_oneapi_cuda || true)
llama_perf_context_print: prompt eval time =     210.78 ms /    19 tokens (   11.09 ms per token,    90.14 tokens per second)
llama_perf_context_print:        eval time =   24153.18 ms /   399 runs   (   60.53 ms per token,    16.52 tokens per second)


Step 1: Select the Website Name
Step 2: Select the Website Type
Step 3: Choose a Website Theme
Step 4: Select a Website Builder
Step 5: Register a Domain Name
Step 6: Buy Web Hosting
Step 7: Upload the Website
Step 8: Connect to Social Media
Step 9: Add a Call to Action
Step 10: Improve Search Engine Optimization (SEO)

Copy link
Collaborator

@NeoZhangJianyu NeoZhangJianyu Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I test this PR code by ./examples/sycl/run-llama2.sh on Arc.

By this PR: https://github.com/Alcpz/llama.cpp/tree/Alcpz/mmvq_q4_0_reorder

commit 52b1622 (HEAD -> Alcpz/mmvq_q4_0_reorder, origin/Alcpz/mmvq_q4_0_reorder)
Author: Alberto Cabrera [email protected]
Date: Thu Apr 10 16:10:43 2025 +0100

The performance is increased, but the result is with more error:

llama_perf_context_print: prompt eval time =     318.59 ms /    19 tokens (   16.77 ms per token,    59.64 tokens per second)
llama_perf_context_print:        eval time =   13073.32 ms /   399 runs   (   32.77 ms per token,    30.52 tokens per second)


Step 1: Select the Website Name
Step 2: Select the Website Type
Step 3: Choose a Website Theme
Step 4: Choose a Website Name
Step 5: Choose a Website Theme
Step 6: Choose a Website Name
Step 7: Choose a Website Theme
Step 8: Choose a Website Name
Step 9: Choose a Website Theme
Step 10: Choose a Website Theme
Step 11: Choose a Website Name

I update the code to disable the code path change by following code:

if (ctx.stream()->get_backend() == sycl::backend::ext_oneapi_cuda){
        // Dispatch becomes obscure with the reorder, MMVQ when the reorder optimization
        // is enabled takes precedence over DMMV, the current if-else implementation
        // requires disabling DMMV if both conditions are met
//        || (reorder && ggml_sycl_supports_reorder_mmvq(src0->type))) {
        use_dequantize_mul_mat_vec = use_dequantize_mul_mat_vec && !use_mul_mat_vec_q;
    }

The performance is decreased from the non-optimize code (28) to 25.
But the result is correct.

llama_perf_context_print: prompt eval time =     577.75 ms /    19 tokens (   30.41 ms per token,    32.89 tokens per second)
llama_perf_context_print:        eval time =   15765.81 ms /   399 runs   (   39.51 ms per token,    25.31 tokens per second)


Step 1: Get to know the basics of web design
Step 2: Set up a web hosting account
Step 3: Download a free website builder
Step 4: Set up a domain name
Step 5: Design your website
Step 6: Add content to your site
Step 7: Make the site responsive
Step 8: Add a contact form
Step 9: Add a social media share button
Step 10: Advertise your website

When disable the reorder optimize and use following code:

    if (ctx.stream()->get_backend() == sycl::backend::ext_oneapi_cuda
        // Dispatch becomes obscure with the reorder, MMVQ when the reorder optimization
        // is enabled takes precedence over DMMV, the current if-else implementation
        // requires disabling DMMV if both conditions are met
//        || (reorder && ggml_sycl_supports_reorder_mmvq(src0->type))) {
        || (ggml_sycl_supports_reorder_mmvq(src0->type))) {

The performance is reduced and the result is wong.

llama_perf_context_print: prompt eval time =     211.28 ms /    19 tokens (   11.12 ms per token,    89.93 tokens per second)
llama_perf_context_print:        eval time =   23958.10 ms /   399 runs   (   60.05 ms per token,    16.65 tokens per second)
...

Step 1: Select the Website Name
Step 2: Select the Website Type
Step 3: Choose a Website Theme
Step 4: Choose a Website Name
Step 5: Choose a Website Theme
Step 6: Choose a Website Name
Step 7: Choose a Website Theme
Step 8: Choose a Website Name
Step 9: Choose a Website Theme
Step 10: Choose a Website Theme

// qr number of weights in a byte (described as 'before dequantization')
// for quantization types that has low and high bits split, qr is calculated with
// using the lower bits, e.g for Q6 quants QR6 is 2
// qi size of a block in 32 bit integers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// qi size of a block in 32 bit integers
// qi number of 32 bit integers needed to represent all the quants from a block (`qs` field)

This is more inline with my understanding of qi, thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the number of quants that can be packed in a 32 bit type, that's why from ggml-common.h,the value of QI4_0 (which is (QK4_0 / (4 * QR4_0))) comes out to be 8, as each quant in this case is 4 bit, thus 2 per byte and therefore 4 * 2 = 8 in a 32 bit type..

Number of 32 bit integer sounds like more than one integer could be required, which I do not think that' what this is field is trying to convey

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of QI4_0 is 4 from what I'm seeing: QK4_0=32 and QR4_0=2 so 32 / (4 * 2).

Number of 32 bit integer sounds like more than one integer could be required, which I do not think that' what this is field is trying to convey

That's my understanding of the field and seems to align with the comment from

// QI = number of 32 bit integers before dequantization

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot clearer, I agree.

Comment on lines +28 to +29
assert(blocks_per_subgroup > 0);
assert(block_elements_per_subgroup > 0);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having asserts inside the kernel seems incorrect, seeing as they check constexprs if the asserts are changed to static they can be compile time asserts which would be better.

// qr number of weights in a byte (described as 'before dequantization')
// for quantization types that has low and high bits split, qr is calculated with
// using the lower bits, e.g for Q6 quants QR6 is 2
// qi size of a block in 32 bit integers

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot clearer, I agree.

Comment on lines +780 to +782
constexpr size_t safe_div(const size_t m, const size_t n) {
return (m + n - 1) / n;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think a better name for it could be ceil_div. safe_div seems to sound like it checks for overflows or promotes to a higher precision type for additions, which is does not

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants