Skip to content

sycl : Implemented reorder Q4_K mmvq #13109

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 15, 2025

Conversation

sgeor255
Copy link
Contributor

@sgeor255 sgeor255 commented Apr 25, 2025

This PR enables reorder optimization for Q4_K layout similarly to #12858 . This branch is based off of @Alcpz 's and before that is merged the easiest way to review it is looking at the diff for d1f5b2d .

Some performance numbers below:

Lunar lake

  • GGML_SYCL_DISABLE_OPT=0
model size params backend ngl threads sm test t/s
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 8 none pp512 1593.59 ± 79.66
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 8 none tg128 41.43 ± 0.49
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 8 none pp512 551.60 ± 2.19
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 8 none tg128 17.69 ± 1.04
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 8 none pp512 590.18 ± 4.57
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 8 none tg128 28.36 ± 0.24
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 8 none pp512 507.64 ± 0.92
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 8 none tg128 13.61 ± 0.07
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 8 none pp512 823.78 ± 30.18
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 8 none tg128 21.44 ± 0.08

build: 105a01d (5223)

  • GGML_SYCL_DISABLE_OPT=1
model size params backend ngl threads sm test t/s
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 8 none pp512 1624.32 ± 64.90
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 8 none tg128 36.27 ± 0.25
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 8 none pp512 552.24 ± 1.20
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 8 none tg128 12.83 ± 1.24
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 8 none pp512 623.69 ± 3.50
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 8 none tg128 24.23 ± 0.58
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 8 none pp512 508.55 ± 1.01
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 8 none tg128 10.21 ± 0.03
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 8 none pp512 820.33 ± 30.67
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 8 none tg128 17.72 ± 0.06

build: 105a01d (5223)

Arc B580 (Battlemage)

  • GGML_SYCL_DISABLE_OPT=0
model size params backend ngl sm test t/s
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none pp512 7963.47 ± 49.91
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none tg128 119.66 ± 1.24
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none pp512 2251.25 ± 3.16
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none tg128 53.63 ± 0.51
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none pp512 5899.09 ± 16.46
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none tg128 87.05 ± 2.77
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 none pp512 2116.96 ± 3.79
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 none tg128 47.78 ± 0.32
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none pp512 3247.42 ± 3.66
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none tg128 66.47 ± 0.62

build: 105a01d (5223)

  • GGML_SYCL_DISABLE_OPT=1
model size params backend ngl sm test t/s
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none pp512 7900.28 ± 61.92
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none tg128 100.15 ± 3.03
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none pp512 2250.62 ± 2.25
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none tg128 38.05 ± 0.25
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none pp512 5925.76 ± 9.85
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none tg128 71.27 ± 0.16
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 none pp512 2114.17 ± 3.93
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 none tg128 34.39 ± 0.10
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none pp512 3265.26 ± 6.07
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none tg128 54.89 ± 0.55

build: 105a01d (5223)

Arc A770

  • GGML_SYCL_DISABLE_OPT=0
model size params backend ngl sm test t/s
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none pp512 4540.38 ± 8.00
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none tg128 44.47 ± 0.15
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none pp512 1753.07 ± 2.08
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none tg128 32.04 ± 0.22
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none pp512 3785.29 ± 6.46
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none tg128 38.65 ± 0.33
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 none pp512 1702.11 ± 2.83
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 none tg128 29.26 ± 0.07
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none pp512 2534.60 ± 0.94
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none tg128 34.11 ± 0.32

build: 105a01d (5223)

  • GGML_SYCL_DISABLE_OPT=1
model size params backend ngl sm test t/s
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none pp512 4532.79 ± 9.10
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none tg128 44.17 ± 0.39
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none pp512 1749.38 ± 2.50
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none tg128 26.03 ± 0.02
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none pp512 3774.80 ± 2.61
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none tg128 35.51 ± 0.08
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 none pp512 1702.25 ± 1.93
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 none tg128 23.40 ± 0.23
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none pp512 2535.88 ± 3.86
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none tg128 30.36 ± 0.39

build: 105a01d (5223)

@Alcpz Alcpz changed the title sycl : Implemented reorder Q4_0 mmvq sycl : Implemented reorder Q4_K mmvq Apr 25, 2025
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Apr 25, 2025
Copy link
Collaborator

@NeoZhangJianyu NeoZhangJianyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Could you share the GPU type of above test result?
  2. Have you test the PR by local UT?
  3. Could you check the detailed output of Q4_K LLM?
    I guess the output should be different to legacy code.

// Dispatch becomes obscure with the reorder, MMVQ when the reorder optimization
// is enabled takes precedence over DMMV, the current if-else implementation
// requires disabling DMMV if both conditions are met
|| (reorder && ggml_sycl_supports_reorder_mmvq(src0->type))) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have same comment and concern:
This code will impact the code path of below. That would lead to the wrong result.

I suggest this PR only optimize the mmvq() function.
You could add another PR to optimize by changing the code path, like by this sentence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment, this branch is based off of #12858 and the only changes I've added are in d1f5b2d . When #12858 is merged I will rebase again, so whatever conclusion is reached on that PR will propagate here too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! Let's focus on the review of #12858 firtly.

@@ -2968,14 +2994,17 @@ static void ggml_sycl_mul_mat(ggml_backend_sycl_context & ctx, const ggml_tensor
// KQ + KQV multi-batch
ggml_sycl_mul_mat_batched_sycl(ctx, src0, src1, dst);
} else if (use_dequantize_mul_mat_vec) {
ggml_sycl_op_mul_mat(ctx, src0, src1, dst, ggml_sycl_op_dequantize_mul_mat_vec, false);
// save_tensor_txt("1/dst_1.txt", (float*) dst->data, src0->ne[1], sizeof(float), ctx.stream());
constexpr bool convert_src1_to_q8_1 = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you follow the solution of PR #13003?
It fixed base issue of reorder Q4_0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rebased this branch and will rebase it again when #12858 is merged, so these changes should make it into this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the discussion WRT opt_for_reorder and how to call the function, this will require another rebase, sorry about that. Will keep it to a single commit so you can cherry pick with no issues.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! Let's focus on the review of #12858 firtly.

@NeoZhangJianyu
Copy link
Collaborator

@sgeor255
Here is a discussion about Q4_K. #13120 (reply in thread)
Could you test the model by this PR?
If result is good, could you reply with your test result?

We need promote SYCL backend in related cases. :)

@sgeor255 sgeor255 force-pushed the svet/mmvq_q4_k_reorder branch 2 times, most recently from d1f5b2d to 105a01d Compare May 1, 2025 10:49
@sgeor255 sgeor255 requested a review from AD2605 May 1, 2025 10:56
@sgeor255
Copy link
Contributor Author

sgeor255 commented May 1, 2025

I rebased the PR on @Alcpz 's latest changes & updated the description with more performance numbers.

@sgeor255
Copy link
Contributor Author

sgeor255 commented May 1, 2025

@NeoZhangJianyu to answer your questions:

  1. Could you share the GPU type of above test result?

I updated the PR description with results from more devices.

  1. Have you test the PR by local UT?

Unit tests pass locally (if I understood the question correctly?).

  1. Could you check the detailed output of Q4_K LLM?
    I guess the output should be different to legacy code.

I ran the example script with Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf, output below:

master @ 8936784
sampler seed: 0
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 400, n_keep = 0

user

Building a website can be done in 10 simple steps:
Step 1:assistant

Here are the 10 simple steps to build a website:

**Step 1: Plan Your Website**
 Define the purpose, target audience, and goals of your website. Determine what type of content you will feature, and what features you want to include (e.g., e-commerce, blog, contact form).

**Step 2: Choose a Domain Name**
 Register a unique and memorable domain name that reflects your website's identity and is easy to spell. Make sure to check if the name is available and not already taken by someone else.

**Step 3: Select a Web Host**
 Choose a reliable web hosting service that meets your website's needs in terms of storage, bandwidth, and technical support. Some popular options include Bluehost, HostGator, and SiteGround.

**Step 4: Design Your Website**
 Use a website builder tool or a content management system (CMS) like WordPress to design and layout your website. Choose a theme or template that is responsive and user-friendly.

**Step 5: Add Content**
 Populate your website with high-quality content, including text, images, videos, and other multimedia elements. Make sure to optimize your content for search engines (SEO).

**Step 6: Install a CMS (Optional)**
 If you want to have more control over your website's design and functionality, install a CMS like WordPress, Joomla, or Drupal.

**Step 7: Set Up Navigation**
 Create a logical and intuitive navigation menu that allows visitors to easily find and access different parts of your website.

**Step 8: Add Features and Functionality**
 Add features and functionality to your website, such as contact forms, email newsletters, and e-commerce functionality (if applicable).

**Step 9: Test and Launch**
 Test your website thoroughly to ensure that it is stable, secure, and functions as intended. Launch your website and make it available to the public.

**Step 10: Maintain and Update**
 Regularly update your website's content, plugins, and
This PR
sampler seed: 0
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 400, n_keep = 0

user

Building a website can be done in 10 simple steps:
Step 1:assistant

Here are the 10 simple steps to build a website:

**Step 1: Plan Your Website**
 Define the purpose, target audience, and goals of your website. Determine what type of content you will feature, and what features you want to include (e.g., e-commerce, blog, contact form).

**Step 2: Choose a Domain Name**
 Register a unique and memorable domain name that reflects your website's identity. Ensure it's easy to spell and remember, and consider the extension (e.g., .com, .net, .io).

**Step 3: Select a Web Hosting Service**
 Choose a reliable web hosting service that meets your needs (e.g., bandwidth, storage, customer support). Consider factors like uptime, security, and scalability.

**Step 4: Plan Your Content**
 Develop a content strategy that includes writing engaging articles, creating high-quality images, and planning a content calendar.

**Step 5: Design Your Website**
 Create a visually appealing and user-friendly website design using a website builder, design software, or by hiring a professional designer.

**Step 6: Choose a Content Management System (CMS)**
 Select a CMS like WordPress, Joomla, or Drupal that suits your needs and allows for easy content management.

**Step 7: Install and Customize Your CMS**
 Install the CMS and customize it to your liking using themes, plugins, and widgets.

**Step 8: Create and Add Content**
 Write and publish engaging content, add images and multimedia, and optimize it for search engines.

**Step 9: Test and Launch**
 Test your website for bugs, usability issues, and performance. Launch your website and make any final adjustments.

**Step 10: Maintain and Update**
 Regularly update your website with fresh content, fix bugs, and keep your CMS and plugins up-to-date to ensure a smooth user experience and maintain search engine rankings.

Let me know if you'd like me to expand on any of these steps!

@AD2605
Copy link
Contributor

AD2605 commented May 2, 2025

@sgeor255 I cannot resolve my comments (because the "resolve conversation " button is just isn't there for me), consider them resolved 👍🏻

Copy link
Collaborator

@Alcpz Alcpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

@@ -2968,14 +2994,17 @@ static void ggml_sycl_mul_mat(ggml_backend_sycl_context & ctx, const ggml_tensor
// KQ + KQV multi-batch
ggml_sycl_mul_mat_batched_sycl(ctx, src0, src1, dst);
} else if (use_dequantize_mul_mat_vec) {
ggml_sycl_op_mul_mat(ctx, src0, src1, dst, ggml_sycl_op_dequantize_mul_mat_vec, false);
// save_tensor_txt("1/dst_1.txt", (float*) dst->data, src0->ne[1], sizeof(float), ctx.stream());
constexpr bool convert_src1_to_q8_1 = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the discussion WRT opt_for_reorder and how to call the function, this will require another rebase, sorry about that. Will keep it to a single commit so you can cherry pick with no issues.

@sgeor255 sgeor255 force-pushed the svet/mmvq_q4_k_reorder branch from 105a01d to 685e02b Compare May 2, 2025 13:27
@sgeor255
Copy link
Contributor Author

sgeor255 commented May 2, 2025

@sgeor255 Here is a discussion about Q4_K. #13120 (reply in thread) Could you test the model by this PR? If result is good, could you reply with your test result?

We need promote SYCL backend in related cases. :)

https://github.com/NeoZhangJianyu there's a small improvement for this model too:

  • GGML_SYCL_DISABLE_OPT=0
model size params backend ngl threads sm test t/s
qwen2 7B Q4_K - Medium 4.36 GiB 7.62 B SYCL 99 8 none pp512 3681.78 ± 24.68
qwen2 7B Q4_K - Medium 4.36 GiB 7.62 B SYCL 99 8 none tg128 62.10 ± 0.27

build: 105a01d (5223)

  • GGML_SYCL_DISABLE_OPT=1
model size params backend ngl threads sm test t/s
qwen2 7B Q4_K - Medium 4.36 GiB 7.62 B SYCL 99 8 none pp512 3721.85 ± 16.25
qwen2 7B Q4_K - Medium 4.36 GiB 7.62 B SYCL 99 8 none tg128 45.49 ± 0.16

build: 105a01d (5223)

@sgeor255
Copy link
Contributor Author

sgeor255 commented May 9, 2025

This PR is now rebased on master as #12858 was merged.

Copy link
Collaborator

@qnixsynapse qnixsynapse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM so far

@NeoZhangJianyu
Copy link
Collaborator

I find the refer PR #12858 has performance and wrong result issue.
Please hope this PR, until the #12858 is confirmed.

Copy link
Collaborator

@Rbiessy Rbiessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall!

@sgeor255 sgeor255 force-pushed the svet/mmvq_q4_k_reorder branch from f7e7d2a to 2a2aef0 Compare May 14, 2025 15:34

static bool can_use_dequantize_mul_mat_vec(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static bool can_use_dequantize_mul_mat_vec(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
static bool choose_dequantize_mul_mat_vec(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nit but can_use seems more accurate to me since there are more logic later on to make the final decision on the matmul implementation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use one function to detect if need call deq_mul_mat_vec(), instead of several.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that the final choice depends on the outputs from can_use_dequantize_mul_mat_vec and can_use_mul_mat_vec_q so it can't all be in a single choose_dequantize_mul_mat_vec currently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree with @Rbiessy and that's why it is currently implemented this way.

src0->ne[0] % GGML_SYCL_DMMV_X == 0 && src1->ne[1] == 1;
}

static bool can_use_mul_mat_vec_q(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static bool can_use_mul_mat_vec_q(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
static bool choose_mul_mat_vec_q(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {

@Rbiessy
Copy link
Collaborator

Rbiessy commented May 15, 2025

Merging now since this PR includes an important fix with the reorder optimization mentioned here: #13109 (comment)
I think the major concerns have been answered.

@Rbiessy Rbiessy merged commit 64bb51c into ggml-org:master May 15, 2025
44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants