-
Notifications
You must be signed in to change notification settings - Fork 11.5k
sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs #12858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Alcpz
wants to merge
6
commits into
ggml-org:master
Choose a base branch
from
Alcpz:Alcpz/mmvq_q4_0_reorder
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+323
−120
Open
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
187451b
sycl : Implemented reorder Q4_0 mmvq
Alcpz 9c8d809
sycl : Fixed mmvq being called when reorder is disabled
Alcpz 52b1622
sycl : Improved comments in the quants header
Alcpz e8555ab
Use static_assert
Rbiessy b60d637
safe_div -> ceil_div
Rbiessy fc768f3
Clarify qi comment
Rbiessy File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is named for Intel GPUs.
Why change the code for CUDA?
In fact, reorder the src0 won't happen for non-intel GPU.
So this code has no impact.
Suggest remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your comment aligns with my suspicion that this change is obscure. This line changes the kernels from DMMV to MMVQ if reorder is enabled and it's supported, so it's no longer only for CUDA devices.
I need to rethink how the dispatcher does the work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please ignore this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another comment:
The reorder behavior impact the code path in this PR:
use_dequantize_mul_mat_vec = use_dequantize_mul_mat_vec && !use_mul_mat_vec_q;
This code works well for CUDA, instead of Intel GPU.
That's why it's limited for only CUDA backend.
Some cases (models) will get benefit from it, some will become bad for Intel GPU.
I suggest removing this behavior.
Only optimize the OPs by reorder. Not change the code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what we have measured the new mmvq code path with the reorder optimization is more optimized on Intel devices as well (cf the PR description). Can you let us know if you find a model or device where this is causing a performance regression? That's why we suggest to enable it by default now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you let me know which device you are using exactly? I've been using Arc B580 and the example is working fine with
GGML_SYCL_DISABLE_OPT=0
and definitely using the newreorder_mul_mat_vec_q4_0_q8_1_sycl
:I've dig a bit deeper comparing the various sizes of mul_mat_vec that are used in this example, in short I can confirm that
reorder_mul_mat_vec_q4_0_q8_1_sycl
outputs the exact same values than the existing mmvq kernel (without the reorder) which is itself very close to the CUDA mmvq implementation (fromggml-cuda
).I am computing the top 10 absolute and relative errors compared to the CPU backend and with a fixed seed. With that I can confirm the output of the different mmvq implementations are identical. See the example below for the sizes
m=k=4096
andn=1
comparing SYCL reorder mmvq with native CUDA mmvq:The accuracy is expected to be lower with mmvq since
src1
is quantized but that does not necessarily translate to a lower model accuracy.I've done the same for the configurations
m=11008 n=1 k=4096
;m=11008 n=2 k=4096
;m=4096 n=1 k=11008
;m=4096 n=2 k=11008
andm=4096 n=2 k=4096
. Attaching the files below:test_06.txt
test_05.txt
test_04.txt
test_03.txt
test_02.txt
test_01.txt
With these results I don't understand how you could get incorrect output, do you have an idea of what could go wrong? I think we could add another option to disable
mmvq
for debugging purposes but I thinkmmvq
should become the default for the better performance.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use Arc 770.
Your result above is same as the code change in my test case:
This code forces to execute the code:
use_dequantize_mul_mat_vec = use_dequantize_mul_mat_vec && !use_mul_mat_vec_q;
It changes the legacy code path.
The code calling dequantize_mul_mat_vec() or use_mul_mat_vec_q() is changed.
That will impact the result.
The OP functions on GPU and CPU should be same (less error) result as you said, but the code path impact the final result.
I think the optimization should keep same result of legacy code.
It's possible to change the code path to get better performance.
I suggest dividing into two parts (PRs): optimize OP function and optimize the code path.
Looks like the optimize the code path impact the final result correction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NeoZhangJianyu @Rbiessy Thanks for putting the effort on reviewing the PR. I'm happy splitting the PR in two, but let's first try to understand why this gets worse performance.
Could it be a driver/Operative System issue maybe? I'll try to gather more information.
@NeoZhangJianyu, are you testing the PR on Linux or Windows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Driver and OS only impact the performance in general.
You must care the shape of tensor.
For example, some code work well for 32 * n, but bad for 24 * n.
I test it on Linux.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification. It's hard to understand the issues you are finding because I don't fully know what you are testing. I'll try to replicate the results locally and depending on the findings see if the PR has to be split or the dispatch could be slightly improved.