Skip to content

[SYCL] Overcoming workaround for mmap() allocation on Windows and remove useless wait #13482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

s-Nick
Copy link
Collaborator

@s-Nick s-Nick commented May 12, 2025

This PR removes the usage of a workaround for mmap bug on some Intel GPUs on Linux. The bug is not present on Windows, so there is no meaning of having it in place.
This causes a small split in the codebase according to the OS in use, but it shows good performance improvements.
Moreover, it also removes some wait() on copy that are not necessary in SYCL backend, due to the usage of in_order queues.

The work introduced here is based on #13109

N.B All numbers assessed with GGML_SYCL_DISABLE_OPT=0

Lunar Lake's performance (this PR)

model size params backend ngl sm test t/s
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none pp512 1330.42 ± 6.59
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none tg128 58.92 ± 0.46
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none pp512 2044.01 ± 13.08
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none tg128 44.47 ± 0.13
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 none pp512 320.23 ± 0.97
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 none tg128 22.66 ± 0.02
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none pp512 533.16 ± 1.41
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none tg128 15.41 ± 0.44
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none pp512 1402.31 ± 7.56
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none tg128 28.55 ± 0.06
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 none pp512 502.78 ± 1.02
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 none tg128 35.83 ± 0.07
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none pp512 807.02 ± 2.71
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none tg128 23.57 ± 0.08

build: 0e1009f (5334)

Lunar Lake's performance (#13109)

model size params backend ngl sm mmap test t/s
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none 0 pp512 1323.21 ± 8.43
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none 0 tg128 52.47 ± 0.42
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none 0 pp512 1994.78 ± 6.69
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none 0 tg128 40.50 ± 0.10
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 none 0 pp512 297.47 ± 0.49
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 none 0 tg128 21.58 ± 0.08
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none 0 pp512 499.53 ± 2.32
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none 0 tg128 15.54 ± 0.31
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none 0 pp512 907.84 ± 0.56
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none 0 tg128 27.54 ± 0.09
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 none 0 pp512 477.35 ± 0.33
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 none 0 tg128 33.95 ± 0.07
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none 0 pp512 757.61 ± 1.53
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none 0 tg128 21.80 ± 0.32

build: f7e7d2a (5331)

Battlemage(B580) performance (this PR)

model size params backend ngl threads sm test t/s
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 5 none pp512 7314.80 ± 23.23
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 5 none tg128 71.10 ± 2.21
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 5 none pp512 7419.09 ± 27.47
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 5 none tg128 88.57 ± 0.12
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 5 none pp512 2147.78 ± 6.70
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 5 none tg128 40.59 ± 0.07
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 5 none pp512 2189.34 ± 2.19
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 5 none tg128 38.32 ± 0.02
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 5 none pp512 5605.63 ± 22.70
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 5 none tg128 72.54 ± 0.29
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 5 none pp512 3002.45 ± 4.25
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 5 none tg128 62.49 ± 0.04
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 5 none pp512 3103.20 ± 3.79
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 5 none tg128 58.64 ± 0.01

build: 0e1009f (5334)

Battlemage(B580) performance(#13109 )

model size params backend ngl threads sm mmap test t/s
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 5 none 0 pp512 7067.24 ± 53.67
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 5 none 0 tg128 64.51 ± 0.33
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 5 none 0 pp512 7132.89 ± 28.96
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 5 none 0 tg128 78.58 ± 0.19
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 5 none 0 pp512 2109.49 ± 2.46
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 5 none 0 tg128 38.37 ± 0.11
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 5 none 0 pp512 2143.62 ± 0.99
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 5 none 0 tg128 36.33 ± 0.03
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 5 none 0 pp512 5322.20 ± 22.77
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 5 none 0 tg128 64.48 ± 0.08
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 5 none 0 pp512 2936.43 ± 7.73
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 5 none 0 tg128 57.50 ± 0.11
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 5 none 0 pp512 3024.06 ± 8.17
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 5 none 0 tg128 54.19 ± 0.05

build: f7e7d2a (5331)

LOG for different GPUs on Linux

In this section there are many logs about this patch working on Linux without affecting performance and or correctness.

Lunar Lake

lnl-test.txt

lnl_bench.txt
master_lnl.txt

Battlemage B580

bmg-test.txt

bmg_bench.txt
master_bmg.txt

PVC

pvc-test.txt

pvc_bench.txt
master_pvc.txt

ARC A770

arc-test.txt

arc_bench.txt
master_arc.txt

llama-cli output

bmg_cli_output.txt
lnl_cli_output.txt
pvc_cli_output.txt
arc_cli_output.txt

@s-Nick s-Nick requested a review from Alcpz May 12, 2025 13:04
@github-actions github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 12, 2025
@NeoZhangJianyu
Copy link
Collaborator

@s-Nick
This PR title is about mmap().
But there is more code about other functions.

Could you clear other code change in this PR?

@s-Nick s-Nick changed the title [SYCL] Overcoming workaround for mmap() allocation on Windows [SYCL] Overcoming workaround for mmap() allocation on Windows and remove useless wait May 15, 2025
s-Nick added 3 commits May 16, 2025 09:01
The default queue is in order so many synchronization with the host are
useless.
After some testing I found that mmap is supported on windows and for
many GPUs on Linux. Therefore I remove the workaround for windows since
it is not necessary.
SYCL backend introduced a workaround that allows execution of
llama-bench also without specifying `--mmp 0` flag
@s-Nick s-Nick force-pushed the add_win_mmap_support branch from 0e1009f to 083f56b Compare May 16, 2025 08:03
@NeoZhangJianyu
Copy link
Collaborator

All wait() in SYCL backend have been confirmed with the value.
Don't rm them before detailed test.

@s-Nick
Copy link
Collaborator Author

s-Nick commented May 16, 2025

Thank your for your review @NeoZhangJianyu.

I modified the description adding many logs of llama-bench, llama-cli and test-backend-ops to address your concerns. I hope everything is clear now. If necessary, I can run other tests available in llama.cpp.

@s-Nick s-Nick marked this pull request as ready for review May 16, 2025 13:43
Comment on lines -83 to -86
Note:

- When using SYCL backend, there would be hang issue in some cases. Please set `--mmp 0`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this still exist on Linux ? , and hence we still have the workaround for Linux
Maybe we mention linux has it at the moment and windows does not ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants