sycl : Overcoming workaround for mmap() allocation on Windows #13482

s-Nick · 2025-05-12T13:04:37Z

This PR removes the usage of a workaround for mmap bug on some Intel GPUs on Linux. The bug is not present on Windows, so there is no meaning of having it in place.
This causes a small split in the codebase according to the OS in use, but it shows good performance improvements.

The work introduced here is based on #13109

N.B All numbers assessed with GGML_SYCL_DISABLE_OPT=0

Lunar Lake's performance (this PR)

model	size	params	backend	ngl	sm	test	t/s
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	none	pp512	1330.42 ± 6.59
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	none	tg128	58.92 ± 0.46
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	none	pp512	2044.01 ± 13.08
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	none	tg128	44.47 ± 0.13
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	none	pp512	320.23 ± 0.97
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	none	tg128	22.66 ± 0.02
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	none	pp512	533.16 ± 1.41
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	none	tg128	15.41 ± 0.44
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	none	pp512	1402.31 ± 7.56
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	none	tg128	28.55 ± 0.06
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	none	pp512	502.78 ± 1.02
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	none	tg128	35.83 ± 0.07
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	none	pp512	807.02 ± 2.71
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	none	tg128	23.57 ± 0.08

build: 0e1009f (5334)

Lunar Lake's performance (#13109)

model	size	params	backend	ngl	sm	test	t/s
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	none	pp512	1323.21 ± 8.43
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	none	tg128	52.47 ± 0.42
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	none	pp512	1994.78 ± 6.69
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	none	tg128	40.50 ± 0.10
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	none	pp512	297.47 ± 0.49
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	none	tg128	21.58 ± 0.08
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	none	pp512	499.53 ± 2.32
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	none	tg128	15.54 ± 0.31
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	none	pp512	907.84 ± 0.56
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	none	tg128	27.54 ± 0.09
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	none	pp512	477.35 ± 0.33
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	none	tg128	33.95 ± 0.07
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	none	pp512	757.61 ± 1.53
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	none	tg128	21.80 ± 0.32

build: f7e7d2a (5331)

Battlemage(B580) performance (this PR)

model	size	params	backend	ngl	threads	sm	test	t/s
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	5	none	pp512	7314.80 ± 23.23
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	5	none	tg128	71.10 ± 2.21
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	5	none	pp512	7419.09 ± 27.47
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	5	none	tg128	88.57 ± 0.12
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	5	none	pp512	2147.78 ± 6.70
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	5	none	tg128	40.59 ± 0.07
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	5	none	pp512	2189.34 ± 2.19
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	5	none	tg128	38.32 ± 0.02
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	5	none	pp512	5605.63 ± 22.70
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	5	none	tg128	72.54 ± 0.29
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	5	none	pp512	3002.45 ± 4.25
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	5	none	tg128	62.49 ± 0.04
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	5	none	pp512	3103.20 ± 3.79
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	5	none	tg128	58.64 ± 0.01

build: 0e1009f (5334)

Battlemage(B580) performance(#13109 )

model	size	params	backend	ngl	threads	sm	test	t/s
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	5	none	pp512	7067.24 ± 53.67
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	5	none	tg128	64.51 ± 0.33
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	5	none	pp512	7132.89 ± 28.96
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	5	none	tg128	78.58 ± 0.19
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	5	none	pp512	2109.49 ± 2.46
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	5	none	tg128	38.37 ± 0.11
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	5	none	pp512	2143.62 ± 0.99
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	5	none	tg128	36.33 ± 0.03
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	5	none	pp512	5322.20 ± 22.77
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	5	none	tg128	64.48 ± 0.08
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	5	none	pp512	2936.43 ± 7.73
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	5	none	tg128	57.50 ± 0.11
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	5	none	pp512	3024.06 ± 8.17
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	5	none	tg128	54.19 ± 0.05

build: f7e7d2a (5331)

LOG for different GPUs on Linux

In this section there are many logs about this patch working on Linux without affecting performance and or correctness.

NeoZhangJianyu · 2025-05-13T06:53:35Z

@s-Nick
This PR title is about mmap().
But there is more code about other functions.

Could you clear other code change in this PR?

NeoZhangJianyu · 2025-05-16T08:08:49Z

All wait() in SYCL backend have been confirmed with the value.
Don't rm them before detailed test.

s-Nick · 2025-05-16T13:43:18Z

Thank your for your review @NeoZhangJianyu.

I modified the description adding many logs of llama-bench, llama-cli and test-backend-ops to address your concerns. I hope everything is clear now. If necessary, I can run other tests available in llama.cpp.

tools/llama-bench/README.md

Rbiessy

I'm not confident we can remove that many waits unfortunately. Hopefully reverting them will not add back the waits that you saw being removed in the models used?

ggml/src/ggml-sycl/ggml-sycl.cpp

NeoZhangJianyu · 2025-05-19T09:30:21Z

Thank your for your review @NeoZhangJianyu.

I modified the description adding many logs of llama-bench, llama-cli and test-backend-ops to address your concerns. I hope everything is clear now. If necessary, I can run other tests available in llama.cpp.

I have tested every wait() when I handled an issue before.
At that moment, I make sure every wait() is necessary.

After some testing I found that mmap is supported on windows and for many GPUs on Linux. Therefore I remove the workaround for windows since it is not necessary.

SYCL backend introduced a workaround that allows execution of llama-bench also without specifying `--mmp 0` flag

Rbiessy

LGTM, you can edit the PR title since it's not removing waits anymore.

NeoZhangJianyu

Thank your work to make code better!

…rg#13482) * Remove mmap workaround on windows After some testing I found that mmap is supported on windows and for many GPUs on Linux. Therefore I remove the workaround for windows since it is not necessary. * Update llama-bench README SYCL backend introduced a workaround that allows execution of llama-bench also without specifying `--mmp 0` flag

s-Nick requested a review from Alcpz May 12, 2025 13:04

github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 12, 2025

s-Nick changed the title ~~[SYCL] Overcoming workaround for mmap() allocation on Windows~~ [SYCL] Overcoming workaround for mmap() allocation on Windows and remove useless wait May 15, 2025

s-Nick force-pushed the add_win_mmap_support branch from 0e1009f to 083f56b Compare May 16, 2025 08:03

s-Nick marked this pull request as ready for review May 16, 2025 13:43

AD2605 reviewed May 16, 2025

View reviewed changes

tools/llama-bench/README.md Show resolved Hide resolved

Rbiessy reviewed May 19, 2025

View reviewed changes

ggml/src/ggml-sycl/ggml-sycl.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-sycl/ggml-sycl.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-sycl/ggml-sycl.cpp Outdated Show resolved Hide resolved

s-Nick added 2 commits May 19, 2025 14:27

Remove mmap workaround on windows

8da4fbd

After some testing I found that mmap is supported on windows and for many GPUs on Linux. Therefore I remove the workaround for windows since it is not necessary.

Update llama-bench README

a2afcb3

SYCL backend introduced a workaround that allows execution of llama-bench also without specifying `--mmp 0` flag

s-Nick force-pushed the add_win_mmap_support branch from 88252ac to a2afcb3 Compare May 19, 2025 13:36

Rbiessy approved these changes May 19, 2025

View reviewed changes

s-Nick changed the title ~~[SYCL] Overcoming workaround for mmap() allocation on Windows and remove useless wait~~ [SYCL] Overcoming workaround for mmap() allocation on Windows May 19, 2025

Alcpz changed the title ~~[SYCL] Overcoming workaround for mmap() allocation on Windows~~ sycl : Overcoming workaround for mmap() allocation on Windows May 19, 2025

NeoZhangJianyu approved these changes May 20, 2025

View reviewed changes

NeoZhangJianyu merged commit f7c9429 into ggml-org:master May 20, 2025
46 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sycl : Overcoming workaround for mmap() allocation on Windows #13482

sycl : Overcoming workaround for mmap() allocation on Windows #13482

Uh oh!

s-Nick commented May 12, 2025 •

edited

Loading

Uh oh!

NeoZhangJianyu commented May 13, 2025

Uh oh!

NeoZhangJianyu commented May 16, 2025

Uh oh!

s-Nick commented May 16, 2025

Uh oh!

Uh oh!

Rbiessy left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NeoZhangJianyu commented May 19, 2025 •

edited

Loading

Uh oh!

Rbiessy left a comment

Uh oh!

NeoZhangJianyu left a comment

Uh oh!

Uh oh!

Uh oh!

sycl : Overcoming workaround for mmap() allocation on Windows #13482

sycl : Overcoming workaround for mmap() allocation on Windows #13482

Uh oh!

Conversation

s-Nick commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Lunar Lake's performance (this PR)

Lunar Lake's performance (#13109)

Battlemage(B580) performance (this PR)

Battlemage(B580) performance(#13109 )

LOG for different GPUs on Linux

Lunar Lake

Battlemage B580

PVC

ARC A770

llama-cli output

Uh oh!

NeoZhangJianyu commented May 13, 2025

Uh oh!

NeoZhangJianyu commented May 16, 2025

Uh oh!

s-Nick commented May 16, 2025

Uh oh!

Uh oh!

Rbiessy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NeoZhangJianyu commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rbiessy left a comment

Choose a reason for hiding this comment

Uh oh!

NeoZhangJianyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

s-Nick commented May 12, 2025 •

edited

Loading

NeoZhangJianyu commented May 19, 2025 •

edited

Loading