Skip to content

Feature: Integrate with unified SYCL backend for Intel GPUs #2690

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 92 commits into from
Jan 28, 2024
Merged
Show file tree
Hide file tree
Changes from 53 commits
Commits
Show all changes
92 commits
Select commit Hold shift + click to select a range
7a4343d
first update for migration
NeoZhangJianyu Dec 27, 2023
2338769
update init_cublas
NeoZhangJianyu Dec 28, 2023
0c00b4f
add debug functio, commit all help code
NeoZhangJianyu Dec 29, 2023
ff83711
step 1
NeoZhangJianyu Dec 29, 2023
02dffb6
step 2
NeoZhangJianyu Dec 29, 2023
43f2c35
step3 add fp16, slower 31->28
NeoZhangJianyu Dec 31, 2023
da752ed
add GGML_LIST_DEVICE function
NeoZhangJianyu Dec 31, 2023
6dd3278
step 5 format device and print
NeoZhangJianyu Dec 31, 2023
3a9d2c5
step6, enhance error check, remove CUDA macro, enhance device id to f…
NeoZhangJianyu Jan 4, 2024
65f895d
support main device is non-zero
NeoZhangJianyu Jan 4, 2024
3b1a743
step7 add debug for code path, rm log
NeoZhangJianyu Jan 6, 2024
c2ef7a9
step 8, rename all macro & func from cuda by sycl
NeoZhangJianyu Jan 7, 2024
69d76c8
fix error of select non-zero device, format device list
NeoZhangJianyu Jan 8, 2024
c709c3c
ren ggml-sycl.hpp -> ggml-sycl.h
NeoZhangJianyu Jan 9, 2024
fa3a586
clear CMAKE to rm unused lib and options
NeoZhangJianyu Jan 9, 2024
3645f25
correct queue: rm dtct:get_queue
NeoZhangJianyu Jan 10, 2024
bd38129
add print tensor function to debug
NeoZhangJianyu Jan 12, 2024
5b53899
fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481
NeoZhangJianyu Jan 13, 2024
a47f5ec
summary dpct definition in one header file to replace folder:dpct
NeoZhangJianyu Jan 13, 2024
c67c2ab
refactor device log
NeoZhangJianyu Jan 13, 2024
c3c5b20
mv dpct definition from folder dpct to ggml-sycl.h
NeoZhangJianyu Jan 15, 2024
ca2cb69
update readme, refactor build script
NeoZhangJianyu Jan 15, 2024
95daece
fix build with sycl
NeoZhangJianyu Jan 15, 2024
a8936f4
set nthread=1 when sycl, increase performance
NeoZhangJianyu Jan 15, 2024
79d30d7
add run script, comment debug code
NeoZhangJianyu Jan 15, 2024
0d6e721
add ls-sycl-device tool
NeoZhangJianyu Jan 15, 2024
7350fd4
add ls-sycl-device, rm unused files
NeoZhangJianyu Jan 15, 2024
09b5619
rm rear space
NeoZhangJianyu Jan 15, 2024
d80dd65
dos2unix
NeoZhangJianyu Jan 15, 2024
593ce00
Update README_sycl.md
NeoZhangJianyu Jan 18, 2024
57e9fba
fix return type
luoyu-intel Jan 18, 2024
d5f7d36
remove sycl version from include path
luoyu-intel Jan 18, 2024
35a0daa
restore rm code to fix hang issue
NeoZhangJianyu Jan 18, 2024
ae941b1
add syc and link for sycl readme
NeoZhangJianyu Jan 19, 2024
e3481fa
rm original sycl code before refactor
NeoZhangJianyu Jan 19, 2024
623d803
fix code err
luoyu-intel Jan 19, 2024
f396a3b
add know issue for pvc hang issue
NeoZhangJianyu Jan 20, 2024
f008cc7
enable SYCL_F16 support
luoyu-intel Jan 22, 2024
67e6b3c
align pr4766
airMeng Jan 23, 2024
533c647
check for sycl blas, better performance
NeoZhangJianyu Jan 23, 2024
dd7f139
cleanup 1
abhilash1910 Jan 23, 2024
b403784
remove extra endif
airMeng Jan 23, 2024
a0a1304
add build&run script, clean CMakefile, update guide by review comments
NeoZhangJianyu Jan 23, 2024
27c08c0
Merge branch 'sycl' of https://github.com/abhilash1910/llama.cpp into…
NeoZhangJianyu Jan 23, 2024
97cbe18
rename macro to intel hardware
NeoZhangJianyu Jan 23, 2024
1ddaf44
editor config format
abhilash1910 Jan 23, 2024
bd716b2
format fixes
abhilash1910 Jan 23, 2024
be31379
format fixes
abhilash1910 Jan 23, 2024
d097e2a
editor format fix
abhilash1910 Jan 23, 2024
88f64b7
Remove unused headers
abhilash1910 Jan 23, 2024
756c4ac
skip build sycl tool for other code path
NeoZhangJianyu Jan 23, 2024
b42a32d
replace tab by space
NeoZhangJianyu Jan 23, 2024
5f83a12
fix blas matmul function
abhilash1910 Jan 23, 2024
d6fc1a0
fix mac build
abhilash1910 Jan 23, 2024
c7e745e
restore hip dependency
abhilash1910 Jan 23, 2024
3bfb846
fix conflict
NeoZhangJianyu Jan 23, 2024
498121b
ren as review comments
NeoZhangJianyu Jan 24, 2024
91b1461
mv internal function to .cpp file
NeoZhangJianyu Jan 24, 2024
816f480
export funciton print_sycl_devices(), mv class dpct definition to sou…
NeoZhangJianyu Jan 24, 2024
7a44a95
update CI/action for sycl code, fix CI error of repeat/dup
NeoZhangJianyu Jan 24, 2024
7babd76
fix action ID format issue
NeoZhangJianyu Jan 24, 2024
04a46c4
rm unused strategy
NeoZhangJianyu Jan 24, 2024
799af05
enable llama_f16 in ci
airMeng Jan 24, 2024
ec5c8bc
fix conflict
NeoZhangJianyu Jan 24, 2024
22e1b45
fix build break on MacOS, due to CI of MacOS depend on external ggml,…
NeoZhangJianyu Jan 24, 2024
238ec31
Merge branch 'master' into sycl
abhilash1910 Jan 24, 2024
67de350
fix ci cases for unsupported data type
NeoZhangJianyu Jan 24, 2024
fb15de3
revert unrelated changed in cuda cmake
airMeng Jan 24, 2024
96186a7
revert hip cmake changes
airMeng Jan 24, 2024
d07a88d
fix indent
airMeng Jan 24, 2024
8dd1b60
add prefix in func name
NeoZhangJianyu Jan 24, 2024
3aabd8a
revert no mmq
airMeng Jan 24, 2024
18742f7
rm cpu blas duplicate
abhilash1910 Jan 24, 2024
0e235fb
fix no_new_line
airMeng Jan 24, 2024
5600118
fix src1->type==F16 bug.
luoyu-intel Jan 24, 2024
eef5faa
pass batch offset for F16 src1
luoyu-intel Jan 24, 2024
5bb93d4
fix batch error
luoyu-intel Jan 24, 2024
0635f84
fix wrong code
luoyu-intel Jan 24, 2024
f1bab50
revert sycl checking in test-sampling
airMeng Jan 25, 2024
66e24c2
pass void as arguments of ggml_backend_sycl_print_sycl_devices
airMeng Jan 25, 2024
b06dca6
remove extra blank line in test-sampling
airMeng Jan 25, 2024
05b7f9b
revert setting n_threads in sycl
airMeng Jan 25, 2024
d6a6505
implement std::isinf for icpx with fast math.
luoyu-intel Jan 26, 2024
174c9a0
Update ci/run.sh
abhilash1910 Jan 26, 2024
c08fec2
Update examples/sycl/run-llama2.sh
abhilash1910 Jan 26, 2024
2cba564
Update examples/sycl/run-llama2.sh
abhilash1910 Jan 26, 2024
f707051
Update CMakeLists.txt
abhilash1910 Jan 26, 2024
45b0618
Update CMakeLists.txt
abhilash1910 Jan 26, 2024
5531754
Update CMakeLists.txt
abhilash1910 Jan 26, 2024
b9ffaab
Update CMakeLists.txt
abhilash1910 Jan 26, 2024
2ab9715
add copyright and MIT license declare
NeoZhangJianyu Jan 26, 2024
d394ca7
update the cmd example
NeoZhangJianyu Jan 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 39 additions & 13 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
cmake_minimum_required(VERSION 3.14) # for add_link_options and implicit target directories.
project("llama.cpp" C CXX)
include(CheckIncludeFileCXX)

set(CMAKE_EXPORT_COMPILE_COMMANDS ON)

Expand Down Expand Up @@ -96,13 +97,14 @@ set(LLAMA_CUDA_KQUANTS_ITER "2" CACHE STRING "llama: iters./thread per block for
set(LLAMA_CUDA_PEER_MAX_BATCH_SIZE "128" CACHE STRING
"llama: max. batch size for using peer access")
option(LLAMA_HIPBLAS "llama: use hipBLAS" OFF)
option(LLAMA_HIP_UMA "llama: use HIP unified memory architecture" OFF)
option(LLAMA_CLBLAST "llama: use CLBlast" OFF)
option(LLAMA_METAL "llama: use Metal" ${LLAMA_METAL_DEFAULT})
option(LLAMA_METAL_NDEBUG "llama: disable Metal debugging" OFF)
option(LLAMA_METAL_SHADER_DEBUG "llama: compile Metal with -fno-fast-math" OFF)
option(LLAMA_MPI "llama: use MPI" OFF)
option(LLAMA_QKK_64 "llama: use super-block size of 64 for k-quants" OFF)
option(LLAMA_SYCL "llama: use SYCL" OFF)
option(LLAMA_SYCL_F16 "llama: use 16 bit floats for sycl calculations" OFF)

option(LLAMA_BUILD_TESTS "llama: build tests" ${LLAMA_STANDALONE})
option(LLAMA_BUILD_EXAMPLES "llama: build examples" ${LLAMA_STANDALONE})
Expand All @@ -121,8 +123,12 @@ include(${CMAKE_CURRENT_SOURCE_DIR}/scripts/build-info.cmake)
#
# Compile flags
#
if (LLAMA_SYCL)
set(CMAKE_CXX_STANDARD 17)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How deep is the C++17 dependency in the SYCL backend?

It's okay to optionally include it like this, but I'm wondering if it is realistic to implement this in C++11 at some point - it would be in better harmony with the rest of the codebase.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the icpx compiler expects C++17 standard and SYCL has dependency on that version. We thought about this process having same version C++11 but it causes compilation errors due to dependency on c++17 headers.
@NeoZhangJianyu , @AidanBeltonS and others can add on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to add a bit more info, it is not just that the SYCL compiler, icpx, expects C++17. C++17 is a core aspect within the SYCL open standard. Any SYCL2020 code is expected to be C++17 conformant, so the relationship is deeper than just the specific implementation of the Khronos specification. I would say the dependency between SYCL and C++17 is hard, and it would likely not work well if SYCL specific features were compiled with C++11.

From the spec: https://registry.khronos.org/SYCL/specs/sycl-2020/pdf/sycl-2020.pdf
The SYCL specification is now based on the core language of C++17, as described in Section 3.9.1. Features of C++17 are now enabled within the specification, such as deduction guides for class template argument deduction

else()
set(CMAKE_CXX_STANDARD 11)
endif()

set(CMAKE_CXX_STANDARD 11)
set(CMAKE_CXX_STANDARD_REQUIRED true)
set(CMAKE_C_STANDARD 11)
set(CMAKE_C_STANDARD_REQUIRED true)
Expand Down Expand Up @@ -338,18 +344,11 @@ if (LLAMA_CUBLAS)
add_compile_definitions(GGML_CUDA_PEER_MAX_BATCH_SIZE=${LLAMA_CUDA_PEER_MAX_BATCH_SIZE})

if (LLAMA_STATIC)
if (WIN32)
# As of 12.3.1 CUDA Tookit for Windows does not offer a static cublas library
set(LLAMA_EXTRA_LIBS ${LLAMA_EXTRA_LIBS} CUDA::cudart_static CUDA::cublas CUDA::cublasLt)
else ()
set(LLAMA_EXTRA_LIBS ${LLAMA_EXTRA_LIBS} CUDA::cudart_static CUDA::cublas_static CUDA::cublasLt_static)
endif()
set(LLAMA_EXTRA_LIBS ${LLAMA_EXTRA_LIBS} CUDA::cudart_static CUDA::cublas_static CUDA::cublasLt_static)
else()
set(LLAMA_EXTRA_LIBS ${LLAMA_EXTRA_LIBS} CUDA::cudart CUDA::cublas CUDA::cublasLt)
endif()

set(LLAMA_EXTRA_LIBS ${LLAMA_EXTRA_LIBS} CUDA::cuda_driver)

if (NOT DEFINED CMAKE_CUDA_ARCHITECTURES)
# 52 == lowest CUDA 12 standard
# 60 == f16 CUDA intrinsics
Expand Down Expand Up @@ -426,9 +425,6 @@ if (LLAMA_HIPBLAS)
if (${hipblas_FOUND} AND ${hip_FOUND})
message(STATUS "HIP and hipBLAS found")
add_compile_definitions(GGML_USE_HIPBLAS GGML_USE_CUBLAS)
if (LLAMA_HIP_UMA)
add_compile_definitions(GGML_HIP_UMA)
endif()
add_library(ggml-rocm OBJECT ggml-cuda.cu ggml-cuda.h)
if (BUILD_SHARED_LIBS)
set_target_properties(ggml-rocm PROPERTIES POSITION_INDEPENDENT_CODE ON)
Expand All @@ -454,6 +450,35 @@ if (LLAMA_HIPBLAS)
endif()
endif()


if (LLAMA_SYCL)
if ( NOT DEFINED ENV{ONEAPI_ROOT})
message(FATAL_ERROR "Not detect ENV {ONEAPI_ROOT}, please install oneAPI & source it, like: source /opt/intel/oneapi/setvars.sh")
endif()
#todo: AOT

find_package(IntelSYCL REQUIRED)
if (LLAMA_SYCL_F16)
add_compile_definitions(GGML_SYCL_F16)
endif()
add_compile_definitions(GGML_USE_SYCL)

add_compile_options(-I./) #include DPCT
add_compile_options(-I/${SYCL_INCLUDE_DIR})

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-narrowing")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsycl -L${MKLROOT}/lib")

set(GGML_HEADERS_SYCL ggml.h ggml-sycl.h)
set(GGML_SOURCES_SYCL ggml-sycl.cpp)

set(LLAMA_EXTRA_LIBS ${LLAMA_EXTRA_LIBS} sycl OpenCL mkl_core pthread m dl mkl_sycl_blas mkl_intel_ilp64 mkl_tbb_thread)

endif()



function(get_flags CCID CCVER)
set(C_FLAGS "")
set(CXX_FLAGS "")
Expand Down Expand Up @@ -790,6 +815,7 @@ add_library(ggml OBJECT
${GGML_SOURCES_METAL} ${GGML_HEADERS_METAL}
${GGML_SOURCES_MPI} ${GGML_HEADERS_MPI}
${GGML_SOURCES_EXTRA} ${GGML_HEADERS_EXTRA}
${GGML_SOURCES_SYCL} ${GGML_HEADERS_SYCL}
)

target_include_directories(ggml PUBLIC . ${LLAMA_EXTRA_INCLUDES})
Expand Down
11 changes: 10 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ The main goal of `llama.cpp` is to run the LLaMA model using 4-bit integer quant
- AVX, AVX2 and AVX512 support for x86 architectures
- Mixed F16 / F32 precision
- 2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit integer quantization support
- CUDA, Metal and OpenCL GPU backend support
- CUDA, Metal, OpenCL, SYCL GPU backend support

The original implementation of `llama.cpp` was [hacked in an evening](https://github.com/ggerganov/llama.cpp/issues/33#issuecomment-1465108022).
Since then, the project has improved significantly thanks to many contributions. This project is mainly for educational purposes and serves
Expand Down Expand Up @@ -597,6 +597,15 @@ Building the program with BLAS support may lead to some performance improvements

You can get a list of platforms and devices from the `clinfo -l` command, etc.

- #### SYCL

SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators.

llama.cpp based on SYCL is used to support Intel GPU (Data Center Max series, Flex series, Arc series, Built-in GPU and iGPU).

For detailed info, please refer to [llama.cpp for SYCL](README_sycl.md).


### Prepare Data & Run

```bash
Expand Down
252 changes: 252 additions & 0 deletions README_sycl.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,252 @@
# llama.cpp for SYCL

[Background](#background)

[OS](#os)

[Intel GPU](#intel-gpu)

[Linux](#linux)

[Environment Variable](#environment-variable)

[Known Issue](#known-issue)

[Todo](#todo)

## Background

SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators—such as CPUs, GPUs, and FPGAs. It is a single-source embedded domain-specific language based on pure C++17.

oneAPI is a specification that is open and standards-based, supporting multiple architecture types including but not limited to GPU, CPU, and FPGA. The spec has both direct programming and API-based programming paradigms.

Intel uses the SYCL as direct programming language to support CPU, GPUs and FPGAs.

To avoid to re-invent the wheel, this code refer other code paths in llama.cpp (like OpenBLAS, cuBLAS, CLBlast). We use a open-source tool [SYCLomatic](https://github.com/oneapi-src/SYCLomatic) (Commercial release [Intel® DPC++ Compatibility Tool](https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compatibility-tool.html)) migrate to SYCL.

The llama.cpp for SYCL is used to support Intel GPUs.

For Intel CPU, recommend to use llama.cpp for X86 (Intel MKL building).

## OS

|OS|Status|Verified|
|-|-|-|
|Linux|Support|Ubuntu 22.04|
|Windows|Ongoing| |


## Intel GPU

|Intel GPU| Status | Verified Model|
|-|-|-|
|Intel Data Center Max Series| Support| Max 1550|
|Intel Data Center Flex Series| Support| Flex 170|
|Intel Arc Series| Support| Arc 770|
|Intel built-in Arc GPU| Support| built-in Arc GPU in Meteor Lake|
|Intel iGPU| Support| iGPU in i5-1250P, i7-1165G7|


## Linux

### Setup Environment

1. Install Intel GPU driver.

a. Please install Intel GPU driver by official guide: [Install GPU Drivers](https://dgpu-docs.intel.com/driver/installation.html).

Note: for iGPU, please install the client GPU driver.

b. Add user to group: video, render.

```
sudo usermod -aG render username
sudo usermod -aG video username
```

Note: re-login to enable it.

c. Check

```
sudo apt install clinfo
sudo clinfo -l
```

Output (example):

```
Platform #0: Intel(R) OpenCL Graphics
`-- Device #0: Intel(R) Arc(TM) A770 Graphics


Platform #0: Intel(R) OpenCL HD Graphics
`-- Device #0: Intel(R) Iris(R) Xe Graphics [0x9a49]
```

2. Install Intel® oneAPI Base toolkit.


a. Please follow the procedure in [Get the Intel® oneAPI Base Toolkit ](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html).

Recommend to install to default folder: **/opt/intel/oneapi**.

Following guide use the default folder as example. If you use other folder, please modify the following guide info with your folder.

b. Check

```
source /opt/intel/oneapi/setvars.sh

sycl-ls
```

There should be one or more level-zero devices. Like **[ext_oneapi_level_zero:gpu:0]**.

Output (example):
```
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.10.0.17_160000]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i7-13700K OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [23.30.26918.50]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26918]

```

2. Build locally:

```
mkdir -p build
cd build
source /opt/intel/oneapi/setvars.sh

#for FP16
#cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_SYCL_F16=ON # faster for long-prompt inference

#for FP32
cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx

#build example/main only
#cmake --build . --config Release --target main

#build all binary
cmake --build . --config Release -v

```

or

```
./examples/sycl/build.sh
```

Note:

- By default, it will build for all binary files. It will take more time. To reduce the time, we recommend to build for **example/main** only.

### Run

1. Put model file to folder **models**

2. Enable oneAPI running environment

```
source /opt/intel/oneapi/setvars.sh
```

3. List device ID

Run without parameter:

```
./build/bin/ls-sycl-device

or

./build/bin/main
```

Check the ID in startup log, like:

```
found 4 SYCL devices:
Device 0: Intel(R) Arc(TM) A770 Graphics, compute capability 1.3,
max compute_units 512, max work group size 1024, max sub group size 32, global mem size 16225243136
Device 1: Intel(R) FPGA Emulation Device, compute capability 1.2,
max compute_units 24, max work group size 67108864, max sub group size 64, global mem size 67065057280
Device 2: 13th Gen Intel(R) Core(TM) i7-13700K, compute capability 3.0,
max compute_units 24, max work group size 8192, max sub group size 64, global mem size 67065057280
Device 3: Intel(R) Arc(TM) A770 Graphics, compute capability 3.0,
max compute_units 512, max work group size 1024, max sub group size 32, global mem size 16225243136

```

|Attribute|Note|
|-|-|
|compute capability 1.3|Level-zero running time, recommended |
|compute capability 3.0|OpenCL running time, slower than level-zero in most cases|

4. Set device ID and execute llama.cpp

Set device ID = 0 by **GGML_SYCL_DEVICE=0**

```
GGML_SYCL_DEVICE=0 && ./build/bin/main -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33
```
or run by script:

```
./examples/sycl/run_llama2.sh
```

Note:

- By default, mmap is used to read model file. In some cases, it leads to the hang issue. Recommend to use parameter **--no-mmap** to disable mmap() to skip this issue.


5. Check the device ID in output

Like:
```
Using device **0** (Intel(R) Arc(TM) A770 Graphics) as main device
```


## Environment Variable

#### Build

|Name|Value|Function|
|-|-|-|
|LLAMA_SYCL|ON (mandatory)|Enable build with SYCL code path. <br>For FP32/FP16, LLAMA_SYCL=ON is mandatory.|
|LLAMA_SYCL_F16|ON (optional)|Enable FP16 build with SYCL code path. Faster for long-prompt inference. <br>For FP32, not set it.|
|CMAKE_C_COMPILER|icx|Use icx compiler for SYCL code path|
|CMAKE_CXX_COMPILER|icpx|use icpx for SYCL code path|

#### Running


|Name|Value|Function|
|-|-|-|
|GGML_SYCL_DEVICE|0 (default) or 1|Set the device id used. Check the device ids by default running output|
|GGML_SYCL_DEBUG|0 (default) or 1|Enable log function by macro: GGML_SYCL_DEBUG|

## Known Issue

- Error: `error while loading shared libraries: libsycl.so.7: cannot open shared object file: No such file or directory`.

Miss to enable oneAPI running environment.

Install oneAPI base toolkit and enable it by: `source /opt/intel/oneapi/setvars.sh`.


- Hang during startup

llama.cpp use mmap as default way to read model file and copy to GPU. In some system, memcpy will be abnormal and block.

Solution: add **--no-mmap**.

## Todo

- Support to build in Windows.

- Support multiple cards.
Loading