-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Open
Description
Since Intel has so far abandoned ipex-llm and Arc cards...
vllm v0.11.1rc2.dev221+g49c00fe30 works together with A770 (4x)
You can build a Docker container from the vllm repository sources (Dockerfile.xpu)
https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.xpu
docker build -f docker/Dockerfile.xpu -t vllm-xpu-0110 --shm-size=32g .
But I do not know how to properly configure it for the 4x A770, and I am sure that the performance could be higher
2 req/s -> 10+ req/s.
Llama3.1 8b Instruct FP8
Sometimes the request processing speed reaches 12 requests/s, but there are problems with the process "hanging up" and then speeding up. I haven't figured out the reason yet.
1024 in, 512 out for configuration
--max-model-len "2000"
--max-num-batched-tokens "3000"
test
vllm bench serve \
--model /llm/models/LLM-Research/Meta-Llama-3.1-8B-Instruct \
--served-model-name Meta-Llama-3.1-8B-Instruct \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 512 \
--ignore-eos \
--num-prompt 1500 \
--trust-remote-code \
--request-rate inf \
--backend vllm \
--port 8000
Ubuntu 25.10, 6,17.3 kernel
my numbers for 4x A770, 2x Xeon 2699 V3 is:
115 requests
1500 requests

jakegibson