This repository provides a starting point for using the vLLM library on the cluster. vLLM supports both offline and online usage:
- Offline Mode: Load the model once and process your data locally.
- Online Mode: Start a service to generate text via an endpoint, similar to OpenAI's GPT API.
This guide focuses on offline usage, as the online service may block cluster resources.
conda create -n vLLM-Starter python=3.11 -y
conda activate vLLM-Starter
pip install vllm
It is recommended to store the LLM models in a central location, as these large files can be shared across different users.
By default, we use /ds/models/llms/cache
as the storage location, where several LLMs are already stored.
To set this directory as the default cache location, you need to configure the environment variable HF_HUB_CACHE
.
In the examples we define the cache location in the srun command.
Please replace {script.py} with your script.
srun --partition=RTXA6000-SLT \
--job-name=vllm-test \
--export=ALL,HF_HUB_CACHE=/ds/models/hf-cache-slt/ \
--nodes=1 \
--ntasks=1 \
--cpus-per-task=6 \
--gpus-per-task=1 \
--mem-per-cpu=4G \
--time=1-00:00:00 \
python {script.py}
A simple example demonstrating how to use vLLM in offline mode can be found in the file offline_simpleInference.py. This script loads a model and generates text based on a prompt.
Example
srun --partition=RTXA6000-SLT \
--job-name=vllm-test \
--export=ALL,HF_HUB_CACHE=/ds/models/hf-cache-slt/ \
--nodes=1 \
--ntasks=1 \
--cpus-per-task=6 \
--gpus-per-task=1 \
--mem-per-cpu=4G \
python offline_simpleInference.py
The following script demonstrates how to load a model and generate text based on a chat-style prompt. You can find the more complex example in the file offline_chatstyle.py.
Example
srun --partition=RTXA6000-SLT \
--job-name=vllm-test \
--export=ALL,HF_HUB_CACHE=/ds/models/hf-cache-slt/ \
--nodes=1 \
--ntasks=1 \
--cpus-per-task=6 \
--gpus-per-task=1 \
--mem-per-cpu=4G \
python offline_chatstyle.py
The GuidedDecodingParams
class in vLLM allows you to define the output structure for tasks that require a predefined format, such as Named Entity Recognition (NER).
You can use various methods to guide the decoding process, including regular expressions, JSON objects, grammar, or simple binary choices.
The example can be found in the file offline_structuredOutput.py.
Example
srun --partition=RTXA6000-SLT \
--job-name=vllm-test \
--export=ALL,HF_HUB_CACHE=/ds/models/hf-cache-slt/ \
--nodes=1 \
--ntasks=1 \
--cpus-per-task=6 \
--gpus-per-task=1 \
--mem-per-cpu=4G \
python offline_structuredOutput.py
vLLM can also be applied to vision tasks, such as generating captions for images.
When using vision LLMs, you have to use the specific prompt-template for the model and provide stop_token_ids
.
Please check the official GitHub repository for the specific prompt-template and stop_token_ids
here.
The example can be found in the file offline_visionExample.py, which loads the image in data/example.jpg and generates a caption for it.
Example
srun --partition=RTXA6000-SLT \
--job-name=vllm-test \
--export=ALL,HF_HUB_CACHE=/ds/models/hf-cache-slt/ \
--nodes=1 \
--ntasks=1 \
--cpus-per-task=6 \
--gpus-per-task=1 \
--mem-per-cpu=4G \
python offline_visionExample.py
As we have seen in the previous example, vLLM requires for vision tasks the correct prompt-template and the stop_token_ids. In the original examples from the vLLM repository, the code loads the LLM for each question. I have modified the code to load the LLM only once and then use it for all questions.
Example
srun --partition=RTXA6000-SLT \
--job-name=vllm-test \
--export=ALL,HF_HUB_CACHE=/ds/models/hf-cache-slt/ \
--nodes=1 \
--ntasks=1 \
--cpus-per-task=6 \
--gpus-per-task=1 \
--mem-per-cpu=4G \
python offline_visionImproved.py --model=LLAVANext
LLMs are known for their substantial size, which makes them highly memory-intensive. Quantization is a technique designed to reduce a model's memory footprint. Quantized models can be loaded just like their standard counterparts in vLLM. In cases where a quantized version of your model isn’t readily available, you can perform the quantization yourself. Beyond the general concept, there are various methods and tools available for quantizing models. If you are interested in model quantization for vLLM, we refer to the vLLM documentation. As I currently lack experience with quantization, I cannot provide insights into best practices.
AWQ quantization
You can find an example for [AWQ quantization](https://arxiv.org/abs/2106.04561) in the file `quantisation_quantizeModel.py`. AWQ quantisation uses calibration instances and you should therefore execute with GPU resources.pip install autoawq
srun --partition=RTXA6000-SLT \
--job-name=quantisation \
--export=ALL,HF_HUB_CACHE=/ds/models/hf-cache-slt/ \
--nodes=1 \
--ntasks=1 \
--gpus-per-task=1 \
--cpus-per-task=3 \
--mem=50G \
--time=1-00:00:00 \
python quantisation_quantizeModel.py
To be added.
pip install unsloth
To be added.
You can also use vLLM in online/interactive mode. As previously mentioned, this mode is not recommended for production use, but it is useful for testing and debugging. Please ensure that you shutdown the service after you are done with it, as it consumes allocated resources even when not in use. This mode starts a service on the cluster, which you can access via a REST interface. This is similar to the tutorial from perseus-textgen, but in my personal experience less brittle.
Steps:
- Start the service
- Retrieve the node name using
squeue -u $USER
. - Access the service documentation at
http://$NODE.kl.dfki.de:8000/docs
.
vllm serve Qwen/Qwen2.5-1.5B-Instruct
Please set download-dir accordingly.
srun --partition=RTXA6000-SLT \
--job-name=vllm_serve \
--nodes=1 \
--ntasks=1 \
--cpus-per-task=6 \
--gpus-per-task=1 \
--mem-per-cpu=4G \
--time=1-00:00:00 \
vllm serve "Qwen/Qwen2.5-1.5B-Instruct" \
--download-dir=/ds/models/llms/cache \
--port=8000
Call this on the head node to get the list of your running jobs:
squeue -u $USER
Then, you can access the API documentation at the following endpoint (replace $NODE with the node name): http://$NODE.kl.dfki.de:8000/docs
Replace $NODE with the node name.
curl http://${NODE}.kl.dfki.de:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
If you want to access the service from your local machine, you can forward the port using SSH.
ssh -L 5001:<$NODE>:8000 <username>@<loginnode>
Then you can access the service on your local machine at http://localhost:5001
.
curl http://localhost:5001/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
Check the example in online/remoteGeneration.py
.
Check the example in online/remoteChat.py
.