Investigate Overlapping Text Effect on vLLM Performance

If multiple concurrent queries have large overlapping sections vs. not having such sections, does this affect vLLM performance? Check this with some experiments.

First test is to detect random number in list, both with same list of numbers for each query vs. different list for different queries. Can be generated automatically and easy to test.

This is an extreme example of when prefix caching causes performance benefits, with a long input query that for the "same" examples is mostly fixed between queries and a short output. Most practical examples will see much less performance improvement than this. Still, this is useful in demonstrating that the performance increase with prefix caching is real.

Test on RunPod GPU Instance with 4-Bit Llama 3.3 70B

This test was done on a RunPod server with Nvidia A100 80 GB VRAM GPU.

The 4-bit AWQ quantized Llama 3.3 70B LLM was used for this test.

Environment Setup

Setup the environment:

apt update
apt install parallel

python -m venv vllm-env
source vllm-env/bin/activate
python -m pip install -U pip setuptools wheel

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install vllm openai fire

git clone https://github.com/StanHatko/benchmark_llm_overlap_queries
cd benchmark_llm_overlap_queries

Start vLLM server:

vllm serve lambdalabs/Llama-3.3-70B-Instruct-AWQ-4bit \
  --host 127.0.0.1 \
  --port 8000 \
  --max_model_len 16000 \
  --enable-prefix-caching \
  --enable-chunked-prefill

Initial test of LLM:

./generate_detect_num_list.py /tmp/llm_test_basic 10 0
./send_local_llm_query.py /tmp/llm_test_basic_000.json

ls -1 /tmp/llm_test_basic_*.json
ls -1 /tmp/llm_test_basic_*.json | parallel -j 10 ./send_local_llm_query.py
time ( ls -1 /tmp/llm_test_basic_*.json | parallel -j 10 ./send_local_llm_query.py )

Performance Test with Different

Check time per run with 100 runs, with 100 entries generated per run, each list being different, 50 threads:

./test_llm_detect_num_list_diff.sh >~/test_diff_log.txt 2>~/test_diff_time.txt
cat ~/test_diff_time.txt | grep real | perl -pe 's/.*0m//' | perl -pe 's/s$//'

The time results are in the file time-taken-test2-diff.txt. The time in seconds has mean $32.89707$ and standard deviation $1.011802$.

Performance Test with Same

Check time per run with 100 runs, with 100 entries generated per run, within each run the list is same, 50 threads:

./test_llm_detect_num_list_same.sh >~/test_same_log.txt 2>~/test_same_time.txt
cat ~/test_same_time.txt | grep real | perl -pe 's/.*0m//' | perl -pe 's/s$//'

The time results are in the file time-taken-test2-same.txt. The time in seconds has mean $5.77354$ and standard deviation $0.6591126$.

The performance benefit of same is far larger with the bigger model than the smaller model below. This makes sense as a big model has greater benefits from caching large parts of the query than a small model.

Test on Lambda Labs GPU Instance with 4-Bit Llama 3.1 8B (Old Test)

This test was done on a Lambda Labs gpu_1x_a100_sxm4 server in the Virginia, USA region.

The 4-bit GPTQ quantized Llama 3.1 8B LLM was used for this test.

Environment Setup

Following steps done in conda environment with Python 3.12. Without conda environment the vLLM server didn't work properly (errors with undefined symbols occurred).

Setup the environment:

sudo apt install parallel
pip install vllm openai fire
pip install --upgrade jinja2

git clone https://github.com/StanHatko/benchmark_llm_overlap_queries
cd benchmark_llm_overlap_queries

Start vLLM server:

vllm serve hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 --host 127.0.0.1 --port 8000 --enable-prefix-caching

Initial test of LLM:

./generate_detect_num_list.py /tmp/llm_test_basic 10 0
./send_local_llm_query.py /tmp/llm_test_basic_000.json

ls -1 /tmp/llm_test_basic_*.json
ls -1 /tmp/llm_test_basic_*.json | parallel -j 10 ./send_local_llm_query.py
time ( ls -1 /tmp/llm_test_basic_*.json | parallel -j 10 ./send_local_llm_query.py )

Performance Test with Different

Check time per run with 100 runs, with 100 entries generated per run, each list being different, 50 threads:

./test_llm_detect_num_list_diff.sh >~/test_diff_log.txt 2>~/test_diff_time.txt
cat ~/test_diff_time.txt | grep real | perl -pe 's/.*0m//' | perl -pe 's/s$//'

The time results are in the file time-taken-test1-diff.txt. The time in seconds has mean $6.45708$ and standard deviation $0.5997773$.

Performance Test with Same

Check time per run with 100 runs, with 100 entries generated per run, within each run the list is same, 50 threads:

./test_llm_detect_num_list_same.sh >~/test_same_log.txt 2>~/test_same_time.txt
cat ~/test_same_time.txt | grep real | perl -pe 's/.*0m//' | perl -pe 's/s$//'

The time results are in the file time-taken-test1-same.txt. The time in seconds has mean $4.58922$ and standard deviation $0.591287$.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Investigate Overlapping Text Effect on vLLM Performance

Test on RunPod GPU Instance with 4-Bit Llama 3.3 70B

Environment Setup

Performance Test with Different

Performance Test with Same

Test on Lambda Labs GPU Instance with 4-Bit Llama 3.1 8B (Old Test)

Environment Setup

Performance Test with Different

Performance Test with Same

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
README.md		README.md
benchmark_llm_overlap_queries.code-workspace		benchmark_llm_overlap_queries.code-workspace
generate_detect_num_list.py		generate_detect_num_list.py
send_local_llm_query.py		send_local_llm_query.py
test_llm_detect_num_list_diff.sh		test_llm_detect_num_list_diff.sh
test_llm_detect_num_list_same.sh		test_llm_detect_num_list_same.sh
time-taken-test1-diff.txt		time-taken-test1-diff.txt
time-taken-test1-same.txt		time-taken-test1-same.txt
time-taken-test2-diff.txt		time-taken-test2-diff.txt
time-taken-test2-same.txt		time-taken-test2-same.txt

StanHatko/benchmark_llm_overlap_queries

Folders and files

Latest commit

History

Repository files navigation

Investigate Overlapping Text Effect on vLLM Performance

Test on RunPod GPU Instance with 4-Bit Llama 3.3 70B

Environment Setup

Performance Test with Different

Performance Test with Same

Test on Lambda Labs GPU Instance with 4-Bit Llama 3.1 8B (Old Test)

Environment Setup

Performance Test with Different

Performance Test with Same

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages