Concurrent CPU Integration Tests + Reuse Model Artifacts #655

stbaione · 2024-12-06T18:00:12Z

Description

There are two things that get added in this PR:

Reuse model artifacts when multiple tests in the same module request export/compilation of the same artifacts
Add concurrency tests for 2, 4, and 8 requests at the same time.

Reusing model aritfacts

Currently, our cpu_llm_server_integration_tests generate new model artifacts for each tests, even when they are requesting the exact same artifacts. This causes the tests to take much longer to run than they should, and makes it harder to add more tests without drastically increasing overall test time.

We add a static MODEL_DIR_CACHE, which is just a hashmap that stores { request.params_hash: temporary_dir }. If a test requests the same artifacts as a previous test, we reuse the already existing artifacts, instead of generating new ones.

Adding concurrency tests

We recenly found a bug in concurrency with the Shortfin LLM Server. When sending multiple requests at the same time, we end up with responses that have incorrect tokens.

This adds basic concurrent integration tests for 2, 4, and 8 requests sent in parallel. Currently, they are xfailed, but we will be able to use these to validate our fix, when we get there, and ensure that we don't have a regression in concurrency in the future. Will extend the periodic SGLang Integration tests to further test concurrency on GPU, with more complex prompts, but for a PR triggered test, this should serve as a good guard.

…cts, Add concurrency tests for `2, 4, and 8` concurrent tests for `trie` and `base` attention cache

renxida

lgtm. once tested feel free to merge!

app_tests/integration_tests/llm/shortfin/conftest.py

# Description Previously we were only benchmarking against the `base` caching algorithm when we run our SGLang benchmark tests. This PR allows us to benchmark against both the `base` and `trie` caching algorithms. Along with that, it adds the same trick from #655 to be able to reuse model artifacts instead of generating new ones each time, which drastically decreases the overall time required to run the benchmark test. It also includes a patch to run the benchmark script after syncing SGLang repo today, and ensures that an error is raised when one occurs

# Description There are two things that get added in this PR: 1. Reuse model artifacts when multiple tests in the same module request export/compilation of the same artifacts 2. Add concurrency tests for `2, 4, and 8` requests at the same time. # Reusing model aritfacts Currently, our `cpu_llm_server_integration_tests` generate new model artifacts for each tests, even when they are requesting the exact same artifacts. This causes the tests to take much longer to run than they should, and makes it harder to add more tests without drastically increasing overall test time. We add a static `MODEL_DIR_CACHE`, which is just a hashmap that stores `{ request.params_hash: temporary_dir }`. If a test requests the same artifacts as a previous test, we reuse the already existing artifacts, instead of generating new ones. # Adding concurrency tests We recenly found a bug in concurrency with the Shortfin LLM Server. When sending multiple requests at the same time, we end up with responses that have incorrect tokens. This adds basic concurrent integration tests for 2, 4, and 8 requests sent in parallel. Currently, they are xfailed, but we will be able to use these to validate our fix, when we get there, and ensure that we don't have a regression in concurrency in the future. Will extend the `periodic SGLang Integration tests` to further test concurrency on GPU, with more complex prompts, but for a PR triggered test, this should serve as a good guard.

# Description Previously we were only benchmarking against the `base` caching algorithm when we run our SGLang benchmark tests. This PR allows us to benchmark against both the `base` and `trie` caching algorithms. Along with that, it adds the same trick from #655 to be able to reuse model artifacts instead of generating new ones each time, which drastically decreases the overall time required to run the benchmark test. It also includes a patch to run the benchmark script after syncing SGLang repo today, and ensures that an error is raised when one occurs

Reuse model artifacts across tests that request the same model artifa…

b18e97a

…cts, Add concurrency tests for `2, 4, and 8` concurrent tests for `trie` and `base` attention cache

stbaione requested a review from renxida December 6, 2024 18:00

renxida approved these changes Dec 6, 2024

View reviewed changes

app_tests/integration_tests/llm/shortfin/conftest.py Show resolved Hide resolved

stbaione merged commit 2faadd2 into nod-ai:main Dec 6, 2024
8 checks passed

stbaione mentioned this pull request Dec 6, 2024

Use Trie and Base Caching in SGL Benchmark #658

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent CPU Integration Tests + Reuse Model Artifacts #655

Concurrent CPU Integration Tests + Reuse Model Artifacts #655

stbaione commented Dec 6, 2024

renxida left a comment

Concurrent CPU Integration Tests + Reuse Model Artifacts #655

Concurrent CPU Integration Tests + Reuse Model Artifacts #655

Conversation

stbaione commented Dec 6, 2024

Description

Reusing model aritfacts

Adding concurrency tests

renxida left a comment

Choose a reason for hiding this comment