Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent CPU Integration Tests + Reuse Model Artifacts #655

Merged
merged 1 commit into from
Dec 6, 2024

Conversation

stbaione
Copy link
Contributor

@stbaione stbaione commented Dec 6, 2024

Description

There are two things that get added in this PR:

  1. Reuse model artifacts when multiple tests in the same module request export/compilation of the same artifacts
  2. Add concurrency tests for 2, 4, and 8 requests at the same time.

Reusing model aritfacts

Currently, our cpu_llm_server_integration_tests generate new model artifacts for each tests, even when they are requesting the exact same artifacts. This causes the tests to take much longer to run than they should, and makes it harder to add more tests without drastically increasing overall test time.

We add a static MODEL_DIR_CACHE, which is just a hashmap that stores { request.params_hash: temporary_dir }. If a test requests the same artifacts as a previous test, we reuse the already existing artifacts, instead of generating new ones.

Adding concurrency tests

We recenly found a bug in concurrency with the Shortfin LLM Server. When sending multiple requests at the same time, we end up with responses that have incorrect tokens.

This adds basic concurrent integration tests for 2, 4, and 8 requests sent in parallel. Currently, they are xfailed, but we will be able to use these to validate our fix, when we get there, and ensure that we don't have a regression in concurrency in the future. Will extend the periodic SGLang Integration tests to further test concurrency on GPU, with more complex prompts, but for a PR triggered test, this should serve as a good guard.

…cts,

Add concurrency tests for `2, 4, and 8` concurrent tests for `trie` and `base` attention cache
@stbaione stbaione requested a review from renxida December 6, 2024 18:00
Copy link
Contributor

@renxida renxida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. once tested feel free to merge!

@stbaione stbaione merged commit 2faadd2 into nod-ai:main Dec 6, 2024
8 checks passed
stbaione added a commit that referenced this pull request Dec 9, 2024
# Description

Previously we were only benchmarking against the `base` caching
algorithm when we run our SGLang benchmark tests. This PR allows us to
benchmark against both the `base` and `trie` caching algorithms.

Along with that, it adds the same trick from #655 to be able to reuse
model artifacts instead of generating new ones each time, which
drastically decreases the overall time required to run the benchmark
test.

It also includes a patch to run the benchmark script after syncing
SGLang repo today, and ensures that an error is raised when one occurs
monorimet pushed a commit that referenced this pull request Jan 8, 2025
# Description

There are two things that get added in this PR:
1. Reuse model artifacts when multiple tests in the same module request
export/compilation of the same artifacts
2. Add concurrency tests for `2, 4, and 8` requests at the same time.

# Reusing model aritfacts

Currently, our `cpu_llm_server_integration_tests` generate new model
artifacts for each tests, even when they are requesting the exact same
artifacts. This causes the tests to take much longer to run than they
should, and makes it harder to add more tests without drastically
increasing overall test time.

We add a static `MODEL_DIR_CACHE`, which is just a hashmap that stores
`{ request.params_hash: temporary_dir }`. If a test requests the same
artifacts as a previous test, we reuse the already existing artifacts,
instead of generating new ones.

# Adding concurrency tests

We recenly found a bug in concurrency with the Shortfin LLM Server. When
sending multiple requests at the same time, we end up with responses
that have incorrect tokens.

This adds basic concurrent integration tests for 2, 4, and 8 requests
sent in parallel. Currently, they are xfailed, but we will be able to
use these to validate our fix, when we get there, and ensure that we
don't have a regression in concurrency in the future. Will extend the
`periodic SGLang Integration tests` to further test concurrency on GPU,
with more complex prompts, but for a PR triggered test, this should
serve as a good guard.
monorimet pushed a commit that referenced this pull request Jan 8, 2025
# Description

Previously we were only benchmarking against the `base` caching
algorithm when we run our SGLang benchmark tests. This PR allows us to
benchmark against both the `base` and `trie` caching algorithms.

Along with that, it adds the same trick from #655 to be able to reuse
model artifacts instead of generating new ones each time, which
drastically decreases the overall time required to run the benchmark
test.

It also includes a patch to run the benchmark script after syncing
SGLang repo today, and ensures that an error is raised when one occurs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants