Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Latency Issue #5

Open
rebemika-amzn opened this issue Jan 6, 2025 · 1 comment
Open

High Latency Issue #5

rebemika-amzn opened this issue Jan 6, 2025 · 1 comment

Comments

@rebemika-amzn
Copy link

Hello, thanks for the repository that nicely integrates multiple KVCC methods to benchmark.

I ran the methods on a NVIDIA L40S GPU but noticed that H2O and StreamingLLM take longer than the baseline to run during the decoding stage (e.g. H2O is around ~1.4x slower than baseline during decoding across different datasets). The vRAM used on the GPU is lower for StreamingLLM, but increases compared to the baseline for H2O.

H2O and StreamingLLM in their paper showed increased latency and throughput. Could you provide information on why I might not get the same result with the methods integrated here?

Many thanks!

@zirui-ray-liu
Copy link

For H2O, please take a look at this.

TL; DR: because H2O is not directly compatible with FlashAttention. Also in the paper, it is under offload setting, i.e., offloading KV Cache to CPU DRAM. In this case, data volume to be transferred is the key bottneleck. But we consdier online setting (all the things are kept in GPU HBM for fast inference) in this benchmark.

For StreamingLLM, it should sigificantly reduce the latency. I have contacted @henryzhongsc to resolve this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants