High Latency Issue #5

rebemika-amzn · 2025-01-06T15:24:44Z

Hello, thanks for the repository that nicely integrates multiple KVCC methods to benchmark.

I ran the methods on a NVIDIA L40S GPU but noticed that H2O and StreamingLLM take longer than the baseline to run during the decoding stage (e.g. H2O is around ~1.4x slower than baseline during decoding across different datasets). The vRAM used on the GPU is lower for StreamingLLM, but increases compared to the baseline for H2O.

H2O and StreamingLLM in their paper showed increased latency and throughput. Could you provide information on why I might not get the same result with the methods integrated here?

Many thanks!

zirui-ray-liu · 2025-01-21T17:19:20Z

For H2O, please take a look at this.

TL; DR: because H2O is not directly compatible with FlashAttention. Also in the paper, it is under offload setting, i.e., offloading KV Cache to CPU DRAM. In this case, data volume to be transferred is the key bottneleck. But we consdier online setting (all the things are kept in GPU HBM for fast inference) in this benchmark.

For StreamingLLM, it should sigificantly reduce the latency. I have contacted @henryzhongsc to resolve this bug.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Latency Issue #5

High Latency Issue #5

rebemika-amzn commented Jan 6, 2025

zirui-ray-liu commented Jan 21, 2025

High Latency Issue #5

High Latency Issue #5

Comments

rebemika-amzn commented Jan 6, 2025

zirui-ray-liu commented Jan 21, 2025