You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, thanks for the repository that nicely integrates multiple KVCC methods to benchmark.
I ran the methods on a NVIDIA L40S GPU but noticed that H2O and StreamingLLM take longer than the baseline to run during the decoding stage (e.g. H2O is around ~1.4x slower than baseline during decoding across different datasets). The vRAM used on the GPU is lower for StreamingLLM, but increases compared to the baseline for H2O.
H2O and StreamingLLM in their paper showed increased latency and throughput. Could you provide information on why I might not get the same result with the methods integrated here?
Many thanks!
The text was updated successfully, but these errors were encountered:
TL; DR: because H2O is not directly compatible with FlashAttention. Also in the paper, it is under offload setting, i.e., offloading KV Cache to CPU DRAM. In this case, data volume to be transferred is the key bottneleck. But we consdier online setting (all the things are kept in GPU HBM for fast inference) in this benchmark.
For StreamingLLM, it should sigificantly reduce the latency. I have contacted @henryzhongsc to resolve this bug.
Hello, thanks for the repository that nicely integrates multiple KVCC methods to benchmark.
I ran the methods on a NVIDIA L40S GPU but noticed that H2O and StreamingLLM take longer than the baseline to run during the decoding stage (e.g. H2O is around ~1.4x slower than baseline during decoding across different datasets). The vRAM used on the GPU is lower for StreamingLLM, but increases compared to the baseline for H2O.
H2O and StreamingLLM in their paper showed increased latency and throughput. Could you provide information on why I might not get the same result with the methods integrated here?
Many thanks!
The text was updated successfully, but these errors were encountered: