Skip to content

Commit 4b5c9b7

Browse files
committed
add deft
1 parent 0534689 commit 4b5c9b7

File tree

4 files changed

+35
-0
lines changed

4 files changed

+35
-0
lines changed
Loading
310 KB
Loading
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
2+
<div align="center">
3+
<img src="deft.jpeg" alt="logo" width="200"></img>
4+
</div>
5+
6+
--------------------------------------------------------------------------------
7+
8+
## TL;DR
9+
We propose DeFT, an IO-aware attention algorithm for efficient tree-structured interactions with LLMs by optimizing QKV grouping and attention calculation.
10+
11+
12+
13+
## Abstract
14+
Large language models (LLMs) are increasingly employed for complex tasks that process multiple generation calls in a tree structure with shared prefixes of tokens, including few-shot prompting, multi-step reasoning, speculative decoding, etc. However, existing inference systems for tree-based applications are inefficient due to improper partitioning of queries and KV cache during attention calculation.This leads to two main issues: (1) a lack of memory access (IO) reuse for KV cache of shared prefixes, and (2) poor load balancing. As a result, there is redundant KV cache IO between GPU global memory and shared memory, along with low GPU utilization. To address these challenges, we propose DeFT(Decoding with Flash Tree-Attention), a hardware-efficient attention algorithm with prefix-aware and load-balanced KV cache partitions. DeFT reduces the number of read/write operations of KV cache during attention calculation through KV-Guided Grouping, a method that avoids repeatedly loading KV cache of shared prefixes in attention computation. Additionally, we propose Flattened Tree KV Splitting, a mechanism that ensures even distribution of the KV cache across partitions with little computation redundancy, enhancing GPU utilization during attention computations. By reducing 73-99% KV cache IO and nearly 100% IO for partial results during attention calculation, DeFT achieves up to 2.23/3.59X speedup in the end-to-end/attention latency across three practical tree-based workloads compared to state-of-the-art attention algorithms.
15+
16+
## DeFT Overview
17+
<div align="center">
18+
<img src="DeFT_overview.jpg" alt="overview" width="95%"></img>
19+
</div>

src/config/Publications.jsx

+16
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,22 @@ const publications = [
1616
},
1717
tags: ["LLM", "Agent"],
1818
},
19+
{
20+
key: "yao2024deft",
21+
title: "DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference",
22+
authors: "Jinwei Yao, Kaiqi Chen, Kexun Zhang, Jiaxuan You, Binhang Yuan, Zeke Wang, Tao Lin",
23+
year: "2025",
24+
links: {
25+
paper: "https://arxiv.org/abs/2404.00242",
26+
code: "https://github.com/LINs-lab/DeFT",
27+
28+
},
29+
files: {
30+
markdown: require("../assets/publications/yao2024deft/deft.md"),
31+
},
32+
venue: "ICLR 2025",
33+
tags: ["LLM", "Inference","Efficiency"],
34+
},
1935
{
2036
key: "feng2024graphrouter",
2137
title: "Graphrouter: A graph-based router for llm selections",

0 commit comments

Comments
 (0)