ulab-uiuc
diff --git a/‎src/assets/publications/yao2024deft/DeFT_overview.jpg
625 KB b/‎src/assets/publications/yao2024deft/DeFT_overview.jpg
625 KB
diff --git a/‎src/assets/publications/yao2024deft/deft.jpeg
310 KB b/‎src/assets/publications/yao2024deft/deft.jpeg
310 KB
diff --git a/‎src/assets/publications/yao2024deft/deft.md
Lines changed: 19 additions & 0 deletions b/‎src/assets/publications/yao2024deft/deft.md
Lines changed: 19 additions & 0 deletions
diff --git a/‎src/config/Publications.jsx
Lines changed: 16 additions & 0 deletions b/‎src/config/Publications.jsx
Lines changed: 16 additions & 0 deletions
@@ -0,0 +1,19 @@
+
+<div align="center">
+<img src="deft.jpeg" alt="logo" width="200"></img>
+</div>
+
+--------------------------------------------------------------------------------
+
+## TL;DR
+We propose DeFT, an IO-aware attention algorithm for efficient tree-structured interactions with LLMs by optimizing QKV grouping and attention calculation.
+
+
+
+## Abstract
+Large language models (LLMs) are increasingly employed for complex tasks that process multiple generation calls in a tree structure with shared prefixes of tokens, including few-shot prompting, multi-step reasoning, speculative decoding, etc. However, existing inference systems for tree-based applications are inefficient due to improper partitioning of queries and KV cache during attention calculation.This leads to two main issues: (1) a lack of memory access (IO) reuse for KV cache of shared prefixes, and (2) poor load balancing. As a result, there is redundant KV cache IO between GPU global memory and shared memory, along with low GPU utilization. To address these challenges, we propose DeFT(Decoding with Flash Tree-Attention), a hardware-efficient attention algorithm with prefix-aware and load-balanced KV cache partitions. DeFT reduces the number of read/write operations of KV cache during attention calculation through KV-Guided Grouping, a method that avoids repeatedly loading KV cache of shared prefixes in attention computation. Additionally, we propose Flattened Tree KV Splitting, a mechanism that ensures even distribution of the KV cache across partitions with little computation redundancy, enhancing GPU utilization during attention computations. By reducing 73-99% KV cache IO and nearly 100% IO for partial results during attention calculation, DeFT achieves up to 2.23/3.59X speedup in the end-to-end/attention latency across three practical tree-based workloads compared to state-of-the-art attention algorithms.
+
+## DeFT Overview
+<div align="center">
+<img src="DeFT_overview.jpg" alt="overview" width="95%"></img>
+</div>
@@ -16,6 +16,22 @@ const publications = [
     },
     tags: ["LLM", "Agent"],
   },
+  {
+    key: "yao2024deft",
+    title: "DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference",
+    authors: "Jinwei Yao, Kaiqi Chen, Kexun Zhang, Jiaxuan You, Binhang Yuan, Zeke Wang, Tao Lin",
+    year: "2025",
+    links: {
+      paper: "https://arxiv.org/abs/2404.00242",
+      code: "https://github.com/LINs-lab/DeFT",
+      contact:"[email protected]"
+    },
+    files: {
+      markdown: require("../assets/publications/yao2024deft/deft.md"),
+    },
+    venue: "ICLR 2025",
+    tags: ["LLM", "Inference","Efficiency"],
+  },
   {
     key: "feng2024graphrouter",
     title: "Graphrouter: A graph-based router for llm selections",