|
| 1 | +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> |
| 2 | +<html xmlns="http://www.w3.org/1999/xhtml"> |
| 3 | + |
| 4 | +<head> |
| 5 | + |
| 6 | +<!-- Global site tag (gtag.js) - Google Analytics --> |
| 7 | +<script async src="https://www.googletagmanager.com/gtag/js?id=G-GDXSC5Y2BD"></script> |
| 8 | +<script> |
| 9 | + window.dataLayer = window.dataLayer || []; |
| 10 | + function gtag(){dataLayer.push(arguments);} |
| 11 | + gtag('js', new Date()); |
| 12 | + |
| 13 | + gtag('config', 'G-GDXSC5Y2BD'); |
| 14 | +</script> |
| 15 | + |
| 16 | +<meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> |
| 17 | + |
| 18 | +<script src="./files/head.js"></script> |
| 19 | + |
| 20 | +<meta name="viewport" content="width=device-width, initial-scale=1"> |
| 21 | + |
| 22 | +<script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script> |
| 23 | +<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script> |
| 24 | + |
| 25 | +<meta name="keywords" content="MIT,Microsoft Research, Machine Learning,Rank Reduction,Computer Science,Machine,Artificial,Intelligence"> |
| 26 | + |
| 27 | +<title>The Truth Is In There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction</title> |
| 28 | +<link rel="stylesheet" href="./files/font.css"> |
| 29 | +<link rel="stylesheet" href="./files/main.css"> |
| 30 | + |
| 31 | +<link rel="stylesheet" type="text/css" |
| 32 | + href="https://cdn.rawgit.com/dreampulse/computer-modern-web-font/master/fonts.css"> |
| 33 | +<style> |
| 34 | +body { |
| 35 | + font-family: "Computer Modern Serif", serif; |
| 36 | + font-size: 14pt; |
| 37 | +} |
| 38 | + |
| 39 | + |
| 40 | +* {padding:0;margin:0;box-sizing:border-box;} |
| 41 | +#video { |
| 42 | + position: relative; |
| 43 | + padding-bottom: 45%; /* 16:9 */ |
| 44 | + height: 0; |
| 45 | +} |
| 46 | +#video iframe { |
| 47 | + position: absolute; |
| 48 | + top: 0; |
| 49 | + left: 0; |
| 50 | + width: 80%; |
| 51 | + height: 100%; |
| 52 | + transform: translateX(12.5%); |
| 53 | +} |
| 54 | + |
| 55 | +</style> |
| 56 | + |
| 57 | + <style type="text/css">/** |
| 58 | + * Style sheet used by new LibX tooltip code |
| 59 | + */ |
| 60 | + |
| 61 | +/* We insert a <div> with libx-tooltip style under the body. |
| 62 | + * This will inherit body's style - we can't afford to inherit undesirable |
| 63 | + * styles and we must redefine what we need. OTOH, some things, e.g. |
| 64 | + * font-size, might be ok to be inherited to stay within the page's tone. |
| 65 | + */ |
| 66 | +.libx-tooltip { |
| 67 | + display: none; |
| 68 | + overflow: visible; |
| 69 | + padding: 5px; |
| 70 | + z-index: 100; |
| 71 | + background-color: #eee; |
| 72 | + color: #000; |
| 73 | + font-weight: normal; |
| 74 | + font-style: normal; |
| 75 | + text-align: left; |
| 76 | + border: 2px solid #666; |
| 77 | + border-radius: 5px; |
| 78 | + -webkit-border-radius: 5px; |
| 79 | + -moz-border-radius: 5px; |
| 80 | +} |
| 81 | + |
| 82 | +.libx-tooltip p { |
| 83 | + /* override default 1em margin to keep paragraphs inside a tooltip closer together. */ |
| 84 | + margin: .2em; |
| 85 | +} |
| 86 | +</style><style type="text/css">/** |
| 87 | + * Style sheet used by LibX autolinking code |
| 88 | + */ |
| 89 | +.libx-autolink { |
| 90 | +} |
| 91 | + |
| 92 | +</style> |
| 93 | + |
| 94 | +</head> |
| 95 | + |
| 96 | + <body> |
| 97 | + |
| 98 | + <div class="outercontainer"> |
| 99 | + <div class="container"> |
| 100 | + |
| 101 | + <div class="content project_title"> |
| 102 | + <center> |
| 103 | + <br> |
| 104 | + <h2>The Truth Is In There: Improving Reasoning in Language Models <br>with Layer-Selective Rank Reduction</h2> |
| 105 | + <div class="authors"> |
| 106 | + <a href="https://pratyushasharma.github.io/">Pratyusha Sharma</a>, |
| 107 | + <a href="https://www.jordantash.com/">Jordan Ash*</a>, and |
| 108 | + <a href="https://dipendramisra.com/">Dipendra Misra*</a> |
| 109 | + </div> |
| 110 | + <!-- <br> --> |
| 111 | + <!-- <a href="https://arxiv.org/abs/2106.02039">Paper</a> --> |
| 112 | + <!-- <a href="./trajectory-transformer-neurips-2021.pdf">Paper</a> --> |
| 113 | + <div> |
| 114 | + <span class="tag"> |
| 115 | + <a href="https://arxiv.org/abs/2106.02039">Paper</a> |
| 116 | + <!-- <a href="./trajectory-transformer-neurips-2021.pdf">Paper</a> --> |
| 117 | + <a href="https://github.com/JannerM/trajectory-transformer">Code</a> |
| 118 | + <a href="files/bib.txt">BibTex</a> |
| 119 | + </span> |
| 120 | + </div> |
| 121 | + </center> |
| 122 | + </div> |
| 123 | + |
| 124 | + <br><br> |
| 125 | + |
| 126 | + <div class="content"> |
| 127 | + <center> |
| 128 | + <div class="text"> |
| 129 | + <p> |
| 130 | + <div class="title"><b>Summary</b></div> |
| 131 | +<!-- <b> |
| 132 | + <font size="5">Summary</font> |
| 133 | + </b> --> |
| 134 | + <!-- --> |
| 135 | + Transformer-based Large Language Models (LLMs) have become a fixture in modern machine learning. <br> |
| 136 | + Correspondingly, significant resources are allocated towards research that aims to further advance this technology, <br>typically resulting in models of increasing size that are trained on increasing amounts of data. <br> |
| 137 | + This work, however, demonstrates the surprising result that it is often possible to improve the performance of LLMs by <br>simply removing higher-order components of their constituent weight matrices in the multi-layer perception (MLP) <br>layers. This simple intervention, which we call LAyer-SElective Rank reduction LASER, can be done on a model after <br>training has completed, and requires no additional parameters or data. LASER can dramatically boost predictive <br>performance on question-answering tasks and across various modalities for which Transformers are used. <br> |
| 138 | + </p> |
| 139 | + </div> |
| 140 | + </center> |
| 141 | + </div> |
| 142 | + <br> |
| 143 | + <br> |
| 144 | + |
| 145 | + <center> |
| 146 | + <img width=60% src="main.png"></img> |
| 147 | + <br> |
| 148 | + <br> |
| 149 | + <i>LAyer SElective Rank reduction (LASER) replaces a specific weight matrix W of the Transformer model by its rank-$k$ <br>approximation and observes the change in the behavior of the model. We find that this rank approximation, especially for MLP <br>weights at the latter layers of the model, often offers surprising benefits to model performance.</i> |
| 150 | + </center> |
| 151 | + <p> |
| 152 | + <br><br> |
| 153 | + |
| 154 | + |
| 155 | + <div class="content"> |
| 156 | + <div class="text"> |
| 157 | + <center> |
| 158 | + <p> |
| 159 | + <div class="title"><b>Layer Selective Rank Reduction Improves Generalization</b></div> |
| 160 | +<!-- <b> |
| 161 | + <font size="5">Layer Selective Rank Reduction Improves Generalization</font> |
| 162 | + </b> --> |
| 163 | + <!-- --> |
| 164 | + <!-- Predictive dynamics models often have excellent single-step error, but poor long-horizon accuracy due to compounding errors. |
| 165 | + We show that Transformers are more reliable long-horizon predictors than state-of-the-art single-step models, even in continuous Markovian domains. --> |
| 166 | + </p> |
| 167 | + <br> |
| 168 | + <br> |
| 169 | + <img width=40% src="loss.png"></img> |
| 170 | + |
| 171 | + |
| 172 | + <br> |
| 173 | + <br> |
| 174 | + <i>The effect of rank reduction across different layer types is not uniform. This figure shows the effect of rank <br>reduction for GPT-J as studied on the CounterFact dataset. The dashed line is the base model loss. In the attention <br>layers (key, query, value, out matrices), while its clear matrices could be significantly rank-reduced without damaging <br>the learned hypothesis, there is very little performance increase. Alternatively, for the multi-layer perceptron (MLP) <br>layers, rank reduction goes from uniformly harming to improving the model's performance (at layer 20).</i> |
| 175 | + </center> |
| 176 | + <br> |
| 177 | + </div> |
| 178 | + </div> |
| 179 | + <br> |
| 180 | + <br> |
| 181 | + |
| 182 | + <div class="content"> |
| 183 | + <div class="text"> |
| 184 | + <center> |
| 185 | + <p> |
| 186 | + <div class="title"><b>LASER offers a kind of denoising procedure that makes weakly learned facts accessible</b></div> |
| 187 | + <!-- <b> |
| 188 | + <font size="5.0"> |
| 189 | + Beam search as trajectory optimizer |
| 190 | + </font> |
| 191 | + </b> --> |
| 192 | + <!-- . --> |
| 193 | + <!-- Various control settings can be reduced to slight modifications of beam search with a sequence model. --> |
| 194 | + <ul> |
| 195 | + <br> |
| 196 | + <img width=40% src="corrected.png"></img> |
| 197 | + <img width=11% src="corrected-2.png"></img> |
| 198 | + <br> |
| 199 | + <li>Which datapoints benefit from LASER? We analyze how frequently in the training data ``corrected'' facts occur. <br> GPT-J is an ideal test bed for such analysis since its training data, the PILE dataset, is publicly available. <br> (a) For GPT-J evaluated on Counterfact we retrieve all the datapoints in the training data that contain a mention of <br>both the entity of interest and the answer that correspond to each sample in CounterFact. (b) A plot depicting the <br>cumulative top-10 accuracy of the model on all datapoints that occur in the training data less than or equal to <br>the frequency indicated on the x-axis. The plot looks at how the accuracy changes before and after LASER. (c) The <br>largest boost in performance occurs for low-frequency samples. Demonstrates the amount of boost offered by LASER for<br> data binned by the frequency with which corresponding facts occur in the training data. Maximal improvements in <br>accuracy are from datapoints that have less-frequent occurrences in the training data as opposed to those that occur<br> more frequently.</li> |
| 200 | + <br> |
| 201 | + <br> |
| 202 | + </ul> |
| 203 | + </p> |
| 204 | + </center> |
| 205 | + </div> |
| 206 | + </div> |
| 207 | + |
| 208 | + </div> |
| 209 | + </div> |
| 210 | + |
| 211 | +<br><br><br><br> |
| 212 | + |
| 213 | + |
| 214 | +</div></body></html> |
0 commit comments