From 0dd0271a3afefb0359ca7e40f918cee13bbaea19 Mon Sep 17 00:00:00 2001 From: j-t-1 <120829237+j-t-1@users.noreply.github.com> Date: Tue, 4 Mar 2025 11:27:52 +0000 Subject: [PATCH] Modify slightly the Bert SQUAD interpretation tutorial --- tutorials/Bert_SQUAD_Interpret2.ipynb | 76 +++++++++++++-------------- 1 file changed, 38 insertions(+), 38 deletions(-) diff --git a/tutorials/Bert_SQUAD_Interpret2.ipynb b/tutorials/Bert_SQUAD_Interpret2.ipynb index b2049de3f1..a055be834a 100644 --- a/tutorials/Bert_SQUAD_Interpret2.ipynb +++ b/tutorials/Bert_SQUAD_Interpret2.ipynb @@ -11,17 +11,17 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In the second part of interpreting Bert models we look into attention matrices, their importance scores, vector norms and compare them with the results that we found in Part 1.\n", + "In the second part of interpreting BERT models we look into attention matrices, their importance scores, vector norms and compare them with the results that we found in Part 1.\n", "\n", - "Similar to Part 1 we use Bert Question Answering model fine-tuned on SQUAD dataset using transformers library from Hugging Face: https://huggingface.co/transformers/\n", + "Similar to Part 1 we use the BERT Question Answering model fine-tuned on the SQUAD dataset using transformers library from Hugging Face: https://huggingface.co/transformers/.\n", "\n", - "In order to be able to use the same setup and reproduce the results form Part 1 we will redefine same setup and helper functions in this tutorial as well. \n", + "In order to be able to use the same setup and reproduce the results form Part 1 we will redefine the same setup and helper functions in this tutorial as well. \n", "\n", - "In this tutorial we compare attention matrices with their importance scores when we attribute them to a particular class, and vector norms as proposed in paper: https://arxiv.org/pdf/2004.10102.pdf\n", + "In this tutorial we compare attention matrices with their importance scores when we attribute them to a particular class, and vector norms as proposed in paper: https://arxiv.org/pdf/2004.10102.pdf.\n", "\n", "We show that the importance scores computed for the attention matrices and specific class are more meaningful than the attention matrices alone or different norm vectors computed for different input activations.\n", "\n", - "Note: Before running this tutorial, please install `seaborn`, `pandas` and `matplotlib`, `transformers`(from hugging face) python packages in addition to `Captum` and `torch` libraries.\n", + "Note: Before running this tutorial, please install `seaborn`, `pandas`, `matplotlib`, and `transformers`(from hugging face) python packages in addition to `Captum` and `torch` libraries.\n", "\n", "This tutorial was built using transformer version 4.3.0." ] @@ -62,7 +62,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The first step is to fine-tune BERT model on SQUAD dataset. This can be easiy accomplished by following the steps described in hugging face's official web site: https://github.com/huggingface/transformers#run_squadpy-fine-tuning-on-squad-for-question-answering \n", + "The first step is to fine-tune the BERT model on the SQUAD dataset. This can be easily accomplished by following the steps described in hugging face's official web site: https://github.com/huggingface/transformers#run_squadpy-fine-tuning-on-squad-for-question-answering \n", "\n", "Note that the fine-tuning is done on a `bert-base-uncased` pre-trained model." ] @@ -116,7 +116,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Defining a custom forward function that will allow us to access the start and end positions of our prediction using `position` input argument." + "Defining a custom forward function that will allow us to access the start and end positions of our prediction using the `position` input argument." ] }, { @@ -140,7 +140,7 @@ "\n", "To do so, we need to define baselines / references, numericalize both the baselines and the inputs. We will define helper functions to achieve that.\n", "\n", - "The cell below defines numericalized special tokens that will be later used for constructing inputs and corresponding baselines/references." + "The cell below defines numericalized special tokens that will be later used for constructing inputs and corresponding baselines / references." ] }, { @@ -158,7 +158,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Below we define a set of helper function for constructing references / baselines for word tokens, token types and position ids." + "Below we define a set of helper function for constructing baselines / references for word tokens, token types and position IDs." ] }, { @@ -212,7 +212,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Let's define the `question - text` pair that we'd like to use as an input for our Bert model and interpret what the model was focusing on when predicting an answer to the question from given input text " + "Let's define the `question - text` pair that we'd like to use as an input for our BERT model and interpret what the model was focusing on when predicting an answer to the question from a given input text." ] }, { @@ -270,7 +270,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now let's make predictions using input, token type, position id and a default attention mask." + "Now let's make predictions using input, token type, position ID and a default attention mask." ] }, { @@ -309,7 +309,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "`output_attentions` represent attention matrices aka attention probabilities for all 12 layers and all 12 heads. It represents softmax-normalized dot-product between the key and query vectors. In the literature (https://www.aclweb.org/anthology/W19-4828.pdf) it has been used as an importance indicator of how much a token attends / relates to another token in the text. In case of translation for example it is a good indicator of how much a token in one language attends to the corresponding translation in another language. In case of Question Answering model it indicates which tokens attend / relate to each other in question, text or answer segment.\n", + "`output_attentions` represent attention matrices, aka attention probabilities, for all 12 layers and all 12 heads. It represents a softmax-normalized dot-product between the key and query vectors. In the literature (https://www.aclweb.org/anthology/W19-4828.pdf) it has been used as an importance indicator of how much a token attends / relates to another token in the text. In case of translation for example, it is a good indicator of how much a token in one language attends to the corresponding translation in another language. In case of Question Answering models it indicates which tokens attend / relate to each other in a question, text or answer segment.\n", "\n", "Since `output_attentions` contains the layers in a list, we will stack them in order to move everything into a tensor." ] @@ -335,7 +335,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Below helper function will be used for visualizing token-to-token relation / attention scores for all heads in a given layer or for all layers across all heads." + "The below helper function will be used for visualizing token-to-token relation / attention scores for all heads in a given layer or for all layers across all heads." ] }, { @@ -378,7 +378,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Below helper function will be used for visualizing the importance scores for tokens across all heads in all layers." + "The below helper function will be used for visualizing the importance scores for tokens across all heads in all layers." ] }, { @@ -414,7 +414,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Let's examine a specific layer. For that reason we will define a fixed layer id that will be used for visualization purposes. The users are free to change this layer if they want to examine a different one.\n" + "Let's examine a specific layer. For that reason we will define a fixed layer ID that will be used for visualization purposes. The users are free to change this layer if they want to examine a different one.\n" ] }, { @@ -459,21 +459,21 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Based on the visualizations above we observe that there is a high attention set along the diagonals and on an uninformative token such as `[SEP]`. This is something that was observed in previous papers which indicates that attention matrices aren't always a good indicator of finding which tokens are more important or which token is related to which. We observe similar pattern when we examine another layer." + "Based on the visualizations above we observe that there is a high attention set along the diagonals and on an uninformative token such as `[SEP]`. This is something that was observed in previous papers which indicates that attention matrices aren't always a good indicator of finding which tokens are more important or which token is related to which. We observe similar patterns when we examine another layer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "In the cell below we compute and visualize L2 norm across head axis for all 12 layer. This provides a summary for each layer across all heads." + "In the cell below we compute and visualize L2 norm across head axis for all 12 layers. This provides a summary for each layer across all heads." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Defining normalization function depending on pytorch version." + "Defining a normalization function depending on pytorch version." ] }, { @@ -517,7 +517,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Based on the visualiziation above we can convince ourselves that attention scores aren't trustworthy measures of importances for token-to-token relations across all layers. We see strong signal along the diagonal and for the `[SEP]` and `[CLS]` tokens. These signals, however, aren't true indicators of what semantic the model learns.\n" + "Based on the visualiziation above we can convince ourselves that attention scores aren't trustworthy measures of importance for token-to-token relations across all layers. We see strong signal along the diagonal and for the `[SEP]` and `[CLS]` tokens. These signals, however, aren't true indicators of what semantics the model learns.\n" ] }, { @@ -531,7 +531,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In the cells below we visualize the attribution scores of attention matrices for the start and end position positions prediction and compare with the actual attention matrices. To do so, first of all, we compute the attribution scores using LayerConductance algorithm similar to Part 1." + "In the cells below we visualize the attribution scores of attention matrices for the start and end position predictions and compare with the actual attention matrices. To do so, first of all, we compute the attribution scores using LayerConductance algorithm similar to Part 1." ] }, { @@ -564,7 +564,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now let's look into the layers of our network. More specifically we would like to look into the distribution of attribution scores for each token across all layers and attribution matrices for each head in all layers in Bert model. \n", + "Now let's look into the layers of our network. More specifically we would like to look into the distribution of attribution scores for each token across all layers and attribution matrices for each head in all layers of the BERT model.\n", "We do that using one of the layer attribution algorithms, namely, layer conductance. However, we encourage you to try out and compare the results with other algorithms as well.\n", "\n", "\n", @@ -584,9 +584,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Let's iterate over all layers and compute the attributions w.r.t. all tokens in the input and attention matrices. \n", + "Let's iterate over all layers and compute the attributions with respect to all tokens in the input and attention matrices. \n", "\n", - "Note: Since below code is iterating over all layers it can take over 5 seconds. Please be patient!" + "Note: Since the below code is iterating over all layers it can take over 5 seconds. Please be patient!" ] }, { @@ -644,12 +644,12 @@ "\n", "The plot below represents a heatmap of attributions across all layers and tokens for the start position prediction.\n", "\n", - "Note that here we do not have information about different heads. Heads related information will be examined separately when we visualize the attribution scores of the attention matrices w.r.t. the start or end position predictions.\n", + "Note that here we do not have information about different heads. Heads related information will be examined separately when we visualize the attribution scores of the attention matrices with respect to the start or end position predictions.\n", "\n", "It is interesting to observe that the question word `what` gains increasingly high attribution from layer one to ten. In the last two layers that importance is slowly diminishing. \n", - "In contrary to `what` token, many other tokens have negative or close to zero attribution in the first 6 layers. \n", + "In contrast to the `what` token, many other tokens have negative or close to zero attribution in the first 6 layers. \n", "\n", - "We start seeing slightly higher attribution in tokens `important`, `us` and `to`. Interestingly token `important` is also assigned high attribution score which is remarkably high in the fifth and sixth layers.\n", + "We start seeing slightly higher attribution in tokens `important`, `us` and `to`. Interestingly token `important` is also assigned a high attribution score which is remarkably high in the fifth and sixth layers.\n", "\n", "Lastly, our correctly predicted token `to` gains increasingly high positive attribution especially in the last two layers.\n" ] @@ -687,7 +687,7 @@ "metadata": {}, "source": [ "Now let's examine the heat map of the attributions for the end position prediction. In the case of end position prediction we again observe high attribution scores for the token `what` in the last 11 layers.\n", - "Correctly predicted end token `kinds` has positive attribution across all layers and it is especially prominent in the last two layers. It's also interesting to observe that `humans` token also has relatively high attribution score in the last two layers." + "Correctly predicted end token `kinds` has positive attribution across all layers and it is especially prominent in the last two layers. It's also interesting to observe that the `humans` token also has a relatively high attribution score in the last two layers." ] }, { @@ -738,7 +738,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In this section we visualize the attribution scores of start and end position predictions w.r.t. attention matrices.\n", + "In this section we visualize the attribution scores of start and end position predictions with respect to attention matrices.\n", "Note that each layer has 12 heads, hence attention matrices. We will first visualize for a specific layer and head, later we will summarize across all heads in order to gain a bigger picture.\n" ] }, @@ -775,7 +775,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "As we can see from the visualizations above, in contrary to attention scores the attributions of specific target w.r.t. to those scores are more meaningful and most importantly, they do not attend to `[SEP]` token or show diagonal patterns. We observe that heads 4, 9, 12 and 2 show strong relationship between `what` and `it` tokens when predicting start position, head 10 and 11 between `it` and `it`, heads 8 between `important` and `to` and head 1 between `to` and `what`. Note that `to` token is the start position of the answer token. It is also important to mention that these observations are for a selected `layer`. We can change the index of selected `layer` and examine interesting relationships in other layers." + "As we can see from the visualizations above, in contrast to attention scores the attributions of specific targets with respect to those scores are more meaningful and most importantly, they do not attend to `[SEP]` token or show diagonal patterns. We observe that heads 2, 4, 9, and 12 show strong relationships between `what` and `it` tokens when predicting start position, head 10 and 11 between `it` and `it`, heads 8 between `important` and `to` and head 1 between `to` and `what`. Note that the `to` token is the start position of the answer token. It is also important to mention that these observations are for a selected `layer`. We can change the index of selected `layer` and examine interesting relationships in other layers." ] }, { @@ -812,14 +812,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "By looking at the visualizations above we can see that the model pays attention to very specific handpicked relationships when making a sprediction for start position. Most notably in the layers 10, 7, 11 and 4 it focuses more on the relationships between `it` and `is`, `important` and `to`." + "By looking at the visualizations above we can see that the model pays attention to very specific handpicked relationships when making a prediction for start position. Most notably in the layers 10, 7, 11 and 4 it focuses more on the relationships between `it` and `is`, `important` and `to`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Now let's run the same experiments for the end position prediction. Below we visualize the attribution scorese of attention matrices for the end position prediction for the selected `layer`." + "Now let's run the same experiments for the end position prediction. Below we visualize the attribution scores of attention matrices for the end position prediction for the selected `layer`." ] }, { @@ -848,7 +848,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "As we can see from the visualizations above that for the end position prediction we have stronger attention towards the end of the answer token `kinds`. Here we can see stronger connection between `humans` and `kinds` in the 11th head, `it` and `em`, `power`, `and` in the 5th, 6th and 8th heads. The connections between `it` and `what` are also strong in first couple and 10th heads." + "As we can see from the visualizations above that for the end position prediction we have stronger attention towards the end of the answer token `kinds`. Here we can see stronger connections between `humans` and `kinds` in the 11th head, `it` and `em`, `power`, `and` in the 5th, 6th and 8th heads. The connections between `it` and `what` are also strong in the first couple and 10th heads." ] }, { @@ -899,7 +899,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In this section of the tutorial we will compute Vector norms for activation layers such as ||f(x)||, ||α * f(x)|| and ||Σαf(x)|| as also described in the: https://arxiv.org/pdf/2004.10102.pdf\n", + "In this section of the tutorial we will compute Vector norms for activation layers such as ||f(x)||, ||α * f(x)|| and ||Σαf(x)|| as described in: https://arxiv.org/pdf/2004.10102.pdf.\n", "\n", "As also shown in the paper mentioned above, normalized activations are better indicators of importance scores than the attention scores however they aren't as indicative as the attribution scores. This is because normalized activations ||f(x)|| and ||α * f(x)|| aren't attributed to a specific output prediction. From our results we can also see that according to those normalized scores `[SEP]` tokens are insignificant." ] @@ -908,7 +908,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Below we define / extract all parameters that we need to computation vector norms. " + "Below we define / extract all parameters that we need to computate vector norms. " ] }, { @@ -929,7 +929,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In order to compute above mentioned norms we need to get access to dense layer's weights and value vector of the self attention layer." + "In order to compute the above mentioned norms we need to get access to dense layer's weights and value vector of the self attention layer." ] }, { @@ -980,7 +980,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In the cell below we perform several transformations with the value layer activations and bring it to the shape so that we can compute different norms. The transformations are done the same way as it is described in the original paper and corresponding github implementation." + "In the cell below we perform several transformations with the value layer activations and change it to the shape so that we can compute different norms. The transformations are done the same way as described in the original paper and corresponding GitHub implementation." ] }, { @@ -1224,14 +1224,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Above visualizations also confirm that the attention scores aren't concentrated on the tokens such as `[CLS]`, `[SEP]` and `.` however we see stronger signals along the diagonals and some patches of stronger signals between certain parts of the text including some tokens in the question part that are relevant in the answer piece." + "The above visualizations also confirm that the attention scores aren't concentrated on the tokens such as `[CLS]`, `[SEP]` and `.` however we see stronger signals along the diagonals and some patches of stronger signals between certain parts of the text including some tokens in the question part that are relevant in the answer piece." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "It is important to mention that all experiments were performed for one input sample, namely, `sentence`. In the papers we often see aggregation of the results across multiple samples. For further analysis and more convincing propositions we recommend to conduct the experiments across multiple input samples. In addition to that it would be also interesting to look into the correlation of heads in layer and across different layers." + "It is important to mention that all experiments were performed for one input sample, namely, `sentence`. In the papers we often see aggregation of the results across multiple samples. For further analysis and more convincing propositions we recommend to conduct the experiments across multiple input samples. In addition to that it would be interesting to look into the correlation of heads in layer and across different layers." ] } ],