No items found.

Evaluating self-attention interpretability through human-grounded experimental protocol

July 31, 2023

•

7 Minute Read

Author(s):

Nicolas Chesneau

In recent years, artificial intelligence algorithms have become more efficient and complex. It is becoming increasingly difficult, if not impossible, to understand how an algorithm makes a decision, such as classifying a text in a category. Interpretability methods allow us to explain our decision-making process. Advantages are manifold: increased trust in the system, the ability to correct errors, bias detection, etc.

Interpretability is an active field of research. Numerous methods have been proposed. Among them, Shapley Additive exPlainations (SHAP) and Local Interpretable Model-Agnostic Explanations (LIME) are the best-known and most effective methods out there. But one of their downsides is that they are post-hoc, which means that we use them after the decision has been made. These models analyze the model of interest in retrospect, which is one the reason they could produce incorrect analyses. Moreover, this kind of analysis is particularly slow, which makes it very difficult to conduct analyses on an entire population.

NLP, or natural language processing, has been particularly impacted by recent developments in artificial intelligence. The Transformers architecture revolutionized the industry by integrating layers of attention into the neural network. Attention coefficients are particularly useful in that they focus on the most important words in the text to make a decision, e.g., classifying a text into a positive sentiment label. We have developed an interpretability algorithm based on analyzing attention coefficients, CLaSsification-Attention (CLS-A), which allows us to assess how relevant a word is in the decision-making process. CLS-A is based on learned coefficients and requires no additional inference. It is very quick to calculate.

We have established an experimental protocol to evaluate the SHAP, LIME, and CLS-A interpretability methods. For this purpose, using a simple, automatically calculated metric is not enough. It fails to fulfill the purpose of an interpretability method, which is to be understandable by a human. We drew inspiration from current neuroscience protocols and we conducted a full-scale experiment on a human population. The experimental protocol consists in asking participants to annotate one hundred texts for a binary classification task. For each text, some words are highlighted in a more or less intense shade of blue, depending on an underlying interpretability method or a randomizer. The higher the coefficient of the method, the deeper the shade of blue. Accuracy and response time are measured to assess the ability of each method to help the participant with the annotation task. The higher the accuracy and the shorter the response time, the more relevant the method, as it makes the semantic processing of text easier.

We tested the different interpretability methods on three classification tasks proposing to sort out movie reviews: positive vs. negative feelings, action vs. drama, horror vs. comedy. We have summarized all the results in the following table:

Firstly, we noticed that all three interpretability methods show better results, in terms of reaction time and human classification performance, than random highlighting. These results demonstrate the effectiveness of SHAP, LIME, and CLS-A. Moreover, the average reaction time associated with CLS-A is lower for experiments 1 and 3. Accuracy was also higher on average for the participants who have been exposed to CLS-A. The complementary experiments we have conducted using a breakdown of the user performance via the Explainable Boosting Machine show that there is a strong correlation between the quality of an explanation and the reliability of a prediction. In experiments 1 and 3, the fact that the efficiency of CLS-A decreases for high probabilities may be linked to how easy it is to classify a text with such a high probability score. The indicators are less useful. We also found that the impact of the CLS-A method was concentrated on fast and slow responses, and tended to merge with the baseline for medium-level fast responses, confirming the method’s facilitative effect on semantic processing.

To conclude, we suggest using a new interpretability method for Transformer models, whose most convincing use cases relate to NLP. There is currently no consensus on what might be a reliable quantitative method to validate interpretability techniques. In this paper, we suggest conducting human-centered experiments by measuring the similarity of the processes at work in the model and in human reasoning. Our study shows its relevance, especially when it is combined with statistical analysis. There are still many possibilities left to be explored. All our experiments have been carried out with binary classifiers. It would be interesting to incorporate classifiers with more than two classes to reach a more specific conclusion. It would also be interesting to evaluate feature importance methods for NLP in the same way.

Frequently asked questions

No items found.

July 31, 2023

•

7 Minute Read

No items found.

Back to all resources

Related resources

No items found.

Share on social media

Get in touch

Evaluating self-attention interpretability through human-grounded experimental protocol

Frequently asked questions

Connect with our Data Science experts