[Re] Reproducing Learning to Deceive With Attention-Based Explanations

Rahel Habacker, Andrew Harrison, Mathias Parisot, Ard Snijders
2021 Zenodo  
Based on the intuition that attention in neural networks is what the model focuses on, attention is now being used as an explanation for a models' prediction (see Galassi, Lippi, and Torroni 1 for a survey). Pruthi et al. 2 challenge the usage of attention-based explanation through a series of experiments using classification and sequence-to-sequence (seq2seq) models. They examine the model's use of impermissible tokens, which are user-defined tokens that can introduce bias e.g. gendered
more » ... s. Across multiple datasets, the authors show that with the impermissible tokens removed the model accuracy drops, implying their usage in prediction. And then by penalising attention paid to the impermissible tokens but keeping them in, they train models that retain full accuracy hence must be using the impermissible tokens, but that does not show attention being paid to the impermissible tokens. As the paper's claims have such significant implications for the use of attention-based explanations, we seek to reproduce their results. Methodology Using the authors' code, for classifiers we attempt to reproduce their embedding, BiL-STM, and BERT results across the occupation prediction, gender identify, and SST + wiki datasets. Further, we reimplemented BERT using HuggingFace's transformer library [3] with restricted self-attention (information cannot flow between permissible and impermissible tokens). For seq2seq we used the authors' code to reproduce results across Bigram Flip, Sequence Copy, Sequence Reverse, and English-German (En-De) machine translation datasets. We performed refactoring on the authors' code aiming toward a more uniformly usable code style as well as porting across to PyTorch Lightning. All experiments were run in approximately 130 GPU hours on a computing cluster with nodes containing Titan RTX GPUs. Results We reproduced the authors' results across all models and all available datasets, confirming their findings that attention-based explanations can be manipulated and that mod-ReScience C 7.2 (#6) -Habacker et al. 2021 [Re] Reproducing Learning to Deceive With Attention-Based Explanations els can learn to deceive. We also replicated their BERT results using our reimplemented model. There was only one result not as strongly (> 1 S.D.) in their experimental direction. What Was Easy The authors' methods were largely well described and easy to follow, and we could quickly produce the first results as their code worked straightaway with minor adjustments. They were also extremely responsive and helpful via email. What Was Difficult Re-implementing the BERT-based classification model to perform replicability, with further specification details on model architecture, penalty mechanism, and training procedure needed. Also, porting code across to PyTorch Lightning. Communication With Original Authors There was a continuous email chain with the authors for several weeks during the reproducibility work. They made additional code and datasets available per our requests, along with providing detailed responses and clarifications to our emailed questions. They encouraged the work and we wish to thank them for their time and support.
doi:10.5281/zenodo.4834146 fatcat:dsib5ae6dbfx3kffnag6bxvssi