QUACKIE

A NLP Classification Task With Ground Truth Explanations

Yves Rychener, Xavier Renard, Djamé Seddah, Pascal Frossard, Marcin Detyniecki

Evaluating NLP interpretability is hard, since the actual reasoning of a black-box classifier is unknown. The QUestion and Answering for Classification tasK Interpretability Evaluation benchmark utilizes a novel approach by formulating a classification task for which Interpretability ground truth labels arise directly from the classification problem definition. It is based on Question Answering datasets with the classification task being question answerability. The ground truth interpretability for knowing that the question is answerable with the context is the sentence containing the answer.

View The Results Read The Paper

Results

Dataset


Classifier

FAQ

The labels for interpretability in QUACKIE are based on question-answering datasets and the simple observation that the most important sentence for knowing the answer to a question is in a text must be the sentence containing the answer. For each question-context pair, we formalize a classification task, asking if the question is answerable with only the context. Interpretability is performed with respect to the context and should find the sentence containing the answer as the most important one.
We provide two classifiers with QUACKIE in order to give a level playing field. A classification model, based on a large RoBERTa model and directly trained on question answerability. The second model is a AlBERT-based question answering model, from which we extract probability scores for answerability by looking at answer-span scores.
QUACKIE contains three metrics: The primary metrics IoU and HPD and the secondary metric SNR. IoU (Intersection over Union) measures an interpreters capability to correctly identify the most important sentence. HPD (Highest Precision for Detection) measurs how highly placed the ground-truth is in the ranking produced by the interpreter. Finally, SNR (Signal to Noise Ratio) measures the selectivity of the interpreter, assessing if the score of the most important sentence is significantly higher than the scores of the other sentences.
We provide a simple evaluation script in our Github, called run_experiment.py. For usage, instantiate the interpreter in lines 17 and following. The experiment can be run with the --run flag, result analysis is done with the --analyse flag.
Systems may use any public or private data for training. There are however 2 exceptions:
  1. Systems may not leverage knowledge about the QA task to come up with an explanation. For example, using a QA model to answer the question and provide its response as reasoning is not allowed.
  2. Systems may not use the test data for training, that is the validation data of SQuAD 2.0 as well as the test-data of SQuADShifts.
Results can be submitted using a pull request in the results.json file. The pull request should include results for all datasets and classifiers. Further, a link to the method (Github page or Paper) should be included in the More Info column.
Yes, and we thank you for your contribution. However, the More Info column should still refer to the original publication.
QUACKIE reuses data from SQuAD 2.0 and SQuADShifts, please refer to their websites.