Evaluating NLP interpretability is hard, since the actual reasoning of a black-box classifier is unknown. The QUestion and Answering for Classification tasK Interpretability Evaluation benchmark utilizes a novel approach by formulating a classification task for which Interpretability ground truth labels arise directly from the classification problem definition. It is based on Question Answering datasets with the classification task being question answerability. The ground truth interpretability for knowing that the question is answerable with the context is the sentence containing the answer.
View The Results Read The Paperrun_experiment.py
. For usage,
instantiate
the interpreter in lines 17 and following. The experiment can be run with the --run
flag,
result
analysis is done with the --analyse
flag.
results.json
file. The pull request
should include results for all datasets and classifiers. Further, a link to the method (Github page or
Paper) should be included in the More Info column.