TY - GEN
T1 - We Need to Consider Disagreement in Evaluation
AU - Basile, Valerio
AU - Fell, Michael
AU - Fornaciari, Tommaso
AU - Hovy, Dirk
AU - Paun, Silviu
AU - Plank, Barbara
AU - Poesio, Massimo
AU - Uma, Alexandra
PY - 2021
Y1 - 2021
N2 - Evaluation is of paramount importance in data- driven research fields such as Natural Language Processing (NLP) and Computer Vision (CV). But current evaluation practice in NLP, except for end-to-end tasks such as machine translation, spoken dialogue systems, or NLG, largely hinges on the existence of a single “ground truth” against which we can meaning- fully compare the prediction of a model. However, this assumption is flawed for two reasons. 1) In many cases, more than one answer is correct. 2) Even where there is a single answer, disagreement among annotators is ubiquitous, making it difficult to decide on a gold standard. We discuss three sources of disagreement: from the annotator, the data, and the con- text, and show how this affects even seemingly objective tasks. Current methods of adjudication, agreement, and evaluation ought to be re- considered at the light of this evidence. Some researchers now propose to address this issue by minimizing disagreement, creating cleaner datasets. We argue that such a simplification is likely to result in oversimplified models just as much as it would do for end-to-end tasks such as machine translation. Instead, we suggest that we need to improve today’s evaluation practice to better capture such disagreement. Datasets with multiple annotations are becoming more common, as are methods to integrate disagreement into modeling. The logical next step is to extend this to evaluation.
AB - Evaluation is of paramount importance in data- driven research fields such as Natural Language Processing (NLP) and Computer Vision (CV). But current evaluation practice in NLP, except for end-to-end tasks such as machine translation, spoken dialogue systems, or NLG, largely hinges on the existence of a single “ground truth” against which we can meaning- fully compare the prediction of a model. However, this assumption is flawed for two reasons. 1) In many cases, more than one answer is correct. 2) Even where there is a single answer, disagreement among annotators is ubiquitous, making it difficult to decide on a gold standard. We discuss three sources of disagreement: from the annotator, the data, and the con- text, and show how this affects even seemingly objective tasks. Current methods of adjudication, agreement, and evaluation ought to be re- considered at the light of this evidence. Some researchers now propose to address this issue by minimizing disagreement, creating cleaner datasets. We argue that such a simplification is likely to result in oversimplified models just as much as it would do for end-to-end tasks such as machine translation. Instead, we suggest that we need to improve today’s evaluation practice to better capture such disagreement. Datasets with multiple annotations are becoming more common, as are methods to integrate disagreement into modeling. The logical next step is to extend this to evaluation.
KW - Natural Language Processing (NLP)
KW - Evaluation methodologies
KW - Annotator disagreement
KW - Ground truth
KW - Data annotation
KW - Natural Language Processing (NLP)
KW - Evaluation methodologies
KW - Annotator disagreement
KW - Ground truth
KW - Data annotation
UR - https://aclanthology.org/2021.bppf-1.3/
U2 - 10.18653/v1/2021.bppf-1.3
DO - 10.18653/v1/2021.bppf-1.3
M3 - Article in proceedings
SP - 15
EP - 21
BT - ACL-IJCNLP2021 Workshop on Benchmarking: Past, Present and Future
PB - Association for Computational Linguistics
ER -