We Need to Talk About train-dev-test Splits

Research output: Conference Article in Proceeding or Book/Report chapterArticle in proceedingsResearchpeer-review

Abstract

Standard train-dev-test splits used to benchmark multiple models against each other are ubiquitously used in Natural Language Processing (NLP). In this setup, the train data is used for training the model, the development set for evaluating different versions of the proposed model(s) during development, and the test set to confirm the answers to the main research question(s). However, the introduction of neural networks in NLP has led to a different use of these standard splits; the development set is now often used for model selection during the training procedure. Because of this, comparing multiple versions of the same model during development leads to overestimation on the development data. As an effect, people have started to compare an increasing amount of models on the test data, leading to faster overfitting and ``expiration'' of our test sets. We propose to use a tune-set when developing neural network methods, which can be used for model picking so that comparing the different versions of a new model can safely be done on the development data.
Original languageEnglish
Title of host publicationProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Number of pages9
PublisherAssociation for Computational Linguistics
Publication dateOct 2021
Pages4485
Publication statusPublished - Oct 2021
EventThe 2021 Conference on Empirical Methods in Natural Language Processing - Punta Cana, Dominican Republic
Duration: 7 Nov 202112 Nov 2021
https://2021.emnlp.org/

Conference

ConferenceThe 2021 Conference on Empirical Methods in Natural Language Processing
Country/TerritoryDominican Republic
CityPunta Cana
Period07/11/202112/11/2021
Internet address

Keywords

  • Natural Language Processing (NLP)
  • train-dev-test splits
  • neural networks
  • model overfitting
  • development set usage

Fingerprint

Dive into the research topics of 'We Need to Talk About train-dev-test Splits'. Together they form a unique fingerprint.

Cite this