Enough Is Enough! a Case Study on the Effect of Data Size for Evaluation Using Universal Dependencies

Research output: Conference Article in Proceeding or Book/Report chapterArticle in proceedingsResearchpeer-review

Abstract

When creating a new dataset for evaluation, one of the first considerations is the size of the dataset. If our evaluation data is too small, we risk making unsupported claims based on the results on such data. If, on the other hand, the data is too large, we waste valuable annotation time and costs that could have been used to widen the scope of our evaluation (i.e. annotate for more domains/languages). Hence, we investigate the effect of the size and a variety of sampling strategies of evaluation data to optimize annotation efforts, using dependency parsing as a test case. We show that for in-language in-domain datasets, 5,000 tokens is enough to obtain a reliable ranking of different parsers; especially if the data is distant enough from the training split (otherwise, we recommend 10,000). In cross-domain setups, the same amounts are required, but in cross-lingual setups much less (2,000 tokens) is enough.
Original languageUndefined/Unknown
Title of host publicationProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
EditorsNicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Number of pages10
Place of PublicationTorino, Italia
PublisherELRA and ICCL
Publication date1 May 2024
Pages6167-6176
Publication statusPublished - 1 May 2024
EventInternational Conference on Computational Linguistics, Language Resources and Evaluation - Torino, Italy
Duration: 20 May 202425 May 2024
https://aclanthology.org/2024.lrec-main.544/

Conference

ConferenceInternational Conference on Computational Linguistics, Language Resources and Evaluation
Country/TerritoryItaly
CityTorino
Period20/05/202425/05/2024
Internet address

Keywords

  • dataset size optimization
  • evaluation sampling strategies
  • annotation cost
  • dependency parsing
  • cross-domain cross-lingual evaluation

Cite this