Identifying Open Challenges in Language Identification

Publikation: Konference artikel i Proceeding eller bog/rapport kapitelKonferencebidrag i proceedingsForskningpeer review

Abstract

Automatic language identification is a core problem of many Natural LanguageProcessing (NLP) pipelines. A wide variety of architectures and benchmarks havebeen proposed with often near-perfect performance. Although previousstudies have focused on certain challenging setups (i.e. cross-domain, shortinputs), a systematic comparison is missing. We propose a benchmark that allows us to test for the effect of input size, training data size, domain, number oflanguages, scripts, and language families on performance. We evaluatefive popular models on this benchmark and identify which open challengesremain for this task as well as which architectures achieve robust performance. Wefind that cross-domain setups are the most challenging (although arguably mostrelevant), and that number of languages, variety in scripts, and variety inlanguage families have only a small impact on performance. We also contributepractical takeaways: training with 1,000 instances per language and a maximuminput length of 100 characters is enough for robust language identification.Based on our findings, we train an accurate (94.41{\%}) multi-domain languageidentification model on 2,034 languages, for which we also provide an analysisof the remaining errors.
OriginalsprogEngelsk
TitelProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
RedaktørerWanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Antal sider21
UdgivelsesstedVienna, Austria
ForlagAssociation for Computational Linguistics
Publikationsdato1 jul. 2025
Sider18207-18227
ISBN (Trykt)979-8-89176-251-0
DOI
StatusUdgivet - 1 jul. 2025

Fingeraftryk

Dyk ned i forskningsemnerne om 'Identifying Open Challenges in Language Identification'. Sammen danner de et unikt fingeraftryk.

Citationsformater