Skip to main navigation Skip to search Skip to main content

Identifying Open Challenges in Language Identification

Research output: Conference Article in Proceeding or Book/Report chapterArticle in proceedingsResearchpeer-review

Abstract

Automatic language identification is a core problem of many Natural LanguageProcessing (NLP) pipelines. A wide variety of architectures and benchmarks havebeen proposed with often near-perfect performance. Although previousstudies have focused on certain challenging setups (i.e. cross-domain, shortinputs), a systematic comparison is missing. We propose a benchmark that allows us to test for the effect of input size, training data size, domain, number oflanguages, scripts, and language families on performance. We evaluatefive popular models on this benchmark and identify which open challengesremain for this task as well as which architectures achieve robust performance. Wefind that cross-domain setups are the most challenging (although arguably mostrelevant), and that number of languages, variety in scripts, and variety inlanguage families have only a small impact on performance. We also contributepractical takeaways: training with 1,000 instances per language and a maximuminput length of 100 characters is enough for robust language identification.Based on our findings, we train an accurate (94.41{\%}) multi-domain languageidentification model on 2,034 languages, for which we also provide an analysisof the remaining errors.
Original languageEnglish
Title of host publicationProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
EditorsWanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Number of pages21
Place of PublicationVienna, Austria
PublisherAssociation for Computational Linguistics
Publication date1 Jul 2025
Pages18207-18227
ISBN (Print)979-8-89176-251-0
DOIs
Publication statusPublished - 1 Jul 2025
EventAnnual Meeting of the Association for Computational Linguistics - Vienna, Austria
Duration: 27 Jul 20251 Aug 2025
Conference number: 63
https://aclanthology.org/volumes/2025.findings-acl/
https://2025.aclweb.org/

Conference

ConferenceAnnual Meeting of the Association for Computational Linguistics
Number63
Country/TerritoryAustria
CityVienna
Period27/07/202501/08/2025
Internet address

Keywords

  • Language identification
  • Cross-domain evaluation
  • Multilingual benchmarks
  • Script variation and language families
  • Training data efficiency

Fingerprint

Dive into the research topics of 'Identifying Open Challenges in Language Identification'. Together they form a unique fingerprint.

Cite this