Abstract
Automatic language identification is a core problem of many Natural LanguageProcessing (NLP) pipelines. A wide variety of architectures and benchmarks havebeen proposed with often near-perfect performance. Although previousstudies have focused on certain challenging setups (i.e. cross-domain, shortinputs), a systematic comparison is missing. We propose a benchmark that allows us to test for the effect of input size, training data size, domain, number oflanguages, scripts, and language families on performance. We evaluatefive popular models on this benchmark and identify which open challengesremain for this task as well as which architectures achieve robust performance. Wefind that cross-domain setups are the most challenging (although arguably mostrelevant), and that number of languages, variety in scripts, and variety inlanguage families have only a small impact on performance. We also contributepractical takeaways: training with 1,000 instances per language and a maximuminput length of 100 characters is enough for robust language identification.Based on our findings, we train an accurate (94.41{\%}) multi-domain languageidentification model on 2,034 languages, for which we also provide an analysisof the remaining errors.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) |
| Editors | Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar |
| Number of pages | 21 |
| Place of Publication | Vienna, Austria |
| Publisher | Association for Computational Linguistics |
| Publication date | 1 Jul 2025 |
| Pages | 18207-18227 |
| ISBN (Print) | 979-8-89176-251-0 |
| DOIs | |
| Publication status | Published - 1 Jul 2025 |
| Event | Annual Meeting of the Association for Computational Linguistics - Vienna, Austria Duration: 27 Jul 2025 → 1 Aug 2025 Conference number: 63 https://aclanthology.org/volumes/2025.findings-acl/ https://2025.aclweb.org/ |
Conference
| Conference | Annual Meeting of the Association for Computational Linguistics |
|---|---|
| Number | 63 |
| Country/Territory | Austria |
| City | Vienna |
| Period | 27/07/2025 → 01/08/2025 |
| Internet address |
Keywords
- Language identification
- Cross-domain evaluation
- Multilingual benchmarks
- Script variation and language families
- Training data efficiency
Fingerprint
Dive into the research topics of 'Identifying Open Challenges in Language Identification'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver