Abstract
Language typology databases enhance multilingual Natural Language Processing (NLP) by improving model adaptability to diverse linguistic structures. The widely-used lang2vec toolkit integrates several such databases, but its coverage remains limited at 28.9%. Previous work on automatically increasing coverage predicts missing values based on features from other languages or focuses on single features; we propose to use textual data for better-informed feature prediction. To this end, we introduce a multi-lingual Part-of-Speech (POS) tagger, achieving over 70% accuracy across 1,749 languages, and experiment with external statistical features and a variety of machine learning algorithms. We also introduce a more realistic evaluation setup, focusing on likely to be missing typology features, and show that our approach outperforms previous work in both setups.
| Originalsprog | Engelsk |
|---|---|
| Titel | Proceedings of the 31st International Conference on Computational Linguistics |
| Redaktører | Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert |
| Antal sider | 10 |
| Udgivelsessted | Abu Dhabi, UAE |
| Forlag | Association for Computational Linguistics |
| Publikationsdato | 1 jan. 2025 |
| Sider | 6520-6529 |
| Status | Udgivet - 1 jan. 2025 |