data2lang2vec: Data Driven Typological Features Completion

Publikation: Konference artikel i Proceeding eller bog/rapport kapitelKonferencebidrag i proceedingsForskningpeer review

Abstract

Language typology databases enhance multilingual Natural Language Processing (NLP) by improving model adaptability to diverse linguistic structures. The widely-used lang2vec toolkit integrates several such databases, but its coverage remains limited at 28.9%. Previous work on automatically increasing coverage predicts missing values based on features from other languages or focuses on single features; we propose to use textual data for better-informed feature prediction. To this end, we introduce a multi-lingual Part-of-Speech (POS) tagger, achieving over 70% accuracy across 1,749 languages, and experiment with external statistical features and a variety of machine learning algorithms. We also introduce a more realistic evaluation setup, focusing on likely to be missing typology features, and show that our approach outperforms previous work in both setups.
OriginalsprogEngelsk
TitelProceedings of the 31st International Conference on Computational Linguistics
RedaktørerOwen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Antal sider10
UdgivelsesstedAbu Dhabi, UAE
ForlagAssociation for Computational Linguistics
Publikationsdato1 jan. 2025
Sider6520-6529
StatusUdgivet - 1 jan. 2025

Fingeraftryk

Dyk ned i forskningsemnerne om 'data2lang2vec: Data Driven Typological Features Completion'. Sammen danner de et unikt fingeraftryk.

Citationsformater