MorSeD: Morphological Segmentation of Danish and its Effect on Language Modeling

Rob van der Goot, Anette Jensen, Emil Allerslev Schledermann, Mikkel Wildner Kildeberg, Nicolaj Larsen, Mike Zhang, Elisa Bassignana

Publikation: Konference artikel i Proceeding eller bog/rapport kapitelKonferencebidrag i proceedingsForskningpeer review

Abstract

Current language models (LMs) mostly exploit subwords as input units based on statistical co-occurrences of characters. Adjacently, previous work has shown that modeling morphemes can aid performance for Natural Language Processing (NLP) models. However, morphemes are challenging to obtain as there is no annotated data in most languages. In this work, we release a wide-coverage Danish morphological segmentation evaluation set. We evaluate a range of unsupervised token segmenters and evaluate the downstream effect of using morphemes as input units for transformer-based LMs. Our results show that popular subword algorithms perform poorly on this task, scoring at most an F1 of 57.6 compared to 68.0 for an unsupervised morphological segmenter (Morfessor). Furthermore, evaluate a range of segmenters on the task of language modeling.
OriginalsprogEngelsk
TitelProceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
RedaktørerRichard Johansson, Sara Stymne
Antal sider7
UdgivelsesstedTallinn, Estonia
ForlagUniversity of Tartu Library
Publikationsdato1 mar. 2025
Sider223-229
ISBN (Trykt)978-9908-53-109-0
StatusUdgivet - 1 mar. 2025

Fingeraftryk

Dyk ned i forskningsemnerne om 'MorSeD: Morphological Segmentation of Danish and its Effect on Language Modeling'. Sammen danner de et unikt fingeraftryk.

Citationsformater