Abstract
Current language models (LMs) mostly exploit subwords as input units based on statistical co-occurrences of characters. Adjacently, previous work has shown that modeling morphemes can aid performance for Natural Language Processing (NLP) models. However, morphemes are challenging to obtain as there is no annotated data in most languages. In this work, we release a wide-coverage Danish morphological segmentation evaluation set. We evaluate a range of unsupervised token segmenters and evaluate the downstream effect of using morphemes as input units for transformer-based LMs. Our results show that popular subword algorithms perform poorly on this task, scoring at most an F1 of 57.6 compared to 68.0 for an unsupervised morphological segmenter (Morfessor). Furthermore, evaluate a range of segmenters on the task of language modeling.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025) |
| Editors | Richard Johansson, Sara Stymne |
| Number of pages | 7 |
| Place of Publication | Tallinn, Estonia |
| Publisher | University of Tartu Library |
| Publication date | 1 Mar 2025 |
| Pages | 223-229 |
| ISBN (Print) | 978-9908-53-109-0 |
| Publication status | Published - 1 Mar 2025 |
| Event | Nordic Conference on Computational Linguistics - Tallinn, Estonia Duration: 2 Mar 2025 → 5 Mar 2025 Conference number: 25 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=180454 https://sites.google.com/view/nodalida-bhlt2025/proceedings |
Conference
| Conference | Nordic Conference on Computational Linguistics |
|---|---|
| Number | 25 |
| Country/Territory | Estonia |
| City | Tallinn |
| Period | 02/03/2025 → 05/03/2025 |
| Internet address |
Keywords
- Morphological segmentation
- Unsupervised token segmentation
- Danish language processing
- Transformer-based language models
- Morpheme-based input representations
Fingerprint
Dive into the research topics of 'MorSeD: Morphological Segmentation of Danish and its Effect on Language Modeling'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver