Skip to main navigation Skip to search Skip to main content

MorSeD: Morphological Segmentation of Danish and its Effect on Language Modeling

Research output: Conference Article in Proceeding or Book/Report chapterArticle in proceedingsResearchpeer-review

Abstract

Current language models (LMs) mostly exploit subwords as input units based on statistical co-occurrences of characters. Adjacently, previous work has shown that modeling morphemes can aid performance for Natural Language Processing (NLP) models. However, morphemes are challenging to obtain as there is no annotated data in most languages. In this work, we release a wide-coverage Danish morphological segmentation evaluation set. We evaluate a range of unsupervised token segmenters and evaluate the downstream effect of using morphemes as input units for transformer-based LMs. Our results show that popular subword algorithms perform poorly on this task, scoring at most an F1 of 57.6 compared to 68.0 for an unsupervised morphological segmenter (Morfessor). Furthermore, evaluate a range of segmenters on the task of language modeling.
Original languageEnglish
Title of host publicationProceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
EditorsRichard Johansson, Sara Stymne
Number of pages7
Place of PublicationTallinn, Estonia
PublisherUniversity of Tartu Library
Publication date1 Mar 2025
Pages223-229
ISBN (Print)978-9908-53-109-0
Publication statusPublished - 1 Mar 2025
EventNordic Conference on Computational Linguistics - Tallinn, Estonia
Duration: 2 Mar 20255 Mar 2025
Conference number: 25
http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=180454
https://sites.google.com/view/nodalida-bhlt2025/proceedings

Conference

ConferenceNordic Conference on Computational Linguistics
Number25
Country/TerritoryEstonia
CityTallinn
Period02/03/202505/03/2025
Internet address

Keywords

  • Morphological segmentation
  • Unsupervised token segmentation
  • Danish language processing
  • Transformer-based language models
  • Morpheme-based input representations

Fingerprint

Dive into the research topics of 'MorSeD: Morphological Segmentation of Danish and its Effect on Language Modeling'. Together they form a unique fingerprint.

Cite this