DanTok: Domain Beats Language for Danish Social Media POS Tagging

Kia Kirstein Hansen, Maria Jung Barrett, Max Müller-Eberstein, Cathrine Damgaard, Trine Eriksen, Rob van der Goot

Research output: Conference Article in Proceeding or Book/Report chapterArticle in proceedingsResearchpeer-review

Abstract

Language from social media remains challenging to process automatically, especially for non-English languages. In this work, we introduce the first NLP dataset for TikTok comments and the first Danish social media dataset with part-of-speech annotation. We further supply annotations for normalization, code-switching, and annotator uncertainty. As transferring models to such a highly specialized domain is non-trivial, we conduct an extensive study into which source data and modeling decisions most impact the performance. Surprisingly, transferring from in-domain data, even from a different language, outperforms in-language, out-of-domain training. These benefits nonetheless rely on the underlying language models having been at least partially pre-trained on data from the target language. Using our additional annotation layers, we further analyze how normalization, code-switching, and human uncertainty affect the tagging accuracy.
Original languageEnglish
Title of host publicationProceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Number of pages9
Publication date2023
Pages271–279
Publication statusPublished - 2023

Keywords

  • NLP dataset
  • Social media text
  • Part-of-speech annotation
  • Code-switching
  • Language model transfer

Fingerprint

Dive into the research topics of 'DanTok: Domain Beats Language for Danish Social Media POS Tagging'. Together they form a unique fingerprint.

Cite this