TY - GEN
T1 - DanTok: Domain Beats Language for Danish Social Media POS Tagging
AU - Hansen, Kia Kirstein
AU - Barrett, Maria Jung
AU - Müller-Eberstein, Max
AU - Damgaard, Cathrine
AU - Eriksen, Trine
AU - van der Goot, Rob
PY - 2023
Y1 - 2023
N2 - Language from social media remains challenging to process automatically, especially for non-English languages. In this work, we introduce the first NLP dataset for TikTok comments and the first Danish social media dataset with part-of-speech annotation. We further supply annotations for normalization, code-switching, and annotator uncertainty. As transferring models to such a highly specialized domain is non-trivial, we conduct an extensive study into which source data and modeling decisions most impact the performance. Surprisingly, transferring from in-domain data, even from a different language, outperforms in-language, out-of-domain training. These benefits nonetheless rely on the underlying language models having been at least partially pre-trained on data from the target language. Using our additional annotation layers, we further analyze how normalization, code-switching, and human uncertainty affect the tagging accuracy.
AB - Language from social media remains challenging to process automatically, especially for non-English languages. In this work, we introduce the first NLP dataset for TikTok comments and the first Danish social media dataset with part-of-speech annotation. We further supply annotations for normalization, code-switching, and annotator uncertainty. As transferring models to such a highly specialized domain is non-trivial, we conduct an extensive study into which source data and modeling decisions most impact the performance. Surprisingly, transferring from in-domain data, even from a different language, outperforms in-language, out-of-domain training. These benefits nonetheless rely on the underlying language models having been at least partially pre-trained on data from the target language. Using our additional annotation layers, we further analyze how normalization, code-switching, and human uncertainty affect the tagging accuracy.
KW - NLP dataset
KW - Social media text
KW - Part-of-speech annotation
KW - Code-switching
KW - Language model transfer
KW - NLP dataset
KW - Social media text
KW - Part-of-speech annotation
KW - Code-switching
KW - Language model transfer
M3 - Article in proceedings
SP - 271
EP - 279
BT - Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
ER -