Abstract
Language from social media remains challenging to process automatically, especially for non-English languages. In this work, we introduce the first NLP dataset for TikTok comments and the first Danish social media dataset with part-of-speech annotation. We further supply annotations for normalization, code-switching, and annotator uncertainty. As transferring models to such a highly specialized domain is non-trivial, we conduct an extensive study into which source data and modeling decisions most impact the performance. Surprisingly, transferring from in-domain data, even from a different language, outperforms in-language, out-of-domain training. These benefits nonetheless rely on the underlying language models having been at least partially pre-trained on data from the target language. Using our additional annotation layers, we further analyze how normalization, code-switching, and human uncertainty affect the tagging accuracy.
| Originalsprog | Engelsk |
|---|---|
| Titel | Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa) |
| Antal sider | 9 |
| Publikationsdato | 2023 |
| Sider | 271–279 |
| Status | Udgivet - 2023 |
| Begivenhed | Nordic Conference on Computational Linguistics - Tórshavn, Færøerne Varighed: 22 maj 2023 → 24 maj 2023 Konferencens nummer: 24 https://www.nodalida2023.fo/ |
Konference
| Konference | Nordic Conference on Computational Linguistics |
|---|---|
| Nummer | 24 |
| Land/Område | Færøerne |
| By | Tórshavn |
| Periode | 22/05/2023 → 24/05/2023 |
| Internetadresse |
Emneord
- NLP dataset
- Social media text
- Part-of-speech annotation
- Code-switching
- Language model transfer
Fingeraftryk
Dyk ned i forskningsemnerne om 'DanTok: Domain Beats Language for Danish Social Media POS Tagging'. Sammen danner de et unikt fingeraftryk.Citationsformater
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver