Projekter pr. år
Abstract
Social media is notoriously difficult to process for existing natural language processing tools, because of spelling errors, non-standard words, shortenings, non-standard capitalization and punctuation. One method to circumvent these issues is to normalize input data before processing. Most previous work has focused on only one language, which is mostly English. In this paper, we are the first to propose a model for cross-lingual normalization, with which we participate in the WNUT 2021 shared task. To this end, we use MoNoise as a starting point, and make a simple adaptation for cross-lingual application. Our proposed model outperforms the leave-as-is baseline provided by the organizers which copies the input. Furthermore, we explore a completely different model which converts the task to a sequence labeling task. Performance of this second system is low, as it does not take capitalization into account in our implementation.
Originalsprog | Engelsk |
---|---|
Titel | Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021) |
Antal sider | 4 |
Forlag | Association for Computational Linguistics |
Publikationsdato | okt. 2021 |
Sider | 510 |
Status | Udgivet - okt. 2021 |
Begivenhed | Seventh Workshop on Noisy User-generated Text (W-NUT 2021) - Varighed: 11 nov. 2021 → 11 nov. 2021 http://noisy-text.github.io/2021/ |
Konference
Konference | Seventh Workshop on Noisy User-generated Text (W-NUT 2021) |
---|---|
Periode | 11/11/2021 → 11/11/2021 |
Internetadresse |
Emneord
- Social media
- Natural language processing
- Cross-lingual normalization
- Preprocessing
- Sequence labeling
Fingeraftryk
Dyk ned i forskningsemnerne om 'CL-MoNoise: Cross-lingual Lexical Normalization'. Sammen danner de et unikt fingeraftryk.Projekter
- 1 Afsluttet
-
Multi-Task Sequence Labeling Under Adverse Conditions
Plank, B. (PI) & van der Goot, R. (CoI)
01/04/2019 → 31/08/2020
Projekter: Projekt › Andet