Social media is notoriously difficult to process for existing natural language processing tools, because of spelling errors, non-standard words, shortenings, non-standard capitalization and punctuation. One method to circumvent these issues is to normalize input data before processing. Most previous work has focused on only one language, which is mostly English. In this paper, we are the first to propose a model for cross-lingual normalization, with which we participate in the WNUT 2021 shared task. To this end, we use MoNoise as a starting point, and make a simple adaptation for cross-lingual application. Our proposed model outperforms the leave-as-is baseline provided by the organizers which copies the input. Furthermore, we explore a completely different model which converts the task to a sequence labeling task. Performance of this second system is low, as it does not take capitalization into account in our implementation.
|Titel||Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)|
|Forlag||Association for Computational Linguistics|
|Status||Udgivet - okt. 2021|
|Begivenhed||Seventh Workshop on Noisy User-generated Text (W-NUT 2021) - |
Varighed: 11 nov. 2021 → 11 nov. 2021
|Konference||Seventh Workshop on Noisy User-generated Text (W-NUT 2021)|
|Periode||11/11/2021 → 11/11/2021|