CL-MoNoise: Cross-lingual Lexical Normalization

Publikation: Konference artikel i Proceeding eller bog/rapport kapitelKonferencebidrag i proceedingsForskningpeer review

Abstract

Social media is notoriously difficult to process for existing natural language processing tools, because of spelling errors, non-standard words, shortenings, non-standard capitalization and punctuation. One method to circumvent these issues is to normalize input data before processing. Most previous work has focused on only one language, which is mostly English. In this paper, we are the first to propose a model for cross-lingual normalization, with which we participate in the WNUT 2021 shared task. To this end, we use MoNoise as a starting point, and make a simple adaptation for cross-lingual application. Our proposed model outperforms the leave-as-is baseline provided by the organizers which copies the input. Furthermore, we explore a completely different model which converts the task to a sequence labeling task. Performance of this second system is low, as it does not take capitalization into account in our implementation.
OriginalsprogEngelsk
TitelProceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)
Antal sider4
ForlagAssociation for Computational Linguistics
Publikationsdatookt. 2021
Sider510
StatusUdgivet - okt. 2021
BegivenhedSeventh Workshop on Noisy User-generated Text (W-NUT 2021) -
Varighed: 11 nov. 202111 nov. 2021
http://noisy-text.github.io/2021/

Konference

KonferenceSeventh Workshop on Noisy User-generated Text (W-NUT 2021)
Periode11/11/202111/11/2021
Internetadresse

Emneord

  • Social media
  • Natural language processing
  • Cross-lingual normalization
  • Preprocessing
  • Sequence labeling

Fingeraftryk

Dyk ned i forskningsemnerne om 'CL-MoNoise: Cross-lingual Lexical Normalization'. Sammen danner de et unikt fingeraftryk.

Citationsformater