CL-MoNoise: Cross-lingual Lexical Normalization

Research output: Conference Article in Proceeding or Book/Report chapterArticle in proceedingsResearchpeer-review

Abstract

Social media is notoriously difficult to process for existing natural language processing tools, because of spelling errors, non-standard words, shortenings, non-standard capitalization and punctuation. One method to circumvent these issues is to normalize input data before processing. Most previous work has focused on only one language, which is mostly English. In this paper, we are the first to propose a model for cross-lingual normalization, with which we participate in the WNUT 2021 shared task. To this end, we use MoNoise as a starting point, and make a simple adaptation for cross-lingual application. Our proposed model outperforms the leave-as-is baseline provided by the organizers which copies the input. Furthermore, we explore a completely different model which converts the task to a sequence labeling task. Performance of this second system is low, as it does not take capitalization into account in our implementation.
Original languageEnglish
Title of host publicationProceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)
Number of pages4
PublisherAssociation for Computational Linguistics
Publication dateOct 2021
Pages510
Publication statusPublished - Oct 2021
EventSeventh Workshop on Noisy User-generated Text (W-NUT 2021) -
Duration: 11 Nov 202111 Nov 2021
http://noisy-text.github.io/2021/

Conference

ConferenceSeventh Workshop on Noisy User-generated Text (W-NUT 2021)
Period11/11/202111/11/2021
Internet address

Keywords

  • Social media
  • Natural language processing
  • Cross-lingual normalization
  • Preprocessing
  • Sequence labeling

Fingerprint

Dive into the research topics of 'CL-MoNoise: Cross-lingual Lexical Normalization'. Together they form a unique fingerprint.

Cite this