Much Gracias: Semi-supervised Code-switch Detection for Spanish-English: How far can we get?

Dana-Maria Iliescu, Rasmus Grand, Rob van der Goot, Sara Qirko

Publikation: Konference artikel i Proceeding eller bog/rapport kapitelKonferencebidrag i proceedingsForskningpeer review

Abstract

Because of globalization, it is becoming more and more common to use multiple languages in a single utterance, also called code-switching. This results in special linguistic structures and, therefore, poses many challenges for Natural Language Processing. Existing models for language identification in code-switched data are all supervised, requiring annotated training data which is only available for a limited number of language pairs. In this paper, we explore semi-supervised approaches, that exploit out-of-domain mono-lingual training data. We experiment with word uni-grams, word n-grams, character n-grams, Viterbi Decoding, Latent Dirichlet Allocation, Support Vector Machine and Logistic Regression. The Viterbi model was the best semi-supervised model, scoring a weighted F1 score of 92.23%, whereas a fully supervised state-of-the-art BERT-based model scored 98.43%.
OriginalsprogEngelsk
TitelProceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching
Antal sider6
ForlagAssociation for Computational Linguistics
Publikationsdatojun. 2021
Sider65
StatusUdgivet - jun. 2021
BegivenhedFifth Workshop on Computational Approaches to Linguistic Code-Switching -
Varighed: 11 jun. 202111 jun. 2021
Konferencens nummer: 5

Konference

KonferenceFifth Workshop on Computational Approaches to Linguistic Code-Switching
Nummer5
Periode11/06/202111/06/2021
NavnProceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching

Emneord

  • Globalization
  • Code-switching
  • Natural Language Processing
  • Semi-supervised learning
  • Language identification

Fingeraftryk

Dyk ned i forskningsemnerne om 'Much Gracias: Semi-supervised Code-switch Detection for Spanish-English: How far can we get?'. Sammen danner de et unikt fingeraftryk.

Citationsformater