Much Gracias: Semi-supervised Code-switch Detection for Spanish-English: How far can we get?
Research output: Conference Article in Proceeding or Book/Report chapter › Article in proceedings › Research › peer-review
Because of globalization, it is becoming more and more common to use multiple languages in a single utterance, also called code-switching. This results in special linguistic structures and, therefore, poses many challenges for Natural Language Processing. Existing models for language identification in code-switched data are all supervised, requiring annotated training data which is only available for a limited number of language pairs. In this paper, we explore semi-supervised approaches, that exploit out-of-domain mono-lingual training data. We experiment with word uni-grams, word n-grams, character n-grams, Viterbi Decoding, Latent Dirichlet Allocation, Support Vector Machine and Logistic Regression. The Viterbi model was the best semi-supervised model, scoring a weighted F1 score of 92.23%, whereas a fully supervised state-of-the-art BERT-based model scored 98.43%.
|Title of host publication||Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching|
|Number of pages||6|
|Publisher||Association for Computational Linguistics|
|Publication date||Jun 2021|
|Publication status||Published - Jun 2021|
|Event||Fifth Workshop on Computational Approaches to Linguistic Code-Switching - |
Duration: 11 Jun 2021 → 11 Jun 2021
Conference number: 5
|Conference||Fifth Workshop on Computational Approaches to Linguistic Code-Switching|
|Periode||11/06/2021 → 11/06/2021|
|Series||Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching|