Challenges in Annotating and Parsing Spoken, Code-switched, Frisian-Dutch Data

Research output: Conference Article in Proceeding or Book/Report chapterArticle in proceedingsResearchpeer-review

Abstract

While high performance have been obtained for high-resource languages, performance on low-resource languages lags behind. In this paper we focus on the parsing of the low-resource language Frisian. We use a sample of code-switched, spontaneously spoken data, which proves to be a challenging setup. We propose to train a parser specifically tailored towards the target domain, by selecting instances from multiple treebanks. Specifically, we use Latent Dirichlet Allocation (LDA), with word and character N-grams. We use a deep biaffine parser initialized with mBERT. The best single source treebank (nl_alpino) resulted in an LAS of 54.7 whereas our data selection outperformed the single best transfer treebank and led to 55.6 LAS on the test data. Additional experiments consisted of removing diacritics from our Frisian data, creating more similar training data by cropping sentences and running our best model using XLM-R. These experiments did not lead to a better performance.
Original languageEnglish
Title of host publicationProceedings of the Second Workshop on Domain Adaptation for NLP
PublisherAssociation for Computational Linguistics
Publication dateApr 2021
Pages50-58
Publication statusPublished - Apr 2021
EventWorkshop on Domain Adaptation for NLP - Kyiv, Ukraine
Duration: 20 Apr 2021 → …
Conference number: 2
https://adapt-nlp.github.io/Adapt-NLP-2021/

Workshop

WorkshopWorkshop on Domain Adaptation for NLP
Number2
Country/TerritoryUkraine
CityKyiv
Period20/04/2021 → …
OtherWorkshop held at EACL conference
Internet address

Keywords

  • Low-resource languages
  • Frisian parsing
  • Code-switching
  • Spontaneous speech
  • Treebank selection
  • Latent Dirichlet Allocation (LDA)
  • Deep biaffine parser
  • mBERT
  • Diacritic removal
  • XLM-R

Fingerprint

Dive into the research topics of 'Challenges in Annotating and Parsing Spoken, Code-switched, Frisian-Dutch Data'. Together they form a unique fingerprint.

Cite this