Projects per year
Abstract
This paper explores the difficulties of annotating transcribed spoken Dutch-Frisian codeswitch utterances into Universal Dependencies. We make use of data from the FAME!
corpus, which consists of transcriptions and
audio data. Besides the usual annotation difficulties, this dataset is extra challenging because of Frisian being low-resource, the informal nature of the data, code-switching and
non-standard sentence segmentation. As a
starting point, two annotators annotated 150
random utterances in three stages of 50 utterances. After each stage, disagreements where
discussed and resolved. An increase of 7.8
UAS and 10.5 LAS points was achieved between the first and third round. This paper will
focus on the issues that arise when annotating
a transcribed speech corpus. To resolve these
issues several solutions are proposed.
corpus, which consists of transcriptions and
audio data. Besides the usual annotation difficulties, this dataset is extra challenging because of Frisian being low-resource, the informal nature of the data, code-switching and
non-standard sentence segmentation. As a
starting point, two annotators annotated 150
random utterances in three stages of 50 utterances. After each stage, disagreements where
discussed and resolved. An increase of 7.8
UAS and 10.5 LAS points was achieved between the first and third round. This paper will
focus on the issues that arise when annotating
a transcribed speech corpus. To resolve these
issues several solutions are proposed.
Original language | English |
---|---|
Publication date | 25 Sept 2021 |
Publication status | Published - 25 Sept 2021 |
Event | RESOURCEFUL-2020 : RESOURCEs and representations For Under-resourced Languages and domains - Gothenburg, Gothenburg, Sweden Duration: 25 Nov 2020 → … https://gu-clasp.github.io/resourceful-2020/ |
Workshop
Workshop | RESOURCEFUL-2020 |
---|---|
Location | Gothenburg |
Country/Territory | Sweden |
City | Gothenburg |
Period | 25/11/2020 → … |
Internet address |
Keywords
- annotating transcribed speech
- Dutch-Frisian codeswitching
- Universal Dependencies
- low-resource languages
- informal data challenges
Fingerprint
Dive into the research topics of 'Creating a Universal Dependencies Treebank of Spoken Frisian-Dutch Code-switched Data'. Together they form a unique fingerprint.Projects
- 1 Finished
-
Multi-Task Sequence Labeling Under Adverse Conditions
Plank, B. (PI) & van der Goot, R. (CoI)
01/04/2019 → 31/08/2020
Project: Other