Skip to main navigation Skip to search Skip to main content

Slot and Intent Detection Resources for Bavarian and Lithuanian: Assessing Translations vs Natural Queries to Digital Assistants

Research output: Conference Article in Proceeding or Book/Report chapterArticle in proceedingsResearchpeer-review

Abstract

Digital assistants perform well in high-resource languages like English, where tasks like slot and intent detection (SID) are well-supported. Many recent SID datasets start including multiple language varieties. However, it is unclear how realistic these translated datasets are. Therefore, we extend one such dataset, namely xSID-0.4, to include two underrepresented languages: Bavarian, a German dialect, and Lithuanian, a Baltic language. Both language variants have limited speaker populations and are often not included in multilingual projects. In addition to translations we provide ``natural'' queries to digital assistants generated by native speakers. We further include utterances from another dataset for Bavarian to build the richest SID dataset available today for a low-resource dialect without standard orthography. We then set out to evaluate models trained on English in a zero-shot scenario on our target language variants. Our evaluation reveals that translated data can produce overly optimistic scores. However, the error patterns in translated and natural datasets are highly similar. Cross-dataset experiments demonstrate that data collection methods influence performance, with scores lower than those achieved with single-dataset translations. This work contributes to enhancing SID datasets for underrepresented languages, yielding NaLiBaSID, a new evaluation dataset for Bavarian and Lithuanian.
Original languageUndefined/Unknown
Title of host publicationProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
EditorsNicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Number of pages18
Place of PublicationTorino, Italia
PublisherELRA and ICCL
Publication date1 May 2024
Pages14898-14915
Publication statusPublished - 1 May 2024
EventJoint International Conference on Computational Linguistics, Language Resources and Evaluation - Torino, Italy
Duration: 20 May 202425 May 2024
https://aclanthology.org/2024.lrec-main.544/
https://aclanthology.org/2024.lrec-main.1054/

Conference

ConferenceJoint International Conference on Computational Linguistics, Language Resources and Evaluation
Country/TerritoryItaly
CityTorino
Period20/05/202425/05/2024
Internet address

Keywords

  • Slot filling and intent detection (SID)
  • Multilingual natural language understanding
  • Zero-shot learning
  • Low-resource languages and dialects
  • Dataset creation and cross-dataset evaluation for SID

Cite this