Leveraging and Probing Speech Prosody to Improve Spoken Language Processing

Publikation: AfhandlingerPh.d.-afhandling

Abstract

Natural Language Processing (NLP) technologies have become ubiquitous, increasingly underpinning diverse applications across society. Although NLP advancements have predominantly focused on written text, especially with the recent emergence of powerful large language models (LLMs), spoken language remains fundamental to human communication. Speech inherently conveys richer and more nuanced information compared to text alone, encapsulating non-verbal and dynamic elements such as tone, pitch, intonation, pauses, and rhythm, collectively referred to as prosody. Prosody significantly influences the perception and intended meaning of spoken language, especially within tonal languages. Despite its importance, explicit utilization of prosodic features in spoken language processing remains limited, primarily due to the scarcity of high-quality prosodically annotated datasets and the lack of advanced models specifically designed to leverage such information. This thesis addresses these challenges through three significant contributions. First, it introduces the Akan Cinematic Emotions (AkaCE) dataset, the first-ever multimodal, prosodically annotated resource for an African language (Akan). AkaCE uniquely enables comprehensive modeling and analysis of prosody-informed Speech Emotion Recognition, establishing foundational benchmarks and facilitating future NLP research in under-represented African languages. Second, this thesis presents innovative computational models explicitly designed to integrate prosodic annotations, achieving state-of-the-art results across crucial Spoken Language Processing tasks, including Automatic Speech Recognition (demonstrating a Word Error Rate reduction of up to 28.3% on LibriSpeech) and speech-based instruction disambiguation in robotics (resolving ambiguous spoken commands with over 71% accuracy). Third, this work provides the first rigorous exploration of how multilingual contexts influence prosodic expression, revealing notable prosodic shifts in monolingual speech due to proximity to multilingual discourse. This finding highlights the potential of leveraging prosodic context as a powerful feature to enhance performance in complex NLP tasks such as code-switch detection. Empirical results presented throughout this thesis confirm that explicitly integrating speech prosody not only enhances task-specific model performance but also provides deeper linguistic insights across languages and discourse contexts, paving the way for more expressive, effective, and inclusive spoken language technologies.
OriginalsprogEngelsk
Vejleder(e)
  • Schluter, Natalie , Hovedvejleder
Udgiver
StatusUdgivet - 2025

Fingeraftryk

Dyk ned i forskningsemnerne om 'Leveraging and Probing Speech Prosody to Improve Spoken Language Processing'. Sammen danner de et unikt fingeraftryk.

Citationsformater