Abstract
Multilingual automatic phone recognition models can learn language independent pronunciation patterns from large volumes of spoken data and recognize them across languages. This potential can be harnessed to improve speech technologies for under-resourced languages.
However, these models are typically trained on phonological representations of speech sounds, which do not necessarily reflect the phonetic realization of speech. A mismatch between a phonological symbol and its phonetic realizations can lead to phone confusions and reduce performance.
This thesis introduces a formant-based vowel categorization method aimed at improving cross-lingual vowel recognition by uncovering a vowel’s phonetic quality from its formant frequencies, and reorganizing the vowel categories in a multilingual speech corpus to increase their consistency across languages. The work investigates vowel categories obtained from a trilingual speech corpus of Danish, Norwegian, and Swedish using four categorization techniques. Crosslingual phone recognition experiments reveal that uniting the vowel
categories of different languages into a shared set of formant-based categories can improve cross-lingual recognition of the shared vowels, but also interfere with recognition of vowels not present in one or more training languages. Nevertheless, improved recognition of individual vowels can translate to improvements in overall phone recognition on languages unseen during training.
To assess their wider applicability in automatic speech recognition (ASR), the investigated vowel representations are also evaluated as part of pronunciation lexicons used in hybrid ASR systems. These experiments, however, do not reveal many conclusive patterns, which demonstrates that hybrid systems are more robust to divergence in pronunciation from the phonological norm. Nonetheless, a qualitative analysis of phone predictions shows that the models trained on
formant-based vowel representations can infer the distinctive vowel qualities of an unseen language, especially when their vowel set and training data align with the vowel system of the target language. This indicates that formant-based vowel representations could provide useful information for tasks where
phonological description is preferred.
However, these models are typically trained on phonological representations of speech sounds, which do not necessarily reflect the phonetic realization of speech. A mismatch between a phonological symbol and its phonetic realizations can lead to phone confusions and reduce performance.
This thesis introduces a formant-based vowel categorization method aimed at improving cross-lingual vowel recognition by uncovering a vowel’s phonetic quality from its formant frequencies, and reorganizing the vowel categories in a multilingual speech corpus to increase their consistency across languages. The work investigates vowel categories obtained from a trilingual speech corpus of Danish, Norwegian, and Swedish using four categorization techniques. Crosslingual phone recognition experiments reveal that uniting the vowel
categories of different languages into a shared set of formant-based categories can improve cross-lingual recognition of the shared vowels, but also interfere with recognition of vowels not present in one or more training languages. Nevertheless, improved recognition of individual vowels can translate to improvements in overall phone recognition on languages unseen during training.
To assess their wider applicability in automatic speech recognition (ASR), the investigated vowel representations are also evaluated as part of pronunciation lexicons used in hybrid ASR systems. These experiments, however, do not reveal many conclusive patterns, which demonstrates that hybrid systems are more robust to divergence in pronunciation from the phonological norm. Nonetheless, a qualitative analysis of phone predictions shows that the models trained on
formant-based vowel representations can infer the distinctive vowel qualities of an unseen language, especially when their vowel set and training data align with the vowel system of the target language. This indicates that formant-based vowel representations could provide useful information for tasks where
phonological description is preferred.
Originalsprog | Engelsk |
---|
Forlag | IT-Universitetet i København |
---|---|
Antal sider | 244 |
Status | Udgivet - 2024 |
Navn | ITU-DS |
---|---|
Nummer | 232 |
ISSN | 1602-3536 |