Quantifying Linguistic Variation: Data-driven Navigation of Variety Space

Publikation: Bog / Antologi / Rapport / Ph.D.-afhandlingPh.d.-afhandling

Abstract

Language emerges naturally from human communication, and as such, linguistic variation across the many possible dimensions of expression is ubiquitous. Higher variation across specific dimensions leads to a decrease in mutual intelligibility, or, in the case of Natural Language Processing (NLP), to decreased model transferability. Linguistics delineates between dimensions such as typology, domain, register, etc., using qualitative definitions, however, these are difficult to apply quantitatively and to combine at scale. NLP on the other hand necessitates a quantization of language, and has thus enabled machines to learn data-driven, vectorized representations thereof, which measure language similarity remarkably well, but fall short of explaining exactly how two data points are related. By leveraging probing methods to segment the high-dimensional latent spaces of Language Models (LMs) into subspaces with linguistically interpretable similarity characteristics, we aim to bridge the divide between these two disciplines. Our results for cross-lingual syntax and cross-domain genre demonstrate that corresponding subspaces can be successfully recovered, and consequently used to predict which training data and models transfer well to unseen language varieties and domains. Combining dimensions from across this Variety Space, we further quantify task similarity in an interpretable way, and investigate how linguistic information emerges in LMs during their training. As NLP increasingly relies on general purpose information stored in LMs to solve myriads of downstream tasks, we argue that quantifying and understanding language and task variation is critical to ensure model robustness and trustworthiness. Towards this goal, our quantitative measures of linguistic variation provide a generally applicable framework grounded in traditional linguistics.
OriginalsprogEngelsk
UdgivelsesstedCopenhagen, Denmark
ForlagIT-Universitetet i København
Antal sider280
ISBN (Trykt)9788779490338
StatusUdgivet - 2024
NavnITU-DS
Nummer225
ISSN1602-3536

Fingeraftryk

Dyk ned i forskningsemnerne om 'Quantifying Linguistic Variation: Data-driven Navigation of Variety Space'. Sammen danner de et unikt fingeraftryk.

Citationsformater