Skip to main navigation Skip to search Skip to main content

DistaLs: a Comprehensive Collection of Language Distance Measures

Research output: Conference Article in Proceeding or Book/Report chapterArticle in proceedingsResearchpeer-review

Abstract

Languages vary along a wide variety of dimensions. In Natural Language Processing (NLP), it is useful to know how “distant” languages are from each other, so that we can inform NLP models about these differences or predict good transfer languages. Furthermore, it can inform us about how diverse language samples are. However, there are many different perspectives on how distances across languages could be measured, and previous work has predominantly focused on either intuition or a single type of distance, like genealogical or typological distance. Therefore, we propose DistaLs, a toolkit that is designed to provide users with easy access to a wide variety of language distance measures. We also propose a filtered subset, which contains less redundant and more reliable features. DistaLs is designed to be accessible for a variety of use cases, and offers a Python, CLI, and web interface. It is easily updateable, and available as a pip package. Finally, we provide a case-study in which we use DistaLs to measure correlations of distance measures with performance on four different morphosyntactic tasks.
Original languageEnglish
Title of host publicationProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
EditorsIvan Habernal, Peter Schulam, Jörg Tiedemann
Number of pages12
Place of PublicationSuzhou, China
PublisherAssociation for Computational Linguistics
Publication date1 Nov 2025
Pages307-318
ISBN (Print)979-8-89176-334-0
DOIs
Publication statusPublished - 1 Nov 2025
EventConference on Empirical Methods in Natural Language Processing - Suzhou, China
Duration: 4 Nov 20259 Nov 2025
Conference number: 30
https://2025.emnlp.org/

Conference

ConferenceConference on Empirical Methods in Natural Language Processing
Number30
Country/TerritoryChina
CitySuzhou
Period04/11/202509/11/2025
Internet address

Keywords

  • language distance measures
  • typological distance
  • genealogical distance
  • morphosyntax
  • DistaLs toolkit

Fingerprint

Dive into the research topics of 'DistaLs: a Comprehensive Collection of Language Distance Measures'. Together they form a unique fingerprint.

Cite this