DECAF: A Dynamically Extensible Corpus Analysis Framework

Publikation: Konference artikel i Proceeding eller bog/rapport kapitelKonferencebidrag i proceedingsForskningpeer review

Abstract

The study of generalization in Language Models (LMs) requires controlled experiments that can precisely measure complex linguistic variations between training and testing datasets. We introduce DECAF, a framework that enables the analysis and filtering of linguistically-annotated datasets down to the character level. Rather than creating new resources for each experiment, DECAF starts from datasets with existing linguistic annotations, and leverages them to analyze, filter, and generate highly controlled and reproducible experimental settings targeting specific research questions. We demonstrate DECAF’s functionality by adding 28 morphosyntactic annotation layers to the 115M-word BabyLM corpus and indexing the resulting 1.1B annotations to analyze its internal domain variance, and to create a controlled training data curriculum for a small-scale gender bias study. We release DECAF as an open-source Python library, along with the parsed and indexed version of BabyLM, as resources for future generalization research.
OriginalsprogEngelsk
TitelProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
RedaktørerPushkar Mishra, Smaranda Muresan, Tao Yu
Antal sider12
UdgivelsesstedVienna, Austria
ForlagAssociation for Computational Linguistics
Publikationsdatoapr. 2025
Sider351-362
ISBN (Trykt)979-8-89176-253-4
DOI
StatusUdgivet - apr. 2025

Fingeraftryk

Dyk ned i forskningsemnerne om 'DECAF: A Dynamically Extensible Corpus Analysis Framework'. Sammen danner de et unikt fingeraftryk.

Citationsformater