Skip to main navigation Skip to search Skip to main content

DECAF: A Dynamically Extensible Corpus Analysis Framework

Research output: Conference Article in Proceeding or Book/Report chapterArticle in proceedingsResearchpeer-review

Abstract

The study of generalization in Language Models (LMs) requires controlled experiments that can precisely measure complex linguistic variations between training and testing datasets. We introduce DECAF, a framework that enables the analysis and filtering of linguistically-annotated datasets down to the character level. Rather than creating new resources for each experiment, DECAF starts from datasets with existing linguistic annotations, and leverages them to analyze, filter, and generate highly controlled and reproducible experimental settings targeting specific research questions. We demonstrate DECAF’s functionality by adding 28 morphosyntactic annotation layers to the 115M-word BabyLM corpus and indexing the resulting 1.1B annotations to analyze its internal domain variance, and to create a controlled training data curriculum for a small-scale gender bias study. We release DECAF as an open-source Python library, along with the parsed and indexed version of BabyLM, as resources for future generalization research.
Original languageEnglish
Title of host publicationProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
EditorsPushkar Mishra, Smaranda Muresan, Tao Yu
Number of pages12
Place of PublicationVienna, Austria
PublisherAssociation for Computational Linguistics
Publication dateApr 2025
Pages351-362
ISBN (Print)979-8-89176-253-4
DOIs
Publication statusPublished - Apr 2025
EventAssociation for Computational Linguistics - Austria, Vienna, Austria
Duration: 27 Jul 20251 Aug 2025
Conference number: 63
https://2025.aclweb.org/
https://aclanthology.org/volumes/2025.acl-long/

Conference

ConferenceAssociation for Computational Linguistics
Number63
LocationAustria
Country/TerritoryAustria
CityVienna
Period27/07/202501/08/2025
Internet address

Keywords

  • Language model generalization
  • Linguistic annotations
  • Dataset curation and filtering
  • Reproducible experimentation
  • Bias analysis in NLP

Fingerprint

Dive into the research topics of 'DECAF: A Dynamically Extensible Corpus Analysis Framework'. Together they form a unique fingerprint.

Cite this