Skip to main navigation Skip to search Skip to main content

Iterative Structured Knowledge Distillation: Optimizing Language Models Through Layer-by-Layer Distillation

Research output: Conference Article in Proceeding or Book/Report chapterArticle in proceedingsResearchpeer-review

Abstract

Traditional language model compression techniques, like knowledge distillation, require a fixed architecture, limiting flexibility, while structured pruning methods often fail to preserve performance. This paper introduces Iterative Structured Knowledge Distillation (ISKD), which integrates knowledge distillation and structured pruning by progressively replacing transformer blocks with smaller, efficient versions during training. This study validates ISKD on two transformer-based language models: GPT-2 and Phi-1. ISKD outperforms L1 pruning and achieves similar performance to knowledge distillation while offering greater flexibility. ISKD reduces model parameters - 30.68% for GPT-2 and 30.16% for Phi-1 - while maintaining at least four-fifths of performance on both language modeling and commonsense reasoning tasks. These findings suggest that this method offers a promising balance between model efficiency and accuracy.
Original languageEnglish
Title of host publicationProceedings of the 31st International Conference on Computational Linguistics
EditorsOwen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Number of pages6
Place of PublicationAbu Dhabi, UAE
PublisherAssociation for Computational Linguistics
Publication date1 Jan 2025
Pages6601-6606
Publication statusPublished - 1 Jan 2025
EventInternational Conference on Computational Linguistics - Abu Dhabi, United Arab Emirates
Duration: 19 Jan 202524 Jan 2025
Conference number: 31
https://coling2025.org/
https://coling2025.org/calls/main_conference_papers/

Conference

ConferenceInternational Conference on Computational Linguistics
Number31
Country/TerritoryUnited Arab Emirates
CityAbu Dhabi
Period19/01/202524/01/2025
Internet address

Keywords

  • Iterative Structured Knowledge Distillation
  • Structured pruning
  • Transformer block replacement
  • Language model compression
  • Parameter-efficient NLP

Fingerprint

Dive into the research topics of 'Iterative Structured Knowledge Distillation: Optimizing Language Models Through Layer-by-Layer Distillation'. Together they form a unique fingerprint.

Cite this