When BERT Plays the Lottery, All Tickets Are Winning

Sai Prasanna, Anna Rogers, Anna Rumshisky

    Publikation: Konference artikel i Proceeding eller bog/rapport kapitelKonferencebidrag i proceedingsForskningpeer review

    Abstract

    Much of the recent success in NLP is due to the large Transformer-based models such as BERT (Devlin et al, 2019). However, these models have been shown to be reducible to a smaller number of self-attention heads and layers. We consider this phenomenon from the perspective of the lottery ticket hypothesis. For fine-tuned BERT, we show that (a) it is possible to find a subnetwork of elements that achieves performance comparable with that of the full model, and (b) similarly-sized subnetworks sampled from the rest of the model perform worse. However, the "bad" subnetworks can be fine-tuned separately to achieve only slightly worse performance than the "good" ones, indicating that most weights in the pre-trained BERT are potentially useful. We also show that the "good" subnetworks vary considerably across GLUE tasks, opening up the possibilities to learn what knowledge BERT actually uses at inference time.
    OriginalsprogEngelsk
    TitelProceedings of EMNLP
    Antal sider22
    UdgivelsesstedOnline
    ForlagAssociation for Computational Linguistics
    Publikationsdato1 nov. 2020
    Sider3208-3229
    StatusUdgivet - 1 nov. 2020

    Emneord

    • Natural Language Processing
    • Transformer models
    • Lottery Ticket Hypothesis
    • BERT Fine-tuning
    • Self-attention heads
    • Subnetwork performance

    Fingeraftryk

    Dyk ned i forskningsemnerne om 'When BERT Plays the Lottery, All Tickets Are Winning'. Sammen danner de et unikt fingeraftryk.

    Citationsformater