Abstract
As input representation for each sub-word, the original BERT architecture proposes the sum of the sub-word embedding,
position embedding and a segment embedding. Sub-word and position embeddings are well-known and studied, and
encode lexical information and word position, respectively. In contrast, segment embeddings are less known and have so
far received no attention. The key idea of segment embeddings is to encode to which of the two sentences (segments)
a word belongs—the intuition is to inform the model about the separation of sentences for the next sentence prediction
pre-training task. However, little is known on whether the choice of segment impacts downstream prediction performance.
In this work, we try to fill this gap and empirically study the impact of alternating the segment embedding during inference
time for a variety of pre-trained embeddings and target tasks. We hypothesize that for single-sentence prediction tasks
performance is not affected—neither in mono- nor multilingual setups—while it matters when changing the segment IDs
in paired-sentence tasks. To our surprise, this is not the case. Although for classification tasks and monolingual BERT
models no large differences are observed, particularly word-level multilingual prediction tasks are heavily impacted. For
low-resource syntactic tasks, we observe impacts of segment embedding and multilingual BERT choice. We find that
the default setting for the most used multilingual BERT model underperforms heavily, and a simple swap of the segment
embeddings yields an average improvement of 2.5 points absolute LAS score for dependency parsing over 9 different treebanks.
position embedding and a segment embedding. Sub-word and position embeddings are well-known and studied, and
encode lexical information and word position, respectively. In contrast, segment embeddings are less known and have so
far received no attention. The key idea of segment embeddings is to encode to which of the two sentences (segments)
a word belongs—the intuition is to inform the model about the separation of sentences for the next sentence prediction
pre-training task. However, little is known on whether the choice of segment impacts downstream prediction performance.
In this work, we try to fill this gap and empirically study the impact of alternating the segment embedding during inference
time for a variety of pre-trained embeddings and target tasks. We hypothesize that for single-sentence prediction tasks
performance is not affected—neither in mono- nor multilingual setups—while it matters when changing the segment IDs
in paired-sentence tasks. To our surprise, this is not the case. Although for classification tasks and monolingual BERT
models no large differences are observed, particularly word-level multilingual prediction tasks are heavily impacted. For
low-resource syntactic tasks, we observe impacts of segment embedding and multilingual BERT choice. We find that
the default setting for the most used multilingual BERT model underperforms heavily, and a simple swap of the segment
embeddings yields an average improvement of 2.5 points absolute LAS score for dependency parsing over 9 different treebanks.
Original language | English |
---|---|
Title of host publication | Proceedings of the Language Resources and Evaluation Conference |
Publication date | 2022 |
Pages | 1418-1427 |
Publication status | Published - 2022 |
Keywords
- BERT architecture
- sub-word embeddings
- segment embeddings
- multilingual BERT
- dependency parsing