Spring til hovednavigation Spring til søgning Spring til hovedindhold

DAN+: Danish Nested Named Entities and Lexical Normalization

Publikation: Konference artikel i Proceeding eller bog/rapport kapitelKonferencebidrag i proceedingsForskningpeer review

Abstract

This paper introduces DAN+, a new multi-domain corpus and annotation guidelines for Dan- ish nested named entities (NEs) and lexical normalization to support research on cross-lingual cross-domain learning for a less-resourced language. We empirically assess three strategies to model the two-layer Named Entity Recognition (NER) task. We compare transfer capabilities from German versus in-language annotation from scratch. We examine language-specific versus multilingual BERT, and study the effect of lexical normalization on NER. Our results show that 1) the most robust strategy is multi-task learning which is rivaled by multi-label decoding, 2) BERT-based NER models are sensitive to domain shifts, and 3) in-language BERT and lexical normalization are the most beneficial on the least canonical data. Our results also show that an out-of-domain setup remains challenging, while performance on news plateaus quickly. This highlights the importance of cross-domain evaluation of cross-lingual transfer.
OriginalsprogEngelsk
TitelThe 28th International Conference on Computational Linguistics
ForlagAssociation for Computational Linguistics
Publikationsdatodec. 2020
Sider6649–6662
StatusUdgivet - dec. 2020
BegivenhedInternational Conference on Computational Linguistics - Barcelona, Spanien
Varighed: 8 dec. 202013 dec. 2020
Konferencens nummer: 28th

Konference

KonferenceInternational Conference on Computational Linguistics
Nummer28th
Land/OmrådeSpanien
ByBarcelona
Periode08/12/202013/12/2020

Emneord

  • DAN+ Corpus
  • Nested Named Entities
  • Cross-lingual Transfer
  • Lexical Normalization
  • Multilingual BERT

Fingeraftryk

Dyk ned i forskningsemnerne om 'DAN+: Danish Nested Named Entities and Lexical Normalization'. Sammen danner de et unikt fingeraftryk.

Citationsformater