DaNewsroom: A Large-scale Danish Summarisation Dataset

Daniel Varab, Natalie Schluter

Research output: Conference Article in Proceeding or Book/Report chapterArticle in proceedingsResearchpeer-review

Abstract

Dataset development for automatic summarisation systems is notoriously English-oriented. In this paper we present the first large-scale non-English language dataset specifically curated for automatic summarisation. The document-summary pairs are news articles and manually written summaries in the Danish language. There has previously been no work done to establish a Danish summarisation dataset, nor any published work on the automatic summarisation of Danish. We provide therefore the first automatic summarisation dataset for the Danish language (large-scale or otherwise). To support the comparison of future automatic summarisation systems for Danish, we include system performance on this dataset of strong well-established unsupervised baseline systems, together with an oracle extractive summariser, which is the first account of automatic summarisation system performance for Danish. Finally, we make all code for automatically acquiring the data freely available and make explicit how this technology can easily be adapted in order to acquire automatic summarisation datasets for further languages.
Original languageEnglish
Title of host publicationProceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)
PublisherEuropean Language Resources Association
Publication dateApr 2020
Pages6731–6739
Publication statusPublished - Apr 2020
EventLREC 2020 - Marseille, France
Duration: 17 May 202022 May 2020
https://lrec2020.lrec-conf.org/en/

Conference

ConferenceLREC 2020
Country/TerritoryFrance
CityMarseille
Period17/05/202022/05/2020
Internet address

Keywords

  • Danish language dataset
  • Automatic summarisation
  • Non-English summarisation
  • Document-summary pairs
  • Unsupervised baseline systems

Fingerprint

Dive into the research topics of 'DaNewsroom: A Large-scale Danish Summarisation Dataset'. Together they form a unique fingerprint.

Cite this