Finding the needle in a haystack: Extraction of Informative COVID-19 Danish Tweets

Benjamin Ahrentløv Olsen, Barbara Plank

Research output: Conference Article in Proceeding or Book/Report chapterArticle in proceedingsResearchpeer-review

Abstract

Finding informative COVID-19 posts in a stream of tweets is very useful to monitor health-related updates. Prior work focused on a balanced data setup and on English, but in- formative tweets are rare, and English is only one of the many languages spoken in the world. In this work, we introduce a new dataset of 5,000 tweets for finding informative COVID- 19 tweets for Danish. In contrast to prior work, which balances the label distribution, we model the problem by keeping its natural dis- tribution. We examine how well a simple prob- abilistic model and a convolutional neural net- work (CNN) perform on this task. We find a weighted CNN to work well but it is sensi- tive to embedding and hyperparameter choices. We hope the contributed dataset is a starting point for further work in this direction.
Original languageEnglish
Title of host publicationProceedings of the 2021 EMNLP Workshop W-NUT: The Seventh Workshop on Noisy User-generated Text
PublisherAssociation for Computational Linguistics
Publication date2021
Pages11–19
Publication statusPublished - 2021

Keywords

  • Informative Tweets
  • COVID-19
  • Danish Language
  • Natural Distribution
  • Convolutional Neural Network (CNN)

Fingerprint

Dive into the research topics of 'Finding the needle in a haystack: Extraction of Informative COVID-19 Danish Tweets'. Together they form a unique fingerprint.

Cite this