ITU

Directions in abusive language training data, a systematic review: Garbage in, garbage out

Research output: Journal Article or Conference Article in JournalJournal articleResearchpeer-review

Standard

Directions in abusive language training data, a systematic review: Garbage in, garbage out. / Vidgen, Bertie; Derczynski, Leon.

In: PLOS ONE, Vol. 15, No. 12, e0243300, 28.12.2020.

Research output: Journal Article or Conference Article in JournalJournal articleResearchpeer-review

Harvard

APA

Vancouver

Author

Bibtex

@article{838f076bbf314f599744bae123f0a785,
title = "Directions in abusive language training data, a systematic review: Garbage in, garbage out",
abstract = "Data-driven and machine learning based approaches for detecting, categorising and measuring abusive content such as hate speech and harassment have gained traction due to their scalability, robustness and increasingly high performance. Making effective detection systems for abusive content relies on having the right training datasets, reflecting a widely accepted mantra in computer science: Garbage In, Garbage Out. However, creating training datasets which are large, varied, theoretically-informed and that minimize biases is difficult, laborious and requires deep expertise. This paper systematically reviews 63 publicly available training datasets which have been created to train abusive language classifiers. It also reports on creation of a dedicated website for cataloguing abusive language data hatespeechdata.com. We discuss the challenges and opportunities of open science in this field, and argue that although more dataset sharing would bring many benefits it also poses social and ethical risks which need careful consideration. Finally, we provide evidence-based recommendations for practitioners creating new abusive content training datasets.",
author = "Bertie Vidgen and Leon Derczynski",
year = "2020",
month = dec,
day = "28",
doi = "10.1371/journal. pone.0243300",
language = "English",
volume = "15",
journal = "PLOS ONE",
issn = "1932-6203",
publisher = "Public Library of Science",
number = "12",

}

RIS

TY - JOUR

T1 - Directions in abusive language training data, a systematic review: Garbage in, garbage out

AU - Vidgen, Bertie

AU - Derczynski, Leon

PY - 2020/12/28

Y1 - 2020/12/28

N2 - Data-driven and machine learning based approaches for detecting, categorising and measuring abusive content such as hate speech and harassment have gained traction due to their scalability, robustness and increasingly high performance. Making effective detection systems for abusive content relies on having the right training datasets, reflecting a widely accepted mantra in computer science: Garbage In, Garbage Out. However, creating training datasets which are large, varied, theoretically-informed and that minimize biases is difficult, laborious and requires deep expertise. This paper systematically reviews 63 publicly available training datasets which have been created to train abusive language classifiers. It also reports on creation of a dedicated website for cataloguing abusive language data hatespeechdata.com. We discuss the challenges and opportunities of open science in this field, and argue that although more dataset sharing would bring many benefits it also poses social and ethical risks which need careful consideration. Finally, we provide evidence-based recommendations for practitioners creating new abusive content training datasets.

AB - Data-driven and machine learning based approaches for detecting, categorising and measuring abusive content such as hate speech and harassment have gained traction due to their scalability, robustness and increasingly high performance. Making effective detection systems for abusive content relies on having the right training datasets, reflecting a widely accepted mantra in computer science: Garbage In, Garbage Out. However, creating training datasets which are large, varied, theoretically-informed and that minimize biases is difficult, laborious and requires deep expertise. This paper systematically reviews 63 publicly available training datasets which have been created to train abusive language classifiers. It also reports on creation of a dedicated website for cataloguing abusive language data hatespeechdata.com. We discuss the challenges and opportunities of open science in this field, and argue that although more dataset sharing would bring many benefits it also poses social and ethical risks which need careful consideration. Finally, we provide evidence-based recommendations for practitioners creating new abusive content training datasets.

U2 - 10.1371/journal. pone.0243300

DO - 10.1371/journal. pone.0243300

M3 - Journal article

VL - 15

JO - PLOS ONE

JF - PLOS ONE

SN - 1932-6203

IS - 12

M1 - e0243300

ER -

ID: 85640372