De-identification of Privacy-related Entities in Job Postings

Kristian Nørgaard Jensen, Mike Zhang, Barbara Plank

Publikation: Konference artikel i Proceeding eller bog/rapport kapitelKonferencebidrag i proceedingsForskningpeer review

Abstract

De-identification is the task of detecting privacy-related entities in text, such as person names, emails and contact data. It has been well-studied within the medical domain. The need for de-identification technology is increasing, as privacy-preserving data handling is in high demand in many domains. In this paper, we focus on job postings. We present JobStack, a new corpus for de-identification of personal data in job vacancies on Stackoverflow. We introduce baselines, comparing Long-Short Term Memory (LSTM) and Transformer models. To improve upon these baselines, we experiment with contextualized embeddings and distantly related auxiliary data via multi-task learning. Our results show that auxiliary data improves de-identification performance.
OriginalsprogEngelsk
TitelProceedings of the 23rd Nordic Conference on Computational Linguistics
ForlagAssociation for Computational Linguistics
Publikationsdato21 maj 2021
Sider210-221
StatusUdgivet - 21 maj 2021
BegivenhedNoDaLiDa 2021 - Rejkjavik, Island
Varighed: 31 maj 2021 → …

Konference

KonferenceNoDaLiDa 2021
LokationRejkjavik
Land/OmrådeIsland
Periode31/05/2021 → …
NavnLinköping Electronic Conference Proceedings
Nummer21
Vol/bind178

Emneord

  • De-identification
  • Privacy-related entities
  • Medical domain
  • Privacy-preserving data
  • JobStack corpus
  • Job postings
  • Personal data
  • Baselines
  • Long-Short Term Memory (LSTM)
  • Transformer models

Fingeraftryk

Dyk ned i forskningsemnerne om 'De-identification of Privacy-related Entities in Job Postings'. Sammen danner de et unikt fingeraftryk.

Citationsformater