Skip to main navigation Skip to search Skip to main content

De-identification of Privacy-related Entities in Job Postings

Research output: Conference Article in Proceeding or Book/Report chapterArticle in proceedingsResearchpeer-review

Abstract

De-identification is the task of detecting privacy-related entities in text, such as person names, emails and contact data. It has been well-studied within the medical domain. The need for de-identification technology is increasing, as privacy-preserving data handling is in high demand in many domains. In this paper, we focus on job postings. We present JobStack, a new corpus for de-identification of personal data in job vacancies on Stackoverflow. We introduce baselines, comparing Long-Short Term Memory (LSTM) and Transformer models. To improve upon these baselines, we experiment with contextualized embeddings and distantly related auxiliary data via multi-task learning. Our results show that auxiliary data improves de-identification performance.
Original languageEnglish
Title of host publicationProceedings of the 23rd Nordic Conference on Computational Linguistics
PublisherAssociation for Computational Linguistics
Publication date21 May 2021
Pages210-221
Publication statusPublished - 21 May 2021
EventNordic Conference on Computational Linguistics - Rejkjavik, Iceland
Duration: 31 May 20212 Jun 2021
Conference number: 23

Conference

ConferenceNordic Conference on Computational Linguistics
Number23
Country/TerritoryIceland
CityRejkjavik
Period31/05/202102/06/2021
SeriesLinköping Electronic Conference Proceedings
Number21
Volume178

Keywords

  • De-identification
  • Privacy-related entities
  • Medical domain
  • Privacy-preserving data
  • JobStack corpus
  • Job postings
  • Personal data
  • Baselines
  • Long-Short Term Memory (LSTM)
  • Transformer models

Fingerprint

Dive into the research topics of 'De-identification of Privacy-related Entities in Job Postings'. Together they form a unique fingerprint.

Cite this