De-identification of Privacy-related Entities in Job Postings

Kristian Nørgaard Jensen, Mike Zhang, Barbara Plank

Research output: Conference Article in Proceeding or Book/Report chapterArticle in proceedingsResearchpeer-review


De-identification is the task of detecting privacy-related entities in text, such as person names, emails and contact data. It has been well-studied within the medical domain. The need for de-identification technology is increasing, as privacy-preserving data handling is in high demand in many domains. In this paper, we focus on job postings. We present JobStack, a new corpus for de-identification of personal data in job vacancies on Stackoverflow. We introduce baselines, comparing Long-Short Term Memory (LSTM) and Transformer models. To improve upon these baselines, we experiment with contextualized embeddings and distantly related auxiliary data via multi-task learning. Our results show that auxiliary data improves de-identification performance.
Original languageEnglish
Title of host publicationProceedings of the 23rd Nordic Conference on Computational Linguistics
PublisherAssociation for Computational Linguistics
Publication date21 May 2021
Publication statusPublished - 21 May 2021
EventNoDaLiDa 2021 - Rejkjavik, Iceland
Duration: 31 May 2021 → …


ConferenceNoDaLiDa 2021
Period31/05/2021 → …
SeriesLinköping Electronic Conference Proceedings


Dive into the research topics of 'De-identification of Privacy-related Entities in Job Postings'. Together they form a unique fingerprint.

Cite this