De-identification of Privacy-related Entities in Job Postings

Research output: Conference Article in Proceeding or Book/Report chapterArticle in proceedingsResearchpeer-review

View graph of relations

De-identification is the task of detecting privacy-related entities in text, such as person names, emails and contact data. It has been well-studied within the medical domain. The need for de-identification technology is increasing, as privacy-preserving data handling is in high demand in many domains. In this paper, we focus on job postings. We present JobStack, a new corpus for de-identification of personal data in job vacancies on Stackoverflow. We introduce baselines, comparing Long-Short Term Memory (LSTM) and Transformer models. To improve upon these baselines, we experiment with contextualized embeddings and distantly related auxiliary data via multi-task learning. Our results show that auxiliary data improves de-identification performance.
Original languageEnglish
Title of host publicationProceedings of the 23rd Nordic Conference on Computational Linguistics
PublisherAssociation for Computational Linguistics
Publication date21 May 2021
Publication statusPublished - 21 May 2021
EventNoDaLiDa 2021 - Rejkjavik, Iceland
Duration: 31 May 2021 → …


ConferenceNoDaLiDa 2021
Periode31/05/2021 → …
SeriesLinköping Electronic Conference Proceedings


No data available

ID: 85880204