De-identification is the task of detecting privacy-related entities in text, such as person names, emails and contact data. It has been well-studied within the medical domain. The need for de-identification technology is increasing, as privacy-preserving data handling is in high demand in many domains. In this paper, we focus on job postings. We present JobStack, a new corpus for de-identification of personal data in job vacancies on Stackoverflow. We introduce baselines, comparing Long-Short Term Memory (LSTM) and Transformer models. To improve upon these baselines, we experiment with contextualized embeddings and distantly related auxiliary data via multi-task learning. Our results show that auxiliary data improves de-identification performance.
Title of host publication
Proceedings of the 23rd Nordic Conference on Computational Linguistics
This page is printed from https://en.itu.dk/research/portalplaceholder?layoutfraction=top&langRef=https://pure.itu.dk/portal/da/clippings/nu-faar-danmark-ny-digital-post-800000-privatansatte-faar-loenseddel-et-nyt-sted(43731c68-eb5c-4d81-be58-0876ac4c734f).html