De-identification of Privacy-related Entities in Job Postings
Research output: Conference Article in Proceeding or Book/Report chapter › Article in proceedings › Research › peer-review
De-identification is the task of detecting privacy-related entities in text, such as person names, emails and contact data. It has been well-studied within the medical domain. The need for de-identification technology is increasing, as privacy-preserving data handling is in high demand in many domains. In this paper, we focus on job postings. We present JobStack, a new corpus for de-identification of personal data in job vacancies on Stackoverflow. We introduce baselines, comparing Long-Short Term Memory (LSTM) and Transformer models. To improve upon these baselines, we experiment with contextualized embeddings and distantly related auxiliary data via multi-task learning. Our results show that auxiliary data improves de-identification performance.
Original language | English |
---|---|
Title of host publication | Proceedings of the 23rd Nordic Conference on Computational Linguistics |
Publisher | Association for Computational Linguistics |
Publication date | 21 May 2021 |
Pages | 210-221 |
Publication status | Published - 21 May 2021 |
Event | NoDaLiDa 2021 - Rejkjavik, Iceland Duration: 31 May 2021 → … |
Conference
Conference | NoDaLiDa 2021 |
---|---|
Location | Rejkjavik |
Land | Iceland |
Periode | 31/05/2021 → … |
Series | Linköping Electronic Conference Proceedings |
---|---|
Number | 21 |
Volume | 178 |
Downloads
No data available
ID: 85880204