Abstract
De-identification is the task of detecting privacy-related entities in text, such as person names, emails and contact data. It has been well-studied within the medical domain. The need for de-identification technology is increasing, as privacy-preserving data handling is in high demand in many domains. In this paper, we focus on job postings. We present JobStack, a new corpus for de-identification of personal data in job vacancies on Stackoverflow. We introduce baselines, comparing Long-Short Term Memory (LSTM) and Transformer models. To improve upon these baselines, we experiment with contextualized embeddings and distantly related auxiliary data via multi-task learning. Our results show that auxiliary data improves de-identification performance.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the 23rd Nordic Conference on Computational Linguistics |
| Publisher | Association for Computational Linguistics |
| Publication date | 21 May 2021 |
| Pages | 210-221 |
| Publication status | Published - 21 May 2021 |
| Event | Nordic Conference on Computational Linguistics - Rejkjavik, Iceland Duration: 31 May 2021 → 2 Jun 2021 Conference number: 23 |
Conference
| Conference | Nordic Conference on Computational Linguistics |
|---|---|
| Number | 23 |
| Country/Territory | Iceland |
| City | Rejkjavik |
| Period | 31/05/2021 → 02/06/2021 |
| Series | Linköping Electronic Conference Proceedings |
|---|---|
| Number | 21 |
| Volume | 178 |
Keywords
- De-identification
- Privacy-related entities
- Medical domain
- Privacy-preserving data
- JobStack corpus
- Job postings
- Personal data
- Baselines
- Long-Short Term Memory (LSTM)
- Transformer models
Fingerprint
Dive into the research topics of 'De-identification of Privacy-related Entities in Job Postings'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver