Abstract
De-identification is the task of detecting privacy-related entities in text, such as person names, emails and contact data. It has been well-studied within the medical domain. The need for de-identification technology is increasing, as privacy-preserving data handling is in high demand in many domains. In this paper, we focus on job postings. We present JobStack, a new corpus for de-identification of personal data in job vacancies on Stackoverflow. We introduce baselines, comparing Long-Short Term Memory (LSTM) and Transformer models. To improve upon these baselines, we experiment with contextualized embeddings and distantly related auxiliary data via multi-task learning. Our results show that auxiliary data improves de-identification performance.
| Originalsprog | Engelsk |
|---|---|
| Titel | Proceedings of the 23rd Nordic Conference on Computational Linguistics |
| Forlag | Association for Computational Linguistics |
| Publikationsdato | 21 maj 2021 |
| Sider | 210-221 |
| Status | Udgivet - 21 maj 2021 |
| Begivenhed | Nordic Conference on Computational Linguistics - Rejkjavik, Island Varighed: 31 maj 2021 → 2 jun. 2021 Konferencens nummer: 23 |
Konference
| Konference | Nordic Conference on Computational Linguistics |
|---|---|
| Nummer | 23 |
| Land/Område | Island |
| By | Rejkjavik |
| Periode | 31/05/2021 → 02/06/2021 |
| Navn | Linköping Electronic Conference Proceedings |
|---|---|
| Nummer | 21 |
| Vol/bind | 178 |
Emneord
- De-identification
- Privacy-related entities
- Medical domain
- Privacy-preserving data
- JobStack corpus
- Job postings
- Personal data
- Baselines
- Long-Short Term Memory (LSTM)
- Transformer models
Fingeraftryk
Dyk ned i forskningsemnerne om 'De-identification of Privacy-related Entities in Job Postings'. Sammen danner de et unikt fingeraftryk.Citationsformater
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver