TY - GEN
T1 - De-identifying an EHR Database
T2 - Anonymity, Correctness and Readability of the Medical Record
AU - Lauesen, Søren
AU - Pantazos, Kostas
AU - Lippert, Søren
PY - 2011
Y1 - 2011
N2 - Abstract. Electronic health records (EHR) contain a large amount of structured data and free text. Exploring and sharing clinical data can improve healthcare and facilitate the development of medical software. However, revealing confidential information is against ethical principles and laws. We de-identified a Danish EHR database with 437,164 patients. The goal was to generate a version with real medical records, but related to artificial persons. We developed a de-identification algorithm that uses lists of named entities, simple language analysis, and special rules. Our algorithm consists of 3 steps: collect lists of identifiers from the database and external resources, define a replacement for each identifier, and replace identifiers in structured data and free text. Some patient records could not be safely de-identified, so the de-identified database has 323,122 patient records with an acceptable degree of anonymity, readability and correctness (F-measure of 95%). The algorithm has to be adjusted for each culture, language and database.
AB - Abstract. Electronic health records (EHR) contain a large amount of structured data and free text. Exploring and sharing clinical data can improve healthcare and facilitate the development of medical software. However, revealing confidential information is against ethical principles and laws. We de-identified a Danish EHR database with 437,164 patients. The goal was to generate a version with real medical records, but related to artificial persons. We developed a de-identification algorithm that uses lists of named entities, simple language analysis, and special rules. Our algorithm consists of 3 steps: collect lists of identifiers from the database and external resources, define a replacement for each identifier, and replace identifiers in structured data and free text. Some patient records could not be safely de-identified, so the de-identified database has 323,122 patient records with an acceptable degree of anonymity, readability and correctness (F-measure of 95%). The algorithm has to be adjusted for each culture, language and database.
KW - Electronic health records
KW - De-identification
KW - Artificial persons
KW - Named entity recognition
KW - Healthcare data privacy
KW - Electronic health records
KW - De-identification
KW - Artificial persons
KW - Named entity recognition
KW - Healthcare data privacy
M3 - Conference article
SN - 0926-9630
VL - 169
SP - 862
EP - 866
JO - Studies in Health Technology and Informatics
JF - Studies in Health Technology and Informatics
ER -