Skip to Main Content

An official website of the United States government

Principal Investigator
Name
Andreas Maier
Degrees
Ph.D.
Institution
FAU Erlangen Nürnberg
Position Title
Professor, Head of the Pattern Recognition Lab
Email
About this CDAS Project
Study
PLCO (Learn more about this study)
Project ID
PLCO-529
Initial CDAS Request Approval
Sep 26, 2019
Title
Development of a machine-learning algorithm for generating realistic synthetic electronic healthcare records
Summary
Development of new software in healthcare industries requires access to high quality realistic patient data. However, due to many legal, privacy and security regulations, real persons' data is mostly not accessible to researchers. One approach to solve this problem is to use anonymized data sold by a range of government, commercial corporate, insurance, and clinical groups. However, using these data comes with issues of privacy and confidentiality, especially because there have been cases where anonymized records have been successfully re-identified.
Another approach is to use a completely synthetic healthcare records generation system. One of the platforms which are currently available to the public is Synthea. This is an open-source project which aims to generate synthetic healthcare patient records based on the publicly available open-source
data (including US Census Bureau demographics, Centers for Disease Control and Prevention prevalence and incidence rates, and National Institutes of Health reports) combined with models of clinical workflow and disease progression.
Even though that approach has shown quite good results, there is still room for improvement. One problem which have been noticed is that Synthea and other synthetic patient generators currently do not model differences in care and the potential outcomes that may result from care deviations.
The goal of this thesis is to improve the quality of the generated patient data by using machine learning algorithms. To create such an algorithm the publicly available datasets will be used to train the network to identify necessary patient characteristic. This will help to create realistic synthetic patient data.
Aims

- Develop module of lung cancer to the existing patient generator tool
- Literature research and evaluation of existing algorithms for patient generation
- Development of a machine learning algorithm to gather distribution from available healthcare population databases of patient characteristics, diagnostic & therapeutic procedures and patient outcomes (complications, survival)
- Validation of the patient generator against real patient population / scientific literature

Collaborators

FAU Erlangen Nürnberg