Skip to Main Content
Principal Investigator
Maya Carswell
University Of Liverpool
Position Title
MSc Student
About this CDAS Project
PLCO (Learn more about this study)
Project ID
Initial CDAS Request Approval
Jun 22, 2022
Using Machine Learning algorithms to predict breast cancer in women, using Electronic Health Record information.
The aim of this project is to compare the use of ML algorithms to predict breast cancer, based on using only data that would be held on a patient's medical record, regardless of how up to date it is. This is because a scoping literature review determined that many ML algorithms for predicting breast cancer have already been published, yet all use data that would be difficult to acquire on a large scale, such as for the general population.
This project will fulfil the MSc Dissertation project for Maya Carswell.
Using the CDAS Breast Cancer Data, this project will extract features that would be readily available within a patient's electronic health record, without having to perform any invasive extra examinations such as blood tests, have up-to-date readings such as BMI, or require the patient to answer any additional questions to supplement the data for the algorithm. It is expected that a patient's electronic medical health record will contain information that they supplied when they registered as a patient at a GP practice, or from their most recent appointments, such as smoking status, family history of cancer, and ethnicity . It is expected to also contain further medical information, such as the time they have spent on birth control, if they have been diagnosed with breast cancer in the past, and if they are on hormone replacement drugs. When the CDAS data is received, this study will involve manually removing features that a patient's electronic health record would not include. Further data pre-processing, such as manipulating missing information, will also be performed.
From here, the data will be used to fine-tune and train a number of Machine Learning algorithms, including but not limited to, Support Vector Machines, Logistic Regression, Random Forest, and Neural Networks, to predict the presence of breast cancer in the women. Then, the AUROC Curve will be generated to test and compare the discriminatory accuracies of the algorithms, to determine the usefulness of each in general medical practice.

- To extract data from the CDAS Breast Cancer Dataset that would be easily accessible on a patient's electronic health record.
- To manipulate the data using weightages due to the class imbalances
- To split the data into Test and Training sets
- To apply and fine-tune Machine Learning algorithms, such as Support Vector Machines, Logistic Regression, Random Forest, and Neural Networks, to this dataset.
- Validate the ML algorithms using 10-fold Cross-Validation
- To generate AUROC curves, and other performance metrics for each algorithm, to determine the discriminatory accuracies of each algorithm.
- To compare the performance metrics of each algorithm and discuss which would be most suitable for use in clinical practice.


Fawada Qaiser , University of Liverpool