Utilization of machine learning to create a predictive model for ovarian cancer based on pelvic ultrasound, CA-125, and clinical characteristics
Machine learning techniques are increasingly being used to assimilate complicated clinical information and generate predictive models that can improve the accuracy of clinical diagnosis. The Prostate, Lung, Colorectal, and Ovarian cancer Screening Trial collected data which is extremely pertinent to the diagnosis of ovarian cancer including detailed and standardized pelvic ultrasound measurements, serum CA-125 level, and patient information such as family history, contraception use, and medical comorbidities. We aim to utilize machine learning techniques to general a predictive model which can be used by clinicians to input readily available clinical information to provide an estimated risk of an ovarian mass on ultrasound to actually represent ovarian cancer.
Data will be divided into a training and testing cohort. Multiple machine learning algorithms will be generated utilizing the training cohort and subsequent internal validation will be conducted with data from the testing cohort. If a successful prediction model is generated, future studies may focus on external validation of the model utilizing hospital data followed by prospective study of this model. Ultimately, a successful prediction model should decrease the risk of false-positive results and decrease unnecessary surgical morbidity while maintaining high sensitivity for the diagnosis of ovarian cancer.
1. Utilize the PLCO dataset (training cohort) including pelvic ultrasound measurements, CA-125 levels, family history, medical history, and health-questionnaire results to determine which factors are associated with ovarian cancer in a patient with an ovarian mass.
-Multiple machine learning algorithms will be generated in order to help identify the most accurate model
-In an attempt to improve the simplicity and clinical utility of this model we will attempt sequentially remove predictive characteristics in order to minimize the number of clinical parameters that would be entered into the model without compromising accuracy
2. Internal validation of the model
-Utilize the PLCO dataset (testing cohort) to identify the most accurate machine learning algorithm
-Perform internal validation to provide AUC and other measures of accuracy
3. External validation
-Externally validate the model utilizing retrospective hospital-based data in conjunction with University Hospitals Cleveland Medical Center Institutional Review Board
-Further external validation utilizing prospective and potentially randomized methodology
David Sheyn MD, Soumya Ray PhD