Towards Precision Prevention: Applying advanced data analytic and machine learning techniques to improve ovarian cancer screening.
More sensitive and specific screening strategies are needed. While research is ongoing into novel serologic tests and better diagnostic imaging, which have future potential to improve disease detection , there is the potential for analysis of available data using advanced data analytics and machine learning techniques that may lead to better screening strategies, which are testable in the near term.
We are proposing to analyze the PLCO study data employing a wider range of advanced data analytics and machine learning techniques than any previous studies in ovarian cancer detection. We will study relevant variables in the PLCO ovarian cancer data sets which include: comprehensive screening, screening abnormalities, diagnostic procedures, medical complications, and treatments. We will use regression and classification techniques including linear models; both ordinary linear regression as well as regression using regularization techniques such as Ridge, Lasso, and Elastic Net. We will test a wide variety of other techniques such as random forests, gradient boosting, neural networks, decision trees, and support vector machines.
AIMS.
(1) Improve our understanding of the risk factors associated with ovarian cancer, identify better screening procedures, and subsets of the general population who might derive greater benefits from screening: specific questions addressed will include:
• Can we identify is a subset of identifiable patients who may benefit from evaluation by TVUS and CA125 and were not picked up in the highest quality studies to date. (PLCO and UKCTOCS)?
• Can we identify an algorithm that can improve current methods to calculate\ risk of undetected ovarian cancer?
(2) To generate the best ensemble risk models from a variety of constituent models using a variety of techniques.
(3) We will also be employing advanced data preprocessing techniques to get the data in the right format where it can be properly analyzed by the techniques mentioned earlier. A major problem which often occurs is missing data. We have developed sophisticated data imputation techniques for handling missing data which go beyond what existing software packages offer. Our advanced preprocessing and data imputation techniques will also be key differentiators for enhancing the value of this research.
Arun Iyengar, Ph.D., Thomas J. Watson Research Center, Yorktown Heights, NY USA
Robert C. Knapp, MD, William H. Baker Professor of Gynecology, Emeritus, Harvard Medical Schoo