Interpretability of machine learning based prediction models for ovarian cancer classification using PLCO dataset

Principal Investigator

Name
Khyati Shah

Degrees
PGDDS, B.E. (Electronics)

Institution
Liverpool John Moores University, UK

Position Title
MS Student

Email
khyatihshah@gmail.com

About this CDAS Project

Study
PLCO (Learn more about this study)

Project ID
PLCO-628

Initial CDAS Request Approval
May 20, 2020

Title
Interpretability of machine learning based prediction models for ovarian cancer classification using PLCO dataset

Summary
Ovarian cancer is the 7th most common cancer, and 8th most common cause of death from cancer in women in the world (The World Ovarian Cancer Coalition Atlas). While a breakthrough in the early detection of the disease might be on the horizon, research depicts a high prediction of diagnosis and death. As per The American Cancer Society, estimates for ovarian cancer in the United States for 2020, about 21,750 women are to receive a new diagnosis of ovarian cancer and about 13,940 women will die from ovarian cancer.

Numerous studies have shown efforts to identify strategies to improve ovarian cancer detection and survival.Some of the research studies demonstrated how machine learning models classified patients into cancer vs non cancer categories.It would be worth exploring and analyzing the factors for the disease stratification.
1. What were the differentiating factors that classified cancer cases vs non-cancer cases?
2. Do multiple machine learning models stratify the cases in similar way?
3. What features across models influence the prediction?
4. When a particular record is classified as cancer, what are the driving features of prediction for that given data point?

To address these questions, we need to evaluate multiple interpretable models for ovarian cancer diagnosis. Thus, the purpose of this research is to utilize the comprehensive PLCO Ovarian Cancer dataset to examine different methods for machine learning prediction model explainability.

Generally, there is a trade-off between performance and interpretation of the model. The better a model performs, the more complex it becomes and the less it is interpretable.
Various models like Logistic Regression, XGBoost, Random Forest, etc. would be evaluated on performance metrics like accuracy, AUC curve, F1 score, however, the main focus of this research would be to analyse techniques and frameworks in interpreting on the best best performing ML models.
Two aspects of interpretability of the model will be explored and examined for ovarian cancer stratification
1. Global interpretation - the model would help determine the most influencing features in cancer prediction. The aim is to be able to identify core subset of features to discriminate cancer vs non-cancer cases using model properties like RFE (Recursive Features Elimination), features_importance_etc.
2. Local interpretation – Examine interpretability for individual prediction through model-agnostic methods like LIME (Local interpretable model-agnostic explanations), SHAP (SHapley Additive exPlanations), etc.

Aims

1. Explore the different frameworks for ML model explainability by-
i. Building multiple machine learning prediction based models (eg: Logistic Regression, Random Forest, XGBoost, etc.)
ii. Examining two aspects of interpretability : Global interpretation and local interpretation
2. Identify differentiating factors that classified ovarian cancer cases vs non-cancer cases
3. Reap benefits of interpretability like directing future data collection, informing human decision-making and nurture an enduring trust in Machine learning.

Collaborators

None