PLCO-1837: Considering explainability and privacy when developing machine learning models to … - Approved Projects

Studies on CDAS

Additional Studies...

More Information

Principal Investigator

Name

Christopher Duckworth

Degrees

Ph.D

Institution

University of Southampton

Position Title

Senior Research Engineer

c.j.duckworth@soton.ac.uk

About this CDAS Project

Study

PLCO (Learn more about this study)

Project ID

PLCO-1837

Initial CDAS Request Approval

Feb 20, 2025

Title

Considering explainability and privacy when developing machine learning models to predict early risk of cancer

Summary

Late diagnosis of cancer greatly increases risk of mortality and limits treatment options. Early identification of individuals at risk is critical for improving outcomes. The Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial dataset provides a rich source of longitudinal health data that can be harnessed to predict early risk factors for a variety of cancers. This project aims to develop and optimise machine learning (ML) models leveraging PLCO data to identify individuals at high risk of developing cancer using patient-level characteristics (e.g., demographics, medical histories) along with screening data. Our modelling will initially focus on pancreatic cancer due to its typically late diagnosis, however we will generalise our methodology to all cancers tracked by the PLCO first detection dataset.

Machine learning model development, however, is typically solely focussed on accuracy. While risk stratification and early detection is critical, there is a defined need for modelling to consider explainability (i.e., explainable AI; XAI and generating meaningful features), privacy (e.g., risk of re-identification) and fairness (i.e., of different populations). Explainability is crucial for highlighting why predictions of risk have been made (important both for trust and providing feedback on how to reduce risk) and the risk to an individual’s privacy must be minimised as we move towards personalised medicine.

Trade-offs between accuracy, privacy and explainability are made throughout the development cycle of a ML model (e.g., what data to include in modelling, whether features engineered from data have meaning to the user, choice of ML model and its inherent interpretability, whether XAI is applied). In this project we will develop a variety of ML models and features to consider these trade-offs when predicting early risk of cancer.

This research will be part of an ongoing project: THEMIS 5.0 (https://www.themis-trust.eu/) which has received funding from the EU Horizon Europe research and innovation programme CL4-2022-HUMAN-02-01 under grant agreement No.101121042, and by UK Research and innovation under the UK governments Horizon funding guarantee.

Aims

1. Data analysis and feature engineering
- Comparison to current modelling using other cancer data.
- Identifying full list of possible features and correlations.
- Synthesisation of additional training data (i.e., synthetic data). This offers additional benefits to privacy, however, impact to model performance must be considered.
- Feature selection. Inclusion of data must be made on the basis that it improves the performance of modelling. Inclusion also has considerations for explainability and privacy.
- Considering several feature engineering strategies. Strategies such as dimensionality reduction can positively improve the performance of ML models. However, this generates features which may not have real-world meaning and hence cannot provide explanations through XAI.
- Generalisation of methods to other types of cancer: e.g., prostate, lung, ovarian.

2. Machine learning
- Machine learning models of varying complexity (i.e., logistic regression, decision trees, random forests, gradient-boosted decision trees, neural networks) will be considered. A more simplistic model is natively more interpretable, however, possibly at the cost of performance.

3. Explainablity
- Explainable AI (XAI) techniques will be used (e.g., SHAP, counterfactuals) to identify risk factors.

4. Privacy
- ML models will be subject to privacy experiments which quantify the relative risk to privacy for each modelling strategy. Privacy experiments will involve: differential privacy, adversarial attack.

5. Machine learning model tracking and store
- Metrics quantifying each model’s relative performance, explainability and privacy will be developed.
- ML models will be tracked and stored by MLflow with associated metadata.

Collaborators

Adam Harrison. University of Southampton
Steve Taylor. University of Southampton