Considering explainability and privacy when developing machine learning models to predict early risk of cancer
Machine learning model development, however, is typically solely focussed on accuracy. While risk stratification and early detection is critical, there is a defined need for modelling to consider explainability (i.e., explainable AI; XAI and generating meaningful features), privacy (e.g., risk of re-identification) and fairness (i.e., of different populations). Explainability is crucial for highlighting why predictions of risk have been made (important both for trust and providing feedback on how to reduce risk) and the risk to an individual’s privacy must be minimised as we move towards personalised medicine.
Trade-offs between accuracy, privacy and explainability are made throughout the development cycle of a ML model (e.g., what data to include in modelling, whether features engineered from data have meaning to the user, choice of ML model and its inherent interpretability, whether XAI is applied). In this project we will develop a variety of ML models and features to consider these trade-offs when predicting early risk of cancer.
This research will be part of an ongoing project: THEMIS 5.0 (https://www.themis-trust.eu/) which has received funding from the EU Horizon Europe research and innovation programme CL4-2022-HUMAN-02-01 under grant agreement No.101121042, and by UK Research and innovation under the UK governments Horizon funding guarantee.
1. Data analysis and feature engineering
- Comparison to current modelling using other cancer data.
- Identifying full list of possible features and correlations.
- Synthesisation of additional training data (i.e., synthetic data). This offers additional benefits to privacy, however, impact to model performance must be considered.
- Feature selection. Inclusion of data must be made on the basis that it improves the performance of modelling. Inclusion also has considerations for explainability and privacy.
- Considering several feature engineering strategies. Strategies such as dimensionality reduction can positively improve the performance of ML models. However, this generates features which may not have real-world meaning and hence cannot provide explanations through XAI.
- Generalisation of methods to other types of cancer: e.g., prostate, lung, ovarian.
2. Machine learning
- Machine learning models of varying complexity (i.e., logistic regression, decision trees, random forests, gradient-boosted decision trees, neural networks) will be considered. A more simplistic model is natively more interpretable, however, possibly at the cost of performance.
3. Explainablity
- Explainable AI (XAI) techniques will be used (e.g., SHAP, counterfactuals) to identify risk factors.
4. Privacy
- ML models will be subject to privacy experiments which quantify the relative risk to privacy for each modelling strategy. Privacy experiments will involve: differential privacy, adversarial attack.
5. Machine learning model tracking and store
- Metrics quantifying each model’s relative performance, explainability and privacy will be developed.
- ML models will be tracked and stored by MLflow with associated metadata.
Adam Harrison. University of Southampton
Steve Taylor. University of Southampton