Skip to Main Content

An official website of the United States government

Principal Investigator
Name
Isaac Schwabe
Degrees
Student
Institution
University of Reading
Position Title
Undergraduate
Email
About this CDAS Project
Study
PLCO (Learn more about this study)
Project ID
PLCO-1801
Initial CDAS Request Approval
Jan 27, 2025
Title
Predicting Pancreatic Cancer Using Machine Learning and Neural Networks
Summary
This project aims to report on the efficacy of different machine learning and neural network models in predicting the likelihood of a person developing pancreatic cancer by utilizing patient-level attributes found within clinical datasets. Additionally, investigating the effectiveness of various models will inform the use of soft voting to source the output of multiple models to formulate a single prediction.

Pancreatic cancer is rarely found in its early stages of development when it has the highest chance of being cured. This is because the cancer does not generally start presenting symptoms until the tumour has metastasized. Predicting the likelihood of developing pancreatic cancer can allow for better early detection through screening those with a higher risk of developing this disease.

Machine learning and neural networks can be used to develop prediction methodologies that aid and improve upon existing risk assessment techniques. These methodologies can identify patterns in patient-level attributes such as age, ethnicity, sex, prior medical history, and the risk of developing pancreatic cancer. Using this knowledge, techniques such as adjusting feature selection and hyperparameter tuning will be employed to increase predictive accuracy.

The outcomes of this project will uncover the efficacy and feasibility of using machine learning and neural network models to create prediction methods that could aid in early detection of pancreatic cancer.
Aims

1. Exploratory Data Analysis
- Familiarisation of existing datasets, identify trends, anomailes and areas that may require cleaning or preprocessing
- Look through existing pancreatic cancer datasets and assess their sizes and completeness to ensure that they meet the requirements for machine learning tasks
- Look through existing pancreatic cancer prediction models and techniques
- Look for initial patterns and relationships in existing datasets through developing sets of data visualizations and utilizing statistical techniques

2. Data Analysis and Preprocessing
- Ensure a unified, clean and high-quality dataset for model training
- Handle missing values through imputation techniques
- Normalize or standardize features where required for models sensitive to scale, for example neural networks
- Peform more complex data visualization techniques and statistical techniques to identify more nuanced and detailed relationships and patterns between patient attributes and the risk of developing pancreatic cancer
- Empoly dimensionaility reduction techniques if required to handle high-dimensional data

3. Machine Learning Training
- Create baseline machine learning models and optimize them to predict the risk of pancreatic cancer
- Develop a cohort of baseline machine learning models
- Use cross-validation to ensure robust evaluation of machine learning models
- Compare and contrast models using evaulation metrics such as F1-Score, AUROC, Precision/Recall and Accuracy
- Apply feature engineering techniques and hyperparameter tuning to optimize the models

4. Neural Network Training
- Create baseline deep learning models
- Develop and train a cohort of neural network models to predict the risk of pancreatic cancer on the aggregated dataset
- Apply dropout and batch normalization to prevent overfitting and improve generalization
- Apply early stopping and learning rate schedulers to refine the training process

5. Compare and contrast model's and metrics
- Identify the best performing models based on shared evaulation criteria
- Discusss benefits and trade-offs between each model
- Highlight interpretability differences between models
- Utilise SHAP to aid in model explainability and allowing relationships between input variables and the target attributes to be explored

6. Develop ensemble model
- Combine the strengths of multiple models to improve overall performance of predicting pancreatic cancer risk
- Test multiple ensemble strategies such as weighted averages, stacking, bagging and boosting techniques
- Evaluate these ensemble models through cross-validation

Collaborators

N/A