Development and Validation of Cancer Risk Prediction Models Using Publicly Available Cohort Data

Principal Investigator

Name
Luke Stetson

Degrees
PhD

Institution
PreOncology

Position Title
Director of Cancer Risk Modeling and Translational Strategy

Email
luke.stetson@preoncology.com

About this CDAS Project

Study
PLCO (Learn more about this study)

Project ID
PLCO-1986

Initial CDAS Request Approval
Sep 22, 2025

Title
Development and Validation of Cancer Risk Prediction Models Using Publicly Available Cohort Data

Summary
PreOncology Cancer Risk Model Project

Cancer remains the leading cause of cancer mortality in the United States, with over 125,000 deaths annually. While low-dose CT screening has been shown to reduce mortality, current screening guidelines are based primarily on age and smoking history, leaving many high-risk individuals unidentified and contributing to inequities in early detection. PreOncology aims to develop and validate an integrated, multi-modal cancer risk prediction model that leverages demographic, lifestyle, genetic, medical, and social determinants of health (SDOH) features to enable ultra-early detection, equitable prevention, and precision screening strategies.

The central hypothesis is that combining traditional epidemiologic risk factors with genomic and SDOH data will significantly improve the discrimination, calibration, and clinical utility of cancer risk prediction across diverse U.S. populations.

Aims

Specific Aim 1. Harmonize and integrate multi-cohort data for cancer risk modeling.

We will curate and link data from large prospective cohorts and screening trials (e.g., NLST, MEC, WHI, All of Us), standardizing demographic, lifestyle, clinical, genetic, and SDOH variables. Harmonization will include development of a reproducible ETL pipeline to align coding systems (ICD, census FIPS, smoking pack-years), define consistent index dates and look-back windows, and generate derived variables (e.g., BMI trajectories, screening adherence).

Specific Aim 2. Develop and compare traditional and machine learning risk models.

We will implement and benchmark multiple approaches, including Cox proportional hazards, random survival forests, gradient boosting survival models, and stacked meta-learners combining published risk scores with raw features. Model evaluation will focus on discrimination (C-index, time-dependent AUC), calibration (observed vs predicted), and clinical utility (decision curve analysis). We will also assess the proportional hazards assumption and incorporate time-varying covariates where needed.

Specific Aim 3. Validate, recalibrate, and prepare the model for clinical translation.

We will perform external validation in independent cohorts, with subgroup analyses by race/ethnicity, sex, and socioeconomic strata. To ensure population-level relevance, we will recalibrate models to SEER-based incidence rates and evaluate generalizability across health systems. Outputs will include a portable, transparent pipeline (scikit-learn/XGBoost + calibration layer) and documentation suitable for regulatory review and future clinical integration.

Impact: Completion of these aims will yield a validated, equitable, and actionable cancer risk model that can support both clinical decision-making and public health strategies, ultimately enabling earlier detection, personalized screening recommendations, and reduction of cancer disparities.

Collaborators

Luke Stetson PreOncology
Jose Barreau, MD PreOncology
Steve Smerz PreOncology
Joe Fago PreOncology