Development and Validation of Cancer Risk Prediction Models Using Publicly Available Cohort Data
Lung cancer remains the leading cause of cancer mortality in the United States, with over 125,000 deaths annually. While low-dose CT screening has been shown to reduce mortality, current screening guidelines are based primarily on age and smoking history, leaving many high-risk individuals unidentified and contributing to inequities in early detection. PreOncology aims to develop and validate an integrated, multi-modal lung cancer risk prediction model that leverages demographic, lifestyle, genetic, medical, and social determinants of health (SDOH) features to enable ultra-early detection, equitable prevention, and precision screening strategies.
The central hypothesis is that combining traditional epidemiologic risk factors with genomic and SDOH data will significantly improve the discrimination, calibration, and clinical utility of lung cancer risk prediction across diverse U.S. populations.
Specific Aim 1. Harmonize and integrate multi-cohort data for lung cancer risk modeling.
We will curate and link data from large prospective cohorts and screening trials (e.g., NLST, MEC, WHI, All of Us), standardizing demographic, lifestyle, clinical, genetic, and SDOH variables. Harmonization will include development of a reproducible ETL pipeline to align coding systems (ICD, census FIPS, smoking pack-years), define consistent index dates and look-back windows, and generate derived variables (e.g., BMI trajectories, screening adherence).
Specific Aim 2. Develop and compare traditional and machine learning risk models.
We will implement and benchmark multiple approaches, including Cox proportional hazards, random survival forests, gradient boosting survival models, and stacked meta-learners combining published risk scores with raw features. Model evaluation will focus on discrimination (C-index, time-dependent AUC), calibration (observed vs predicted), and clinical utility (decision curve analysis). We will also assess the proportional hazards assumption and incorporate time-varying covariates where needed.
Specific Aim 3. Validate, recalibrate, and prepare the model for clinical translation.
We will perform external validation in independent cohorts, with subgroup analyses by race/ethnicity, sex, and socioeconomic strata. To ensure population-level relevance, we will recalibrate models to SEER-based incidence rates and evaluate generalizability across health systems. Outputs will include a portable, transparent pipeline (scikit-learn/XGBoost + calibration layer) and documentation suitable for regulatory review and future clinical integration.
Impact: Completion of these aims will yield a validated, equitable, and actionable lung cancer risk model that can support both clinical decision-making and public health strategies, ultimately enabling earlier detection, personalized screening recommendations, and reduction of lung cancer disparities.
Luke Stetson PreOncology
Jose Barreau, MD PreOncology
Steve Smerz PreOncology
Joe Fago PreOncology