Skip to Main Content

An official website of the United States government

Government Funding Lapse

Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit  cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Principal Investigator
Name
Luke Stetson
Degrees
PhD
Institution
PreOncology
Position Title
Director of Cancer Risk Modeling and Translational Strategy
Email
About this CDAS Project
Study
PLCO (Learn more about this study)
Project ID
PLCO-1986
Initial CDAS Request Approval
Sep 22, 2025
Title
Development and Validation of Cancer Risk Prediction Models Using Publicly Available Cohort Data
Summary
PreOncology Lung Cancer Risk Model Project

Lung cancer remains the leading cause of cancer mortality in the United States, with over 125,000 deaths annually. While low-dose CT screening has been shown to reduce mortality, current screening guidelines are based primarily on age and smoking history, leaving many high-risk individuals unidentified and contributing to inequities in early detection. PreOncology aims to develop and validate an integrated, multi-modal lung cancer risk prediction model that leverages demographic, lifestyle, genetic, medical, and social determinants of health (SDOH) features to enable ultra-early detection, equitable prevention, and precision screening strategies.

The central hypothesis is that combining traditional epidemiologic risk factors with genomic and SDOH data will significantly improve the discrimination, calibration, and clinical utility of lung cancer risk prediction across diverse U.S. populations.
Aims

Specific Aim 1. Harmonize and integrate multi-cohort data for lung cancer risk modeling.

We will curate and link data from large prospective cohorts and screening trials (e.g., NLST, MEC, WHI, All of Us), standardizing demographic, lifestyle, clinical, genetic, and SDOH variables. Harmonization will include development of a reproducible ETL pipeline to align coding systems (ICD, census FIPS, smoking pack-years), define consistent index dates and look-back windows, and generate derived variables (e.g., BMI trajectories, screening adherence).

Specific Aim 2. Develop and compare traditional and machine learning risk models.

We will implement and benchmark multiple approaches, including Cox proportional hazards, random survival forests, gradient boosting survival models, and stacked meta-learners combining published risk scores with raw features. Model evaluation will focus on discrimination (C-index, time-dependent AUC), calibration (observed vs predicted), and clinical utility (decision curve analysis). We will also assess the proportional hazards assumption and incorporate time-varying covariates where needed.

Specific Aim 3. Validate, recalibrate, and prepare the model for clinical translation.

We will perform external validation in independent cohorts, with subgroup analyses by race/ethnicity, sex, and socioeconomic strata. To ensure population-level relevance, we will recalibrate models to SEER-based incidence rates and evaluate generalizability across health systems. Outputs will include a portable, transparent pipeline (scikit-learn/XGBoost + calibration layer) and documentation suitable for regulatory review and future clinical integration.

Impact: Completion of these aims will yield a validated, equitable, and actionable lung cancer risk model that can support both clinical decision-making and public health strategies, ultimately enabling earlier detection, personalized screening recommendations, and reduction of lung cancer disparities.

Collaborators

Luke Stetson PreOncology
Jose Barreau, MD PreOncology
Steve Smerz PreOncology
Joe Fago PreOncology