AI-Driven Non-Invasive Biomarker Prediction for Lung Cancer Using Self-Supervised Learning on Large-Scale CT Imaging Data

Principal Investigator

Name
Tianyu Han

Degrees
Ph.D.

Institution
University of Pennsylvania

Position Title
Assistant Professor

Email
tianyu.han@pennmedicine.upenn.edu

About this CDAS Project

Study
NLST (Learn more about this study)

Project ID
NLST-1409

Initial CDAS Request Approval
Mar 25, 2025

Title
AI-Driven Non-Invasive Biomarker Prediction for Lung Cancer Using Self-Supervised Learning on Large-Scale CT Imaging Data

Summary
This project aims to develop and validate an AI-based foundation model for non-invasive biomarker prediction in lung cancer using computed tomography (CT) imaging. Leveraging large-scale CT datasets from the Penn Medicine Biobank (PMBB) and the National Lung Screening Trial (NLST), we will train a vision transformer (ViT) model pre-trained via self-supervised learning (SSL) to extract robust imaging features predictive of cancer risk, patient survival, and molecular biomarkers. By integrating advanced attention mechanisms to efficiently process full-volume CT scans, we seek to uncover radiologic signatures of tumor biology without requiring repeated biopsies. The successful completion of this work will establish AI-driven imaging biomarkers for lung cancer, potentially transforming patient stratification and treatment personalization.

Aims

Lung and bronchus cancers remain the leading cause of cancer-related mortality in the U.S. While targeted therapies and immunotherapies improve outcomes, invasive biopsies are often required for treatment decisions, posing challenges for many patients. This project aims to develop an AI-based foundation model for non-invasive biomarker prediction using CT imaging. Leveraging large-scale CT datasets from the Penn Medicine Biobank (PMBB) and the National Lung Screening Trial (NLST), we will train a vision transformer (ViT) model with self-supervised learning (SSL) to extract imaging features predictive of cancer risk, survival, and molecular biomarkers. By correlating imaging features with tumor biology, we aim to advance patient stratification and personalized treatment strategies.

AI-based risk models for lung cancer currently require extensive annotations, limiting scalability. We propose an SSL-trained 3D ViT model to learn task-agnostic imaging representations from whole CT volumes, overcoming annotation constraints. To address computational challenges, we will integrate softmax-free or dilated attention mechanisms and masked autoencoding, allowing efficient full-volume modeling while reducing redundancy.

Specific Aim 1: Develop imaging signatures of cancer risk and survival.
We will construct a 3D ViT model for whole-CT analysis, trained on over 500,000 CT scans from PMBB and NLST. Efficient attention mechanisms will enable learning of risk-relevant imaging features without excessive computational burden.

Specific Aim 2: Predict molecular biomarkers from CT imaging.
We will integrate imaging features with genetic biomarkers to predict mutations, gene expression, and PD-L1 status in NSCLC, leveraging vision DL models with an aggregation transformer. This approach aims to reduce the need for invasive biopsies by identifying radiologic correlates of tumor biology.

All methods will be implemented as open-source Python packages. By applying our model to multiple datasets, we aim to establish a robust, AI-driven approach for non-invasive lung cancer biomarker profiling.

Collaborators

Prof. Daniel Truhn, RWTH Aachen University
Prof. Christos Davatzikos, University of Pennsylvania