Development of a Multimodal AI Model for Predicting Future Lung Cancer Risk from Chest CTs in Asymptomatic Individuals
Lung cancer remains one of the leading causes of cancer-related mortality worldwide. Although screening efforts have largely focused on individuals with heavy smoking histories, recent epidemiological trends show a significant rise in lung cancer incidence among never-smokers and light-smokers. These cases often fall outside current screening guidelines, leading to delayed diagnoses and poorer prognoses. This shift underscores the pressing need for novel approaches that enable earlier and more individualized assessment of lung cancer risk in broader populations.
Traditional risk models based on demographic and clinical variables have demonstrated some utility in risk stratification. However, their predictive performance, particularly among low-risk individuals, remains suboptimal. Advances in artificial intelligence (AI) and machine learning, especially the emergence of multimodal foundation models, offer a promising solution to this challenge. These models are capable of learning from and integrating diverse data types, including structured clinical data and unstructured medical images, to enhance prediction accuracy across a range of tasks.
Recent studies have shown that incorporating image-based information can significantly improve performance for disease risk prediction tasks. One notable example is the Mirai model, which combines mammography images with clinical variables to predict future breast cancer risk.
Proposal
We propose to develop, train, and validate a image-based risk predicton model that predicts lung cancer risk occurring 1 to 6 years after a baseline chest CT exam. This model will integrate patient-level clinical data and chest radiographs to estimate personalized, calibrated risk scores.
Methods
Model Development
The proposed AI model will input baseline chest CT and available clinical variables described in the next section. The AI model consists of two modules: an image encoder and a risk prediction model. The risk prediction uses patient data and image features to produce multitask output: lung cancer risk, age, smoking history, and PLCOm2012 score. The lung cancer risk will be a continuous, calibrated probability score indicating the likelihood of lung cancer diagnosis within 1 to 6 years post-examination.
In the first stage of training, we will use age, smoking history, and PLCOm2012 as a proxy label. In the later stages, we will incorporate actual clinical outcomes for fine-tuning and validation.
Input Data
The proposed model will accept two categories of input data:
- Clinical Variables: Including age, sex, BMI, smoking history (e.g., pack-years), family history of lung cancer, and history of lung-related diseases such as TB, COPD, ILD
- Image Data: Low dose Chest CT in DICOM format
Data Sources
Model development will be supported by both public and private data sources. Public datasets include the National Lung Screening Trial (NLST) and UK Biobank data, which provide longitudinal data with linked clinical variables and imaging.
Additionally, we hope to collaborate with tertiary hospitals in South Korea, including SNUH and KBSMC, for collecting retrospective cehst CT exams, clinical baseline characteristics, and follow-up information including biopsy results, surgical pathology, and survival data.
Added Clinical Value
The image-based risk prediction model will support early identification of high-risk individuals beyond traditional screening populations, ultimately improving patient outcomes and clinical decision-making.
This research aims to develop and validate a multimodal artificial intelligence (AI) model that predicts future lung cancer risk in asymptomatic individuals, using baseline chest CT images and integrated clinical data. We hypothesize that the synergistic combination of imaging-derived features with structured clinical variables—such as age, sex, BMI, smoking history, and comorbidities—can provide substantially improved individualized risk stratification compared to current demographic-based models. Leveraging both public datasets (e.g., NLST, UK Biobank) and retrospective cohorts from South Korean tertiary hospitals, the proposed model will output a calibrated, continuous probability score estimating the likelihood of lung cancer development within 1 to 6 years post-baseline. This model will be trained using proxy labels in early stages (e.g., PLCOm2012, smoking history) and refined with actual clinical outcomes. By identifying high-risk individuals who fall outside conventional screening criteria, such as never- and light-smokers, this work seeks to inform earlier clinical interventions and screening expansion. This study is designed with the scientific and translational rigor appropriate for submission to Radiology, Investigative Radiology, NEJM AI, and Nature Medicine, reflecting its potential to contribute meaningfully to the evolving landscape of AI-driven precision medicine.
Hyunsuk Yoo, Evom AI
Hyeonseob Nam, Evom AI
Hyunjae Lee, Evom AI