NLST-1388: An Analysis of Design and Evaluation Metrics in Studies Using … - Approved Projects

Studies on CDAS

Additional Studies...

More Information

Principal Investigator

Name

Bhramar Mukherjee

Degrees

Ph.D.

Institution

Yale University

Position Title

Senior Associate Dean of Public Health Data Science and Data Equity; Professor of Biostatistics

bhramar.mukherjee@yale.edu

About this CDAS Project

Study

NLST (Learn more about this study)

Project ID

NLST-1388

Initial CDAS Request Approval

Feb 19, 2025

Title

An Analysis of Design and Evaluation Metrics in Studies Using AI to Predict Future Cancer from Medical Imaging

Summary

Artificial intelligence (AI) has emerged as a powerful tool in medicine, offering the potential to transform disease risk prediction through automated analysis of complex multi-modal data including images. In particular, solid tumor cancers are an area where biopsies and imaging are carried out when there are sufficient indications to suspect the presence of mass or nodules. With the advances in deep neural architectures and generative models, these methods have become quite promising for analysis of unstructured imaging data including CT scans, PET scans and MRIs. Numerous studies have demonstrated the ability of AI algorithms to predict future incidence of cancer based on subtle patterns in imaging modalities such as mammograms, CT scans, and MRIs. These advancements are promising, with the potential to revolutionize clinical care by speeding up clinical workflow and early detection of high-risk imaging patterns that are undetectable by physician readers. However, evaluation of AI tools in clinical context is still not standardized and there is need for standardization and regularization.
We believe that researchers too often neglect to think carefully about important steps in the scientific process of designing studies that compare AI tools with the alternatives such as a human reader or current automated algorithms or a combination of both. These issues are particularly prominent as most of these studies use observational patient care data extracted from existing electronic medical records. For instance, studies using existing medical imaging data have implicit selection bias baked in because patients must have certain risk factors and symptoms for their physician to order an imaging test and for the insurance to pay for it. Thus, these data sets often have an overrepresentation of patients who are at increased risk of cancer. Training on these biased data may impact a model’s predictions in a population that is not representative. Similarly, since comparison of AI tools are rarely randomized between two arms, confounding bias may also impede valid comparison.

Aims

We aim to conduct a systematic review of the design and evaluation metrics used in these studies to identify shortcomings and opportunities for improvement. Based on these findings, we will:
a. Summarize the existing literature in terms of common practices, study designs and methods for inference when comparing new AI prediction tools with standard of care. We will particularly focus on handling selection and confounding bias.
b. We will then consider a data set on thoracic cancer from the National Lung Screening Trial that was used to train Sybil (Mikhael et al., 2023) to evaluate how different methodological approaches influence prediction outcomes. The methodologies will differ in terms of both design and analytic choices.
c. Time permitting, we will look at the problem of predicting esophageal adenocarcinoma using the Yale New Haven health system database using the best practices that we learn from (a) and (b).

Collaborators

None