An Analysis of Design and Evaluation Metrics in Studies Using AI to Predict Future Cancer from Medical Imaging
We believe that researchers too often neglect to think carefully about important steps in the scientific process of designing studies that compare AI tools with the alternatives such as a human reader or current automated algorithms or a combination of both. These issues are particularly prominent as most of these studies use observational patient care data extracted from existing electronic medical records. For instance, studies using existing medical imaging data have implicit selection bias baked in because patients must have certain risk factors and symptoms for their physician to order an imaging test and for the insurance to pay for it. Thus, these data sets often have an overrepresentation of patients who are at increased risk of cancer. Training on these biased data may impact a model’s predictions in a population that is not representative. Similarly, since comparison of AI tools are rarely randomized between two arms, confounding bias may also impede valid comparison.
We aim to conduct a systematic review of the design and evaluation metrics used in these studies to identify shortcomings and opportunities for improvement. Based on these findings, we will:
a. Summarize the existing literature in terms of common practices, study designs and methods for inference when comparing new AI prediction tools with standard of care. We will particularly focus on handling selection and confounding bias.
b. We will then consider a data set on thoracic cancer from the National Lung Screening Trial that was used to train Sybil (Mikhael et al., 2023) to evaluate how different methodological approaches influence prediction outcomes. The methodologies will differ in terms of both design and analytic choices.
c. Time permitting, we will look at the problem of predicting esophageal adenocarcinoma using the Yale New Haven health system database using the best practices that we learn from (a) and (b).
None