NLST-1357: Improved Methods for Comparing Human Radiologists to AI tools - Approved Projects

Studies on CDAS

Additional Studies...

More Information

Principal Investigator

Name

Ashesh Rambachan

Degrees

Ph.D.

Institution

Massachusetts Institute of Technology

Position Title

Assistant Professor of Economics

asheshr@mit.edu

About this CDAS Project

Study

NLST (Learn more about this study)

Project ID

NLST-1357

Initial CDAS Request Approval

Nov 12, 2024

Title

Improved Methods for Comparing Human Radiologists to AI tools

Summary

AI tools are quickly becoming an alternative to human radiologists. As the performance and adoption of predictive AI tools grow in real-world settings, accurately comparing AI performance to human performance is paramount (Liu et al., 2019; Lai et al., 2021; Mullainathan & Obermeyer, 2019; Kleinberg et al., 2017; Agrawal et al., 2018; Rambachan 2024). Accurate AI and human decision-maker comparisons are required to develop optimal AI-human collaboration systems (Agarwal et al., 2023). In vital settings like radiology, evaluators may decide to automate cases when AI performance exceeds human performance (Mozannar & Sontag, 2020; Raghu et al., 2019; Bansal et al., 2021).

However, in practice, the quality of a radiologist can be difficult to measure through observational data. A radiologist’s diagnosis consists of two components: 1) the information observed by the radiologist and 2) the preferences of the radiologist that translate their information into a decision (Chan et al., 2022). The existing literature usually focuses on the accuracy of the final diagnosis. It does not typically account for the potential preference heterogeneity conditional on observable characteristics of the cases. For example, radiologists may assess that the costs of false positive/false negative reports vary across patients in complex ways. We hypothesize that failing to account for the potentially unobservable variation in preferences can lead to incorrect and misleading skillfulness comparisons among decision-makers and AIs, leading us to draw erroneous conclusions about the scope of AI-human collaboration in radiology.

Aims

Our study will quantify the degree of bias that can arise when we ignore the preference heterogeneity in a radiology setting. Then, we plan to identify the conditions under which we can compare the skillfulness of the decision-makers and AIs without ambiguity and develop practical procedures for this comparison. This investigation will also allow us to more accurately compare the potential for humans to add value in a human-AI collaboration.

The National Lung Screening Trial (NLST) dataset is uniquely suited for our research objectives for several key reasons:

1) Unique Identification of Radiologists and Cases: The dataset includes numerous radiologists, with unique identifiers for the radiologists and cases. Identification is essential for computing individual performance metrics such as Area Under the Curve (AUC), sensitivity, and specificity for each radiologist and/or group of radiologists.

2) Comprehensive and Reliable Ground Truth: The NLST dataset provides extensive diagnostic follow-up data through systematic medical record abstraction for positive cases (it is particularly valuable that the NLST provided clear guidelines on what constituted a positive case), coupled with documented diagnostic evaluations for cases showing positive findings or other abnormalities. We can also use the inclusion of one-year follow-up data confirming cancer-free status in negative cases as a negative ground truth.

3) Ability to train AI Algorithms: The dataset has been widely used in the computer science literature as a benchmark dataset for computer vision and cancer detection algorithms (Ardila et al., 2019). The wide availability of radiological images in the dataset can be used to train and validate AI models. This feature enables straightforward, direct comparisons between human and AI performance.

These characteristics make the NLST dataset an exceptional choice for our investigation into methods of comparing diagnostic performance between radiologists and AI systems.

Collaborators

Yucheng Shang
PhD Student, MIT Economics

Ray Huang
Research Fellow, MIT Blueprint Labs

Haya Alsharif
Pre-doctoral Fellow, MIT Economics

Nikhil Agarwal
Professor of Economics, MIT