Improved Methods for Comparing Human Radiologists to AI tools
However, in practice, the quality of a radiologist can be difficult to measure through observational data. A radiologist’s diagnosis consists of two components: 1) the information observed by the radiologist and 2) the preferences of the radiologist that translate their information into a decision (Chan et al., 2022). The existing literature usually focuses on the accuracy of the final diagnosis. It does not typically account for the potential preference heterogeneity conditional on observable characteristics of the cases. For example, radiologists may assess that the costs of false positive/false negative reports vary across patients in complex ways. We hypothesize that failing to account for the potentially unobservable variation in preferences can lead to incorrect and misleading skillfulness comparisons among decision-makers and AIs, leading us to draw erroneous conclusions about the scope of AI-human collaboration in radiology.
Our study will quantify the degree of bias that can arise when we ignore the preference heterogeneity in a radiology setting. Then, we plan to identify the conditions under which we can compare the skillfulness of the decision-makers and AIs without ambiguity and develop practical procedures for this comparison. This investigation will also allow us to more accurately compare the potential for humans to add value in a human-AI collaboration.
The National Lung Screening Trial (NLST) dataset is uniquely suited for our research objectives for several key reasons:
1) Unique Identification of Radiologists and Cases: The dataset includes numerous radiologists, with unique identifiers for the radiologists and cases. Identification is essential for computing individual performance metrics such as Area Under the Curve (AUC), sensitivity, and specificity for each radiologist and/or group of radiologists.
2) Comprehensive and Reliable Ground Truth: The NLST dataset provides extensive diagnostic follow-up data through systematic medical record abstraction for positive cases (it is particularly valuable that the NLST provided clear guidelines on what constituted a positive case), coupled with documented diagnostic evaluations for cases showing positive findings or other abnormalities. We can also use the inclusion of one-year follow-up data confirming cancer-free status in negative cases as a negative ground truth.
3) Ability to train AI Algorithms: The dataset has been widely used in the computer science literature as a benchmark dataset for computer vision and cancer detection algorithms (Ardila et al., 2019). The wide availability of radiological images in the dataset can be used to train and validate AI models. This feature enables straightforward, direct comparisons between human and AI performance.
These characteristics make the NLST dataset an exceptional choice for our investigation into methods of comparing diagnostic performance between radiologists and AI systems.
Yucheng Shang
PhD Student, MIT Economics
Ray Huang
Research Fellow, MIT Blueprint Labs
Haya Alsharif
Pre-doctoral Fellow, MIT Economics
Nikhil Agarwal
Professor of Economics, MIT