Towards radiologist-level cancer risk assessment in CT lung screening using deep learning.
- Philips Research, Eindhoven, 5656 AE, The Netherlands.
- Human Longevity, Inc., San Diego, CA 92121, USA.
- Machine Learning lab, University of Amsterdam, 1090 GH Amsterdam and, Philips Research, Eindhoven, 5656 AE, The Netherlands.
- Philips Research, Hamburg, 22335, Germany.
- Philips Research North America, Cambridge, MA, 02141, USA.
- Lahey Hospital & Medical Center, Burlington, MA, 01805, USA.
- Department of Radiology, University of Chicago, Chicago, IL, 60637, USA.
PURPOSE: Lung cancer is the leading cause of cancer mortality in the US, responsible for more deaths than breast, prostate, colon and pancreas cancer combined and large population studies have indicated that low-dose computed tomography (CT) screening of the chest can significantly reduce this death rate. Recently, the usefulness of Deep Learning (DL) models for lung cancer risk assessment has been demonstrated. However, in many cases model performances are evaluated on small/medium size test sets, thus not providing strong model generalization and stability guarantees which are necessary for clinical adoption. In this work, our goal is to contribute towards clinical adoption by investigating a deep learning framework on larger and heterogeneous datasets while also comparing to state-of-the-art models.
METHODS: Three low-dose CT lung cancer screening datasets were used: National Lung Screening Trial (NLST, n = 3410), Lahey Hospital and Medical Center (LHMC, n = 3154) data, Kaggle competition data (from both stages, n = 1397 + 505) and the University of Chicago data (UCM, a subset of NLST, annotated by radiologists, n = 132). At the first stage, our framework employs a nodule detector; while in the second stage, we use both the image context around the nodules and nodule features as inputs to a neural network that estimates the malignancy risk for the entire CT scan. We trained our algorithm on a part of the NLST dataset, and validated it on the other datasets. Special care was taken to ensure there was no patient overlap between the train and validation sets.
RESULTS AND CONCLUSIONS: The proposed deep learning model is shown to: (a) generalize well across all three data sets, achieving AUC between 86% to 94%, with our external test-set (LHMC) being at least twice as large compared to other works; (b) have better performance than the widely accepted PanCan Risk Model, achieving 6 and 9% better AUC score in our two test sets; (c) have improved performance compared to the state-of-the-art represented by the winners of the Kaggle Data Science Bowl 2017 competition on lung cancer screening; (d) have comparable performance to radiologists in estimating cancer risk at a patient level.
- NLST-241: Improved lung cancer screening with cognitive computing (Shawn Stapleton - 2016)