Devel
Oatmeal Health is developing a novel 3D lung CT foundation model, OxyGen, that uses the masked autoencoder paradigm on the ~98,000 low dose CT (LDCT) chest scans in the National Lung Screening Trial (NLST) dataset to achieve state-of-the-art (SOTA) generative performance for CT chest scans. The first of its kind trained on such a substantial dataset, OxyGen can accurately regenerate masked portions of CT scans even with substantial input occlusion. Furthermore, it showcases robust performance on downstream tasks including classification of lung nodules as malignant or benign, outperforming SOTA models in this domain.
By building on a state-of-the-art transformer architecture that has excelled in generative 3D modeling of CT data and utilizing an end-to-end self-supervised training scheme relying purely on masked autoencoding of CT sub-volumes, OxyGen improves upon prior foundation models by benefiting from “scaling laws.” In other words, OxyGen will not require fundamental changes in architecture or modeling approaches to dramatically improve performance — just larger models and more CT scans.
Oatmeal Health seeks access to the complete NLST dataset to continue training the OxyGen and thus continue improving its generative performance. The end goal is for our foundation model to be used to develop AI models to automatically detect and diagnoses pulmonary nodules on LDCT scans obtained for lung cancer screening. We intend to market such a computer aided detection (CADe) and diagnosis (CADx) tool as a potential AI-based alternative to the existing Lung-RADS algorithm.
1) Train the OxyGen model using LDCT data from NLST. The objective function will be re-generation of the image, and loss will be defined as disparity between the re-generated and actual image. Model quality will be measured using reconstruction loss.
2) Train the OxyGen model using NLST data to predict 1-year lung cancer diagnosis from an LDCT scan. A separate computer aided detection (CADe) technology licensed from DeepHealth Technologies will be used to identify all nodules on the scan. Nodules will be embedded and concatenated to a separate embedding of the scan as a whole, and then decoded altogether. 1-year clinical follow up data will be used to define the outcome of interest. Model quality will be measured using AUROC, AUPRC, sensitivity, specificity, and F1 score. The aim of the project will be to outperform the existing clinical standard of care, the LungRADS algorithm.
Muhammad Suri, MS, Machine Learning Engineer, Oatmeal Health
Yakov Keselman, PhD, Principal Machine Learning Engineer, Oatmeal Health