Demographic distribution matching between real-world and virtual phantom population.

Authors

Ghosh D, Tushar F, Dahal L, Vancoillie L, Lafata KJ, Samei E, Lo JY, Luo S

Affiliations

Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, North Carolina, USA.
Center for Virtual Imaging Trials, Carl E. Ravin Advanced Imaging Laboratories, Department of Radiology, Duke University School of Medicine, Durham, North Carolina, USA.

Abstract

BACKGROUND: The adoption of virtual imaging trials (VITs) is rapidly expanding, offering a cost-effective and ethically viable alternative to large-scale clinical trials for imaging system evaluation. However, differences in demographic composition between virtual phantom populations and real-world clinical cohorts can introduce bias in imaging performance assessments, particularly for underrepresented populations. Such discrepancies, if unaddressed, can limit the translational relevance of VIT findings by misrepresenting diagnostic performance across diverse patient groups.

PURPOSE: To address this limitation, we introduce DISTINCT (Distributional Subsampling for Covariate-Targeted Alignment), a statistical framework for selecting demographically aligned subsamples from large clinical datasets to support robust comparisons with virtual cohorts.

METHODS: We applied DISTINCT to the National Lung Screening Trial (NLST) and a companion virtual trial dataset (VLST). The algorithm jointly aligned typical continuous (age, BMI) and categorical (sex, race, ethnicity) variables by constructing multidimensional bins based on discretized covariates. For a given target size, DISTINCT samples individuals to match the joint demographic distribution of the reference population. We evaluated the demographic similarity between VLST and progressively larger NLST subsamples using Wasserstein and Kolmogorov-Smirnov (K-S) distances to identify the maximal subsample size with acceptable alignment. After demographic alignment, we evaluated lung cancer risk prediction performance by applying two established NLST risk scores to the aligned subsamples and assessing their stability with receiver operating characteristic (ROC) analysis.

RESULTS: The DISTINCT algorithm identified a maximal demographically aligned NLST subsample of 9974 participants that preserved similarity to the VLST population. To assess whether such aligned subsets were sufficient for downstream applications, we applied two established NLST lung cancer risk scores and evaluated their performance using ROC analysis. Area under the curve (AUC) estimates stabilized once subsample sizes exceeded approximately 6000 participants, demonstrating that moderately sized aligned subsets provide reliable predictive model evaluation. Stratified analyses revealed demographic-specific variations in AUC, underscoring the importance of covariate alignment for fair and representative comparisons.

CONCLUSION: DISTINCT provides a statistically rigorous and scalable approach for covariate alignment between real and virtual imaging cohorts based on demographic factors of variability. Although demonstrated for lung cancer screening with low-dose CT, the framework is broadly applicable to other imaging modalities and diseases, and across wide ranges of factors of variability. By enabling fair and representative performance assessments, DISTINCT advances the integration of VITs into imaging research and protocol optimization workflows.

Publication Details

PubMed ID
41746164

Digital Object Identifier
10.1002/mp.70364

Publication
Med Phys. 2026 Mar; Volume 53 (Issue 3): Pages e70364

Related CDAS Studies

NLST

Related CDAS Studies

NLST-1020: Virtual NLST (Joseph Lo - 2023 )