Skip to Main Content

An official website of the United States government

Principal Investigator
Name
James Long
Degrees
Ph.D.
Institution
University of Texas MD Anderson Cancer Center
Position Title
Assistant Professor
Email
About this CDAS Project
Study
PLCO (Learn more about this study)
Project ID
PLCO-808
Initial CDAS Request Approval
Jul 15, 2021
Title
A Retrospective Study for Construction and Evaluation of Cancer Screening Models from the PLCO Trial: Pancreatic, Prostate, Lung, Colorectal, and Ovarian Cohorts
Summary
This project will construct models to predict occurrence of pancreatic cancer using clinical and epidemiological features. The most important features (e.g. age, smoking history, diabetes status) will be identified and model performance quantified in terms of AUC, net benefit, and sensitivity at high specificity (> 95%) thresholds. This project will be performed in conjunction with "Longitudinal Proteomic and Metabolomic Predictors of Pancreatic Cyst Malignant Progression and Early Stage Pancreatic Cancer" (PAR-2020-0002) which requests plasma access.

The performance of models constructed in this project (using only clinical / epidemiological features) will be compared to models constructed using biomarkers quantified using plasma requested in PAR-2020-0002. Construction of these clinical/epidemiological only models is an essential step towards determining the added value of a biomarker or set of biomarkers relative to easily available clinical information. In particular, when considering the utility of a protein biomarker X for early detection of pancreatic cancer, the primary question of interest is the difference in performance of model a) which uses biomarker X and clinical features versus model b) which uses only clinical features. The biomarker has value if this difference in performance is clinically meaningful, as measured by a performance metric such as area under the receiver operating characteristic curve (AUC).

In addition to constructing a multivariate clinical predictive model, the effect of matching in nested case control biomarker studies will be assessed via simulation. These results will inform the choice of design of nested case-control studies in PLCO.

Since the statistical issues with cancer screening described above apply to all cancers, the analysis will be repeated for the prostate, lung, colorectal, and ovarian cancer cohorts in the PLCO project.

Metrics such as ROC curves and AUC do not directly capture the effect of screening on survival, the outcome of primary interest. Statistical methods for estimating sojourn time, lead time, overdiagnosis rate, and age specific sensitivity will be applied to each cancer type. Estimation of lead time when combined with the stage-shift conditional survival distribution could be used to estimate the impact of screening on mortality, the question of fundamental interest.

Finally we will consider adjusting biomarker performance for baseline covariates such as age, sex, and race [4,5]. Recently developed methodology enables estimation of sensitivity, specificity, and ROC curves in subpopulations defined by categorical and continuous covariations [6]. This will be performed on each of the 5 requested cohorts.
Aims

AIM 1: Identify important predictors of each cancer type (pancreatic, prostate, lung, colorectal, and ovarian). Occurrence of each cancer type will be correlated with each clinical / epidemiological variable. Strength of correlation will be assessed via AUC. This analysis will be done defining cases at different diagnosis lag times, e.g. (control= no cancer diagnosis within 1 year OR control= no cancer diagnosis within 2 years).

AIM 2: Construct and evaluate multivariate models. Multivariate prediction models will be constructed using a variety of regression tools (e.g. logistic regression, random forests) and performance quantified on several scales (AUC, net benefits, sensitivity/specificity). Performance will be evaluated on held out test sets to avoid overfitting and optimistic estimates of model performance.

AIM 3: Study performance of matched and unmatched nested case-control designs in PLCO. Biomarker levels will be simulated and cases and controls selected using both unmatched and matched designs. Two forms of matching will be considered 1) matching based on age, sex, material study year, and time from randomization to exit and 2) matching based on all clinical / epidemiological variables used in the best performing multivariate models from Aim 2. Predictive models will be constructed using the biomarkers and clinical variables. Performance of the three designs (2 matched, 1 unmatched) will be assessed using AUC and sensitivity at high specificity thresholds. For unmatched designs, computation of these performance metrics will follow standard practice. For matched designs performance metrics will account for matching using methodology developed in [1,2,3]. Of interest is which design obtain the best performance and the uncertainty on the performance measures. The results will inform the design (whether to match, which variables to match on) of future PLCO studies.

AIM 4: Estimate sojourn time, lead time, overdiagnosis rate, and age specific sensitivity for each PLCO cancer type. Screening performance of the imaging screening modalities or/and biomarker trajectories (e.g. CA125 and transvaginal ultrasound for ovarian cancer early detection). New methods will be developed if currently methodology does not appear suitable for the PLCO cohorts.

AIM 5: Estimate sensitivity, specificity, and receiver operating characteristic (ROC) curves for each of the 5 cohorts (pancreatic, prostate, lung, colorectal, and ovarian) after adjusting for baseline covariates such as age, sex, race, and other risk factors.

[1] Pepe, M. S., et al. (2013). "Estimating the receiver operating characteristic curve in studies that match controls to cases on covariates." Academic radiology 20(7): 863-873.

[2] Bansal, A. and M. S. Pepe (2013). "Estimating improvement in prediction with matched case–control designs." Lifetime data analysis 19(2): 170-201.

[3] Janes, H. and M. S. Pepe (2008). "Matching in studies of classification accuracy: implications for analysis, efficiency, and assessment of incremental value." Biometrics 64(1): 1-9.

[4] Janes, Holly, and Margaret S. Pepe. "Adjusting for covariate effects on classification accuracy using the covariate-adjusted receiver operating characteristic curve." Biometrika 96.2 (2009): 371-382.

[5] Alonzo, Todd A., and Margaret Sullivan Pepe. "Distribution‐free ROC analysis using binary regression techniques." Biostatistics 3.3 (2002): 421-432.

[6] https://github.com/ziyili20/caROC

Collaborators

Jianjun Zhang

Ehsan Irajizad

Related Publications