Efficient Design and Analysis of Two-Phase Biomarker Studies
Principal Investigator
Name
Li Cheung
Degrees
PhD
Institution
National Cancer Institute
Position Title
Principal Investigator
Email
li.cheung@nih.gov
About this CDAS Project
Study
PLCO
(Learn more about this study)
Project ID
PLCO-2053
Initial CDAS Request Approval
Jun 8, 2026
Title
Efficient Design and Analysis of Two-Phase Biomarker Studies
Summary
Two-phase biomarker studies are widely used when biomarker assessment is expensive, invasive, or otherwise unavailable on the full study cohort. In a typical design, information from all participants is available in phase 1, while biomarker measurements are collected on a selected phase-2 subsample. This design can improve feasibility and reduce cost, but it also creates statistical challenges related to sampling design, estimation efficiency, missingness, selection bias, and variance estimation.
We propose to use PLCO prostate cancer data to compare various approaches for the design and analysis of two-phase biomarker studies. We will treat time to prostate diagnosis, demographic variables, and relevant SNP data (summarized as a polygenic risk score) as the phase 1 data and prostate-specific antigen (PSA) as the expensive biomarker. We will compare the estimation efficiency for the hazard ratio/odds ratio of PSA for prostate cancer when a subsample of the data is selected and analyzed using the various design and estimation approaches. We may also include a second example using colorectal cancer and CA-125.
The goal is to develop practical, efficient, and robust guidance for epidemiological work on biomarker discovery. The target audience will be an epidemiologists.
Aims
Aims.
1. We will compare the estimation efficiency of several two-phase sampling designs commonly used in epidemiological studies, including simple random sampling, case-cohort designs, matched and nested case-control designs (covariate-matched and propensity-matched), stratified outcome-dependent sampling, extreme phenotype sampling, and residual-based sampling.
PLCO prostate cancer data will be used to estimate hazard ratios (HRs) and confidence intervals of PSA for prostate cancer using two-phase samples pertaining to fractions of the full dataset (1/20, 1/10, 1/5, 1/4, 1/3, and 1/2). Bias and efficiency of different sampling methods will be compared against estimates that utilize the full dataset.
We may also include a second example using PLCO colorectal cancer and CA-125, following the same approach as described for prostate cancer.
2. We will also evaluate the performance of the different two-phase designs in simulation studies. However, we want to write a practical epidemiological paper, so simulations will not be the main focus.
Collaborators
Li Cheung National Cancer Institute
Fangya Mao National Cancer Institute
Anil Chaturvedi National Cancer Institute