Sample selection for multiple myeloma proteomics study
Our initial proposal described a nested, individually matched case-control design, where controls were randomly selected and matched to cases based on key sources of bias and confounding (sex, race, age and year of blood collection). However, this approach has limitations. This includes control samples which are not fully representative of the underlying population, potentially impacting generalizability and limiting the direct estimation of risk prediction metrics. Additionally, risk estimates derived from matched designs typically rely on conditional logistic regression models, which do not incorporate time-to-event information, potentially introducing bias and reducing statistical power in estimating marker-outcome associations and risk prediction.
To address these limitations, we propose adopting an alternative stratified sampling design that stratifies the entire cohort based on time-to-event outcomes and auxiliary covariates1, then randomly selects individuals from each stratum for protein measurements. The time-to-event outcomes will include both binary event status (MM case vs. non-case) and event or censoring time, and we will explore different stratification strategies. Once protein measurements on the selected individuals are returned by CGR, we will apply inverse probability weighting (IPW) in Cox models to account for selection bias. This approach will enhance the flexibility of our modeling strategy while minimizing potential sources of bias.
To explore the possibility of stratified sampling for our study and identify possible participants for our EEMS proposal we are requesting information on baseline covariates, availability of serum samples, MM and MGUS data in PLCO. Code to generate the selection and example analysis accounting for the selection will be shared with IMS to preserve institutional memory.
References:
1https://doi.org/10.1111/j.1467-9469.2006.00523.x
We aim to develop a stratified sampling strategy to guide sample selection for EEMS-2025-0032 and inform our specimen request.
We will explore sample selection method, number of strata & allocation, stratification variables, including time-to-event outcomes and relevant covariates (e.g., demographic, clinical, biomarker data).
The following data are requested to support the proposed analyses:
• PLCO Baseline Dataset
• Cancer and Follow-Up Data
-First primary cancer ICD code.
-Incident multiple myeloma (1, 0).
-Any available multiple myeloma subtype information (e.g., light chain, non-light chain, IgM, etc.).
-Date of multiple myeloma diagnosis.
-Date of death.
-Date lost to follow-up.
-Date of cancer diagnosis for any type of cancer.
• Blood Collection Data (array format due to serial samples)
-Serum sample availability with corresponding sample collection dates (e.g., date_sample1, date_sample2).
-Perfect parent serum sample availability (binary: 0 = unavailable, 1 = available for each sample).
-Study year of sample collection (T0 to T5).
• MGUS Data from EEMS Proposal 2005-0017 (array format due to serial samples)
-MGUS test status for each year with available data.
-MGUS subtype information (e.g., light chain, non-light chain, IgM, etc.).
-Date of MGUS sample collection, corresponding to study years T0 to T5.
• EEMS Project Inclusion Indicators (binary: 0 = not included, 1 = included)
-EEMS: 2005-0017
-EEMS: 2016-0009
-EEMS: 2018-1019
-PLCO-355
-Genetic data available
This requested data will facilitate a comprehensive evaluation of sample selection strategies and participant eligibility for the proposed study. The data structure can be one record per individual with the blood collection and MGUS data arranged as an array from earliest to latest date.
Eleanor Watts, MEB, DCEG, NCI, NIH
Steve Moore, MEB, DCEG, NCI, NIH
Fangya Mao, BB, DCEG, NCI, NIH
Li Cheung, BB, DCEG, NCI, NIH