Study
PLCO
(Learn more about this study)
Project ID
PLCO-1325
Initial CDAS Request Approval
Sep 18, 2023
Title
Correcting for participation bias in convenience samples using multiple reference samples
Summary
Large-scale probability samples are considered the gold standard design in applied health research. However, assembling such samples can be challenging due to the high cost of data collection, low response rates, and the time required to develop an accurate sampling frame from which potential participants can be sampled and recruited. To overcome these barriers, health researchers are increasingly adopting convenience sampling strategies and recruiting participants using smart phone apps, social media platforms, membership lists, or opt-in panels composed of individuals who have agreed to take part in surveys periodically. The selection mechanism of surveyed individuals is controlled by the survey design, which assigns each participant a survey sampling weight to reflect their probability of selection, and thus the number of people in the target population that they represent. This allows findings from the sample to be generalized to the target population. By contrast, in convenience samples the participation mechanism is unknown. As such, nonprobability samples can often over- or under-represent certain demographic, lifestyle, occupational and health-related characteristics in the target population, which can, in turn, lead to biased prevalence estimates. Moreover, when factors that are associated with the outcomes and exposures of interest are also associated with the decision of individuals to participate in the sample, exposure-outcome association analyses can also produce biased results. Currently available statistical methods designed to mitigate this bias attempt to use a probability reference sample in conjunction with the nonprobability sample to derive pseudo-weights that mimic `true' survey sampling weights in probability samples. These methods work well when all the important variables associated with an individual's decision to participate in the nonprobability sample are available in just one reference sample. However, more than one reference sample will often be required to account for all important variables, and yet, existing statistical methods used to derive pseudo-weights are not designed for these situations. In order to address the inherent bias of nonprobability samples, the goal of the proposed study is to expand existing statistical methods to accommodate more than one reference sample. This will allow estimates from nonprobability samples to be corrected in a broader set of contexts and, in turn, ensure more accurate inference from nonprobability samples to the target population. The overall objective of this proposal is to develop and assess the performance of methods that generate pseudo-weights for nonprobability samples, extending current approaches to situations where important variables associated with the participation mechanism are spread across multiple reference samples. As a starting point, the proposed research will extend current methods to accommodate two reference samples, with the expectation that we will be able to leverage the findings of this study in future work to integrate three or more reference samples. We apply the new and existing methods to real-world nonprobability samples such as the Prostate, Lung, Colon, and Ovarian (PLCO) Cancer Study cohort sample, and compare prevalence estimates and exposure-outcome associations obtained by these methods and the unweighted analyses.
Aims
The specific aims are to: Objective1 : Expand the current methods to integrate two reference samples into the generation of pseudo-weights. Objective 2: Assess and compare the performance of the new methods (in terms of variance, bias and convergence) under various simulated scenarios (e.g., misspecification of the participation model, sample size of the nonprobability sample) with that of existing methods. Objective 3: Apply the new and existing methods to real-world nonprobability samples: the cannabis sample and the PLCO sample, and compare prevalence estimates and exposure-outcome associations obtained by these methods and the unweighted analyses. Objective 4: Develop software packages within R and SAS that can be used by other health researchers to carry out pseudo-weighted analyses from nonprobability samples.
Collaborators
Dr. Victoria Landsman, Dalla Lana School of Public Health, Toronto, Canada
Dr. Lingxiao Wang, University of Virginia, Charlottesville, VA
Dr. Aya Mitani, Dalla Lana School of Public Health, Toronto, Canada