Skip to Main Content

COVID-19 is an emerging, rapidly evolving situation.

What people with cancer should know:

Get the latest public health information from CDC:

Get the latest research information from NIH:

Principal Investigator
Mitchell Gail
M.D., Ph.D.
National Cancer Institute
Position Title
Distinguished Investigator
About this CDAS Project
PLCO (Learn more about this study)
Project ID
Initial CDAS Request Approval
Oct 22, 2020
Statistical methods to identify the “causal SNP” from among SNPs in high LD with an index SNP
This is a methodology project to see whether some proposed statistical methods can help identify the SNPs most likely to be “causal” from a GWAS study. GWAS studies can detect index SNPs that are associated with phenotypes with a genome-wide statistically significant p value. Such an index SNP can lead to functional studies to determine its biologic significance. However, other SNPs in high linkage disequilibrium with the index SNP can also show association with the phenotype, complicating the choice of which SNPS should be investigated for functional evidence. Schaid (Nature Reviews, 22018) reviewed statistical approaches for finding sets of SNPs that contain the “causal” SNP (or SNPs) with high probability. Most of these methods are computationally very intensive and try to identify a set of likely SNPs rather than identify a single causal SNP. The purpose of this project is to study simpler statistical methods aimed at identifying the causal SNP, under the assumption that there is only one. The method extends to finding the two causal SNPs if it is assumed there are two.
Although this problem arises prominently in GWAS studies, it can figure in other epidemiologic studies where there are several highly correlated exposures, only one (or two) of which is assumed to be causal. A standard approach would be to fit a multivariable model and choose the exposure with the largest Wald statistic (or smallest p value). I have been working on methods that incorporate both marginal and conditional estimates of exposure effects and found these to be much more powerful in simulations for identifying the causal exposure. I would like to test these methods using PLCO GWAS data.
Working with Mitchell Machiela in ITEB, I would like to examine the phenotypes defined by event indicators for all breast cancer, invasive breast cancer, colon cancer, and prostate cancer and for baseline height. For the cancer outcomes I would need age at entry and minimum of time to event, to death or to end of follow-up. For each phenotype, we need baseline age, gender, BMI, number of affected first-degree relatives, and self-described race/ethnicity. Data on principal components for ancestry and on genetic ancestry make-up will be derived from the genotype data. Initial analyses will focus on persons of European ancestry, but other analyses based on e.g. African ancestry may offer additional analytical possibilities for identifying the “causal” SNP.
Currently, my proposed collaborator is Dr. Machiela, but if the methods seem useful, other collaborators may be enlisted. Currently the only DCEG investigators who might need access to the data are Dr. Machiela and I. It is possible, however, that we would need help with data management from staff at IMS or at the DCEG Cancer Genomics Research Laboratory.

1. To test proposed analytical methods to help identify a causal SNP from among several highly SNPs using PLCO GWAS data
2. To develop and extend the analytical methods if needed, depending on the results of initial analyses

If the procedures seem to work, I expect substantive projects could follow in collaboration with other DCEG investigators, but that is beyond the scope of the proposed project.


Dr. Mitchell Machiela National Cancer Institute, DCEG, ITEB