Skip to Main Content

An official website of the United States government

Principal Investigator
Name
Timothy O'Connor
Degrees
PhD
Institution
University of Maryland School of Medicine
Position Title
Associate Professor
Email
About this CDAS Project
Study
PLCO (Learn more about this study)
Project ID
PLCO-790
Initial CDAS Request Approval
Jul 26, 2021
Title
Genetics of Latin American Diversity (GLAD) database with application to Cancer.
Summary
Latin American (LA) individuals, both inside and outside the U.S., are significantly understudied in genomics. As of 2018, only 1.3% of study participants in the GWAS Catalog are LA, 0.03% are Native American, and 2.4% are African. However, studies of LAs can also provide a number of advantages in identifying novel disease loci including the use of admixture mapping and simultaneously accessing European, Native American, and African haplotypes.

We propose a publically available database to accommodate both the complexity of LA genomic ancestry. This will enable others to perform more powerful admixture mapping in their samples without transferring individual-level data. We will do this by combining sequence and genotyping array data from publicly available sources to create a database of potential controls. We are developing a matching algorithm to connect external samples with these internal controls via summary statistics of genomic background. An interactive portal we can provide scalable access to summary statistics that can be used for association analyses and variant prioritization.

We will create and curate a publicly available database to accommodate the complexity of Latin American genomic ancestry and enable others to perform more powerful associations without transferring individual-level data. The genetics of Latin American individuals are understudied so this project will collect and curate available resources across many studies. This database will provide the largest pool of controls for Latin American association tests at a genetic variant level and for comparing the ancestry of different segments of the genome. This is known as admixture mapping. Using non-individual data, the user will summarize the genetics of their disease cohort and use the database to find matches to those with a similar genetic background. Users will then receive a non-individual summary of the genetic variant frequencies and ancestry by genome segment for the matching controls. These statistics can be combined with the original cases to perform a genome-wide association analysis both of the variants as well as ancestry by segments. This will empower studies with limited access to controls to augment their studies and increase the power for association testing. As a case study, we will use cancer phenotypes to generate GWAS and polygenic risk scores specific to Latin American individuals.

We believe this project will bring together the best available data for LA individuals, identify gaps in our understanding of genetic diversity, and allow for more powerful tests of disease association for researchers studying this diverse group.
Aims

Our goals are six-fold:
1) Combine existing publicly available LA and Native American samples imputed with sequence data from the TOPMed Project (N>90K), via the Michigan Imputation Server. The cohorts selected have all been found to have appropriate consent and at least 50 LA samples.
2) Analyze the data including: ancestry of haplotypes, genome-wide ancestry, allele frequencies, and sampling location. Our analysis in this proposal does not focus on origins or ancestry; we utilize admixture dynamics for disease association.
3) Develop a system to match external samples to controls using non-identifiable summary statistics of genomic background and covariable distribution values, e.g mean and standard deviation of age or BMI. We will use disease status to help eliminate matches that might have the same disease, for instance, those known to be cases for prostate cancer would be eliminated from a search for prostate cancer controls. We will use ascertainment status (e.g. case/control, population sampling, family sampling) to help filter individuals as well. An option will be presented to the user to help them select the appropriate pool for controls.
4) Provide genome-wide genotype frequencies and average ancestry of local haplotypes from selected matches.
5) Identify gaps in our assessment of genomic information from LA individuals.
6) Run a GWAS and generate polygenic risk scores for cancer phenotypes by combining PLCO data with other cancer data already in hand.

Collaborators

Ryan Hernandez, UCSF
Ignacio Mata, Cleveland Clinic
Eduardo Tarazona-Santos, Universidade Federal de Minas Gerais
These individuals are collaborating in a consulting fashion and will not have access to individual-level data without a specific modification/re-application requesting the data at a future date.