Overview
TOPMed generates scientific resources to enhance understanding of fundamental biological processes that underlie heart, lung, blood and sleep disorders. It is part of efforts to harness data science to drive precision medicine, which aims to provide disease treatments that consider unique genes and environment. TOPMed integrates -omics data with molecular, behavioral, imaging, environmental, and clinical data from diverse participants in NHLBI's population and epidemiology studies. Integrating this data supports researchers in their efforts to expand their analyses and identify factors that increase or decrease the risk of disease, identify subtypes of disease, and develop more targeted and personalized treatments.
Currently, TOPMed's Freeze 3a includes 26 different studies with approximately 72,000 samples that underwent whole genome sequencing. The studies encompassed several experimental designs (e.g. cohort, case-control, family) and many different clinical trait areas (e.g. asthma, COPD, atrial fibrillation, atherosclerosis, sleep). See study descriptions on the Parent Studies Descriptions & Statements page.
TOPMed WGS data is released in multiple waves. The first release, in October 2016, included approximately 8,600 samples in 15 separate dbGaP accessions, followed by four additional accessions in Nov/Dec 2016. These accessions are summarized in the Table below. Some TOPMed studies have previously released genotypic and phenotypic data on dbGaP in “parent” accessions (see Table 1). For those studies, the TOPMed WGS accession contains only WGS-derived data and, therefore, genotype-phenotype analysis requires access to data from both parent and TOPMed WGS accessions. For the studies in Table 1 without a specific parent accession number, the TOPMed WGS accession contains both genotype and phenotype data.
Table 1: Summary of TOPMed Study Accessions (Phase 1)
TOPMed Study Accession Number |
TOPMed Study Name |
TOPMed study PI |
Approx. Sample Size - Oct 2016 release |
Approx. Sample Size - total expected |
Sequencing Center |
Parent Study Accession Number |
Phenotype Focus |
---|
phs000920 |
NHLBI TOPMed: Genes-environments and Admixture in Latino Asthmatics (GALA II) Study |
Esteban Burchard |
978 |
1000 |
NYGC3 |
phs001180 |
Asthma |
phs000921 |
NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study |
Esteban Burchard |
485 |
500 |
NYGC |
NA |
Asthma |
phs001062 |
NHLBI TOPMed: Massachusetts General Hospital Atrial Fibrillation (MGH AF) Study1 |
Patrick Ellinor |
274 |
794 |
BROAD4 |
phs001001 |
Atrial Fibrillation |
phs001032 |
NHLBI TOPMed: The Vanderbilt Genetic Basis of Atrial Fibrillation1 |
Dawood Darbar |
310 |
1140 |
BROAD |
NA |
Atrial Fibrillation |
phs000997 |
NHLBI TOPMed: The Vanderbilt Atrial Fibrillation Ablation Registry1 |
M. Benjamin Shoemaker |
55 |
121 |
BROAD |
NA |
Atrial Fibrillation |
phs000993 |
NHLBI TOPMed: Heart and Vascular Health Study (HVH)1 |
Susan Heckbert |
73 |
79 |
BROAD |
phs001013 |
Atrial Fibrillation |
phs001189 |
NHLBI TOPMed: The Cleveland Clinic Atrial Fibrillation Study of the CV/Arrhythmia Biobank1,2 |
Mina Chung |
0 |
363 |
BROAD |
phs000820 |
Atrial Fibrillation |
phs001211 |
NHLBI TOPMed: Atherosclerosis Risk in Communities1,2 |
Alvaro Alonso/Eric Boerwinkle |
0 |
81 |
BROAD |
phs000280 |
Atrial Fibrillation |
phs001040 |
NHLBI TOPMed: Novel Risk Factors for the Development of Atrial Fibrillation in Women1 |
Christine Albert |
111 |
118 |
BROAD |
NA |
Atrial Fibrillation |
phs001024 |
NHLBI TOPMed: Partners HealthCare Biobank1 |
Steven Lubitz |
127 |
128 |
BROAD |
NA |
Atrial Fibrillation |
phs000974 |
NHLBI TOPMed: The Framingham Heart Study1 |
Vasan Ramachandran |
1757 |
4206 |
BROAD |
phs000007 |
General heart, lung & blood (including atrial fibrillation) |
phs000956 |
NHLBI TOPMed: Genetics of Cardiometabolic Health in the Amish |
Braxton Mitchell |
930 |
1120 |
BROAD |
NA |
General heart, lung & blood |
phs000951 |
NHLBI TOPMed: Genetic Epidemiology of COPD (COPDGene) |
Edwin Silverman |
1136 |
1880 |
UW NWGC5 |
phs000179 |
COPD |
phs000946 |
NHLBI TOPMed: Boston Early-Onset COPD Study |
Edwin Silverman |
55 |
75 |
UW NWGC |
phs001161 |
COPD |
phs000988 |
NHLBI TOPMed: The Genetic Epidemiology of Asthma in Costa |
Scott Weiss |
605 |
1082 |
UW NWGC |
NA |
Asthma |
phs000964 |
NHLBI TOPMed: The Jackson Heart Study |
Adolfo Correa |
1429 |
3418 |
UW NWGC |
phs000286 |
General heart, lung & blood |
phs000972 |
NHLBI TOPMed: Genome-wide Association Study of Adiposity in Samoans |
Stephen McGarvey |
298 |
383 |
UW NWGC |
phs000914 |
Adiposity |
phs000954 |
NHLBI TOPMed: The Cleveland Family Study2 |
Susan Redline |
0 |
997 |
UW NWGC |
phs000284 |
General heart, lung, blood & sleep |
phs001143 |
NHLBI TOPMed: The Genetics and Epidemiology of Asthma in Barbados2 |
Kathleen Barnes |
0 |
1096 |
Illumina6 |
NA |
Asthma |
TOTAL NUMBERS |
|
|
8623 |
18581 |
|
|
|
1These studies comprise an atrial fibrillation case-control study, Patrick Ellinor TOPMed project PI
2Data for these studies are scheduled for release in Nov/Dec 2016
3New York Genome Center
4Broad Institute of MIT and Harvard
5University of Washington Northwest Genomics Center
6Illumina Genomic Services
|
The following sections of this document describe methods of data acquisition, processing and quality control (QC) for TOPMed WGS data contained in the 2016 releases. Briefly, approximately 30X whole genome sequencing was performed at several different Sequencing Centers (named in the Table 1). All samples for a given study were sequenced at the same center, except for a small number of control samples described below. The reads were aligned to human genome build GRCh37 at each center using similar, but not identical, processing pipelines. The resulting binary alignment and map (BAM) files were transferred from all centers to the TOPMed Informatics Research Center (IRC), where they were re-aligned to build GRCh37, using a common pipeline to produce a set of ‘harmonized’ BAM files. Both the Sequencing Center-specific BAM and the harmonized BAM files were deposited in the NCBI Sequence Read Archive (SRA), where they were converted to the ‘.sra’ file format. Both center-specific and IRC-harmonized .sra files are available to users with approved access to a given study. The IRC performed joint genotype calling on all samples in the October 2016 releases (along with additional samples to be released later). The resulting VCF files were split by study and consent group for distribution to approved dbGaP users, but can be reassembled easily for cross-study, pooled analysis because the files for all studies contain the same variant sites. Quality control was performed at each stage of the process by the Sequencing Centers, the IRC and the TOPMed Data Coordinating Center (DCC). Only samples and variants that passed QC are included in the genotype call sets distributed with the 2016 releases.
Sequence/genotype data files provided in the 2016 dbGaP releases include the following:
- Aligned read data for each sample in ‘.sra’ format (which is readily convertible to BAM format). Each sample has two .sra files: one from the Sequencing Center and the other from the IRC
- Genotype call sets (one per chromosome) in ‘.vcf’ format
TOPMed DNA sample/sequencing-instance identifiers
Each DNA sample processed by TOPMed was given a unique identifier as “NWD” followed by six digits (e.g. NWD123456). These identifiers are unique across all TOPMed studies. Each NWD identifier is associated with a single study subject identifier used in other dbGaP files (such as phenotypes, pedigrees and consent files). A given subject identifier may link to multiple NWD identifiers when duplicate samples are taken from the same individual. Study investigators assigned NWD IDs to subjects, and their biorepositories assigned DNA samples/ NWD IDs to specific bar-coded wells/tubes supplied by their Sequencing Center, and recorded those assignments in a sample manifest, along with other metadata (e.g. sex, DNA extraction method). At each Sequencing Center, the NWD ID was propagated through all phases of the pipeline and is the primary identifier in all results files. Each NWD ID resulted in a single sequencing instance (i.e. ‘run’ in SRA terminology).
Control Samples
One parent-offspring trio from the Framingham Heart Study (FHS) was sequenced at each of four Sequencing Centers (family ID 746, subject IDs 13823, 15960 and 20156). All four WGS runs for each subject are provided in the TOPMed FHS accession (phs000974). In addition, HapMap subjects NA12878 (CEU, Lot K6) and NA19238 (YRI, Lot E2) were sequenced at each of the Sequencing Centers in alternation, once approximately every 1000 study samples. The HapMap sequence data will be released publicly as a BioProject in Q4 2016 or Q1 2017.
One parent-offspring trio from the Framingham Heart Study (FHS) was sequenced at each of four Sequencing Centers (family ID 746, subject IDs 13823, 15960 and 20156). All four WGS runs for each subject are provided in the TOPMed FHS accession (phs000974). In addition, HapMap subjects NA12878 (CEU, Lot K6) and NA19238 (YRI, Lot E2) were sequenced at each of the Sequencing Centers in alternation, once approximately every 1000 study samples. The HapMap sequence data will be released publicly as a BioProject in Q4 2016 or Q1 2017.
The average pairwise non-reference genotype discordance rate among 69 pairs of duplicate sequenced samples is 5 x 10-5 on the set of variants included in this release. The genotype discordance rate is very sensitive to the stringency level used for variant site filtering. This low figure is evidence of the benefit of 30x whole genome sequencing and suggests that the current filtering threshold suitably balances sensitivity and specificity. It must be acknowledged that these 27 control samples were among 4,047 duplicate and related samples which provided a negative training set for the SVM classifier used for site level filtering (see Variant Filtering section).
To calculate non-reference discordance, the genotypes of each DNA sample are called independently from separate sets of sequence reads, often from different Sequencing Centers. The denominator for each pairwise comparison is the number of sites where at least one of the two samples has a non-reference genotype called (either het or hom-alt). The numerator is the number of sites where the two genotypes disagree.