TOPMed

Data Access: Where are the data?

TOPMed Members

TOPMed Members, designated by TOPMed Project and Center PIs (i.e., data contributors and data generators) - TOPMed Working Groups and the originating parent study teams may apply to use TOPMed data "pre-public release" in dbGaP exchange Areas. This means they can use their own study TOPMed data before the data are submitted to dbGaP and assigned accessions for controlled access by the scientific community, and before the IRC submits selected data to the NHLBI BioData Catalyst® ecosystem. They may also apply for access to released TOPMed data via the same process as other scientific community members.

The Scientific Community (Those Who are Not TOPMed Members)

TOPMed genomic and pre-existing parent study phenotypic data available to the scientific community who have been granted access through the database for genotypes and phenotypes (dbGaP) in multiple venues. Individual-level molecular and phenotypic data and select Genomic Summary Results (GSR) are available through controlled-access TOPMed study accessions in NIH-designated repositories - see TOPMed Data Access for the Scientific Community for more information. Publicly-available resources include the BRAVO variant server and NHLBI BioData Catalyst®.

Accessing TOPMed data in BDC offers researchers opportunities to create cohorts using other datasets (with appropriate access permissions) and leverage innovative data analysis tools, applications, and workflows to accelerate their research efforts.

Phenotypes:

When the Parent study has a dbGaP accession that preceded the existence of the TOPMed program, phenotypic data are in the Parent accession. Otherwise, the phenotypic data are in the TOPMed accession. A number of phenotypes have been harmonized across the TOPMed program.

Genotypes:

Unphased genotype calls from TOPMed WGS are available in the TOPMed accession as Variant Call Format (VCF) files. Studies may have multiple sets of VCF files corresponding to the various TOPMed data freezes. The VCF files contain variant-level quality metrics and a support vector machine (SVM) quality filter. The table below summarizes TOPMed WGS characteristics by freeze.

Latest Freezes
TOPMed freeze (methods documents linked)	Date	Genome Build	n_variants	n_samples	n_studies
freeze.5b	Sep 2017	38	582M	56K	32
freeze.8	Feb 2019	38	1.02B	138K	72

Genotypes (WGS):

Read alignment data (WGS): A limited number of TOPMed Phase 1 Compressed Reference-oriented Alignment Map files (CRAMs) aligned to build 37 are available directly through the dbGaP Sequence Read Archive (SRA). These are accessible via their corresponding TOPMed accessions with dbGaP approval . All other CRAMs, including build 38 alignments for all TOPMed WGS samples, are hosted in NHLBI cloud buckets and accessed using Fusera software.

Non-WGS OMICS:

TOPMed is generating a rich resource of multi-omics data that will include approximately 40,000 samples undergoing RNA-sequencing, 37,000 samples from metabolomics profiling, 57,000 samples from DNA methylation, and 4,000 samples from proteomics assaying. These projected totals include all stages of progress, from DNA/RNA that are currently being extracted, that are undergoing sequencing/profiling, or that have completed sequencing/profiling pipelines. Omics data will be released to the scientific community via NIH-designated repositories (dbGaP and BioData Catalyst).

Non-WGS omics pipelines and flowcharts are available on the Methods webpage (scroll down)
Omics progress

How do I apply for access?

Users who want to use controlled-access TOPMed data new to apply for access by following the dbGaP instructions for requesting controlled-access data.

Since participant consent and data use limitations (DULs) differ across and within TOPMed studies, requests for access to controlled-access data must be made for each dataset and applicants need to review DULs carefully to ensure that proposed Research Use Statements (RUS) are consistent with the study-consent group(s) being requested. Furthermore, some TOPMed studies have consent modifiers that may require additional documentation, such as documentation of local IRB approval and/or letters of collaboration with the primary study PI(s).

Applicants should investigate whether phenotype data are deposited in the TOPMed or the Parent accession for the studies of interest. If the latter, then applicants will need to specifically apply for access to the Parent accession for phenotypes in addition to applying to the TOPMed accession for TOPMed WGS genotypes. Phs numbers for TOPMed and Parent accessions are available in the dbGaP methods documents.

CDS IRB:

The NHLBI established the Clinical Data Science IRB (CDS-IRB) to provide a useful resource for the research community by offering—at no cost—central review of secondary research proposals utilizing NHLBI datasets for which IRB approval is required.

As a central IRB for research protocols that propose secondary analyses of existing clinical data, the CDS-IRB will address the growing complexity of research and non-traditional uses of biomedical data. Broad utilization of the CDS-IRB will also provide an opportunity for the NHLBI to systematically understand the evolution and range of requests to conduct secondary analyses, recognize emerging trends, and explore ways to enhance data stewardship with the research community.

How do I use the data?

Running mega analyses across TOPMed studies requires combining genotype and phenotype data across individual dbGaP accessions.

Combining Genotypes:

The Informatics Research Center’s (IRC) joint calling process produces a multi-study VCF file for each chromosome, each of which is split into study-specific components. For studies with multiple consent groups, these components are further divided by consent groups and deposited in the study’s TOPMed accession. The same variants occur in all VCF components of a given call set. To construct a multi-study VCF file for analysis, a user must apply for access to each study-consent group and reassemble the components. Note some TOPMed accessions will have VCF files for more than one data freeze. Therefore, users must take care to select VCF files from the same freeze for their multi-study reassembly. Tools for combining VCF files include vcftools and bcftools.

Combining Phenotypes:

The Parent studies contributing to TOPMed have many phenotypic measures in common, thereby providing opportunities for cross-study analyses to gain power in detecting genetic effects. However, these studies’ designs differ in how their phenotypic data were collected, and in how their data are annotated and structured. Creating harmonized phenotypic data sets for cross-study analyses is therefore a challenging and largely manual process. Users will need to carefully evaluate the source phenotypes and accompanying documentation before attempting to harmonize across studies. The TOPMed phenotype harmonization efforts are described under TOPMed harmonized phenotypes for the scientific community, along with a phenotype tagging project that can assist members of the scientific community in finding related phenotype variables to perform their own harmonization.

A note on Sample/subject identifiers:

The TOPMed ACC centrally assigns each molecular sample in the TOPMed program a unique sample identifier (e.g., for DNA, “NWD” followed by 6 digits), which is used in all files containing TOPMed sequence, genotype, or other molecular data. Subject (aka participant or individual) identifiers are assigned by study investigators and are not guaranteed to be unique across all studies. The subject identifiers are associated with individual-level phenotypic data and, in most cases, are consistent between the TOPMed and Parent accessions for a given study. Mappings between sample and subject identifiers, as well as subject ID aliases, are given in standard dbGaP files labeled as subject-sample mapping and subject consent files.

Where can I access variant summary data?

The following resources provide summary-level information on variants observed in TOPMed (e.g., allele frequencies, association results), or other non-individual-level data (e.g., imputation server).