This guide is for TOPMed Working Groups. It describes how phenotype data and genotype call sets can be accessed and shared across studies, using the dbGaP TOPMed Exchange Areas (EAs).
The EAs are temporary file storage spaces, provided by dbGaP for the duration of the TOPMed project. Each study has a separate EA, where phenotypes (and related ancillary data) may be stored. Each study-specific EA has a link to SRA files for that study. (SRA files can be converted to BAMs.) There is also a common EA, for the whole TOPMed project, which stores cross-study genotype call sets as well as variant and sample annotation. In the near future, the common EA will contain kinship coefficients, PCs and GRMs calculated across all studies, but these are not available yet. Approval to access any one of the study-specific EAs also provides access to the common EA. The process to obtain approval is described in the “Downloading data” section.
In the following, we assume that phenotype data will be prepared by Working Group members affiliated with each study to be included in the analysis. In the near future, the ACC will begin harmonizing phenotypes across TOPMed studies and will deposit these data in the study-specific EAs for general use.
Data sharing by working group members
Figure: Diagram of data transfers to assemble cross-study phenotype and genotype data at one location for performing cross-study analysis. In this example, three studies (A, B and C) each upload their phenotype data to their respective EAs. Then a downloader for study B downloads those data onto the computing resources of study B, where the cross-study association analysis takes place. Only one analysis site is shown here (Study B’s computers), but combined analyses may occur at multiple sites.
The figure above shows a hypothetical example of data sharing among three studies (A, B, and C) so that cross-study analyses can be performed at one or more study sites. Investigators affiliated with these studies belong to the TOPMed Blood Pressure Working Group. They formed a writing group and decided to run a whole-genome association analysis of BP using individual-level data pooled across studies. This involves the following steps:
- The writing group members decide upon an analysis plan, which includes specification of the outcome (e.g. systolic BP), as well as all covariates, and exclusions.
- Each study assigns a person to assemble the data from their own study, following the analysis plan’s protocol for the outcome and covariates, and any sample exclusions. The analysis plan should specify the columns of phenotype data to be supplied (e.g. outcome, age, sex, study, BP-lowering medication status) and the corresponding variable names. The plan should also address how to handle relatedness and population structure and whether pedigrees should be provided.
- Each study uploads its data file to the study-specific EAs. The “uploader” for each study is an individual authorized by a study PI to upload data to their study’s EA (see “Uploading data”). These individuals may or may not also be Working Group members.
- One or more working group members are chosen as the cross-study analyst(s). The example shows just one cross-study analyst (from study B). This analyst will ask study B’s “downloader” to download the data files from the participating studies’ EAs to the computing resources of Study B (see “Downloading data”). The analyst will also ask the downloader to download the cross-study genotype call set, along with sample and variant annotation; these data are available upon access to any study-specific EA.
- The cross-study analyst will check the phenotype files to make sure they conform to the analysis plan. If the analysis requires measures of relatedness and population structure not available from the general-use versions, the cross-study analyst is responsible for creating these.
- The cross-study analyst will perform association testing with the combined phenotype and genotype data, generally using local computing resources. A study may use cloud computing if specified in each of their Data Access Requests and these requests are approved. In addition, the TOPMed Computing Infrastructure Committee is considering various cloud computing options for general usage, but these are not available yet.
- Association test results files will be uploaded to the analyst’s study-specific EA by the uploader for that study. Individuals from the other studies involved in the analysis may download the results files, through their study’s downloader.
Details regarding uploading and downloading data
Uploading and downloading require different authorization and file transfer mechanisms:
Uploading data
Study PIs may request uploader status for one or more individuals during their study’s TOPMed dbGaP registration or after it is complete. Files may be uploaded to the study’s EA through the dbGaP Submission Portal by logging in to dbGaP with your eRA or myNCBI ID, going to the Submission Portal and clicking on your TOPMed study's name. When your study's page opens, scroll to the bottom tab labeled “Other files”, click on the menu “Select file type” and select “Exchange area files”. Please use a consistent file-naming convention for uploaded files, such as StudyName_DomainOrSubdomain_YYYYMMDD_initials.csv". For more detailed information, please see Uploading to the dbGaP Exchange Area.
Submissions to the EA are not curated by dbGaP staff and are not released for access by the general scientific community unless specifically requested by the study PI. In that case, the data must be re-submitted through the Submission Portal. Therefore the EAs allow studies to exchange data that are not necessarily destined for sharing beyond TOPMed investigators. However, TOPMed investigators with approved access to a given study-specific EA may download data from it, so study PIs should note that any data placed in an EA may be broadly available within TOPMed to anyone with approved access to their study’s EA. See also “Confidentiality of Working Group documents”.
Downloading data
Access to data stored in TOPMed EAs is controlled through dbGaP Data Access Requests (DARs), which are reviewed and either approved or disapproved by the NHLBI Data Access Committee (DAC). TOPMed has a generic DAR and expedited review process. This process requires that each applicant be included in a list of eligible investigators, who are nominated by study PIs. In addition, investigators are not added to this list until the study with which they are affiliated has completed their study registration, has uploaded required phenotype data for public release and has submitted a majority of their DNA samples for sequencing. Instructions for how to submit a DAR are given in the Instructions for Online Data Access Request.
In the example given in the figure above, the PI of study B (or one of his/her eligible collaborators) must apply for access to the EAs of studies A, B and C. Once the DAC approves these DARs, the applicant may assign “downloader” status to one or more individuals within his/her group. See NCBI's instructions for how to authorize downloaders. Downloads are done through a web interface at dbGaP. Log into dbGaP using your eRA commons ID, select “Controlled Access Data”, go to “My Requests”, select “Request Files” next to the accession of interest, then select the “Provisional files” tab and create a download request. Please see Downloading from the dbGaP Exchange Area.
Confidentiality of Working Group documents
Some Working Groups have asked whether access to data files and other documents within EAs can be limited to members of a particular Working Group. Unfortunately, providing access limited by Working Group membership is beyond the scope of dbGaP support for EAs. However, Working Group members could password-protect their files prior to their submission to the EA and the password shared only among members of that group. For example, a set of results files can be zip archived with a password prior to upload to the EA; Microsoft files can be protected from the Tools or File tab, depending on version.
Current status of EA content and access
Nearly all of the phase 1 studies and some phase 2 studies have a study-specific EA to which they may upload files. The IRC has uploaded freeze 4, cross-study genotype call set (~18,000 samples) to the common EA, along with variant quality and related accessory data. The IRC has uploaded BAM files to the Sequence Read Archive; these have been converted to SRA file format. SRA files for a given study are available through their study-specific EA. Please consult with dbGaP staff if you want to download more than a few SRA files. dbGaP staff have constructed and placed array-based SNP “fingerprint” files in each of the study-specific EAs that have such data. (The fingerprints consist of genotypes for a selected set of SNPs that many of the commercial SNP arrays have in common.)
Several TOPMed investigators have submitted DARs via a dbGaP project application. However, a limiting factor can be when studies do not have approval from their local IRBs to receive cross-study TOPMed data. (DARs require an IRB approval letter.) We encourage study PIs to obtain their IRB’s approval and begin the DAR process. The ACC is available to help answer questions from those filling out the online form for DARs whenever the study has their local IRB approval document ready.
The NHLBI Data Access Committee has approved several DARs from TOPMed investigators across each of the existing EAs. The ACC has downloaded SNP fingerprint files for checking concordance between prior array-based genotyping and the sequence-based genotype calls from the EAs. (Results of the concordance checks are being sent to study PIs.) Working Group investigators have also successfully downloaded phenotype files from the study-specfic EAs.