Skip to main content

Sharing dbGaP Data in a Cloud Environment

NIH policy

Here is the NIH policy regarding sharing among investigators at different institutions (Data Access, Question #10):

How can investigators share controlled-access human data and analyses with approved collaborators at different institutions while remaining compliant with the Genomic Data Sharing (GDS) Policy?

All sharing of controlled-access data with collaborators must be consistent with the GDS Policy and the NIH Security Best Practices for Controlled-Access Data Subject to the NIH Genomic Data Sharing (GDS) Policy. Controlled-access data may be shared with collaborators at other institutions if they have obtained approval to access the data through their own dbGaP project request. Such collaborators should be listed on the project request as external collaborators for both projects. Data may be encrypted and mailed to approved collaborators on a hard drive, or shared with approved collaborators over a virtual private network or in a cloud environment, as described in the NIH Security Best Practices for Controlled-Access Data Subject to the NIH Genomic Data Sharing (GDS) Policy .

See TOPMed guidance (section E) on how to set up a group of collaborators (i.e. “sharing group”) who will share dbGaP data in a common repository, under the mechanism described above.

Example of how this mechanism might be used to share TOPMed genotypic and phenotypic data

As shown in the diagram below, a group of investigators decide to form a sharing group.  They submit coordinated applications (Data Access Requests) to dbGaP according to TOPMed guidance (section E). Once the applications of all investigators have been approved, one or more of the investigators assign individuals (“downloaders”) to download the data from the TOPMed Exchange Area(s) may include source phenotypes to be harmonized by members of the sharing group. All of these data types may be brought together in a cloud for harmonization and cross-study analysis. All members of the sharing group (and others they designate from their own institutions) can access the data in the cloud repository; this includes members who do not directly download data (e.g. investigators 2 and 4 in the diagram). Although the diagram shows sharing in a cloud, the data may also be shared by members of the sharing group through other mechanisms, as indicated by the NIH policy statement above.

1 2 3 4 5 Sharing Group Investigators Source Phenotypes Genotypes Hits! PA PA PB PC PD TA TB TC TD PB PC PD Harmonized Phenotypes Downloaders Harmonize Analyze Sharing dbGaP data in a cloud environment dbGaP Data Access Requests TOPMed Exchange Area Accessions: TA, TB, TC, TD e.g. VCF files and/or harmonized phenotypes from studies A, B, C, D Parent Study Accessions: PA, PB, PC, PD e.g. source phenotypes to be harmonized

Cloud Use Statement

If you plan to use cloud computing, check the box “I am requesting permission to use cloud computing…”  Then, provide the Cloud Use Statement using the template below. 

Updated Statement (Effective April 1, 2025)

We plan to use cloud computing environments which are described under “Cloud Providers”. Potential users for these cloud environments consist of ‘Internal Collaborators’ on this application and approved investigators included on the Approved Users List (https://tinyurl.com/p6hjt7b5).  External collaborators will submit their own dbGaP Data Access Requests.  Data will be shared in a given cloud environment among individuals who have approved access for the same set of study-consent groups.  Gatekeepers of each cloud environment will provide user accounts for individuals who meet these requirements.  The cloud access will be consistent with dbGaP approvals, NIH and TOPMed policies, participant consents, and NIST standards as specified in the NIH Security Best Practices for Controlled-Access Data Subject to the NIH Genomic Data Sharing Policy and NIH Security Best Practice for Users of Controlled – Access Data. 

Cloud Environments may include the BioData Catalyst or other NIST SP800-171 compliant system.

Cloud Providers

BioData Catalyst

Provider: Amazon Web Services, Google Cloud Platforms 
Type: Private
Details:

The BDC is a cloud-based infrastructure where researchers can go to find, search, access, share, cross-link, and compute on large scale datasets. It provides tools, applications, and workflows to enable those capabilities in secure workspaces. BDC comprises the Data Commons Framework Services (DCFS) hosted and operated by the U. of Chicago. DCFS will provide the gold master data reference as well as authorization/authentication and indexing services. Workspaces are provided by Terra, hosted and operated by the Broad Institute and Seven Bridges, hosted and operated by Velsera. The NHLBI Designated Authorizing Official has recognized the Authority to Operate issued to the Broad Institute, U. of Chicago and Velsera as presenting acceptable risk when used by designated TOPMed stakeholders. BDC is NIST SP800-53 compliant, including the NIST SP 800-171 controls.

Back to top