Skip to main content

Sharing dbGaP Data in a Cloud Environment

NIH policy

Here is the NIH policy regarding sharing among investigators at different institutions (Data Access, Question #10):

How can investigators share controlled-access human data and analyses with approved collaborators at different institutions while remaining compliant with the Genomic Data Sharing (GDS) Policy?

All sharing of controlled-access data with collaborators must be consistent with the GDS Policy and the NIH Security Best Practices for Controlled-Access Data Subject to the NIH Genomic Data Sharing (GDS) Policy. Controlled-access data may be shared with collaborators at other institutions if they have obtained approval to access the data through their own dbGaP project request. Such collaborators should be listed on the project request as external collaborators for both projects. Data may be encrypted and mailed to approved collaborators on a hard drive, or shared with approved collaborators over a virtual private network or in a cloud environment, as described in the NIH Security Best Practices for Controlled-Access Data Subject to the NIH Genomic Data Sharing (GDS) Policy .

See TOPMed guidance (section E) on how to set up a group of collaborators (i.e. “sharing group”) who will share dbGaP data in a common repository, under the mechanism described above.

Example of how this mechanism might be used to share TOPMed genotypic and phenotypic data

As shown in the diagram below, a group of investigators decide to form a sharing group.  They submit coordinated applications (Data Access Requests) to dbGaP according to TOPMed guidance (section E). Once the applications of all investigators have been approved, one or more of the investigators assign individuals (“downloaders”) to download the data from the TOPMed Exchange Area(s)  may include source phenotypes to be harmonized by members of the sharing group. All of these data types may be brought together in a cloud for harmonization and cross-study analysis. All members of the sharing group (and others they designate from their own institutions) can access the data in the cloud repository; this includes members who do not directly download data (e.g. investigators 2 and 4 in the diagram). Although the diagram shows sharing in a cloud, the data may also be shared by members of the sharing group through other mechanisms, as indicated by the NIH policy statement above.

1 2 3 4 5 Sharing Group Investigators Source Phenotypes Genotypes Hits! PA PA PB PC PD TA TB TC TD PB PC PD Harmonized Phenotypes Downloaders Harmonize Analyze Sharing dbGaP data in a cloud environment dbGaP Data Access Requests TOPMed Exchange Area Accessions: TA, TB, TC, TD e.g. VCF files and/or harmonized phenotypes from studies A, B, C, D Parent Study Accessions: PA, PB, PC, PD e.g. source phenotypes to be harmonized
TOPMed Cloud Pilots

Four groups of TOPMed investigators are developing Cloud Computing platforms with user interfaces that allow TOPMed investigators from multiple groups to access and compute on TOPMed data in a cloud environment.  The following documents provide descriptions of the cloud providers (including security features) and how access to each cloud environment is managed.

  1. Analysis Commons  (contact acadmin@uw.edu or Jen Brody, jeco@u.washington.edu)
  2. FireCloud/Broad Genomics Workbench  (contact Alisa Manning, amanning@broadinstitute.org)
  3. Michigan Encore Platform  (contact Matthew Flickinger, mflick@umich.edu)
  4. OASIS Server  (contact James Perry, JPerry@som.umaryland.edu)
  5. Data STAGE (contact Ben Heavner, bheavner@uw.edu)

Users who wish to use one or more of these cloud pilot services should contact the individuals indicated above and read instructions in the TOPMed Data Sharing Policy, section E.

Use of cloud computing must be specified in your dbGaP Data Access Request.  The following wording is recommended for the "Cloud Use Statement" and "Cloud Provider Information" sections of the DAR.

If you plan to use cloud computing, check the box “I am requesting permission to use cloud computing…”  Then, provide the Cloud Use Statement using the template below. If you plan to use more than one cloud computing service, describe all as shown below.
 

Cloud Use Statement

(Designed for multiple platforms.  Remove items you do not plan to use. This template wording meets the 2000-character limit)

We plan to use multiple cloud computing environments implemented by the TOPMed Cloud Pilots, which are described under “Cloud Providers” in this application and on the TOPMed website (https://topmed.nhlbi.nih.gov/sharing-dbgap-data-cloud-environment#TOPMed Cloud Pilots). These cloud environments will be used as Platform as a Service (PaaS) to develop and implement analysis methods for genotype-phenotype associations using genome sequence data. Data transfers and cloud access will be supervised by “gatekeeper(s)” as described in each platform’s Cloud Management Plan on the TOPMed website (at link given above).  Potential users for these cloud environments consist of the list of the ‘Internal Collaborators’ on this application and members of the external collaborator list attached to this application (appended to IRB approval).  External collaborators will submit their own dbGaP Data Access Requests.  Data will be shared in a given cloud environment among individuals who have approved access for the same set of study-consent groups (where ‘study-consent’ refers to a combination of study and consent type).  Gatekeepers of each cloud environment will provide user accounts for individuals who meet these requirements.  They will assure that cloud access is consistent with dbGaP approvals; NIH and TOPMed policies; and participant consents; as specified in the “NIH Security Best Practices for Controlled-Access Data Subject to the NIH Genomic Data Sharing Policy”. 

 

Cloud Providers

Create separate Cloud Provider Information entries for each provider you plan to use.

DNAnexus

Type of Provider: Commercial

Details:

DNAnexus provides a cloud-based data analysis and management platform for storage and analysis of DNA sequence data using cloud computing from Amazon Web Services. By using DNAnexus (as part of a “Statistical Analysis Commons” Pilot) we are able to deploy sophisticated analysis efforts on large scale phenotypic and genomic datasets quickly and cost-effectively in the cloud. DNAnexus uses data centers in high-security facilities with SAS-70/SSAE-16, PCI Level 1, and FISMA Moderate certifications, and is in the process of completing FedRAMP certification. At the user level, it enforces best practices such as password strength and rotation, session expiration, and client encryption. All data access is carefully controlled, logged for auditing purposes, encrypted end-to-end (both in flight and at rest), integrity-verified, and replicated in at least three physically distinct data centers to ensure against loss. Data analysis is constrained to computing nodes that are sandboxed using virtualization and encryption technologies, and are versioned to ensure reproducibility and the ability to track data provenance. The software has undergone multiple third-party audits, including penetration testing by security experts, and the overall system has been ISO 27001 certified, an internationally recognized standard for secure data management processes. DNAnexus has summarized how their platform supports compliance with various US and International regulations and standards in a white paper, including best practices for dbGaP: https://www.dnanexus.com/papers/Compliance_White_Paper.pdf Amazon Web Services (AWS) is a secure cloud services platform offering compute power, database storage, content delivery and other functionality that will allow us to deploy sophisticated analysis efforts on large scale phenotypic and genomic datasets quickly and cost-effectively. It is a secure, durable technology platform with industry-recognized certifications and audits: PCI DSS Level 1, ISO 27001, FISMA Moderate, FedRAMP, HIPAA, and SOC 1 (formerly referred to as SAS 70 and/or SSAE 16) and SOC 2 audit reports. Their services and data centers have multiple layers of operational and physical security to ensure the integrity and safety of data. AWS has summarized how their platform supports compliance with controlled-access datasets in a white paper, including best practices for dbGaP: https://d0.awsstatic.com/whitepapers/architecting-for-genomic-data-secu…

FireCloud

Type of Provider: Private

Details:

FireCloud (powered by Broad Genomics’ Workbench) is operated by the Broad Institute at the FISMA (Federal Information Systems Management Act) “moderate” level and received Authority to Operate from NCI and NIH in May of 2016. FISMA is a practice of documentation, audit, and organizational risk acceptance.  It is centered on the controls outlined in NIST (National Institute of Standards and Technology) Special Publications 800-30 and 800-53.   Covered topics include: Network penetration testing and assessment by an Federally authorized outside firm; Maintaining system logs separate from the primary system for forensic analysis; Regular review of logs and changes by an in-house auditor; Security training and background screening for staff with elevated access to the system; Documented procedures to respond to security incidents. The FireCloud portal and its underlying platform, Broad Genomics’ Workbench, are hosted on Google’s Cloud Services. See below for details. Since Firecloud requires that users utilize Google logins, the application operates on top of Google’s world-class security that protects from nation-state level attacks. FireCloud supports the use of Google’s 2 Factor authentication as well. As a FISMA Moderate system, all logs are audited continually and various levels of security layering are required. These include Web Application Firewalls, weekly scanning, code scanning (dynamic and static), dependency scanning and manual penetration testing. Data analysis is constrained to computing nodes that are sandboxed using Docker within Google’s Pipelines API. Google Cloud Platform is a cloud computing service by Google that offers hosting on the same supporting infrastructure that Google uses internally for end-user products like Google Search. Google undergoes several independent third party audits on a regular basis to provide verification of security, privacy and compliance controls including annual audits for SSAE 16/ISAE 3402 Type II. Google's infrastructure provides reliable information security that can meet or exceed the requirements of HIPAA and protected health information. The Google Cloud Platform has summarized its services with respect to genomics data processing in a white paper here: https://cloud.google.com/files/genomics-data-wp.pdf

University of Michigan ENCORE server/FLUX cluster

Type of Provider: Private

Details:

Flux is a high-performance computing (HPC) Linux-based cluster intended to support parallel and other applications and available to all researchers at the University of Michigan. Each Flux compute node comprises multiple CPU cores with at least 4 GB of RAM per core; Flux has more than 19,000 cores. All compute nodes are interconnected with InfiniBand networking. More information about FLUX can be found here: http://arc-ts.umich.edu/systems-and-services/flux/

University of Maryland OASIS Server

Type of Provider: Private

Details:

OASIS will be implemented as separate “website instances” for each TOPMed Working Group.  For example, there will be an OASIS website for the TOPMed Diabetes Working Group and a different website, with a different URL and different user accounts, for the TOPMed Lipids Working Group.  Each “OASIS website instance” will allow access ONLY to information associated with the specific Working Group.  Thus, if TOPMed investigators are approved to access data for multiple Working Groups, they will need user accounts on multiple OASIS website instances to use the OASIS features. During the pilot phase, instances of the OASIS Server will reside on the University of Maryland OASIS Webserver.  If the pilot is successful and if additional funding is obtained, instances of the OASIS Server will be migrated/created using Amazon Web Services Website Hosting.  These website hosting environments will be “Platforms as a Service” (PaaS) for the various types of analysis and visualizations described above. These services will allow the research community to quickly and flexibly make use of nearly unlimited, on-demand, high performance computing capacity. In contrast to traditional clouds, an OASIS website does not allow download of individual-level data.  Users may download only summary statistics and analysis results calculated at the levels of variants or groups of variants. Users may upload individual-level phenotype data for use with the analysis and visualization tools.  They may also upload analysis results from other tools for web-based visualization.  But, these uploaded data cannot be downloaded.  There can be no egress of genotypes or phenotypes to an OASIS user. The University of Maryland OASIS Webserver is a high-performance Linux-based webserver intended to support parallel database searching and re-analysis of omics data.  It is a secure environment running the Apache HTTP Server software with strong encryption via the Secure Sockets Layer (SSL) and Transport Layer Security (TLS) protocols.  The server has its own SSL Certificate (SHA-2) and access to all OASIS website instances employ the HTTPS (HTTP Secure) communications protocol. Passwords are encrypted and the requirements for password strength and rotation are enforced.  The OASIS website instances and associated user accounts will be managed by the project Gatekeepers as described above.  The webserver’s software and physical environments are managed by the University of Maryland IT staff. Amazon Web Services (AWS) is a secure cloud services platform offering website hosting, compute power, database storage and retrieval and other functionality needed to deploy multiple instances of the OASIS web-based application.  It is a secure, durable technology platform with industry-recognized certifications and audits: PCI DSS Level 1, ISO 27001, FISMA Moderate, FedRAMP, HIPAA, and SOC 1 (formerly referred to as SAS 70 and/or SSAE 16) and SOC 2 audit reports. Their services and data centers have multiple layers of operational and physical security to ensure the integrity and safety of data. AWS has summarized how their platform supports compliance with controlled-access datasets including best practices for dbGaP: https://d0.awsstatic.com/whitepapers/architecting-for-genomic-data-secu…

BioData Catalyst

Type of Provider: Private

Details:

The NHLBI-supported BioData Catalyst (Storage, Toolspace, Access and analytics for biG data Empowerment) (biodatacatalyst.nhlbi.nih.gov) is a cloud-based infrastructure where heart, lung, blood, and sleep (HLBS) researchers can go to find, search, access, share, cross-link, and compute on large scale datasets. It will provide tools, applications, and workflows to enable those capabilities in secure workspaces. BioData Catalyst will employ Amazon Web Services and Google Cloud Platform for data storage and compute. BioData Catalyst comprises the Data Commons Framework Services (DCFS) hosted and operated by the University of Chicago. DCFS will provide the gold master data reference as well as authorization/authentication and indexing services. The DCFS will also enable security interoperability with the secure workspaces. Workspaces will be provided by Terra, hosted and operated by the Broad Institute; Fair4Cures, hosted and operated by Seven Bridges Genomics; and i2b2/tranSMART, hosted by University of Chicago and operated by Harvard Medical School. For the NHLBI BioData Catalyst, the NHLBI Designated Authorizing Official has recognized the Authority to Operate (ATO) issued to the Broad Institute, University of Chicago and Seven Bridges Genomics as presenting acceptable risk, and therefore the NCI ATO serves as an Interim Authority to Test (IATT) when used by designated TOPMed investigators and collaborators. Amazon Web Services (AWS) is a secure cloud services platform offering compute power, database storage, content delivery and other functionality that will allow us to deploy sophisticated analysis efforts on large scale phenotypic and genomic datasets quickly and cost-effectively. It is a secure, durable technology platform with industry-recognized certifications and audits: PCI DSS Level 1, ISO 27001, FISMA Moderate, FedRAMP, HIPAA, and SOC 1 (formerly referred to as SAS 70 and/or SSAE 16) and SOC 2 audit reports. Their services and data centers have multiple layers of operational and physical security to ensure the integrity and safety of data. AWS has summarized how their platform supports compliance with controlled-access datasets in a white paper, including best practices for dbGaP: https://d0.awsstatic.com/whitepapers/architecting-for-genomic-data-secu… Google Cloud Platform is a cloud computing service by Google that offers hosting on the same supporting infrastructure that Google uses internally for end-user products like Google Search. Google undergoes several independent third party audits on a regular basis to provide verification of security, privacy and compliance controls including annual audits for SSAE 16/ISAE 3402 Type II. Google's infrastructure provides reliable information security that can meet or exceed the requirements of HIPAA and protected health information. The Google Cloud Platform has summarized its services with respect to genomics data processing in a white paper here: https://cloud.google.com/files/genomics-data-wp.pdf

Back to top