TOPMed

TOPMed Omics Metadata Requirements

Omics Metadata from TOPMed Studies

TOPMed Informatics Research Center
For questions, please contact topmed.informatics@umich.edu

Expand All

Instructions for studies regarding sample attributes data for TOPMed omics assays

The TOPMed program requires that omics data be submitted to dbGaP, along with thorough documentation of biosampling and laboratory methods, as well as sample provenance. Here we describe the documentation to be provided by the contributing study (generally the study’s coordinating and/or biorepository center). Updated required documents as of February 2023:The required document is described below. There are also some required phenotypes (such as age and gender), along with the required metadata and sample mapping, to expedite harmonized data sharing and cross-study data analysis in TOPMed.

Sample attributes metadata files; generally one file per omics data type (RNAseq, metabolomics, proteomics, or methylation) with rows corresponding to samples and columns to sample attributes. A template for the (required) data_dictionary to accompany each sample metadata file describes the variables to be included. The rest of this document refers to specific columns in this table.

Additional protocol documents (related to sample collection and processing, laboratory methods) should accompany phenotype datasets as available. The centers performing the omics assays have been asked to supply additional sample attributes files (including quality control metrics) and associated protocol documents.

TOPMed omics sample metadata

These instructions try to cover all four omics assay types -- RNA sequencing, DNA methylation, metabolomics, proteomics -- and many different study designs in a single document. Please interpret these instructions liberally and provide information which makes sense for your particular study. The omics sample metadata spreadsheet asks for three different classes of information:

Sample-to-subject mapping and subject consent information needed for dbGaP controlled access;
Sample provenance information useful when diagnosing possible batch effects in the data, resolving apparent sample swaps or understanding the relationship among multiple samples and across omics types;
Phenotypic covariates specific to sample collection -- these are sample level covariates (time-varying) rather than subject level covariates (time-invariant).

The requested columns and column names are listed in an associated data_dictionary and further discussed below.

The provenance of each sample submitted by a study to the omics centers is specified by a set of identifiers from the following list and illustrated in the Figure below. These are presented in approximately the order of sample collection, storage and processing. Not all identifiers will be relevant for all studies; details on what is required or relevant are provided below, as well as in the data_dictionary template.

“ SUBJECT_ID ” identifies the human study participant from whom a biological sample is taken. This subject identifier must be the same one used in the subject consent file submitted to dbGaP. It is used to link the omics data to subject consent and phenotype information. dbGaP requires the variable name to be “SUBJECT_ID”.
“ CONSENT ” Consent code assigned by dbGaP for this study subject (e.g. “DS-CVD-IRB-NPU-MDS”).
“ Age_at_collection ” Subject’s age in years at the time of tissue collection. Alternatively, studies can provide age at study entry and time in years between study entry and biosample collection, please note this (and the relevant column names) in the submitted data dictionary.
"Gender" Self-reported gender of study participant.
“ Primary_biosample_type ” (e.g. “venous blood” or “lung epithelium”) specifies the type of the primary biosample. Be as specific as possible, e.g. “psoriatic” versus “non-psoriatic” skin.
“ Collection_year ” Year in which the primary biosample was collected. If not known, can note as “NA” (but age_at_collection is required)
“ Collection_visit ” Study visit when the primary biosample was collected. (Relevant for studies with multiple waves of clinic visits.) This column can be all NA for studies where this is not relevant.
“ BODY_SITE ” Anatomical site at which biosample was taken, e.g. “arm vein”, “lung”, “inner oral cavity”, dbGaP required with column name “BODY_SITE”. Will likely be “arm vein” for most blood-based samples.
“ HISTOLOGICAL_TYPE ” The primary biosample often undergoes additional processing, purification or fractionation to enrich for a specific cell or tissue type before analytes (often DNA or RNA) are extracted for an omics assay. “HISTOLOGICAL_TYPE” describes the resulting tissue type or sub-type, e.g. plasma for proteins and metabolites; PBMC, T-cell, or monocyte for RNA; blood buffy coat for DNA. This column with the column name “HISTOLOGICAL_TYPE” is required by dbGaP.
“ IS_TUMOR ” Tumor status of the sample. For non-cancer studies, all values should be “no”. For cancer studies, values should be “yes” or “no”. This column with column name “IS_TUMOR” is required by dbGaP.
“ SAMPLE_ID ” dbGaP-required name for each sample aliquot submitted to the omics assay center. This sample may be a primary biosample, isolated material or isolated analyte. This identifier must be one of the TOR (RNA), TOE (DNA for methylation), TOM (metabolites), or TOP (proteins) identifiers supplied by the TOPMed IRC. (These identifiers are equivalent to the “NWD” identifiers given to DNA samples for whole genome sequencing.) A “SAMPLE_ID” also represents an instance of the assay. Repeated assays from the same subject must use different SAMPLE_IDs. In some cases, multiple analyte samples from the same analyte isolation batch might be submitted, either as technical replicates or as replacements. Each sample shipped must have a separate “SAMPLE_ID”.
“ ANALYTE_TYPE ” dbGaP required, type of omics assay performed, generally one of “RNA”, “DNA”, “protein” or “metabolite”.

For studies with metabolomics assays- if available, please include an indicator variable for “fasting” (0=no, 1=yes), and “fasting_hours”, and note this in the data dictionary.

For subjects that include samples from deceased donors- please add an additional hours post-mortem column “hours_post_mortem” if available.

The “SAMPLE_ID” will be the identifier attached to the data returned from each center performing omics assays. The “SUBJECT_ID” identifies the subject from whom the analyte was sampled and connects the omics data to appropriate consent type and clinical phenotypes for that subject.

A general note regarding identifiers: All identifiers must be “ <a data-cke-saved-href=" www.hhs.gov="">A general note regarding identifiers: All identifiers must be “ de-identified ” to ensure subject anonymity and privacy of health information. The “SUBJECT_ID” and “SAMPLE_ID” identifiers are fully specified by points (1) and (11) above.

Please use the following formats and file names

Sample attributes metadata file :

Format: tab-separated text file
Name: SampleAttributes_DS_< study_name>_<analyte_type>_<date>

Data Dictionary for sample attributes data file:

Format: tab-separated text file
Name: SampleAttributes_DD_< study_name>_<analyte_type>_<date>

Definitions

<study_name> is an abbreviated name of the study (e.g. FHS)
<analyte_type> is the type of omics assay to be performed on the submitted sample (use DNA, RNA, metabolite, or protein)

<date> is the date that the file was made (as YearMonthDay - e.g. 20191108)

Please upload all prepared files to the TOPMed exchange area for your study. Instructions on how to upload to the exchange area are posted here. Please use the data type “Exchange Area Data” from the drop-down menu “Select file type” under “Other files” near the bottom of the submission portal page. Once the files have been uploaded, please email topmed.informatics@umich.edu; also, please email if you have any questions or issues with the upload.

Expand All

Instructions Prior to February 2023

Sample attributes metadata files; generally one file per omics data type (RNAseq, metabolomics, proteomics, or methylation) with rows corresponding to samples and columns to sample attributes. A template for the (required) data_dictionary to accompany each sample metadata file describes the variables to be included. The rest of this document refers to specific columns in this table.
Protocol documents that describe clinical and laboratory procedures at each step in the process of supplying a sample to the center performing the omics assay.

The centers performing the omics assays have been asked to supply additional sample attributes files (including quality control metrics) and associated protocol documents.

Sample-to-subject mapping and subject consent information needed for dbGaP controlled access;
Sample provenance information useful when diagnosing possible batch effects in the data, resolving apparent sample swaps or understanding the relationship among multiple samples and across omics types;
Phenotypic covariates specific to sample collection -- these are sample level covariates (time-varying) rather than subject level covariates (time-invariant).

The requested columns and column names are listed in an associated data_dictionary and further discussed below.

“ SUBJECT_ID ” identifies the human study participant from whom a biological sample is taken. This subject identifier must be the same one used in the subject consent file submitted to dbGaP. It is used to link the omics data to subject consent and phenotype information. dbGaP requires the variable name to be “SUBJECT_ID”.
“ CONSENT ” Consent code assigned by dbGaP for this study subject (e.g. “DS-CVD-IRB-NPU-MDS”).
“ Age_at_collection ” Subject’s age in years at the time of tissue collection.
“ Subject_collection_state ” Subject’s health status at the time of tissue collection. This includes both immediate conditions such as “fasting hours” or “hours post mortem” which may be relevant to interpreting the omics assay results as well as longer term conditions indicative of disease progression such as “HbA1c” for diabetes or “GOLD stage” for COPD. Please use more than one column if appropriate. The column name(s) should describe the measurement being provided.
“ Primary_biosample_ID ” is a unique identifier for tissue sampled directly from the subject - e.g. venous blood (via blood draw) or lung epithelium (via biopsy).
“ Primary_biosample_type ” (e.g. “venous blood” or “lung epithelium”) specifies the type of the primary biosample. Be as specific as possible, e.g. “psoriatic” versus “non-psoriatic” skin.
“ Collection_year ” Year in which the primary biosample was collected.
“ Collection_visit ” Study visit when the primary biosample was collected. (Relevant for studies with multiple waves of clinic visits.)
“ BODY_SITE ” Anatomical site at which biosample was taken, e.g. “arm vein”, “lung”, “inner oral cavity”.
“ UBERON_ID ” UBERON ontology identifier for type of primary biosample; value is “0013756” for “venous blood”, https://www.ebi.ac.uk/ols/ontologies/uberon for others.
“ UBERON_term ” UBERON ontology term for type of primary biosample, e.g. “venous blood”, https://www.ebi.ac.uk/ols/ontologies/uberon for others.
“ HISTOLOGICAL_TYPE ” The primary biosample often undergoes additional processing, purification or fractionation to enrich for a specific cell or tissue type before analytes (often DNA or RNA) are extracted for an omics assay. “HISTOLOGICAL_TYPE” describes the resulting tissue type or sub-type, e.g. plasma for proteins and metabolites; PBMC, T-cell, or monocyte for RNA; blood buffy coat for DNA. This column with the column name “HISTOLOGICAL_TYPE” is required by dbGaP.
“ IS_TUMOR ” Tumor status of the sample. For non-cancer studies, all values should be “no”. For cancer studies, values should be “yes” or “no”. This column with column name “IS_TUMOR” is required by dbGaP.
“ Analyte_isolated ” The type of material, usually DNA or RNA, extracted from either the primary biosample or the derived material described in “HISTOLOGICAL_TYPE” and sent to the omics assay center.
“ Analyte isolation_batch ” identifies a batch of primary biosamples or derived material samples that together went through the analyte isolation process. All samples in a given batch should be given the same “Analyte_isolation_batch” value. This is often a plate identifier or a date (or range of dates) for the isolation process. These identifiers can be omitted if the study did not isolate specific analytes for submission to the omics assay center (e.g., when submitting serum samples).
“ Analyte_isolation_year ” Year in which the analyte was extracted. Informative for storage time.
“ Analyte_isolation_lab ” Name of laboratory performing the analyte isolation.
“ SAMPLE_ID ” dbGaP-required name for each sample aliquot submitted to the omics assay center. This sample may be a primary biosample, isolated material or isolated analyte. This identifier must be one of the TOR (RNA), TOE (DNA for methylation), TOM (metabolites), or TOP (proteins) identifiers supplied by the TOPMed DCC. (These identifiers are equivalent to the “NWD” identifiers given to DNA samples for whole genome sequencing.) A “SAMPLE_ID” also represents an instance of the assay. Repeated assays from the same subject must use different SAMPLE_IDs. In some cases, multiple analyte samples from the same analyte isolation batch might be submitted, either as technical replicates or as replacements. Each sample shipped must have a separate “SAMPLE_ID”.
“ ANALYTE_TYPE ” Type of omics assay to be performed, generally one of “RNA”, “DNA”, “protein” or “metabolite”.
“ Sample_container_ID ” identifies the sample plate, box or other container used for shipping samples to the omics assay center. These containers generally consist of a 96- or 384-well plate in which samples may be arrayed according to a specific design. These plates may become the batches in which samples are processed at the assay center, in which case a description of how the samples were assigned to plates (and wells therein) may be important in adjusting for batch effects in analysis.
“ Sample_well_ID ” identifies locations within the sample container.
“ Omics_assay_lab ” Name of the assay lab to which samples were sent.

The “SAMPLE_ID” will be the identifier attached to the data returned from each center performing omics assays. Additional levels of identifiers in the sample attributes metadata file for each omics type will allow users of the data to determine whether, for example, metabolite data from one center came from the same blood draw as methylation data from another center (i.e. having the same or different “Primary_biosample_ID”). The “Analyte_isolation_batch” allows investigation of analyte isolation batch effects. “Sample_container_ID” and “Sample_well_ID” may specify batches for processing during assay and are helpful when resolving apparent sample swaps. Of course, the “SUBJECT_ID” identifies the subject from whom the analyte was sampled and connects the omics data to appropriate consent type and clinical phenotypes for that subject.

A general note regarding identifiers: All identifiers must be “ de-identified ” to ensure subject anonymity and privacy of health information. The “SUBJECT_ID” and “SAMPLE_ID” identifiers are fully specified by points (1) and (18) above. The other identifiers can be coined by the study, but the ID system used should be applied consistently across multiple omics projects submitted to dbGaP (past, present, and future) in order to allow for cross-referencing (e.g. was the methylation data released in dbGaP version 3 from the same blood draw as the metabolomics data in version 5?).

FIGURE. Illustration of sample provenance and identifier types

The Figure above illustrates a scenario in which three primary biosamples were collected from a single subject. One whole blood sample (collected into an EDTA tube) was fractionated to obtain two “materials”: serum (extracellular fluid) and buffy coat (mainly leukocytes). The serum was then aliquoted into two tubes, one submitted for metabolite assays and the other for protein assays. DNA was extracted from the buffy coat and submitted for DNA methylation assays. The second whole blood sample (collected into a PAXgene tube) was fractionated to isolate T-cells, from which RNA was isolated and two aliquots were submitted for RNAseq assay. In this case, the first submitted sample failed, so a second RNA sample (from the same batch of isolated RNA) was submitted as a replacement. A third primary biosample (e.g. lung epithelium) was sampled via biopsy and an aliquot of RNA isolated without further tissue fractionation was then submitted for RNAseq.

In this example, the study would submit four different sample attributes metadata files (one per type of analyte to be assayed), each with a data dictionary including all relevant variables specified in the data dictionary template - i.e. separate files for the metabolite, protein, methylation and RNA. For the serum sample used for metabolite and protein assays, “Analyte_isolation_batch” should be omitted because the analytes were not extracted by the study.

Protocol Documents

The following protocols (as pdf files) should accompany the sample attributes file:

Tissue collection document, which should include detailed descriptions of (a) how and when the tissue was obtained (e.g. blood draw, biopsy, post-mortem dissection, etc.), (b) preserved (e.g. PAXgene tube; fixed tissue), and (c) stored (e.g. -70 freezer).
Material isolation document, which should include a detailed protocol for how the primary biosample was fractionated (including reagent kit name, number, and company name).
Analyte isolation document, which should include a detailed protocol for how the analyte was isolated from the material or primary biosample (including reagent kit name, number, and company name).
Plate map document, which should include a description of the plate (or other container) map design. For example, samples may have been assigned to plates and wells completely at random except that samples from the same subject were always kept together on the same plate. If there was no design, that should be indicated. For example, it could be stated that samples were plated in the order in which they were stored in the freezer, which does not follow any particular design.

File Formats and Naming

Please use the following formats and file names

Sample attributes metadata file :

Format: tab-separated text file
Name: SampleAttributes_DS_< study_name>_<analyte_type>_<date>

Data Dictionary for sample attributes data file:

Format: tab-separated text file
Name: SampleAttributes_DD_< study_name>_<analyte_type>_<date>

Protocol Document file:

Format: pdf
Name: < document_topic>_ < study_name>_<date>

Definitions

<study_name> is an abbreviated name of the study (e.g. FHS)

<analyte_type> is the type of omics assay to be performed on the submitted sample (use DNA, RNA, metabolite, or protein)

<date> is the date that the file was made (as YearMonthDay - e.g. 20191108)

< document_type> is a short label for the document contents - generally “tissue_collection”, “material_isolation”, “analyte_isolation” or “plate_map”

The data file (and corresponding data dictionary) should include columns for all variables specified in the template data dictionary , except for those that do not apply to the samples submitted for the omics assay (e.g. omit “analyte isolation” variables for submitted serum samples).

File Submission

TOPMed Omics Metadata Requirements

Omics Metadata from TOPMed Studies

Overview

TOPMed omics sample metadata

File Formats and Naming

File Submission

Overview

TOPMed omics sample metadata

Example