Skip to main content

Harmonized Phenotypes

TOPMed Phenotype Harmonization Project

The main goal of the TOPMed harmonization project is to provide harmonized phenotypes that are well-documented, reproducible, and homogeneous across studies. In harmonized datasets and documentation, “phenotype” refers to the observable characteristic (e.g., diastolic blood pressure) and “variable” to refer to the specific data vector values for a given phenotype (e.g., bp_diastolic_1). To enable reproducibility, all study data were acquired from dbGaP.

Datasets and documentation of the harmonized variables were submitted to two repositories: dbGaP and BioData Catalyst. Full documentation for each harmonized variable is provided in a GitHub repository. The documentation for each harmonized variable includes the identifiers of the original dbGaP study variables used in harmonization as well as the code that was used to transform them into the harmonized variable. This repository also includes a reproducible example that instructs users how to use the documentation to reproduce a simulated harmonized variable.

TOPMed Phenotype Tagging Project

Over 16,000 dbGaP study variables with 65 phenotype concepts from heart, lung, blood, and sleep domains were tagged. These tags enable researchers to identify variables of interest that can be used in future harmonization efforts.  The results of the tagging project are available in the dbGaP user interface.  All tags are mapped to a UMLS Concept Unique Identifier (CUI), which is required for identifying the tagged variables on dbGaP.  

Instructions for Identifying Tagged Variables on dbGaP

The following are examples of different methods to search for tagged variables: Entrez search and faceted search.

Entrez search

  • In your web browser, visit the dbGaP Entrez advanced search page.
  • In the search builder, select Common Data Element Resource and enter “umls” into the associated text box or add “umls[Common Data Element Resource]”.  Another option is to select Common Data Element Term and enter the CUI of a UMLS term into the associated text box or add “C0005890[Common Data Element Term]” to the search box.    
  • The Studies tab of the search results displays all of the studies that contain tagged variables.
  • The Variables tab of the search results displays all of the dbGaP variables that are tagged with at least one UMLS term. Click on a variable name to see more information on the variable page

Faceted search

  • In your web browser, visit the dbGaP faceted search page.• Click on the Variables tab.
  • Under the Common Data Elements filter, check UMLS.o This will display all of the dbGaP study variables that are tagged with a UMLS term.
  • For a given variable listed on the right, you can click on the UMLS link to go directly to the variable’s information page with the full UMLS term name.
  • To search for variables tagged with a specific UMLS term, search for the term’s CUI in the search box in the upper left corner of the page.

Mapped Phenotype Tags

Mapped Phenotype Tags
Phenotype Domain Description UMLS CUI UMLS Term Tag Name (phenotype concept)
Supporting phenotypes Participant age at enrollment or age at which data or biosamples were collected C0001779 Age Age at enrollment/collection
Supporting phenotypes Categorical indicator of the clinical visit (exam, year, etc.) during which observations, measurements and/or other data/biosample collections were made C0008952 Clinic Visits Clinic visit
Supporting phenotypes Qualitative or quantitative indicator describing or quantifying fasting status prior to blood draw for any blood sample-derived measurement C1976106 Fasting Status Fasting
Supporting phenotypes Categorical indicator of the clinic, recruitment site, and/or field center at which participants were recruited, and/or where data or biosamples were collected. C2828208 Locality Geographic site
Supporting phenotypes Qualitative or quantitative indicator of use of any kind of medication, vitamin, mineral, or dietary supplement C0240320 medication use Medication/supplement use
Supporting phenotypes Categorical indicator of membership in different study subcohorts (i.e. participant subsets within a study who were recruited at different times and/or enrolled in different sub-studies) C0599755 Cohort Subcohort
Sleep Quantitative measure of Apnea-Hypopnea Index (AHI), a measure of sleep apnea severity C2111846 Apnea-hypopnea index procedure AHI
Sleep Qualitative indicator of sleep apnea status C0037315 Sleep Apnea Syndromes Sleep apnea
Smoking Qualitative and quantitative measures of cigarette smoking history and cigarette smoking status C1519384 Smoking History Cigarette smoking
Stroke Qualitative indicator of hemorrhagic stroke status C0553692 Brain hemorrhage Hemorrhagic stroke


Information about these projects is available in a published manuscript. If you use the datasets described on this page, please cite the following paper:

Stilp AM, Emery LS, Broome JG, Buth EJ, Khan AT, Laurie CA, Wang FF, Wong Q, Chen D, D’Augustine CM, Heard-Costa NL, Hohensee CR, Johnson WC, Juarez LD, Liu J, Mutalik KM, Raffield LM, Wiggins KL, de Vries PS, Kelly TN, Kooperberg C, Natarajan P, Peloso GM, Peyser PA, Reiner AP, Arnett DK, Aslibekyan S, Barnes KC, Bielak LF, Bis JC, Cade BE, Chen MH, Correa A, Cupples LA, de Andrade M, Ellinor PT, Fornage M, Franceschini N, Gan W, Ganesh SK, Graffelman J, Grove ML, Guo X, Hawley NL, Hsu WL, Jackson RD, Jaquish CE, Johnson AD, Kardia SLR, Kelly S, Lee J, Mathias RA, McGarvey ST, Mitchell BD, Montasser ME, Morrison AC, North KE, Nouraie SM, Oelsner EC, Pankratz N, Rich SS, Rotter JI, Smith JA, Taylor KD, Vasan RS, Weeks DE, Weiss ST, Wilson CG, Yanek LR, Psaty BM, Heckbert SR, Laurie CC. A System for Phenotype Harmonization in the National Heart, Lung, and Blood Institute Trans-Omics for Precision Medicine (TOPMed) Program. Am J Epidemiol. 2021 Oct 1;190(10):1977-1992. doi: 10.1093/aje/kwab115. PMID: 33861317; PMCID: PMC8485147.

Contact Us

* Required field
Reason of Contact
Back to top