Harmonization strategy
Phenotype harmonization was conducted to enable cross-study analyses. The main goals of this process are to provide harmonized phenotypes that are well-documented, reproducible, and as homogeneous across studies as possible. In harmonized datasets and documents, "phenotype” refers to the general concept of a measurement and “variable” to refer to the specific data vector values of a given phenotype. The underlying database assigns a “trait_id” to uniquely identify a given variable, which appears in some of the documentation.
Collaboration between working groups, studies, and analysts is essential for rigorous phenotype harmonization. Working group members provide domain expertise in their phenotype area, and liaisons from each study provide guidance about which study variables are appropriate to use. Harmonized phenotypes are constructed from “observed” study variables whenever possible, as opposed to using “derived” variables, unless otherwise specified. Analysts rely on the working groups to provide both the initial harmonization algorithm and the component variables to use, with the study liaisons assisting if necessary.
Phenotype data for all studies is acquired from dbGaP. In addition to being a stable repository, dbGaP has already curated and processed the data into a consistent format, which allows for automated processing, and all phenotypic variables have been assigned accession numbers, for tracking provenance. Using the data on dbGaP, harmonized phenotypes for all available study participants instead of just those being sequenced in TOPMed, allows for non-TOPMed participants to be included in analyses after imputation or future sequencing.
Both the original study phenotypes and the final, harmonized phenotypes are stored in a relational database at the DCC. This setup allows DCC analysts to work with the study data in a consistent format instead of referencing a large number of files. It also provides a mechanism for tracking the provenance of each harmonized phenotype. In addition to storing metadata, the database also tracks the definition of the algorithms used to calculate each harmonized phenotype as well as the exact study phenotypes used in the calculation. Using this information, harmonized phenotypes can automatically be recomputed when updated study data are acquired from dbGaP. It also creates a lasting resource for the broader scientific community, as this detailed information will allow external investigators to augment the phenotypes harmonized in TOPMed with additional, non-TOPMed studies and will likely facilitate additional harmonization in future cross-study projects.
Analysts perform quality control (QC) of component phenotypes, and consult with the relevant Working Group and study liaisons to resolve issues. This work is focused on finding inconsistencies and large batch effects in the study data. The harmonized traits are also QCed to detect possible errors in the harmonization process; this consists of checking whether most values are within the expected range and evaluating differences among studies and sample sets within study that may have been handled differently during harmonization. For each phenotype, the DCC analysts provide comments on the harmonization and QC process. The information in these comments should be considered before including a harmonized phenotype in any analysis. The analysts note outliers, but they are not removed from the harmonized data set for two reasons: (1) they may represent extreme effects of rare loss-of-function variants and (2) the definition of an outlier may vary according to the intended use, so users of the data should be able to make their own decisions about exclusionary criteria. Unless otherwise specified, the precision of phenotypic measurements is not harmonized, and they are not rounded to significant digits because the necessary information is generally not available.
For each phenotypic value for a given subject, an associated age at measurement is provided. These age values may have been winsorized in some studies and, if so, that winsorization carries through to the harmonized phenotype. For example, a study may give an age value as “>89” or “90+” instead of specific ages for subjects greater than 89 years of age; in this case, that text string was converted to a numeric value of 90. Otherwise, if the age measurements have not been winsorized (minimize the influence of outliers in the data) by the study, we provide those age measurements with no winsorization. Analysts should also take care when working with multiple phenotypic variables at the same time, as variables across datasets or even within the same dataset are not necessarily measured at the same time for each subject.
Detailed documentation about which study phenotypes were used and the code that was run to produce a harmonized phenotype is maintained. The data values for each subject can be linked to the documentation using harmonization “unit” variables in each dataset. For each harmonized variable, a paired “unit_at_variable” is provided, whose value indicates where in the documentation to look to find the set of component variables and the algorithm used to harmonize those variables.
TOPMed recommends that any phenotype variable be carefully inspected before use in analysis. We recommend caution in use of categorical variables as covariates in genetic association tests without first checking for categories with low counts, which might cause model-fitting problems. Analysts are advised to view plots of the data distribution to identify potential outliers they might want to exclude from analysis.