Authors |
S.A. Gagliano Taliun, Y. Li, D. Ray, P. Yajnik, NIMH InPSYght Consortium and NHLBI TOPMed Program, S. Lee, L.J. Scott, S.A. McCarroll, C.N. Pato, G.R. Abecasis, M. Boehnke, H.M. Kang
|
Abstract Text |
Joint analysis of whole genome sequencing (WGS) datasets can boost statistical power to detect disease risk-altering alleles. Here we focus on increasing power of a WGS case-control study by incorporating genetic ancestry-matched external WGS samples as additional controls. A first step to ensure data harmonization is to use a functionally equivalent pipeline to map and process alignments. However, in our experience, false positive associations in the resulting genome-wide comparisons still occur due to between-study differences in sequencing protocols or depth. Our goal is to develop high-specificity variant filtering procedures to eliminate such variants, and enable comparisons of WGS studies processed in a compatible manner.
To illustrate the challenges, we execute a joint analysis of African American WGS samples from the NIMH InPSYght study (N=3K controls, N=5K schizophrenia or bipolar cases, average depth 27x) with control samples of the same ancestry from the NHLBI TOPMed Project (N=15K, average sequencing depth 37x). Samples were sequenced at five centers, but sequence data were jointly processed and genotypes called together. We start with a comparison of InPSYght controls versus TOPMed controls, a scenario where we expect no true positive association signals. We adjust for genetic relatedness, sex and four principal components of ancestry, but observe a modest number of false positive associations: 113 common or low-frequency variants reach genome-wide significance (p≤5x10-8) and an additional 158 also deviate from the expected null distribution. These false positive variants consistently had lower depth and genotyping quality in carriers, which we hypothesized could be driving the spurious associations.
We evaluate strategies for variant filtering using information such as duplicate concordance, Mendelian inconsistencies, sequencing depth and genotype missingness. We show that a strict set of variant filters that remove ~3% of variants (1,153,945 of 41,732,031 variants with allele count >10) enable joint analysis. Specifically, we remove variants with either a >0.5% discordance between duplicate samples or where >2% of the pairs have a missing genotype. In the variant-filtered control versus control comparison no variants reach genome-wide significance nor deviate from the null. Our duplicate-based variant filtering strategies allow for the addition of external controls in WGS datasets to boost power to detect disease associations.
|