Authors |
Quan Sun, Yingxi Yang, Jiawen Chen, Jia Wen, Michael R. Knowles, Charles Kooperberg, Alex Reiner, Laura M. Raffield, April Carson, Stephen Rich, Jerome Rotter, Ruth Loos, Eimear Kenny, Byron C. Jaeger, Yuan-I Min, Christian Fuchsberger, Yun Li
|
Abstract Text |
Since genotype imputation was introduced, researchers have been relying on the estimated imputation quality from imputation software to perform post-imputation quality control (QC). However, this quality estimate (denoted as Rsq) performs less well for lower frequency variants. We recently published MagicalRsq, a machine-learning-based imputation quality calibration metric, which leverages additional typed markers from the same cohort and outperforms Rsq as a QC metric. In this work, we extended the original MagicalRsq to allow cross-cohort model training, named MagicalRsq-X. We removed the cohort-specific estimated minor allele frequency and additionally included LD scores and recombination rates as variant-level features. Leveraging whole genome sequencing data from TOPMed, specifically participants in BioMe, JHS, WHI and MESA studies, we performed comprehensive cross-cohort evaluations for European and African ancestral individuals based on their inferred global ancestry with the 1000 Genomes and HGDP data as reference. Our results suggest MagicalRsq-X outperforms Rsq in almost every setting, with 7.3-14.4% improvement in squared Pearson correlation with true R2, corresponding to 85-218K variant gains. We further developed a metric to quantify the genetic distances of a target cohort relative to a reference cohort and showed that such metric could largely explain the performance of MagicalRsq-X models. Finally, we found that MagicalRsq-X saved 9-53 GWAS variants in one of the largest blood cell traits GWAS results that would be missed using the original Rsq for QC. In conclusion, MagicalRsq-X shows clear superiority for post-imputation QC and can greatly benefit genetic studies by rescuing well-imputed low frequency and rare variants.
|