AI-STAAR: An ancestry-informed association analysis framework for large-scale multi-ancestry whole genome sequencing studies

Submitted by	Wang, Wenbo
Authors	Wenbo Wang, Laura Y. Zhou, Diptavo Dutta, Yun Li, Tamar Sofer, Nora Franceschini, Zilin Li, Joseph G. Ibrahim, Xihao Li, on behalf of the TOPMed Kidney Function Working Group
Name and Date of Professional Meeting	ASHG Annual Meeting (November 5-9, 2024)
Associated paper proposal(s)	An ancestry-informed association analysis framework of large-scale whole genome sequencing studies, with applications to TOPMed kidney data
Working Group(s)	Kidney Function
Abstract Text	Introduction Large-scale whole genome sequencing (WGS) studies enable the detection of common and rare variants (RVs) associated with complex diseases or traits. With the increasing availability of WGS data representing participants from diverse populations, it is of interest to address heterogeneity in allelic effect sizes across ancestries to improve statistical power of association analyses and detect complex trait loci when the underlying causal variants are shared between ancestry groups with heterogeneous effects. Existing association analysis methods are limited in leveraging multi-ancestry variant effect heterogeneity, especially for under-represented ancestry populations. Methods We propose AI (Ancestry-Informed)-STAAR, a powerful and scalable association analysis framework for ancestry- and functionally-informed genetic association analysis in biobank-scale multi-ancestry sequencing studies. AI-STAAR performs ancestry-informed association analysis to improve the power of single variant analysis for common variants and variant-set analysis for rare variants by modeling the potential heterogeneity through ensemble weighting informed by ancestry-specific variant allele frequencies and effect sizes, while accounting for population stratification and relatedness within and across ancestries. AI-STAAR further facilitates functionally-informed association analysis of both coding and noncoding RVs by incorporating multiple categorical and quantitative functional annotations for variant grouping and weighting. Results We applied AI-STAAR to perform WGS common and rare variant analysis of derived kidney function traits, estimate glomerular filtration rate (eGFR) and urine albumin-creatinine ratio (UACR), from the NHLBI TOPMed consortium. Among 45,090 and 18,869 participants with eGFR and UACR from diverse ancestries, AI-STAAR detected single variant 22-40220108-G-A for eGFR and 1-231196875-C-A for UACR, as well as RVs residing in BAZ2A enhancer regions and of CIR1 UTR for UACR. These were missed by methods that do not account for heterogeneous ancestry effects. In addition to improved power for detecting associations accounting for effect size heterogeneity, AI-STAAR identifies the ancestry group(s) with strongest variant associations: 22-40220108-G-A for eGFR and 1-231196875-C-A for UACR were driven by East Asian and European ancestries, respectively; the RVs of BAZ2A and CIR1 for UACR were African ancestry. Summary AI-STAAR is a powerful and computationally scalable framework that leverages allelic heterogeneity by ancestry for genetic association analysis in multi-ancestry sequencing studies. Key Words: Genome-sequencing; Genome-wide association; Statistical genetics; Rare variants; Genetic diversity