Abstract Text |
Mixed models for genetic association testing have traditionally accounted for structure among samples by using an empirical genetic relationship matrix (GRM) that measures genetic covariance, genome-wide, from both ancestry and relatedness. However, fitting mixed models in samples with tens or hundreds of thousands of individuals can be a prohibitive computational burden. Here, we address this problem by using a sparse empirical kinship matrix (KM) and ancestry principal components in place of a GRM.
Standard forms of empirical GRMs and KMs estimated from genotype data are dense; i.e. have no entries equal to zero. To exploit the computational speedups that sparse matrices enable, we make an empirical KM sparse by clustering samples based on their pairwise kinship estimates, setting all inter-cluster estimates to zero; this can also be thought of as approximating low levels of relatedness as `unrelated’. In today’s large-scale population studies, where those in pedigrees are a small proportion of the overall sample, this approximation can be expected to be highly accurate, and the computational speedup substantial.
To illustrate the computational advantage and statistical impact of using sparse empirical KMs, we performed genetic association analyses using seven red blood cell traits and WGS data from TOPMed freeze 6. Between 17,469 and 48,858 samples were available for these traits. Using a 4th degree relatedness threshold (i.e. kinship > 0.022) and our proposed algorithm, 98.3% to 99.5% of entries in the sparse KM were set to zero, and the largest cluster ranged from 1667 to 2459 samples. Compared to using a GRM, using a sparse KM significantly improved computational performance; e.g. fitting the null models for these traits took just 0.5-6.2% of the CPU time and required 1.4-6.7% of the memory. Furthermore, differences in association p-values between the two approaches were small. For these traits, over 99.99% of tests differed in -log10(p) by less than 0.5; i.e. by an amount very unlikely to change the practical interpretation of results. With the level of sparsity attainable in population studies such as TOPMed, we also find that our approach performs favorably compared to SAIGE, another mixed model method designed for analysis of large samples. The use of sparse KMs is a promising and flexible approach to improve the computational efficiency of association testing in large population studies, without sacrificing accuracy.
|