Abstract Text |
Large sequencing studies identify an abundance of rare variants, only a small proportion of which are
functional. Identifying this functional subset enables designing powerful downstream analyses. As many
functional variants are subject to selection, we aim to use the evidence for selection, namely their ages, to
prioritize these variants.
To estimate a variant’s age, we propose a Bayesian model to leverage the haplotype sharing pattern
surrounding a variant in addition to the sample allele frequency. Intuitively, individuals sharing a recent
mutation are also likely to share a large identitybydescent (IBD) segment around that position. We model
the length of this IBD segment conditional on the observed mutation as shaped by recent demographics,
such as population expansion and structure. Our new method combines analytical and numerical
solutions from coalescence theory to handle large sample size under any specified parameter settings of
effective population size or history model used to infer the most recent demographics.
We use simulation to demonstrate that our new approach of jointly modeling IBD lengths, allele frequency
and population history provides more accurate age estimates than previous methods, and that the tool we
develop scales to samples with more than 100,000 individuals. Applying our method to the TOPMed data
of >100,000 individuals to estimate the ages of all rare variants with sample allele counts below ten, we
observe a twofold enrichment of all proteinaltering variants among the youngest 10% of doubletons, and
a fourfold difference in the number of high impact (frameshift, stop/start lost/gain, etc.) variants when
comparing the youngest to the oldest 10% doubletons. This indicates a clear signal that our age estimates
serve as a functional annotation, provide new insights in interpreting GWAS findings and help grouping
variants for association analysis.
|