However a detailed genome-wide assessment of indels impact and dsitribution still missing...until now.
In this interesting paper appeared in Genome Research, Montgomery et al. address exactly this question and with amazing results. First of all authors as to deal with the short Indels calling challenge that is one of the biggest issue when analyzing NGS data. Starting with DNA sequences from 179 individuals from 3 population groups, they made several optimization to the standard pipeline used by the 1000 Genome Project to obtain a set of high quality indels. Even if indels in homopolymeric regions remain out of reach, the improved pipeline described in the paper is certainly a guideline for anyone working in the field. Among the other interesting findings, authors confirmed that rates of indel mutagenesis are highly heterogeneous, with 43-48% of indels occurring in 4.03% of the genome (loci defined as indel hotspots by the authors), and they proposed fork stalling and template switching (FoSTeS) together with polymerase slippage as the main mechanism originating the indels.
Take a look!
The origin, evolution and functional impact of short insertion-deletion variants identified in 179 human genomes
- Stephen B Montgomery1,
- David Goode1,
- Erika Kvikstad2,
- Cornelis A Albers3,
- Zhengdong Zhang4,
- Xinmeng Jasmine Mu5,
- Guruprasad Ananda6,
- Bryan Howie7,
- Konrad J Karczewski1,
- Kevin S Smith1,
- Vanessa Anaya1,
- Rhea Richardson1,
- Joe Davis1,
- Daniel G MacArthur8,
- Arend Sidow1,
- Laurent Duret2,
- Mark Gerstein5,
- Kateryna Markova6,
- Jonathan Marchini9,
- Gilean A McVean9 and
- Gerton Lunter9,10
Abstract
Short insertions and deletions (indels) are the second most abundant form of human genetic variation, but our understanding of their origins and functional effects lags behind that of other types of variants. Using population-scale sequencing, we have identified a high-quality set of 1.6 million indels from 179 individuals representing 3 diverse human populations. We show that rates of indel mutagenesis are highly heterogeneous, with 43-48% of indels occurring in 4.03% of the genome we classify as indel hotspots, while in the remaining 96% their prevalence is 16-times lower than that for SNPs. Polymerase slippage can explain upwards of 3/4 of all indels, including virtually all hotspot indels. The remainder are mostly simple deletions in complex sequence, but insertions do occur and are significantly associated with pseudo-palindromic sequence features compatible with the fork stalling and template switching (FoSTeS) mechanism more commonly associated with large structural variations. We introduce a quantitative model of polymerase slippage showing an excellent fit to observed levels of variation, which enables us to identify a minority of indel-hypermutagenic protein-coding genes, some of which are associated with recurrent mutations leading to disease. Accounting for mutational rate heterogenetity due to sequence context, we find that indels across functional sequence are generally subject to stronger purifying selection than SNPs. We find that indel length modulates selection strength, as is well known of frameshift mutations in coding regions, but also longer indels and indels affecting multiple functionally constrained nucleotides are more strongly selected against in various non-coding contexts. We further find that indels are enriched in associations with gene expression, and find evidence for a contribution of nonsense-mediated decay to this association. Finally, we show that indels can be integrated in existing GWAS studies, and although we do not find direct evidence that potentially causal protein-coding indels are enriched with strong associations to known disease-associated SNPs, many of our findings suggest that the causal variant underlying some of these associations may be indels.
No comments:
Post a Comment