Friday, 4 April 2014

A gold standard dataset of SNP and Indels for benchmarking tools

One of the pain in the NGS data analysis is that different tools, different settings and different sequencing platforms produce different and often low overlapping variants. Every analysis pipeline has its peculiar issues and results in specific false positive/negative calls.

Genome in a BottleThe need for a gold standard reference of SNP and indels calls is thus of primary relevance to correctly asses the accuracy and sensibility of new NGS pipeline. The need for a robust benchmarking of variant identification pipelines is even more critical as NGS analysis is fast moving from research to diagnostic/clinical field.
In this interesting paper on Nature Biotechnology, authors have performed the analysis of 14 different variants datasets from the same standard genome NA12878 (choose as the standard reference genome by the Genome in a Bottle Consortium) to produce a list of gold standard SNPs and Indels and identify genomic regions that are particularly difficult to address. The final dataset is the most robust produced so far since it integrate data from 5 different sequencing technology, 7 read mapper and 3 variant callers to obtain a robust estimation of SNVs.
The final standard for evaluation of your favorite pipeline is finally here, publicly available Genome Comparison and Analytic Testing website!

Nat Biotechnol. 2014 Feb 16. 
Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls.

Zook JM, Chapman B, Wang J, Mittelman D, Hofmann O, Hide W, Salit M. 

Clinical adoption of human genome sequencing requires methods that output genotypes with known accuracy at millions or billions of positions across a genome. Because of substantial discordance among calls made by existing sequencing methods and algorithms, there is a need for a highly accurate set of genotypes across a genome that can be used as a benchmark. Here we present methods to make high-confidence, single-nucleotide polymorphism (SNP), indel and homozygous reference genotype calls for NA12878, the pilot genome for the Genome in a Bottle Consortium. We minimize bias toward any method by integrating and arbitrating between 14 data sets from five sequencing technologies, seven read mappers and three variant callers. We identify regions for which no confident genotype call could be made, and classify them into different categories based on reasons for uncertainty. Our genotype calls are publicly available on the Genome Comparison and Analytic Testing website to enable real-time benchmarking of any method.

