Wednesday, 30 April 2014

Pubmed highlight: Comparison of mapping alghoritms

Map millions of reads to some reference genome sequence is in most cases the first step in NGS data analysis. Proper mapping is essential for downstream variant identification and assessing of the quality of each sequenced base. Various tools exist to perform this task and this paper present an interesting new tool to benchmark results from the different aligners. Authors developed CuReSim, a tool able to generate simulated reads dataset resembling differetn NGS technologies, and CuReSimEval, that perform the performance evaluation for a given aligner given the created dataset.
In the paper they apply this new method to compare the performance of some popular aligners (like BWA, TMAP and BowTie) working on Ion Torrent data. "The application of this procedure to Ion Torrent data from the whole genome sequencing of small genomes [...] demonstrate that it is helpful for selecting a mapper based on the intended application, questions to be addressed, and the technology used." they reports in the abstract.

Comparison of mapping algorithms used in high-throughput sequencing: application to Ion Torrent data.
Caboche S, Audebert C, Lemoine Y, Hot D

BACKGROUND: The rapid evolution in high-throughput sequencing (HTS) technologies has opened up new perspectives in several research fields and led to the production of large volumes of sequence data. A fundamental step in HTS data analysis is the mapping of reads onto reference sequences. Choosing a suitable mapper for a given technology and a given application is a subtle task because of the difficulty of evaluating mapping algorithms.
RESULTS: In this paper, we present a benchmark procedure to compare mapping algorithms used in HTS using both real and simulated datasets and considering four evaluation criteria: computational resource and time requirements, robustness of mapping, ability to report positions for reads in repetitive regions, and ability to retrieve true genetic variation positions. To measure robustness, we introduced a new definition for a correctly mapped read taking into account not only the expected start position of the read but also the end position and the number of indels and substitutions. We developed CuReSim, a new read simulator, that is able to generate customized benchmark data for any kind of HTS technology by adjusting parameters to the error types. CuReSim and CuReSimEval, a tool to evaluate the mapping quality of the CuReSim simulated reads, are freely available. We applied our benchmark procedure to evaluate 14 mappers in the context of whole genome sequencing of small genomes with Ion Torrent data for which such a comparison has not yet been established.
CONCLUSIONS: A benchmark procedure to compare HTS data mappers is introduced with a new definition for the mapping correctness as well as tools to generate simulated reads and evaluate mapping quality. The application of this procedure to Ion Torrent data from the whole genome sequencing of small genomes has allowed us to validate our benchmark procedure and demonstrate that it is helpful for selecting a mapper based on the intended application, questions to be addressed, and the technology used. This benchmark procedure can be used to evaluate existing or in-development mappers as well as to optimize parameters of a chosen mapper for any application and any sequencing platform.

Wednesday, 23 April 2014

PubMed Highlight: Importance of annotation tool and transcript dataset for variants analysis

This paper recently published on Genome Medicine analyzes the differences in variants annotation when using different annotation tools and transcript datasets. Authors extensively report on pitfalls and peculiar aspects of the two mostly diffused softwares (ANNOVAR and VEP), using transcripts definition from both RefSeq and Ensembl databases.
From the paper abstract clearly emerges a high level of discrepancy, expecially for functional variants (namely missense and LoF) that are the far the most relevant in NGS analysis. "We find only 44% agreement in annotations for putative loss-of-function variants when using the REFSEQ and ENSEMBL transcript sets as the basis for annotation with ANNOVAR. The rate of matching annotations for loss-of-function and nonsynonymous variants combined is 79% and for all exonic variants it is 83%. When comparing results from ANNOVAR and VEP using ENSEMBL transcripts, matching annotations were seen for only 65% of loss-of-function variants and 87% of all exonic variants, with splicing variants revealed as the category with the greatest discrepancy" authors write.

What impress me the most is the low level of concordance between different transcript datasets, reflecting the fact that also the annotation of mRNA forms are far from definitely established.
So be careful with your annotations!

Here the full paper:

Choice of transcripts and software has a large effect on variant annotation
Davis J McCarthy, Peter Humburg, Alexander Kanapin, Manuel A Rivas, Kyle Gaulton, The WGS500 Consortium, Jean-Baptiste Cazier and Peter Donnelly

Monday, 7 April 2014

Some useful tools for your everyday NGS analysis

There are a lot of tools that can assist you at every step of an NGS data analysis. Here are some interesting pieces of software I've recently started using.

SAMStat - published on Bioinformatics
This tool take your sam/bam/fastq files and compute several metrics describing frequency and distribution of bases across all your reads. Results include: stats on MAPQ of reads, distribution of MAPQ across read lenght, nucleotide over-representation across reads, error distribution and identification of over-represented 2-mers and 10-mers. All the data are conveniently presented in a html summary and help you identifying potential issues present in your sequencing experiment. Moreover, these graphs are always useful as quality reports for your presentations!

ngsCAT - published on Bioinformatics
This command line tool provide you with a detailed analysis of your mapped reads given a defined target regions. It compute several metrics and stats: medium coverage, number of bases covered at least n fold, duplicated reads, distribution of on-target reads across chromosomes and uniformity of coverage across target regions. The tool require a bam file and a bed file as inputs and produce several graphs and tabs and a final summary report in html format. Really simple and useful to assess quality of your target capture!

GRAB - published on PLOS ONE
This tool is as simple as cleaver. It takes genotyping information of subjects from various formats (Genotype/Var/masterVar/GVF/VCF/PEDMAP/TPED) and compute their eventual relationship. It works best with whole genome data, but I've tested it also using vcf files from exome sequencing and reducing the default reading window it performs well at least in identifying 1st and 2nd grade relationships. This tool require R installed in your system and it's really fast in performing the analysis. It is useful when you are dealing with a group of samples and you want to verify that there are no members from the same family.

MendelScan - published on American Journal of Human Genetics
This tool is also described by the author on his blog MassGenomics. This software perform variant prioritization for family based exome sequencing studies. It needs some preliminary steps to prepare the necessary input files: a multisample vcf files with variants from the family members, a ped file describing the relations between samples, a ranked list of genes that are mostly expressed in your tissue of interest and the VEP annotated list of your variants. Given these data the tool compute a ranked list of identified variants based on the selected inheritance model (recessive or dominant). Moreover it include two additional modules developed by the authors: the Rare Heterozygous Rule Out and the Shared Identity-by-Descent. The first one "identifies candidate regions consistent with autosomal dominant inheritance based on the idea that a disease-causing haplotype will manifest regions of rare heterozygous variants shared by all affecteds, and an absence of homozygous differences between affected pairs (which would indicate that a pair had no haplotype in common).", while the second one "uses BEAGLE FastIBD results to identify regions of maximum identity-by-descent (IBD) among affected pairs". This tool integrates the canonical prediction scores (such as Polyphen, PhyloP and so on) with gene expression ranking and the newly developed methods to provide a straightforward analysis for your mendelian disease NGS studies!

SPRING - published on PLOS Genetics
Like MendelScan, this is another tool for variant prioritization. This one has also a version working from the web for those that are not familiar with the command line. The tool takes a list of seed genes already known to be involved in the pathology or in similar phenotypes and a list of your candidate missense variants and give you a ranked list of the variants that have a high probability of being associated with the disease. This tools work fine for disease with high genetic heterogeneity so that you can easily and confidently build a list of seed genes. SPRING can then be really useful in prioritizing candidates variants emerging from new studies.

PRADA - published on Bioinformatics
This tool is focused on RNA-Seq and it provide a complete framework for analysis of this kind of data. The tool is a complete solution since it can perform several kind of analysis starting from raw paired-end RNA-seq data: gene expression levels, quality metrics, detection of unsupervised and supervised fusion transcripts, detection of intragenic fusion variants, homology scores and fusion frame classification. As the project page reports:
"PRADA currently supports 7 modules to process and identify abnormalities from RNAseq data:
preprocess: Generates aligned and recalibrated BAM files.
expression: Generates gene expression (RPKM) and quality metrics.
fusion: Identifies candidate gene fusions.
guess-ft: Supervised search for fusion transcripts.
guess-if: Supervised search for intragenic fusions.
homology: Calculates homology between given two genes.
frame: Predicts functional consequence of fusion transcript"

This is a good starting point for those not familiar with RNA-Seq!

Friday, 4 April 2014

A gold standard dataset of SNP and Indels for benchmarking tools

One of the pain in the NGS data analysis is that different tools, different settings and different sequencing platforms produce different and often low overlapping variants. Every analysis pipeline has its peculiar issues and results in specific false positive/negative calls.

Genome in a BottleThe need for a gold standard reference of SNP and indels calls is thus of primary relevance to correctly asses the accuracy and sensibility of new NGS pipeline. The need for a robust benchmarking of variant identification pipelines is even more critical as NGS analysis is fast moving from research to diagnostic/clinical field.
In this interesting paper on Nature Biotechnology, authors have performed the analysis of 14 different variants datasets from the same standard genome NA12878 (choose as the standard reference genome by the Genome in a Bottle Consortium) to produce a list of gold standard SNPs and Indels and identify genomic regions that are particularly difficult to address. The final dataset is the most robust produced so far since it integrate data from 5 different sequencing technology, 7 read mapper and 3 variant callers to obtain a robust estimation of SNVs.
The final standard for evaluation of your favorite pipeline is finally here, publicly available Genome Comparison and Analytic Testing website!

Nat Biotechnol. 2014 Feb 16. 
Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls.

Zook JM, Chapman B, Wang J, Mittelman D, Hofmann O, Hide W, Salit M. 

Clinical adoption of human genome sequencing requires methods that output genotypes with known accuracy at millions or billions of positions across a genome. Because of substantial discordance among calls made by existing sequencing methods and algorithms, there is a need for a highly accurate set of genotypes across a genome that can be used as a benchmark. Here we present methods to make high-confidence, single-nucleotide polymorphism (SNP), indel and homozygous reference genotype calls for NA12878, the pilot genome for the Genome in a Bottle Consortium. We minimize bias toward any method by integrating and arbitrating between 14 data sets from five sequencing technologies, seven read mappers and three variant callers. We identify regions for which no confident genotype call could be made, and classify them into different categories based on reasons for uncertainty. Our genotype calls are publicly available on the Genome Comparison and Analytic Testing website to enable real-time benchmarking of any method.