Pages

Monday, 7 April 2014

Some useful tools for your everyday NGS analysis

There are a lot of tools that can assist you at every step of an NGS data analysis. Here are some interesting pieces of software I've recently started using.

SAMStat - published on Bioinformatics
This tool take your sam/bam/fastq files and compute several metrics describing frequency and distribution of bases across all your reads. Results include: stats on MAPQ of reads, distribution of MAPQ across read lenght, nucleotide over-representation across reads, error distribution and identification of over-represented 2-mers and 10-mers. All the data are conveniently presented in a html summary and help you identifying potential issues present in your sequencing experiment. Moreover, these graphs are always useful as quality reports for your presentations!

ngsCAT - published on Bioinformatics
This command line tool provide you with a detailed analysis of your mapped reads given a defined target regions. It compute several metrics and stats: medium coverage, number of bases covered at least n fold, duplicated reads, distribution of on-target reads across chromosomes and uniformity of coverage across target regions. The tool require a bam file and a bed file as inputs and produce several graphs and tabs and a final summary report in html format. Really simple and useful to assess quality of your target capture!

GRAB - published on PLOS ONE
This tool is as simple as cleaver. It takes genotyping information of subjects from various formats (Genotype/Var/masterVar/GVF/VCF/PEDMAP/TPED) and compute their eventual relationship. It works best with whole genome data, but I've tested it also using vcf files from exome sequencing and reducing the default reading window it performs well at least in identifying 1st and 2nd grade relationships. This tool require R installed in your system and it's really fast in performing the analysis. It is useful when you are dealing with a group of samples and you want to verify that there are no members from the same family.

MendelScan - published on American Journal of Human Genetics
This tool is also described by the author on his blog MassGenomics. This software perform variant prioritization for family based exome sequencing studies. It needs some preliminary steps to prepare the necessary input files: a multisample vcf files with variants from the family members, a ped file describing the relations between samples, a ranked list of genes that are mostly expressed in your tissue of interest and the VEP annotated list of your variants. Given these data the tool compute a ranked list of identified variants based on the selected inheritance model (recessive or dominant). Moreover it include two additional modules developed by the authors: the Rare Heterozygous Rule Out and the Shared Identity-by-Descent. The first one "identifies candidate regions consistent with autosomal dominant inheritance based on the idea that a disease-causing haplotype will manifest regions of rare heterozygous variants shared by all affecteds, and an absence of homozygous differences between affected pairs (which would indicate that a pair had no haplotype in common).", while the second one "uses BEAGLE FastIBD results to identify regions of maximum identity-by-descent (IBD) among affected pairs". This tool integrates the canonical prediction scores (such as Polyphen, PhyloP and so on) with gene expression ranking and the newly developed methods to provide a straightforward analysis for your mendelian disease NGS studies!

SPRING - published on PLOS Genetics
Like MendelScan, this is another tool for variant prioritization. This one has also a version working from the web for those that are not familiar with the command line. The tool takes a list of seed genes already known to be involved in the pathology or in similar phenotypes and a list of your candidate missense variants and give you a ranked list of the variants that have a high probability of being associated with the disease. This tools work fine for disease with high genetic heterogeneity so that you can easily and confidently build a list of seed genes. SPRING can then be really useful in prioritizing candidates variants emerging from new studies.

PRADA - published on Bioinformatics
This tool is focused on RNA-Seq and it provide a complete framework for analysis of this kind of data. The tool is a complete solution since it can perform several kind of analysis starting from raw paired-end RNA-seq data: gene expression levels, quality metrics, detection of unsupervised and supervised fusion transcripts, detection of intragenic fusion variants, homology scores and fusion frame classification. As the project page reports:
"PRADA currently supports 7 modules to process and identify abnormalities from RNAseq data:
preprocess: Generates aligned and recalibrated BAM files.
expression: Generates gene expression (RPKM) and quality metrics.
fusion: Identifies candidate gene fusions.
guess-ft: Supervised search for fusion transcripts.
guess-if: Supervised search for intragenic fusions.
homology: Calculates homology between given two genes.
frame: Predicts functional consequence of fusion transcript"

This is a good starting point for those not familiar with RNA-Seq!

No comments: