Sunday, 30 June 2013

PubMed Highlight: Prioritization of synonymous variants

The final step of variant prioritization is a key point in NGS studies focused on identification of disease causing mutations. By now all the tools developed in this area consider only missense mutations, relying on various algorithms and integration with known information to suggest the best causative variants within a list of candidates. However, recent studies showed that also synonymous mutations could be responsible for disease. The new Silent Variant Analyzer (SilVA), describes by Buske et al. on Bioinformatics, is the first effort to prioritize synonymous variants and identify the ones that may be deleterious. I'm sorry, it seems that we can't anymore throw away synonymous SNVs to simplify data analysis...

Identification of deleterious synonymous variants in human genomes
Orion J. Buske, AshokKumar Manickaraj, Seema Mital, Peter N. Ray and Michael Brudno

Motivation: The prioritization and identification of disease-causing mutations is one of the most significant challenges in medical genomics. Currently available methods address this problem for non-synonymous single nucleotide variants (SNVs) and variation in promoters/enhancers; however, recent research has implicated synonymous (silent) exonic mutations in a number of disorders.
Results: We have curated 33 such variants from literature and developed the Silent Variant Analyzer (SilVA), a machine-learning approach to separate these from among a large set of rare polymorphisms. We evaluate SilVA’s performance on in silico ‘infection’ experiments, in which we implant known disease-causing mutations into a human genome, and show that for 15 of 33 disorders, we rank the implanted mutation among the top five most deleterious ones. Furthermore, we apply the SilVA method to two additional datasets: synonymous variants associated with Meckel syndrome, and a collection of silent variants clinically observed and stratified by a molecular diagnostics laboratory, and show that SilVA is able to accurately predict the harmfulness of silent variants in these datasets.
Availability: SilVA is open source and is freely available from the project website:

Thursday, 27 June 2013

PubMed Highlight: The state of the art in Genomic Medicine

As interested in genomic studies, we can not miss this review appeared on Science Translational Medicine, that give a comprhensive overview of the impact and future perspective of genomics applied to medicine. The authors will guide you through a decade of genomic research that lead to identification of genetic causes for many mendelian diseases as well as the dissection of genetic factors underlying complex diseases. They also show how recent advances in sequencing technology have finally allowed for development of clinical genomic-driven care of patients, at least in some field such as cancer pharmacogenomics and genetic diagnosis.

Personalized medicine, the final goal that pushed us to decipher the whole DNA sequence, seems now close...or at least NGS has posed this goal within grasp.

Genomic Medicine: A Decade of Successes, Challenges, and Opportunities
Jeanette J. McCarthy1,2, Howard L. McLeod3 and Geoffrey S. Ginsburg1,*

Sci Transl Med 12 June 2013: Vol. 5, Issue 189, p. 189sr4 

Genomic medicine—an aspirational term 10 years ago—is gaining momentum across the entire clinical continuum from risk assessment in healthy individuals to genome-guided treatment in patients with complex diseases. We review the latest achievements in genome research and their impact on medicine, primarily in the past decade. In most cases, genomic medicine tools remain in the realm of research, but some tools are crossing over into clinical application, where they have the potential to markedly alter the clinical care of patients. In this State of the Art Review, we highlight notable examples including the use of next-generation sequencing in cancer pharmacogenomics, in the diagnosis of rare disorders, and in the tracking of infectious disease outbreaks. We also discuss progress in dissecting the molecular basis of common diseases, the role of the host microbiome, the identification of drug response biomarkers, and the repurposing of drugs. The significant challenges of implementing genomic medicine are examined, along with the innovative solutions being sought. These challenges include the difficulty in establishing clinical validity and utility of tests, how to increase awareness and promote their uptake by clinicians, a changing regulatory and coverage landscape, the need for education, and addressing the ethical aspects of genomics for patients and society. Finally, we consider the future of genomics in medicine and offer a glimpse of the forces shaping genomic medicine, such as fundamental shifts in how we define disease, how medicine is delivered to patients, and how consumers are managing their own health and affecting change.

Tuesday, 25 June 2013

Computational diet: how to make your Gb human genome as light as few Mb

With the technology rapidly developing, whole exome/genome sequence of individual subjects is nowadays been performed by many labs all around the world. From the first examples of indivudal genomes, such as Venter or Watson complete DNA, we have now reach a point where the entire 6 billion bases could be sequenced in about a day. With such an high data production rate, the data storage problem suddenly emerged as a painful thorn in the NGS boot, getting worse with every (sequencing) run!
In a future when genome based personalized medicine would become reality, your DNA sequence will need to be stored life through as any other medical record.
Long term storage of at least the complete DNA sequence could be a crucial factor, since as new genotype-phenotype correlations are discovered the owner of the genome could be updated with the new relevant information.

Different solutions has been proposed to reduce the amount of information that have to be stored for a single genome sequence so to make storage of large genomic data-set feasible.  However, at least the final exome/genome sequence of the individual can now be reduced to few Mb of disk space so that you can accommodate several of them even in a standard hard-disk.
This incredible result is achieved using various compression alghortims. The first generation basically used information on repetitive regions in the genome to reduce the final file size to few hundred Mb. Second generation alghoritms are instead based on a reference sequence and only store the differences between the reference and the sequenced DNA. The best performing tools until now was, DNAZip, which reduced the Watson genome sequence to only 4Mb...something that you can easily share as email attachment, as the author stated. Now Pavlichin et al. from Stanford university has pushed compression even further. Their solution is based on the same approach, and take advantage of dbSNP database so that the positions of already known SNP don't have to be stored in the final file. To further shrink the file size, positions of novel SNPs are not stored individually but as distances from the previous SNP. This improvements, together with a brand new compression function and an haplotype based trick, push the file size for a single complete genome down to 2.5Mb.
The main disadvantage is that you need the reference sequence and the dbSNP database in order to reconstruct your genome, but when dealing with thousand of them it is a minor drawback.

Unfortunately, less can be done for the bunch of sequence related information so useful for research purposes (such as base and reads quality scores, genotype scores, alignment metrics and so on...) seems that you still need a tower of drives to store them all! However, lot of efforts has been applied also to this area, since the possibility to reanalyze older dataset with new techniques and new tools can lead to new discoveries. By now there is also a prize for data compression, the sequence squeeze competition, that also has updated information on the best performing tools.

PubMed Highlight: Benchmarking of short sequence aligner for NGS

The introduction of NGS technology has posed the problem of fast ad accurate mapping of the millions of short sequences produced with every single experiment. To address this challenge a number of alignment tools have been developed and updated, each one optimized for a specific type of inputs or a specific kind of alignment problems (gapped, ungapped and so on...).
Even if every single aligner has its strengths and pitfalls, a detailed comparison of the overall performances of the various tools is always useful when it come to choice the best one for your own analysis. This recent paper published in BMC Bioinformatics reports a benchmark of the most popular aligners, such as Bowtie, Bowtie2, BWA, SOAP2, MAQ, RMAP, GSNAP, Novoalign, and mrsFAST. Moreover the authors have developed a benchmarking suite that can be used to asses the performance of any other aligner of interest.

If you need a compass to orientate in the world of aligners don't forget to visit also the HTS mapper page at EBI, which is a really useful and summarize the main features of the single software.

Benchmarking short sequence mapping tools
Ayat Hatem, Doruk Bozda¿, Amanda E Toland and Ümit V Çatalyürek
BMC Bioinformatics 2013, 14:184. Published: 7 June 2013

The development of next-generation sequencing instruments has led to the generation of millions of short sequences in a single run. The process of aligning these reads to a reference genome is time consuming and demands the development of fast and accurate alignment tools. However, the current proposed tools make different compromises between the accuracy and the speed of mapping. Moreover, many important aspects are overlooked while comparing the performance of a newly developed tool to the state of the art. Therefore, there is a need for an objective evaluation method that covers all the aspects. In this work, we introduce a benchmarking suite to extensively analyze sequencing tools with respect to various aspects and provide an objective comparison.
We applied our benchmarking tests on 9 well known mapping tools, namely, Bowtie, Bowtie2, BWA, SOAP2, MAQ, RMAP, GSNAP, Novoalign, and mrsFAST (mrFAST) using synthetic data and real RNA-Seq data. MAQ and RMAP are based on building hash tables for the reads, whereas the remaining tools are based on indexing the reference genome. The benchmarking tests reveal the strengths and weaknesses of each tool. The results show that no single tool outperforms all others in all metrics. However, Bowtie maintained the best throughput for most of the tests while BWA performed better for longer read lengths. The benchmarking tests are not restricted to the mentioned tools and can be further applied to others.
The mapping process is still a hard problem that is affected by many factors. In this work, we provided a benchmarking suite that reveals and evaluates the different factors affecting the mapping process. Still, there is no tool that outperforms all of the others in all the tests. Therefore, the end user should clearly specify his needs in order to choose the tool that provides the best results.

Wednesday, 19 June 2013

PubMed Highlight: When coverage is deep enough for SNP detection?

This interesting article just published on BMC Bioinformatics give a scientific answer to one of the most discussed question in NGS: how many reads do you need for reliable SNP calling? And what the effect of coverage on the power of your SNP detection study?

And the answer is: "We estimate a local read depth of 13X is required to detect
the alleles and genotype of a heterozygous SNV 95% of the time, but only 3X for a homozygous SNV. At a mean on-target read depth of 20X, commonly used for rare disease exome sequencing studies, we predict 5-15% of heterozygous and 1-4% of homozygous SNVs in the targeted regions will be missed."

A must have paper!

Quantifying single nucleotide variant detection sensitivity in exome sequencing.
BMC Bioinformatics. 2013 Jun 18;14(1):195

Meynert AM, Bicknell LS, Hurles ME, Jackson AP, Taylor MS. 

BACKGROUND: The targeted capture and sequencing of genomic regions has rapidly demonstrated its utility in genetic studies. Inherent in this technology is considerable heterogeneity of target coverage and this is expected to systematically impact our sensitivity to detect genuine polymorphisms. To fully interpret the polymorphisms identified in a genetic study it is often essential to both detect polymorphisms and to understand where and with what probability real polymorphisms may have been missed.
RESULTS: Using down-sampling of 30 deeply sequenced exomes and a set of gold-standard single nucleotide variant (SNV) genotype calls for each sample, we developed an empirical model relating the read depth at a polymorphic site to the probability of calling the correct genotype at that site. We find that measured sensitivity in SNV detection is substantially worse than that predicted from the naive expectation of sampling from a binomial. This calibrated model allows us to produce single nucleotide resolution SNV sensitivity estimates which can be merged to give summary sensitivity measures for any arbitrary partition of the target sequences (nucleotide, exon, gene, pathway, exome). These metrics are directly comparable between platforms and can be combined between samples to give "power estimates" for an entire study. We estimate a local read depth of 13X is required to detect the alleles and genotype of a heterozygous SNV 95% of the time, but only 3X for a homozygous SNV. At a mean on-target read depth of 20X, commonly used for rare disease exome sequencing studies, we predict 5¿15% of heterozygous and 1¿4% of homozygous SNVs in the targeted regions will be missed.
CONCLUSIONS: Non-reference alleles in the heterozygote state have a high chance of being missed when commonly applied read coverage thresholds are used despite the widely held assumption that there is good polymorphism detection at these coverage levels. Such alleles are likely to be of functional importance in population based studies of rare diseases, somatic mutations in cancer and explaining the "missing heritability" of quantitative traits.

Thursday, 13 June 2013

Pubmed highlights: comparison between ION Proton and Illumina HiSeq for exome sequencing

We have already seen papers comparing Ion PGM with illumina tecnology. Now is time to muscle up and get the competition to a larger scale! Could the new Proton be an actual challenge for the HiSeq king?

Sunday, 9 June 2013

PubMed Highlight: Netherland plan to sequence full genome of 250 local trios.

After the last post on animal genome sequencing, lets get back to some human genomes.

As a part of the large European Biobanking and Biomolecular Research Infrastructure (BBMRI), the Netherland group has launched an initiative to sequence at 15X coverage the full DNA of 250-trios from all provinces of the country, with the aim to characterize genetic variability in the dutch population. As the authors report: 
"The family-based design represents a unique resource to assess the frequency of regional variants, accurately reconstruct haplotypes by family-based phasing, characterize short indels and complex structural variants, and establish the rate of de novo mutational events. GoNL will also serve as a reference panel for imputation in the available genome-wide association studies in Dutch and other cohorts to refine association signals and uncover population-specific variants".

Genomic data will be also integrated with detailed geographic and phenotype information, providing  a valuable resource for genotype-phenotype correlation studies and evaluation of variant distribution across local and European population. The initiative has been named Genome of Netherlands (GoNL) and data are accessible at the dedicated web page.

Eur J Hum Genet. 2013 May 29
Boomsma DI, Wijmenga C, Slagboom EP, Swertz MA, Karssen LC, Abdellaoui A, Ye K, Guryev V, Vermaat M, van Dijk F, Francioli LC, Jan Hottenga J, Laros JF, Li Q, Li Y, Cao H, Chen R, Du Y, Li N, Cao S, van Setten J, Menelaou A, Pulit SL, Hehir-Kwa JY, Beekman M, Elbers CC, Byelas H, de Craen AJ, Deelen P, Dijkstra M, T den Dunnen J, de Knijff P, Houwing-Duistermaat J, Koval V, Estrada K, Hofman A, Kanterakis A, Enckevort DV, Mai H, Kattenberg M, van Leeuwen EM, Neerincx PB, Oostra B, Rivadeneira F, Suchiman EH, Uitterlinden AG, Willemsen G, Wolffenbuttel BH, Wang J, de Bakker PI, van Ommen GJ, van Duijn CM.

Within the Netherlands a national network of biobanks has been established (Biobanking and Biomolecular Research Infrastructure-Netherlands (BBMRI-NL)) as a national node of the European BBMRI. One of the aims of BBMRI-NL is to enrich biobanks with different types of molecular and phenotype data. Here, we describe the Genome of the Netherlands (GoNL), one of the projects within BBMRI-NL. GoNL is a whole-genome-sequencing project in a representative sample consisting of 250 trio-families from all provinces in the Netherlands, which aims to characterize DNA sequence variation in the Dutch population. The parent-offspring trios include adult individuals ranging in age from 19 to 87 years (mean=53 years; SD=16 years) from birth cohorts 1910-1994. Sequencing was done on blood-derived DNA from uncultured cells and accomplished coverage was 14-15x. The family-based design represents a unique resource to assess the frequency of regional variants, accurately reconstruct haplotypes by family-based phasing, characterize short indels and complex structural variants, and establish the rate of de novo mutational events. GoNL will also serve as a reference panel for imputation in the available genome-wide association studies in Dutch and other cohorts to refine association signals and uncover population-specific variants. GoNL will create a catalog of human genetic variation in this sample that is uniquely characterized with respect to micro-geographic location and a wide range of phenotypes. The resource will be made available to the research and medical community to guide the interpretation of sequencing projects. The present paper summarizes the global characteristics of the project

The White Tigers, the White Gorillas and the Green Turtle

We have been quite busy in the past weeks setting up our own NGS sequencing lab, but we haven't stop to look around for genomics news and new genomes!
Three articles have been published recently: the complete sequence of two species of turtle to get insight on the evolution of their peculiar body structure, the complete genome sequence of the only known exemplar of white gorilla; a large sequencing and genome wide association study to identify the cause of albinism in white tigers.

The first study appeared at the end of April on Nature Genetics (The draft genomes of soft-shell turtle and green sea turtle yield insights into the development and evolution of the turtle-specific body plan. Wang et al.) describe the complete genome assembly of two kind of turtles (P. sinensis and C. mydas) and also reports interesting results from embryo studies and comparative developmental studies against chicken embryos. Taken together these data give a comprehensive picture on turtle evolution and new insight on the crucial factor driving their peculiar body structure. From extensive analysis of embryo gene expression and comparison with chicken, authors found that turtle development initially follows the common vertebrate pattern but then it differentiate from the stage TK11 and they were also able to identify a set of 233 genes that should be crucial for the specific turtle body plan.

"Taken together, these results suggest that turtles indeed conform to the developmental hourglass model (Supplementary Fig. 15) by first establishing an ancient vertebrate body plan and by developing turtle-specific characteristics thereafter. The above results suggest that turtle-specific global repatterning of gene regulation begins after TK11 or the phylotypic period. Although turtle and chicken express many shared developmental genes in the embryo during the putative phylotypic period (Fig. 4a and Supplementary Tables 27and 28) and have the fewest expanded or contracted gene family members expressed (Supplementary Fig. 16) at this stage, later stages showed increasing differences in their molecular patterns. We found 233 genes that showed turtle-specific increasing expression patterns after the phylotype (Fig. 4b). Considering that the chicken orthologs did not show this type of increasing expression (Supplementary Figs. 17 and 18), these 233 genes represent attractive candidates for clarifying the genomic nature of turtle-specific morphological oddities"

This paper combines different techniques and imply a lot of NGS experiments. First of all, the genome sequencing of the two specimens was conducted by paired-end sequencing on Illumina HiSeq 2000 using both short and long insert libraries for a median of 105X and 82X (estimated genome size of about 2.2 Gb). The complete set of turtle transcripts were assessed by computational analysis, but also supported by RNA-Seq data of the soft-shell turtle conducted with three different approach: Titanium sequencig, Illumina strand-specific and non strand-specific RNA-seq, for a total of about 37 Gb of sequencing. Finally, authors investigated miRNAs as well by sequencing small RNAs on Illumina platform. They then used computational tools and comparison with other species known miRNA to infer potential binding sites and conserved miRNA species.

The other two paper deals with some more "exotic" species, trying to dissect the molecular origin of albinism in a white gorilla known as Snowflake and in a family of white tigers raised in captivity. Oculocutaneous albinism in humans is knwon to be related to mutations in the SLC45A2 gene and the authors found that this is the case also in the two considered species.
Complete sequence of the white gorilla was published at the end of May on BMC Genomics (The genome sequencing of an albino Western lowland gorilla reveals inbreeding in the wild. Prado-Martinez et al.). Authors reported the 19X whole genome sequencing of the only known white gorilla and compared these data with the human reference genome and other two already sequenced gorilla genome searching for SNVs in albinism related genes. Data analysis lead to the identification of a missense mutation in SLC45A2 gene resulting in the G518R aminoacid substitution, which should alter the function of the protein channel thus leading to albinism in the white gorilla. Interestingly, it resulted that the genome of snowflake presented large ROH regions, suggesting that it defect may result from high rate of inbreeding.

The study on white tigers was published at the beginning of June on Current Biology (The genetic basis of white tigers. Xu et al.). The authors performed an extensive genomic analysis on a family of 16 white tigers raised in captivity to track down the molecular defect responsible for the albinism in this species. Applying a combined approach based on genome-wide association mapping with restriction-site-associated DNA sequencing (RAD-seq), followed by whole-genome sequencing (WGS) of the three parents, they identified the aminoacid substitution A477V in the SLC45A2 gene as the causative mutation. This finding was confirmed by validation in 130 unrelated tigers identified and three-dimensional homology modeling, suggesting that the substitution may partially block the transporter channel cavity.

Tuesday, 4 June 2013

PubMed Highlights: Disease Gene Prioritization

The next generation sequencing field is reaching a solid robustness for what concern the techniques and now most of the weaknesses that characterized the first steps at the beginning of this new era have been mostly fixed or reduced. At the state of art, what still represent a "bottle neck" in finding the candidate disease gene, specially in exome sequencing and genome sequencing studies, is the gene prioritization step. Not always there is the availability of linkage or association data and the number of variants on which focusing remains high. The need to prioritize the genes inside a list could be crucial in a genetic study and many tools are now available. Here we suggest a chapter, taken from “Translational Bioinformatics" collection for PLOS Computational Biology, in which several tools are described with some of their successful applications.

This article is part of the “Translational Bioinformatics" collection for PLOS Computational Biology.


Disease-causing aberrations in the normal function of a gene define that gene as a disease gene. Proving a causal link between a gene and a disease experimentally is expensive and time-consuming. Comprehensive prioritization of candidate genes prior to experimental testing drastically reduces the associated costs. Computational gene prioritization is based on various pieces of correlative evidence that associate each gene with the given disease and suggest possible causal links. A fair amount of this evidence comes from high-throughput experimentation. Thus, well-developed methods are necessary to reliably deal with the quantity of information at hand. Existing gene prioritization techniques already significantly improve the outcomes of targeted experimental studies. Faster and more reliable techniques that account for novel data types are necessary for the development of new diagnostics, treatments, and cure for many diseases.