Thursday, 2 May 2013

PubMed Highlight: Zebrafish genome sequenced and the systematic genome-wide analysis of zebrafish protein-coding gene function

The new assembly of the Zebrafish (D. rerio) genome has been recently published in Nature describing also the complete set of proteins encoded in the teleost DNA and their relationship to human orthologs. In a second paper the research group propose a complete genome-wide analysis on genotype-phenotype correlation for every single protein coding gene in the assembly.

The first paper describe the latest assembly of the D. rerio genome.
The first assembly of the Zebrafish genome was made available in 2002 (Zv1), and now the Zebrafish Genome Project has produced the latest detailed assembly based on NGS Illumina technology. The new version represents a great improvement: it provides a better coverage of the entire genome sequence, helped resolve tricky artifacts that are still sparse along the fish genome and allow better identification of the complete set of protein coding genes (about 26,000). Also the catalog of small RNA and repeated elements has been update and some mis-annotated genes has been removed (mostly genes from other species that was assigned to zebrafish in previous assemblies).
Having a complete and robust assembly of this teleost genome is a key factor for research community, giving its important role as animal model for studies on development and characterization of functional impact of mutations in disease genes.

The zebrafish reference genome sequence and its relationship to the human genome. Nature. 2013 Apr 25;496(7446):498-503.
Howe K, et al.
Wellcome Trust Sanger Institute

Abstract

Zebrafish have become a popular organism for the study of vertebrate gene function. The virtually transparent embryos of this species, and the ability to accelerate genetic studies by gene knockdown or overexpression, have led to the widespread use of zebrafish in the detailed investigation of vertebrate gene function and increasingly, the study of human genetic disease. However, for effective modelling of human genetic disease it is important to understand the extent to which zebrafish genes and gene structures are related to orthologous human genes. To examine this, we generated a high-quality sequence assembly of the zebrafish genome, made up of an overlapping set of completely sequenced large-insert clones that were ordered and oriented using a high-resolution high-density meiotic map. Detailed automatic and manual annotation provides evidence of more than 26,000 protein-coding genes, the largest gene set of any vertebrate so far sequenced. Comparison to the human reference genome shows that approximately 70% of human genes have at least one obvious zebrafish orthologue. In addition, the high quality of this genome assembly provides a clearer understanding of key genomic features such as a unique repeat content, a scarcity of pseudogenes, an enrichment of zebrafish-specific genes on chromosome 4 and chromosomal regions that influence sex determination.

The second paper describe a project for genome-wide characterization of mutations in every single protein coding genes in D. rerio.
In this perspective the research group working on D. rerio genome has published another interesting paper describing their active project to identify and phenotype the disruptive mutations in every zebrafish protein-coding gene, using high-throughput sequencing and efficient chemical mutagenesis. They have already identified pote
ntially disruptive mutations in more than 38% of all known zebrafish protein-coding genes and assessed the effects of each mutation during embryogenesis. Moreover they have analysed the phenotypic consequences of over 1,000 alleles, making all data available to the community for genotype-phenotype correlation studies.

A systematic genome-wide analysis of zebrafish protein-coding gene function.Nature. 2013 Apr 25;496(7446):494-7.
Kettleborough RN, Busch-Nentwich EM, Harvey SA, Dooley CM, de Bruijn E, van Eeden F, Sealy I, White RJ, Herd C, Nijman IJ, Fényes F, Mehroke S, Scahill C, Gibbons
R, Wali N, Carruthers S, Hall A, Yen J, Cuppen E, Stemple DL.
Wellcome Trust Sanger Institute

Abstract

Since the publication of the human reference genome, the identities of specific genes associated with human diseases are being discovered at a rapid rate. A central problem is that the biological activity of these genes is often unclear. Detailed investigations in model vertebrate organisms, typically mice, have been essential for understanding the activities of many orthologues of these disease-associated genes. Although gene-targeting approaches and phenotype analysis have led to a detailed understanding of nearly 6,000 protein-coding
genes, this number falls considerably short of the more than 22,000 mouse protein-coding genes. Similarly, in zebrafish genetics, one-by-one gene studies using positional cloning, insertional mutagenesis, antisense morpholino oligonucleotides, targeted re-sequencing, and zinc finger and TAL endonucleases have made substantial contributions to our understanding of the biological activity of vertebrate genes, but again the number of genes studied falls well short of the more than 26,000 zebrafish protein-coding genes. Importantly, for both mice and zebrafish, none of these strategies are particularly suited to the rapid generation of knockouts in thousands of genes and the assessment of their biological activity. Here we describe an active project that aims to identify and phenotype the disruptive mutations in every zebrafish protein-coding gene, using a well-annotated zebrafish reference genome sequence, high-throughput sequencing and efficient chemical mutagenesis. So far we have identified potentially disruptive mutations in more than 38% of all known zebrafish protein-coding genes. We have developed a multi-allelic phenotyping scheme to efficiently assess the effects of each allele during embryogenesis and have analysed the phenotypic consequences of over 1,000 alleles. All mutant alleles and data are available to the community and our phenotyping scheme is adaptable to phenotypic analysis beyond embryogenesis.

Monday, 15 April 2013

CLARITY and the new frontiers of brain-imaging: really a brilliant idea!

This is not actually a genomic news, but the new CLARITY brain imaging technique just appeared on Nature is so fascinating that I have to share the video!

This demonstrate how far we have gone in our ability to map the activity of single cells and open amazing possibilities for future studies, promising to provide knowledge on how the brain response to stimuli or coordinate body activities.
The construction of a complete and informative map of the neuron interactions seems feasible and it is a hot topic right now. The US BRAIN project (recently founded with 100 million $ by Obama administration) and the European Human Brain Project (HBP) (founded with 1 billion euros for ten years as one of the EU FET-flegship), are two huge international initiatives, just started to accomplish this ambitious goal.

Someone have already pointed to the brain map as the third revolutionizing achievement after the Human Genome Project and the ENCODE project.

See your brain, with plenty of colourful neuron cell...
This time you can literally say: this is a brilliant idea!
And THIS is amazing science!

Wednesday, 10 April 2013

PubMed Highlight: Updating benchtop sequencing performance comparison

The comparison of available NGS benchtop sequencers continue as every platform rapidly upgrade its performance...or at least claim to have done that!

Accurate evaluation of the present performance of the single technologies is really difficult in a such quickly developing field as NGS...However a snapshot of the NGS technology scenario is of great advantage to everyone to evaluate which one best fit its research needs!

Read the complete paper published on Nature Biotechnology!

What's new in this paper compared to the last benchmarks from BGI (published on Journal of Biomedicine and Biotechnology) and Wellcome Trust Institute (published on BMC Genomics)?

Two competitors emerge as the leaders in the field: Ion PGM and MiSeq. Data shows data some technological and analytic gaps have been closed, with both platforms performing about the same in term of substitution detection accuracy. MiSeq performance remains better than PGM in detecting small indels (about 100-fold lower error rate), with most of the errors due to troubles in sequencing homopolymers runs. MiSeq still have the lowest cost per Mb, while PGM still the fastest and more flexible.

A good recap of the findings can be found on this post from GenomeWeb, where you can read also the first official answers from both Illumina and Life Tech.

Table from Junemann et al., Nature Biotechnology 31(4): 294-296, April 2013

Tuesday, 9 April 2013

A walking-dead pidgeon and the return of the mummy...you can do NGS starting from any source!!!

Here we are again with some surprising data from NGS studies.

Besides provide some interesting (and, why not, funny) scientific stories, the following news clearly demonstrate how, relying on cut of the edge sample preparation, NGS technology allow sequencing from basically any input material. The ability to obtain a good sequence of an entire genome starting from few nanograms, but even picograms, of input DNA open the way to amazing application. From cancer genomics, to prenatal screening, the ability of sequence very little amount of cell-free DNA that is present in circulating blood as already provide really exciting results. This kind of techniques promise great improvements also in epidemic controls and contamination/pathogens detection in foods, water or any other samples.
The topic of sequencing from nanograms or picograms is extensively covered in this post on CoreGenomics, that provides also some examples of recently published papers and dicuss about the new library preparation kits that make NGS sequencing from low input DNA fast and easy!

Now the stories!

We have reported on September 2012 about Revive and Restore (see the our post here), a company founded with the ambitious and controversial aim of sequence and reconstruct the genome of extinct species with the ultimate goal of eventually bring them back to life.
We were not sure how far this initiative would have gone, but the last month they surprised us again with the announcement of an actual project to resurrect the extinct passenger pidgeon.

Ben Novak, a 26-year-old genetics, has received support from the company to achieve the goal of sequencing the entire DNA of this species from a tissue sample received from the Chicago's Field Museum in 2011. Working with evolutionary biologist Beth Shapiro at the University of California, Santa Cruz, they plan to complete the genome of passenger pidgeon and its closest living relative, the band-tailed pidgeon. The extinct DNA will then be aligned to the living one to identify all the differences and finally a massive mutagenesis will be performed on band-tailed pidgeon DNA to re-create the complete sequence of the passenger species.
However the idea not only require hard work, but it could really get dicey. Indeed, according to Shapiro, "because the last common ancestor of the two species flew about 30 million years ago, their genomes will likely differ at millions of locations." Fitting the pieces together will be grueling, if not impossible. GenomeWeb have a post on this and Wired also has covered the story in this article from Kelly Servick.

However one consider this real science or fantasy science, the general topic of de-extinction is getting increasing attention now that the DNA sequencing technology allow to effectively assemble genomes from ancient and degraded samples (remember for example the Neanderthal genome or the Mammoth genome). The collection of DNA from different species is a part of some huge projects intended to preserve and study biodiversity and experts are now discussing if and eventually how we have to deal with species that go extinct over time. If we as human race are responsible for the disappearing of a specific organism and we have the ability to bring it back to life, should we do this? What are the risks of re-introduce extinct species in our ecosystem?

Recently a TEDx event has been organized exploring the topic of de-extinction. and the recent advances in the field have received attention also from the National Geographic and The New York Times (this one dealing with bring an extinct frog back to life). Revive and Restore has a list of candidate organisms waiting for de-exctinction and is searching for collaborators! Looking at their list I may like to see a Dodo walking again in the garden...but want to raise my doubts about a tooth-saber cat!!

The second news is directly from scientific literature. In their paper recently published on Journal of Applied Genetics, Khairat et al. from the University of Tubingen, report the first metagenome analysis on ancient egyptian mummies. Their dataset comprise seven sequencing experiments performed on DNA obtained from five randomly selected Third Intermediate to Graeco-Roman Egyptian mummies (806 BC-124AD) and two unearthed pre-contact Bolivian lowland skeletons. Analyzing the data their were able to identify different genetic materials from bacteria, presumibly due to contamination from mummies conservation procedures, and also from plants, potentially associated with their use in embalming reagents. The paper demonstrates that also DNA from ancient mummies, could be a proper template for NGS sequencing, despite its age and the several treatment performed on the samples in the course of the conservation protocols.

All start from high quality reads in NGS!

If you are performing an NGS based experiment, first of all you want to be sure that your are starting from high quality raw data. Current technology have achieved outstanding robustness but the quality check on sequencing reads remain the first step in every analysis.

Several tools are available that return stats and graphs from analysis of your fastq files and, inspired by a new paper just appeared on PLoS ONE, I just report want to report a couple of solutions that I found useful.

First is FastQC. This is a relative simple tools which take your fastq or bam/sam file and report all the essential stats you need to be sure that nothing has gone wrong with your sequencing. It's based on Java and so it can easily run on almost every platform without the need for tricky installation steps.

You can find this from the official web page at Babraham Bioinformatic Institute.

Second is NGS QC Toolkit. This is a set of tools for the quality control of next generation sequencing data. It accept data in the popular fastq format and provide with detailed results in the form of tables and graphs. Moreover it allows filtering of high-quality sequence data and includes few other tools, which are helpful in NGS data quality control and analysis (format conversion and trimming of the reads for example).

It is developed by the Indian National Institute of Plant Genome Research and you can find it at its official page here.

Also take a look at the official paper published on PLoS ONE in 2012 by Patel RK & Jain M

Third is this recent QC-chain tool that have cited above. The tool comprise a set of user-friendly tools for quality assessment and trimming of raw reads (Parallel-QC). Moreover it has an interesting feature that allows identification, quantification and filtration of unknown contamination to get high-quality clean reads. Authors stated that the tool was optimized based on parallel computation, promising that processing speed is significantly higher than other QC methods...This could be really useful if you routinely deal with a huge volume of data.

QC-chain is developed by the Computation Biology Team at Qingdao Institute of Bioenergy and Bioprocess Technology, and can be found here at the official web page.

This one also have an official paper published on PLoS ONE in 2013 by Zhou Q et al.

Saturday, 30 March 2013

Pubmed highlight: cnv detection from exome sequencing data

FishingCNV : a graphical software package for detecting rare copy number variations in exome sequencing data.

Friday, 29 March 2013

PubMed Highlight: Next-Generation sequencing visualization

I've just finished an intensive course on NGS data analysis where command line based soutions where of course the best reported way to manage and make sense of data.
Playing with scripts, unix code and R language make you feel a sort of bioinformatic power. You start to blame all those wet-lab collegues spending hours on excel spreadsheets. You are amazed by the results of your last programming trick and effectivness of your command-line skills. Even if this make you proude, keep in mind that a screen full of symbols and over-a-million-row tables have to most og biologist and geneticists the same appeal of the flowing characters of The Matrix...As in the famous movie, not everyone can see the meaning behind the code, most of them will just see a bunch of chars and number, doubting that this is The real world!
A good visualization of genomic data from NGS experiments would make your results nicer to see, easier to explain and explore. Moreover, a colorful alignments of reads in genome browser style or a circos graph sure make a better impact when you show them in your presentations! The scientific community constantly ask for visulization tools that simplify the task of explaining and exploring NGS data, so that they became accessible to everyone, even to the old-school ones.

The last special issue of Briefings in Bioinformatics make an extensive review of the main visualization tools, with an overview on their peculiar advantages and main features. Web-based browsers, UCSC Genome Browser, IGV, Tablet, Bamview and GBrowse are all covered, making this issue the ideal answer to the collegue asking you: "I've just received this great NGS data, but what are all these bam and vcf files? I want to see them nicely placed on my favourite chromosome!".

Main articles in the special issue:
Jun Wang, Lei Kong, Ge Gao, and Jingchu Luo
A brief introduction to web-based genome browsers

Robert M. Kuhn, David Haussler, and W. James Kent
The UCSC genome browser and associated tools

Lincoln D. Stein
Using GBrowse 2.0 to visualize and share next-generation sequence data

Oscar Westesson, Mitchell Skinner, and Ian Holmes
Visualizing next-generation sequencing data with JBrowse

Helga Thorvaldsdóttir, James T. Robinson, and Jill P. Mesirov
Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration

Iain Milne, Gordon Stephen, Micha Bayer, Peter J.A. Cock, Leighton Pritchard, Linda Cardle, Paul D. Shaw, and David Marshall
Using Tablet for visual exploration of second-generation sequencing data

Tim Carver, Simon R. Harris, Thomas D. Otto, Matthew Berriman, Julian Parkhill, and Jacqueline A. McQuillan
BamView: visualizing and interpretation of next-generation sequencing read alignments

Michael C. Schatz, Adam M. Phillippy, Daniel D. Sommer, Arthur L. Delcher, Daniela Puiu, Giuseppe Narzisi, Steven L. Salzberg, and Mihai Pop
Hawkeye and AMOS: visualizing and assessing the quality of genome assemblies

Thursday, 28 March 2013

Elementary elements in bioinformatics

In these days I'm attending to an intensive course on NGS data analysis...Everyday we deal with about 9h of bioinformatics, both theory and scripting...And a tons of useful tools have been cited during the course...
Since the names of these softwares are all but easy to remember, I found myself wondering for a summary that give a compact and organized overview and quick access to the main ones.

Considering that bioinformatics tricks have became as essentials as chemical elements, The Elements of Bioinformatics table from Eagle Genomics is an efficient and funny answer to my needs
If programming, analyzing DNA data and talking about stats and complex biology don't satisfy your need to look nerdy, use this table to remember strange-named tools should improve your reputation as a real geek!!

Have fun (if you read this blog I'm sure you will!)

Thursday, 21 March 2013

PubMed Highlight: The origin, evolution and functional impact of short insertion-deletion variants identified in 179 human genomes

The role of short indels (<50 bp) as main player shaping human genome variability and contributing to various mendelian disease has been underlined by several recent findings.
However a detailed genome-wide assessment of indels impact and dsitribution still missing...until now.

In this interesting paper appeared in Genome Research, Montgomery et al. address exactly this question and with amazing results. First of all authors as to deal with the short Indels calling challenge that is one of the biggest issue when analyzing NGS data. Starting with DNA sequences from 179 individuals from 3 population groups, they made several optimization to the standard pipeline used by the 1000 Genome Project to obtain a set of high quality indels. Even if indels in homopolymeric regions remain out of reach, the improved pipeline described in the paper is certainly a guideline for anyone working in the field. Among the other interesting findings, authors confirmed that rates of indel mutagenesis are highly heterogeneous, with 43-48% of indels occurring in 4.03% of the genome (loci defined as indel hotspots by the authors), and they proposed fork stalling and template switching (FoSTeS) together with polymerase slippage as the main mechanism originating the indels.

Take a look!

The origin, evolution and functional impact of short insertion-deletion variants identified in 179 human genomes

Short insertions and deletions (indels) are the second most abundant form of human genetic variation, but our understanding of their origins and functional effects lags behind that of other types of variants. Using population-scale sequencing, we have identified a high-quality set of 1.6 million indels from 179 individuals representing 3 diverse human populations. We show that rates of indel mutagenesis are highly heterogeneous, with 43-48% of indels occurring in 4.03% of the genome we classify as indel hotspots, while in the remaining 96% their prevalence is 16-times lower than that for SNPs. Polymerase slippage can explain upwards of 3/4 of all indels, including virtually all hotspot indels. The remainder are mostly simple deletions in complex sequence, but insertions do occur and are significantly associated with pseudo-palindromic sequence features compatible with the fork stalling and template switching (FoSTeS) mechanism more commonly associated with large structural variations. We introduce a quantitative model of polymerase slippage showing an excellent fit to observed levels of variation, which enables us to identify a minority of indel-hypermutagenic protein-coding genes, some of which are associated with recurrent mutations leading to disease. Accounting for mutational rate heterogenetity due to sequence context, we find that indels across functional sequence are generally subject to stronger purifying selection than SNPs. We find that indel length modulates selection strength, as is well known of frameshift mutations in coding regions, but also longer indels and indels affecting multiple functionally constrained nucleotides are more strongly selected against in various non-coding contexts. We further find that indels are enriched in associations with gene expression, and find evidence for a contribution of nonsense-mediated decay to this association. Finally, we show that indels can be integrated in existing GWAS studies, and although we do not find direct evidence that potentially causal protein-coding indels are enriched with strong associations to known disease-associated SNPs, many of our findings suggest that the causal variant underlying some of these associations may be indels.

Wednesday, 20 March 2013

PubMed Highlight: the genome of HeLa cell line has been sequenced

HeLa cells, sampled in 1951 from the cervical tumor of a woman named Henrietta Lacks, are probably the world's most commonly used human cell lines and have been used as a standard for understanding many fundamental biological processes, leading to more than 60,000 scientific publications.
In a new study published on G3 (Genes, Genomes, Genetics), scientists announce they have successfully sequenced the genome of a HeLa cell line. While previous work had shown that they have extra copies of each chromosome and sometimes multiple extra chromosomes, the analysis of the HeLa genome revealed additional features commonly associated with cancer cells like losing healthy copies of genes. In particular, the researchers found that countless regions of the chromosomes in each cell were arranged in the wrong order and had extra or fewer copies of genes.

The results of the study are also discussed in a Nature commentary.

Published Early Online March 11, 2013, doi:10.1534/g3.113.005777

G3 March 11, 2013g3.113.005777

The Genomic and Transcriptomic Landscape of a HeLa Cell Line

Jonathan J. M. Landry, Paul Theodor Pyl, Tobias Rausch, Thomas Zichner, Manu M. Tekkedil, Adrian M. Stütz, Anna Jauch, Raeka S. Aiyar, Gregoire Pau, Nicolas Delhomme, Julien Gagneur, Jan O. Korbel, Wolfgang Huber and Lars M. Steinmetz

Abstract

HeLa is the most widely used model cell line for studying human cellular and molecular biology. To date, no genomic reference for this cell line has been released, and experiments have relied on the human reference genome. Effective design and interpretation of molecular genetic studies done using HeLa cells requires accurate genomic information. Here we present a detailed genomic and transcriptomic characterization of a HeLa cell line. We performed DNA and RNA sequencing of a HeLa Kyoto cell line and analyzed its mutational portfolio and gene expression profile. Segmentation of the genome according to copy number revealed a remarkably high level of aneuploidy and numerous large structural variants at unprecedented resolution. The extensive genomic rearrangements are indicative of catastrophic chromosome shattering, known as chromothripsis. Our analysis of the HeLa gene expression profile revealed that several pathways, including cell cycle and DNA repair, exhibit significantly different expression patterns from those in normal human tissues. Our results provide the first detailed account of genomic variants in the HeLa genome, yielding insight into their impact on gene expression and cellular function as well as their origins. This study underscores the importance of accounting for the strikingly aberrant characteristics of HeLa cells when designing and interpreting experiments, and has implications for the use of HeLa as a model of human biology.

NGS: News on Genomic Studies

Pages