Friday, 27 December 2013

Happy new Human Genome! GRCh38 is here!

This Christmas bring a long desired gift for anyone involved in genomics: a brand new release of the Human Genome Assembly. The new GRCh38 is a kind gift from the Genome Research Consortium (GRC), including The Wellcome Trust Sanger Institute (WTSI), the Washington University Genome Sciences Center (WUGSC), the European Bioinformatics Institute (EBI) and the The National Center for BiotechnologyInformation (NCBI)This new version is the first major release since 4 years and provide two major improvements: fewer gaps and centromere model sequences.


The first draft of the human genome contained around 150000 gaps, years of hard work have reduced them to 357 in the GRCh37 version. The new GRCh38 take care of several of these gaps, including one on chromosome 10 associated with the mannose receptor C Type 1 (MRC1) locus and one on chromosome 17, associated with the chemokine (C-C-motif) ligand 3 like 1 and ligand 4 like 1 (CCL3L1/CCL4L1 ) genes. In addition, there have been additions from whole-genome shotgun sequencing at nearly 100 of GRCh37′s assembly gaps.
Telomeres continue to be represented by default 10 kilobase gaps, while some improvments have been done on acrocentric chromosomes and GRCh38 includes new sequences on the short arms of chromosomes 21 and 22.

The other major feature added to the new release is the model sequence representation for centromeres and some heterochromatin. Using a method developed by a research team at University of California at Santa Cruz (UCSC) and reads generated during the Venter genome assembly, scientists created models for the centromeres. “These models don’t exactly represent the centromere sequences in the Venter assembly, but they are a good approximation of the ‘average’ centromere in this genome” says Church, a genomicist formerly at the US National Center for Biotechnology Information. Even if these sequence models are not exact representations of any real centromere, they will likely improve genome analysis and allow study of variation in centromere sequences.

GRC recently submitted the data for GRCh38 to GenBank, and the assembly is available with accession GCA_000001405.15. These data are also available by FTP at
However keep in mind that this sequence is provided withou any annotation and that it will take at least a couple of week for the NCBI annotation pipeline to process the whole data and produce a new set of RefSeqs. As the NCBI reports: "The chromosome sequences will continue to have accessions NC_000001-NC_000024, but their versions will update as GRCh38 includes a sequence change for all chromosomes. This process generally takes about 2 weeks, and when that is done we will incorporate these sequences into various analysis and display tools, such as genomic BLAST and genome viewers. Thus, at the end of this process each chromosome will be represented by both an unannotated sequence in GenBank (the original GRC data) and an annotated sequence in the RefSeq collection."

Further details on the properties and development of the GRCh38 are reported by Methagora blog from Nature Methods, NCBI Insights blog and the GRC consotium blog.

Thursday, 21 November 2013

Fred Sanger, father of sequencing, died at the age 95

As researchers interested in genomics and NGS we could not miss to spent few words in memory of Fred Sanger, the two-time nobel prize winner that developed the Sanger sequencing method. He was a dedicated and brilliant scientists that, unusual for someone of his stature, spent his most of his career in a laboratory. Even after receiving his first Nobel for the discovery of insulin protein structure, he shifted is interest on DNA and continued to perform many experiments himself.
s to his genius we have been able to read the sequence from DNA molecule, deciphering the secrets of gene information. Its work opened the door for sequencing automation, so being the foundation for the entire genomic era and finally for the assembly of the complete human DNA sequence. To honor his achivements, the Wellcome Trust Sanger Institute at Hinxton, where work on the genome continues, is named after him.
Even if something new has appeared in the field with NGS and its innovative techniques, Fred Sanger still the real father of sequencing!

"Fred can fairly be called the father of the genomic era: his work laid the foundations of humanity's ability to read and understand the genetic code, which has revolutionized biology and is today contributing to transformative improvements in healthcare," Jeremy Farrar, the director of the Wellcome Trust.

“Fred was one of the outstanding scientists of the last century and it is simply impossible to overestimate the impact he has had on modern genetics and molecular biology. Moreover, by his modest manner and his quiet and determined way of carrying out experiments himself right to the end of his career, he was a superb role model and inspiration for young scientists everywhere." Venki Ramakrishnan, deputy director of the Laboratory for Molecular Biology.

"Fred was an inspiration to many, for his brilliant work, for his quiet determination and for his modesty. He was a outstanding investigator, with a dogged determination to solve questions that have led to transformations in how we perceive our world" - Prof Sir Mike Stratton, director of the Wellcome Trust Sanger Institute

Thursday, 24 October 2013

Flash Report: Oxford Nanopore is announcing an early access program for its MinIon nanopore sequencer

The MinION is a USB stick (!!!) DNA sequencer announced by Oxford Nanopore Technologies about one year and a half ago. The company, which is currently exhibiting at the American Society of Human Genetics, yesterday announced the launch of the MinION Access Program.
More details in this GenomeWeb article as well as on this post on Nick Loman's blog. Interesting insights into the DNA sample preparation have been posted on the Future Continuous blog.

MinION Access Programme
In late November, Oxford Nanopore will open registration for a MinION Access Programme (MAP – product preview). This is a substantial but initially controlled programme designed to give life science researchers access to nanopore sequencing technology at no risk and minimal cost.
MAP participants will be at the forefront of applying a completely novel, long-read, real-time sequencing system to existing and new application areas. MAP participants will gain hands-on understanding of the MinION technology, its capabilities and features. They will also play an active role in assessing and developing the system over time. Oxford Nanopore believes that any life science researcher can and should be able to exploit MinION in their own work. Accordingly, Oxford Nanopore is accepting applications for MAP participation from all1, 2.
About the programme
A substantial number of selected participants will receive a MinION Access programme package. This will include:
* At least one complete MinION system (device, flowcells and software tools).
* MAP participants will be asked to pay a refundable $1,000 deposit on the MinION USB device, plus shipping.
* Oxford Nanopore will provide a regular baseline supply of flowcells sufficient to allow frequent usage of the system. MAP participants will ONLY pay shipping costs on these flowcells. Any additional flowcells required at the participants’ discretion may be available for purchase at a MAP-only price of $999 each plus shipping and taxes.
* Oxford Nanopore will provide Sequencing Preparation Kits. MAP participants may choose to develop their own sample preparation and analysis methods; however, at this stage on an unsupported basis.
What are the terms of the MAP agreement?
Participation in the MAP product preview program will require participants to sign up to an End User License Agreement (EULA) and simple terms intended to allow Oxford Nanopore to further develop the utility of the products, applications and customer support while also maximising scientific benefits for MAP participants. Further details will be provided when registration opens, however in outline:
* MAP participants will be invited to provide Oxford Nanopore with feedback regarding their experiences through channels provided by the company.
* All used flow cells are to be returned to Oxford Nanopore3.
* MAP participants will receive training and support through an online participant community and support portal.
* MAP participants will go through an initial restricted ‘burn-in’ period, during which test samples will be run and data shared with Oxford Nanopore. After consistent and satisfactory performance has been achieved under pre-agreed criteria, the MAP participants will be able to conduct experiments with their own samples. Data can be published whilst participants are utilising the baseline supply of flowcells.
* MAP participants or Oxford Nanopore may terminate participation in the programme at any time, for any reason. Deposits will be refunded after all of the MAP hardware is returned.
* MAP participants will be the first to publish data from their own samples. Oxford Nanopore does not intend to restrict use or dissemination of the biological results obtained by participants using MinIONs to analyse their own samples. Oxford Nanopore is interested in the quality and performance of the MiniION system itself.
* Oxford Nanopore intends to give preferential status for the GridION Access Programme (GAP) when announced to successful participants in the MinION access programme.
* The MinION software will generate reports on the quality of each experiment and will be provided to Oxford Nanopore only to facilitate support and debugging.
Registration process
Registration will open in late November for a specific and limited time period. Oxford Nanopore will operate a controlled release of spaces on the programme.
MAP participants will be notified upon acceptance to the programme. They will then able to review and accept the EULA before providing the refundable deposit and joining the programme. MAP participants will then receive a login for the participant support portal and a target delivery date for their MinION(s) and initial flow cells.
The online participant support portal will provide training materials, FAQs, support and other information such as data examples from Oxford Nanopore. It will also include a community forum to allow participants to share experiences.
Who can join?
Anybody who is not affiliated with competitors of Oxford Nanopore. Strong preference will be given to biologists/researchers working within the field of applied NGS where long reads, simple workflow, low costs, and real time analysis can be shown to make a key difference. Preference may also be given to individuals/sites opting for multiple MinIONs. If the programme is oversubscribed, some element of fairly applied random selection may be used to further prioritise participants.
1. If you would like us to keep you informed of the opening of this registration please visit our contact page and select the box marked ‘Keep me informed on the MinION Access programme’.
2. The MinION system is for Research Use Only
3. Flowcells can be easily, quickly and thoroughly washed through with water and dried before return.

Thursday, 17 October 2013

Sad but true: six years after acquisition Roche shuts down 454 sequencing business

I think that October 15, 2013 is a sad day for the NGS community. Using a pyrosequencing-based technology originally developed by John Rothberg, in 2007 it has been possible to sequences a for the first time the complete genome of an individual of our species (James Watson) using a NGS based approach. Now these days are over and new technologies are  more "commercially-sustainable" than some of the ones how made the history of NGS. I'm also wondering what's the future of the SOLID system in a world dominated by Illumina and a rapidly growing community io Ion Torrent/Proton users.

By a GenomeWeb staff reporter

NEW YORK (GenomeWeb News) - Roche is shuttering its 454 Life Sciences sequencing operations and laying off about 100 employees, the company confirmed today. The 454 sequencers will be phased out in mid-2016, and the 454 facility in Branford, Conn., will be closed "accordingly," Roche said in a statement e-mailed to GenomeWeb Daily News.

The 100 layoffs will take place during the next three years, and, "Roche is committed to finding socially responsible solutions for the employees affected by these changes," it added.

Until the business is shut down, Roche will provide service and support to 454 instruments, parts, reagents, and consumables.

"Sequencing is a fast-evolving technology," the firm said. "With the continuous efforts of the sequencing unit in building a diverse pipeline of potentially differentiated sequencing technologies, Roche is committed to introducing differentiated and competitive products to the market and offer[ing] a sequencing product pipeline for both life science and clinical applications." Roche bought 454 from Curagen in 2007 for $155 million in cash and stock, saying at the time the deal would solidify its access to future 454 sequencers and enable it to use the tools for in vitro diagnostic applications. Prior to the purchase, Roche had been 454's exclusive distributor starting in 2005. Roche does not break out revenue figures for the 454 business, but in recent years, with the ascent of other sequencing technologies, the 454 instruments were pushed to the research margins. More recently, lower-throughput sequencers such as Illumina's MiSeq and Life Technologies' Ion Torrent systems have further moved the 454 technology toward irrelevance.

Over the past few years, Roche has made efforts to stay involved with the next wave of sequencing technologies in development. It forged alliances with IBM and DNA Electronics in 2010, and in early 2012 it made a hostile bid for Illumina that eventually reached a price tag of $6.7 billion - a bid that was rejected by Illumina.

After being rebuffed by Illumina, Roche continued trying to resuscitate its sequencing operations, and in December 2012, reports surfaced that Roche was again courting Illumina, though no deal materialized.

Then, earlier this year, Roche announced it had ended its deals with IBM and DNA Electronics, and said simultaneously it would shut down its R&D efforts in semiconductor sequencing and nanopore sequencing, resulting in about 60 layoffs at the Branford site.

The company said at the time that it was consolidating its 454 and NimbleGen products into a new sequencing unit covering both life science and clinical diagnostic applications. Dan Zabrowski, head of Roche Applied Science, who was to head the new sequencing unit, told In Sequence then, "We are fully committed to [our] life science business, and this decision did not have an impact on any of our businesses or customers. We continue to think that sequencing is going to play an important role in life science and in the clinic."

Roche is not abandoning the sequencing space entirely, though. Last month the firm forged a deal for up to $75 million with Pacific Biosciences to develop diagnostic products based on PacBio's SMRT technology.

"Roche sees great potential in this sequencing technology which offers unprecedented read length, accuracy, and speed for the development of future sequencing based applications in clinical diagnostics," Roche said.

Tuesday, 20 August 2013

Ready for Avalanche?

It's been a while (I remember rumors more than 1 year ago) since Life Tech started talking about a new chemistry based on isothermal amplification that promise to eliminate emPCR, shorten clonal amplification down to less than 1h and provide better uniformity...They called it Avalanche...But no more has been revealed in the past months, such that Avalanche was becoming like a mythological creature: sure it's fascinating, but does it really exist?

However keep your new stuff secret since the time of commercial release seems to be a common way of acting in the competitive field of NGS technology...and now finally Avalanche is ready to hit our bench and provide the long waited improvement for SOLiD and Torrent sequencing platforms.
Indeed, researchers from Life Technologies has recently published on PNAS a paper demonstrating the feasibility, efficiency and rapidity of the new method. As anticipated, it's based on isothermal amplification of a properly prepared DNA template using the bst DNA polymerase and substrate immobilized primers. Citing the abstract, it use "a template walking mechanism using a pair of low-melting temperature (Tm) solid- surface homopolymer primers and a low-Tm solution phase primer". Authors report results obtained with the new method applied on a SOLiD 5500 W flowchip and they are quite exciting: reaction time of slightly more than 30 min with high percentage of monoclonal colonies and 3- to 4-fold more mapped reads than with traditional method and easy paired-end protocol.

Since the procedure reported in the paper is developed on SOLiD technology, I expect that the new chemistry will be commercially available in short time on this platform...However, I'm wondering if Life Technology is already working to extend this innovation also to Torrent sequencers and if they will introduce Avalanche with the PII chip (announced to be out in these months) or with the PIII ( probably in the first half of 2014). The method demonstrated in the paper require surface immobilized primers on a flowcell, so they have to adapt it to work on the small beads that are required for the PGM and Proton chips...or they have to re-think the chips themselves to avoid use of beads, but I think this solution is not so easy to apply on the current sequencer machine. Another question that bother most of the Torrent cutomers: will Avalanche be compatible with the current library/template preparation equipment or we will have to buy some new expensive piece to actually do the upgrade?

Now waiting to be trampled by the Avalanche!

For more technical details read the full paper:
Isothermal amplification method for next-generation sequencing. 
Ma Z, Lee RW, Li B, Kenney P, Wang Y, Erikson J, Goyal S, Lao K. 

Sunday, 28 July 2013

Monday, 22 July 2013

Pubmed Highlight: silencing the extra copy of chromosome 21 in Down’s syndrome cells using the XIST gene

Having been been part in 1991 of the team who originally cloned the mouse Xist gene, I've been really excited by the news. Scientists at the University of Massachusetts discovered that XIST, the gene involved in X-chromosome inactivation, can be used to turn off the extra chromosome 21 in Down syndrome.

The study has been published on the latest issue of Nature.

 2013 Jul 17. doi: 10.1038/nature12394.

Translating dosage compensation to trisomy 21.


Department of Cell and Developmental Biology, University of Massachusetts Medical School, 55 Lake Avenue North, Worcester, Massachusetts 01655, USA.


Down's syndrome is a common disorder with enormous medical and social costs, caused by trisomy for chromosome 21. We tested the concept that gene imbalance across an extra chromosome can be de facto corrected by manipulating a single gene, XIST (the X-inactivation gene). Using genome editing with zinc finger nucleases, we inserted a large, inducible XIST transgene into the DYRK1A locus on chromosome 21, in Down's syndrome pluripotent stem cells. The XIST non-coding RNA coats chromosome 21 and triggers stable heterochromatin modifications, chromosome-wide transcriptional silencing and DNA methylation to form a 'chromosome 21 Barr body'. This provides a model to study human chromosome inactivation and creates a system to investigate genomic expression changes and cellular pathologies of trisomy 21, free from genetic and epigenetic noise. Notably, deficits in proliferation and neural rosette formation are rapidly reversed upon silencing one chromosome 21. Successfultrisomy silencing in vitro also surmounts the major first step towards potential development of 'chromosome therapy'.

Friday, 19 July 2013

PubMed Highlight: A new type of viruses, the Pandoraviruses.

Pandoraviruses: Amoeba Viruses with Genomes Up to 2.5 Mb Reaching That of Parasitic Eukaryotes

Ten years ago, the discovery of Mimivirus, a virus infecting Acanthamoeba, initiated a reappraisal of the upper limits of the viral world, both in terms of particle size (>0.7 micrometers) and genome complexity (>1000 genes), dimensions typical of parasitic bacteria. The diversity of these giant viruses (the Megaviridae) was assessed by sampling a variety of aquatic environments and their associated sediments worldwide. We report the isolation of two giant viruses, one off the coast of central Chile, the other from a freshwater pond near Melbourne (Australia), without morphological or genomic resemblance to any previously defined virus families. Their micrometer-sized ovoid particles contain DNA genomes of at least 2.5 and 1.9 megabases, respectively. These viruses are the first members of the proposed “Pandoravirus” genus, a term reflecting their lack of similarity with previously described microorganisms and the surprises expected from their future study.

Wednesday, 17 July 2013

PubMed Highlight: RNA-Seq analysis made simple

Use of RNA-Seq data to asses differential expression and analyze variation in splicing and isoforms is becoming a recurrent task for many lab interested in gene expression. As usual with NGS, generate the is quite fast and simple but strong bioinformatic know-how is required to actually answer the biological question.
With this paper published in BMC Bioinformatics, Boria et al. provide a simple and automated analysis pipeline for RNA-Seq data, with the ability to detect deferentially expressed genes, differential splicing events and new gene transcripts. The suite is freely available upon registration at this web address.

BMC Bioinformatics. 2013 Apr 22;14 Suppl 7:S10.
NGS-Trex: Next Generation Sequencing Transcriptome profile explorer. 

Boria I, Boatti L, Pesole G, Mignone F. 

BACKGROUND: Next-Generation Sequencing (NGS) technology has exceptionally increased the ability to sequence DNA in a massively parallel and cost-effective manner. Nevertheless, NGS data analysis requires bioinformatics skills and computational resources well beyond the possibilities of many "wet biology" laboratories. Moreover, most of projects only require few sequencing cycles and standard tools or workflows to carry out suitable analyses for the identification and annotation of genes, transcripts and splice variants found in the biological samples under investigation. These projects can take benefits from the availability of easy to use systems to automatically analyse sequences and to mine data without the preventive need of strong bioinformatics background and hardware infrastructure. RESULTS: To address this issue we developed an automatic system targeted to the analysis of NGS data obtained from large-scale transcriptome studies. This system, we named NGS-Trex (NGS Transcriptome profile explorer) is available through a simple web interface and allows the user to upload raw sequences and easily obtain an accurate characterization of the transcriptome profile after the setting of few parameters required to tune the analysis procedure. The system is also able to assess differential expression at both gene and transcript level (i.e. splicing isoforms) by comparing the expression profile of different samples.By using simple query forms the user can obtain list of genes, transcripts, splice sites ranked and filtered according to several criteria. Data can be viewed as tables, text files or through a simple genome browser which helps the visual inspection of the data. CONCLUSIONS: NGS-Trex is a simple tool for RNA-Seq data analysis mainly targeted to "wet biology" researchers with limited bioinformatics skills. It offers simple data mining tools to explore transcriptome profiles of samples investigated taking advantage of NGS technologies.

Wednesday, 10 July 2013

PubMed Highlight: Evaluation of bioinformatic tools for prediction of functional impact of missense variants

This interesting paper evaluates the performance (sensitivity and specificity) of 9 different tools commonly used in bioinformatics to predict the functional effect of a missense mutation. The authors also developed a publicly available Web-tool (Variant Effect Prediction) to estimate a consensus score taking into account the results from four different tools (SIFT, PolyPhen2, SNPs&GO and Mutation Assessor).
Since the automated prediction of functional impact is part of most SNV prioritization pipelines, this paper could certainly be useful to develop a robust NGS secondary analysis.

Genomics. 2013 Jul 3. pii: S0888-7543(13)00126-2
Predicting the functional consequences of non-synonymous DNA sequence variants - evaluation of bioinformatics tools and development of a consensus strategy.

Frousios K, Iliopoulos CS, Schlitt T, Simpson MA. 

The study of DNA sequence variation has been transformed by recent advances in DNA sequencing technologies. Determination of the functional consequences of sequence variant alleles offers potential insight as to how genotype may influence phenotype. Even within protein coding regions of the genome, establishing the consequences of variation on gene and protein function is challenging and often requires substantial laboratory investigation. However, a series of bioinformatics tools have been developed to predict whether non-synonymous variants are neutral or disease-causing. In this study we evaluate the performance of nine such methods (SIFT, PolyPhen2, SNPs&GO, PhD-SNP, PANTHER, Mutation Assessor, MutPred, Condel and CAROL) and developed CoVEC (Consensus Variant Effect Classification), a tool that integrates the prediction results from four of these methods. We demonstrate that the CoVEC approach outperforms most individual methods and highlights the benefit of combining results from multiple tools.

Made with Love (and Science): first child born following embryo screening with NGS

Above is picture of Connor Levy from NewsWorks

The news has been reported on July 8 at the annual meeting of European Society of Human Reproduction and Embryology (ESHRE) by Dr Dagan Wells of the NIHR Biomedical Research Centre at the University of Oxford, UK.
According to The Guardian, after standard treatment at a US clinic a Philadelphia couple had 13 in vitro fertilization embryos embryos to choose from. The doctors cultured the embryos for five days, took a few cells from each and sent them to Dr. Wells in Oxford for genetic screening. Tests performed using NGS on a Ion Torrent platform showed that while most of the embryos looked healthy, only three had the right number of chromosomes. Based on the screening results, the US doctors transferred one of the healthy embryos into the mother and left the rest in cold storage. The single embryo implanted, and on 18 May 2013 a healthy boy, named Connor, was born.
Apparently the Oxford team has used NGS for testing for aneuploidy, mutations in the cystic fibrosis gene and mtDNA.
Dr Wells, who led the international research team behind the study, said: "Many of the embryos produced during infertility treatments have no chance of becoming a baby because they carry lethal genetic abnormalities. Next generation sequencing improves our ability to detect these abnormalities and helps us identify the embryos with the best chances of producing a viable pregnancy. Potentially, this should lead to improved IVF success rates and a lower risk of miscarriage".
The abstract of the ESHRE communication can be downloaded here.

Tuesday, 2 July 2013

Incredible But True: Life Technologies introduces an amplicon-based exome sequencing kit

Life Technologies has launched an exome capture kit that makes use of its AmpliSeq technology.
According to the manufacturer, the AmpliSeq™ Exome Kit minimizes the high cost and complexity of exome sequencing enabling the enrichment and sequencing of ~294,000 amplicons (!!!).
The kit targets 97% of coding regions, as described by Consensus Coding Sequences

(CCDS) annotation, in 12 primer pools for highly specific enrichment of exons within the human genome totaling ~58 Mb (it is not clear to me if the amplicons include sequences like UTRs and miRNA).  The novel technology, designed for the Ion Proton platform, delivers >94% of targeted bases covered at 10x even with two exomes per Ion PI  chip. Total workflow from DNA to annotated variants of an exome can be achieved in two days, including six hours for exome library preparation and three hours of sequencing time. Compared to hybridization approaches for exome sequencing, one advantage of an amplicon-based approached is that the input amount is small (as little as 50 nanograms).
Additional information can be found in this Life Technologies' Application Note.

Sunday, 30 June 2013

PubMed Highlight: Prioritization of synonymous variants

The final step of variant prioritization is a key point in NGS studies focused on identification of disease causing mutations. By now all the tools developed in this area consider only missense mutations, relying on various algorithms and integration with known information to suggest the best causative variants within a list of candidates. However, recent studies showed that also synonymous mutations could be responsible for disease. The new Silent Variant Analyzer (SilVA), describes by Buske et al. on Bioinformatics, is the first effort to prioritize synonymous variants and identify the ones that may be deleterious. I'm sorry, it seems that we can't anymore throw away synonymous SNVs to simplify data analysis...

Identification of deleterious synonymous variants in human genomes
Orion J. Buske, AshokKumar Manickaraj, Seema Mital, Peter N. Ray and Michael Brudno

Motivation: The prioritization and identification of disease-causing mutations is one of the most significant challenges in medical genomics. Currently available methods address this problem for non-synonymous single nucleotide variants (SNVs) and variation in promoters/enhancers; however, recent research has implicated synonymous (silent) exonic mutations in a number of disorders.
Results: We have curated 33 such variants from literature and developed the Silent Variant Analyzer (SilVA), a machine-learning approach to separate these from among a large set of rare polymorphisms. We evaluate SilVA’s performance on in silico ‘infection’ experiments, in which we implant known disease-causing mutations into a human genome, and show that for 15 of 33 disorders, we rank the implanted mutation among the top five most deleterious ones. Furthermore, we apply the SilVA method to two additional datasets: synonymous variants associated with Meckel syndrome, and a collection of silent variants clinically observed and stratified by a molecular diagnostics laboratory, and show that SilVA is able to accurately predict the harmfulness of silent variants in these datasets.
Availability: SilVA is open source and is freely available from the project website:

Thursday, 27 June 2013

PubMed Highlight: The state of the art in Genomic Medicine

As interested in genomic studies, we can not miss this review appeared on Science Translational Medicine, that give a comprhensive overview of the impact and future perspective of genomics applied to medicine. The authors will guide you through a decade of genomic research that lead to identification of genetic causes for many mendelian diseases as well as the dissection of genetic factors underlying complex diseases. They also show how recent advances in sequencing technology have finally allowed for development of clinical genomic-driven care of patients, at least in some field such as cancer pharmacogenomics and genetic diagnosis.

Personalized medicine, the final goal that pushed us to decipher the whole DNA sequence, seems now close...or at least NGS has posed this goal within grasp.

Genomic Medicine: A Decade of Successes, Challenges, and Opportunities
Jeanette J. McCarthy1,2, Howard L. McLeod3 and Geoffrey S. Ginsburg1,*

Sci Transl Med 12 June 2013: Vol. 5, Issue 189, p. 189sr4 

Genomic medicine—an aspirational term 10 years ago—is gaining momentum across the entire clinical continuum from risk assessment in healthy individuals to genome-guided treatment in patients with complex diseases. We review the latest achievements in genome research and their impact on medicine, primarily in the past decade. In most cases, genomic medicine tools remain in the realm of research, but some tools are crossing over into clinical application, where they have the potential to markedly alter the clinical care of patients. In this State of the Art Review, we highlight notable examples including the use of next-generation sequencing in cancer pharmacogenomics, in the diagnosis of rare disorders, and in the tracking of infectious disease outbreaks. We also discuss progress in dissecting the molecular basis of common diseases, the role of the host microbiome, the identification of drug response biomarkers, and the repurposing of drugs. The significant challenges of implementing genomic medicine are examined, along with the innovative solutions being sought. These challenges include the difficulty in establishing clinical validity and utility of tests, how to increase awareness and promote their uptake by clinicians, a changing regulatory and coverage landscape, the need for education, and addressing the ethical aspects of genomics for patients and society. Finally, we consider the future of genomics in medicine and offer a glimpse of the forces shaping genomic medicine, such as fundamental shifts in how we define disease, how medicine is delivered to patients, and how consumers are managing their own health and affecting change.

Tuesday, 25 June 2013

Computational diet: how to make your Gb human genome as light as few Mb

With the technology rapidly developing, whole exome/genome sequence of individual subjects is nowadays been performed by many labs all around the world. From the first examples of indivudal genomes, such as Venter or Watson complete DNA, we have now reach a point where the entire 6 billion bases could be sequenced in about a day. With such an high data production rate, the data storage problem suddenly emerged as a painful thorn in the NGS boot, getting worse with every (sequencing) run!
In a future when genome based personalized medicine would become reality, your DNA sequence will need to be stored life through as any other medical record.
Long term storage of at least the complete DNA sequence could be a crucial factor, since as new genotype-phenotype correlations are discovered the owner of the genome could be updated with the new relevant information.

Different solutions has been proposed to reduce the amount of information that have to be stored for a single genome sequence so to make storage of large genomic data-set feasible.  However, at least the final exome/genome sequence of the individual can now be reduced to few Mb of disk space so that you can accommodate several of them even in a standard hard-disk.
This incredible result is achieved using various compression alghortims. The first generation basically used information on repetitive regions in the genome to reduce the final file size to few hundred Mb. Second generation alghoritms are instead based on a reference sequence and only store the differences between the reference and the sequenced DNA. The best performing tools until now was, DNAZip, which reduced the Watson genome sequence to only 4Mb...something that you can easily share as email attachment, as the author stated. Now Pavlichin et al. from Stanford university has pushed compression even further. Their solution is based on the same approach, and take advantage of dbSNP database so that the positions of already known SNP don't have to be stored in the final file. To further shrink the file size, positions of novel SNPs are not stored individually but as distances from the previous SNP. This improvements, together with a brand new compression function and an haplotype based trick, push the file size for a single complete genome down to 2.5Mb.
The main disadvantage is that you need the reference sequence and the dbSNP database in order to reconstruct your genome, but when dealing with thousand of them it is a minor drawback.

Unfortunately, less can be done for the bunch of sequence related information so useful for research purposes (such as base and reads quality scores, genotype scores, alignment metrics and so on...) seems that you still need a tower of drives to store them all! However, lot of efforts has been applied also to this area, since the possibility to reanalyze older dataset with new techniques and new tools can lead to new discoveries. By now there is also a prize for data compression, the sequence squeeze competition, that also has updated information on the best performing tools.

PubMed Highlight: Benchmarking of short sequence aligner for NGS

The introduction of NGS technology has posed the problem of fast ad accurate mapping of the millions of short sequences produced with every single experiment. To address this challenge a number of alignment tools have been developed and updated, each one optimized for a specific type of inputs or a specific kind of alignment problems (gapped, ungapped and so on...).
Even if every single aligner has its strengths and pitfalls, a detailed comparison of the overall performances of the various tools is always useful when it come to choice the best one for your own analysis. This recent paper published in BMC Bioinformatics reports a benchmark of the most popular aligners, such as Bowtie, Bowtie2, BWA, SOAP2, MAQ, RMAP, GSNAP, Novoalign, and mrsFAST. Moreover the authors have developed a benchmarking suite that can be used to asses the performance of any other aligner of interest.

If you need a compass to orientate in the world of aligners don't forget to visit also the HTS mapper page at EBI, which is a really useful and summarize the main features of the single software.

Benchmarking short sequence mapping tools
Ayat Hatem, Doruk Bozda¿, Amanda E Toland and Ümit V Çatalyürek
BMC Bioinformatics 2013, 14:184. Published: 7 June 2013

The development of next-generation sequencing instruments has led to the generation of millions of short sequences in a single run. The process of aligning these reads to a reference genome is time consuming and demands the development of fast and accurate alignment tools. However, the current proposed tools make different compromises between the accuracy and the speed of mapping. Moreover, many important aspects are overlooked while comparing the performance of a newly developed tool to the state of the art. Therefore, there is a need for an objective evaluation method that covers all the aspects. In this work, we introduce a benchmarking suite to extensively analyze sequencing tools with respect to various aspects and provide an objective comparison.
We applied our benchmarking tests on 9 well known mapping tools, namely, Bowtie, Bowtie2, BWA, SOAP2, MAQ, RMAP, GSNAP, Novoalign, and mrsFAST (mrFAST) using synthetic data and real RNA-Seq data. MAQ and RMAP are based on building hash tables for the reads, whereas the remaining tools are based on indexing the reference genome. The benchmarking tests reveal the strengths and weaknesses of each tool. The results show that no single tool outperforms all others in all metrics. However, Bowtie maintained the best throughput for most of the tests while BWA performed better for longer read lengths. The benchmarking tests are not restricted to the mentioned tools and can be further applied to others.
The mapping process is still a hard problem that is affected by many factors. In this work, we provided a benchmarking suite that reveals and evaluates the different factors affecting the mapping process. Still, there is no tool that outperforms all of the others in all the tests. Therefore, the end user should clearly specify his needs in order to choose the tool that provides the best results.

Wednesday, 19 June 2013

PubMed Highlight: When coverage is deep enough for SNP detection?

This interesting article just published on BMC Bioinformatics give a scientific answer to one of the most discussed question in NGS: how many reads do you need for reliable SNP calling? And what the effect of coverage on the power of your SNP detection study?

And the answer is: "We estimate a local read depth of 13X is required to detect
the alleles and genotype of a heterozygous SNV 95% of the time, but only 3X for a homozygous SNV. At a mean on-target read depth of 20X, commonly used for rare disease exome sequencing studies, we predict 5-15% of heterozygous and 1-4% of homozygous SNVs in the targeted regions will be missed."

A must have paper!

Quantifying single nucleotide variant detection sensitivity in exome sequencing.
BMC Bioinformatics. 2013 Jun 18;14(1):195

Meynert AM, Bicknell LS, Hurles ME, Jackson AP, Taylor MS. 

BACKGROUND: The targeted capture and sequencing of genomic regions has rapidly demonstrated its utility in genetic studies. Inherent in this technology is considerable heterogeneity of target coverage and this is expected to systematically impact our sensitivity to detect genuine polymorphisms. To fully interpret the polymorphisms identified in a genetic study it is often essential to both detect polymorphisms and to understand where and with what probability real polymorphisms may have been missed.
RESULTS: Using down-sampling of 30 deeply sequenced exomes and a set of gold-standard single nucleotide variant (SNV) genotype calls for each sample, we developed an empirical model relating the read depth at a polymorphic site to the probability of calling the correct genotype at that site. We find that measured sensitivity in SNV detection is substantially worse than that predicted from the naive expectation of sampling from a binomial. This calibrated model allows us to produce single nucleotide resolution SNV sensitivity estimates which can be merged to give summary sensitivity measures for any arbitrary partition of the target sequences (nucleotide, exon, gene, pathway, exome). These metrics are directly comparable between platforms and can be combined between samples to give "power estimates" for an entire study. We estimate a local read depth of 13X is required to detect the alleles and genotype of a heterozygous SNV 95% of the time, but only 3X for a homozygous SNV. At a mean on-target read depth of 20X, commonly used for rare disease exome sequencing studies, we predict 5¿15% of heterozygous and 1¿4% of homozygous SNVs in the targeted regions will be missed.
CONCLUSIONS: Non-reference alleles in the heterozygote state have a high chance of being missed when commonly applied read coverage thresholds are used despite the widely held assumption that there is good polymorphism detection at these coverage levels. Such alleles are likely to be of functional importance in population based studies of rare diseases, somatic mutations in cancer and explaining the "missing heritability" of quantitative traits.

Thursday, 13 June 2013

Pubmed highlights: comparison between ION Proton and Illumina HiSeq for exome sequencing

We have already seen papers comparing Ion PGM with illumina tecnology. Now is time to muscle up and get the competition to a larger scale! Could the new Proton be an actual challenge for the HiSeq king?

Sunday, 9 June 2013

PubMed Highlight: Netherland plan to sequence full genome of 250 local trios.

After the last post on animal genome sequencing, lets get back to some human genomes.

As a part of the large European Biobanking and Biomolecular Research Infrastructure (BBMRI), the Netherland group has launched an initiative to sequence at 15X coverage the full DNA of 250-trios from all provinces of the country, with the aim to characterize genetic variability in the dutch population. As the authors report: 
"The family-based design represents a unique resource to assess the frequency of regional variants, accurately reconstruct haplotypes by family-based phasing, characterize short indels and complex structural variants, and establish the rate of de novo mutational events. GoNL will also serve as a reference panel for imputation in the available genome-wide association studies in Dutch and other cohorts to refine association signals and uncover population-specific variants".

Genomic data will be also integrated with detailed geographic and phenotype information, providing  a valuable resource for genotype-phenotype correlation studies and evaluation of variant distribution across local and European population. The initiative has been named Genome of Netherlands (GoNL) and data are accessible at the dedicated web page.

Eur J Hum Genet. 2013 May 29
Boomsma DI, Wijmenga C, Slagboom EP, Swertz MA, Karssen LC, Abdellaoui A, Ye K, Guryev V, Vermaat M, van Dijk F, Francioli LC, Jan Hottenga J, Laros JF, Li Q, Li Y, Cao H, Chen R, Du Y, Li N, Cao S, van Setten J, Menelaou A, Pulit SL, Hehir-Kwa JY, Beekman M, Elbers CC, Byelas H, de Craen AJ, Deelen P, Dijkstra M, T den Dunnen J, de Knijff P, Houwing-Duistermaat J, Koval V, Estrada K, Hofman A, Kanterakis A, Enckevort DV, Mai H, Kattenberg M, van Leeuwen EM, Neerincx PB, Oostra B, Rivadeneira F, Suchiman EH, Uitterlinden AG, Willemsen G, Wolffenbuttel BH, Wang J, de Bakker PI, van Ommen GJ, van Duijn CM.

Within the Netherlands a national network of biobanks has been established (Biobanking and Biomolecular Research Infrastructure-Netherlands (BBMRI-NL)) as a national node of the European BBMRI. One of the aims of BBMRI-NL is to enrich biobanks with different types of molecular and phenotype data. Here, we describe the Genome of the Netherlands (GoNL), one of the projects within BBMRI-NL. GoNL is a whole-genome-sequencing project in a representative sample consisting of 250 trio-families from all provinces in the Netherlands, which aims to characterize DNA sequence variation in the Dutch population. The parent-offspring trios include adult individuals ranging in age from 19 to 87 years (mean=53 years; SD=16 years) from birth cohorts 1910-1994. Sequencing was done on blood-derived DNA from uncultured cells and accomplished coverage was 14-15x. The family-based design represents a unique resource to assess the frequency of regional variants, accurately reconstruct haplotypes by family-based phasing, characterize short indels and complex structural variants, and establish the rate of de novo mutational events. GoNL will also serve as a reference panel for imputation in the available genome-wide association studies in Dutch and other cohorts to refine association signals and uncover population-specific variants. GoNL will create a catalog of human genetic variation in this sample that is uniquely characterized with respect to micro-geographic location and a wide range of phenotypes. The resource will be made available to the research and medical community to guide the interpretation of sequencing projects. The present paper summarizes the global characteristics of the project

The White Tigers, the White Gorillas and the Green Turtle

We have been quite busy in the past weeks setting up our own NGS sequencing lab, but we haven't stop to look around for genomics news and new genomes!
Three articles have been published recently: the complete sequence of two species of turtle to get insight on the evolution of their peculiar body structure, the complete genome sequence of the only known exemplar of white gorilla; a large sequencing and genome wide association study to identify the cause of albinism in white tigers.

The first study appeared at the end of April on Nature Genetics (The draft genomes of soft-shell turtle and green sea turtle yield insights into the development and evolution of the turtle-specific body plan. Wang et al.) describe the complete genome assembly of two kind of turtles (P. sinensis and C. mydas) and also reports interesting results from embryo studies and comparative developmental studies against chicken embryos. Taken together these data give a comprehensive picture on turtle evolution and new insight on the crucial factor driving their peculiar body structure. From extensive analysis of embryo gene expression and comparison with chicken, authors found that turtle development initially follows the common vertebrate pattern but then it differentiate from the stage TK11 and they were also able to identify a set of 233 genes that should be crucial for the specific turtle body plan.

"Taken together, these results suggest that turtles indeed conform to the developmental hourglass model (Supplementary Fig. 15) by first establishing an ancient vertebrate body plan and by developing turtle-specific characteristics thereafter. The above results suggest that turtle-specific global repatterning of gene regulation begins after TK11 or the phylotypic period. Although turtle and chicken express many shared developmental genes in the embryo during the putative phylotypic period (Fig. 4a and Supplementary Tables 27and 28) and have the fewest expanded or contracted gene family members expressed (Supplementary Fig. 16) at this stage, later stages showed increasing differences in their molecular patterns. We found 233 genes that showed turtle-specific increasing expression patterns after the phylotype (Fig. 4b). Considering that the chicken orthologs did not show this type of increasing expression (Supplementary Figs. 17 and 18), these 233 genes represent attractive candidates for clarifying the genomic nature of turtle-specific morphological oddities"

This paper combines different techniques and imply a lot of NGS experiments. First of all, the genome sequencing of the two specimens was conducted by paired-end sequencing on Illumina HiSeq 2000 using both short and long insert libraries for a median of 105X and 82X (estimated genome size of about 2.2 Gb). The complete set of turtle transcripts were assessed by computational analysis, but also supported by RNA-Seq data of the soft-shell turtle conducted with three different approach: Titanium sequencig, Illumina strand-specific and non strand-specific RNA-seq, for a total of about 37 Gb of sequencing. Finally, authors investigated miRNAs as well by sequencing small RNAs on Illumina platform. They then used computational tools and comparison with other species known miRNA to infer potential binding sites and conserved miRNA species.

The other two paper deals with some more "exotic" species, trying to dissect the molecular origin of albinism in a white gorilla known as Snowflake and in a family of white tigers raised in captivity. Oculocutaneous albinism in humans is knwon to be related to mutations in the SLC45A2 gene and the authors found that this is the case also in the two considered species.
Complete sequence of the white gorilla was published at the end of May on BMC Genomics (The genome sequencing of an albino Western lowland gorilla reveals inbreeding in the wild. Prado-Martinez et al.). Authors reported the 19X whole genome sequencing of the only known white gorilla and compared these data with the human reference genome and other two already sequenced gorilla genome searching for SNVs in albinism related genes. Data analysis lead to the identification of a missense mutation in SLC45A2 gene resulting in the G518R aminoacid substitution, which should alter the function of the protein channel thus leading to albinism in the white gorilla. Interestingly, it resulted that the genome of snowflake presented large ROH regions, suggesting that it defect may result from high rate of inbreeding.

The study on white tigers was published at the beginning of June on Current Biology (The genetic basis of white tigers. Xu et al.). The authors performed an extensive genomic analysis on a family of 16 white tigers raised in captivity to track down the molecular defect responsible for the albinism in this species. Applying a combined approach based on genome-wide association mapping with restriction-site-associated DNA sequencing (RAD-seq), followed by whole-genome sequencing (WGS) of the three parents, they identified the aminoacid substitution A477V in the SLC45A2 gene as the causative mutation. This finding was confirmed by validation in 130 unrelated tigers identified and three-dimensional homology modeling, suggesting that the substitution may partially block the transporter channel cavity.