Thursday, 30 October 2014

Exome Aggregation Consortium release its data on 63,000 exomes!

On October 29th, the Exome Aggregation Consoritum as released its browser based on the impressive number of 63,000 human exomes.
This database is the larger collection of human exome data so far and provide both a web base interface to retrieve variants in your gene of interest or the download of a VCF file containing the list of all the annotated variants.

The final dataset is based on sequences from several consortia working on complex disorders and also includes 1000G and ESP6500 data.

The first aim of the Consortium is to study the distribution of "human knockout", that is people having both copies of a given gene inactivated by severe mutations. The analysis of associated phenotype data promise to reveal lot of interesting information of the actual role of single human genes. Moreover, the study of subjects carrying inactivating mutations on known disease genes but not showing the expected phenotype could lead to identification of new therapeutic targets.

See more information on Nature News and Genome Web!

Tuesday, 2 September 2014

PubMed Highlight: New relese of ENCODE and modENCODE

Five papers that summarize the latest data from ENCODE and modENCODE consortia have recently been published on Nature. Together, the publications add more than 1,600 new data sets, bringing the total number of data sets from ENCODE and modENCODE to around 3,300.

The growth of ENCODE and modENCODE data sets.

The authors analyze RNA-Seq data produced in the three species and an extensive effort was conducted in Drosophila to investigate genes expressed only in specific tissue, developmental stages or only after specific perturbations.  The analysis also identified many new candidate long non-coding RNAs, including ones that overlap with previously defined mutations that have been associated with developmental defects.
Other data sets derive from chromatin binding assays focused on transcription-regulatory factors in human cell lines, Drosophila and C. elegans; and on study of DNA accessibility and certain modifications to histone proteins. These new chromatin data sets led to identification of several features common to the three species, such as shared histone-modification patterns around genes and regulatory regions.
The new transcriptome data sets will result in more precise gene annotations in all three species, which should be released soon. The access to the data on chromatin features, regulatory-factor binding sites, and the regulatory-element predictions seem more difficult. We have to wait for them to be integrated in user-friendly portals for data visualization and flexible analyses. The UCSC Genome Browser, Ensembl, ENCODE consortium are all working to provide the solution.

Meanwhile take a look to the papers:
Diversity and dynamics of the Drosophila transcriptome Regulatory analysis of the C. elegans genome with spatiotemporal resolution Comparative analysis of metazoan chromatin organization

Monday, 14 July 2014

New challenges in NGS

After about a decade from the first appearance if NGS sequencing we have seen incredible improvements in throughput, accuracy and analysis methods and sequencing is now more diffused and easy to achieve also for small labs. Researchers have produced tons of sequencing data and the new technology allowed us to investigate DNA and human genomic variations at unprecedent scale and precision.
However, beside the milestones achieved, we have now to deal with new challenges that were largely underestimated in the early days of NGS.

MassGenomics has a nice blog post underlining the main ones, that I reported here:

Data Storage. 
Where do we put all those data from large genomic sequencing projects? Can we afford the cost of store everything or we have to be more selectively on what to keep in our hard drives?

Statistical significance.
GWAS studies have showed us that large numbers, in the order of 10 thousands of samples, are needed to achieve statistical significance for association studies, particularly for common diseases. Even when you consider the present low price of 1,000$ / genome it will require around 10 millions $ for such a sequencing project. So we can reduce our sample size (and thus significance) or create mega consortium with all the managing issues.

Samples became precious resources.
In the present scenario sequencing power is not longer a limitation. The real matter is find enough well-characterized samples to sequence!

Functional validation.
Whole genome and whole exome approaches let researchers to rapidly identify new variants potentially related to phenotypes. But which of them are truly relevant? Our present knowledge do not allow for a confident prediction of functional impact of genetic variation and thus functional studies are often needed to assess the actual role of each variants. These studies, based on cellular models or animal models, could be expensive and complicated.

With large and increasing amount of genomic data available to the community and studies showing that people ancestry and living location could be traced using them (at least in a proportion of cases), there are concerns about how "anonymous" these kind of data could really be. This is going to became a real problem has more and more genomes are sequenced.

Friday, 4 July 2014

PubMed highlight: Literome help you find relevant papers in the "genomic" literature

This tool mines the "genomic" literature for your gene of interest and reports a list of interactions with other genes, specifying also the kind of the relation (inhibit, activate, regulate...). It can also search for a SNP and find phenotypes associated to it by GWAS. You can then filter the results and also report if the listed interactions are actually real or not.

Good stuff to quickly identify relevant papers in the large amount of genomic researches!

Literome: PubMed-scale genomic knowledge base in the cloud

Hoifung Poon, Chris Quirk, Charlie DeZiel and David Heckerman

Motivation: Advances in sequencing technology have led to an exponential growth of genomics data, yet it remains a formidable challenge to interpret such data for identifying disease genes and drug targets. There has been increasing interest in adopting a systems approach that incorporates prior knowledge such as gene networks and genotype–phenotype associations. The majority of such knowledge resides in text such as journal publications, which has been undergoing its own exponential growth. It has thus become a significant bottleneck to identify relevant knowledge for genomic interpretation as well as to keep up with new genomics findings.
Results: In the Literome project, we have developed an automatic curation system to extract genomic knowledge from PubMed articles and made this knowledge available in the cloud with a Web site to facilitate browsing, searching and reasoning. Currently, Literome focuses on two types of knowledge most pertinent to genomic medicine: directed genic interactions such as pathways and genotype–phenotype associations. Users can search for interacting genes and the nature of the interactions, as well as diseases and drugs associated with a single nucleotide polymorphism or gene. Users can also search for indirect connections between two entities, e.g. a gene and a disease might be linked because an interacting gene is associated with a related disease.

Availability and implementation: Literome is freely available at Download for non-commercial use is available via Web services.

Wednesday, 25 June 2014

National Children's Study stopped, waiting for revisions

One of the most ambitious project and one of the few attempts to really perform "personal genomics", is (or I may say was) the National Children's Study (NCS) sustained by NIH and the US government.

The project try to investigate the relation between genomics and environmental factors to define their impact on human life and define which advantages this kind of genomic screening could provide for the human health. The massive longitudinal project that would sequence the genomes of 100,000 US babies and collect loads of environmental, lifestyle, and medical data on them until the age of 21.
However the NIH director, Francis Collins, has recently announced that the project will be stopped waiting for a detailed review on the methodologies applied and the opportunity to complete it in its present form. Few key questions has to be addressed: Is the study actually feasible, particularly in light of budget constraints? If so, what changes need to be made? If not, are there other methods for answering the key research questions the study was designed to address?

As GenomeWeb reports, National Academy of Sciences (NAS) released a report saying the NCS needs some major changes to its design, management, and oversight.  The NAS recommendations include making some changes to the core hypotheses behind the study, beefing up scientific input and oversight, and enrolling the subjects during pregnancy, instead of at birth, as is the current plan.

Monday, 23 June 2014

One banana a day will keep the doctor away

According to GenomeWeb and The Guardian, researchers from Australia are tweaking the genome of the banana in order to get it to deliver higher levels of vitamin A. The study is aimed to supplement vitamin A in Uganda and other similar population, where banana is one of the main food sources and deficiency in vitamin A cause blindness and death in children.

The group of professor James Dale, from the Queensland University of Technology, received a $10 million grant from the Bill and Melinda Gates Foundation to support this 9 year project.

Dale said that by 2020 vitamin A-enriched banana varieties would be grown by farmers in Uganda, where about 70% of the population survive on the fruit.

The genome of a baby sequenced before birth raises questions on the opportunity and pitfalls of genome screening

Khan, graduate student at the University of California, Davis, and blogger at The Unz Review, decided that he wanted detailed genetic information on his child as soon as he knew that his wife was pregnant. After a genetic test for chromosomal abnormalities he asked to have the DNA sample back and managed to have the baby's genome sequenced on one of the University NGS instruments.

MIT technology review reports about the whole story and Khan tells on the many difficulties he faced to have the genome sequencing done. Most of the medical staff tried to discourage him from performing this kind of test, afraid that the couple could take irrevocable decisions, such as pregnancy termination, based on the presence of putative deleterious mutations in the baby's genome. This case raised again the question of how much information could be extracted from a single genome, which part of this information is really useful on a medical care basis and which part is actionable in nowadays.

It seems to me that by now our ability to robustly correlate genotypes to phenotypes is still scarce. This is due to incomplete knowledge about causative and risk associated mutations as well as on the molecular and genetic mechanisms that lead from genetic variants to phenotypes. Studies in the last years have demonstrated that this path is not straightforward and the actual phenotypes often depend on the interaction of several genetic components and regulatory mechanisms, living aside the environmental factors.
Several disease mutations show incomplete penetrance and many example exist of variants linked to phenotypes only in specific populations, so a reliable interpretation of genomic data seems far away by now.
However, many decision can be made knowing your DNA sequence and this information will become even more interesting as researchers continue to find new associations and elucidate genotype-phenotype correlation mechanisms.
Moreover, if the health public service continues to stand against whole genome screening, people will soon turn to private companies, that can already provide this kind of services. This policy will thus increase the risk of incomplete or misleading interpretations without any kind of support from medical stuff.
A lot has to be discussed on the practical and ethical point of view, but we have to face the reality that since these kind of tests are going to became easily accessible in the near future, we have also to find a way to provide the correct information to the subject analyzed.
The topic of genomic risk assessment in healthy people has been recently discussed also on the New England Jornal of Medicine, that published a review on clinical whole exome and whole genome sequencing. The journal also presented the hypothetical scenario of a subject which discovers some cancer affected relatives and wants to undergo genetic testing. They propose 2 strategies, gene panel or whole exome/genome sequencing and the case is open for readers to comment with even a pool to vote for your preferred solution.  

PubMed Highlight: Complete review of computational biology free courses

This paper is a great resource for anyone looking to get started in a computational biology, or just looking to an insight on a specific topics ranging from natural language processing to evolutionary theory. The author describes hundreds of video courses that are foundational to a good understanding of computational biology and bioinformatics. The table of contents breaks the curriculum down into 11 "departments" with links to online courses in each subject area:
  • Mathematics Department
  • Computer Science Department
  • Data Science Department
  • Chemistry Department
  • Biology Department
  • Computational Biology Department
  • Evolutionary Biology Department
  • Systems Biology Department
  • Neurosciences Department
  • Translational Sciences Department
  • Humanities Department

Listings in the catalog can take one of three forms: Courses, Current Topics, or Seminars. All listed courses are video-based and free of charge. The author has tested most of the courses, having enrolled in up to a dozen at a time, and he shared his experience in this paper. So you can find commentary on the importance of the subject and an opinion on the quality of instruction. For the courses that the author completed, listings have an "evaluation" section, which ranks the course in difficulty, time requirements, lecture/homework effectiveness, assessment quality, and overall opinions. Finally there are also autobiographical annotations reporting why the courses have revealed useful in a bioinformatics career. 

Don't miss this!

PubMed Highlight: VarMod, modelling the functional effects of non-synonymous variants

On Nucleic Acid Research, authors from Uuniversity of Kent published the varmod tool. By incorporating protein sequence and structural feature cues into the non-synonymous variant analysis, their Variant Modeller method provides clues to understanding genotype effects on phenotype, the study authors note. Their proof-of-principle analysis of 3,000 such variants suggests VarMod predicts protein function and structural effects with accuracy that's on par with that offered by the PolyPhen-2 tool.

Unravelling the genotype–phenotype relationship in humans remains a challenging task in genomics studies. Recent advances in sequencing technologies mean there are now thousands of sequenced human genomes, revealing millions of single nucleotide variants (SNVs). For non-synonymous SNVs present in proteins the difficulties of the problem lie in first identifying those nsSNVs that result in a functional change in the protein among the many non-functional variants and in turn linking this functional change to phenotype. Here we present VarMod (Variant Modeller) a method that utilises both protein sequence and structural features to predict nsSNVs that alter protein function. VarMod develops recent observations that functional nsSNVs are enriched at protein–protein interfaces and protein–ligand binding sites and uses these characteristics to make predictions. In benchmarking on a set of nearly 3000 nsSNVs VarMod performance is comparable to an existing state of the art method. The VarMod web server provides extensive resources to investigate the sequence and structural features associated with the predictions including visualisation of protein models and complexes via an interactive JSmol molecular viewer. VarMod is available for use at

First user reports on Oxford Nanopore MinION!

After the start of the early access program, the sequencing community is waiting for the first results and comments on the MinION platform by Oxford Nanopore. This sequencer promises to revolutionize the field and is the first nanopore based sequencer that have reached the market.

Nick Loman, one of the early customer has now reported the first results obtained on the new platform. It is a 8.5 Kb read from P. aeruginosa showing that MinION can produce useful data, even if the accuracy remains low. Analyses of a read by two bioinformatics researchers, who used different alignment tools and posted their results here and here, showed that the read is about 68 percent identical to the P. aeruginosa genome and has many errors, particularly gaps. Main issues seems to be in the basecalling software, but Oxford Nanopore is working hard to improve it. Accordingly also to Konrad Paszkiewicz, another early customer, the device itself is really robust and easy to use and the library preparation procedure is simple, resulting in low sequencing costs.
The mean read length seems to be about 10 Kb, but users reported even longer reads up to 40 Kb, covering the entire lambda genome used for testing. So the read length is really promising and place the mature MinION as a natural competitor to PacBio.
The use of MinION seems straightforward: after plugging the sequencer into a USB 3.0 port of a computer, it installs the MinKnow software suite. A program called Metrichor uploads the raw data – ion current traces – to the Amazon Elastic Compute Cloud, where base-calling happens, either 1D base-calling for unidirectional reads or 2D base-calling for bidirectional reads.
Overall, improvements have to be made to the base-calling software, reliability of the flow cells, and library shelf-life, and new software needs to be developed by the community to take advantage of the MinION reads.  Oxford Nanopore said a new chemistry will be available in the coming months, which might include some of these improvements.

In the meantime, many other early access users contacted by IS website are awaiting the arrival of reagents, are in the midst of burn-in, or have run their own samples but are not ready to talk about their results yet. So we are expecting many more data and comments and detailed estimation on platform accuracy and data production to be out in the next months! The new minion has fulfilled the excpection in this first test and there is a lot promising about this new technology...maybe a new revolution in the field is ready to come!

Other details can be found on this post on GenomeWeb.