Tuesday, 2 September 2014

PubMed Highlight: New relese of ENCODE and modENCODE

Five papers that summarize the latest data from ENCODE and modENCODE consortia have recently been published on Nature. Together, the publications add more than 1,600 new data sets, bringing the total number of data sets from ENCODE and modENCODE to around 3,300.

The growth of ENCODE and modENCODE data sets.

The authors analyze RNA-Seq data produced in the three species and an extensive effort was conducted in Drosophila to investigate genes expressed only in specific tissue, developmental stages or only after specific perturbations.  The analysis also identified many new candidate long non-coding RNAs, including ones that overlap with previously defined mutations that have been associated with developmental defects.
Other data sets derive from chromatin binding assays focused on transcription-regulatory factors in human cell lines, Drosophila and C. elegans; and on study of DNA accessibility and certain modifications to histone proteins. These new chromatin data sets led to identification of several features common to the three species, such as shared histone-modification patterns around genes and regulatory regions.
The new transcriptome data sets will result in more precise gene annotations in all three species, which should be released soon. The access to the data on chromatin features, regulatory-factor binding sites, and the regulatory-element predictions seem more difficult. We have to wait for them to be integrated in user-friendly portals for data visualization and flexible analyses. The UCSC Genome Browser, Ensembl, ENCODE consortium are all working to provide the solution.

Meanwhile take a look to the papers:
Diversity and dynamics of the Drosophila transcriptome Regulatory analysis of the C. elegans genome with spatiotemporal resolution Comparative analysis of metazoan chromatin organization

Monday, 14 July 2014

New challenges in NGS

After about a decade from the first appearance if NGS sequencing we have seen incredible improvements in throughput, accuracy and analysis methods and sequencing is now more diffused and easy to achieve also for small labs. Researchers have produced tons of sequencing data and the new technology allowed us to investigate DNA and human genomic variations at unprecedent scale and precision.
However, beside the milestones achieved, we have now to deal with new challenges that were largely underestimated in the early days of NGS.

MassGenomics has a nice blog post underlining the main ones, that I reported here:

Data Storage. 
Where do we put all those data from large genomic sequencing projects? Can we afford the cost of store everything or we have to be more selectively on what to keep in our hard drives?

Statistical significance.
GWAS studies have showed us that large numbers, in the order of 10 thousands of samples, are needed to achieve statistical significance for association studies, particularly for common diseases. Even when you consider the present low price of 1,000$ / genome it will require around 10 millions $ for such a sequencing project. So we can reduce our sample size (and thus significance) or create mega consortium with all the managing issues.

Samples became precious resources.
In the present scenario sequencing power is not longer a limitation. The real matter is find enough well-characterized samples to sequence!

Functional validation.
Whole genome and whole exome approaches let researchers to rapidly identify new variants potentially related to phenotypes. But which of them are truly relevant? Our present knowledge do not allow for a confident prediction of functional impact of genetic variation and thus functional studies are often needed to assess the actual role of each variants. These studies, based on cellular models or animal models, could be expensive and complicated.

With large and increasing amount of genomic data available to the community and studies showing that people ancestry and living location could be traced using them (at least in a proportion of cases), there are concerns about how "anonymous" these kind of data could really be. This is going to became a real problem has more and more genomes are sequenced.

Friday, 4 July 2014

PubMed highlight: Literome help you find relevant papers in the "genomic" literature

This tool mines the "genomic" literature for your gene of interest and reports a list of interactions with other genes, specifying also the kind of the relation (inhibit, activate, regulate...). It can also search for a SNP and find phenotypes associated to it by GWAS. You can then filter the results and also report if the listed interactions are actually real or not.

Good stuff to quickly identify relevant papers in the large amount of genomic researches!

Literome: PubMed-scale genomic knowledge base in the cloud

Hoifung Poon, Chris Quirk, Charlie DeZiel and David Heckerman

Motivation: Advances in sequencing technology have led to an exponential growth of genomics data, yet it remains a formidable challenge to interpret such data for identifying disease genes and drug targets. There has been increasing interest in adopting a systems approach that incorporates prior knowledge such as gene networks and genotype–phenotype associations. The majority of such knowledge resides in text such as journal publications, which has been undergoing its own exponential growth. It has thus become a significant bottleneck to identify relevant knowledge for genomic interpretation as well as to keep up with new genomics findings.
Results: In the Literome project, we have developed an automatic curation system to extract genomic knowledge from PubMed articles and made this knowledge available in the cloud with a Web site to facilitate browsing, searching and reasoning. Currently, Literome focuses on two types of knowledge most pertinent to genomic medicine: directed genic interactions such as pathways and genotype–phenotype associations. Users can search for interacting genes and the nature of the interactions, as well as diseases and drugs associated with a single nucleotide polymorphism or gene. Users can also search for indirect connections between two entities, e.g. a gene and a disease might be linked because an interacting gene is associated with a related disease.

Availability and implementation: Literome is freely available at Download for non-commercial use is available via Web services.

Wednesday, 25 June 2014

National Children's Study stopped, waiting for revisions

One of the most ambitious project and one of the few attempts to really perform "personal genomics", is (or I may say was) the National Children's Study (NCS) sustained by NIH and the US government.

The project try to investigate the relation between genomics and environmental factors to define their impact on human life and define which advantages this kind of genomic screening could provide for the human health. The massive longitudinal project that would sequence the genomes of 100,000 US babies and collect loads of environmental, lifestyle, and medical data on them until the age of 21.
However the NIH director, Francis Collins, has recently announced that the project will be stopped waiting for a detailed review on the methodologies applied and the opportunity to complete it in its present form. Few key questions has to be addressed: Is the study actually feasible, particularly in light of budget constraints? If so, what changes need to be made? If not, are there other methods for answering the key research questions the study was designed to address?

As GenomeWeb reports, National Academy of Sciences (NAS) released a report saying the NCS needs some major changes to its design, management, and oversight.  The NAS recommendations include making some changes to the core hypotheses behind the study, beefing up scientific input and oversight, and enrolling the subjects during pregnancy, instead of at birth, as is the current plan.

Monday, 23 June 2014

One banana a day will keep the doctor away

According to GenomeWeb and The Guardian, researchers from Australia are tweaking the genome of the banana in order to get it to deliver higher levels of vitamin A. The study is aimed to supplement vitamin A in Uganda and other similar population, where banana is one of the main food sources and deficiency in vitamin A cause blindness and death in children.

The group of professor James Dale, from the Queensland University of Technology, received a $10 million grant from the Bill and Melinda Gates Foundation to support this 9 year project.

Dale said that by 2020 vitamin A-enriched banana varieties would be grown by farmers in Uganda, where about 70% of the population survive on the fruit.

The genome of a baby sequenced before birth raises questions on the opportunity and pitfalls of genome screening

Khan, graduate student at the University of California, Davis, and blogger at The Unz Review, decided that he wanted detailed genetic information on his child as soon as he knew that his wife was pregnant. After a genetic test for chromosomal abnormalities he asked to have the DNA sample back and managed to have the baby's genome sequenced on one of the University NGS instruments.

MIT technology review reports about the whole story and Khan tells on the many difficulties he faced to have the genome sequencing done. Most of the medical staff tried to discourage him from performing this kind of test, afraid that the couple could take irrevocable decisions, such as pregnancy termination, based on the presence of putative deleterious mutations in the baby's genome. This case raised again the question of how much information could be extracted from a single genome, which part of this information is really useful on a medical care basis and which part is actionable in nowadays.

It seems to me that by now our ability to robustly correlate genotypes to phenotypes is still scarce. This is due to incomplete knowledge about causative and risk associated mutations as well as on the molecular and genetic mechanisms that lead from genetic variants to phenotypes. Studies in the last years have demonstrated that this path is not straightforward and the actual phenotypes often depend on the interaction of several genetic components and regulatory mechanisms, living aside the environmental factors.
Several disease mutations show incomplete penetrance and many example exist of variants linked to phenotypes only in specific populations, so a reliable interpretation of genomic data seems far away by now.
However, many decision can be made knowing your DNA sequence and this information will become even more interesting as researchers continue to find new associations and elucidate genotype-phenotype correlation mechanisms.
Moreover, if the health public service continues to stand against whole genome screening, people will soon turn to private companies, that can already provide this kind of services. This policy will thus increase the risk of incomplete or misleading interpretations without any kind of support from medical stuff.
A lot has to be discussed on the practical and ethical point of view, but we have to face the reality that since these kind of tests are going to became easily accessible in the near future, we have also to find a way to provide the correct information to the subject analyzed.
The topic of genomic risk assessment in healthy people has been recently discussed also on the New England Jornal of Medicine, that published a review on clinical whole exome and whole genome sequencing. The journal also presented the hypothetical scenario of a subject which discovers some cancer affected relatives and wants to undergo genetic testing. They propose 2 strategies, gene panel or whole exome/genome sequencing and the case is open for readers to comment with even a pool to vote for your preferred solution.  

PubMed Highlight: Complete review of computational biology free courses

This paper is a great resource for anyone looking to get started in a computational biology, or just looking to an insight on a specific topics ranging from natural language processing to evolutionary theory. The author describes hundreds of video courses that are foundational to a good understanding of computational biology and bioinformatics. The table of contents breaks the curriculum down into 11 "departments" with links to online courses in each subject area:
  • Mathematics Department
  • Computer Science Department
  • Data Science Department
  • Chemistry Department
  • Biology Department
  • Computational Biology Department
  • Evolutionary Biology Department
  • Systems Biology Department
  • Neurosciences Department
  • Translational Sciences Department
  • Humanities Department

Listings in the catalog can take one of three forms: Courses, Current Topics, or Seminars. All listed courses are video-based and free of charge. The author has tested most of the courses, having enrolled in up to a dozen at a time, and he shared his experience in this paper. So you can find commentary on the importance of the subject and an opinion on the quality of instruction. For the courses that the author completed, listings have an "evaluation" section, which ranks the course in difficulty, time requirements, lecture/homework effectiveness, assessment quality, and overall opinions. Finally there are also autobiographical annotations reporting why the courses have revealed useful in a bioinformatics career. 

Don't miss this!

PubMed Highlight: VarMod, modelling the functional effects of non-synonymous variants

On Nucleic Acid Research, authors from Uuniversity of Kent published the varmod tool. By incorporating protein sequence and structural feature cues into the non-synonymous variant analysis, their Variant Modeller method provides clues to understanding genotype effects on phenotype, the study authors note. Their proof-of-principle analysis of 3,000 such variants suggests VarMod predicts protein function and structural effects with accuracy that's on par with that offered by the PolyPhen-2 tool.

Unravelling the genotype–phenotype relationship in humans remains a challenging task in genomics studies. Recent advances in sequencing technologies mean there are now thousands of sequenced human genomes, revealing millions of single nucleotide variants (SNVs). For non-synonymous SNVs present in proteins the difficulties of the problem lie in first identifying those nsSNVs that result in a functional change in the protein among the many non-functional variants and in turn linking this functional change to phenotype. Here we present VarMod (Variant Modeller) a method that utilises both protein sequence and structural features to predict nsSNVs that alter protein function. VarMod develops recent observations that functional nsSNVs are enriched at protein–protein interfaces and protein–ligand binding sites and uses these characteristics to make predictions. In benchmarking on a set of nearly 3000 nsSNVs VarMod performance is comparable to an existing state of the art method. The VarMod web server provides extensive resources to investigate the sequence and structural features associated with the predictions including visualisation of protein models and complexes via an interactive JSmol molecular viewer. VarMod is available for use at

First user reports on Oxford Nanopore MinION!

After the start of the early access program, the sequencing community is waiting for the first results and comments on the MinION platform by Oxford Nanopore. This sequencer promises to revolutionize the field and is the first nanopore based sequencer that have reached the market.

Nick Loman, one of the early customer has now reported the first results obtained on the new platform. It is a 8.5 Kb read from P. aeruginosa showing that MinION can produce useful data, even if the accuracy remains low. Analyses of a read by two bioinformatics researchers, who used different alignment tools and posted their results here and here, showed that the read is about 68 percent identical to the P. aeruginosa genome and has many errors, particularly gaps. Main issues seems to be in the basecalling software, but Oxford Nanopore is working hard to improve it. Accordingly also to Konrad Paszkiewicz, another early customer, the device itself is really robust and easy to use and the library preparation procedure is simple, resulting in low sequencing costs.
The mean read length seems to be about 10 Kb, but users reported even longer reads up to 40 Kb, covering the entire lambda genome used for testing. So the read length is really promising and place the mature MinION as a natural competitor to PacBio.
The use of MinION seems straightforward: after plugging the sequencer into a USB 3.0 port of a computer, it installs the MinKnow software suite. A program called Metrichor uploads the raw data – ion current traces – to the Amazon Elastic Compute Cloud, where base-calling happens, either 1D base-calling for unidirectional reads or 2D base-calling for bidirectional reads.
Overall, improvements have to be made to the base-calling software, reliability of the flow cells, and library shelf-life, and new software needs to be developed by the community to take advantage of the MinION reads.  Oxford Nanopore said a new chemistry will be available in the coming months, which might include some of these improvements.

In the meantime, many other early access users contacted by IS website are awaiting the arrival of reagents, are in the midst of burn-in, or have run their own samples but are not ready to talk about their results yet. So we are expecting many more data and comments and detailed estimation on platform accuracy and data production to be out in the next months! The new minion has fulfilled the excpection in this first test and there is a lot promising about this new technology...maybe a new revolution in the field is ready to come!

Other details can be found on this post on GenomeWeb.

Insects, sheep, polar bears, crow, beans and eucalyptus...all the genomes you want!

I'm always amazed by the explosion of new species genomes since the introduction of NGS. In the last two years the sequencing and assembly of genomes from various animals and plants have accelerated even more and focused also on "exotic" species, so much that now we have almost a new genome per month! All these data can tell us a lot on basic mechanisms of evolution and provide information to study how complex biological processes have developed and why they act the way we see now. Moreover, many species have peculiar properties and produce biopeptides or other biological molecules that could be useful for life science and medicine.
So, here is a quick update of what has been published in the last months!

The amazing spiderman: Social velvet and tarantula genomes to study silk and venom
Authors from BGI-Shenzhen and the Aarhus University reported on Nature Communication the assembly of the full genome of social velvet spider and tarantula spider. Besides the genome sequencing and analysis, authors also performed transcriptome sequencing and proteomic analysis by mass spectroscopy. A de novo assembly of the velvet spider (S. mimosarum) was generated from 91 × coverage sequencing of paired end and mate pair libraries and assembled into contigs and scaffolds spanning 2.55 Gb. Integrating also transcriptome data authors reconstructed a gene set of 27,235 protein-coding gene models. Approximately 400 gene models had no homology to known proteins but were supported by proteomic evidence, identifying putative ‘spider’-specific proteins. The exon-intron structure, unlike other arthropod genomes, is characterized by and intron-exon structure very similar to the human genome. The size estimate of the tarantula genome is about 6 Gb and was sequenced at 40 × coverage from a single female A. geniculata using a similar combination of paired end and mate pair libraries as for the velvet spider. Authors sequenced proteins from different spider tissues (venom, thorax, abdomen, haemolymph and silk), identifying 120 proteins in venom, 15 proteins in silk and 2,122 proteins from body fluid and tissue samples, for a total of 2,193 tarantula proteins. Introns were found to be longer than those of the velvet spider.
Combining three different omics approaches the paper reconstructed species specific gene duplication and the set of peculiar proteins involved in spiders silk and venom. The analysis revealed enrichment in cysteine-rich peptides with neurotoxic effect and proteases, that specifically activate protoxins in the venom of spiders.

Stick insects: a large sequencing effort to study evolution and speciation
In this paper published on Science (it appears on the cover magazine), the authors performed whole genome sequencing on several subjects from different populations of stick insects to investigate the role and mechanism of action of selection in adaptation and parallel speciation. Researchers performed a parallel experiment moving four groups of individuals from the original population from their natural host plant to a new one. They sequenced them and their first offspring generation and analyzed genomic variations and their role in adaptation to the new environment. Comparing genomic changes in the four groups allow analysis of parallel speciation and the genomic mechanisms behind the scene.

Polar bear genome: population genomics to dissect adaptation to extreme environments
On this paper from Cell, authors reconstructed a draft assembly of the polar bear genome and then analyzed 89 complete genomes of polar bear and brown bear using population genomic modeling. Results show that the species diverged 479–343 thousand years ago and that the polar bear lineage have been under stronger positive selection than the brown bears. Several genes specifically selected in polar bears are associated with cardiomyopathy and vascular disease, implying important reorganization of the cardiovascular system. Another group of genes showing strong evidence of selection are those related to lipid metabolism, transport and storage, like APOB. Functional mutations in this gene may explain how polar bears are able to cope with life-long elevated LDL levels.

Sheep genome: now all the major livestock animals have their genome sequence
Researchers from the international sheep genomics consortium published on Science the first complete assembly of the sheep genome. The team build an assembly that spans 2.61 billion bases of the sheep genome to an average depth of around 150-fold. That assembly covers around 99 percent of the sheep's 26 autosomal chromosomes and X chromosome. In addition to the high-quality reference genome, the team generated transcriptome sequences representing 40 sheep tissues, which contributed to its subsequent analysis of sheep features. Like cattle, sheep are known for feed on plants and deriving useful proteins from lignocellulose-laden material with the help of fermentation and microbes in the rumen. Specialized features of the sheep metabolism go to work on volatile fatty acids that gut bugs produce during that process and other adaptations on fatty acid metabolism features seem to feed into the production of wool fibers, which contain lanolin formed from waxy ester molecules. By adding in transcript sequence data for almost 100 samples taken from 40 sheep tissue types, the researchers looked at the protein-coding genes present in the sheep genome and their relationship to those found in 11 other ruminant and non-ruminant mammals.

Two Crow species: genomes reveal what make them look different
Researchers published on Science a genomic study on two crow species, the all-black carrion crow and the gray-coated hooded crow — and find that a very small percentage of the birds' genes are responsible for their different looks. Researchers started by assembling the high-quality reference genome for the hooded crow species C. cornix. The 16.4-million-base assembly — covered to an average depth of 152-fold — contained nearly 20,800 predicted protein-coding genes. The team then resequenced the genomes of 60 hooded or carrion crows at average depths of between 7.1- and 28.6-fold apiece, identifying more than 5.27 million SNPs shared between the two species and more than 8.4 million SNPs in total. Comparison of the two species genomes revealed that varied expression of less than 0.28 percent of the entire genome was enough to maintain different coloration between the two species. This particular 1.95 megabase pair-long area of the genome is located on the avian chromosome 18, and it harbors genes associated with pigment coloration, visual perception, and hormonal balance. Together, the team's findings hint that distinctive physical features are maintained in hooded and carrion crow species despite gene flow across all but a fraction of the genome.

Eucalyptus genome: tandem duplications and essential oils encoded in the DNA
An international team published on Nature a reference genome for the eucalyptus tree. The researchers used whole-genome Sanger sequencing to build the genome assembly of an E. grandis representative belonging to the BRASUZI genotype. Using those sequences, together with bacterial artificial chromosome sequences and a genetic linkage map, the team covered more than 94 percent of the plant's predicted 640 million base sequence at an average depth of nearly seven-fold. To facilitate transcripts identification, they added RNA sequences representing different eucalyptus tissue types and developmental stages and reconstructed 36,376 predicted protein-coding eucalyptus genes. The genomes of a sub-tropical representative from E. grandis BRASUZI and a temperate eucalyptus species called E. globulus were re-sequenced with Illumina instruments. Comparison of the different genomes revealed that eucalyptus displays the greatest number of tandem duplications of any plant genome sequenced so far, and that the duplications have appear to have prioritized genes for wood formation. The plant also has the highest diversity of genes for producing various essential oils.

Common Bean genome: genomic effects of plant domestication
The reference genome for the common bean, Phaseolus vulgaris L., was recently published on Nature Genetics. Authors used a whole-genome shotgun sequencing strategy combining together linear libraries and paired libraries of varying insert sizes, sequenced with the Roche 454 platform. To these data they added 24.1 Gb of Illumina-sequenced fragment libraries and sequences from fosmid libraries and BAC libraries obtained from canonical Sanger platform for a total assembled sequence coverage level of 21.0X. The final assembly covers 473 Mb of the 587-Mb genome and 98% of this sequence is anchored in 11 chromosome-scale pseudomolecules. Using resequencing of 60 wild individuals and 100 landraces from genetically differentiated Mesoamerican and Andean gene pools, the authors performed a genome-wide analysis of dual domestications and confirmed two independent domestications from genetic pools that diverged before human colonization. They also identified a set of genes linked with increased leaf and seed size. These results identify regions of the genome that have undergone intense selection and thus provide targets for future crop improvement efforts.