Pages

Thursday, 30 October 2014

Exome Aggregation Consortium release its data on 63,000 exomes!


On October 29th, the Exome Aggregation Consoritum as released its browser based on the impressive number of 63,000 human exomes.
This database is the larger collection of human exome data so far and provide both a web base interface to retrieve variants in your gene of interest or the download of a VCF file containing the list of all the annotated variants.

The final dataset is based on sequences from several consortia working on complex disorders and also includes 1000G and ESP6500 data.

The first aim of the Consortium is to study the distribution of "human knockout", that is people having both copies of a given gene inactivated by severe mutations. The analysis of associated phenotype data promise to reveal lot of interesting information of the actual role of single human genes. Moreover, the study of subjects carrying inactivating mutations on known disease genes but not showing the expected phenotype could lead to identification of new therapeutic targets.

See more information on Nature News and Genome Web!

Tuesday, 2 September 2014

PubMed Highlight: New relese of ENCODE and modENCODE

Five papers that summarize the latest data from ENCODE and modENCODE consortia have recently been published on Nature. Together, the publications add more than 1,600 new data sets, bringing the total number of data sets from ENCODE and modENCODE to around 3,300.

The growth of ENCODE and modENCODE data sets.

The authors analyze RNA-Seq data produced in the three species and an extensive effort was conducted in Drosophila to investigate genes expressed only in specific tissue, developmental stages or only after specific perturbations.  The analysis also identified many new candidate long non-coding RNAs, including ones that overlap with previously defined mutations that have been associated with developmental defects.
Other data sets derive from chromatin binding assays focused on transcription-regulatory factors in human cell lines, Drosophila and C. elegans; and on study of DNA accessibility and certain modifications to histone proteins. These new chromatin data sets led to identification of several features common to the three species, such as shared histone-modification patterns around genes and regulatory regions.
The new transcriptome data sets will result in more precise gene annotations in all three species, which should be released soon. The access to the data on chromatin features, regulatory-factor binding sites, and the regulatory-element predictions seem more difficult. We have to wait for them to be integrated in user-friendly portals for data visualization and flexible analyses. The UCSC Genome Browser, Ensembl, ENCODE consortium are all working to provide the solution.

Meanwhile take a look to the papers:
Diversity and dynamics of the Drosophila transcriptome Regulatory analysis of the C. elegans genome with spatiotemporal resolution Comparative analysis of metazoan chromatin organization

Monday, 14 July 2014

New challenges in NGS


After about a decade from the first appearance if NGS sequencing we have seen incredible improvements in throughput, accuracy and analysis methods and sequencing is now more diffused and easy to achieve also for small labs. Researchers have produced tons of sequencing data and the new technology allowed us to investigate DNA and human genomic variations at unprecedent scale and precision.
However, beside the milestones achieved, we have now to deal with new challenges that were largely underestimated in the early days of NGS.

MassGenomics has a nice blog post underlining the main ones, that I reported here:

Data Storage. 
Where do we put all those data from large genomic sequencing projects? Can we afford the cost of store everything or we have to be more selectively on what to keep in our hard drives?

Statistical significance.
GWAS studies have showed us that large numbers, in the order of 10 thousands of samples, are needed to achieve statistical significance for association studies, particularly for common diseases. Even when you consider the present low price of 1,000$ / genome it will require around 10 millions $ for such a sequencing project. So we can reduce our sample size (and thus significance) or create mega consortium with all the managing issues.

Samples became precious resources.
In the present scenario sequencing power is not longer a limitation. The real matter is find enough well-characterized samples to sequence!

Functional validation.
Whole genome and whole exome approaches let researchers to rapidly identify new variants potentially related to phenotypes. But which of them are truly relevant? Our present knowledge do not allow for a confident prediction of functional impact of genetic variation and thus functional studies are often needed to assess the actual role of each variants. These studies, based on cellular models or animal models, could be expensive and complicated.

Privacy.
With large and increasing amount of genomic data available to the community and studies showing that people ancestry and living location could be traced using them (at least in a proportion of cases), there are concerns about how "anonymous" these kind of data could really be. This is going to became a real problem has more and more genomes are sequenced.

Friday, 4 July 2014

PubMed highlight: Literome help you find relevant papers in the "genomic" literature

This tool mines the "genomic" literature for your gene of interest and reports a list of interactions with other genes, specifying also the kind of the relation (inhibit, activate, regulate...). It can also search for a SNP and find phenotypes associated to it by GWAS. You can then filter the results and also report if the listed interactions are actually real or not.

Good stuff to quickly identify relevant papers in the large amount of genomic researches!

Literome: PubMed-scale genomic knowledge base in the cloud

Hoifung Poon, Chris Quirk, Charlie DeZiel and David Heckerman

Abstract
Motivation: Advances in sequencing technology have led to an exponential growth of genomics data, yet it remains a formidable challenge to interpret such data for identifying disease genes and drug targets. There has been increasing interest in adopting a systems approach that incorporates prior knowledge such as gene networks and genotype–phenotype associations. The majority of such knowledge resides in text such as journal publications, which has been undergoing its own exponential growth. It has thus become a significant bottleneck to identify relevant knowledge for genomic interpretation as well as to keep up with new genomics findings.
Results: In the Literome project, we have developed an automatic curation system to extract genomic knowledge from PubMed articles and made this knowledge available in the cloud with a Web site to facilitate browsing, searching and reasoning. Currently, Literome focuses on two types of knowledge most pertinent to genomic medicine: directed genic interactions such as pathways and genotype–phenotype associations. Users can search for interacting genes and the nature of the interactions, as well as diseases and drugs associated with a single nucleotide polymorphism or gene. Users can also search for indirect connections between two entities, e.g. a gene and a disease might be linked because an interacting gene is associated with a related disease.

Availability and implementation: Literome is freely available at literome.azurewebsites.net. Download for non-commercial use is available via Web services.

Wednesday, 25 June 2014

National Children's Study stopped, waiting for revisions

One of the most ambitious project and one of the few attempts to really perform "personal genomics", is (or I may say was) the National Children's Study (NCS) sustained by NIH and the US government.

The project try to investigate the relation between genomics and environmental factors to define their impact on human life and define which advantages this kind of genomic screening could provide for the human health. The massive longitudinal project that would sequence the genomes of 100,000 US babies and collect loads of environmental, lifestyle, and medical data on them until the age of 21.
However the NIH director, Francis Collins, has recently announced that the project will be stopped waiting for a detailed review on the methodologies applied and the opportunity to complete it in its present form. Few key questions has to be addressed: Is the study actually feasible, particularly in light of budget constraints? If so, what changes need to be made? If not, are there other methods for answering the key research questions the study was designed to address?

As GenomeWeb reports, National Academy of Sciences (NAS) released a report saying the NCS needs some major changes to its design, management, and oversight.  The NAS recommendations include making some changes to the core hypotheses behind the study, beefing up scientific input and oversight, and enrolling the subjects during pregnancy, instead of at birth, as is the current plan.

Monday, 23 June 2014

One banana a day will keep the doctor away

According to GenomeWeb and The Guardian, researchers from Australia are tweaking the genome of the banana in order to get it to deliver higher levels of vitamin A. The study is aimed to supplement vitamin A in Uganda and other similar population, where banana is one of the main food sources and deficiency in vitamin A cause blindness and death in children.

The group of professor James Dale, from the Queensland University of Technology, received a $10 million grant from the Bill and Melinda Gates Foundation to support this 9 year project.

Dale said that by 2020 vitamin A-enriched banana varieties would be grown by farmers in Uganda, where about 70% of the population survive on the fruit.

The genome of a baby sequenced before birth raises questions on the opportunity and pitfalls of genome screening

Khan, graduate student at the University of California, Davis, and blogger at The Unz Review, decided that he wanted detailed genetic information on his child as soon as he knew that his wife was pregnant. After a genetic test for chromosomal abnormalities he asked to have the DNA sample back and managed to have the baby's genome sequenced on one of the University NGS instruments.

MIT technology review reports about the whole story and Khan tells on the many difficulties he faced to have the genome sequencing done. Most of the medical staff tried to discourage him from performing this kind of test, afraid that the couple could take irrevocable decisions, such as pregnancy termination, based on the presence of putative deleterious mutations in the baby's genome. This case raised again the question of how much information could be extracted from a single genome, which part of this information is really useful on a medical care basis and which part is actionable in nowadays.

It seems to me that by now our ability to robustly correlate genotypes to phenotypes is still scarce. This is due to incomplete knowledge about causative and risk associated mutations as well as on the molecular and genetic mechanisms that lead from genetic variants to phenotypes. Studies in the last years have demonstrated that this path is not straightforward and the actual phenotypes often depend on the interaction of several genetic components and regulatory mechanisms, living aside the environmental factors.
Several disease mutations show incomplete penetrance and many example exist of variants linked to phenotypes only in specific populations, so a reliable interpretation of genomic data seems far away by now.
However, many decision can be made knowing your DNA sequence and this information will become even more interesting as researchers continue to find new associations and elucidate genotype-phenotype correlation mechanisms.
Moreover, if the health public service continues to stand against whole genome screening, people will soon turn to private companies, that can already provide this kind of services. This policy will thus increase the risk of incomplete or misleading interpretations without any kind of support from medical stuff.
A lot has to be discussed on the practical and ethical point of view, but we have to face the reality that since these kind of tests are going to became easily accessible in the near future, we have also to find a way to provide the correct information to the subject analyzed.
The topic of genomic risk assessment in healthy people has been recently discussed also on the New England Jornal of Medicine, that published a review on clinical whole exome and whole genome sequencing. The journal also presented the hypothetical scenario of a subject which discovers some cancer affected relatives and wants to undergo genetic testing. They propose 2 strategies, gene panel or whole exome/genome sequencing and the case is open for readers to comment with even a pool to vote for your preferred solution.  

PubMed Highlight: Complete review of computational biology free courses

This paper is a great resource for anyone looking to get started in a computational biology, or just looking to an insight on a specific topics ranging from natural language processing to evolutionary theory. The author describes hundreds of video courses that are foundational to a good understanding of computational biology and bioinformatics. The table of contents breaks the curriculum down into 11 "departments" with links to online courses in each subject area:
  • Mathematics Department
  • Computer Science Department
  • Data Science Department
  • Chemistry Department
  • Biology Department
  • Computational Biology Department
  • Evolutionary Biology Department
  • Systems Biology Department
  • Neurosciences Department
  • Translational Sciences Department
  • Humanities Department

Listings in the catalog can take one of three forms: Courses, Current Topics, or Seminars. All listed courses are video-based and free of charge. The author has tested most of the courses, having enrolled in up to a dozen at a time, and he shared his experience in this paper. So you can find commentary on the importance of the subject and an opinion on the quality of instruction. For the courses that the author completed, listings have an "evaluation" section, which ranks the course in difficulty, time requirements, lecture/homework effectiveness, assessment quality, and overall opinions. Finally there are also autobiographical annotations reporting why the courses have revealed useful in a bioinformatics career. 

Don't miss this!

PubMed Highlight: VarMod, modelling the functional effects of non-synonymous variants

On Nucleic Acid Research, authors from Uuniversity of Kent published the varmod tool. By incorporating protein sequence and structural feature cues into the non-synonymous variant analysis, their Variant Modeller method provides clues to understanding genotype effects on phenotype, the study authors note. Their proof-of-principle analysis of 3,000 such variants suggests VarMod predicts protein function and structural effects with accuracy that's on par with that offered by the PolyPhen-2 tool.


Abstract
Unravelling the genotype–phenotype relationship in humans remains a challenging task in genomics studies. Recent advances in sequencing technologies mean there are now thousands of sequenced human genomes, revealing millions of single nucleotide variants (SNVs). For non-synonymous SNVs present in proteins the difficulties of the problem lie in first identifying those nsSNVs that result in a functional change in the protein among the many non-functional variants and in turn linking this functional change to phenotype. Here we present VarMod (Variant Modeller) a method that utilises both protein sequence and structural features to predict nsSNVs that alter protein function. VarMod develops recent observations that functional nsSNVs are enriched at protein–protein interfaces and protein–ligand binding sites and uses these characteristics to make predictions. In benchmarking on a set of nearly 3000 nsSNVs VarMod performance is comparable to an existing state of the art method. The VarMod web server provides extensive resources to investigate the sequence and structural features associated with the predictions including visualisation of protein models and complexes via an interactive JSmol molecular viewer. VarMod is available for use at http://www.wasslab.org/varmod.

First user reports on Oxford Nanopore MinION!

After the start of the early access program, the sequencing community is waiting for the first results and comments on the MinION platform by Oxford Nanopore. This sequencer promises to revolutionize the field and is the first nanopore based sequencer that have reached the market.


Nick Loman, one of the early customer has now reported the first results obtained on the new platform. It is a 8.5 Kb read from P. aeruginosa showing that MinION can produce useful data, even if the accuracy remains low. Analyses of a read by two bioinformatics researchers, who used different alignment tools and posted their results here and here, showed that the read is about 68 percent identical to the P. aeruginosa genome and has many errors, particularly gaps. Main issues seems to be in the basecalling software, but Oxford Nanopore is working hard to improve it. Accordingly also to Konrad Paszkiewicz, another early customer, the device itself is really robust and easy to use and the library preparation procedure is simple, resulting in low sequencing costs.
The mean read length seems to be about 10 Kb, but users reported even longer reads up to 40 Kb, covering the entire lambda genome used for testing. So the read length is really promising and place the mature MinION as a natural competitor to PacBio.
The use of MinION seems straightforward: after plugging the sequencer into a USB 3.0 port of a computer, it installs the MinKnow software suite. A program called Metrichor uploads the raw data – ion current traces – to the Amazon Elastic Compute Cloud, where base-calling happens, either 1D base-calling for unidirectional reads or 2D base-calling for bidirectional reads.
Overall, improvements have to be made to the base-calling software, reliability of the flow cells, and library shelf-life, and new software needs to be developed by the community to take advantage of the MinION reads.  Oxford Nanopore said a new chemistry will be available in the coming months, which might include some of these improvements.

In the meantime, many other early access users contacted by IS website are awaiting the arrival of reagents, are in the midst of burn-in, or have run their own samples but are not ready to talk about their results yet. So we are expecting many more data and comments and detailed estimation on platform accuracy and data production to be out in the next months! The new minion has fulfilled the excpection in this first test and there is a lot promising about this new technology...maybe a new revolution in the field is ready to come!

Other details can be found on this post on GenomeWeb.

Insects, sheep, polar bears, crow, beans and eucalyptus...all the genomes you want!

I'm always amazed by the explosion of new species genomes since the introduction of NGS. In the last two years the sequencing and assembly of genomes from various animals and plants have accelerated even more and focused also on "exotic" species, so much that now we have almost a new genome per month! All these data can tell us a lot on basic mechanisms of evolution and provide information to study how complex biological processes have developed and why they act the way we see now. Moreover, many species have peculiar properties and produce biopeptides or other biological molecules that could be useful for life science and medicine.
So, here is a quick update of what has been published in the last months!

The amazing spiderman: Social velvet and tarantula genomes to study silk and venom
Authors from BGI-Shenzhen and the Aarhus University reported on Nature Communication the assembly of the full genome of social velvet spider and tarantula spider. Besides the genome sequencing and analysis, authors also performed transcriptome sequencing and proteomic analysis by mass spectroscopy. A de novo assembly of the velvet spider (S. mimosarum) was generated from 91 × coverage sequencing of paired end and mate pair libraries and assembled into contigs and scaffolds spanning 2.55 Gb. Integrating also transcriptome data authors reconstructed a gene set of 27,235 protein-coding gene models. Approximately 400 gene models had no homology to known proteins but were supported by proteomic evidence, identifying putative ‘spider’-specific proteins. The exon-intron structure, unlike other arthropod genomes, is characterized by and intron-exon structure very similar to the human genome. The size estimate of the tarantula genome is about 6 Gb and was sequenced at 40 × coverage from a single female A. geniculata using a similar combination of paired end and mate pair libraries as for the velvet spider. Authors sequenced proteins from different spider tissues (venom, thorax, abdomen, haemolymph and silk), identifying 120 proteins in venom, 15 proteins in silk and 2,122 proteins from body fluid and tissue samples, for a total of 2,193 tarantula proteins. Introns were found to be longer than those of the velvet spider.
Combining three different omics approaches the paper reconstructed species specific gene duplication and the set of peculiar proteins involved in spiders silk and venom. The analysis revealed enrichment in cysteine-rich peptides with neurotoxic effect and proteases, that specifically activate protoxins in the venom of spiders.

Stick insects: a large sequencing effort to study evolution and speciation
In this paper published on Science (it appears on the cover magazine), the authors performed whole genome sequencing on several subjects from different populations of stick insects to investigate the role and mechanism of action of selection in adaptation and parallel speciation. Researchers performed a parallel experiment moving four groups of individuals from the original population from their natural host plant to a new one. They sequenced them and their first offspring generation and analyzed genomic variations and their role in adaptation to the new environment. Comparing genomic changes in the four groups allow analysis of parallel speciation and the genomic mechanisms behind the scene.

Polar bear genome: population genomics to dissect adaptation to extreme environments
On this paper from Cell, authors reconstructed a draft assembly of the polar bear genome and then analyzed 89 complete genomes of polar bear and brown bear using population genomic modeling. Results show that the species diverged 479–343 thousand years ago and that the polar bear lineage have been under stronger positive selection than the brown bears. Several genes specifically selected in polar bears are associated with cardiomyopathy and vascular disease, implying important reorganization of the cardiovascular system. Another group of genes showing strong evidence of selection are those related to lipid metabolism, transport and storage, like APOB. Functional mutations in this gene may explain how polar bears are able to cope with life-long elevated LDL levels.

Sheep genome: now all the major livestock animals have their genome sequence
Researchers from the international sheep genomics consortium published on Science the first complete assembly of the sheep genome. The team build an assembly that spans 2.61 billion bases of the sheep genome to an average depth of around 150-fold. That assembly covers around 99 percent of the sheep's 26 autosomal chromosomes and X chromosome. In addition to the high-quality reference genome, the team generated transcriptome sequences representing 40 sheep tissues, which contributed to its subsequent analysis of sheep features. Like cattle, sheep are known for feed on plants and deriving useful proteins from lignocellulose-laden material with the help of fermentation and microbes in the rumen. Specialized features of the sheep metabolism go to work on volatile fatty acids that gut bugs produce during that process and other adaptations on fatty acid metabolism features seem to feed into the production of wool fibers, which contain lanolin formed from waxy ester molecules. By adding in transcript sequence data for almost 100 samples taken from 40 sheep tissue types, the researchers looked at the protein-coding genes present in the sheep genome and their relationship to those found in 11 other ruminant and non-ruminant mammals.

Two Crow species: genomes reveal what make them look different
Researchers published on Science a genomic study on two crow species, the all-black carrion crow and the gray-coated hooded crow — and find that a very small percentage of the birds' genes are responsible for their different looks. Researchers started by assembling the high-quality reference genome for the hooded crow species C. cornix. The 16.4-million-base assembly — covered to an average depth of 152-fold — contained nearly 20,800 predicted protein-coding genes. The team then resequenced the genomes of 60 hooded or carrion crows at average depths of between 7.1- and 28.6-fold apiece, identifying more than 5.27 million SNPs shared between the two species and more than 8.4 million SNPs in total. Comparison of the two species genomes revealed that varied expression of less than 0.28 percent of the entire genome was enough to maintain different coloration between the two species. This particular 1.95 megabase pair-long area of the genome is located on the avian chromosome 18, and it harbors genes associated with pigment coloration, visual perception, and hormonal balance. Together, the team's findings hint that distinctive physical features are maintained in hooded and carrion crow species despite gene flow across all but a fraction of the genome.

Eucalyptus genome: tandem duplications and essential oils encoded in the DNA
An international team published on Nature a reference genome for the eucalyptus tree. The researchers used whole-genome Sanger sequencing to build the genome assembly of an E. grandis representative belonging to the BRASUZI genotype. Using those sequences, together with bacterial artificial chromosome sequences and a genetic linkage map, the team covered more than 94 percent of the plant's predicted 640 million base sequence at an average depth of nearly seven-fold. To facilitate transcripts identification, they added RNA sequences representing different eucalyptus tissue types and developmental stages and reconstructed 36,376 predicted protein-coding eucalyptus genes. The genomes of a sub-tropical representative from E. grandis BRASUZI and a temperate eucalyptus species called E. globulus were re-sequenced with Illumina instruments. Comparison of the different genomes revealed that eucalyptus displays the greatest number of tandem duplications of any plant genome sequenced so far, and that the duplications have appear to have prioritized genes for wood formation. The plant also has the highest diversity of genes for producing various essential oils.

Common Bean genome: genomic effects of plant domestication
The reference genome for the common bean, Phaseolus vulgaris L., was recently published on Nature Genetics. Authors used a whole-genome shotgun sequencing strategy combining together linear libraries and paired libraries of varying insert sizes, sequenced with the Roche 454 platform. To these data they added 24.1 Gb of Illumina-sequenced fragment libraries and sequences from fosmid libraries and BAC libraries obtained from canonical Sanger platform for a total assembled sequence coverage level of 21.0X. The final assembly covers 473 Mb of the 587-Mb genome and 98% of this sequence is anchored in 11 chromosome-scale pseudomolecules. Using resequencing of 60 wild individuals and 100 landraces from genetically differentiated Mesoamerican and Andean gene pools, the authors performed a genome-wide analysis of dual domestications and confirmed two independent domestications from genetic pools that diverged before human colonization. They also identified a set of genes linked with increased leaf and seed size. These results identify regions of the genome that have undergone intense selection and thus provide targets for future crop improvement efforts.

Monday, 19 May 2014

Pubmed highlight: SNP detection tools comparison

Performance comparison of SNP detection tools with illumina exome sequencing data-an assessment using both family pedigree information and sample-matched SNP array data.

Nucleic acids research. 2014 May 15. pii: gku392

Abstract

To apply exome-seq-derived variants in the clinical setting, there is an urgent need to identify the best variant caller(s) from a large collection of available options. We have used an Illumina exome-seq dataset as a benchmark, with two validation scenarios-family pedigree information and SNP array data for the same samples, permitting global high-throughput cross-validation, to evaluate the quality of SNP calls derived from several popular variant discovery tools from both the open-source and commercial communities using a set of designated quality metrics. To the best of our knowledge, this is the first large-scale performance comparison of exome-seq variant discovery tools using high-throughput validation with both Mendelian inheritance checking and SNP array data, which allows us to gain insights into the accuracy of SNP calling through such high-throughput validation in an unprecedented way, whereas the previously reported comparison studies have only assessed concordance of these tools without directly assessing the quality of the derived SNPs. More importantly, the main purpose of our study was to establish a reusable procedure that applies high-throughput validation to compare the quality of SNP discovery tools with a focus on exome-seq, which can be used to compare any forthcoming tool(s) of interest.

Wednesday, 30 April 2014

Pubmed highlight: Comparison of mapping alghoritms

Map millions of reads to some reference genome sequence is in most cases the first step in NGS data analysis. Proper mapping is essential for downstream variant identification and assessing of the quality of each sequenced base. Various tools exist to perform this task and this paper present an interesting new tool to benchmark results from the different aligners. Authors developed CuReSim, a tool able to generate simulated reads dataset resembling differetn NGS technologies, and CuReSimEval, that perform the performance evaluation for a given aligner given the created dataset.
In the paper they apply this new method to compare the performance of some popular aligners (like BWA, TMAP and BowTie) working on Ion Torrent data. "The application of this procedure to Ion Torrent data from the whole genome sequencing of small genomes [...] demonstrate that it is helpful for selecting a mapper based on the intended application, questions to be addressed, and the technology used." they reports in the abstract.

Comparison of mapping algorithms used in high-throughput sequencing: application to Ion Torrent data.
Caboche S, Audebert C, Lemoine Y, Hot D

Abstract
BACKGROUND: The rapid evolution in high-throughput sequencing (HTS) technologies has opened up new perspectives in several research fields and led to the production of large volumes of sequence data. A fundamental step in HTS data analysis is the mapping of reads onto reference sequences. Choosing a suitable mapper for a given technology and a given application is a subtle task because of the difficulty of evaluating mapping algorithms.
RESULTS: In this paper, we present a benchmark procedure to compare mapping algorithms used in HTS using both real and simulated datasets and considering four evaluation criteria: computational resource and time requirements, robustness of mapping, ability to report positions for reads in repetitive regions, and ability to retrieve true genetic variation positions. To measure robustness, we introduced a new definition for a correctly mapped read taking into account not only the expected start position of the read but also the end position and the number of indels and substitutions. We developed CuReSim, a new read simulator, that is able to generate customized benchmark data for any kind of HTS technology by adjusting parameters to the error types. CuReSim and CuReSimEval, a tool to evaluate the mapping quality of the CuReSim simulated reads, are freely available. We applied our benchmark procedure to evaluate 14 mappers in the context of whole genome sequencing of small genomes with Ion Torrent data for which such a comparison has not yet been established.
CONCLUSIONS: A benchmark procedure to compare HTS data mappers is introduced with a new definition for the mapping correctness as well as tools to generate simulated reads and evaluate mapping quality. The application of this procedure to Ion Torrent data from the whole genome sequencing of small genomes has allowed us to validate our benchmark procedure and demonstrate that it is helpful for selecting a mapper based on the intended application, questions to be addressed, and the technology used. This benchmark procedure can be used to evaluate existing or in-development mappers as well as to optimize parameters of a chosen mapper for any application and any sequencing platform.

Wednesday, 23 April 2014

PubMed Highlight: Importance of annotation tool and transcript dataset for variants analysis

This paper recently published on Genome Medicine analyzes the differences in variants annotation when using different annotation tools and transcript datasets. Authors extensively report on pitfalls and peculiar aspects of the two mostly diffused softwares (ANNOVAR and VEP), using transcripts definition from both RefSeq and Ensembl databases.
From the paper abstract clearly emerges a high level of discrepancy, expecially for functional variants (namely missense and LoF) that are the far the most relevant in NGS analysis. "We find only 44% agreement in annotations for putative loss-of-function variants when using the REFSEQ and ENSEMBL transcript sets as the basis for annotation with ANNOVAR. The rate of matching annotations for loss-of-function and nonsynonymous variants combined is 79% and for all exonic variants it is 83%. When comparing results from ANNOVAR and VEP using ENSEMBL transcripts, matching annotations were seen for only 65% of loss-of-function variants and 87% of all exonic variants, with splicing variants revealed as the category with the greatest discrepancy" authors write.

What impress me the most is the low level of concordance between different transcript datasets, reflecting the fact that also the annotation of mRNA forms are far from definitely established.
So be careful with your annotations!

Here the full paper:

Choice of transcripts and software has a large effect on variant annotation
Davis J McCarthy, Peter Humburg, Alexander Kanapin, Manuel A Rivas, Kyle Gaulton, The WGS500 Consortium, Jean-Baptiste Cazier and Peter Donnelly

Monday, 7 April 2014

Some useful tools for your everyday NGS analysis

There are a lot of tools that can assist you at every step of an NGS data analysis. Here are some interesting pieces of software I've recently started using.

SAMStat - published on Bioinformatics
This tool take your sam/bam/fastq files and compute several metrics describing frequency and distribution of bases across all your reads. Results include: stats on MAPQ of reads, distribution of MAPQ across read lenght, nucleotide over-representation across reads, error distribution and identification of over-represented 2-mers and 10-mers. All the data are conveniently presented in a html summary and help you identifying potential issues present in your sequencing experiment. Moreover, these graphs are always useful as quality reports for your presentations!

ngsCAT - published on Bioinformatics
This command line tool provide you with a detailed analysis of your mapped reads given a defined target regions. It compute several metrics and stats: medium coverage, number of bases covered at least n fold, duplicated reads, distribution of on-target reads across chromosomes and uniformity of coverage across target regions. The tool require a bam file and a bed file as inputs and produce several graphs and tabs and a final summary report in html format. Really simple and useful to assess quality of your target capture!

GRAB - published on PLOS ONE
This tool is as simple as cleaver. It takes genotyping information of subjects from various formats (Genotype/Var/masterVar/GVF/VCF/PEDMAP/TPED) and compute their eventual relationship. It works best with whole genome data, but I've tested it also using vcf files from exome sequencing and reducing the default reading window it performs well at least in identifying 1st and 2nd grade relationships. This tool require R installed in your system and it's really fast in performing the analysis. It is useful when you are dealing with a group of samples and you want to verify that there are no members from the same family.

MendelScan - published on American Journal of Human Genetics
This tool is also described by the author on his blog MassGenomics. This software perform variant prioritization for family based exome sequencing studies. It needs some preliminary steps to prepare the necessary input files: a multisample vcf files with variants from the family members, a ped file describing the relations between samples, a ranked list of genes that are mostly expressed in your tissue of interest and the VEP annotated list of your variants. Given these data the tool compute a ranked list of identified variants based on the selected inheritance model (recessive or dominant). Moreover it include two additional modules developed by the authors: the Rare Heterozygous Rule Out and the Shared Identity-by-Descent. The first one "identifies candidate regions consistent with autosomal dominant inheritance based on the idea that a disease-causing haplotype will manifest regions of rare heterozygous variants shared by all affecteds, and an absence of homozygous differences between affected pairs (which would indicate that a pair had no haplotype in common).", while the second one "uses BEAGLE FastIBD results to identify regions of maximum identity-by-descent (IBD) among affected pairs". This tool integrates the canonical prediction scores (such as Polyphen, PhyloP and so on) with gene expression ranking and the newly developed methods to provide a straightforward analysis for your mendelian disease NGS studies!

SPRING - published on PLOS Genetics
Like MendelScan, this is another tool for variant prioritization. This one has also a version working from the web for those that are not familiar with the command line. The tool takes a list of seed genes already known to be involved in the pathology or in similar phenotypes and a list of your candidate missense variants and give you a ranked list of the variants that have a high probability of being associated with the disease. This tools work fine for disease with high genetic heterogeneity so that you can easily and confidently build a list of seed genes. SPRING can then be really useful in prioritizing candidates variants emerging from new studies.

PRADA - published on Bioinformatics
This tool is focused on RNA-Seq and it provide a complete framework for analysis of this kind of data. The tool is a complete solution since it can perform several kind of analysis starting from raw paired-end RNA-seq data: gene expression levels, quality metrics, detection of unsupervised and supervised fusion transcripts, detection of intragenic fusion variants, homology scores and fusion frame classification. As the project page reports:
"PRADA currently supports 7 modules to process and identify abnormalities from RNAseq data:
preprocess: Generates aligned and recalibrated BAM files.
expression: Generates gene expression (RPKM) and quality metrics.
fusion: Identifies candidate gene fusions.
guess-ft: Supervised search for fusion transcripts.
guess-if: Supervised search for intragenic fusions.
homology: Calculates homology between given two genes.
frame: Predicts functional consequence of fusion transcript"

This is a good starting point for those not familiar with RNA-Seq!

Friday, 4 April 2014

A gold standard dataset of SNP and Indels for benchmarking tools

One of the pain in the NGS data analysis is that different tools, different settings and different sequencing platforms produce different and often low overlapping variants. Every analysis pipeline has its peculiar issues and results in specific false positive/negative calls.

Genome in a BottleThe need for a gold standard reference of SNP and indels calls is thus of primary relevance to correctly asses the accuracy and sensibility of new NGS pipeline. The need for a robust benchmarking of variant identification pipelines is even more critical as NGS analysis is fast moving from research to diagnostic/clinical field.
In this interesting paper on Nature Biotechnology, authors have performed the analysis of 14 different variants datasets from the same standard genome NA12878 (choose as the standard reference genome by the Genome in a Bottle Consortium) to produce a list of gold standard SNPs and Indels and identify genomic regions that are particularly difficult to address. The final dataset is the most robust produced so far since it integrate data from 5 different sequencing technology, 7 read mapper and 3 variant callers to obtain a robust estimation of SNVs.
The final standard for evaluation of your favorite pipeline is finally here, publicly available Genome Comparison and Analytic Testing website!

Nat Biotechnol. 2014 Feb 16. 
Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls.

Zook JM, Chapman B, Wang J, Mittelman D, Hofmann O, Hide W, Salit M. 

Abstract
Clinical adoption of human genome sequencing requires methods that output genotypes with known accuracy at millions or billions of positions across a genome. Because of substantial discordance among calls made by existing sequencing methods and algorithms, there is a need for a highly accurate set of genotypes across a genome that can be used as a benchmark. Here we present methods to make high-confidence, single-nucleotide polymorphism (SNP), indel and homozygous reference genotype calls for NA12878, the pilot genome for the Genome in a Bottle Consortium. We minimize bias toward any method by integrating and arbitrating between 14 data sets from five sequencing technologies, seven read mappers and three variant callers. We identify regions for which no confident genotype call could be made, and classify them into different categories based on reasons for uncertainty. Our genotype calls are publicly available on the Genome Comparison and Analytic Testing website to enable real-time benchmarking of any method.

Tuesday, 25 March 2014

Flash Report: Loblolly pine genome is largest ever sequenced

Below is the report from CBSNews.com:

Scientists say they've generated the longest genome sequence to date, unraveling the genetic code of the loblolly pine tree.
Conifers have been around since the age of the dinosaurs, and they have some of the biggest genomes of all living things.
Native to the U.S. Southeast, the loblolly pine (Pinus taeda) can grow over 100 feet (30 meters) tall and has a lengthy genome to match, with 23 billion base pairs. That's more than seven times the size of the human genome, which has 3 billion base pairs. (These pairs form sequences called genes that tell cells how to make proteins.)
"It's a huge genome. But the challenge isn't just collecting all the sequence data. The problem is assembling that sequence into order," study researcher David Neale, a professor of plant sciences at the University of California, Davis, said in a statement.
To simplify this huge genetic puzzle, Neale and colleagues assembled most of the sequence from part of a single pine nut-- a haploid part of the seed with just one set of chromosomes to piece together.
The new research showed that the loblolly genome is bloated with repetitive DNA. In fact, 82 percent of the genome repeats itself, the researchers say.
Understanding the loblolly pine's genetic code could lead to improved breeding of the tree, which is used to make paper and lumber and is being investigated as a potential biofuel, the scientists say.
The loblolly pine joins other recently sequenced conifers, including the Norway spruce (Picea abies), which has 20 billion base pairs. For their next project, the researchers are eyeing the sugar pine, a tree with 35 billion base pairs.
The research was detailed this week in the journals Genetics and Genome Biology.

Friday, 21 February 2014

Flash Report: King Richard III genome is going to be sequenced soon



The list of famous people whose genome has been sequenced is going to get a new member, more precisely a royal one. After the discovery of the remains of his body under a car park, scientists of the University of Leicester are going to sequence the genome of King Richard III, who died in the Battle of Bosworth Field in 1485. The project has the purpose to learn more about King's ancestry and health, and to provide genetic data useful for historians, researchers and the public.
In addition to the scientific aspects, these types of initiatives also represent good strategies to attract the attention of the media on the institution involved in the sequencing and, why not, to increase the possibility to raise funds for other projects with a deeper scientific impact.
In general, the choice to sequence the genome of famous historic characters could be a good initiative to acknowledge what they did for their country and a great opportunity of visibility for the institution performing the study. Here is the link to the Reuters news release.

Tuesday, 18 February 2014

Genapsys reveals GENIUS at AGBT


At the last AGBT meeting Genapsys, a small company from California, presented its new "launch-box" sequencer called GENIUS (Gene Electronic Nano-Integrated Ultra-Sensitive).
This machine, with the size of a toaster, is a new kind of NGS sequencer that apply electronic sensor technology and promise to produce up to 100 Gb of sequences in few hours, with read length up to 1000 bp. Like the MinION they propose to push the NGS market even further, providing a new generation of sequencher which are incredibly small, cheap and easy to operate. A step further to the so called "freedom of sequencing".

Genapsys started an early-access program for the GENIUS platform last week and plans to ship out the first instruments within a few months, followed by a general commercial launch either this year or next year.

The system has an opening in the front to insert a small, square semiconductor-based sequencing chip. A reusable reagent cartridge attaches to the back, and a computer for data processing is integrated. Based on the Genapsys CEO interview at InSequence, library preparation require clonal amplification of the template DNA on beads in an emulsion-free off-instrument process that requires no specialized equipment from the company. While the first version of the Genius will require separate sample and library preparation, the company plans to integrate sample prep into the platform eventually. The template beads are then pipetted into the chip, which has "millions of sensors" and a flat surface with no wells. There, the beads assemble into an array-like pattern, with each bead being individually addressable. This is followed by polymerase-based DNA synthesis where nucleotides are added sequentially. The system uses an electronic detection method to identify which nucleotide was incorporated, but the detection principle remain secreted by now, even if it is not based on pH measurements (like the Ion Torrent technology).

Three types of chips will be available for the Genius, generating up to 1 gigabase, 20 gigabases, or 100 gigabases of data with the sequencing run taking only few hours. Estimated sequencing costs per gigabase will be $300 for the smallest chip, $10 for the middle chip, and $1 for the largest chip. Even if the instrument price has not been determined yet but, Genapsys will provide special offers to customers committing to high usage. This will be something like an usage plan that include the instrument at a lower price, but with a minimum expense in consumable per year. Esfandyarpour, Genapsys CEO, said they have already generated sequence data on the platform, though he did not provide specifics. The data quality will be "equivalent or better than the best product out there," he said.

Friday, 14 February 2014

Uzbekistan apply genetic testing to select for future olympians


Strange, but true!
According to this news reported by The Atlantic, beginning in 2015, Uzbekistan will incorporate genetic testing into its search for Olympic athletes.

Rustam Muhamedov and his colleagues from Uzbekistan's Institute of Bioorganic Chemistry's genetics laboratory are working on developing a set of 50 genes to determine what sport a child is best suited for. So the national trainers would start working on high predisposed children and know exactly which sport is more suited to their physical characteristics.

"Developed countries throughout the world like the United States, China, and European countries are researching the human genome and have discovered genes that define a propensity for specific sports," Muhamedov tells the Atlantic. "We want to use these methods in order to help select our future champions."
The program, overseen by Uzbekistan's Academy of Sciences, would be "implemented in practice" in early 2015 in cooperation with the National Olympic Committee and several of the country's national sports federations—including soccer, swimming, and rowing.

By now there is no explicit ban against genetic testing or genetic selection on athletes, even if
the World Anti-Doping Agency discourages such practices. In the past, there was suspicions that also China government had applied some sort of genetic testing or at least encouraged marriage and pregnancy between people with desired predisposition with the aim to generate better athletes (see the Yao Ming story as example).

Many experts doubt that genetic testing could really improve performance more than an excellent training program, given that the genotype-phenotype correlation on many physiological traits relevant for sportsmen is not yet explored nor fully understood.
The explicit use of genetic testing however pose some ethical question and claim for an official rule on its application in sport practice.

We'll see if the Uzbekistan effort will push the country at the top of olympic ranking! 

Thursday, 13 February 2014

Follow in real time the 2014 AGBT Meeting


The 15th annual Advances in Genome Biology and Technology (AGBT) meeting is being held in Marco Island, Florida, on February 12 -15. You can follow the latest announcements via Twitter from the top tab we placed in our blog page (the tab will be removed at the end of the meeting)

You will find here a list of some among the most interesting talks, while the complete agenda is available on the official web site of the event.


Wednesday, 12 February 2014

Genomic sequencing in newborn to improve healthcare: pilot projects funded by NIH!


As NGS technology became cheaper and more robust in the next few years and our knowledge about genotype-phenotype associations increase, the idea of performing whole genome sequencing as a standard test on newborns may became an actual strategy in healthcare.


The genomes of infants may be able to be sequenced shortly after birth and allow parents to know what diseases or conditions their child may be affected by or have a propensity for developing, giving them the chance to possibly head them off or start on early treatments.

To test this approach, the US National Institutes of Health has awarded $5 million in research grants to four pilot programs to study newborn screening. The research programs aim to develop the science and technology ad well as to investigate the ethical issues related to such screening.
This pilot project, covered also by The New York Times, is the first effort to asses impact of whole genome screening on the quality of life and on our ability to provide better healthcare. 
Genomic sequencing may reveal many problems that could be treated early in a child’s life, avoiding the diagnostic odyssey that parents can endure when medical problems emerge later, said Dr. Cynthia Powell, winner of one of the research grants.

However, the role of each and every variant and how to interpret their contribution to disease risk is not yet fully understood and many questions remain on which variants to report and if and how genomic findings translate into improved therapies or life quality. This matter will also be addressed in the funded studies.

“Many changes in the DNA sequence aren't disease-causing,” said Dr. Robert Nussbaum, chief of the genomic medicine division at the School of Medicine of the University of California, San Francisco, and leader of one of the pilot grants. “We aren't very good yet at distinguishing which are and which aren’t.”

“You will get dozens of findings per child that you won’t be able to adequately interpret,” said , Dr. Jeffrey Botkin, a professor of pediatrics and chief of medical ethics at the University of Utah. The ethical issues of sequencing are sharply different when it is applied to children. Adults can decide which test information they want to receive, but children won’t usually have that option. Their parents will decide and children will get information that he will rather prefer to ignore when he became adult.

"We are not ready now to deploy whole genome sequencing on a large scale," said Eric Green, the director of the National Human Genome Research Institute, that promote the research program, "but it would be irresponsible not to study the problem."
"We are doing these pilot studies so that when the cost of genomic sequencing comes down, we can answer the question, 'Should we do it?' " he adds.

Here are the four pilot project funded by the NIH, as reported in the official news:
  • Brigham and Women's Hospital and Boston Children's Hospital, Boston
    Principal Investigators: Robert Green, M.D., and Alan Beggs, Ph.D.

    This research project will accelerate the use of genomics in pediatric medicine by creating and safely testing new methods for using information obtained from genomic sequencing in the care of newborns. It will test a new approach to newborn screening, in which genomic data are available as a resource for parents and doctors throughout infancy and childhood to inform health care.  A genetic counselor will provide the genomic sequencing information and newborn screening results to the families.  Parents will then be asked about the impact of receiving genomic sequencing results and if the information was useful to them.  Researchers will try to determine if the parents respond to receiving the genomic sequencing results differently if their newborns are sick and if they respond differently to receiving genomic sequencing results as compared to current newborn screening results. Investigators will also develop a process for reporting results of genomic sequencing to the newborns' doctors and investigate how they act on these results.
     
  • Children's Mercy Hospital - Kansas City, Mo.
    Principal Investigator: Stephen Kingsmore, M.D.

    Many newborns require care in a neonatal intensive care unit (NICU), and this group of newborns has a high rate of disability and death. Given the severity of illness, these newborns may have the most to gain from fast genetic diagnosis through the use of genomic sequencing. The researchers will examine the benefits and risks of using rapid genomic sequencing technology in this NICU population. They also aim to reduce the turnaround time for conducting and receiving genomic sequencing results to 50 hours, which is comparable to other newborn screening tests. The researchers will test if their methods increase the number of diagnoses or decrease the time it takes to reach a diagnosis in NICU newborns. They will also study if genomic sequencing changes the clinical care of newborns in the NICU.  Additionally, the investigators are interested in doctor and parent perspectives and will try to determine if parents' perception of the benefits and risks associated with the results of sequencing change over time.
     
  • University of California, San Francisco 
    Principal Investigator: Robert Nussbaum, M.D.

    This pilot project will explore the potential of exome sequencing as a method of newborn screening for disorders currently screened for and others that are not currently screened for, but where newborns may benefit from screening. The researchers will examine the value of additional information that exome sequencing provides to existing newborn screening that may lead to improved care and treatment. Additionally, the researchers will explore parents' interest in receiving information beyond that typically available from newborn screening tests. The research team also intends to develop a participant protection framework for conducting genomic sequencing during infancy and will explore legal issues related to using genome analysis in newborn screening programs. Together, these studies have the potential to provide public health benefit for newborns and research-based information for policy makers.
     
  • University of North Carolina at Chapel Hill 
    Principal Investigators: Cynthia Powell, M.D., M.S., and Jonathan Berg, M.D., Ph.D.

    In this pilot project, researchers will identify, confront and overcome the challenges that must be met in order to implement genomic sequencing technology to a diverse newborn population. The researchers will sequence the exomes of healthy infants and infants with known conditions such as phenylketonuria, cystic fibrosis or other disorders involving metabolism. Their goal is to help identify the best ways to return results to doctors and parents. The investigators will explore the ethical, legal and social issues involved in helping doctors and parents make informed decisions, and develop best practices for returning results to parents after testing. The researchers will also develop a tool to help parents understand what the results mean and examine extra challenges that doctors may face as this new technology is used. This study will place a special emphasis on including multicultural families.

Monday, 10 February 2014

PubMed Highlights: NGS library prepartion

Starting with a robust sequencing library is the first and crucial step to obtain unbiased, high-quality results from your next generation sequencher!
Here are a couple of interesting paper reviewing problems and solutions related to the NGS library preparation. They also give a sinthetic overview on the present library prepartion methods and how they fit to different downstream applications.
Take a look!

Library preparation methods for next-generation sequencing: Tone down the bias.
Exp Cell Res. 2014 Jan 15

Abstract
Next-generation sequencing (NGS) has caused a revolution in biology. NGS requires the preparation of libraries in which (fragments of) DNA or RNA molecules are fused with adapters followed by PCR amplification and sequencing. It is evident that robust library preparation methods that produce a representative, non-biased source of nucleic acid material from the genome under investigation are of crucial importance. Nevertheless, it has become clear that NGS libraries for all types of applications contain biases that compromise the quality of NGS datasets and can lead to their erroneous interpretation. A detailed knowledge of the nature of these biases will be essential for a careful interpretation of NGS data on the one hand and will help to find ways to improve library quality or to develop bioinformatics tools to compensate for the bias on the other hand. In this review we discuss the literature on bias in the most common NGS library preparation protocols, both for DNA sequencing (DNA-seq) as well as for RNA sequencing (RNA-seq). Strikingly, almost all steps of the various protocols have been reported to introduce bias, especially in the case of RNA-seq, which is technically more challenging than DNA-seq. For each type of bias we discuss methods for improvement with a view to providing some useful advice to the researcher who wishes to convert any kind of raw nucleic acid into an NGS library.


Abstract
High-throughput sequencing, also known as next-generation sequencing (NGS), has revolutionized genomic research. In recent years, NGS technology has steadily improved, with costs dropping and the number and range of sequencing applications increasing exponentially. Here, we examine the critical role of sequencing library quality and consider important challenges when preparing NGS libraries from DNA and RNA sources. Factors such as the quantity and physical characteristics of the RNA or DNA source material as well as the desired application (i.e., genome sequencing, targeted sequencing, RNA-seq, ChIP-seq, RIP-seq, and methylation) are addressed in the context of preparing high quality sequencing libraries. In addition, the current methods for preparing NGS libraries from single cells are also discussed.