Wednesday, 23 April 2014

PubMed Highlight: Importance of annotation tool and transcript dataset for variants analysis

This paper recently published on Genome Medicine analyze the differences in variants annotation when using different annotation tools and transcript dataset. Authors extensively report on pitfalls and peculiar aspects of the two mostly diffused softwares (ANNOVAR and VEP), using transcripts definition from both RefSeq or Ensembl databases.
From the paper abstract clearly emerge an high level of discrepancy, expecially for functional variants (namely missense and LoF) that are the far most important in most analysis. "We find only 44% agreement in annotations for putative loss-of-function variants when using the REFSEQ and ENSEMBL transcript sets as the basis for annotation with ANNOVAR. The rate of matching annotations for loss-of-function and nonsynonymous variants combined is 79% and for all exonic variants it is 83%. When comparing results from ANNOVAR and VEP using ENSEMBL transcripts, matching annotations were seen for only 65% of loss-of-function variants and 87% of all exonic variants, with splicing variants revealed as the category with the greatest discrepancy" authors writes.

What impress me the most is the low level of concordance between different transcript datasets, reflecting the fact that also the annotation of mRNA forms are far from definitely established.
So be careful with your annotations!

Here the full paper:

Choice of transcripts and software has a large effect on variant annotation
Davis J McCarthy, Peter Humburg, Alexander Kanapin, Manuel A Rivas, Kyle Gaulton, The WGS500 Consortium, Jean-Baptiste Cazier and Peter Donnelly

Monday, 7 April 2014

Some useful tools for your everyday NGS analysis

There are a lot of tools that can assist you at every step of an NGS data analysis. Here are some interesting pieces of software I've recently started using.

SAMStat - published on Bioinformatics
This tool take your sam/bam/fastq files and compute several metrics describing frequency and distribution of bases across all your reads. Results include: stats on MAPQ of reads, distribution of MAPQ across read lenght, nucleotide over-representation across reads, error distribution and identification of over-represented 2-mers and 10-mers. All the data are conveniently presented in a html summary and help you identifying potential issues present in your sequencing experiment. Moreover, these graphs are always useful as quality reports for your presentations!

ngsCAT - published on Bioinformatics
This command line tool provide you with a detailed analysis of your mapped reads given a defined target regions. It compute several metrics and stats: medium coverage, number of bases covered at least n fold, duplicated reads, distribution of on-target reads across chromosomes and uniformity of coverage across target regions. The tool require a bam file and a bed file as inputs and produce several graphs and tabs and a final summary report in html format. Really simple and useful to assess quality of your target capture!

GRAB - published on PLOS ONE
This tool is as simple as cleaver. It takes genotyping information of subjects from various formats (Genotype/Var/masterVar/GVF/VCF/PEDMAP/TPED) and compute their eventual relationship. It works best with whole genome data, but I've tested it also using vcf files from exome sequencing and reducing the default reading window it performs well at least in identifying 1st and 2nd grade relationships. This tool require R installed in your system and it's really fast in performing the analysis. It is useful when you are dealing with a group of samples and you want to verify that there are no members from the same family.

MendelScan - published on American Journal of Human Genetics
This tool is also described by the author on his blog MassGenomics. This software perform variant prioritization for family based exome sequencing studies. It needs some preliminary steps to prepare the necessary input files: a multisample vcf files with variants from the family members, a ped file describing the relations between samples, a ranked list of genes that are mostly expressed in your tissue of interest and the VEP annotated list of your variants. Given these data the tool compute a ranked list of identified variants based on the selected inheritance model (recessive or dominant). Moreover it include two additional modules developed by the authors: the Rare Heterozygous Rule Out and the Shared Identity-by-Descent. The first one "identifies candidate regions consistent with autosomal dominant inheritance based on the idea that a disease-causing haplotype will manifest regions of rare heterozygous variants shared by all affecteds, and an absence of homozygous differences between affected pairs (which would indicate that a pair had no haplotype in common).", while the second one "uses BEAGLE FastIBD results to identify regions of maximum identity-by-descent (IBD) among affected pairs". This tool integrates the canonical prediction scores (such as Polyphen, PhyloP and so on) with gene expression ranking and the newly developed methods to provide a straightforward analysis for your mendelian disease NGS studies!

SPRING - published on PLOS Genetics
Like MendelScan, this is another tool for variant prioritization. This one has also a version working from the web for those that are not familiar with the command line. The tool takes a list of seed genes already known to be involved in the pathology or in similar phenotypes and a list of your candidate missense variants and give you a ranked list of the variants that have a high probability of being associated with the disease. This tools work fine for disease with high genetic heterogeneity so that you can easily and confidently build a list of seed genes. SPRING can then be really useful in prioritizing candidates variants emerging from new studies.

PRADA - published on Bioinformatics
This tool is focused on RNA-Seq and it provide a complete framework for analysis of this kind of data. The tool is a complete solution since it can perform several kind of analysis starting from raw paired-end RNA-seq data: gene expression levels, quality metrics, detection of unsupervised and supervised fusion transcripts, detection of intragenic fusion variants, homology scores and fusion frame classification. As the project page reports:
"PRADA currently supports 7 modules to process and identify abnormalities from RNAseq data:
preprocess: Generates aligned and recalibrated BAM files.
expression: Generates gene expression (RPKM) and quality metrics.
fusion: Identifies candidate gene fusions.
guess-ft: Supervised search for fusion transcripts.
guess-if: Supervised search for intragenic fusions.
homology: Calculates homology between given two genes.
frame: Predicts functional consequence of fusion transcript"

This is a good starting point for those not familiar with RNA-Seq!

Friday, 4 April 2014

A gold standard dataset of SNP and Indels for benchmarking tools

One of the pain in the NGS data analysis is that different tools, different settings and different sequencing platforms produce different and often low overlapping variants. Every analysis pipeline has its peculiar issues and results in specific false positive/negative calls.

Genome in a BottleThe need for a gold standard reference of SNP and indels calls is thus of primary relevance to correctly asses the accuracy and sensibility of new NGS pipeline. The need for a robust benchmarking of variant identification pipelines is even more critical as NGS analysis is fast moving from research to diagnostic/clinical field.
In this interesting paper on Nature Biotechnology, authors have performed the analysis of 14 different variants datasets from the same standard genome NA12878 (choose as the standard reference genome by the Genome in a Bottle Consortium) to produce a list of gold standard SNPs and Indels and identify genomic regions that are particularly difficult to address. The final dataset is the most robust produced so far since it integrate data from 5 different sequencing technology, 7 read mapper and 3 variant callers to obtain a robust estimation of SNVs.
The final standard for evaluation of your favorite pipeline is finally here, publicly available Genome Comparison and Analytic Testing website!

Nat Biotechnol. 2014 Feb 16. 
Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls.

Zook JM, Chapman B, Wang J, Mittelman D, Hofmann O, Hide W, Salit M. 

Clinical adoption of human genome sequencing requires methods that output genotypes with known accuracy at millions or billions of positions across a genome. Because of substantial discordance among calls made by existing sequencing methods and algorithms, there is a need for a highly accurate set of genotypes across a genome that can be used as a benchmark. Here we present methods to make high-confidence, single-nucleotide polymorphism (SNP), indel and homozygous reference genotype calls for NA12878, the pilot genome for the Genome in a Bottle Consortium. We minimize bias toward any method by integrating and arbitrating between 14 data sets from five sequencing technologies, seven read mappers and three variant callers. We identify regions for which no confident genotype call could be made, and classify them into different categories based on reasons for uncertainty. Our genotype calls are publicly available on the Genome Comparison and Analytic Testing website to enable real-time benchmarking of any method.

Tuesday, 25 March 2014

Flash Report: Loblolly pine genome is largest ever sequenced

Below is the report from

Scientists say they've generated the longest genome sequence to date, unraveling the genetic code of the loblolly pine tree.
Conifers have been around since the age of the dinosaurs, and they have some of the biggest genomes of all living things.
Native to the U.S. Southeast, the loblolly pine (Pinus taeda) can grow over 100 feet (30 meters) tall and has a lengthy genome to match, with 23 billion base pairs. That's more than seven times the size of the human genome, which has 3 billion base pairs. (These pairs form sequences called genes that tell cells how to make proteins.)
"It's a huge genome. But the challenge isn't just collecting all the sequence data. The problem is assembling that sequence into order," study researcher David Neale, a professor of plant sciences at the University of California, Davis, said in a statement.
To simplify this huge genetic puzzle, Neale and colleagues assembled most of the sequence from part of a single pine nut-- a haploid part of the seed with just one set of chromosomes to piece together.
The new research showed that the loblolly genome is bloated with repetitive DNA. In fact, 82 percent of the genome repeats itself, the researchers say.
Understanding the loblolly pine's genetic code could lead to improved breeding of the tree, which is used to make paper and lumber and is being investigated as a potential biofuel, the scientists say.
The loblolly pine joins other recently sequenced conifers, including the Norway spruce (Picea abies), which has 20 billion base pairs. For their next project, the researchers are eyeing the sugar pine, a tree with 35 billion base pairs.
The research was detailed this week in the journals Genetics and Genome Biology.

Friday, 21 February 2014

Flash Report: King Richard III genome is going to be sequenced soon

The list of famous people whose genome has been sequenced is going to get a new member, more precisely a royal one. After the discovery of the remains of his body under a car park, scientists of the University of Leicester are going to sequence the genome of King Richard III, who died in the Battle of Bosworth Field in 1485. The project has the purpose to learn more about King's ancestry and health, and to provide genetic data useful for historians, researchers and the public.
In addition to the scientific aspects, these types of initiatives also represent good strategies to attract the attention of the media on the institution involved in the sequencing and, why not, to increase the possibility to raise funds for other projects with a deeper scientific impact.
In general, the choice to sequence the genome of famous historic characters could be a good initiative to acknowledge what they did for their country and a great opportunity of visibility for the institution performing the study. Here is the link to the Reuters news release.

Tuesday, 18 February 2014

Genapsys reveals GENIUS at AGBT

At the last AGBT meeting Genapsys, a small company from California, presented its new "launch-box" sequencer called GENIUS (Gene Electronic Nano-Integrated Ultra-Sensitive).
This machine, with the size of a toaster, is a new kind of NGS sequencer that apply electronic sensor technology and promise to produce up to 100 Gb of sequences in few hours, with read length up to 1000 bp. Like the MinION they propose to push the NGS market even further, providing a new generation of sequencher which are incredibly small, cheap and easy to operate. A step further to the so called "freedom of sequencing".

Genapsys started an early-access program for the GENIUS platform last week and plans to ship out the first instruments within a few months, followed by a general commercial launch either this year or next year.

The system has an opening in the front to insert a small, square semiconductor-based sequencing chip. A reusable reagent cartridge attaches to the back, and a computer for data processing is integrated. Based on the Genapsys CEO interview at InSequence, library preparation require clonal amplification of the template DNA on beads in an emulsion-free off-instrument process that requires no specialized equipment from the company. While the first version of the Genius will require separate sample and library preparation, the company plans to integrate sample prep into the platform eventually. The template beads are then pipetted into the chip, which has "millions of sensors" and a flat surface with no wells. There, the beads assemble into an array-like pattern, with each bead being individually addressable. This is followed by polymerase-based DNA synthesis where nucleotides are added sequentially. The system uses an electronic detection method to identify which nucleotide was incorporated, but the detection principle remain secreted by now, even if it is not based on pH measurements (like the Ion Torrent technology).

Three types of chips will be available for the Genius, generating up to 1 gigabase, 20 gigabases, or 100 gigabases of data with the sequencing run taking only few hours. Estimated sequencing costs per gigabase will be $300 for the smallest chip, $10 for the middle chip, and $1 for the largest chip. Even if the instrument price has not been determined yet but, Genapsys will provide special offers to customers committing to high usage. This will be something like an usage plan that include the instrument at a lower price, but with a minimum expense in consumable per year. Esfandyarpour, Genapsys CEO, said they have already generated sequence data on the platform, though he did not provide specifics. The data quality will be "equivalent or better than the best product out there," he said.

Friday, 14 February 2014

Uzbekistan apply genetic testing to select for future olympians

Strange, but true!
According to this news reported by The Atlantic, beginning in 2015, Uzbekistan will incorporate genetic testing into its search for Olympic athletes.

Rustam Muhamedov and his colleagues from Uzbekistan's Institute of Bioorganic Chemistry's genetics laboratory are working on developing a set of 50 genes to determine what sport a child is best suited for. So the national trainers would start working on high predisposed children and know exactly which sport is more suited to their physical characteristics.

"Developed countries throughout the world like the United States, China, and European countries are researching the human genome and have discovered genes that define a propensity for specific sports," Muhamedov tells the Atlantic. "We want to use these methods in order to help select our future champions."
The program, overseen by Uzbekistan's Academy of Sciences, would be "implemented in practice" in early 2015 in cooperation with the National Olympic Committee and several of the country's national sports federations—including soccer, swimming, and rowing.

By now there is no explicit ban against genetic testing or genetic selection on athletes, even if
the World Anti-Doping Agency discourages such practices. In the past, there was suspicions that also China government had applied some sort of genetic testing or at least encouraged marriage and pregnancy between people with desired predisposition with the aim to generate better athletes (see the Yao Ming story as example).

Many experts doubt that genetic testing could really improve performance more than an excellent training program, given that the genotype-phenotype correlation on many physiological traits relevant for sportsmen is not yet explored nor fully understood.
The explicit use of genetic testing however pose some ethical question and claim for an official rule on its application in sport practice.

We'll see if the Uzbekistan effort will push the country at the top of olympic ranking! 

Thursday, 13 February 2014

Follow in real time the 2014 AGBT Meeting

The 15th annual Advances in Genome Biology and Technology (AGBT) meeting is being held in Marco Island, Florida, on February 12 -15. You can follow the latest announcements via Twitter from the top tab we placed in our blog page (the tab will be removed at the end of the meeting)

You will find here a list of some among the most interesting talks, while the complete agenda is available on the official web site of the event.

Wednesday, 12 February 2014

Genomic sequencing in newborn to improve healthcare: pilot projects funded by NIH!

As NGS technology became cheaper and more robust in the next few years and our knowledge about genotype-phenotype associations increase, the idea of performing whole genome sequencing as a standard test on newborns may became an actual strategy in healthcare.

The genomes of infants may be able to be sequenced shortly after birth and allow parents to know what diseases or conditions their child may be affected by or have a propensity for developing, giving them the chance to possibly head them off or start on early treatments.

To test this approach, the US National Institutes of Health has awarded $5 million in research grants to four pilot programs to study newborn screening. The research programs aim to develop the science and technology ad well as to investigate the ethical issues related to such screening.
This pilot project, covered also by The New York Times, is the first effort to asses impact of whole genome screening on the quality of life and on our ability to provide better healthcare. 
Genomic sequencing may reveal many problems that could be treated early in a child’s life, avoiding the diagnostic odyssey that parents can endure when medical problems emerge later, said Dr. Cynthia Powell, winner of one of the research grants.

However, the role of each and every variant and how to interpret their contribution to disease risk is not yet fully understood and many questions remain on which variants to report and if and how genomic findings translate into improved therapies or life quality. This matter will also be addressed in the funded studies.

“Many changes in the DNA sequence aren't disease-causing,” said Dr. Robert Nussbaum, chief of the genomic medicine division at the School of Medicine of the University of California, San Francisco, and leader of one of the pilot grants. “We aren't very good yet at distinguishing which are and which aren’t.”

“You will get dozens of findings per child that you won’t be able to adequately interpret,” said , Dr. Jeffrey Botkin, a professor of pediatrics and chief of medical ethics at the University of Utah. The ethical issues of sequencing are sharply different when it is applied to children. Adults can decide which test information they want to receive, but children won’t usually have that option. Their parents will decide and children will get information that he will rather prefer to ignore when he became adult.

"We are not ready now to deploy whole genome sequencing on a large scale," said Eric Green, the director of the National Human Genome Research Institute, that promote the research program, "but it would be irresponsible not to study the problem."
"We are doing these pilot studies so that when the cost of genomic sequencing comes down, we can answer the question, 'Should we do it?' " he adds.

Here are the four pilot project funded by the NIH, as reported in the official news:
  • Brigham and Women's Hospital and Boston Children's Hospital, Boston
    Principal Investigators: Robert Green, M.D., and Alan Beggs, Ph.D.

    This research project will accelerate the use of genomics in pediatric medicine by creating and safely testing new methods for using information obtained from genomic sequencing in the care of newborns. It will test a new approach to newborn screening, in which genomic data are available as a resource for parents and doctors throughout infancy and childhood to inform health care.  A genetic counselor will provide the genomic sequencing information and newborn screening results to the families.  Parents will then be asked about the impact of receiving genomic sequencing results and if the information was useful to them.  Researchers will try to determine if the parents respond to receiving the genomic sequencing results differently if their newborns are sick and if they respond differently to receiving genomic sequencing results as compared to current newborn screening results. Investigators will also develop a process for reporting results of genomic sequencing to the newborns' doctors and investigate how they act on these results.
  • Children's Mercy Hospital - Kansas City, Mo.
    Principal Investigator: Stephen Kingsmore, M.D.

    Many newborns require care in a neonatal intensive care unit (NICU), and this group of newborns has a high rate of disability and death. Given the severity of illness, these newborns may have the most to gain from fast genetic diagnosis through the use of genomic sequencing. The researchers will examine the benefits and risks of using rapid genomic sequencing technology in this NICU population. They also aim to reduce the turnaround time for conducting and receiving genomic sequencing results to 50 hours, which is comparable to other newborn screening tests. The researchers will test if their methods increase the number of diagnoses or decrease the time it takes to reach a diagnosis in NICU newborns. They will also study if genomic sequencing changes the clinical care of newborns in the NICU.  Additionally, the investigators are interested in doctor and parent perspectives and will try to determine if parents' perception of the benefits and risks associated with the results of sequencing change over time.
  • University of California, San Francisco 
    Principal Investigator: Robert Nussbaum, M.D.

    This pilot project will explore the potential of exome sequencing as a method of newborn screening for disorders currently screened for and others that are not currently screened for, but where newborns may benefit from screening. The researchers will examine the value of additional information that exome sequencing provides to existing newborn screening that may lead to improved care and treatment. Additionally, the researchers will explore parents' interest in receiving information beyond that typically available from newborn screening tests. The research team also intends to develop a participant protection framework for conducting genomic sequencing during infancy and will explore legal issues related to using genome analysis in newborn screening programs. Together, these studies have the potential to provide public health benefit for newborns and research-based information for policy makers.
  • University of North Carolina at Chapel Hill 
    Principal Investigators: Cynthia Powell, M.D., M.S., and Jonathan Berg, M.D., Ph.D.

    In this pilot project, researchers will identify, confront and overcome the challenges that must be met in order to implement genomic sequencing technology to a diverse newborn population. The researchers will sequence the exomes of healthy infants and infants with known conditions such as phenylketonuria, cystic fibrosis or other disorders involving metabolism. Their goal is to help identify the best ways to return results to doctors and parents. The investigators will explore the ethical, legal and social issues involved in helping doctors and parents make informed decisions, and develop best practices for returning results to parents after testing. The researchers will also develop a tool to help parents understand what the results mean and examine extra challenges that doctors may face as this new technology is used. This study will place a special emphasis on including multicultural families.

Monday, 10 February 2014

PubMed Highlights: NGS library prepartion

Starting with a robust sequencing library is the first and crucial step to obtain unbiased, high-quality results from your next generation sequencher!
Here are a couple of interesting paper reviewing problems and solutions related to the NGS library preparation. They also give a sinthetic overview on the present library prepartion methods and how they fit to different downstream applications.
Take a look!

Library preparation methods for next-generation sequencing: Tone down the bias.
Exp Cell Res. 2014 Jan 15

Next-generation sequencing (NGS) has caused a revolution in biology. NGS requires the preparation of libraries in which (fragments of) DNA or RNA molecules are fused with adapters followed by PCR amplification and sequencing. It is evident that robust library preparation methods that produce a representative, non-biased source of nucleic acid material from the genome under investigation are of crucial importance. Nevertheless, it has become clear that NGS libraries for all types of applications contain biases that compromise the quality of NGS datasets and can lead to their erroneous interpretation. A detailed knowledge of the nature of these biases will be essential for a careful interpretation of NGS data on the one hand and will help to find ways to improve library quality or to develop bioinformatics tools to compensate for the bias on the other hand. In this review we discuss the literature on bias in the most common NGS library preparation protocols, both for DNA sequencing (DNA-seq) as well as for RNA sequencing (RNA-seq). Strikingly, almost all steps of the various protocols have been reported to introduce bias, especially in the case of RNA-seq, which is technically more challenging than DNA-seq. For each type of bias we discuss methods for improvement with a view to providing some useful advice to the researcher who wishes to convert any kind of raw nucleic acid into an NGS library.

High-throughput sequencing, also known as next-generation sequencing (NGS), has revolutionized genomic research. In recent years, NGS technology has steadily improved, with costs dropping and the number and range of sequencing applications increasing exponentially. Here, we examine the critical role of sequencing library quality and consider important challenges when preparing NGS libraries from DNA and RNA sources. Factors such as the quantity and physical characteristics of the RNA or DNA source material as well as the desired application (i.e., genome sequencing, targeted sequencing, RNA-seq, ChIP-seq, RIP-seq, and methylation) are addressed in the context of preparing high quality sequencing libraries. In addition, the current methods for preparing NGS libraries from single cells are also discussed.