Friday, 25 January 2013

PubMed Highlight: a compass in the vast sea of NGS analysis tools

One of the most productive field in NGS related research is for sure data analysis tools, starting from raw sequence data tools down to variant annotation and prioritization softwares. Tons of different projects has been made available for free developed by academic research groups and several others have been generated by commercial companies. As usual, every software has its strengths and pitfalls and it is best suited for a specific application. However, some are for sure better then others and trying to orient in this vast sea of tools could be frustrating.
Luckily experienced had came at help providing this two brief but complete surveys on error-correcting methods and exome variant analysis tools. Sure a useful compass to help navigate in the NGS sea! Take a look!

A survey of error-correction methods for next-generation sequencing.
Brief Bioinform. 2013 Jan;14(1):56-66. doi: 10.1093/bib/bbs015. Epub 2012 Apr 6.

Yang X, Chockalingam SP, Aluru S. 

Error Correction is important for most next-generation sequencing applications because highly accurate sequenced reads will likely lead to higher quality results. Many techniques for error correction of sequencing data from next-gen platforms have been developed in the recent years. However, compared with the fast development of sequencing technologies, there is a lack of standardized evaluation procedure for different error-correction methods, making it difficult to assess their relative merits and demerits. In this article, we provide a comprehensive review of many error-correction methods, and establish a common set of benchmark data and evaluation criteria to provide a comparative assessment. We present experimental results on quality, run-time, memory usage and scalability of several error-correction methods. Apart from providing explicit recommendations useful to practitioners, the review serves to identify the current state of the art and promising directions for future research. Availability: All error-correction programs used in this article are downloaded from hosting websites. The evaluation tool kit is publicly available at:

Brief Bioinform. 2013 Jan 21

Pabinger S, Dander A, Fischer M, Snajder R, Sperk M, Efremova M, Krabichler B, Speicher MR, Zschocke J, Trajanoski Z. 

Recent advances in genome sequencing technologies provide unprecedented opportunities to characterize individual genomic landscapes and identify mutations relevant for diagnosis and therapy. Specifically, whole-exome sequencing using next-generation sequencing (NGS) technologies is gaining popularity in the human genetics community due to the moderate costs, manageable data amounts and straightforward interpretation of analysis results. While whole-exome and, in the near future, whole-genome sequencing are becoming commodities, data analysis still poses significant challenges and led to the development of a plethora of tools supporting specific parts of the analysis workflow or providing a complete solution. Here, we surveyed 205 tools for whole-genome/whole-exome sequencing data analysis supporting five distinct analytical steps: quality assessment, alignment, variant identification, variant annotation and visualization. We report an overview of the functionality, features and specific requirements of the individual tools. We then selected 32 programs for variant identification, variant annotation and visualization, which were subjected to hands-on evaluation using four data sets: one set of exome data from two patients with a rare disease for testing identification of germline mutations, two cancer data sets for testing variant callers for somatic mutations, copy number variations and structural variations, and one semi-synthetic data set for testing identification of copy number variations. Our comprehensive survey and evaluation of NGS tools provides a valuable guideline for human geneticists working on Mendelian disorders, complex diseases and cancers.

Thursday, 24 January 2013

Pubmed Highlight: direct haplotyping of a human genome

In this recent article, authors from Craig Venter Institute explore a new method to reconstruct chromosome length haplotypes using genotyping and NGS data. Following this procedure one can leverage on short-read sequencing of sperm-cell haploid genome to reconstruct chromosome length haplotypes for any individual for which the diploid genome is known. Moreover they also demonstrate the ability to detect recombination events with a median resolution of less than 100kb.
InSequence has also a post covering this innovative paper!

Genome Res. 2013 Jan 2

Kirkness EF, Grindberg RV, Yee-Greenbaum J, Marshall CR, Scherer SW, Lasken RS, Venter JC. 
The J Craig Venter Institute

There is increasing evidence that the phenotypic effects of genomic sequence variants are best understood in terms of variant haplotypes rather than as isolated polymorphisms. Haplotype analysis is also critically important for uncovering population histories, and for the study of evolutionary genetics. Although the sequencing of individual human genomes to reveal personal collections of sequence variants is now well established, there has been slower progress in the phasing of these variants into pairs of haplotypes along each pair of chromosomes. Here, we have developed a distinct approach to haplotyping that can yield chromosome-length haplotypes, including the vast majority of heterozygous SNPs in an individual human genome. This approach exploits the haploid nature of sperm cells, and employs a combination of genotyping and low-coverage sequencing on a short-read platform. In addition to generating chromosome-length haplotypes, the approach can directly identify recombination events (averaging 1.1 per chromosome) with a median resolution of less than 100 kb.

Wednesday, 23 January 2013

What customers say about 2012 in the NGS field: Illumina still more usable but Life is promising!

An interesting survey recently appeared on InSequence (from GenomeWeb) make the point about customers opinions on the NGS platforms available nowadays. The survey was conducted with 26 questions proposed to 99 respondents selected from the GenomeWeb readers. Around 55 percent of them work in a government or academic setting, while others have been chosen from hospital, reference labs and biopharmaceutical firms.

The depicted scenario is quite clear: Illumina still unbeaten in throughput, accuracy and usability and a slightly advantage in ease of sample preparation. Life Tech (basically PGM and Proton platforms) confirmed its dominance in reagent and instrument prices and run time. Both solutions stay substantially even talking about read length. Moreover, the Life technology ability to deliver the promised rapid technological improvements on ION platforms seems to pay: ION platforms seems to be more accredited for future development, as more users answer positively when asked if the company sequencers are promising for future improvements.

This reflects quite well the present situation on the NGS market, with Illumina still dominating, with around 60%, and Life ascending thanks to its new platforms, ranking now at about 25%. As a consequence of the claimed high accuracy and thorughput, Illumina HiSeq are still also the preferred platform for eventual large clinical application, while MiSeq has only recently taken a slight advantage over PGM, following Illumina announcement of the CLIA approved MiSeqDx.

You can also follow changes in customers perspective over the last 6 months compairing the last survey with the one conducted in April, 2012.

There is another interesting trend emerging: most of customers declared that they will increase they sequence production of at least 50% in 2013 over 2012 and that they prefer to do it in house. In fact,  49 out of 99 respondents disagreeing or strongly disagreeing with the statement that they'd be more likely to outsource sequencing. However it seems that this increase in production will not be achieved by increasing "hardware", since around half of responder either disagreed or strongly disagreed with the statement "I will purchase a new NGS instrument in 12 months" and only a quarter are agree or strongly agree with that.

PubMed Highlights: Quantitative visualization of DNA G-quadruplex structures in human cells.

Quantitative visualization of DNA G-quadruplex structures in human cells.
Nature Chemistry 2013, Jan 20

Giulia Biffi, David Tannahill, John McCafferty and Shankar Balasubramanian.

Department of Biochemistry, University of Cambridge

Four-stranded G-quadruplex nucleic acid structures are of great interest as their high thermodynamic stability under near-physiological conditions suggests that they could form in cells. Here we report the generation and application of an engineered, structure-specific antibody employed to quantitatively visualize DNA G- quadruplex structures in human cells. We show explicitly that G-quadruplex formation in DNA is modulated during cell-cycle progression and that endogenous G-quadruplex DNA structures can be stabilized by a small-molecule ligand. Together these findings provide substantive evidence for the formation of G-quadruplex structures in the genome of mammalian cells and corroborate the application of stabilizing ligands in a cellular context to target G-quadruplexes and intervene with their function.

Monday, 21 January 2013

Flash Report: Genome can reveal your surname!

Researchers from the Whitehead Institute using a computer, an Internet connection, and publicly accessible online resources were able to identify nearly 50 individuals who had anonymously submitted personal genetic material as participants in genomic studies.
The group’s work was described in a paper (Identifying Personal Genomes by Surname Inference) published this week in the journal Science. The study sought to show how the full names and identities of genomic research participants can be determined under certain circumstances, even when their genetic information is held in de-identified form within databases. Through surname inference, the study was able to discover the family names of the men by submitting their short tandem repeats on the Y chromosomes to publicly accessible databases maintained by genealogists and genetic genealogy companies, which store the Y-STR data by surname. The algorithm identified the surname of one in eight tested. In one case, the researchers managed to identify the name and the fact that he lived in California, all according to his Y chromosome. They also found the chromosome Y information about Craig Venter, who headed the Celera Genome Project, after narrowing down the identity to only two men in California. The significance of the research involves more than a handful of useful applications such as locating relatives and identifying bodies in natural disasters and other calamities. But there is also something sinister, as if a person publishes his genome on the Internet, even if done anonymously, his identity could be exposed.

Flash Report: Share your genome sequence using your smartphone

Apparently we are not there yet, but this possibility is just around the corner. Computer scientists at UC Irvine have devised a smartphone app that can store your DNA – and perhaps one day allow your partner, your relatives or your doctor to scan it (we can imagine a couple on a date holding their smartphones together, and instantly determining what their children will look like – or whether the kids might be predisposed to genetic disease).
The article "Latest app? Store your DNA on a smartphone" discuss a number of possible future scenarios.
A more scientific approach to these issue is provided by this paper (GenoDroid: Are Privacy-Preserving Genomic Tests Ready for Prime Time?) that explores the viability and practicality of privacy-agile computational genomic tests in the portable and pervasive setting of modern smartphones.

Wednesday, 9 January 2013

Flash Report: Oxford Nanopore announces further collaborations

2012 is over and Oxford Nanopore Technologies has failed to maintain its commitment to commercialize the revolutionary GridION and MinION DNA strand sequencing products directly to customers within the past year.
However a new press release dated 8 Jan 2013 reveals that something is moving forward. The company announced that it has completed a series of agreements with leading academic research institutions (including University of Illinois at Urbana-Champaign, Brown University, Stanford University, Boston University, University of Cambridge and University of Southampton) to further develop and exploit nanopore sensing technology for the analysis of DNA, RNA, proteins and other single molecules.
Will 2013 be the year of nanopore-based NGS sequencing?

Monday, 7 January 2013

Essentials Human Genome Numbers

Recent large population sequencing studies performed with NGS technologies have provided an updated and more accurate estimation about human genome variability (like the average number of missense mutation in an individual exome and so on). These numbers have became a must-know for every genomic geek and I often found myself in trouble during genomic talking, since I can't get them fixed in my mind... They used to pop out unexpectedly during conversation and I feel like the dumb one missing something essential. So I've decided to put all the interesting values together for a rapid and effective reference. I hope this will help me to memorize them and provide a way to impress lab mates with fast and accurate genomic answers! And if you really want to amaze someone, don't forget to look also at BioNumbers! It has dozen of fascinating numbers from biology (who has never found himself asking how large is the biggest eukaryotic cell?).

A little note before starting: when values in the literature are discordant, I've reported the different estimates with their own references. In particular, I think that differences in values reported by (1) and (2) could be explained mainly by 2 causes:
1. In (1) NHLBI GO Exome Sequencing Project project uses either Roche/Nimblegen capture or Agilent reagents for exome capture, while in (2) 1000 Genomes Project considers exome portion as defined by GENCODE. So the first includes sequences based on NCBI Consensus CDS database (CCDS) (containing protein-coding genes and some miRNA and snoRNA + UTRs), while the second includes all protein-coding loci with alternatively transcribed variants, non-coding loci with transcript evidence, and pseudogenes.
2. The dataset analyzed by the 1000 Genomes Project comprises a wider representation of Asian and African populations, thus resulting in a higher number of average variants, since the reference genome currently adopted is essentially based on subjects of American/Caucasian origin.

SNVs in Exomes:
Average SNVs in an individual exome: 12400-15000 (average 13600) (of which 66% heterozygous) (1); 24000 (2)
Average indels per individual (2): 440 
Expected novel SNVs per exome given present data in public databases: 200-500 (1); however note that for any exome sequence 3.3% of observed heterozygous variants are predicted to be novel based on a recent model about human population growth (3).
Number of SNVs with functional effect on protein-coding genes expected in one genome: 320-510 (about 95% of functional SNVs are rare, MAF < 0.5%).

Observed variability per individual:
Synonymous SNVs: 7600 (1); 13-16 k (2).
Missense SNVs: 5700 (1); 11-14 k (2).
Splice affecting SNVs: 12 (1); 12-28 (2).
Stop-gain SNVs: 35 (1); 34-57 (2).
Indels in protein-coding genes: 110-186 (2).
Frameshift indels: 30-50 (2).
SNVs in disease genes reported by HGMD: 41-84 (2).
SNVs in COSMIC (Catalogue Of Somatic Mutations In Cancer) genes: 33-51 (2).
Mean number of SNVs per gene: 30-40 (2).
Large deletions (>100kb) per exome: 39 (2).

Note also that cause to the recent exponential growth of human population, rare SNVs are expected to account for about 15-20% of total diversity (13).

Variants in Genome (2):
SNPs / genome (autosomes - ChrX): 3.6 M - 105 k.
Indels / genome (autosomes - ChrX): 344 k - 13 k.
Large deletions / genome (autosomes - ChrX): 717 - 26.

De novo SNVs (4):
Mutation rate per gene per cell division: 10e-6 10e-7 (5).
Observed de novo SNVs per individual: 74, giving a mutation rate of 1.18 x 10e-8 per position.
Observed de novo Indels per individual: 3, giving a mutation rate of 4 x 10e-10 per position, with deletions being 3 times more frequent than insertions.
Observed de novo CNVs (>100kb) per individual: 1 de novo every 50 individuals. However It's worth noting that 10% of subjects with Intellectual Disability, Autism Spectrum Disorders and Schizophrenia present large CNVs.

The number of de novo SNVs and CNVs is strongly influenced by parental age and ethnicity, with an increase of about two mutations per year. An exponential model estimates paternal mutations doubling every 16.5 years (6). On the other hand, maternal age correlates with increased probability of aneuploidies.

Loss-of-Function variants:
LoF sites per individual: 100-120 (estimated in 7); 30-40 (observed in 1).
Number of genes completely inactivated due to homozygous LoF: about 20 (estimated in 7); at least 1 (observed in 1).
It has been reported that genes affected by LoF variants are relatively less evolutionary conserved, showing a higher ratio of protein-altering to silent substitutions in coding regions between human and macaque (P =2.8 × 10e−52) and less evolutionary conservation in their promoter regions (GERP score; P = 3.7 × 10e−16). On average, they have more closely related gene family members (paralogs) than other genes (P = 0.0058) and show greater sequence identity to paralogs (P = 0.0068). These data suggest that LoF strikes mainly genes with redundant or not essential function (7).


(1) Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes (May 2012)
(3) Recent Explosive Human Population Growth Has Resulted in an Excess of Rare Genetic Variants (May 2012)
(4) De novo mutations in human genetic disease (Aug 2012)

Friday, 4 January 2013

PubMed Highlight: Genes contributing to pain sensitivity in the normal population: an exome sequencing study

This interesting study applies exome sequencing to determine rare variants associated with pain sensitivity. The approach used by the authors relies both on twins study design (they tested two groups of about 200 healthy volunteers from the Twin UK project) and selection of extreme phenotypes (authors compared individuals performing at the extreme ranges of standard pain tests). Angiotensin related pathway emerged as a good candidate for pain sensitivity modulation.

Genes contributing to pain sensitivity in the normal population: an exome sequencing study
PLoS Genet. 2012 Dec;8(12)

Williams FM, Scollen S, Cao D, Memari Y, Hyde CL, Zhang B, Sidders B, Ziemek D, Shi Y, Harris J, Harrow I, Dougherty B, Malarstig A, McEwen R, Stephens JC, Patel K, Menni C, Shin SY, Hodgkiss D, Surdulescu G, He W, Jin X, McMahon SB, Soranzo N, John S, Wang J, Spector TD.

Department of Twin Research and Genetic Epidemiology, King's College London, London, United Kingdom. 

Sensitivity to pain varies considerably between individuals and is known to be heritable. Increased sensitivity to experimental pain is a risk factor for developing chronic pain, a common and debilitating but poorly understood symptom. To understand mechanisms underlying pain sensitivity and to search for rare gene variants (MAF<5%) influencing pain sensitivity, we explored the genetic variation in individuals' responses to experimental pain. Quantitative sensory testing to heat pain was performed in 2,500 volunteers from TwinsUK (TUK): exome sequencing to a depth of 70× was carried out on DNA from singletons at the high and low ends of the heat pain sensitivity distribution in two separate subsamples. Thus in TUK1, 101 pain-sensitive and 102 pain-insensitive were examined, while in TUK2 there were 114 and 96 individuals respectively. A combination of methods was used to test the association between rare variants and pain sensitivity, and the function of the genes identified was explored using network analysis. Using causal reasoning analysis on the genes with different patterns of SNVs by pain sensitivity status, we observed a significant enrichment of variants in genes of the angiotensin pathway (Bonferroni corrected p = 3.8×10(-4)). This pathway is already implicated in animal models and human studies of pain, supporting the notion that it may provide fruitful new targets in pain management. The approach of sequencing extreme exome variation in normal individuals has provided important insights into gene networks mediating pain sensitivity in humans and will be applicable to other common complex traits.

Wednesday, 2 January 2013

Cell-free fetus DNA sequencing in the era of NGS: New opportunities for pre-natal diagnosis

The idea that free fetal DNA is present in the maternal blood is quite old and dates back to the end of 90s when, starting from the observation that cancer cell DNA could be found freely circulating in the human blood serum, Lo et al. demonstrated that fetal cell-free DNA could be detected as well starting from a specimen of mother blood (see the original paper). A nice overview of prenatal technology advances can be found on this article published in Wired.

After this discovery several studies emerged and confirmed the possibility to use this easy accessible fetus DNA to detect well-known chromosome abnormalities previously diagnosed by invasive and risky procedures such as amniocentesis or chorionic villus sampling (CVS). In the first years of the new millenium techniques based on free circulating fetal DNA have attracted increasing attention for Rhesus D (RhD) genotyping and detection of chr 21 and cur 18 trysomies (see this review on Nature). Since they have proved to be as accurate as the classical tests requiring either amniocentesis or CVS but as low-risk as a blood draw, a commercial test kit based on this principle first hit the market in October 2011. Known as cell-free fetal DNA testing, it’s now offered by Sequenom, Verinata, and Ariosa Diagnostics

The main advantages in detecting free fetal DNA from maternal blood are the quick and early response and the total absence of risk for the fetus itself. In fact fetal DNA could be detected as early as the 6th week of of pregnancy and the test simply requires a sample of mother's blood. This is a great achievement in the management of difficult pregnancies, thus easing off stress from parents' mind due to the doubtful choice between the risk of having a severely affected child and the risk of impairing a healthy fetus with an invasive procedure. Moreover an early response leaves more time for counseling and decision making process.
The cell-free fetal DNA tests are by now limited to the detection of some well-known trisomy related conditions (Down, Edwards and Patau syndromes) and few other chromosomal abnomalities such as Klineferter or Turner syndromes. Moreover, despite its advantages and reliability, cell-free fetal DNA testing is still not diffused as a clinical practice and sometimes it's even not mentioned by physicians, so that the requests are often driven by the mothers themselves, looking online for a safer test. At the end of November 2012, the American Congress of Obstetricians and Gynecologists released a long-awaited opinion on the cell-free fetal DNA test. The group recommended it for patients at an increased risk for chromosomal defects, including those over age 35 and those with a history of trisomy pregnancies. These guidelines hopefully will make doctors more aware of the new test (many ob-gyns apparently have never even heard of it).

New and rapid improvements in cell-free fetal DNA analysis will now come from applying NGS sequencing techniques, potentially leading to a new era of prenatal genomic diagnosis: deeper, faster, and risk free. In the 2012 a paper published on Science Translational Medicine demonstrated the feasibility of Whole-Genome Sequencing of fetal DNA extracted from mother's blood. The investigators integrated the haplotype-resolved genome sequence of the mother, the shotgun sequence of the father, and the deep sequencing of cell-free DNA in maternal plasma (maternal and fetal) to non-invasively predict the whole-genome sequence of a fetus. This approach requires a paternal sample to determine the fetal genome, which is a practical limitation to its clinical application. Another paper on Nature goes even further, eliminating the need for a paternal sample. Shotgun sequencing of the plasma cell-free DNA was performed, and the relative amounts of parental haplotypes were measured by counting the number of alleles specific to each parental haplotype (‘markers’). The paternally inherited haplotypes were reconstructed by detection of paternal-specific alleles, followed by imputation at linked positions using reference haplotypes from the 1000 Genomes project. This method allowed deduction of the inheritance of each parental haplotype and construction of the full inherited fetal genome. Moreover, the authors were able to detect in the DNA of the fetus a deletion on chr 22 responsible for Di George syndrome. Although technical and analytical challenges remain, in particular to correctly detect de-novo fetal genetic variations, these studies open new possibilities to apply cell-free fetal DNA test also for the identification of point mutations and other small rearrangements.
A complete review on recent advancements in genomic prenatal diagnosis is given in this review published on Trends in Genetics.

There are no doubts that prenatal DNA-based diagnosis will develop rapidly and that WGS of fetal DNA will became feasible and accurate in short times. This rises one more ethical and practical questions on how the genetic data will be managed and stored and on which information should be reported. The leading opinion by now is that only proved disease causing variations will be reported and many suggest to limit the list even further to only those mutations with severe consequences. The main concern in fact is to avoid that reproductive selection is applied also in the case of mild pathologies, or even worst based on unwanted characters based on cultural background. On the other hand, one advantage of having whole genome information is the possibility to adopt therapies or special life habits for conditions for which an early intervention could limit or avoid symptoms. Prenatal genome sequencing also poses the question on who is the owner of the data, or at least who is responsible for it. The test actually acquire information on a subject that can't give his consent and that potentially may be unwilling to know its DNA code at all when its grown up. But these information could also have a great benefic impact for the subject, since they one day may reveal details essential for his health as research constantly upgrade our knowledge about functional impact of genetic variants. So who is eventually in charge for the updates and who will decide on genetic information access? A possible solution could be some kind of encryption strategy (with a decryption code available to the owner only) that guarantees information security. For example I found an interesting solution (developed by Emiliano de Cristofano when he was at the University of California) that also allows for selective decryption of DNA sequence, giving access to a specific region while leaving the rest of the genome unrevealed.

Promises and questions!

PubMed Highlight: Whole-genome sequencing in autism identifies hot spots for de novo germline mutation

In this paper appeared on December in Cell authors apply WGS on 10 monozygotic twins to find new genetic variants associated with autism. Meanwhile, they gave also a better definition and mapping of genomic mutation hotspots, confirming the idea that genetic variations do not occur randomly throughout the genome. Moreover, they found that these highly variable regions seem to occur at highly conserved loci and they affect particularly genes related to brain function and development. Even if the causes of the higher mutational rate are not full understood, the hotspots could play an important role in autism and other genetic diseases. Based on their data, authors also developed a predictive model to gauge a region's mutability index.

Cell. 2012 Dec 21;151(7):1431-42. doi: 10.1016/j.cell.2012.11.019. 

Michaelson JJ, Shi Y, Gujral M, Zheng H, Malhotra D, Jin X, Jian M, Liu G, Greer D, Bhandari A, Wu W, Corominas R, Peoples A, Koren A, Gore A, Kang S, Lin GN, Estabillo J, Gadomski T, Singh B, Zhang K, Akshoomoff N, Corsello C, McCarroll S, Iakoucheva LM, Li Y, Wang J, Sebat J. 

De novo mutation plays an important role in autism spectrum disorders (ASDs). Notably, pathogenic copy number variants (CNVs) are characterized by high mutation rates. We hypothesize that hypermutability is a property of ASD genes and may also include nucleotide-substitution hot spots. We investigated global patterns of germline mutation by whole-genome sequencing of monozygotic twins concordant for ASD and their parents. Mutation rates varied widely throughout the genome (by 100-fold) and could be explained by intrinsic characteristics of DNA sequence and chromatin structure. Dense clusters of mutations within individual genomes were attributable to compound mutation or gene conversion. Hypermutability was a characteristic of genes involved in ASD and other diseases. In addition, genes impacted by mutations in this study were associated with ASD in independent exome-sequencing data sets. Our findings suggest that regional hypermutation is a significant factor shaping patterns of genetic variation and disease risk in humans.