Monday, 7 April 2014

Some useful tools for your everyday NGS analysis

There are a lot of tools that can assist you at every step of an NGS data analysis. Here are some interesting pieces of software I've recently started using.

SAMStat - published on Bioinformatics
This tool take your sam/bam/fastq files and compute several metrics describing frequency and distribution of bases across all your reads. Results include: stats on MAPQ of reads, distribution of MAPQ across read lenght, nucleotide over-representation across reads, error distribution and identification of over-represented 2-mers and 10-mers. All the data are conveniently presented in a html summary and help you identifying potential issues present in your sequencing experiment. Moreover, these graphs are always useful as quality reports for your presentations!

ngsCAT - published on Bioinformatics
This command line tool provide you with a detailed analysis of your mapped reads given a defined target regions. It compute several metrics and stats: medium coverage, number of bases covered at least n fold, duplicated reads, distribution of on-target reads across chromosomes and uniformity of coverage across target regions. The tool require a bam file and a bed file as inputs and produce several graphs and tabs and a final summary report in html format. Really simple and useful to assess quality of your target capture!

GRAB - published on PLOS ONE
This tool is as simple as cleaver. It takes genotyping information of subjects from various formats (Genotype/Var/masterVar/GVF/VCF/PEDMAP/TPED) and compute their eventual relationship. It works best with whole genome data, but I've tested it also using vcf files from exome sequencing and reducing the default reading window it performs well at least in identifying 1st and 2nd grade relationships. This tool require R installed in your system and it's really fast in performing the analysis. It is useful when you are dealing with a group of samples and you want to verify that there are no members from the same family.

MendelScan - published on American Journal of Human Genetics
This tool is also described by the author on his blog MassGenomics. This software perform variant prioritization for family based exome sequencing studies. It needs some preliminary steps to prepare the necessary input files: a multisample vcf files with variants from the family members, a ped file describing the relations between samples, a ranked list of genes that are mostly expressed in your tissue of interest and the VEP annotated list of your variants. Given these data the tool compute a ranked list of identified variants based on the selected inheritance model (recessive or dominant). Moreover it include two additional modules developed by the authors: the Rare Heterozygous Rule Out and the Shared Identity-by-Descent. The first one "identifies candidate regions consistent with autosomal dominant inheritance based on the idea that a disease-causing haplotype will manifest regions of rare heterozygous variants shared by all affecteds, and an absence of homozygous differences between affected pairs (which would indicate that a pair had no haplotype in common).", while the second one "uses BEAGLE FastIBD results to identify regions of maximum identity-by-descent (IBD) among affected pairs". This tool integrates the canonical prediction scores (such as Polyphen, PhyloP and so on) with gene expression ranking and the newly developed methods to provide a straightforward analysis for your mendelian disease NGS studies!

SPRING - published on PLOS Genetics
Like MendelScan, this is another tool for variant prioritization. This one has also a version working from the web for those that are not familiar with the command line. The tool takes a list of seed genes already known to be involved in the pathology or in similar phenotypes and a list of your candidate missense variants and give you a ranked list of the variants that have a high probability of being associated with the disease. This tools work fine for disease with high genetic heterogeneity so that you can easily and confidently build a list of seed genes. SPRING can then be really useful in prioritizing candidates variants emerging from new studies.

PRADA - published on Bioinformatics
This tool is focused on RNA-Seq and it provide a complete framework for analysis of this kind of data. The tool is a complete solution since it can perform several kind of analysis starting from raw paired-end RNA-seq data: gene expression levels, quality metrics, detection of unsupervised and supervised fusion transcripts, detection of intragenic fusion variants, homology scores and fusion frame classification. As the project page reports:
"PRADA currently supports 7 modules to process and identify abnormalities from RNAseq data:
preprocess: Generates aligned and recalibrated BAM files.
expression: Generates gene expression (RPKM) and quality metrics.
fusion: Identifies candidate gene fusions.
guess-ft: Supervised search for fusion transcripts.
guess-if: Supervised search for intragenic fusions.
homology: Calculates homology between given two genes.
frame: Predicts functional consequence of fusion transcript"

This is a good starting point for those not familiar with RNA-Seq!

Friday, 4 April 2014

A gold standard dataset of SNP and Indels for benchmarking tools

One of the pain in the NGS data analysis is that different tools, different settings and different sequencing platforms produce different and often low overlapping variants. Every analysis pipeline has its peculiar issues and results in specific false positive/negative calls.

Genome in a BottleThe need for a gold standard reference of SNP and indels calls is thus of primary relevance to correctly asses the accuracy and sensibility of new NGS pipeline. The need for a robust benchmarking of variant identification pipelines is even more critical as NGS analysis is fast moving from research to diagnostic/clinical field.
In this interesting paper on Nature Biotechnology, authors have performed the analysis of 14 different variants datasets from the same standard genome NA12878 (choose as the standard reference genome by the Genome in a Bottle Consortium) to produce a list of gold standard SNPs and Indels and identify genomic regions that are particularly difficult to address. The final dataset is the most robust produced so far since it integrate data from 5 different sequencing technology, 7 read mapper and 3 variant callers to obtain a robust estimation of SNVs.
The final standard for evaluation of your favorite pipeline is finally here, publicly available Genome Comparison and Analytic Testing website!

Nat Biotechnol. 2014 Feb 16. 
Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls.

Zook JM, Chapman B, Wang J, Mittelman D, Hofmann O, Hide W, Salit M. 

Clinical adoption of human genome sequencing requires methods that output genotypes with known accuracy at millions or billions of positions across a genome. Because of substantial discordance among calls made by existing sequencing methods and algorithms, there is a need for a highly accurate set of genotypes across a genome that can be used as a benchmark. Here we present methods to make high-confidence, single-nucleotide polymorphism (SNP), indel and homozygous reference genotype calls for NA12878, the pilot genome for the Genome in a Bottle Consortium. We minimize bias toward any method by integrating and arbitrating between 14 data sets from five sequencing technologies, seven read mappers and three variant callers. We identify regions for which no confident genotype call could be made, and classify them into different categories based on reasons for uncertainty. Our genotype calls are publicly available on the Genome Comparison and Analytic Testing website to enable real-time benchmarking of any method.

Tuesday, 25 March 2014

Flash Report: Loblolly pine genome is largest ever sequenced

Below is the report from

Scientists say they've generated the longest genome sequence to date, unraveling the genetic code of the loblolly pine tree.
Conifers have been around since the age of the dinosaurs, and they have some of the biggest genomes of all living things.
Native to the U.S. Southeast, the loblolly pine (Pinus taeda) can grow over 100 feet (30 meters) tall and has a lengthy genome to match, with 23 billion base pairs. That's more than seven times the size of the human genome, which has 3 billion base pairs. (These pairs form sequences called genes that tell cells how to make proteins.)
"It's a huge genome. But the challenge isn't just collecting all the sequence data. The problem is assembling that sequence into order," study researcher David Neale, a professor of plant sciences at the University of California, Davis, said in a statement.
To simplify this huge genetic puzzle, Neale and colleagues assembled most of the sequence from part of a single pine nut-- a haploid part of the seed with just one set of chromosomes to piece together.
The new research showed that the loblolly genome is bloated with repetitive DNA. In fact, 82 percent of the genome repeats itself, the researchers say.
Understanding the loblolly pine's genetic code could lead to improved breeding of the tree, which is used to make paper and lumber and is being investigated as a potential biofuel, the scientists say.
The loblolly pine joins other recently sequenced conifers, including the Norway spruce (Picea abies), which has 20 billion base pairs. For their next project, the researchers are eyeing the sugar pine, a tree with 35 billion base pairs.
The research was detailed this week in the journals Genetics and Genome Biology.

Friday, 21 February 2014

Flash Report: King Richard III genome is going to be sequenced soon

The list of famous people whose genome has been sequenced is going to get a new member, more precisely a royal one. After the discovery of the remains of his body under a car park, scientists of the University of Leicester are going to sequence the genome of King Richard III, who died in the Battle of Bosworth Field in 1485. The project has the purpose to learn more about King's ancestry and health, and to provide genetic data useful for historians, researchers and the public.
In addition to the scientific aspects, these types of initiatives also represent good strategies to attract the attention of the media on the institution involved in the sequencing and, why not, to increase the possibility to raise funds for other projects with a deeper scientific impact.
In general, the choice to sequence the genome of famous historic characters could be a good initiative to acknowledge what they did for their country and a great opportunity of visibility for the institution performing the study. Here is the link to the Reuters news release.

Tuesday, 18 February 2014

Genapsys reveals GENIUS at AGBT

At the last AGBT meeting Genapsys, a small company from California, presented its new "launch-box" sequencer called GENIUS (Gene Electronic Nano-Integrated Ultra-Sensitive).
This machine, with the size of a toaster, is a new kind of NGS sequencer that apply electronic sensor technology and promise to produce up to 100 Gb of sequences in few hours, with read length up to 1000 bp. Like the MinION they propose to push the NGS market even further, providing a new generation of sequencher which are incredibly small, cheap and easy to operate. A step further to the so called "freedom of sequencing".

Genapsys started an early-access program for the GENIUS platform last week and plans to ship out the first instruments within a few months, followed by a general commercial launch either this year or next year.

The system has an opening in the front to insert a small, square semiconductor-based sequencing chip. A reusable reagent cartridge attaches to the back, and a computer for data processing is integrated. Based on the Genapsys CEO interview at InSequence, library preparation require clonal amplification of the template DNA on beads in an emulsion-free off-instrument process that requires no specialized equipment from the company. While the first version of the Genius will require separate sample and library preparation, the company plans to integrate sample prep into the platform eventually. The template beads are then pipetted into the chip, which has "millions of sensors" and a flat surface with no wells. There, the beads assemble into an array-like pattern, with each bead being individually addressable. This is followed by polymerase-based DNA synthesis where nucleotides are added sequentially. The system uses an electronic detection method to identify which nucleotide was incorporated, but the detection principle remain secreted by now, even if it is not based on pH measurements (like the Ion Torrent technology).

Three types of chips will be available for the Genius, generating up to 1 gigabase, 20 gigabases, or 100 gigabases of data with the sequencing run taking only few hours. Estimated sequencing costs per gigabase will be $300 for the smallest chip, $10 for the middle chip, and $1 for the largest chip. Even if the instrument price has not been determined yet but, Genapsys will provide special offers to customers committing to high usage. This will be something like an usage plan that include the instrument at a lower price, but with a minimum expense in consumable per year. Esfandyarpour, Genapsys CEO, said they have already generated sequence data on the platform, though he did not provide specifics. The data quality will be "equivalent or better than the best product out there," he said.

Friday, 14 February 2014

Uzbekistan apply genetic testing to select for future olympians

Strange, but true!
According to this news reported by The Atlantic, beginning in 2015, Uzbekistan will incorporate genetic testing into its search for Olympic athletes.

Rustam Muhamedov and his colleagues from Uzbekistan's Institute of Bioorganic Chemistry's genetics laboratory are working on developing a set of 50 genes to determine what sport a child is best suited for. So the national trainers would start working on high predisposed children and know exactly which sport is more suited to their physical characteristics.

"Developed countries throughout the world like the United States, China, and European countries are researching the human genome and have discovered genes that define a propensity for specific sports," Muhamedov tells the Atlantic. "We want to use these methods in order to help select our future champions."
The program, overseen by Uzbekistan's Academy of Sciences, would be "implemented in practice" in early 2015 in cooperation with the National Olympic Committee and several of the country's national sports federations—including soccer, swimming, and rowing.

By now there is no explicit ban against genetic testing or genetic selection on athletes, even if
the World Anti-Doping Agency discourages such practices. In the past, there was suspicions that also China government had applied some sort of genetic testing or at least encouraged marriage and pregnancy between people with desired predisposition with the aim to generate better athletes (see the Yao Ming story as example).

Many experts doubt that genetic testing could really improve performance more than an excellent training program, given that the genotype-phenotype correlation on many physiological traits relevant for sportsmen is not yet explored nor fully understood.
The explicit use of genetic testing however pose some ethical question and claim for an official rule on its application in sport practice.

We'll see if the Uzbekistan effort will push the country at the top of olympic ranking! 

Thursday, 13 February 2014

Follow in real time the 2014 AGBT Meeting

The 15th annual Advances in Genome Biology and Technology (AGBT) meeting is being held in Marco Island, Florida, on February 12 -15. You can follow the latest announcements via Twitter from the top tab we placed in our blog page (the tab will be removed at the end of the meeting)

You will find here a list of some among the most interesting talks, while the complete agenda is available on the official web site of the event.

Wednesday, 12 February 2014

Genomic sequencing in newborn to improve healthcare: pilot projects funded by NIH!

As NGS technology became cheaper and more robust in the next few years and our knowledge about genotype-phenotype associations increase, the idea of performing whole genome sequencing as a standard test on newborns may became an actual strategy in healthcare.

The genomes of infants may be able to be sequenced shortly after birth and allow parents to know what diseases or conditions their child may be affected by or have a propensity for developing, giving them the chance to possibly head them off or start on early treatments.

To test this approach, the US National Institutes of Health has awarded $5 million in research grants to four pilot programs to study newborn screening. The research programs aim to develop the science and technology ad well as to investigate the ethical issues related to such screening.
This pilot project, covered also by The New York Times, is the first effort to asses impact of whole genome screening on the quality of life and on our ability to provide better healthcare. 
Genomic sequencing may reveal many problems that could be treated early in a child’s life, avoiding the diagnostic odyssey that parents can endure when medical problems emerge later, said Dr. Cynthia Powell, winner of one of the research grants.

However, the role of each and every variant and how to interpret their contribution to disease risk is not yet fully understood and many questions remain on which variants to report and if and how genomic findings translate into improved therapies or life quality. This matter will also be addressed in the funded studies.

“Many changes in the DNA sequence aren't disease-causing,” said Dr. Robert Nussbaum, chief of the genomic medicine division at the School of Medicine of the University of California, San Francisco, and leader of one of the pilot grants. “We aren't very good yet at distinguishing which are and which aren’t.”

“You will get dozens of findings per child that you won’t be able to adequately interpret,” said , Dr. Jeffrey Botkin, a professor of pediatrics and chief of medical ethics at the University of Utah. The ethical issues of sequencing are sharply different when it is applied to children. Adults can decide which test information they want to receive, but children won’t usually have that option. Their parents will decide and children will get information that he will rather prefer to ignore when he became adult.

"We are not ready now to deploy whole genome sequencing on a large scale," said Eric Green, the director of the National Human Genome Research Institute, that promote the research program, "but it would be irresponsible not to study the problem."
"We are doing these pilot studies so that when the cost of genomic sequencing comes down, we can answer the question, 'Should we do it?' " he adds.

Here are the four pilot project funded by the NIH, as reported in the official news:
  • Brigham and Women's Hospital and Boston Children's Hospital, Boston
    Principal Investigators: Robert Green, M.D., and Alan Beggs, Ph.D.

    This research project will accelerate the use of genomics in pediatric medicine by creating and safely testing new methods for using information obtained from genomic sequencing in the care of newborns. It will test a new approach to newborn screening, in which genomic data are available as a resource for parents and doctors throughout infancy and childhood to inform health care.  A genetic counselor will provide the genomic sequencing information and newborn screening results to the families.  Parents will then be asked about the impact of receiving genomic sequencing results and if the information was useful to them.  Researchers will try to determine if the parents respond to receiving the genomic sequencing results differently if their newborns are sick and if they respond differently to receiving genomic sequencing results as compared to current newborn screening results. Investigators will also develop a process for reporting results of genomic sequencing to the newborns' doctors and investigate how they act on these results.
  • Children's Mercy Hospital - Kansas City, Mo.
    Principal Investigator: Stephen Kingsmore, M.D.

    Many newborns require care in a neonatal intensive care unit (NICU), and this group of newborns has a high rate of disability and death. Given the severity of illness, these newborns may have the most to gain from fast genetic diagnosis through the use of genomic sequencing. The researchers will examine the benefits and risks of using rapid genomic sequencing technology in this NICU population. They also aim to reduce the turnaround time for conducting and receiving genomic sequencing results to 50 hours, which is comparable to other newborn screening tests. The researchers will test if their methods increase the number of diagnoses or decrease the time it takes to reach a diagnosis in NICU newborns. They will also study if genomic sequencing changes the clinical care of newborns in the NICU.  Additionally, the investigators are interested in doctor and parent perspectives and will try to determine if parents' perception of the benefits and risks associated with the results of sequencing change over time.
  • University of California, San Francisco 
    Principal Investigator: Robert Nussbaum, M.D.

    This pilot project will explore the potential of exome sequencing as a method of newborn screening for disorders currently screened for and others that are not currently screened for, but where newborns may benefit from screening. The researchers will examine the value of additional information that exome sequencing provides to existing newborn screening that may lead to improved care and treatment. Additionally, the researchers will explore parents' interest in receiving information beyond that typically available from newborn screening tests. The research team also intends to develop a participant protection framework for conducting genomic sequencing during infancy and will explore legal issues related to using genome analysis in newborn screening programs. Together, these studies have the potential to provide public health benefit for newborns and research-based information for policy makers.
  • University of North Carolina at Chapel Hill 
    Principal Investigators: Cynthia Powell, M.D., M.S., and Jonathan Berg, M.D., Ph.D.

    In this pilot project, researchers will identify, confront and overcome the challenges that must be met in order to implement genomic sequencing technology to a diverse newborn population. The researchers will sequence the exomes of healthy infants and infants with known conditions such as phenylketonuria, cystic fibrosis or other disorders involving metabolism. Their goal is to help identify the best ways to return results to doctors and parents. The investigators will explore the ethical, legal and social issues involved in helping doctors and parents make informed decisions, and develop best practices for returning results to parents after testing. The researchers will also develop a tool to help parents understand what the results mean and examine extra challenges that doctors may face as this new technology is used. This study will place a special emphasis on including multicultural families.

Monday, 10 February 2014

PubMed Highlights: NGS library prepartion

Starting with a robust sequencing library is the first and crucial step to obtain unbiased, high-quality results from your next generation sequencher!
Here are a couple of interesting paper reviewing problems and solutions related to the NGS library preparation. They also give a sinthetic overview on the present library prepartion methods and how they fit to different downstream applications.
Take a look!

Library preparation methods for next-generation sequencing: Tone down the bias.
Exp Cell Res. 2014 Jan 15

Next-generation sequencing (NGS) has caused a revolution in biology. NGS requires the preparation of libraries in which (fragments of) DNA or RNA molecules are fused with adapters followed by PCR amplification and sequencing. It is evident that robust library preparation methods that produce a representative, non-biased source of nucleic acid material from the genome under investigation are of crucial importance. Nevertheless, it has become clear that NGS libraries for all types of applications contain biases that compromise the quality of NGS datasets and can lead to their erroneous interpretation. A detailed knowledge of the nature of these biases will be essential for a careful interpretation of NGS data on the one hand and will help to find ways to improve library quality or to develop bioinformatics tools to compensate for the bias on the other hand. In this review we discuss the literature on bias in the most common NGS library preparation protocols, both for DNA sequencing (DNA-seq) as well as for RNA sequencing (RNA-seq). Strikingly, almost all steps of the various protocols have been reported to introduce bias, especially in the case of RNA-seq, which is technically more challenging than DNA-seq. For each type of bias we discuss methods for improvement with a view to providing some useful advice to the researcher who wishes to convert any kind of raw nucleic acid into an NGS library.

High-throughput sequencing, also known as next-generation sequencing (NGS), has revolutionized genomic research. In recent years, NGS technology has steadily improved, with costs dropping and the number and range of sequencing applications increasing exponentially. Here, we examine the critical role of sequencing library quality and consider important challenges when preparing NGS libraries from DNA and RNA sources. Factors such as the quantity and physical characteristics of the RNA or DNA source material as well as the desired application (i.e., genome sequencing, targeted sequencing, RNA-seq, ChIP-seq, RIP-seq, and methylation) are addressed in the context of preparing high quality sequencing libraries. In addition, the current methods for preparing NGS libraries from single cells are also discussed.

Monday, 3 February 2014

The sequencing frenzy! Not only human genomes!

With NGS cutting down the costs, Illumina pushing hard to increase sequencing production and some expectation from the new nanopore technology, the past year saw an explosion of genome sequencing.
Lot of organisms, even some bizarre creatures, had their genome sequenced and lot of new sequencing programs had been started, aiming to sequence thousand and thousand of new genomes in the next few years!
I've a personal interest in evolutionary and comparative genomics, so I always appreciate a new genome and I'm particularly intrigued by exotic organisms genomes...Sometimes they may be not so informative, but it is always funny to tell your friend the story of the genome of the white tiger! 
I made a rapid survey of what I've missed in the last couple of months and here are some cool new genomes:

The Burmese python genome (total of 1.44 Gb) and the King cobra genome (total of 1.66 Gb).

These two snake genomes have been published in December on PNAS and give new insights on the evolution of snakes and the peculiar adaptation related to their metabolism and to venom production. Both paper report results from genome sequencing as well as transcriptome characterization, providing a complete picture on several interesting and poorly understood aspects of snakes biology.

The first paper is focused on the molecular basis of morphological and physiological adaptations in snakes. Positive selection acted in ancestral snakes on many genes related to metabolism, development, lungs, eyes, heart, kidney, and skeletal structure—all highly modified features in snakes. To better study genetic basis of the extreme phenotypes of the python, they also compared the python genome with king cobra genome and genomic samples from other snakes. They also performed a detailed transcriptome analysis and found responsive genese associated with metabolism, development, and also mammalian diseases.

The second paper is focused on snake venom, a fascinating toxic protein cocktails. The authors investigate the evolution of these complex biological weapon by sequencing the genome of the king cobra and perform transcriptome analysis to assess the composition of venom gland expressed genes, small RNAs, and secreted venom proteins. They found that "toxin genes important for prey capture have massively expanded by gene duplication and evolved under positive selection, resulting in protein neofunctionalization.". There is a lot of interest in animals venom as a source of new bio-active peptides with a possible application as human drugs, and this article advance the field providing lots of new information on the origin and evolution of venom proteins.

The Locust genome (total of 6.5 Gb)

The genome of L. migratoria has been published in Nature on January. 
There is no doubt that locusts are one of the world’s most destructive agricultural pests, as demonstrated also by their use as a God punishment! Locusts are grasshopper species and they exhibit a remarkable ability in swarming and long-distance migration. Locust swarms form suddenly from the congregation of billions of insects and they can fly hundreds of kilometres each day, and even cross oceans. They are also quite voracious and a single individual could consume its own body weight in food every day! The authors of this paper combined genome sequencing with a set of transcriptome and methylome data from gregarious and solitarious locusts to get insights on the adaptations behind the locust machine! They revealed peculiar findings on neuronal regulatory mechanisms underlying phase change in the locust, together with a significant expansion of gene families associated with energy consumption and detoxification, consistent with long-distance flight capacity and phytophagy. Moreover they also identified hundreds of potential insecticide target genes, such as ion channels, G-protein-coupled receptors and lethal genes. Beware locust!

The Elephant shark genome (0.93 Gb)

In this paper published by Nature on January, authors report the whole-genome analysis of a cartilaginous fish, the elephant shark (Callorhinchus milii). This genome will provide new insights on the evolution of gnathostomes from jawless vertebrates, a transition accompanied by many morphological and phenotypic innovations: jaws, paired appendages and an adaptive immune system based on immunoglobulins, T-cell receptors and major histocompatibility complex (MHC). Moreover they also found a lack of genes encoding secreted calcium-binding phosphoproteins, suggesting an explanation to the the absence of bon
They also found that "the C. milii genome is the slowest evolving of all known vertebrates and features extensive synteny conservation with tetrapod genomes, making it a good model for comparative analyses of gnathostome genomes". The paper analyze also some peculiar aspects of the adaptive immune system of cartilaginous fishes: "it lacks the canonical CD4 co-receptor and most transcription factors, cytokines and cytokine receptors related to the CD4 lineage, despite the presence of polymorphic major histocompatibility complex class II molecules. It thus presents a new model for understanding the origin of adaptive immunity."

The Giant galapagos tortoise (C. nigra) transcriptome

In this study published by Genome Biology on December, authors performed transcriptome sequencing on five C. nigra individuals from three distinct subspecies. Moreover they also analyzed samples from the congeneric red-footed tortoise C. carbonaria and from the Spanish pond turtle Mauremys leprosa. To get a complete picture on tortoise evolution, transcriptome data from the previously published European pond turtle Emys orbicularis and pond slider Trachemys scripta were also considered.
Based on this dataset, they perform a population genomic study of the giant Galápagos tortoise, a species endemic from the Galápagos archipelago. C. nigra is an interesting turtle species: it is the largest known living species of terrestrial turtles and can live well above 100 years. From mtDNA analyses authors suggested that "this insular species has been isolated from the South American continent during millions of years". C. nigra is therefore a perfect model for the study of adaptation following island colonization and "point to island endemic species as a promising model for the study of the deleterious effects on genome evolution of a reduced long-term population size". Among other interesting results, authors found a reduced diversity of immunity genes, supporting the hypothesis of attenuated pathogen diversity in the island restricted habitat, and an increased selective pressure on genes involved in response to stress, potentially involved in the response to the climatic instability and in the elongated lifespan of this species.

After these intriguing examples, here are the major sequencing programs that promise to provide us with more and more genomes in the next few years:

This project aim to sequence the genome of 10k vertebrate species, covering amphibian, birds, reptiles, mammal, fishes and teleosts. The declared goal is "To understand how complex animal life evolved through changes in DNA and use this knowledge to become better stewards of the planet."
The project is co-directed by David Haussler (Howard Hughes Medical Institute, University of California, Santa Cruz); Stephen J. O'Brien (Chief Scientific Officer, Theodosius Dobzhansky Center for Genome Bioinformatics, St. Petersburg State University, St. Petersburg, Russia) and Oliver A. Ryder (San Diego Zoo, Institute for Conservation Research, San Diego, CA). Between collaborators and promoters they have also the BGI.

The i5K Insect and other Arthropod Genome Sequencing Initiative
Started officially in 2011, "the i5k initiative plans to sequence the genomes of 5,000 insect and related arthropod species over the next 5 years. This project will be transformative because it aims to sequence the genomes of all insect species known to be important to worldwide agriculture, food safety, medicine, and energy production; all those used as models in biology; the most abundant in world ecosystems; and representatives in every branch of insect phylogeny so as to achieve a deep understanding of arthropod evolution and phylogeny". The collaborators on the project has already produced more than 60 genomes, such as various species of Drosophila, Apis mellifera, Bombyx mori, Aedes aegypti, Anopheles gambiae, Iodex scapularis and many others. 

Sustained by BGI and China National Genebank (CNGB) this project is started at the end of 2013 and aims to "unveil the mysteries of the origin, evolution and diversification of the largest group of vertebrates." Morover, "all data generated from Fish T1K will be made available publicly through CNGB, ensuring that scientists have access to new developments and trends in fish research and the use of RNA-seq technology."

This is another insects related initiative. Insects are one of the most species-rich groups of metazoan organisms. They play a pivotal role in most non-marine ecosystems and they are of enormous economical and medical importance. With about 20 international partners involved, the 1K Insect Transcriptome Evolution project "aims to study the transcriptomes (that is the entirety of expressed genes) of more than 1,000 insect species encompassing all recognized insect orders. For each species, so-called ESTs (Expressed Sequence Tags) will be produced using next generation sequencing techniques (NGS). [...]. The expected data will allow inferring for the first time a robust phylogenetic backbone tree of insects. Furthermore, the project includes the development of new software for data quality assessment and analysis."

The Global Invertebrate Genomics Alliance (GIGA) is an initiative started in 2013, that group together diverse scientists "with the intent of growing a collaborative network that can address the major problems associated with genomic sequencing of a large taxonomic spectrum - sample collection and processing, data handling, sequence annotation, alignment and access, as well as intellectual property issues." The entire project is focused on (non-insect/ non-nematode) invertebrate, a taxonomic group that "comprise over 70% of all described metazoan species diversity, yet most of their genomes (complete hereditary material, DNA code) remain relatively unknown and understudied".