Tuesday, 25 June 2013
Computational diet: how to make your Gb human genome as light as few Mb
With the technology rapidly developing, whole exome/genome sequence of individual subjects is nowadays been performed by many labs all around the world. From the first examples of indivudal genomes, such as Venter or Watson complete DNA, we have now reach a point where the entire 6 billion bases could be sequenced in about a day. With such an high data production rate, the data storage problem suddenly emerged as a painful thorn in the NGS boot, getting worse with every (sequencing) run!
In a future when genome based personalized medicine would become reality, your DNA sequence will need to be stored life through as any other medical record.
Long term storage of at least the complete DNA sequence could be a crucial factor, since as new genotype-phenotype correlations are discovered the owner of the genome could be updated with the new relevant information.
Different solutions has been proposed to reduce the amount of information that have to be stored for a single genome sequence so to make storage of large genomic data-set feasible. However, at least the final exome/genome sequence of the individual can now be reduced to few Mb of disk space so that you can accommodate several of them even in a standard hard-disk.
This incredible result is achieved using various compression alghortims. The first generation basically used information on repetitive regions in the genome to reduce the final file size to few hundred Mb. Second generation alghoritms are instead based on a reference sequence and only store the differences between the reference and the sequenced DNA. The best performing tools until now was, DNAZip, which reduced the Watson genome sequence to only 4Mb...something that you can easily share as email attachment, as the author stated. Now Pavlichin et al. from Stanford university has pushed compression even further. Their solution is based on the same approach, and take advantage of dbSNP database so that the positions of already known SNP don't have to be stored in the final file. To further shrink the file size, positions of novel SNPs are not stored individually but as distances from the previous SNP. This improvements, together with a brand new compression function and an haplotype based trick, push the file size for a single complete genome down to 2.5Mb.
The main disadvantage is that you need the reference sequence and the dbSNP database in order to reconstruct your genome, but when dealing with thousand of them it is a minor drawback.
Unfortunately, less can be done for the bunch of sequence related information so useful for research purposes (such as base and reads quality scores, genotype scores, alignment metrics and so on...)...it seems that you still need a tower of drives to store them all! However, lot of efforts has been applied also to this area, since the possibility to reanalyze older dataset with new techniques and new tools can lead to new discoveries. By now there is also a prize for data compression, the sequence squeeze competition, that also has updated information on the best performing tools.