Monday, 14 May 2012

CGHub at UCSC: Cancer data at Petabyte scale

Recently there were several announcements about genomic data made available on the net, supporting the idea of a new era in genomics when data production greatly exceed data analysis capacity. So the solution is to make all these data avaiable to the community and let other research groups to continue the analysis. The recent CGHub initiative from UCSC and NCI goes exactly in this direction and stands out for the dimension of the dataset.

Indeed, University of California Santa Cruz, that everybody knows for its famous Genome Browser, is now building a petabyte-scale data repository for cancer genome projects, called CGHub (Cancer Genomics Hub).
The project is funded by the National Cancer Institute with over $10M and it started with 5 petabytes avaiable and the possibility to growth as needed by the three main NCI cancer genome projects: the Cancer Genome Atlas (TCGA); the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) and the Cancer Genome Characterization Initiative (CGCI).
Therefore, CGHub will soon become the largest available dataset on cancer genomics ranging from adult cancer to childhood cancer to HIV-related tumors.

Currently, TCGA alone generates about 10 terabytes of data each month and its output is expected to increase rapidly getting to a total amount of 10 petabytes of DNA and RNA data from the 10,000 patients involved.
Besides providing the space, UCSC is also working on innovative software solutions to grant rapid and flexible access to this huge amount of data, as well as to compress them and simplfy the management of a petabyte scale storage.
"We would like to compress the data down to one tenth of its current size and that will not be possible without losing some information," told David Haussler, a professor of biomolecular engineering at UCSC in charge for the CGHub project. At present, the cancer genomics community is "working very hard to decide what information we can sacrifice in these very valuable data."
More information can be found on this post from GenomeWeb Bioinform and on the offical pages of TCGA, TARGET and CGCI.

