Pages

Friday 27 December 2013

Happy new Human Genome! GRCh38 is here!

This Christmas bring a long desired gift for anyone involved in genomics: a brand new release of the Human Genome Assembly. The new GRCh38 is a kind gift from the Genome Research Consortium (GRC), including The Wellcome Trust Sanger Institute (WTSI), the Washington University Genome Sciences Center (WUGSC), the European Bioinformatics Institute (EBI) and the The National Center for BiotechnologyInformation (NCBI)This new version is the first major release since 4 years and provide two major improvements: fewer gaps and centromere model sequences.

 




The first draft of the human genome contained around 150000 gaps, years of hard work have reduced them to 357 in the GRCh37 version. The new GRCh38 take care of several of these gaps, including one on chromosome 10 associated with the mannose receptor C Type 1 (MRC1) locus and one on chromosome 17, associated with the chemokine (C-C-motif) ligand 3 like 1 and ligand 4 like 1 (CCL3L1/CCL4L1 ) genes. In addition, there have been additions from whole-genome shotgun sequencing at nearly 100 of GRCh37′s assembly gaps.
Telomeres continue to be represented by default 10 kilobase gaps, while some improvments have been done on acrocentric chromosomes and GRCh38 includes new sequences on the short arms of chromosomes 21 and 22.

The other major feature added to the new release is the model sequence representation for centromeres and some heterochromatin. Using a method developed by a research team at University of California at Santa Cruz (UCSC) and reads generated during the Venter genome assembly, scientists created models for the centromeres. “These models don’t exactly represent the centromere sequences in the Venter assembly, but they are a good approximation of the ‘average’ centromere in this genome” says Church, a genomicist formerly at the US National Center for Biotechnology Information. Even if these sequence models are not exact representations of any real centromere, they will likely improve genome analysis and allow study of variation in centromere sequences.


GRC recently submitted the data for GRCh38 to GenBank, and the assembly is available with accession GCA_000001405.15. These data are also available by FTP at ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38.
However keep in mind that this sequence is provided withou any annotation and that it will take at least a couple of week for the NCBI annotation pipeline to process the whole data and produce a new set of RefSeqs. As the NCBI reports: "The chromosome sequences will continue to have accessions NC_000001-NC_000024, but their versions will update as GRCh38 includes a sequence change for all chromosomes. This process generally takes about 2 weeks, and when that is done we will incorporate these sequences into various analysis and display tools, such as genomic BLAST and genome viewers. Thus, at the end of this process each chromosome will be represented by both an unannotated sequence in GenBank (the original GRC data) and an annotated sequence in the RefSeq collection."

Further details on the properties and development of the GRCh38 are reported by Methagora blog from Nature Methods, NCBI Insights blog and the GRC consotium blog.

No comments: