Pages

Monday, 7 January 2013

Essentials Human Genome Numbers

Recent large population sequencing studies performed with NGS technologies have provided an updated and more accurate estimation about human genome variability (like the average number of missense mutation in an individual exome and so on). These numbers have became a must-know for every genomic geek and I often found myself in trouble during genomic talking, since I can't get them fixed in my mind... They used to pop out unexpectedly during conversation and I feel like the dumb one missing something essential. So I've decided to put all the interesting values together for a rapid and effective reference. I hope this will help me to memorize them and provide a way to impress lab mates with fast and accurate genomic answers! And if you really want to amaze someone, don't forget to look also at BioNumbers! It has dozen of fascinating numbers from biology (who has never found himself asking how large is the biggest eukaryotic cell?).


A little note before starting: when values in the literature are discordant, I've reported the different estimates with their own references. In particular, I think that differences in values reported by (1) and (2) could be explained mainly by 2 causes:
1. In (1) NHLBI GO Exome Sequencing Project project uses either Roche/Nimblegen capture or Agilent reagents for exome capture, while in (2) 1000 Genomes Project considers exome portion as defined by GENCODE. So the first includes sequences based on NCBI Consensus CDS database (CCDS) (containing protein-coding genes and some miRNA and snoRNA + UTRs), while the second includes all protein-coding loci with alternatively transcribed variants, non-coding loci with transcript evidence, and pseudogenes.
2. The dataset analyzed by the 1000 Genomes Project comprises a wider representation of Asian and African populations, thus resulting in a higher number of average variants, since the reference genome currently adopted is essentially based on subjects of American/Caucasian origin.

SNVs in Exomes:
Average SNVs in an individual exome: 12400-15000 (average 13600) (of which 66% heterozygous) (1); 24000 (2)
Average indels per individual (2): 440 
Expected novel SNVs per exome given present data in public databases: 200-500 (1); however note that for any exome sequence 3.3% of observed heterozygous variants are predicted to be novel based on a recent model about human population growth (3).
Number of SNVs with functional effect on protein-coding genes expected in one genome: 320-510 (about 95% of functional SNVs are rare, MAF < 0.5%).

Observed variability per individual:
Synonymous SNVs: 7600 (1); 13-16 k (2).
Missense SNVs: 5700 (1); 11-14 k (2).
Splice affecting SNVs: 12 (1); 12-28 (2).
Stop-gain SNVs: 35 (1); 34-57 (2).
Indels in protein-coding genes: 110-186 (2).
Frameshift indels: 30-50 (2).
SNVs in disease genes reported by HGMD: 41-84 (2).
SNVs in COSMIC (Catalogue Of Somatic Mutations In Cancer) genes: 33-51 (2).
Mean number of SNVs per gene: 30-40 (2).
Large deletions (>100kb) per exome: 39 (2).

Note also that cause to the recent exponential growth of human population, rare SNVs are expected to account for about 15-20% of total diversity (13).

Variants in Genome (2):
SNPs / genome (autosomes - ChrX): 3.6 M - 105 k.
Indels / genome (autosomes - ChrX): 344 k - 13 k.
Large deletions / genome (autosomes - ChrX): 717 - 26.

De novo SNVs (4):
Mutation rate per gene per cell division: 10e-6 10e-7 (5).
Observed de novo SNVs per individual: 74, giving a mutation rate of 1.18 x 10e-8 per position.
Observed de novo Indels per individual: 3, giving a mutation rate of 4 x 10e-10 per position, with deletions being 3 times more frequent than insertions.
Observed de novo CNVs (>100kb) per individual: 1 de novo every 50 individuals. However It's worth noting that 10% of subjects with Intellectual Disability, Autism Spectrum Disorders and Schizophrenia present large CNVs.

The number of de novo SNVs and CNVs is strongly influenced by parental age and ethnicity, with an increase of about two mutations per year. An exponential model estimates paternal mutations doubling every 16.5 years (6). On the other hand, maternal age correlates with increased probability of aneuploidies.

Loss-of-Function variants:
LoF sites per individual: 100-120 (estimated in 7); 30-40 (observed in 1).
Number of genes completely inactivated due to homozygous LoF: about 20 (estimated in 7); at least 1 (observed in 1).
It has been reported that genes affected by LoF variants are relatively less evolutionary conserved, showing a higher ratio of protein-altering to silent substitutions in coding regions between human and macaque (P =2.8 × 10e−52) and less evolutionary conservation in their promoter regions (GERP score; P = 3.7 × 10e−16). On average, they have more closely related gene family members (paralogs) than other genes (P = 0.0058) and show greater sequence identity to paralogs (P = 0.0068). These data suggest that LoF strikes mainly genes with redundant or not essential function (7).

References:

(1) Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes (May 2012)
(3) Recent Explosive Human Population Growth Has Resulted in an Excess of Rare Genetic Variants (May 2012)
(4) De novo mutations in human genetic disease (Aug 2012)




No comments: