Introduction to genetic variation He Zhang Bioinformatics Core Facility 6/22/2016
Outline Basic concepts of genetic variation Genetic variation in human populations Variation and genetic disorders Databases and resources
Human genetic variation Genetic variation is the genetic differences both within and among populations No two humans, including monozygotic twins, are genetically identical. On average, in terms of DNA sequence all humans are more than 99% similar to any other humans
Primary sources of genetic variation Random mutations are the ultimate source of genetic variation. DNA fails to copy accurately Induced mutation by chemicals or radiation Crossing over and random segregation during meiosis can result in the production of new alleles or new combinations of alleles.
Hereditary mutations (inherited mutations) Hereditary mutations are inherited from a parent and are present throughout a person s life in virtually every cell in the body usually. These mutations are also called germline mutations because they are present in the parent s egg or sperm cells, which are also called germ cells.
Acquired mutations Acquired mutations may occur relatively early in development or at any later time throughout the lifespan, generally affecting fewer cells These changes can be caused by environmental factors such as ultraviolet radiation from the sun, or can occur if a mistake is made as DNA copies itself during cell division.
Acquired mutations Acquired mutations in somatic cells (cells other than sperm and egg cells) cannot be passed on to the next generation. Donald Freed, et al, 2014
Acquired mutations can be inherited in some cases Acquired mutations may occur in early stage of development, and affect both germ cells and somatic cells Acquired mutations occurs in a person s egg or sperm cell but is not present in any of the person s other cells. In other cases, the mutation occurs in the fertilized egg shortly after the egg and sperm cells unite. Donald Freed, et al, 2014
Mosaic mutations Acquired mutations that happen in a single cell in embryonic development can lead to a situation called mosaicism. These genetic changes are not present in a parent s egg or sperm cells, or in the fertilized egg, but happen a bit later when the embryo includes several cells. As all the cells divide during growth and development, cells that arise from the cell with the altered gene will have the mutation, while other cells will not.
De novo mutations De novo mutations are operationally defined as genotypes observed in a child but not in either parent. They may originate in a parental germ cell or postzygotically Donald Freed, et al, 2014
Mutation rate in human genome The overall error rate of DNA polymerase is 10-8 per base pair. Repair enzymes fix 99% of these lesions for an overall error rate of 10-10 per bp. Mutation rate in some somatic cell can reach to 1.06x10-6 per bp (David Araten, et al, 2005) ~40 germline de novo mutations per generation (Donald Conrad, et al, 2011) ~1,500 non-germline de novo mutations were derived each person (Donald Conrad, et al, 2011)
Types of genetic variation Single Nucleotide Polymorphism (SNP) Insertion/Deletion (Indel) Copy Number Variation (CNV) Rearrangement Inversion Translocation Segmental duplication Numerical variation Polyploidy Aneuploidy
Single Nucleotide Polymorphism (SNP) SNP is a variation in a single nucleotide that occurs at a specific position in the genome and exchanges a single nucleotide for another Transitions: replacement of a purine base with another purine or replacement of a pyrimidine with another pyrimidine A <-> G, C <-> T Transversions: replacement of a purine with a pyrimidine or vice versa. A <-> C, A <-> T, C <-> G, G <-> T ts/tv ~ 2 for whole genome ts/tv ~ 3 for whole exome α : transition β : transversion
Effects of SNPs in coding sequence Silent mutation A silent mutation changes a codon, but doesn t affected the protein sequences Different codons can lead to differential protein expression levels Missense mutation A missense mutation changes a codon and generate a different amino acid. Nonsense mutation A nonsense mutation converts an amino acid codon into a termination codon. This causes the protein to be shortened because of the stop codon interrupting its normal code Read-through mutation A read-through mutation changes a stop codon to a sense codon Splice site mutation Results in one or more introns remaining in mature mrna and may lead to the production of abnormal proteins
Insertion/Deletion (Indel) Insertions add one or more extra nucleotides into the DNA. Deletions remove one or more nucleotides from the DNA. They are usually caused by transposable elements, or errors during replication of repeating elements.
Effect of Indels in coding sequence Reading-frame shift The number of nucleotides in a coding sequence of a gene that is not divisible by three The message in the gene is no longer correctly parsed. Insertion or deletion of one or more amino acids Altering splicing of the mrna
Inversion An inversion is a chromosome rearrangement in which a segment of a chromosome is reversed end to end. Inversions do not change the overall amount of the genetic material
Effect of inversions An Introduction to Genetic Analysis. 7th edition
Effect of inversions An Introduction to Genetic Analysis. 7th edition
Effect of inversions An Introduction to Genetic Analysis. 7th edition
Translocation Translocation is a chromosome abnormality caused by rearrangement of parts between nonhomologous chromosomes Braude P, et al, 2002
Effect of translocations Balanced translocation An even exchange of material with no genetic information extra or missing, and ideally full functionality Unbalanced translocation The exchange of chromosome material is unequal resulting in extra or missing genes
Unbalanced translocation
Copy Number Variation (CNV) A copy-number variation (CNV) is a difference in the genome due to deleting or duplicating large regions of DNA on some chromosome. Duplications lead to multiple copies of all chromosomal regions, increasing the dosage of the genes located within them. Deletions of large chromosomal regions, leading to loss of the genes within those regions. Recent research indicates that approximately two thirds of the entire human genome is composed of repeats and 4.8-9.5% of the human genome can be classified as copy number variations (Mehdi Zarrei, et al, 2015).
Polyploidy and Aneuploidy Polyploidy refers to a numerical change in a whole set of chromosomes Polyploidy occurs in humans in the form of triploidy, with 69 chromosomes and tetraploidy with 92 chromosomes. Aneuploidy refers to a numerical change in part of the chromosome set 45 or 47 chromosomes are common aneuploidy found in human
Genetic variations in human populations
Human reference genome The human genome is the complete set of nucleic acid sequence for humans (Homo sapiens) Haploid human genome 22 autosomes X chromosome Y chromosome
Human reference genome Human reference genome does not correspond to any actual human individual Genome Reference Consortium human genome (build 37) is mosaic haploid genome derived from 13 anonymous volunteers One male accounts for 66% of the total The latest human reference genome (GRCh38) integrated whole genome sequencing data from other projects to improve the completeness, but still have gaps covering ~5% of the genome
How many variants in human genomes A typical genome differs from the reference human genome at 4.1 million to 5.0 million sites >99.9% of variants consist of SNPs and short indels 2,100 to 2,500 structural variants (affecting ~20 million bases of sequence) AFR AMR EAS EUR SAS Samples 661 347 504 503 489 Mean coverage 8.2 7.6 7.7 7.4 8 SNPs 4.31M 3.64M 3.55M 3.53M 3.60M Indels 625k 557k 546k 546k 556k Large deletions 1.1k 949 940 939 947 CNVs 170 153 158 157 165 Inversions 12 9 10 9 11 The 1000 Genomes Project Consortium, 2015
Loss of function variants in human genome human genomes typically contain ~100 genuine loss of function (LoF) variants with ~20 genes completely inactivated
Genetic diversity in different populations Europe East Asian South Asian America Africa
Modern humans originated from Africa L. Luca Cavalli-Sforza & Marcus W. Feldman, 2003
Bottleneck effect during migrations reduce the diversity of human genetic variations Michael C. Campbell and Sarah A. Tishkoff, 2009 Effective population size (Albert Tenesa et al, 2009) non-african populations was 3100 African population was 7500
Genetic variation exists between populations Founder effect and past small population size (increasing the likelihood of genetic drift) may have had an important influence in neutral differences between populations. Natural selection may confer an adaptive advantage to individuals in a specific environment if an allele provides a competitive advantage. Genetic drift will cause some neutral mutations fixed or disappeared randomly in a population.
Genes mirror geography within Europe Nature. 2008 Nov 6; 456(7218): 98 101.
Variations and genetic disorders
Genetic variants and health Most of the variants in human genome don t affect health A typical human genome contains ~100 loss of function (LoF) variants with ~20 genes completely inactivated (Daniel MacArthur, et al, 2012). LoF variants found in healthy individuals will fall into several overlapping categories Severe recessive disease alleles in the heterozygous state Alleles that are less deleterious but nonetheless have an impact on phenotype and disease risk Benign LoF variation in redundant genes Genuine variants that do not seriously disrupt gene function
Genetic disorder A genetic disorder is a disease caused in whole or in part by a change in the DNA sequence away from the normal sequence. Genetic disorders can be caused by a mutation in one gene (monogenic disorder), by mutations in multiple genes (multifactorial inheritance disorder), by a combination of gene mutations and environmental factors, or by damage to chromosomes (changes in the number or structure of entire chromosomes, the structures that carry genes).
Monogenetic disorders Monogenetic disorders (single-gene disorders, Mendelian disorders) are caused by mutations in a single gene. These are usually rare diseases. The mutation may be present on one or both chromosomes Over 4000 human diseases are caused by single-gene defects Sickle cell disease Cystic fibrosis
Multifactorial inheritance disorders Multifactorial inheritance disorders are caused by a combination of variations in different genes, often acting together with environmental factors. The effect of each variant/gene was usually small Many common diseases including cardiovascular disease, diabetes, and most cancers are examples of such disorders.
Chromosome disorders Chromosome disorders are caused by an excess or deficiency of the genes that are located on chromosomes, or by structural changes within chromosomes. Down syndrome is caused by an extra copy of chromosome 21 (called trisomy 21) Prader-Willi syndrome is caused by the absence or non-expression of a group of genes on chromosome 15.
Genetic Mapping in Human Diseases Genetic mapping is the localization of genes underlying phenotypes on the basis of correlation with DNA variation Methods Linkage analysis Association study
Linkage analysis Genetic linkage analysis is a statistical method that is used to associate functionality of genes to their location on chromosomes. It is based on the observation that genes that reside physically close on a chromosome remain linked during meiosis. if some disease is often passed to offspring along with specific marker-genes, then it can be concluded that the gene(s) which are responsible for the disease are located close to these markers. Pedigree is required for linkage analysis
Linkage analysis
Association study Genetic association studies test for a correlation between disease status and genetic variation In case-control studies, it is investigated if the allele frequency is significantly altered between the case and their mathed control group
Population-based design Case and controls are unrelated Easier to collect Susceptible to population stratification bias
Family-based design Cases and controls are related: parents, sibs etc Commonly used design: case-parent trios Not susceptible to population stratification bias Not easy to collect Not appropriate for late-onset diseases Female Male Disease-affected Healthy
Databases and resources
NCBI dbsnp and dbvar The Single Nucleotide Polymorphism database (dbsnp) is a publicdomain archive for a broad collection of simple genetic polymorphisms. dbvar is NCBI's database of genomic structural variation (SV)
1000 Genomes Project The goal of the 1000 Genomes Project was to find most genetic variants with frequencies of at least 1% in the populations studied. 2,504 individuals from 26 populations using a combination of lowcoverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping.
Exome Aggregation Consortium The Exome Aggregation Consortium (ExAC) is a coalition of investigators seeking to aggregate and harmonize exome sequencing data from a variety of large-scale sequencing projects. The data set 60,706 unrelated individuals.
OMIM Online Mendelian Inheritance in Man (Online Mendelian Inheritance in Man ) is a comprehensive, authoritative compendium of human genes and genetic phenotypes that is freely available and updated daily http://omim.org Class of phenotype Phenotype Gene* Single gene disorders and traits 4,728 3,182 Susceptibility to complex disease or infection 700 499 "Nondiseases" 141 111 Somatic cell genetic disease 202 115
NCBI ClinVar ClinVar is a public archive of reports of the relationships among human variations and phenotypes, with supporting evidence.