Introduction to the Genetics of Complex Disease Jeremiah M. Scharf, MD, PhD Departments of Neurology, Psychiatry and Center for Human Genetic Research Massachusetts General Hospital
Breakthroughs in Genome Science 2001 Human Genome Project: Sequence 2005 HapMap Project: Common Variation 2010 1000 Genomes Project: Rare Variation 2012 ENCODE Project: Function
Patterns of Inheritance: Single Gene Disorders Dominant Example: Huntington Disease Single gene causes disease Disease requires one copy of mutation Recessive Example: Sickle Cell Anemia Single gene causes disease Disease requires two copies of mutation
Complex Disorders Inheritance pattern: multifactorial or complex Not due to single gene Several or many genes may contribute Each may have small effect by itself Effects may depend on interaction with environment and other genes (epistasis)
Complex Disease Genetics Most common medical illnesses are genetically complex Aggregate in families but don t show Mendelian segregation Multiple genes contribute to disease in each individual Incomplete penetrance and variable expression penetrance = probability of disease given risk genotype Gene-gene and gene-environment interaction
Chain of Genetic Research Questions Is the disorder familial? Study Methods Family study How much do genes contribute? Twin and adoption studies What genes are involved? How do genes cause disease? Linkage, association, sequencing Functional and biological studies Adapted from Faraone and Tsuang. 1995.
Does it Run in Families? Compare prevalence (risk) in relatives of affected proband to prevalence in relatives of unaffected controls Recurrence risk ratio: 1 = Risk to first-degree relative of affected Prevalence in general population
Familial relative risk (RR) for various neuropsychiatric disorders SNCA Parkin Etc APP PS1 PS2 Mendelian (monogenic) Complex Inheritance genetic + non-genetic ( environmental ) Deterministic Probabilistic Textbook of Neuropsychiatry and Behavioral Neurosciences, 5th Edition. Eds, Yudofsky SC, Hales RE. 2008 American Psychiatric Publishing, Inc. All rights reserved. www.appi.org
Twin Studies: Is it Genetic? Compare concordance in MZ vs DZ twins MZ > DZ implies genetic contribution MZ < 100% implies environmental contribution Heritability (h 2 ): Proportion of phenotypic variance (in a population) attributable to genetic factors.
Heritability Caveats A heritability of 60% means that at least one gene operates on the trait that 60% of the individual differences in that population can be attributed to differences in the additive effects of certain genes A heritability of 60% does not mean that the trait of any one individual is 60% determined by his or her genes, 40% determined by his or her environment that environmental interventions can not have striking effects Ignores heterogeneity in mode of inheritance Depends on degree of genetic and environmental variability in the population Courtesy: Shaun Purcell
Estimated Heritability Disorder/trait Approx. h 2 Autism 80% Schizophrenia 80% Bipolar Disorder 60-80% Attention Deficit Disorder ~75% Tourette Syndrome 60-80% Inflammatory Bowel Disease 65-75% Multiple Sclerosis 55% Alcohol/drug addiction 55% Major Depression 40% Anxiety Disorders 30-45% Breast Cancer 25%
Where are the genes? Molecular Genetic Methods Linkage analysis: examines the coinheritance of the phenotype with markers of known chromosomal location Primary application: genome scans ( Where ) Association analysis: examines correlation between specific genetic variants and presence of the phenotype Primary application: candidate gene and genomewide studies ( Which )
Question Linkage vs. Association Linkage Where are the Genes? Association Which Alleles Confer Risk? Best Suited For Mendelian Disease Complex Disease Genomic Scope Subjects Markers Typical Marker Spacing Whole Genome Families Microsatellites or SNPs [Candidate Gene] or Whole Genome Case/control or nuclear families SNPs < 10 Mb < 10 kb
Genetic Architecture Landscape of mutations that collectively contribute to disease Major gene Large effect Boston Many genes (polygenic) Small effects Example: Huntington s Disease Example: Height McCarthy et al., 2008; Sullivan et al., 2012
Genetic methods target different types of mutations 3 2 Early-onset AD (APP, PS1/2) Cystic Fibrosis LINKAGE VCFS/DiGeorge Williams Syndrome Idiopathic Neurodevelopmental Disorders COPY NUMBER VARIANTS 3 NEXT-GEN SEQUENCING Family-based Case-control 1 Late-onset AD APOE OR Common Disorders Inflammatory Bowel Disease Multiple Sclerosis ASSOCIATION Type 2 Diabetes Schizophrenia McCarthy et al., 2008; Sullivan et al., 2012
SINGLE NUCLEOTIDE POLYMORPHISMS (SNPs): Most common form of human genetic variation ACGGCGCGCATCGCTGATCGATGGCTCGTG ACAGCAGCTACGACATGACGCAGCGCCAAC GGGCTAGCTAGCTTTAGTTTCCCCGAAAGCG CGAGCGACGCTCGATCGCTCGATCGACGGC T GCGCATCGCTGATCGATGGCTCGTGACAGC AGCTACGACATGACGCAGCGCCAACGGGCT AGCTAGCTTTAGTTTCCCCGAAAGCGCGAGC GACGCTCGATCGCTCGATCGACGGCGCGCA TCGCTGATCGATGGCTCGTGACAGCAGCTA CGACATGACGCAGCGCCGACGGCGCGCATC GCTGATCGATGGCTCGTGACAGCAGCTACG
Association Analysis: Co-inheritance of Alleles and Disease Across Families A G G G G G G G A G A G G G G G A G A A A A A A A G A G A A A A Are alleles more common in cases than controls? Cases Controls A G A G A G G G A A A G A G A A A A G G G G G G A G A G A G Trios Are alleles transmitted to affected offspring more than 50% of time?
Association Studies are Like Other Epidemiologic Studies General Question: Is Exposure Associated with Disease? Is smoking associated with MI? + + -- + + + + -- Cases (MI+) -- -- -- -- -- -- -- Controls (MI-) + MI+ MI- 120 50 -- 54 100 2 = 41.0, p <.0001 OR = (120*100)/(50*54) = 4.44
Alleles as Exposures A G G G G G G G A G A A A G A A Are alleles more common in cases than controls? G G A G A G G G Cases (MI+) A A A A A G A A Controls (MI-) ie Is G allele associated with MI? MI+ MI- G 120 50 A 54 100 2 = 41.0, p <.0001 OR = (120*100)/(50*54) = 4.44
Family-based Association Analysis: Transmission/disequilibrium test 1 2 1 2 Not Transmitted 1 1 1 2 Transmitted 1 a c 2 2 1 2 1 1 2 b d 1 2? 1 Not Transmitted 2? 1 2 1 1 Transmitted 1 2 95 50 120 195 2 c TDT = å (b-c)2 (b+c) 2 TDT = ( 120-50 ) 2 = ( 120 + 50 ) p <.0001
Association Study Pitfalls Problem False positives: -Multiple testing (genes x SNPs x phenotypes) -Low prior probability for any SNP (even for the best candidate gene!) Solutions Correct for multiple testing Independent Replication! False negatives: -Modest effects sizes of susceptibility alleles -Vast majority of studies are underpowered -Typical odds ratios for GWAS loci = 1.1-1.3 -Detection requires samples of 10s of thousands Increase sample size
Association and Linkage Disequilibrium Hirschhorn and Daly, 2005
LD and Haplotypes Linkage disequilibrium (LD): correlation in the population between alleles at two loci. ie nonrandom association of alleles at linked loci Haplotype: A series of alleles at linked loci along a single chromosome Haplotype (LD) blocks: genomic regions of LD. The human genome shows a block-like structure with limited haplotype diversity (Gabriel et al. Science, 2002)
Haplotype: A-A-T
Tag SNPs
GWAS Family-based Case-control OR 3 ASSOCIATION McCarthy et al., 2008; Sullivan et al., 2012
The GWAS Era Before 2006: only a handful of genes had been found for any common medical disorders like diabetes, heart disease, inflammatory bowel disease, arthritis Since 2006: thousands of confirmed genetic findings for major medical diseases What Happened? Powerful DNA chip technology Computational advances Whole genome analysis Much larger studies
www.nature.com/.../v5/n5/full/nmeth0508-447.html; http://www.illumina.com; http://www.sanger.ac.uk Genomewide Association Studies (GWAS) Micro-array based genotyping technique Assays common DNA variants ( SNPs ) that tag blocks of DNA across the human genome mean DNA block size: ~10-20 kb (10-20,000 DNA bases) much finer resolution than linkage studies each chip assays > 1 million SNP markers in a single experiment
Genomewide Association Study (GWAS) DNA Microarray (DNA-Chip) with 500K - 5M SNPs covering the genome Allele frequencies usually >5% Examine for each SNP: allele frequency differences between cases and controls correlation between allele count and quantitative trait Threshold for significance: p < 5 x 10-8
Published Genome-Wide Associations through 05/2013 Published GWA at p 5X10-8
Size Matters N = 183,727 Loci: 180 Variance: 10% N = 249,796 Loci: 32 Variance: 2.5%
# GWAS Loci # of cases Schizophrenia: ~ 4 / 1,000 Crohn s: ~ 10 genes / 1,000 cases Adult Height: ~ 3/ 1,000 (Bipolar Disorder: ~ 1 gene/ 1,000 cases)
Key Plots Summarizing GWAS Manhattan plot Q-Q plot Regional plot
So, you found an association. Is it due to? True association with causal variant? Spurious association due to confounding? (population stratification) Linkage disequilibrium with nearby causal variant? Chance indexed by p value--but beware multiple testing!
Population Genetics Study of allele frequency distribution and change
Hardy-Weinberg Equilibrium large population no mutation no selection random mating no migration [A] = p [a] = q p + q =1 [AA] = p 2 [Aa] = 2pq [aa] = q 2 frequencies remain stable
With genome-wide SNP data, population structure can be detectable to very fine scales... Novembre et al (2008)
Population Stratification Differences in allele frequencies between cases and controls due to systematic differences in ancestry rather than association of genes with disease.
Population Allele Differences Can Confound Association Studies Does A/G SNP in CNR1 gene cause MI? Cases recruited from MGH patients: 55% European-American 20% African-American Controls recruited from volunteers 85% European American 5% African American G A A G G G G G G G A G A A A G A A European American.7.3 G G A G A G G G A A A A A G A A African American.4.6 Cases (MI+) Controls (MI-) p <.0001
Association Causality Ioannidis et al. 2009, Nature Rev Genet
2 COPY NUMBER VARIANTS McCarthy et al., 2008; Sullivan et al., 2012
Copy Number Variation Structural variations of > 1kb Low copy repeats are common mechanism: Highly homologous sequence elements arising from segmental duplication E.g. cause of psychiatric illness VCFS/DiGeorge syndrome - microdeletion on 22q: 20-30% incidence of psychotic illness Autism - de novo CNVs in >10% of sporadic cases?
Large, rare CNVs are found across neurodevelopmental disorders From Morrow, JAACAP 2010
Next-Generation Sequencing 2001: $3 billion 2017-2018: <$1,000
Bras et al. Nature Rev Neurosci, 2012
Related Methods: The -omics Functional Genomics - a field of molecular biology that attempts to make use of the vast wealth of data produced by genomic projects to describe gene and protein functions and interactions. Focuses on dynamic aspects such as gene transcription, translation, and proteinprotein interactions, as opposed to the static aspects of the genomic information such as DNA sequence or structures. Transcriptomics (expression profiling)- examines the expression level of mrnas in a given cell population, often using high-throughput techniques based on microarray technology. Proteomics- examines the full complement of proteins and their structure, quantity, and function Metabolomics- examines the whole set of small-molecule metabolites (such as metabolic intermediates, hormones and other signalling molecules, and secondary metabolites) to be found within a biological sample or organism Interactomics- examines the whole set of molecular interactions in cells
ENCODE Project
Nature 518, 317 330 (19 February 2015) doi:10.1038/nature14248 RoadMap Epigenomics Consortium, Nature 2015
Genomics In Silico Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems. NIH BISTIC definition
Systems Biology: The Big Picture Oltvai, Science, 2002
Translating Genetic Findings to Novel Therapies How Do We Get There From Here?
Skin cells Induced Stem Cells Neurons Glia Animal Models Confirmed Genetic Variants Biological Characterization Develop Functional Assays Small Molecule Screening Preclinical and Safety Studies Proof-of-Concept Trials Larger Clinical Trials
Summary Most common diseases are complex Aggregate in families with non-mendelian patterns of inheritance Multiple genes of varying effect +/- Gene-gene interaction (epistasis), gene-environment interaction Association analysis is most common method for identifying susceptibility alleles Interpret with care: Beware false positives Replication is essential Exome and whole genome sequencing now feasible and successful in identifying rare variants related to Mendelian and complex disorders Ultimately, whole genome sequencing may become the preferred approach
Brief Break?