THE GENETIC STRUCTURE, FUNCTION AND RELEVANCE TO DISEASE OF THE SALIVARY AGGLUTININ GENE (DMBT1)

Size: px
Start display at page:

Download "THE GENETIC STRUCTURE, FUNCTION AND RELEVANCE TO DISEASE OF THE SALIVARY AGGLUTININ GENE (DMBT1)"

Transcription

1 THE GENETIC STRUCTURE, FUNCTION AND RELEVANCE TO DISEASE OF THE SALIVARY AGGLUTININ GENE (DMBT1) Thesis submitted for the degree of Doctor of Philosophy at the University of Leicester by Shamik Polley MVSc Department of Genetics University of Leicester 2014

2 ABSTRACT The Genetic Structure, Function and Relevance to Disease of The Salivary Agglutinin Gene (DMBT1) Shamik Polley Salivary agglutinin, encoded by the gene DMBT1, is a multifunctional high molecular mass glycoprotein (340 kda) that acts as a pattern recognition receptor (PRRs) in innate immunity and mediates epithelial differentiation. The central region of the protein contains 13 tandemly-repeated scavenger receptor cysteine-rich (SRCR) domains that are copy number variable and bind to bacteria and viruses. The paralogue ratio test (PRT) was used to estimate the exact copy number of two distinct CNV (1 & 2) regions of DMBT1 gene and results were compared with other CNV estimation assays. Both CNV1 and CNV2 at DMBT1 were multiple allelic CNVs and diploid copy number varied in different populations. The de novo mutation rate at CNV1 and CNV2 of DMBT1 was estimated using a segregation study of 520 samples from 40 multigenerational CEPH families; a high mutation rate was found at both loci of DMBT1 (CNV1-1.4% and CNV2-3.3% per generation). The evolutionary basis of CNV at DMBT1 was examined using 971 samples from 52 populations from the Human Genome Diversity Panel (HGDP-CEPH). The study found that the subsistence history of human populations affected the frequency distribution of both CNVs at DMBT1. The increase in dental caries following the development of agriculture, and the likely causative role played by an increase in Streptococcus mutans following transition to a starch-rich diet, the present study suggests that this has favoured CNV1 and CNV2 alleles at DMBT1 with more S. mutans-binding SRCR domains in agricultural populations. Due to the functional importance of DMBT1, the study analysed association of DMBT1 copy number in different disease cohorts. The study found no evidence of the association between DMBT1 copy number with Crohn s disease (n=2900), Urinary tract infection (UTI; n=405), vesicoureteral reflux (VUR; n=625), Chronic obstructive pulmonary disease (COPD; n=241) and Asthma cohorts (n=850). A significant association was found between CNV2 copy number and base-line HIV (n=987) viral load just before anti-retroviral therapy. i

3 Acknowledgements There are a number of people to whom I would like to extend my sincere gratitude and thanks. I would not have been able to complete this project without the assistance and support of these people. First of all, I would like to express my deepest gratitude to my wonderful supervisor, Dr. Ed Hollox to whom I am in debt forever. My thesis would not have been possible without his constant guidance, encouragement, support, patience and enthusiasm. I wish to thank all collaborators: Dr. David Hains, University of Tennessee, USA; Dr. Jack Satsangi, University of Edinburgh, UK; Dr. Christopher Mathew, King s College London, UK; Dr. Paal Skyt Andersen, Statens Serum Institute, Copenhagen, Denmark; Prof. John Yates, University of Cambridge, UK; Dr. Valentina Cipriani, Institute of Opthalmology, University College London, UK; Dr Robert Bals, Saarland University, Germany, Germany; Dr Eleni Aklillu and Prof Lars Lindquist, Karolinska Institute, Sweden; Prof Martin Tobin, Dr Louise Wain and Dr Ioanna Ntalla, University of Leicester, UK for providing DNA samples and helping in data analysis. I wish to thank Prof. Jan Mollenhauer, University of Southern Denmark, Prof. Mark Jobling, Prof. Sir Alec Jeffreys, Prof. Yuri Dubrova for control DNA samples and for use of machines and instruments. I wish to thank members of my first year thesis committee: Dr. Flaviano Giorgini and Dr. Celia May for the helpful advices. Many thanks to Dr. Rob Hardwick and Dr. Lee Machado for their help in the early stages of my research. Lots of thanks to fellow PhD students, Angelica Vittori, Barbara Ottolini, Razan Abujaber and Ezgi Kucukkilic for making the lab a pleasant and inspiring place to work and learn. I would also like to thank all past and present members of Hollox group and Jobling group at University of Leicester for all the help, encouragement and friendship. I wish to thank the Ministry of Social Justice & Empowerment, Government of India for the financial support during my PhD. And finally, a heartfelt thanks to my fiancée Somdatta, my family and friends for their love and support throughout this four-year endeavour abroad, as always. I would like to thank my parents for their continuous encouragement and passion for my education. ii

4 TABLE OF CONTENTS INTRODUCTION Copy number Variation Classes of CNVs Functional consequences of CNVs Mechanisms of structural change CNV detection methods Southern blotting and Pulse Field Gel Electrophoresis Fibre-FISH (Fluorescence in situ hybridization) Array comparative genomic hybridization Representational Oligonucleotide Microarray Analysis (ROMA) Quantitative PCR (qpcr) Multiplex ligation dependent probe amplification (MLPA) Multiplex amplifiable probe hybridization (MAPH) Fosmid Paired-End Sequencing Paralogue ratio test (PRT) Deleted in Malignant Brain Tumours 1 (DMBT1) Genomic Structure of the DMBT1 gene Expression and localisation of DMBT The domain organization of DMBT The role of DMBT1 in epithelial and stem cell differentiation Role of DMBT1 in Innate immunity Bacteria-binding domain on DMBT Hydroxyapatite-binding domain on DMBT Interaction of DMBT1 with viruses Interaction of DMBT1 with endogenous protein ligands The glycosylation pattern of DMBT Role of DMBT1 in the complement pathway Involvement of DMBT1 in mechanism of fertilization DMBT1 binding region to Streptococcus mutans Ag I/II Evidence of copy number variation at DMBT AIMS OF THE STUDY iii

5 2 MATERIALS AND METHODS DNA samples used HapMap samples Cell lines samples ECACC Human Random Control (HRC) samples CEPH family samples HGDP-CEPH panel Leicester local volunteers Crohn s samples African HIV cohort Lung disease cohort UTI and VUR cohort DMBT1 Sequence processing and bioinformatics Sequence analysis of SRCR repeats of DMBT Evidence of copy number variation on DMBT Growing of lymphoblastoid cell lines Genomic DNA extraction from lymphoblastoid cell lines Characterization of CNV1 region of DMBT Long-range PCR PCR for the long allele Block-specific long PCR Analysis acgh of data for CNV Designing of primers for CNV1 region PRT assays for CNV1 of DMBT Characterization of CNV2 region of DMBT Long PCR spanning CNV2 region Analysis of acgh data for CNV PRT assays for CNV2 of DMBT Designing of probes for physical mapping of DMBT Synthesis of DMBT1 probe Fibre-FISH molecular combing methods Analysis of fibre-fish Sample Preparation for PFGE Genomic DNA extraction from lymphoblastoid cell lines iv

6 Digestion of liquid DNA Samples Pulsed field Gel Electrophoresis conditions Southern blot analysis Gel depurination, denaturation and neutralization Transfer of DNA to membrane Synthesis of DMBT1 probe Probe labeling and recovery Hybridization Washing blot Preparing blot for exposure Autoradiography Analysis of DNA fragment size after PFGE Estimation of allele and genotype frequency for mutation estimation Simple tandem repeats (STR) analysis DMBT1-m DMBT1-m Detection of de novo mutation Estimation of mutation rate Worldwide distribution of CNV1 and CNV2 copy number of DMBT1 gene Relationship of copy number variation in HGDP populations Analysis of Pathogen Richness Analysis of DMBT1 copy number variation due to human life style adaptations Isolation of Genomic DNA from buccal cells Designing of PCR primers for C-terminal region of Ag I/II of S. mutans PCR amplification of C-terminal region of Ag I/II of S. mutans Extraction of PCR product from Agarose gel Sequencing of PCR product using internal sequencing primers Sequence read and alignment Sequence diversity Phylogenetic analysis McDonald-Kreitman test Input sequences for McDonald-Kreitman test Sequence analysis for McDonald-Kreitman test Allele frequency spectrum v

7 2.33 Secretor status assay Primer design for secretor status assay PCR amplification and restriction digestion Regression Analysis Comparison between acgh and PRT for Crohn s samples Statistical analysis for case-control study Analysis of Crohn s disease samples Analysis of African HIV cohorts Analysis of lung disease cohorts Analysis of Vesicoureteral Reflux (VUR) and Urinary Tract Infections (UTI) cohorts CHARACTERIZATION OF COPY NUMBER VARIATION OF THE HUMAN DMBT1 GENE DMBT1 Sequence processing and bioinformatics Sequence relationship of SRCR repeats Evidence of copy number variable region of DMBT PRT results for HapMap samples Analysis of CNV1 region in HapMap samples Analysis of CNV2 region in HapMap samples Discussion ANALYSIS OF COPY NUMBER VARIATION OF DMBT1 USING PHYSICAL MAPPING APPROACHES Introduction Copy number variation of DMBT1 using Fibre-FISH Analysis of DNA fibres Measurements of DNA fibre lengths Analysis of the DMBT1 region in YRI family Y Analysis of DMBT1 region in 1447 family Copy number variation of DMBT1 using PFGE Selection of restriction enzyme for genomic DNA digestion Selection of DNA samples Southern blotting analysis of liquid genomic DNA DNA size analysis using Southern blotting Discussion vi

8 5 ANALYSIS OF SEGREGATION PATTERNS AND DE NOVO MUTATION RATES AT THE DMBT1 GENE Aim of the study Estimation of DMBT1 copy number in CEPH pedigree samples Analysis of CNV1 copy number in CEPH pedigree samples Distribution of integer CNV1 copy number Analysis of CNV2 copy number Distribution of integer CNV2 copy number Allelic architecture and copy number genotype Detection of de novo mutation Estimation of mutation rate Discussion DETERMINATION OF EXTENT OF DIVERSITY AND EVOLUTIONARY BASIS OF DMBT1 COPY NUMBER IN GLOBAL POPULATIONS Introduction Estimation of DMBT1 copy number in HGDP samples Analysis of CNV1 copy number in HGDP samples Distribution of CNV1 diploid copy number in HGDP Distribution of CNV1 diploid copy number in different geographical regions Distribution of CNV1 diploid copy number in different populations Analysis of CNV2 copy number Distribution of CNV2 diploid copy number in HGDP Distribution of CNV2 diploid copy number in different geographical regions Distribution of CNV2 diploid copy number in different populations Estimation and distribution of total SRCR copy number in different geographical regions Analysis of CNV1 and CNV2 copy number association Analysis of pathogen-driven selection on DMBT1 Copy number variation Kendall rank correlation for pathogen richness Partial Mantel tests for pathogen richness DMBT1 copy number variation due to human life style adaptations Analysis of human life style adaptations using agriculture data as dichotomous variables Analysis of human life style adaptations using agriculture data as relative amount of activity vii

9 6.6 Discussion ANALYSIS OF DIVERSITY OF THE SALIVARY AGGLUTININ-BINDING PROTEIN OF STREPTOCOCCUS MUTANS Aim of the study Analysis of S. mutans sequence Sequence diversity of SpaP gene of S. mutans Phylogenetic analysis of SpaP gene of S. mutans Analysis of DMBT1 binding regions of S. mutans McDonald-Kreitman test for SpaP gene of S. mutans Allele frequency spectrum of SpaP of S. mutans Estimation of DMBT1 Copy number in Leicester local volunteers Estimation of CNV1 copy number in Leicester local volunteers Distribution of CNV1 copy number in Leicester local volunteers Estimation CNV2 copy number in Leicester local volunteers Distribution of CNV2 copy number in Leicester local volunteers Secretor status of Leicester local volunteers Analysis of SpaP genotype and CNV1 and CNV2 copy number of DMBT Discussion ASSOCIATION OF COPY NUMBER VARIATION OF DMBT1 IN CROHN S DISEASE PATIENTS Introduction The rationale for study Estimation of DMBT1 copy number in Crohn s samples Copy number estimation of English Crohn s samples Copy number estimation of Scottish Crohn s samples Copy number estimation of Danish Crohn s samples Comparison acgh and PRT for copy number estimation Distribution of diploid copy number in the Crohn s samples Distribution of CNV1 copy number in Crohn s samples Distribution CNV2 copy number in Crohn s samples Distribution of SRCR copy number in Crohn s samples Association of DMBT1 copy number with Crohn s disease Discussion ASSOCIATION OF COPY NUMBER VARIATION OF DMBT1 IN AFRICAN HIV COHORTS viii

10 9.1 Introduction Estimation of DMBT1 copy number in African HIV cohorts Estimation of CNV1 copy number in African HIV cohorts Estimation of CNV2 copy number in African HIV cohorts Distribution of CNV1 and CNV2 copy number in African HIV cohorts Association of copy number with clinical parameters in African HIV cohorts Discussion ASSOCIATION OF COPY NUMBER VARIATION OF DMBT1 IN LUNG DISEASE COHORTS Introduction Estimation of DMBT1 copy number in lung disease cohorts Estimation of DMBT1 copy number in Gedling cohort Estimation of DMBT1 Copy number in Leicester Respiratory cohort (LRC) Distribution of CNV1 and CNV2 copy number in lung disease cohorts Association study in the Gedling and LRC cohorts Discussion ASSOCIATION OF COPY NUMBER VARIATION OF DMBT1 IN VUR AND UTI SAMPLES Introduction Estimation of DMBT1 copy number in VUR and UTI samples Estimation of CNV1 copy number in VUR and UTI samples Estimation of CNV2 copy number in VUR and UTI samples Distribution of CNV1 and CNV2 copy number in VUR and UTI samples Secretor status in VUR and UTI samples Association study in VUR and UTI samples Discussion DISCUSSION. 255 BIBLIOGRAPHY ix

11 LIST OF FIGURES Figure 1: Diallelic and multiallelic copy number variation Figure 2: Schematic picture of Southern blotting methods Figure 3: AMY1 copy number estimation using high-resolution Fibre-FISH Figure 4: Schematic picture of array-based comparative genome hybridization (array-cgh) Figure 5: Representational oligonucleotide microarray analysis (ROMA) for copy number detection Figure 6: Multiplex PCR-based methods for the identification of copy-number variants Figure 7: The paired-end sequencing methodology for detection of structural variation Figure 8: Schematic picture describing different steps of PRT Figure 9: Domain organization of DMBT1 and DMBT1 orthologs Figure 10: Schematic of gp340 structures, indicating conserved cysteines of SRCR domains Figure 11: Schematic picture of bacteria-binding on SRCR domains of DMBT Figure 12: Schematic presentation of SRCR domain, highlighting the Hydroxyapatite (HA)-binding domain and bacteria-binding domain on DMBT Figure 13: Schematic picture of virus-binding regions on SRCR domains of DMBT Figure 14: Cartoon shows of primary sequence of S. mutans UA159 AgI/II Figure 15: UCSC genome browser screen shot showing the exon-intron structure with three DMBT1 gene annotations from different transcripts Figure 16: Schematic illustration of PRT1 assay Figure 17: Schematic illustration of PRT2 assay Figure 18: An example of GeneMapper electropherogram showing test locus and reference locus in PRT1 assay Figure 19: An example of GeneMapper electropherogram showing test locus and reference locus in PRT2 assay Figure 20: The calibration standard on reference DNA samples for PRT1 from single PCR reaction Figure 21: The calibration standard on reference DNA samples for PRT2 from single PCR reaction Figure 22: Schematic illustration of PRT3 assay Figure 23: Schematic illustration of PRT4 assay Figure 24: Schematic illustration of PRT5 assay Figure 25: An example of a GeneMapper electropherogram showing the test locus and the reference locus in the PRT3 assay Figure 26: An example of a GeneMapper electropherogram showing the test locus and the reference locus in PRT4 assay Figure 27: An example of a GeneMapper electropherogram showing the test locus and the reference locus in PRT5 assay Figure 28: The calibration standard on reference DNA samples for PRT3 from single PCR reaction Figure 29: The calibration standard on reference DNA samples for PRT4 from single PCR reaction x

12 Figure 30: Strategic picture of primer design for synthesis of Fibre-FISH probes Figure 31: Schematic picture showing primers location to amplify C-terminal region of Ag I/II of S. mutans Figure 32: Schematic picture showing PCR-RFLP strategy of rs of FUT2 gene Figure 33: Electropherogram of secretor PCR products showing three possible genotypes of rs of FUT2 gene using Leicester local samples Figure 34: Dot plot analysis of DMBT1 gene (exons and introns) against itself Figure 35: Nucleotide sequence relationship of SRCR repeats Figure 36: Copy number variable regions in DMBT1 gene Figure 37: Histogram of raw PRT ratio of PRT1 and PRT2 in the HapMap Phase I DNA samples Figure 38: Assessment of CNV1 copy number assay quality in HapMap samples Figure 39: Scree plot for PCA of acgh data generated using Agilent 210 K CNV chip for CNV1 in HapMap samples. The X-axis shows the number of principal components Figure 40: Scatter plot of mean unrounded copy number value of CNV1 and first PC of Agilent acgh data for CNV1 region of HapMap samples Figure 41: Histogram of mean normalized PRT ratio of CNV1 for HapMap samples. X-axis shows mean PRT ratio of CNV Figure 42: Output of the clustering procedure using the PRT transformed data of CNV1 for HapMap samples Figure 43: Analysis of integer copy number calling for CNV1 in HapMap samples Figure 44: Frequencies of integer copy number of CNV1 per diploid genome in different HapMap populations Figure 45: Schematic illustration of long range PCR for genotyping CNV1 region of DMBT Figure 46: Genotyping of CNV1 allele of good quality DNA samples using long range PCR Figure 47: Genotyping of CNV1 allele of freeze-thawed DNA using long range PCR Figure 48: Top panel showed location of long allele specific PCR Figure 49: (A) Schematic presentation of block specific long PCR Figure 50: Genotype frequency of CNV1 region in different HapMap populations Figure 51: Histogram of raw PRT ratio of PRT3 and PRT4 in the HapMap Phase I DNA samples Figure 52: Histogram of raw PRT ratio of PRT5 in the HapMap Phase I DNA samples Figure 53: Assessment of CNV2 copy number assay quality Figure 54: Assessment of CNV2 copy number assay quality Figure 55: Assessment of CNV2 copy number assay quality Figure 56: Scree plot for PCA of acgh data generates using Agilent 210 K CNV chip for CNV1 in HapMap samples Figure 57: Scatter plot produced by mean unrounded copy number value and first PC of Agilent acgh signal of CNV2 of HapMap samples Figure 58: Histogram of normalized PRT ratio of CNV2 for HapMap samples xi

13 Figure 59: Output of the clustering procedure using the PRT transformed data of CNV2 for HapMap samples Figure 60: Analysis of integer copy number calling of CNV2 in HapMap samples Figure 61: Frequencies of integer copy number of CNV2 per diploid genome in different HapMap populations Figure 62: Schematic presentation of CNV2 copy number estimation using long PCR Figure 63: Schematic picture of DMBT1 fibre-fish Figure 64: Analysis of DNA fibre of DMBT1 region Figure 65: Fiber-FISH image on DNA from cell lines derived from YRI HapMap trio Y Figure 66: Individual measurements of DMBT1 probe fibre length of Y045 family Figure 67: Individual measurements of DMBT1 probe fibre length of 1447 family Figure 68: Selection of restriction enzyme for DMBT1 gene Figure 69: Agarose gel shows smearing of gdna after Sca I digestion Figure 70: Southern blot analysis of genomic DNA using DMBT1 SRCR probe Figure 71: Standard curve using known size standard of PFGE ladder Figure 72: Histogram of mean normalized PRT ratio of CNV1 in CEPH pedigree samples Figure 73: Output of the clustering procedure using the PRT transformed data of CNV1 in CEPH pedigree samples Figure 74: Analysis of integer copy number calling Figure 75: Histogram of raw PRT ratio of PRT3 and PRT4 in the CEPH pedigree samples Figure 76: Scatter plot produces by PRT3 and PRT4 assays of CNV2 estimation in CEPH pedigree samples Figure 77: Histogram of mean normalized PRT ratio of CNV2 in CEPH pedigree samples Figure 78: Output of the clustering procedure using the PRT transformed data of CNV2 in CEPH pedigree samples Figure 79: Analysis of integer copy number calling of CNV2 in CEPH family Figure 80: CoNVEM analysis for CNVs of DMBT1 using unrelated parents data from 40 CEPH families. 125 Figure 81: Comparison of allele frequencies for CNVs of DMBT1 using unrelated parents data from 40 CEPH pedigrees Figure 82: Analysis of CEPH/FRENCH pedigree 12 for detection of de novo mutation Figure 83: Analysis of CEPH/FRENCH pedigree 1424 for detection of de novo mutation Figure 84: Analysis of CEPH/FRENCH pedigree 1362 for detection of de novo mutation Figure 85: Histogram of raw PRT ratio of PRT1 and PRT2 in the HGDP-CEPH samples Figure 86: Scatter plot produces by PRT1 and PRT2 assays of CNV1 estimation in HGDP samples Figure 87: Histogram of mean normalized PRT ratio of CNV1 for HGDP samples Figure 88: Output of the clustering procedure using the PRT transformed data of CNV1 for HGDP samples Figure 89: Analysis of CNV1 integer copy number calling using CNVtools xii

14 Figure 90: Population distribution of diploid CNV1 copy number Figure 91: Frequency distribution of worldwide CNV1 copy number in HGDP continental regions Figure 92: Histogram of raw PRT ratio of PRT3 and PRT4 in the HGDP-CEPH samples Figure 93: Scatter plot produced by PRT3 and PRT4 assays of CNV2 estimation in HGDP samples Figure 94: Histogram of mean normalized PRT ratio of CNV2 for HGDP samples Figure 95: Output of the clustering procedure using the PRT transformed data of CNV2 for HGDP samples Figure 96: Analysis of CNV1 integer copy number calling using CNVtools for the HGDP samples Figure 97: Population distribution of CNV2 copy number Figure 98: Frequency distribution of worldwide CNV2 copy number in HGDP continental regions Figure 99: Schematic picture of DMBT1 region to estimate total SRCR copy number Figure 100: Frequency distribution of total diploid SRCR copy number in HGDP samples Figure 101: The pattern of CNV1 and CNV2 copy number variation in different HGDP individuals Figure 102: The pattern of copy number variation at CNV1 and CNV2 in different HGDP populations Figure 103: The variation pattern of copy number for CNV1 and CNV2 in different HGDP regions Figure 104: Molecular Phylogenetic analysis for all samples by Maximum Likelihood method in MEGA6 using nucleotide sequences Figure 105: Molecular Phylogenetic analysis for European samples by Maximum Likelihood method in MEGA6 using nucleotide sequences Figure 106: Molecular Phylogenetic analysis for all samples by Maximum Likelihood method in MEGA6 using Amino acid sequences Figure 107: Molecular Phylogenetic analysis for European samples by Maximum Likelihood method in MEGA6 using Amino acid sequences Figure 108: Sequence logos showing pattern of aligned nucleotide sequences of Ad1 region of SpaP gene of S. mutans Figure 109: Sequence logos showing pattern of aligned nucleotide sequences of Ad2 region of SpaP gene of S. mutans. The height of each nucleotide is made proportional to its frequency and most common nucleotide is on top. The number of nucleotide correspondences to main DNA sequence use in the study Figure 110: Sequence logos showing pattern of aligned amino acid sequences of Ad1 region of Ag I/II of S. mutans Figure 111: Sequence logos showing pattern of aligned amino acid sequences of Ad2 region of Ag I/II of S. mutans Figure 112: Analysis of allele frequency spectrum in S. mutans for all samles Figure 113: Analysis of allele frequency spectrum in S. mutans for EU samles Figure 114: Histogram of raw PRT ratio of PRT1, PRT2 and mean CNV1 PRT ratio in the Leicester local samples xiii

15 Figure 115: Scatter plot produces by raw ratio from PRT1 and PRT2 assays of CNV1 estimation in the Leicester local samples Figure 116: Output of the clustering procedure using the PRT transformed data of CNV1 in Leicester local samples Figure 117: Analysis of integer copy number calling of CNV1 in local samples Figure 118: Histogram of raw PRT ratio of PRT3, PRT4 and mean CNV2 PRT ratio in the Leicester samples Figure 119: Scatter plot produces by raw ratio from PRT3 and PRT4 assays of CNV2 estimation in the Leicester samples Figure 120: Output of the clustering procedure using the PRT transformed data of CNV2 in local samples Figure 121: Analysis of CNV2 integer copy number calling in Leicester local sample Figure 122: Scatter plot produces by PRT1 and PRT2 assays use to estimate diploid copy number of CNV1 in English Crohn s and control samples Figure 123: Histogram of mean unrounded normalized PRT ratio of CNV1 for English Crohn s and control samples Figure 124: Output of the clustering procedures using the PRT transformed data of CNV1 for English Crohn s and control samples Figure 125: Analysis of integer CNV1 copy number calling for English Crohn s disease and control samples Figure 126: Scatter plot produces by PRT ratio of PRT3 and PRT4 assays of CNV2 estimation in English Crohn s and control samples Figure 127: Histogram of mean normalized PRT ratio of CNV2 in English CD samples Figure 128: Output of the clustering procedure using the PRT transformed data of CNV2 for English CD samples Figure 129: Analysis of integer CNV2 copy number calling for English Crohn s disease and control samples Figure 130: Scatter plot produces by PRT1 and PRT2 assays use to estimate diploid copy number of CNV1 in Scottish Crohn s samples Figure 131: Output of the clustering procedure using the PRT transformed data of CNV1 for Scottish CD samples Figure 132: Analysis of CNV1 integer copy number calling using CNVtools for Scottish Crohn s samples Figure 133: Scatter plot produces by PRT ratio of PRT3 and PRT4 assays of CNV2 estimation in Scottish Crohn s samples Figure 134: Output of the clustering procedure using the PRT transformed data of CNV2 for Scottish Crohn s samples xiv

16 Figure 135: Analysis of CNV2 integer copy number calling using CNVtools for Scottish Crohn s samples Figure 136: Histogram of mean normalized PRT ratio of CNV1 for Danish Crohn s samples Figure 137: Output of the clustering procedure using the PRT transformed data of CNV1 for Danish CD samples Figure 138: Analysis of CNV1 integer copy number calling using CNVtools for Danish IBD samples Figure 139: Scatter plot using raw PRT ratio of PRT3 and PRT4 assays, use to estimate diploid copy number of CNV2 in Danish CD samples Figure 140: Output of the clustering procedure using the PRT transformed data of CNV2 for Danish samples Figure 141: Analysis of CNV2 integer copy number calling using CNVtools for Danish CD samples Figure 142: Scree plot for PCA of first normalized acgh data is generated using Agilent 210 K CNV chip for CNV1 in Crohn s disease cohort Figure 143: Scree plot for PCA of second normalized acgh data is generated using Agilent 210 K CNV chip for CNV1 in Crohn s disease cohort Figure 144: Scree plot for PCA of first normalized acgh data is generated using Agilent 210 K CNV chip for CNV2 in Crohn s disease cohort Figure 145: Scree plot for PCA of second normalized acgh data is generated using Agilent 210 K CNV chip for CNV2 in the Crohn s disease cohort Figure 146: Distribution of diploid copy number for CNV1 of DMBT1 in CD cohorts Figure 147: Distribution of diploid copy number for CNV2 of DMBT1 in CD cohorts Figure 148: Distribution of total SRCR domain of DMBT1 in the English Crohn s cohorts Figure 149: Distribution of total SRCR domain of DMBT1 in the Scottish Crohn s cohorts Figure 150: Distribution of total SRCR domain of DMBT1 in the Danish Crohn s cohorts Figure 151: Worldwide HIV prevalence among adults (adopted from WHO-HIV department) Figure 152: Regional HIV and AIDS statistics according to WHO-HIV departments Figure 153: Output of the clustering procedure using the PRT transformed data of CNV1 for HIV samples Figure 154: Analysis of integer CNV1 copy number calling for HIV samples Figure 155: Output of the clustering procedure using the PRT transformed data of CNV2 for HIV samples Figure 156: Analysis of integer CNV2 copy number calling for HIV cohort Figure 157: Distribution of diploid copy number of CNV1 and CNV2 at DMBT Figure 158: Distribution of diploid copy number of SRCR at DMBT Figure 159: Analysis of integer CNV1 copy number calling for Gedling COPD samples Figure 160: Scatter plot produces by raw ratio of PRT3 and PRT4 assay and uses to estimate diploid copy number of CNV2 in the Gedling samples Figure 161: Analysis of integer CNV2 copy number calling for Gedling COPD samples xv

17 Figure 162: Scatter plot produces by raw ratio of PRT1 and PRT2 assay, use to estimate diploid copy number of CNV1 in LRC samples Figure 163: Analysis of integer copy number calling for LRC samples Figure 164: Scatter plot produces by raw ratio of PRT3 and PRT4 assay, use to estimate diploid copy number of CNV2 in LRC samples Figure 165: Analysis of integer CNV2 copy number calling for LRC samples Figure 166: Scatter plots of unadjusted raw (A and B) and adjusted (for age, age2, sex, height) inverse normally transformed (C and D) FEV 1 /FVC against average raw (A and C) and integer CNV2 copy number (B and D) of DMBT1 in LRC Figure 167: Cumulative frequency distribution of average raw PRT ratio of DMBT1 CNV1 (left) and DMBT1 CNV2 (right) in Gedling COPD cases and controls Figure 168: Cumulative frequency distribution of average raw PRT ratio of DMBT1 CNV1 and DMBT1 CNV2 in doctor diagnosed asthma cases and controls from Gedling cohort Figure 169: Cumulative frequency distribution of average raw PRT ratio of DMBT1 CNV1 and DMBT1 CNV2 in doctor diagnosed asthma cases and controls from LRC Figure 170: Histogram of mean unrounded normalized PRT ratio of CNV1 for VUR, UTI and control samples Figure 171: Output of the clustering procedure using the PRT transformed data of CNV2 for VUR & UTI cohorts Figure 172: Analysis of integer copy number calling for VUR and UTI cohort Figure 173: Histogram of mean normalized PRT ratio of CNV2 for VUR, UTI and control samples Figure 174: Output of the clustering procedure using the PRT transformed data of CNV2 for VUR and UTI samples Figure 175: Analysis of integer CNV2 copy number calling for UTI and VUR samples Figure 176: Agarose gel electrophoresis to determine of secretor status using PCR-RFLP in VUR and UTI cohort xvi

18 LIST OF TABLES Table 1: DMBT1 synonyms and orthologs in different organisms. 18 Table 2: Summary of samples analysed for Crohn's study. 36 Table 3: PCR Primers used to amplify deletion allele of CNV1. 41 Table 4: Primer sequences used to amplify for CNV1 long allele. 41 Table 5 : Primer sequences used to amplify block-specific long PCR. 41 Table 6: Primer sequences used in PRT1 assay for CNV1. 43 Table 7: Primer sequences used in PRT2 assay for CNV1. 44 Table 8: Long PCR primers used to amplify CNV2 region. 49 Table 9: Primer sequences used in PRT3 assay for CNV2. 50 Table 10: Primer sequences use in PRT4 assay for CNV2. 51 Table 11: Primer sequences use in PRT5 assay for CNV2. 53 Table 12: Primer sequences for amplification of DMBT1 probes. 58 Table 13: The conditions used to resolve DMBT1 restriction fragments after Sca I digestion 62 Table 14: Primer sequences used to amplify DMBT1-m1 for STR analysis. 64 Table 15: Primer sequences used to amplify DMBT1-m2 for STR analysis. 65 Table 16: Primer sequences used to amplify C-terminal regions of SpaP gene of S. mutans. 69 Table 17: Internal sequencing primers sequences used to sequence full length full length PCR product of C-terminal region of SpaP gene of S. mutans. 70 Table 18: Contingency table for McDonald-Kreitman test of S. mutans from all samples. 73 Table 19: Contingency table for McDonald-Kreitman test of S. mutans from European samples. 73 Table 20: Primers used to amplify rs of FUT2 gene for secretor status assay. 75 Table 21: The fragments size, genotypes and secretor status of rs of FUT2 gene based on PCR- RFLP. 76 Table 22: CNV1 copy number frequencies in HapMap samples. 90 Table 23: PCR fragments for Validation of CNV1 copy number using long-range PCR. 91 Table 24: Combination of different PCR assays for validation of CNV1 copy number. 93 Table 25: CNV2 copy number frequencies in HapMap samples. 102 Table 26: Family trio with integer copy number of CNV1 and CNV2 at DMBT1 for family Y Table 27: Estimated size of one unit CNV1 allele of DMBT1 using samples of HapMap YRI family Y045 by Fiber-FISH. 110 Table 28: Family trio with integer copy number of CNV1 and CNV2 at DMBT1 for CEU family Table 29: Estimated size of one unit of CNV2 of DMBT1 using samples of HapMap CEU family 1447 by Fiber-FISH. 112 Table 30: The samples with integer copy number used for PFGE. Integer copy numbers of the samples were measured using PRT assays. 113 xvii

19 Table 31: Estimated size of DMBT1 region from different control DNA samples using PFGE combined with Southern blotting. 115 Table 32: Frequency of CNV1 copy number of DMBT1 in CEPH pedigrees. 120 Table 33: Frequency distribution of CNV2 copy number of CEPH family. 124 Table 34: Estimated genotype frequencies for CNV1 after CoNVEM analysis. 126 Table 35: Estimated genotype frequencies for CNV2 based on result1 after CoNVEM analysis. 126 Table 36: Estimated genotype frequencies for CNV2 based on result2 after CoNVEM analysis. 127 Table 37: CNV1 diploid copy number frequencies in HGDP samples. 136 Table 38: Diploid CNV1 copy number frequencies in HGDP continental regions. 137 Table 39: CNV1 copy number frequencies in HGDP American populations. 138 Table 40: CNV1 copy number frequencies in HGDP South Asia populations. 138 Table 41: CNV1 copy number frequencies in HGDP East Asia populations. 139 Table 42: CNV1 copy number frequencies in HGDP European populations. 139 Table 43: CNV1 copy number frequencies in HGDP Middle East and Oceania populations. 140 Table 44: CNV1 copy number frequencies in HGDP Sub-Saharan Africa populations. 140 Table 45: CNV2 copy number frequencies in HGDP samples. 145 Table 46: CNV2 copy number frequencies in HGDP continental regions. 146 Table 47: CNV2 copy number frequencies in HGDP American populations. 147 Table 48: CNV2 copy number frequencies in HGDP South Asia populations. 147 Table 49: CNV2 copy number frequencies in HGDP East Asia populations. 148 Table 50: CNV2 copy number frequencies in HGDP European populations. 149 Table 51: CNV2 copy number frequencies in HGDP Middle East and Oceania populations. 149 Table 52: CNV2 copy number frequencies in HGDP Sub-Saharan African populations. 150 Table 53: Mean unrounded copy number for CNV1 and CNV2 in different HGDP populations. 154 Table 54: Mean CNV1 and CNV2 copy number for different geographical regions in HGDP samples. _ 155 Table 55: The Kendall Correlations with the richness of viruses, helminths, bacteria and protozoa. 156 Table 56: Partial mantel correlations (using distance from Africa). 157 Table 57: Correlations with copy number variable of DMBT1 and a human life style variable (agriculture variable) as dichotomous variables. 159 Table 58: Correlations with copy number variable of DMBT1 and human life style as relative amount of human activity using partial mantel tests. 160 Table 59: Regression analysis with copy number variable of DMBT1 and human life style as relative amount of human activity spent. 160 Table 60: Sequence diversity of C-terminal region of SpaP gene of S. mutans. 163 Table 61: Summary results of McDonald-Kreitman test of Antigen I/II of S. mutans using sequences from all samples. 170 Table 62: Summary results of McDonald-Kreitman test of Antigen I/II of S. mutans using sequences from European samples. 171 xviii

20 Table 63: Frequency of synonymous and non-synonymous polymorphisms of S. mutans from all samples. 173 Table 64: Frequency of synonymous and non-synonymous polymorphism of S. mutans from European samples. 174 Table 65: CNV1 copy number frequencies in the Leicester samples. 177 Table 66 :CNV2 copy number frequencies in the Leicester samples. 181 Table 67: Genotype frequency, allele frequency and secretor status of the Leicester samples. 182 Table 68: Summary table relating regression analysis of polymorphic alleles of S. mutans with CNVs and secretor status of all Leicester samples. 182 Table 69: Summary table relating regression analysis of polymorphic alleles of S. mutans with CNVs and secretor status of European samples. 183 Table 70: Comparison of CNV1, CNV2 and SRCR copy number frequency in Crohn s patients and controls of three different Crohn s cohorts. 212 Table 71: Comparison of DMBT1 deletion allele frequency in Crohn s patients and controls of three different Crohn s cohorts. 212 Table 72: Copy number frequencies of CNV1 at DMBT1 in African HIV cohorts. 222 Table 73: Copy number frequencies of CNV2 at DMBT1 in HIV samples. 223 Table 74: Copy number frequencies of total SRCR of DMBT1 in HIV cohorts. 225 Table 75: Tests of association of copy number with HIV load pre-haart. 226 Table 76: Tests of association of copy number with CD4 count during HAART. 226 Table 77: Copy number frequencies of CNV1 at DMBT1 in respiratory disease cohorts. 237 Table 78: Copy number frequencies of CNV2 at DMBT1 in respiratory disease cohorts. 238 Table 79: Association of CNV1, CNV2 and total SRCR copy number of DMBT1 with lung function in Gedling cohort. 239 Table 80 : Association of CNV1, CNV2 and total SRCR copy number of DMBT1 with lung function in LRC. Significant results are shown in asterisk. 239 Table 81: Association of CNV1, CNV2 and total SRCR copy number of DMBT1 with lung COPD in the Gedling cohort. 240 Table 82: Association of CNV1, CNV2 and total SRCR copy number of DMBT1 with asthma (doctor diagnosed) in Gedling cohort. 242 Table 83: Association of CNV1, CNV2 and total SRCR copy number of DMBT1 with asthma-ics in LRC. 242 Table 84: Copy number frequencies of CNV1 at DMBT1 in VUR cohort. 250 Table 85: Copy number frequencies of CNV2 at DMBT1 in VUR and UTI cohort. 251 Table 86: Genotype frequencies of FUT2 gene for secretor status in VUR and controls. 252 Table 87: Summary of samples and tests of association of copy number at DMBT1 with VUR samples. 252 Table 88: Summary of samples and tests of association of copy number at DMBT1 with UTI samples. 253 Table 89: Summary of samples and tests of association of copy number at DMBT1 with VUR and UTI samples. 253 xix

21 LIST OF ABBREVIATIONS acgh array-comparative Genomic Hybridisation Bp Base pairs CD Crohn's Disease CEPH Centre de'etude du Polymorphisme Humain CNV Copy Number Variations CUB C1r/C1s Uegf Bmp1 ddntps Dideoxy Nucleotides Triphosphates DGV Database of Genomic Variants DMBT1 Deleted in Malignant Brain Tumours 1 DMBT1pbs1 DMBT1 pathogen binding site 1 DNA Deoxyribonucleic acid dntps Deoxy Nucleotides Triphosphates ECM Extra Cellular Matrix FISH Fluorescence In Situ Hybridisation gdna Genomic DNA Gp120 HIV Envelope surface glycoprotein 120 Gp340 Cell surface glycoprotein 340 GWAS Genome-Wide Association Study HA Hydroxyapatite HAART highly active antiretroviral therapy HIV-1 Human Immunodeficiency Virus type 1 IgG Immunoglobulin G Indels Insertions and Deletions KDa Kilodalton LCRs Low Copy Repeats NAHR Non-Allelic Homologous Recombination NaOH Sodium Hydroxide ng nanogram PAMP Pathogen associated molecular structure PCR Polymerase Chain Reaction PRR Pattern recognition receptor PRT Paralogue Ratio Test RFLP Restriction Fragment Length Polymorphism SAG Salivary agglutinin SID SRCR interspersed domain SNP Single Nucleotide Polymorphism SRCR Scavenger receptor cysteine-rich STRs Short Tandem Repeats UCSC University of California Santa Cruz VNTR Variable Number Tandem Repeat ZP Zona pellucida μg Microgram xx

22 LIST OF WEB RESOURCES The URLs for data presented in the thesis are as follows: Human Genome Diversity Panel, International HapMap Project, UCSC Human Genome Browser, UCSC In-Silico PCR, Database of Genomic Variants, R Project, Coriell Institute for Medical Research, Human Random Control DNA Panels, Repeat Masker, Basic Local Alignment Search Tool (BLAST), Clustal Omega, Primer3, Gideon database, Sequence Manipulation Suite, WebLogo 3, McDonald and Kreitman test, GraphPad, ImageJ, HIV/AIDS, xxi

23 1 INTRODUCTION 1.1 Copy number Variation The human genome shows extensive variation in different forms. Copy number variants (CNVs) account for a major proportion of human genetic polymorphism (Craddock et al., 2010) and contribute to the differences between individual humans (Hastings et al., 2009). CNVs also play an important role in genetic susceptibility to common disease. Human genetic variation is the genetic diversity or variation in alleles of genes of humans and represents the total amount of genetic diversity within the human genome at both the individual and the population level (Conrad et al., 2010; Sudmant et al., 2010; Zhang et al., 2009). Recent studies have reported that variations exist in the human genome at different levels: large microscopically visible chromosome anomalies (several kilobase to megabase pairs), submicroscopic copy number variation of DNA segments (tens to thousands of kilo base pairs) and the single base pair. CNVs are widespread in human genomes and a major source of genetic variation in humans (Iafrate et al., 2004; Sebat et al., 2004). Deletions, insertions and duplications of DNA segments ranging from several kilobases (kb) to megabases (Mb) in size at variable number, in comparison with a reference genome are collectively referred to as copy number variants (CNV) (Conrad et al., 2010). A CNV can be simple tandem duplication, or may involve complex gains or losses of homologous sequences at multiple sites in the genome (Figure 1). Recent studies show that up to 12% of the genome is subject to CNV (Conrad et al., 2010; Iafrate et al., 2004; Kidd et al., 2010; Korbel et al., 2007; Redon et al., 2006). It has been reported that copy number varies in different organs and tissues in the same individual and can arise both meiotically and somatically (Piotrowski et al., 2008). Most CNVs are benign variants and do not directly cause disease. Genes involved in the development and activity of both the immune system and brain tend to be enriched in CNVs (Feuk et al., 2006; Zhang et al., 2009). The simplest type of copy number variation in the human genome may occur due to deletion or duplication of a gene. A diploid genome contains two copies of a particular gene, one on each chromosome. Copy number can be categorized into diallelic and multiallelic groups. Diallelic CNVs have two alleles and could produce three different genotypes in both deletion and duplication events (Figure 1). A simple deletion event could change the diploid copy number of particular gene and therefore could result in diploid copy number of two, one or zero (Figure 1). Similarly a diploid genome could therefore contain two, three, or four copies of gene after simple duplication event in genome (Figure 1). But the pattern of deletion or duplication events in the genome is not always simple and could result complex copy number 1

24 variation, known as multiallelic copy number variants (Wain et al., 2009). A diploid genome after successive rounds of duplication could produce multiallelic copy number variants in diploid copy genome. Multiallelic CNVs have more than two alleles and could produce more than three genotypes (Figure 1). Generally, the size of genomic segments of deletion and duplication regions can vary from a few hundred to several million bp and could contain an entire gene, part of a gene, a region outside of a gene, or several genes in case of larger variants. Figure 1: Diallelic and multiallelic copy number variation (Wain et al., 2009). Diallelic locus (grey) and flanking loci (green and blue) with copy number variation cause by (A) deletion and (B) duplication, each showing the locus with (i) normal diploid copy number, (ii) heterozygous state, and (iii) homozygous state. (C) Multiallelic locus showing (i) normal copy number, (ii) multiple rounds of duplication on one chromosome and a deletion on the homologous chromosome, (iii) duplication on one chromosome and no deletion on the homologous chromosome, (iv) multiple rounds of duplication on one chromosome and no deletion on the homologous chromosome, (v) one round of duplication on each chromosome, (vi) one round of duplication on one chromosome and multiple rounds of duplication on the homologous chromosome, and (vii) multiple rounds of duplication on both chromosomes. Multiallelic assays measure total diploid copy number but cannot describe genotypes status of (ii) and (iii), or (iv) and (v). 2

25 1.1.1 Classes of CNVs Based on the mutational origin and molecular mechanism of their formation CNVs can be classified into two classes; frequently termed recurrent and non-recurrent CNVs. The mutation rates are thought to be different for recurrent and non-recurrent CNVs (Hollox & Hoh, 2014) Recurrent CNVs Recurrent CNVs exist in regions containing large segmental duplications and are mainly generated by a non-allelic homologous recombination mechanism of CNV formation % of normal polymorphic CNVs can be classified as recurrent CNVs (Conrad et al. 2010). Recurrent CNVs can occur anywhere in the genome but hotspots for these CNVs mainly exist in subtelomeric and pericentromeric regions (Conrad & Hurles, 2009; Redon et al., 2006) Non-recurrent CNVs Non-recurrent CNVs involve large genomic regions and break-point analysis shows minimal or no-homology is required for non-recurrent CNV formation (Conrad et al., 2010). Non-recurrent CNVs can be generated by non-homologous end joining (NHEJ) or fork-stalling and template switching (FoSTeS) mechanisms (Zhang et al., 2009) and many non-recurrent CNVs are unique (Hollox & Hoh, 2014). The majority of benign CNVs and a large percentage of pathogenic CNVs come under this class and sometimes show extreme deleterious phenotypes (Arlt et al., 2009; Arlt et al., 2011) Functional consequences of CNVs The functional importance of many CNVs is relatively clear; reduced copy number of a gene can be correlated with reduced expression level, while duplicated copies of a gene can lead to increase expression level (McCarroll & Altshuler, 2007; Stranger et al., 2007). 85%-95% of CNVs in human and mice were reported to be associated with a change in expression of the affected genes (Stranger et al., 2007). CNVs are thought to be a major driving force in evolution and adaptation (Hancock et al., 2010; Iskow et al., 2012; Zhang et al., 2009). Additional copies of genes provide redundancy in sequence, so that some copies maintain the original function while extra copies are free to evolve new or modified functions (Inoue & Lupski, 2002). The copy number variation of specific genes can offer selective advantage in human adaptation and evolution. For example the amount of salivary amylase is directly proportional to the copy number of the AMY1 gene. The higher copy numbers of AMY1 in starch-consuming populations suggests that a high copy number of AMY1 may be 3

26 advantageous in starch eating individuals (Perry et al., 2008). But much variation in copy number of specific genes is disadvantageous and leads to a group of pathological conditions known as genomic disorders (Lupski & Ph, 2007). The copy number change in human somatic cells leads to cancer formation and progression (Volik et al., 2006) and contributes to cancer proneness (Frank et al., 2007). CNVs have been reported to confer risk of complex disease including susceptibility to autism (Kumar et al., 2008; Marshall et al., 2008; Sebat et al., 2004, 2007), schizophrenia (Stefansson et al., 2009; Walsh et al., 2008; Xu et al., 2008), Crohn s disease (Mccarroll et al., 2009), psoriasis (Hollox et al., 2008), systemic lupus erythematosus (Aitman et al., 2006). Several CNV gene involve in some known metabolizing enzymes, such as CYP2D6, GSTM1 and potential drug targets such as CCL3L1, may also make significant contributions to pharmacogenomic studies (Ouahchi et al., 2006) Mechanisms of structural change Heritable CNVs are produced by germline genomic rearrangements that result in gains or losses of DNA segments. The different mechanisms for chromosomal structural change have been studied in model organisms mainly yeast, Escherichia coli and Drosophila. Each of recent findings in model organisms has led to different mechanisms for copy number variation in human genomes (Hastings et al., 2009). The mechanism of copy number variations in the human genome can be broadly categorised into two groups; homologous recombination (HR) and nonhomologous recombination (Alkan et al., 2011; Sudmant et al., 2010). At least 300 bp long homologous sequences are required in eukaryotes cells for homologous recombination (HR), where as nonhomologous recombination mechanism typically utilizes short homologous DNA sequences called microhomologies to guide repair. The microhomologies are often present in single-stranded overhangs on the ends of double-strand breaks (Hastings et al., 2009). HR is the basis of several mechanisms of accurate DNA repair that use another highly identical sequence in the genome such as segmental duplication, or related interspersed repetitive elements to repair damaged sequence. Chromosomal structural change and copy number variation can occur by HR when the repair mechanisms utilize homologous sequences in different chromosomal positions. This is called non-allelic or ectopic homologous recombination (NAHR). In contrast, nonhomologous recombination mechanisms change copy number of genes as they use sequence from a non-homologous template (Lupski & Stankiewicz, 2005; Stankiewicz & Lupski, 2002). NAHR events are more frequent than NHEJ, with estimates of up to 10-4 per locus per generation (Shaffer & Lupski, 2000) and tend to be associated with larger CNVs (Redon et al., 2006). Environmental factors and localized DNA conformations are likely to influence the rate 4

27 of NHEJ events which in general has been estimated to occur at a rate of less than 10-7 per generation, similar to the estimated mutation rate of single nucleotide polymorphisms (SNPs) of 10-8 per locus per generation (Conrad & Hurles, 2009). 1.2 CNV detection methods Recent studies have provided increasing evidence that CNVs have an important role in conferring differences in infectious disease susceptibility. The detection of accurate copy number of a particular gene is more challenging than SNP genotyping. CNV typing measures a quantitative difference rather than a qualitative difference. At the beginning of CNV typing, the commonly applied methods were Southern blotting, fluorescence in situ hybridization (FISH) and quantitative PCR (qpcr). In recent years different methods have been developed to study copy number variants (CNVs) with greater accuracy and precision (Fiegler et al., 2006; McCarroll & Altshuler, 2007). The accuracy and precision of CNV typing method also depends on adequate availability of well-characterized copy number reference controls to allow for comparison of results. Many recent methods also facilitate the study of many different genomic regions in parallel and sometimes on a genome-wide basis. However at present, no single existing methodology has the scope for accurately genotyping all CNV classes for large case-control studies and that power comes from combining methods and repeat typing (Hollox et al. 2008). Diverse approaches have been developed over the years, with good improvements in the detection accuracy and precision Southern blotting and Pulse Field Gel Electrophoresis Southern blotting was first devised by E. M. Southern (1975) and was used as the standard method to detect gene deletion or duplication in the genome and was routinely used for genetic fingerprinting and paternity testing (figure 2). In Southern blotting, restriction enzyme digested fragments of DNA are transferred from an electrophoresis gel to a nitrocellulose or nylon membrane. The immobilized DNA is hybridised with probes that specifically target individual sequences in the blotted DNA. The size alterations of restriction fragments size appear as novel bands on the blot (Mellars & Gomez, 2011). The resolution of conventional agarose gel technique depends on migration of DNA molecules through a relatively small gel pore. Large random coils of DNA molecules cannot be resolved through a much smaller gel matrix, leading to size independent mobility and loss of resolution. The resolution and accuracy for measuring copy number can be improved using pulsed field gel electrophoresis (PFGE) in combination with Southern blotting. The periodic alteration of the electric field in PFGE produces continuous re-orientation of DNA molecules and allows the resolution of large 5

28 DNA fragments. Southern blotting has its own intrinsic limitations; it is laborious, time consuming, and requires large amounts of high-quality DNA. Southern blotting is not suited for automation as it involves many steps like DNA digestion, electrophoresis, blotting, and hybridization. Southern blotting can analyze a limited number of loci per blot (maximum 10 to 15 when good probes are available). PFGE allows more scope for detection of deletions and duplications in larger genomic regions with the ability to resolve DNA sequences up to 2 Mb. In a semi-quantitative approach, the intensity of probe hybridization to a specific target is compared to a control locus and a control sample but uneven transfer of DNA to the nylon membrane or incomplete washing of probes can result in misinterpretation of band intensities. 6

29 Figure 2: Schematic picture of Southern blotting methods. (Adopted from Essential Cell Biology, Second Edition, Garland Science). (A) Separation of DNA by electrophoresis. (B) Transfer of DNA to nitrocellulose paper or nylon paper. (C) The nitrocellulose sheet is carefully peeled off the gel. (D) Hybridization of membrane with buffer containing a radioactively labeled DNA probe specific for the required DNA sequence. (E) Specific DNA fragments as DNA bands on the autoradiograph Fibre-FISH (Fluorescence in situ hybridization) Fibre-FISH is a modified FISH technique developed for high resolution mapping of genes and chromosomal regions on fibres of chromatin or DNA. Fibre-FISH permits physical ordering of DNA probes to a resolution of 1000 bp. The high resolution of Fibre-FISH allows assessment of gaps and overlaps in contigs and analysis of segmental duplications and copy number variations. In Fibre-FISH, the chromatin/dna fibres are released from interphase nuclei and are stretched on a glass slide by means of salt or solvent extraction. After stretching, the DNA fibres are fixed on a microscope slide before hybridization. The stretching uniformity and reproducibility of DNA has improved significantly after implementation of the molecular combing protocol (Bensimon et al., 1994). In the molecular combing protocol the action of a receding air/water meniscus is used to extend and align DNA molecules at one end to a glass surface. Fibre-FISH allows the determination of copy number per allele which is important for studies of inheritance and diseases. The Salivary amylase gene (AMY1) copy number was successfully genotyped using Fibre-FISH methods (Perry et al. 2008) (Figure 3). The Fibre-FISH requires a labour intensive workflow, low throughput and a high quality sample requirement and due to overlapping signals, highly variable regions are difficult to interpret (Cantsilieris et al., 2012). 7

30 Figure 3: AMY1 copy number estimation using high-resolution Fibre-FISH (Perry et al., 2008). Red ( 10 kb) and green ( 8 kb) probes encompass the entire AMY1 gene and a retrotransposon directly upstream of (and unique to) AMY1, respectively. (a) Individual with 14 diploid AMY1 gene copies showing one allele with 10 copies and the other with four copies. (b) Individual with 6 diploid AMY1 gene copies, consistent with Fibre-FISH results Array comparative genomic hybridization Array comparative genomic hybridization (acgh) was developed for high resolution, genomewide screening of segmental genomic copy number variations (CNVs). This technique identifies balanced and unbalanced structural and numerical chromosomal abnormalities (Baris et al., 2007; Chin et al., 2007; Jaillard et al., 2010). acgh allows comprehensive interrogation of thousands of discrete genomic loci for DNA copy number gains and losses. Routine karyotype analysis is not sensitive enough to detect subtle chromosome rearrangements (less than 4 Mb). As a result the higher resolution and throughput, with possibilities for automation, robustness, simplicity, high reproducibility and precise mapping of aberrations are the most significant advantages of acgh over cytogenetic methods (Miller et al., 2010). acgh is gradually replacing cytogenetic methods in an increasing number of genetics laboratories (Ahn et al., 2013). In acgh, equal amounts of labelled genomic DNA from a test and a reference sample are cohybridized to an array containing the DNA targets. Genomic DNA of the patient and control are differentially labelled with Cyanine 3 (Cy3) and Cyanine 5 (Cy5) (Figure 4). Hybridization of 8

31 the repetitive sequences can be blocked by the addition of Cot-1DNA. The slides are scanned into image files using a microarray scanner. The spot intensities are measured and analyzed for copy number analysis (Ahn et al., 2013; Baris et al., 2007; Feuk et al., 2006; Jaillard et al., 2010). The resulting ratio of the fluorescence intensities is proportional to the ratio of the copy numbers of DNA sequences in the test and reference genomes. If the intensities of the fluorescent dyes are equal on one probe, this region of the patient s genome is interpreted as having equal quantity of DNA in the test and reference samples; if there is an altered Cy3:Cy5 ratio this indicates a loss or a gain of the patient DNA at that specific genomic region. Figure 4: Schematic picture of array-based comparative genome hybridization (array-cgh) adopted from (Feuk et al., 2006). The reference and test DNA samples are differentially labelled with fluorescent tags (Cy5 and Cy3, respectively), after repetitive-element is blocked using COT-1 DNA and then hybridized to genomic arrays. After hybridization, the fluorescence ratio (Cy3:Cy5) reveals copy-number differences between the two DNA samples. Typically, in array-cgh, the initial labelling of the reference and test DNA samples reversed for a second hybridization ( dye-swap ) (left and right sides of the panel) to detect spurious signals. The red line represents the original hybridization and the blue line represents the reciprocal hybridization Representational Oligonucleotide Microarray Analysis (ROMA) Representational oligonucleotide microarray analysis (ROMA) is a variant of array-cgh designed to search for CNVs (Figure 5). The reference and test DNA samples are made into representations to reduce the sample complexity before hybridization. DNA is digested with a common restriction enzyme that has uniformly distributed cleavage sites (BglII is shown in Figure 5) and ligated with common adaptors containing PCR primer sites. Ligated fragments are amplified by PCR under controlled PCR conditions so that only DNA of less than 1.2 kb is amplified, therefore reducing the complexity of the DNA. The PCR amplified fragments are hybridized to the array for copy number detection (Lucito et al., 2003; Sebat et al., 2004). Like 9

32 the common microarray method the reference and test DNA samples are then tagged with different fluorescent dyes, usually green or red. An oligonucleotide (around base pairs) is spotted with computationally on glass or synthesizing onto silica by laser photochemistry, with many copies of a single probe comprising each dot. If the gene is present equally in both samples then a dot glows yellow. A mostly-red or a mostly-green dot indicates a deletion or duplication respectively in the gene. An estimation of the number of copies of gene can be made depending on the colour s intensity of each dot (Lucito et al., 2003). One limitation with the ROMA technique is that PCR can only amplify ~200,000 fragments of DNA, comprising approximately 2.5% of the human genome (Lucito et al., 2003). Figure 5: Representational oligonucleotide microarray analysis (ROMA) for copy number detection (Feuk et al., 2006). DNA is digested with a common restriction enzyme with uniformly distributed cleavage sites (BglII). The adaptors with PCR primer sites are ligated to each fragment and are amplified by PCR. Only DNA of less than 1.2 kb (yellow) is amplified and hybridized to the array Quantitative PCR (qpcr) Quantitative PCR (qpcr) is a high throughput technique to measure and validate copy number variation of a gene. qpcr measures PCR amplicons in real time and the fractional cycle number (Ct) indicates the amount of starting template, when PCR amplification reaches a defined threshold during the exponential phase of the reaction. The absolute or relative quantitation of an unknown sample is measured using a standard curve, which is constructed using known amounts of target DNA (usually a serial dilution), plotting resultant Ct values as log concentrations and fitting a linear trend line to the data. The amount of PCR amplicon accumulation is measured by fluorescent based chemistry, either DNA intercalating dyes such as SYBR green or probe based methodologies such as TaqMan, Scorpion and molecular beacons (Cantsilieris & White, 2013; Chen et al., 2006; Fellermann et al., 2006; Linzmeier & Ganz, 2005). Quantitative PCR is quick, powerful, requires very small amounts of DNA (5 10 ng) and is able to detect very small deletion or duplication in the genome. However, the number of targets are limited by the number of fluorophores available and the detection capabilities of the instrument (Cantsilieris et al. 2012). Quantitative real-time PCR is a powerful technique and is mainly used to confirm or validate copy number variation (Fellermann et al., 10

33 2006; Linzmeier & Ganz, 2005) but the resolution capabilities of this approach cannot distinguish high copies (more than 4 copies) of a gene (Hollox et al. 2008) Multiplex ligation dependent probe amplification (MLPA) Multiplex ligation-dependent probe amplification (MLPA) was first introduced in 2002 by Schouten and co-workers and has been widely applied in a variety of clinical and research situations (Schouten et al., 2002). The technique has proven to be an efficient and reliable technique for detection and validation of copy number variation (Hills et al., 2010; Janssen et al., 2005; Pedersen et al., 2010). In MLPA, two sequence-tagged half probes are annealed to adjacent sites on the genomic target sequence and ligated using a thermostable DNA ligase. The ligated probes are subsequently amplified with universal PCR primers (one of which is fluorescently labelled) and quantified using electrophoresis (Figure 6b). Each PCR product has a distinct size which allows identification of specific DNA fragments. The amount of ligated probe is proportional to copy number of the target gene and can be quantified after fractionating the ligated PCR products by capillary electrophoresis. A typical, capillary-based MLPA assay allows quantification of up to 45 distinct sequences of unknown copy number. In each MLPA experiment, reference probes are also included in probe mixes to calculate unknown copy number. The reference probes are assumed to have a normal copy number (n=2) in both test samples and control samples. The reference probes are designed from nonvariable chromosomal regions. Several reference samples are recommended to estimation experimental variability. Groth and co-workers estimated beta-defensin copy number in 44 different samples using the MLPA technique and a noticeable correlation was observed with other techniques, such as PRT and quantitative PCR (Groth et al., 2008). MLPA detects copy number variation of maximum 45 distinct genomic sequences in a single reaction using small amounts of starting DNA (20 ng) and does not require cells for chromosome spreads. MLPA assay can be used to target any genomic sequences for copy number analysis, irrespective of their size or proximity to each other. MLPA allows more accurate determination of the size of deletions or duplications in comparison FISH or qpcr (Janssen et al., 2005). MLPA is high throughput and results can be obtained within 20 hours. There are significant challenges in designing custom probes for those regions not yet commercially available as kits. A list of criteria (probe length, Tm, secondary structure, GC content, nucleotide composition at the ligation, site, sequence uniqueness, avoidance of known SNPs, etc) need to be satisfied to improve the likelihood of a successful MLPA assay. Unknown SNP in the probe binding regions may affect MLPA results and appear as exon deletions. 11

34 1.2.7 Multiplex amplifiable probe hybridization (MAPH) Multiplex amplifiable probe hybridization (MAPH) is a PCR-based method of quantifying multiple genomic loci in a single reaction (Armour et al., 2000; Hollox et al., 2002). The technique is based on the quantitative amplification of multiple probes that have been hybridized to immobilized genomic DNA (Figure 6a). All the probes have universal primers at the ends to amplify by single PCR. MAPH probes are generated by cloning the target sequences into a plasmid vector, followed by PCR amplification of cloned sequence using primers directed to the vector, to have similar flanking sequence in all PCR products. Probes are with different length as well as identical tails facilitating PCR amplification with a single primer pair. Almost µg denatured genomic DNA is spotted onto a nylon filter and hybridized with a set of probes corresponding to the target sequences. The membranes are washed rigorously to remove unbound probe, and the remaining specifically bound probe is proportional to its target copy number. The probes are stripped from the membrane and amplified simultaneously with the universal primer pair and separated by electrophoresis. A relative comparison is made between the test and control probes based on band intensities, peak area/peak heights depending on the detection method. Reduced band intensities or peak area/peak heights compared to internal control probes indicate deletion and increased band intensities or peak area/peak heights indicates duplication. Armour et al., multiplexed up to 40 probes in one single reaction and resolved by gel electrophoresis simultaneously (Armour et al., 2000). MAPH has been used to measure beta-defensin copy number (Armour et al., 2000; Hollox et al., 2002). The design of probes for MAPH is far simpler than MLPA probe generation. MAPH works with double-stranded DNA probes that are obtained from cloning or PCR. SNPs in the probe binding regions are unlikely to affect MAPH but if part of a region targeted by a MAPH probe is deleted, the probe may still hybridize and the target will be scored as being present. The washing steps in the MAPH technique, necessary to remove unbound probe, may also introduce a contamination risk. MAPH requires 1µg of DNA for reliable and reproducible results. 12

35 Figure 6: Multiplex PCR-based methods for the identification of copy-number variants (adopted from (Feuk et al., 2006)). a In multiplex amplifiable probe hybridization (MAPH), The probes (red) of different sizes normally clone into vectors and amplify by PCR such that each end flanks by the same sequence site (blue). The genomic DNA is fixed to a membrane and probes are hybridized to it. Unbound probes removes after rigorous washing and the probes are stripped from the membranes. The amount of probe presents at this stage is proportional to its copy number in the target genomic DNA. Probes amplify by a universal primer pair and size-separates by gel electrophoresis. Changes in peak heights relative to control DNA (non-cnv), indicates the copy number. b Multiplex ligation-dependent probe amplification (MLPA) For each target region 2 probes are designed, which hybridize adjacent to each other (probes for 2 regions are shown in red and yellow). Like MAPH, all probe pairs are flanked by universal primer sites (blue). The probes are hybridized to genomic DNA and adjacent probes are ligated to join the two primers together. The number of ligated primers is proportional to the target copy number. After denaturation, the ligated probes amplify with PCR amplification. Sometimes stuffer sequence is added with one of these probes as having a universal primer site, which allows each probe set to produce fragments of a different size. Size separation by gel electrophoresis is carried out as with MAPH, to detect deletions and duplications Fosmid Paired-End Sequencing Fosmid Paired-End Sequencing was used to characterize structural variation in the human genome (Tuzun et al., 2005). The assay sequences both ends from a fosmid genome library (representing a single individual) and maps paired end-sequences to the human genome reference sequence assembly (Figure 7). This method can identify deletions, insertions, and even inversions by identifying discordant regions where multiple fosmids show discrepancy by length and/or orientation. The sequences of end-sequence pairs are compared with the reference sequence. When fosmid end-sequence pairs span much shorter (<32 kb) on the reference genome it indicates insertion in the fosmid sequence but deletion occurs when the fosmid end-sequence pairs span farther apart (>48 kb) on reference chromosome. In the case of sequence inversion the fosmid end-sequences have inconsistent orientation. 13

36 Figure 7: The paired-end sequencing methodology for detection of structural variation (Tuzun et al., 2005). Fosmid end-sequence pairs span >48 kb defines deletion and insertions are defined when two or more end-sequence pairs span <32 Kb Paralogue ratio test (PRT) The principles of the Paralogue ratio test have been described by Armour and colleagues to estimate actual copy number of gene (Armour et al., 2007). The paralogue ratio test (PRT) is a comparative PCR based method where identical primer pairs are used to co-amplify a copy number variable test region and a non-variable reference region using PCR (Deutsch et al. 2004; Armour et al. 2007; Hollox et al. 2008). The identical primer pairs improve reproducibility by making the amplification kinetics of test and reference loci very similar. The PCR amplicons can be distinguished easily by capillary electrophoresis based on their small size difference (Figure 8). The PRT assay requires careful design of primers so that the primers amplify the copy number variable region of interest and one non-copy number variable reference region of the genome. Reference sequences on other chromosomes are usually preferred just to minimize potential gene conversion between the test and reference sequences. Experiments can be performed in duplicate by using two different fluorescent dyes to label the same primer. Raw copy number can be determined by calculating the ratio of the area under the peak between test and reference amplicon. The precision and accuracy of the PRT assay are equivalent to MLPA and MAPH. But PRT has the advantage of high-throughput analysis for CNV typing of large cohort of samples in a cost effective way using small amount (5 ng) of DNA (Armour et al. 2007; Hardwick et al. 2014; Machado et al. 2013; Aklillu et al. 2013; Abu Bakar et al. 2009; Hollox et al. 2008; Wain et al. 2014; Hardwick et al. 2012; Hollox et al. 2008). 14

37 Figure 8: Schematic picture describing different steps of PRT. Both test (CNV) and reference regions are PCR amplified using florescent labeled primers. The amount of PCR products are quantified with capillary electrophoresis and copy number is estimated by comparing amount of test products with reference products. 1.3 Deleted in Malignant Brain Tumours 1 (DMBT1) The mammalian immune system protects the host against microbial infections, and is comprised of innate and adaptive components. The adaptive immune mechanism is mediated by antigen receptors (Medzhitov and Janeway, 1997; Iwasaki & Medzhitov, 2010) whereas the innate immune mechanism is mediated by pattern recognition receptors (PRRs) (Medzhitov, 2007). PRRs are proteins expressed by cells of the innate immune system (Akira et al., 2006; Iwasaki & Medzhitov, 2010; Karin et al., 2006). Innate immunity is a broad spectrum, nonadaptive and evolutionarily older defense mechanism and provides immediate defense after challenge by bacterial and viral pathogens (Medzhitov, 2007; Medzhitov and Janeway, 1997; Medzhitov and Janeway, 2000). The innate immunity participates in the regulation of the inflammatory response and represent a link between infection and inflammation at the molecular level (Iwasaki & Medzhitov, 2010; Karin et al., 2006). In humans, the Deleted in Malignant Brain Tumour 1 (DMBT1) protein mediates two possible functions in regenerative processes and pathogen defense (Bikker et al., 2002, 2004; Holmskov et al., 1999; Mollenhauer et al., 2001; Mollenhauer et al., 2007; Müller et al., 2012). DMBT1 15

38 works as a PRR in innate immunity (End et al., 2009; Hansen et al., 2011; Kang & Reid, 2003). The DMBT1 protein is commonly known as a cell surface glycoprotein 340 (gp-340) and Salivary Agglutinin (SAG), and interacts with microbial structures to deliver a danger signal to the host cell (Holmskov et al., 1997; Ligtenberg et al., 2007; Meylan et al., 2006) Genomic Structure of the DMBT1 gene The Deleted in Malignant Brain Tumours 1 (DMBT1) gene is located on human chromosome 10q This was first identified with losses of heterozygosity (LOHs) in a variety of tumours (Gray et al., 1995). The gene spans a genomic region of about 80 kb and consists of 55 exons (Holmskov et al., 1999; Mollenhauer et al., 2007; Mollenhauer et al., 1999; Mollenhauer et al., 1997). The six exons towards the N-terminal end of DMBT1 encode the signal peptide and an approximately 90 amino acid motif of unknown function. The major part of the genomic sequence of DMBT1 is composed of 13 highly similar scavenger receptor cysteine rich (SRCR) domains separated by SIDs (SRCR-interspersed domains). The homologous SRCRs/SIDs are followed by two CUB (C1r/C1s Uegf Bmp1) domains flanking the 14th SRCR and then a ZP (zona pellucida) domain toward the C-terminus (Holmskov et al., 1999; Mollenhauer et al., 2007; Mollenhauer et al., 2002; Mollenhauer et al., 2002). The SRCRs and SIDs domain form tandem-repeated units of about 3-4 kb in length including 3 exons and intronic sequences. The first exon of every repeat unit codes for an SRCR domain and the following two small exons form an SID domain (Holmskov et al., 1999; Madsen et al., 2010). All repeat units show an extraordinarily high degree of sequence similarity (87-100% identical) including the intronic sequences. Genetic polymorphism leads to different DMBT1 alleles with an altered number of SRCR domains and SIDs domains (Mollenhauer et al., 2007; Mollenhauer et al., 1999; Mollenhauer et al., 2002). The two consecutive SRCR domains are usually separated by a SID domain although the pattern is disrupted in between SRCR4 and SRCR5 domain where they are separated by single exon coding the amino-terminal half of a SID but the exon coding carboxyterminal part of SID is absent (Mollenhauer et al., 1999). Previous studies reported various DMBT1 alleles with an altered number of SRCR domains and SIDs due to copy number variation in the tandem repeated units (Mollenhauer et al. 1999; Mollenhauer et al. 2002). Two allelic variants were very common in normal individuals, are classified as the largest allelic variant and the other as the shortest variant. The largest allelic variant contains 13 central SRCR domains and SIDs whereas the shortest variant lacks 5 of the 13 central SRCR domains and SIDs (Mollenhauer et al. 2002; Renner et al. 2007). 16

39 1.3.2 Expression and localisation of DMBT1 The largest DMBT1 transcript is identified as DMBT1/8kb.2 consisting of 7656 nucleotides including 5'-UTR, exons 1-16 and exons The smallest human DMBT1 variant is known as DMBT1/6kb.1 comprising 5802nt. The 6kb transcript unambiguously lacks exons 15 (carboxyterminal half of SID3) to 24 (SRCR7). The exon utilization is more in the 3' region, where one SRCR domain and one SID is missing in the 6 kb transcript. The DMBT1 is mainly expressed in epithelial cells and associated glands, in particular in the respiratory and gastrointestinal tract (Renner et al., 2007). In humans, The DMBT1 gene is strongly expressed in lung; at weak levels in brain (Mollenhauer et al., 1997; Holmskov et al., 1999). DMBT1 is expressed at high levels in salivary glands (Ericson and Rundegren, 1983; Prakobphol et al., 2000) and at low levels in mammary gland, uterus, testis, and prostate (Holmskov et al., 1999). The protein has been detected in human tear fluid (Schulz et al., 2002), sweat glands, hair follicles and epidermis (Mollenhauer et al., 2003), liver (Sasaki et al., 2002), and alveolar macrophages (Holmskov et al., 1999). In general, the gene is expressed most strongly in epithelia and is usually observed on the apical cell surface or in luminal exocrine secretions. In most epithelial tissues, low to moderate DMBT1 levels are expressed under normal conditions. An upregulation of DMBT1 has been observed in response to various patho-physiological conditions, such as bacterial infection, inflammation, tumour-flanking tissues, carcinogen exposure, etc. Expression has also been noted in other tissues such as the brain and in immune cells. In cancer the DMBT1 gene shows both decreased (Mollenhauer et al., 2002) and increased expression in tumours and cell lines (Mollenhauer et al., 2001). Differentiation of kidney intercalated tubule epithelial cells is altered when they are grown on culture dishes coated with the purified DMBT1 gene product called hensin (Takito & Al-Awqati, 2004) The domain organization of DMBT1 The domain organization of DMBT1 protein features 13 scavenger receptor cysteine-rich (SRCR) domains, each separated by an SRCR-interspersed domain (SID), except for SRCRs 4 and 5, which are contiguous. The first six exons encode the signal peptide and form a motif of approximately 90 amino acids of unknown function. The signal peptide is followed by a repeated pattern of 13 SRCR domains separated by scavenger interspersed domains (SIDs) that constitute the main functional domain of DMBT1 protein. The SRCR domains are followed by a CUB domain (C1r/C1s UegfBmp1), another SRCR domain, a second CUB domain, and finally a zona pellucida (ZP) domain (Mollenhauer et al., 2000; Holmskov et al., 1999; Kang & Reid, 17

40 2003; Ligtenberg et al., 2010; Mollenhauer et al., 2007). The protein exists both in a soluble form and in association with the membranes of alveolar macrophages (Tino & Wright, 1999b). The largest known protein variant (DMBT1/8kb.2) is comprised 2413 amino acids with a calculated molecular weight of 265 kda. The smallest known variant (DMBT1/6kb.1) is identified as containing 1785 amino acids with a calculated molecular weight of 196kDa (Holmskov et al., 1999). When first identified it was suggested that the protein variants might arise due to genetic polymorphisms and/or alternative splicing at the mrna level (Holmskov et al., 1997, 1999; Mollenhauer et al., 2001). DMBT1 orthologs have been identified in various mammalian organisms such as mouse (dmbt1, CRP-ductin, vomeroglandin, muclin, apactin), rat (dmbt1, ebnerin, pancrin), rabbit (hensin), cow (bovine gall bladder mucin), pig (dmbt1) and rhesus monkey (H3) (Table 1). All DMBT1 orthologs proteins contain SRCR, CUB and ZP domains although the number and order of these domains vary in different animals (De Lisle & Ziemer, 2000; Li & Snyder, 1995; Matsushita et al., 2000; Tandon & De Lisle, 2004) (Figure 9). Table 1: DMBT1 synonyms and orthologs in different organisms (Ligtenberg et al., 2010). Organism Protein Function Localization Human DMBT1 Tumour suppression Skin and mucosal surfaces gp-340 Epithelial cell differentiation Lung SAG Innate immunity Salivary gland Mouse CRP-ductin Mucosal defense Intestinal crypts Vomeroglandin Epithelial differentiation Vomeronasal organ Muclin Pheromone perception Pancreatic and intestinal mucus Apactin Sorting receptor Pancreatic acinar cell Rabbit Hensin Terminal differentiation of Kidney epithelial intercalated cells and embryonic stem cells Kidney collecting duct Rat Ebnerin Liver regeneration small intestine Pancrin Taste perception Taste buds Cow Bovine gallbladder Mucin Cholesterol-binding, Gallstone formation Gall-bladder epithelium 18

41 Figure 9: Domain organization of DMBT1 and DMBT1 orthologs. The largest and the smallest human variants are shown as DMBT1/8 kb.2 and DMBT1/6 kb.1 respectively. The DMBT1 prototype is featured with 13 SRCR domains separated by SIDs. The SRCR domains are followed by a short Thr-rich region, a CUB domain, a 14th SRCR domain, a Ser-Thr-Pro-rich region, a second CUB domain, and a Zona Pellucida domain. The DMBT1 orthologs from the rabbit (Hensin), Mouse (CRP-ductin-alpha), rat (Ebnerin) and pig were shown with varying numbers of SRCR domains, follow by one or more CUB domains, and share a C- terminal SRCR domain, CUB domains and Zona Pellucida domain. The signal peptide is followed by a domain of unknown function that is indicated with a gray box (Ligtenberg et al., 2010). Salivary agglutinin (DMBT1 SAG ), a kda glycoprotein, was isolated from parotid saliva (Ericson & Rundegren, 1983) that showed calcium-dependent binding to Streptococcus mutans. DMBT1 GP340 was purified from broncho-alveolar lavage fluid of patients with alveolar proteinosis (Holmskov et al., 1997). Several studies have shown that the protein core of DMBT1 SAG is identical with lung glycoprotein 340 (DMBT1 GP340 ) and Deleted in Malignant Brain Tumors-1 (DMBT1) (Holmskov et al., 1999; Ligtenberg et al., 2001; Prakobphol et al., 2000). DMBT1 SAG and DMBT1 gp-340 cross react with monoclonal antibodies raised against different isolates of DMBT1 from the different sources (Ligtenberg et al., 2007). DMBT1 SAG DMBT1 GP340 proteins are DMBT1 isoforms, encoded by the DMBT1 gene. A collecting duct protein, Hensin is also a DMBT1 isoform, and is involved in the in vitro plasticity of intercalated cell polarity (Al-awqati et al., 2011; Gao et al., 2010; Takito et al., 1999). Genetic polymorphism in different individuals results in variation of DMBT1 with different numbers of SRCR domains. In addition, different isoforms may arise through alternative splicing or differential posttranslational modifications, such as glycosylation (Holmskov et al., 1997). and 19

42 Different isoforms of DMBT1 are not uniformly named in human, mouse, rat, rabbit, pig, horse and cattle. The literature nomenclature of the DMBT1 isoforms creates a complex nomenclature for DMBT1 protein. The present nomenclature of different isoforms of DMBT1 is confusing and isoforms are often named base on tissue-specific localization. To avoid confusion in the literature, all the isoforms of DMBT1 protein are described as DMBT1 in this thesis SRCR Domains The most prevalent and repeated domains in DMBT1 are the scavenger receptor cysteine-rich (SRCR) domains (Resnick et al. 1994; Sarrias et al. 2004; Mollenhauer et al. 1999; Bikker et al. 2004). SRCR domains are found from lower invertebrates (sponges) to higher vertebrates (mammals) (Resnick et al., 1994). The SRCR domain is an ancient and highly conserved protein peptide of ~ amino acids and is expressed by hematopoietic and non-hematopoietic cells as soluble or membrane-bound receptors (Resnick et al., 1994; Sarrias et al., 2004). SRCR domains are classified into two groups. The group A SRCR domain contains six cysteine residues and are encoded by two exons whereas group B SRCR has eight cysteine residues and is encoded by a single exon (Resnick et al., 1994). DMBT1 belongs to group B SRCR which is involved in ligand binding in the innate immune system (Sarrias et al., 2004). The disulfide bond patterns for group B SRCR domains are C1 C4, C2 C7, C3 C8 and C5 C6 (Figure 10) which are highly conserved for each SRCR domain of SRCR super family (Resnick et al., 1994). Group B SRCR domains are responsible for broad host-pathogenic interaction and bind with a wide range of bacteria, including Streptococcus mutans, Escherichia coli, Lactobacillus casei, Helicobacter pylori and Prevotella intermedia, and also mediate agglutination of S. mutans (Cornejo et al., 2013; Esberg et al., 2012; Prakobphol et al., 2000). 20

43 Figure 10: Schematic of gp340 structures, indicating conserved cysteines of SRCR domains (Malamud & Wahl, 2010). The positions of cysteine residues and disulfide bond patterns for SRCR domains of DMBT1 are shown in detail. The 14 SRCR domains of DMBT1 contain 8 conserved cysteine residues and are separated by SRCR interspersed domains (SIDS) CUB Domains The CUB domain is an approximately amino acid structural motif that is exclusively found in extracellular and plasma membrane-associated proteins (Bork & Beckmann, 1993). The domain was named based on the three proteins where it was first recognized: Complement subcomponents (C1s/C1r), an embryonic sea urchin protein (Uegf; sea urchin epidermal growth factor), and bone morphogenetic protein 1 (Bmp1). CUB domains are mainly found in proteins that are involved in developmental processes (Bork & Beckmann, 1993). The CUB domains are involved in a diverse range of functions such as complement activation, developmental patterning, tissue repair, axon guidance and angiogenesis, cell signalling, fertilisation, haemostasis, inflammation, neurotransmission, receptor-mediated endocytosis and tumour suppression (Ligtenberg et al., 2010). The CUB domains involve in oligomerisation and recognition of substrates and also in protein-protein and glycosaminoglycan-protein interactions (Romero et al., 1997). Four conserved cysteines of CUB domains fold to form a beta-barrel containing two disulfide bridges (C1 C2, C3 C4) (Romero et al., 1997) Zona Pellucida Domains The zona pellucida (ZP) domain is found in extracellular proteins in many mammals. The zona pellucida is a thick extracellular coat, usually found surrounding all mammalian eggs and preimplantation embryos. The ZP domain is located near the C-terminus of the extracellular 21

44 proteins and that is also the case for DMBT1 (Holmskov et al., 1999; Madsen et al., 2010; Mollenhauer et al., 1999). The ZP domain contains approximately 260 amino acids with eight conserved cysteine (Cys) residues that mediate intra-molecular disulfide bonds within proteins (Jovine et al., 2005; Jovine et al., 2004). The ZP domain plays role in protein polymerization and is associated with proteins that polymerize into higher-order structures, such as filaments and matrices (e.g., egg ZP, inner ear tectorial membrane, nematode cuticle, and fly tracheal system). ZP domains are also found in glycoproteins involved in development, acoustic perception, immunity, and cancer (Jovine et al., 2002). ZP domains were involved in protein oligomerization (Jovine et al., 2005). ZP domains also play an important role during agglutination of DMBT1 SAG (Oho et al., 1998) The role of DMBT1 in epithelial and stem cell differentiation The intercalated cells (ICs) of the kidney play a crucial role in acid-base homeostasis by regulating the expression of acid-base transporters (Brown et al., 1988). Intercalated cells exist in two functionally distinct subtypes; the β-type secretes bicarbonate (HCO3 ) to the urine whereas the α-form secretes acid (H + ) (Schwartz et al., 2002; Takito et al., 1999). As acid-base transporters of intercalated cells, α intercalated cells (A-IC) express apical H + -ATPases and basolateral bicarbonate exchanger AE1, whereas basolateral H + -ATPases and the apical bicarbonate exchanger, pendrin, are expressed by β intercalated cells (B-IC) (Alper et al., 1989; Wagner et al., 2004). The role of DMBT1 in epithelial cell differentiation was studied through functional studies of its orthologs, rabbit Hensin and mouse CRP-ductin (Al-Awqati et al., 2003; Al-Awqati, 2011; Schwartz et al., 2002; Takito et al., 1999; Vijayakumar et al., 2013; Vijayakumar et al., 2006). In acidic environments, kidney epithelial cells reversed their polarity by changing the position of transmembrane ion transporters from the apical to the basal membrane (Schwartz et al., 2002). The reversal of kidney epithelial cell and the induction of terminal differentiation of intercalated epithelial cells were mediated by an extracellular matrix (ECM) protein which was called Hensin (DMBT1), encoded by DMBT1 (Schwartz et al., 2002; Takito et al., 1999; Vijayakumar et al., 2013, 2006). Metabolic acidosis induced conversion of the collecting tubule from a state of bicarbonate (HCO3 - ) secretion to bicarbonate (HCO3 - ) absorption (i.e., H + secretion). The total number of intercalated cells was unchanged by metabolic acidosis but the number of β-intercalated cells was reduced. The β-intercalated cells were converted to α- intercalated cells and this increased number of α-intercalated cells compensated for the reduced number of β-intercalated cells (Gao et al., 2010). DMBT1 is secreted as a soluble monomer and becomes active after polymerization and deposition in the ECM. The 22

45 polymerization of hensin is mediated by integrins in association with cypa and galectin 3 (Alawqati et al., 2011; Al-Awqati, 2011). Previous studies reported that deletion of hensin was associated with accumulation of β-intercalated and lack of α-intercalated cells. The deletion allele of DMBT1 blocked the conversion of β-to α-intercalated cells and induced distal renal tubular acidosis (Gao et al., 2010). DMBT1 plays an important role in proliferation and differentiation of stem cells of the intestinal crypt and isthmus of human gastric mucosa, as well as mediating the physiological renewal process of gastrointestinal epithelia (Kang et al., 2002; Kang & Reid, 2003). The strong signals and distinct spatial distribution of DMBT1 in gastrointestinal epithelium and epidermal cells of foetuses, rather than in adults, indicated that DMBT1 might play an important role in the developmental process (Mollenhauer et al., 2000) Role of DMBT1 in Innate immunity Innate immunity is usually regarded as the first line of defense and acts as host defense against invading pathogens (Akira et al., 2006). Its main function is to detect and eliminate non-self invading molecules through the recognition of pathogen-associated molecular patterns (PAMPs) which are found on bacterial surfaces, but not shared by the host. The innate immune receptors recognizing PAMPs are termed pattern-recognition receptors (PRRs) (Akira et al., 2006; Kumar et al., 2011). DMBT1 acts as a pattern recognition receptor against invading pathogens (Hansen et al., 2011; Ligtenberg et al., 2010). DMBT1 recognizes and binds with both Gram-positive and the Gram-negative bacteria by the Lrr motifs which appear to be conserved on bacterial surface proteins (Loimaranta et al., 2009; Meylan et al., 2006). PAMPs recognize by innate immune receptors are lipopolysaccharide (LPS) from Gram-negative bacteria, lipoteichoic acid (LTA), peptidoglycan from Gram-positive bacteria, or β-glucans and mannans from yeast (Loimaranta et al., 2009). DMBT1 is involved in the body s first line defence and acts as an innate immune molecules in human (Bikker et al., 2002, 2004; Holmskov et al., 1999; Ligtenberg et al., 2010; Ligtenberg et al., 2001; Madsen et al., 2013; Prakobphol et al., 2000). Salivary agglutinin (SAG) was first discovered as a major non-immunoglobulin component in saliva, having bacteria-binding activity (Ericson & Rundegren, 1983). DMBT1 acts as a typical pattern recognition receptor (PRR) in the body s innate anti-microbial process and interacts with both invading pathogens and local defense components (Resnick et al., 1994). DMBT1 binds with invading pathogens in the inner surfaces of the body due to its epithelia-associated distribution in the body mucosa or ducts of glands (Mueller et al., 2002; Sasaki et al., 2003). DMBT1 SAG aggregates S. mutans bacteria and plays an important role in prevention of caries through hindrance of bacterial 23

46 adhesion (Cornejo et al., 2013; Esberg et al., 2012; Pecharki et al., 2005; Prakobphol et al., 2000; Seki et al., 2003). DMBT1 binds to a great diversity of Gram-positive and Gram-negative bacteria for example; Escherichia coli, Lactobacillus casei, Haemophilus influenzae, Klebsiella oxytoca, Helicobacter pylori, Staphylococcus aureus, Streptococcus pneumoniae, Streptococcus agalactiae, Bacteroides fragilis, Salmonella and many more (Bikker et al., 2004; Bikker et al., 2002; Ligtenberg et al., 2010; Madsen et al., 2010; Prakobphol et al., 2000; Rosenstiel et al., 2007). DMBT1 not only interacts with bacteria but also viruses, namely HIV and influenza A viruses, and inhibits viral infection in vitro (Hartshorn et al., 2003; Nagashunmugam et al., 1998) Bacteria-binding domain on DMBT1 DMBT1 binds and agglutinates gram-positive and gram-negative bacteria and mediates a very important role in innate immunity. Bikker and colleagues digested complete DMBT1 protein of 1722 amino acids containing only SRCR domains and SIDs with Lys-C enzyme. On the basis of synthetic peptides, a 16-mer peptide loop (peptide SRCRP2; QGRVEVLYRGSWGTVC) of SRCR domain was identified as an effective bacterial binding domain (Bikker et al., 2002). SRCR2 peptide was first identified based on binding with S. mutans and binds with a number of other bacteria including Streptococcus gordonii, Staphylococcus aureus, Escherichia coli and Helicobacter pylori. The SRCRP2 region showed 100% identity in 8 out of 13 SRCR domains (Figure 11). The presence of repeated units of SRCRP2 endows DMBT1 with a general bacterial binding feature for both Gram-positive and Gram-negative bacteria (Bikker et al., 2002). Figure 11: Schematic picture of bacteria-binding SRCR domains of DMBT1. Top line of sequence logos shows pattern of aligned amino acid sequences of SRCR domains, bottom line shows the consensus amino acid sequence after multiple sequence alignment. The Bacteria binding 16-mer peptide sequence (SRCRP2) and 11-mer motif (DMBT1pbs1) are shown in green-dotted and blue boxes respectively. The minimal bacteria-binding site on SRCRP2 was further narrowed down using the N- and C- terminal truncation of SRCRP2, using overlapping 16-mer peptides. The minimum binding domain was pinpointed to an 11 amino acid motif characterized by effective bacteria-binding, 24

47 designated DMBT1 pathogen-binding site 1 (DMBT1pbs1; GRVEVLYRGSW) (Figure 11). An alanine substitution scan revealed that Val-3, Glu-4, Val-5, Leu-6, and Trp-11 resulted in a drastic decrease of bacteria binding and five residues were critical for effective binding in DMBT1pbs1 motif (xxvevlxxxxw) (Bikker et al. 2004). All five residues were highly conserved within the 13 SRCR domains of DMBT1, the only difference being the presence of Ile-5 in the SRCR 1 domain. Bacterial agglutination was induced more rapidly by DMBT1pbs1 than SRCRP2 peptide (Bikker et al., 2004) Hydroxyapatite-binding domain on DMBT1 DMBT1 is a high molecular weight glycoprotein in human saliva that modulates microbial growth and colonization on the dental enamel, which is mainly composed of hydroxyapatite (HA) (Bikker et al., 2013). DMBT1 plays a dual role with regard to bacterial homeostasis (Loimaranta et al., 2005), DMBT1 binds and mediates the aggregation of bacteria in soluble form and when bound to HA, DMBT1 provides a site for bacterial colonization and microbial growth (Brady et al., 1992; Kishimoto et al., 1989). Bikker and co-workers identified the HAbinding region of DMBT1 and 18-mer peptide (DDSWDTNDANVVCRQLGA) on SRCR domain (SRCRP3) mediated effective binding of DMBT1 to HA (Bikker et al., 2013). A detailed view of the sequence of P3 revealed that four negatively charged aspartic-acid residues (D), three of which (D34, D35, and D38), were present on surface of SAG, and mediated HA binding via calcium ions (Figure 12) (Bikker et al., 2013). Figure 12: Schematic presentation of SRCR domain, highlighting the Hydroxyapatite (HA)-binding domain and bacteria-binding domain on DMBT1. (A) The bacteria-binding domain (P2) and Hydroxyapatite (P3) are shown in yellow and orange respectively. (B) Three negatively charge aspartic acid (D) residues D34, D35 and D38 of P3 are involved in HA binding and HA-binding is supported by negatively charged amino acids: glutamic acid (E) 101 and D102 of P7, and D73 and D74 of P4. The orientation of negatively charged residues within P3, P4 and P7 play important role in HA-binding. Positively charged residues are shown in red. Adapted from (Bikker et al., 2013). 25

48 1.3.8 Interaction of DMBT1 with viruses DMBT1 binds to different viruses including HIV-1 (Wu et al., 2004; Wu et al., 2006) and influenza A virus (IAV) (Hartshorn et al., 2006, 2007b). HIV-1 binds to the CD4 receptor through the viral envelope glycoprotein gp-120, leading to conformational changes in gp-120 allowing high affinity interaction with chemokine receptors (Sattentau & Moore, 1991; Trkola et al., 1996b). DMBT1 interacts to gp-120 of HIV-1 in calcium-dependent manner (Wu et al., 2006). DMBT1 binding region on gp-120 of HIV1 is a linear, highly conserved sequence, locates near the stem of the V3 loop which is critical for chemokine receptor binding during viral infection (Wu et al., 2004). DMBT1 blocks the access of gp-120 of HIV-1 to the chemokine receptor. The first SRCR domain and one-half of the first SID of DMBT1 binds to the same HIV-1 V3 sequences previously interacted with full-length DMBT1 (Wu et al., 2006). Different synthetic fragments of the SRCR1 domain of DMBT1 have been shown to inhibit HIV- 1 infection through binding to the N-terminal flank of the V3 loop of HIV-1 gp120 (Chu et al., 2013). Chu and co-workers identified gp120-binding region of SRCR1 using twenty overlapping 15-mer peptides covering the complete SRCR1 sequence and identified six peptides with high binding index in a solid-phase ELISA (Figure 13). An alanine substitution scan revealed Asp34, Asp35 in P5 and P6 and Asn96 and Glu101 in P19 and P20 with the highest binding index were the critical residues in SRCR1 interaction with gp120. Three peptide loops (P5, P6 and P20) showed highest binding index where P5 and P6 coincided with bacteria-binding SRCRP2 sequences. Figure 13: Schematic picture of virus-binding regions on SRCR domains of DMBT1. Top line of sequence logos shows the pattern of aligned amino acid sequences of SRCR domains, bottom line shows consensus amino acid sequences after multiple sequence alignment. Six peptides (P5, P6, P14, P15, P19 and P20) of SRCR1 with high binding index for gp120 of HIV-1 are shown in boxes along with other SRCR sequences. DMBT1 has broad-spectrum activity against Influenza A virus (IAV) and inhibits the haemagglutination activity and infectivity of IAV which are responsible for outbreaks of influenza (Hartshorn et al., 2006, 2007b). DMBT1 has broad antiviral activity against human, 26

49 equine and porcine IAV strains and neutralises virus by binding to sialic acid residues present at the cell surface of the respiratory tract (Hartshorn et al., 2006, 2007b). The antiviral activity of DMBT1 against specific IAV strains varies among donors depending on the relative abundance of specific sialic acid linkages on DMBT1. DMBT1 has direct virus-neutralising properties and displays cooperative interactions with surfactant protein D (SP-D) in viral neutralisation and aggregation (Hartshorn et al., 2006) Interaction of DMBT1 with endogenous protein ligands DMBT1 binds to a wide range of endogenous proteins, like secretory IgA, MUC5B, surfactant proteins A and D (SP-A & SP-D), complement factor C1q, lactoferrin and albumin (Boackle et al., 1993; Ligtenberg et al., 2004; Ligtenberg et al., 2001; Oho et al., 1998; Thornton et al., 2001; Tino & Wright, 1999). DMBT1 binds to secretory IgA in a calcium-dependent manner and mediates cooperative effect on bacterial aggregation (Armstrong et al., 1993; Rundegren & Arnold, 1987). DMBT1 interacts with IgA by a motif of 11 amino acids motif (DMBT1pbs1; GRVEVLYRGSW) same domains responsible for bacteria binding (Ligtenberg et al., 2004). DMBT1 binds to bovine and human lactoferrin of transferrin family (End et al., 2005; Mitoma et al., 2001) which exhibits various functions in the innate immune system. It sequesters iron from the local environment, thereby inhibiting microbial growth, and it prevents the formation of Pseudomonas biofilms (Singh et al., 2002). In addition, antimicrobial peptides are released upon proteolytic degradation of lactoferrin (Groenink et al., 1999). Bovine lactoferrin inhibits DMBT1 binding to S. mutans protein antigen Pac that belongs to the Ag I/II family (Mitoma et al., 2001). The same peptide of DMBT1 is responsible for S. mutans binding also mediates binding of lactoferrin (Mitoma et al., 2001; Oho et al., 1998). Bovine lactoferrin residues (SCAFDEFFSQSCA) are important for DMBT1-binding. Although the homologous sequence in human lactoferrin is slightly different (SCKFDEYFSQSCA), human lactoferrin does also interact with DMBT1 (End et al., 2005). DMBT1 was isolated as a soluble receptor for SP-D, also binds SP-A (Holmskov et al., 1997; Tino and Wright, 1999). SP-A and SP-D are collagencontaining, (C-type) calcium-dependent lectins called collectins (Kishore et al., 1996). SP-D forms oligomers of 4 8 subunits. Each subunit is composed of three identical polypeptides of 43kDa held together by disulphide bonds and non-covalent interactions at the N-terminal ends of the chains. Each polypeptide consists of a short N-terminal region, followed by a collagenlike sequence, a short a-helical sequence and the carbohydrate recognition domain. DMBT1 binds to the carbohydrate recognition domain of SP-D through a calcium-dependent proteinprotein interaction (Holmskov et al., 1997). SP-A and SP-D are involved in a range of immune functions, including viral neutralisation, aggregation and killing of bacteria and fungi, and 27

50 clearance of apoptotic and necrotic cells. In immunologically naive lungs, they downregulate inflammatory reactions, but when challenged with LPS or apoptotic cells they induce phagocytosis by macrophages, pro-inflammatory cytokine production and enhancement of adaptive immune responses (Gardai et al., 2003) The glycosylation pattern of DMBT1 DMBT1 in human saliva is highly glycosylated by post-translational modification where monosaccharides are sequentially added by glycosyltransferases in the Golgi apparatus and endoplasmic reticulum. There are two main types of glycosylation, N-glycosylation and O- glycosylation (Amado et al., 1999). In N-glycosylation the polysaccharide is linked to asparagine whereas the polysaccharide is linked to serine or threonine via Nacetylgalactosamine (GalNAc) in O-glycosylation. DMBT1 contains up to 14 potential N-linked glycosylation sites (Holmskov et al., 1999), and numerous potential O-linked sites in the SIDs. The SRCR domains also contain few potential O-glycosylation sites. The classical human secretor locus (Se) FUT2 gene encodes a α1, 2-fucosyltransferase. The Se locus determines the presence of blood group antigens in secretory fluids and determines the secretor status of the ABO antigens (Hazra et al., 2008). The first step of generation of blood group antigens α1,2-fucosyltransferase couples a fucose to galactose (Fucα1-2Gal). Nonsecretor individuals have a different carbohydrate composition compared to secretor individuals (Hazra et al., 2008). Non-secretors have Lea and Lex structures on DMBT SAG where secretors also have Leb and Ley and ABH structures, in addition to Lea and Lex (Eriksson et al., 2007; Ligtenberg et al., 2000). DMBT1 from secretor individual has a higher molecular mass than DMBT1 from non-secretors (Eriksson et al., 2007). The blood group antigens such as ABH and the Lea antigens are used as ligands for bacterial receptors and thus might be involved in bacterial binding (Ligtenberg et al., 2000) Role of DMBT1 in the complement pathway The complement system is a biochemical cascade of the innate immune system, and is involved in a wide range of functions in the human body ranging from elimination of microorganisms through phagocytosis of viral, bacterial or fungal pathogens, controlling adaptive immunity through opsonization and leukocyte chemotaxis and removal of immune complexes and apoptotic cells. The complement system compromises of more than 40 soluble and surface bound proteins and is mediated through three different activation pathways, the classical pathway, the alternative pathway and lectin pathway. The classical pathway activates when C1q binds to targets directly, or through antibodies or C-reactive protein. The lectin 28

51 pathway activates when Mannose-binding lectin binds with mannose, N-acetylglucosamine (GlcNAc) or carbohydrate-containing structure of microbes. DMBT1 was found to be associated with the lectin pathway of complement activation (Leito et al., 2011; Reichhardt et al., 2012). DMBT1 binds to mannose-binding lectin (MBL) by directing protein-protein interaction but performs the dual role of complement activation modifying function. Soluble SAG was found to be associated with inhibition of Candida albicans-induced complement activation whereas surface or membrane associated DMBT1 activates complement pathway (Leito et al., 2011; Reichhardt et al., 2012). DMBT1 was also found to initiate complement systems through the classical pathway when it bound with C1q (Boackle et al., 1993). DMBT1 SAG interacts with C1q of the complement system and induces a series of inflammatory responses (Boackle et al., 1993). DMBT1 is secreted in saliva, but C1q is a serum component. During oral inflammation, the two components mix with each other and initiate local complement activation. Another group reported complement activation by SAG through classical pathway in MBL-deficient serum (Reichhardt et al., 2012). The tandem repeated SRCR domains of DMBT1 interact with several ligands at the same time and more SRCR domains effectively binds more MLB. The interaction of soluble DMBT1 with MBL and C1q leads to agglutination of complement proteins and thus inhibiting lectin pathway activation and subsequent inflammation (Reichhardt et al., 2012) Involvement of DMBT1 in mechanism of fertilization The involvement of DMBT1 in the mechanism of fertilization has been analyzed in two animals, horse (Equus ferus caballus) and pig (Sus scrofa). DMBT1 was identified as sperm-binding glycoprotein (SBG) in the pig (Ambruosi et al., 2013) and was located to two different chromosomes, 6 and 14. SBG was implicated in sperm selection through acrosome alteration and suppressed motility of sperms with premature capacitation (Teijeiro & Marini, 2012; Teijeiro et al., 2008; Teijeiro et al., 2012) and is also involved in sperm-oviduct interaction (Teijeiro et al., 2012). SPG is highly expressed in oviduct, where final gamete maturation takes place and which stores viable spermatozoa for fertilization (Ambruosi et al., 2013). SBG was identified associated with sperm periacrosomal membranes (Teijeiro & Marini, 2012) and involved maintaining sperm acrosome integrity in pig (Teijeiro et al., 2008). The secretion of SBG in different sections of the porcine oviduct (ampulla and isthmus) depends on the stages of estrus cycle and SBG secretion mainly is observed from the end of follicular phase in both ampulla and isthmus. SBG was found mainly in the apical region of non-ciliated secretory cells and also in the lumen of both ampulla and isthmus during follicular stage and preovulatory phase (Ambruosi et al., 2013). SBG mediates oocyte and sperm interaction in association with 29

52 integrins, proteins previously characterized for involvement of gamete interaction (Vijayakumar et al., 2008). The interaction of SBG with sperm was shown to be mediated by psoriasin (S100A7), localized at the head of sperm cells (Teijeiro & Marini, 2012). SBG was involved in sperm-negative selection in the oviduct and selection was mediated by damaging the sperm that started capacitation just after arrival in the oviduct (Teijeiro et al., 2012). However another study reported that SBG deficient knock-out mice were fertile indicating that SBG was not crucial for reproduction or sperm selection and might be substituted by another mechanism (De Lisle et al., 2008) DMBT1 binding to Streptococcus mutans Ag I/II Dental caries is a chronic infectious disease that involves dental plaque on teeth. Dental caries is characterized by pain, bone infection and the demineralisation of the tooth tissues that occurs at low ph due to bacterial fermentation of carbohydrates. The main causative organism of dental caries is Streptococcus mutans, a gram-positive, chain forming, and nearly spherical bacteria (Cornejo et al., 2013; Esberg et al., 2012). Saliva-mediated adhesion and high numbers of mutans streptococci in saliva or the tooth HA are predictors for dental caries (Bradshaw & Lynch, 2013; Cornejo et al., 2013; Esberg et al., 2012; Jonasson et al., 2007). Multiple surface polypeptides (adhesins) are expressed by streptococci for adhesion to SAG. Oral streptococci express three families of adhesins (Antigen I/II (AgI/II)-, Csh-, and Fap1-family proteins) (Jakubovics & Kolenbrander, 2010; Jakubovics et al., 2009; Jakubovics et al., 2005). Adhesins AgI/II are expressed by most streptococci in the oral cavity (Larson et al., 2011). Proteins belonging to the AgI/II family is SpaP or Pac in S. mutans (Guo et al., 2013; Jakubovics et al., 2005; Pieralise et al., 2013). The interaction of SpaP and SAG is very important for bacterial attachment to teeth. AgI/II homologs carry a signal sequence at the N-terminus, followed by the alanine-rich (A), variable (V) and proline-rich (P) regions, succeeded by the C-terminal region and the membrane spanning domain that anchors to the bacterial cell wall (Figure 14). In earlier studies, two SAG adherence regions on AgI/II were identified, one at the alanine-rich N-terminal region of the AgI/II adhesin and (Larson et al., 2010) and another at the C-terminus (C123) (Larson et al., 2011; Purushotham & Deivanayagam, 2014). The presence of allelic variability of SAG among donors and their associated post-translational modifications is also important for interaction of SAG and AgI/II binding (Purushotham & Deivanayagam, 2014). Figure 14: Cartoon shows of primary sequence of S. mutans UA159 AgI/II From (Purushotham & Deivanayagam, 2014). 30

53 Evidence of copy number variation at DMBT1 Genome wide association studies indicate that DMBT1 region is a highly copy number variable region. The Wellcome Trust Case Control Consortium (WTCCC) used the Agilent CGH platform for CNV-typing of the DMBT1 gene in four HapMap populations: The Yoruba in Ibadan, Nigeria; Japanese in Tokyo, Japan; Han Chinese in Beijing, China; and the CEPH (U.S. Utah residents with ancestry from northern and western Europe) (Conrad et al., 2010). The Database of Genomic Variants is the most up-to-date repository for results of comprehensive genome-wide screening for copy number variation. The Database of Genomic variants shows that the middle portion of DMBT1 containing tandem-repeated SRCR and SID domains is copy number variable than 3 and 5 flanking region (Figure 15). Previous studies using long range PCR, southern blotting and SSCP analysis showed extensive polymorphisms at DMBT1. Three deletion hot spots were suggested in the DMBT1 gene, 5 upstream, DMBT1 repeat 2-4 to 2-7 (deletion 2-4/2-7) and at DMBT1 repeat 2-9 and 2-10 (deletion 2-9/2-10). The regions of DMBT1 at repeat 2-4/2-7 and 2-9/2-10 showed deletion polymorphisms in CEPH individuals (Sasaki et al., 2002). 31

54 Figure 15: UCSC genome browser screen shot showing the exon-intron structure with three DMBT1 gene annotations from different transcripts. The Database of Genome variants are shown below with red indicating loss of signal, blue indicating gain of signal and brown both loss and gain of signal in different samples. Tandem repeated SRCR domains show more copy number variable region than other region (shown in green rectangle). Previously reported 2-4/2-7 and 2-9/ 2-10 polymorphism are also indicated on the reference genome (reference genome assembly hg19). 32

55 1.4 AIMS OF THE STUDY 1. Characterization and validation of copy number variation of the human DMBT1 gene. 2. Analysis of segregation patterns and de novo mutation rates at the DMBT1 gene. 3. Determination of extend of diversity and evolution basis of DMBT1 copy number in global populations. 4. Analysis of diversity of the salivary agglutinin-binding protein of Streptococcus mutans. 5. Investigation of copy number variation of DMBT1 in different disease cohorts. 33

56 2 MATERIALS AND METHODS 2.1 DNA samples used HapMap samples The International HapMap project is a worldwide effort initiated in October 2002 to study the genetic diversity of 4 different populations, the African Yoruba from Nigeria (YRI), the Japanese from Tokyo (JPT), the Han Chinese from Beijing (CHB) and the European-descent collection from Utah (CEU) (The International HapMap Consortium 2003). The 90 HapMap CEPH (CEU) DNA samples are from the U.S. population with Northern and Western European ancestry, which were collected by the Centre d Etude du Polymorphisme Humain (CEPH) located in Paris, France. The 90 HapMap Yoruban (YRI) samples were collected from the people of Ibadan, Nigeria. The HapMap CEPH and HapMap Yoruban (YRI) samples provided 30 sets of samples from two parents and an adult child (each such set is called a trio). The total of 90 HapMap Asian samples consist of 45 from unrelated Han Chinese from Beijing (CHB), China, and the other 45 from unrelated Japanese from Tokyo (JPT), Japan. All HapMap DNA samples are part of the International HapMap project. The blood samples were converted into cell lines by Epstein Barr Virus transformation of peripheral blood lymphocytes by the Coriell Institute ( Coriell provides DNA and cell lines from the samples for research projects that have been approved by the appropriate ethics committees Cell lines samples Lymphoblastoid cell lines (LCLs) were obtained from the Coriell cell repository. B- Lymphoblastoid cell lines were established at Coriell Cell Repositories by transformation of B- lymphocytes (isolated from peripheral blood mononuclear cells) with Epstein-Barr virus (EBV) using phytohemaglutinin as a mitogen (Henderson et al., 1977). The lymphoblastoid morphology is small (7-9 micron) round cells that grow as loose aggregates in suspension. The cultures obtained from the Coriell cell repository were stored in liquid nitrogen (- 196 C) ECACC Human Random Control (HRC) samples ECACC Human Random Control (HRC) panels were used as reference standards in all assays ( The Human Random Control (HRC) DNA samples represents a control population of 480 UK Caucasian blood donors. The DNA samples were extracted from lymphoblastoid cell lines derived by Epstein Barr Virus (EBV) transformation of peripheral blood lymphocytes from single donor blood samples. The 34

57 genomic DNA was provided as a solution at a standard concentration of 100ng/μl in 10mM Tris-HCl buffer (ph 8.0) with 1mM EDTA. Generally 5-10ng/μl was used as working concentration in this study CEPH family samples A set of lymphoblastoid cell lines from 809 individuals of European origin in 62 threegeneration pedigrees was developed by the Centre de Etude du Polymorphisme Humain (CEPH) (Stevens et al., 2012). The CEPH trios which form part of HapMap are a subset of these. The CEPH samples serve as a large reference collection and have been used extensively as a benchmark for analysis of genetic variants, to create linkage maps of the human genome, to provide samples to the HapMap and 1000 Genomes projects (NIH/CEPH Collaborative Mapping Group., 1992). A set of CEPH families obtained from the Coriell Institute ( was used in the segregation analysis. A Total of 522 members from 40 reference families were included for our study. Thirty one (31) of forty (40) CEPH families were obtained as three-generation pedigrees and the others were two generation only, including a total of 323 offspring, 80 parents and 119 grandparents in this study HGDP-CEPH panel In total 1,035 individuals from the HGDP-CEPH panel were genotyped for copy number typing of DMBT1. The previous studies accounted atypical and duplicated samples and pairs of close relatives within the HGDP-CEPH panel and three standardized subsets (H1048, H971 and H952) were recommended for most population-genetic studies (Rosenberg, 2006). The subset H971 was constructed by avoiding first-degree relative pairs and all analysis was performed based on subset H971 in the present study Leicester local volunteers DNA was extracted local volunteers (students and staff from the University of Leicester) from 20-50ml mouthwash using an in-house protocol by Dr Edward Hollox, Razan Abujaber and Eyeman Khier. Ethical approval was granted by the University of Leicester Committee for Research Ethics Concerning Human Subjects (Non-NHS) Crohn s samples Three different Crohn s cohorts were used to study genetic association of DMBT1 with Crohn s disease. The details of Crohn s cohorts are described below and details of sample sets are shown in Table 2. 35

58 England ( London ) Crohn s samples White European patients with Crohn s disease were recruited from specialist IBD clinics in London and Newcastle as reported elsewhere (Prescott et al., 2007) after informed consent and ethical review (REC 05/Q0502/127). Patients were recruited from Guy s and St. Thomas Hospitals London, United Kingdom, St. Mark s Hospital London, United Kingdom, and the Royal Victoria Infirmary, Newcastle, United Kingdom after ethical review and informed consent from Crohn s patients. Human random control (HRC) samples from the UK (from the ECACC collection: were used as control samples for the London Crohn s disease cohort. The English Crohn s sample of 980 cases and 480 UK controls (HRC) were genotyped with case and controls distributed on each of the DNA plates. In total 688 samples from the English Crohn s samples were also analysed using an Agilent 210k acgh chip for WTCCC study. Table 2: Summary of samples analyses for Crohn's study. Crohn s disease cohort English sample Scottish sample Danish sample Sample analysed Disease status Crohn s control Crohn s control Crohn s control CNV1 copy number genotypes of DMBT1 with disease information CNV2 copy number genotypes of DMBT1 with disease information SRCR copy number genotypes of DMBT1with disease information Total Male % Female % Total Male % Female % Scotland ( Edinburgh ) Crohn s sample Total Male % Female % The patients with Crohn s sample of Scotland ( Edinburgh ) were collected at the Western General Hospital, Edinburgh, Scotland, which is a tertiary referral centre for IBD in South-East Scotland. Detailed description of Scotland ( Edinburgh ) cohort was given elsewhere (Aldhous et al., 2010). The written consent from Crohn s patient was obtained prior to inclusion in the study. The Crohn s patients attended the inflammatory bowel disease (IBD) clinic at the Western General Hospital were included in this cohort. Unrelated spouses/friends of IBD patients or blood samples obtained from the Scottish Blood Transfusion Service were used as Healthy controls. The study protocol was approved by Medicine and Oncology Subcommittee of the Lothian Local Research Ethics Committee (LREC 2000/4/192). A total of 340 controls and 348 cases were recruited in the Edinburgh Crohn s cohort, where cases and controls were randomly distributed across DNA plates. A total of 97 samples from the Edinburgh Crohn s 36

59 samples were also analysed as part of the WTCCC study on copy number variation, using an Agilent 210k acgh chip Danish Crohn s samples Three hundred and ninety ethnic Danish CD patients were recruited from a well-defined geographical region (Copenhagen capital area) fulfilling the international diagnostic criteria for IBD during a two-year period from January 1, 2003 to December 31, The detailed of Danish Crohn s sample was described elsewhere (Jespersgaard et al., 2011a; Vind et al., 2008). A total of one hundred and fifty five of Crohn s disease samples were included in our study. One-hundred and seventy-five healthy blood donors from a Danish blood bank were included as controls African HIV cohorts The study protocol was approved by the Institutional Review Board at the Faculty of Medicine, Addis Ababa University, Addis Ababa, Ethiopia and Ethiopian Science and Technology Ministry, Ethiopia; the regional ethical review board in Stockholm at the Karolinska Institutet, Stockholm, Sweden and the ethical review committee of Muhimbili University of Health and Allied Sciences, Dar es Salaam, Tanzania. Written informed consent was obtained from each subject before the start of this study. The study was started in parallel at tuberculosis and HIV clinics in Addis Ababa, Ethiopia and Dar es Salaam, Tanzania at the same time and details were described elsewhere (Machado et al. 2013; Aklillu et al. 2013; Hardwick et al. 2012). A total of 649 newly diagnosed ART-naive Ethiopian patients living in Addis Ababa and Tanzanian HIVonly and HIV/tuberculosis coinfected study participants (n = 353) were recruited and enrolled prospectively. Study participants were recruited and followed up to 1 year to monitor clinical, virological, and immunological outcomes of HAART Lung disease cohorts Gedling cohort A total of 1176 European ancestry individuals for whom DNA and non-missing relevant phenotype data (age, sex, height, FEV 1, FVC, smoking status, asthma diagnosis) were typed for DMBT1 copy number using PRT. The details of the Gedling cohort are described elsewhere (Wain et al., 2014). In brief, individuals with confirm asthma diagnosed by a doctor were considered as doctor diagnosed asthma (Britton et al., 1995) and individuals age>40, smoking pack years>5, % predicted FEV 1 <80% and FEV 1 /FVC<0.7 were defined as COPD cases (Vestbo et al., 2013). Individuals age>40, pack years >5, %predicted FEV 1 >80% and FEV 1 /FVC>0.7 was 37

60 used as COPD controls for the Gedling cohort. Individuals with a doctor s diagnosis of asthma were excluded from the COPD case and control sets Leicester Respiratory cohort (LRC) In Leicester Respiratory Cohort (LRC) dataset perinatal data were collected at birth, and data on growth and development were acquired prospectively during childhood in Leicestershire, UK, described in detail elsewhere (Kuehni et al., 2007; Wain et al., 2014). A total 689 patients of European ancestry (self-reported), aged less than 16 years, and with non-missing relevant phenotype data (age, sex, height, FEV 1, FVC, and asthma variables) were included in the study. The LCR is an unselected population-based cohort and most children reporting a doctor diagnosis of asthma had mild disease. Children were divided into two subpopulations. Therefore children who had ever reported frequent wheeze during the past 12 months (over four attacks or always accompanied by shortness of breath) and were taking inhaled corticosteroids at the time of questionnaire were defined as asthma-ics, controls for the analysis were all children who had never reported frequent wheeze during the past 12 months, were not taking inhaled corticosteroids at the time of questionnaire, and did not have a doctor s diagnosis of asthma UTI (Urinary Tract Infection) and VUR (Vesicoureteral Reflux) cohorts The study was approved by the Nationwide Children s Hospital Institutional Review Board, Nationwide Children s Hospital, Columbus, Ohio, United States of America. 570 patients enrolled in the RIVUR study (children aged 2 to 72 months with VUR and UTI) were typed to estimate diploid copy number at thecnv1 and CNV2 loci. After excluding non-european origin samples from the RIVUR study, a total 420 patients of European origin were analysed for association. 415 individuals without any evidence of UTI and VUR history were typed as control samples, of these 310 samples of European origin were included in association study. 2.2 DMBT1 Sequence processing and bioinformatics The genomic sequence of the DMBT1 gene (chr10:124,320, ,403,252) on Human Feb (GRCh37/hg19) assembly was retrieved from UCSC Genome Browser ( and repeated sequences (ALUs and L1s) were masked with RepeatMasker Web Server ( Dot plot analysis of the entire sequenced region (chr10:124,320, ,403,252, 83,072 bp) against itself was performed with the Basic Local Alignment Search Tool (BLAST) to identify regions of local similarity between sequences ( after masking repetitive sequences (ALUs). 38

61 Two different algorithms (megablast, discontiguous megablast) available in the Nucleotide Blast program were analysed using the default parameters like maximum target sequences 100 bp and after filtering low complexity regions (Zhang et al., 2000). 2.3 Sequence analysis of SRCR repeats of DMBT1 MEGA6 (Tamura et al., 2013) was used to generate a maximum-likelihood tree for DMBT1 SRCR repeats containing both exonic and intronic sequences. The nucleotide sequence similarity of each SRCR repeats was inferred by using the Maximum Likelihood method based on the General Time Reversible model (Nei and Kumar, 2000). Initial tree(s) for the heuristic search were obtained by applying the Neighbor-Joining method to a matrix of pairwise distances estimated using the Maximum Composite Likelihood (MCL) approach. All positions containing gaps and missing data were eliminated from the final dataset. 2.4 Evidence of copy number variation on DMBT1 The Database of Genomic Variants (DGV) was analyzed for copy number variation on the DMBT1 region. The Database of Genomic Variants is the most up-to-date repository to provide a comprehensive summary of structural variation in the human genome. The database considers segments of DNA that are larger than 50bp as structural variation as genomic alterations and only represents structural variation in healthy control samples (MacDonald et al., 2014). DGV shows loss of signal in red, gain of signal in blue and brown for both gain and loss compared to the reference genome. The Wellcome Trust Case Control Consortium (WTCCC) used Agilent 210k acgh chip for CNV-typing and a total 28 DMBT1 specific probes covered 40kb genomic region of DMBT1 gene were analyzed in four HapMap phase I populations, CEU, YRI, CHB and JPT (Conrad et al., 2010). Previous studies had also characterized deletion polymorphism at DMBT1 using Long-range PCR, southern blotting and SSCP (Sasaki et al., 2001, 2002). 2.5 Growing of lymphoblastoid cell lines The lymphoblastoid cell lines were grown in RPMI 1640 medium supplemented with 2 mm glutamine and 10% fetal calf serum, in a humidified incubator at 37 C, with 5% CO 2. Lymphoblastoid cell lines were grown in suspension culture with cells clumped in loose aggregates. The cell aggregates were dissociated by gently agitating the culture or by gentle pipetting. After three to four days the cell cultures were subcultured with fresh medium 39

62 depending on cell growth. When the ph of cultures become acidic the medium changes from red to yellow if phenol red is used as an indicator. 2.6 Genomic DNA extraction from lymphoblastoid cell lines Genomic DNA was isolated by standard phenol/chloroform extraction protocol (Sambrook & Russell, 2001) with little modification. All the solutions were prepared from fresh solvents and marked separately for use in genomic DNA extraction only. Genomic DNA was isolated from lymphoblastoid cell lines by proteinase-k digestion (20mg/ml) at 55 C for 2 hours, followed by phenol/chloroform extraction and ethanol precipitation. 2.7 Characterization of CNV1 region of DMBT Long-range PCR for CNV1 region of DMBT1 Previous studies showed three deletion hot spots in the DMBT1 gene, 5 upstream, DMBT1 repeat 2-4 to 2-7 (deletion 2-4/2-7) at DMBT1 repeat 2-9 and 2-10 (deletion 2-9/2-10). The DMBT1 repeat 2-4 to 2-7 is called CNV1 region at DMBT1. The constitutional DMBT1 deletions were assessed by long-range PCR (primers: L1595 in intron 10 and L1591 in intron 25, reference sequence: GenBank AJ243211) flanking the deletion as described previously (Sasaki et al., 2001, 2002). Long PCRs were performed in a total volume of 25μl reactions on a Veriti thermal cycler (ABI) with 1X of 11.1X long range PCR buffer, 5μM of each primer (forward and reverse primer), 0.6U of Taq DNA Polymerase (Kapa Biosystems) and 0.03U Pfu DNA polymerase (Stratagene) and ng of good quality DNA. Primer pairs outside the deleted region were used to co-amplify both long allele and deleted allele of DMBT1 (Table 3). Long PCR reactions were performed with thermo cycler conditions of an initial denaturation of 94 C for 1 min, a first stage consisting of 20 cycles each of 94 C for 15s and 68 C for 10 min, and a second stage consisting of 12 cycles each of 94 C for 15 s and 68 C for 10 min (plus 15s/cycle); these were followed by a single incubation phase of 72 C for 10 min. We used two control genomic DNAs U87-MG and U343-MG (gift from Dr. Jan Mollenhauer, University of Southern Denmark, Denmark) to standardize our long PCR. U87-MG is the representative for the intact configuration (largest allelic variant) and U343-MG has the internal deletion (shortest allelic variant). 10μl of PCR product was mixed with 2 µl with 6X loading dye and separated on 0.8% Agarose gel (0.5 μg/ml Ethidium bromide) using 0.5X TBE buffer. 200 ng of size standard (HyperLadder TM I, BIOLINE) was also run alongside the samples to check the product size of the amplified PCR bands. 40

63 Table 3: PCR Primers used to amplify deletion allele of CNV1. Oligo Name Primers Sequence (5-3 ) Product size (bp) DMBT1-L1595 CTGCTGAGCATTGCCTGTGTTCTA Long allele-16.9 kb DMBT1-L1591 GTCATATCAGCTCTGAATAGAAAAGTGC Deleted allele-4.2 kb PCR for the long allele for CNV1 region of DMBT1 Specific primers were designed and used to amplify the long allele of DMBT1 (Table 4). The PCR reactions were done with 10 ng of DNA with 1X low dntps buffer and were supplemented with 5 µm of each primer (forward and reverse primer), 0.5U Taq DNA Polymerase (Kapa Biosystems) in a total volume of 10µl. The PCRs were performed with initial denaturation of 94 C for 4 minutes, followed by 35 cycles of 94 C for 30 seconds, 63 C for 30 seconds and 72 C for 30 seconds, followed by a final extension 72 C for 10 minutes. The PCR products were separated using 3% Agarose gel (containing 0.5 μg/ml Ethidium bromide) in 0.5x TBE buffer. 200 ng of size standard (HyperLadder I, BIOLINE) was also run alongside the samples to check the product size of the amplified PCR bands. Table 4: Primers sequences used to amplify for CNV1 long allele. Oligo Name Primers Sequence (5-3 ) Product size (bp) DMBT1-SR47r TGAGTGCTTGACTGCAATTC 230 bp DMBT1-SR47f GGTTGACACAAAACCAACCC Block-specific long PCR for CNV1 region of DMBT1 The shared PCR primers (Table 5) were designed to amplify different regions to identify nonvariable and variable regions of DMBT1 gene. Block-specific primers were designed by eye based on multiple sequence alignments and amplified three regions of different PCR fragment length within SRCR/SIDs domains using long PCR. Table 5 : Primers sequences used to amplify block-specific long PCR. Oligo Name Primers Sequence (5-3 ) Product size (bp) DMBT1-F3F GGATGATGTGCGCTGCTCAGGACA 1240 bp, 1549 bp, 1119 bp DMBT1-F3R CTGGGGACTCACCTGGCCTG Long PCRs were performed in a total volume of 25μl reactions on a Veriti thermal cycler (ABI) with 1X of 11.1X long range PCR buffer, 5μM of each primer (forward and reverse primer), 0.6U of Taq DNA Polymerase (Kapa Biosystems) and 0.03U Pfu DNA polymerase (Stratagene) and ng of good quality DNA. Long PCR reactions were performed with thermal cycler conditions of an initial denaturation of 94 C for 1 min, a first stage consisting of 16 cycles each of 94 C for 15s and 68 C for 10 min, and a second stage consisting of 12 cycles each of 94 C for 41

64 15 s and 68 C for 10 min (plus 15s/cycle); these were followed by a single chase phase of 72 C for 10 min. 10 μl of PCR product was mixed with 2 µl 6X loading dye and separated on 0.8% Agarose gel (0.5 μg/ml Ethidium bromide) using 0.5X TBE buffer. 200 ng of size standard (HyperLadder TM I) was also run alongside the samples to check the product size of the amplified PCR bands Analysis acgh of data for CNV1 of DMBT1 The array data generated for the CNV association study conducted by the Wellcome Trust Case Control Consortium (WTCCC) was used for analysis of CNV1 of DMBT1. A total of 12 probes covered the genomic region chr10: containing CNV1. These were analyzed for 269 individuals (HAPMAPPT01, HAPMAPPT2 and HAPMAPPT3 plate) from four geographically diverse populations, CEU, YRI, CHB and JPT. Principal component analysis (PCA) of probe signals was conducted for CNV1 region using CNVtools v , package written in the R programming language (v ). The scree plot was plotted to show percentage of variation represented by each principal component in arraycgh data using program written in R programming language (v ) Designing of primers for CNV1 region of DMBT1 In a paralogue ratio test (PRT) assay, a single primer pair is used to amplify two genomic regions (both CNV and non-cnv region) in a single PCR reaction. The success of PRT assays totally depends on designing of primers to amplify two regions with one primer pair. The design of PRT primers with size-different PCR products is most important and difficult part of PRT assay. The genomic sequences of DMBT1 gene (chr10:124,320, ,403,252) on Human Feb (GRCh37/hg19) Assembly was retrieved from UCSC Genome Browser ( and repeated sequences (Alus) were masked with RepeatMasker Web Server ( The different repeat sequences were aligned using a multiple sequence alignment program, Clustal Omega ( and misaligned regions containing one or more indels were targeted for primer design for both CNV1 and CNV2 region. All the possible primer pairs for different CNV region of the DMBT1 gene were selected by eye. UCSC In-Silico PCR tool, available on UCSC Genome Browser ( was used to check the specificity of the primers for the test and the reference sequence. The list of primers and assays for detecting copy number of two CNV regions are described below. 42

65 2.7.6 PRT assays for CNV1 of DMBT1 Two independent PRT assays were designed for the CNV1 region of DMBT1. The assays were described as PRT1 and PRT PRT1 The PRT1 assay was designed to estimate CNV1 copy number of the DMBT1 gene. All 13 SRCR/SID blocks were aligned and small misaligned regions were targeted for PRT1 assay (Figure 16). The copy number variable (test) and non-variable (reference) SRCR/SID blocks were amplified in PRT1 assays using one pair of primers within the DMBT1 gene. The test region was a PCR product from 3 rd SRCR/SID block and the reference region was a PCR product from 7 th SRCR/SID block (Table 6). Table 6: Primers sequences used in PRT1 assay for CNV1. Primers Sequence (5-3 ) Product size (bp) 5 Fluorescent Labeled Forward: CTTGAGCCTTCATAAACC Reference-153 Reverse: CTAAGGAATGTTCCACACT Test-143 The PCR products were amplified in a total volume of 10µl with 1µl of 10X low dntps buffer, 5µM of each primer (labeled forward and reverse primer), 0.5U Taq DNA Polymerase (kapa biosystems), 1 µl of 10X KAPA Taq Buffer A (kappa biosystems) and 5-10 ng of DNA as template in every PCR reaction. The following thermocycler conditions were used for PRT1 PCR amplification: initial denaturation of 94 C for 4 minutes, followed by 25 cycles of 94 C for 30 seconds, 62 C for 30 seconds and 72 C for 30 seconds, followed by a final extension 70 C for 10 minutes. 43

66 Figure 16: Schematic illustration of PRT1 assay. (A) Positions of PCR products of PRT1 assay are shown in UCSC genome browser on Human Feb (GRCh37/hg19) Assembly. The sequences and product sizes of test PCR product and reference product are shown in red and green boxes respectively. (B) Alignment of test and reference PCR products with sequence differences are indicated with dots. (C) Electropherogram of test locus and reference locus of PRT1 assay showing different copy number. 1.0μl of PCR product was mixed with 0.1μl MapMarker 1000 size standard (BioVentures, Inc., USA) and 10μl HiDi formamide (Applied Biosystems, UK) before being size-separated by capillary electrophoresis on a 3130xl genetic analyser (Applied Biosystems, UK) according to the manufacturer s instructions. The 10 base pair size difference between reference region and test region was sufficient to resolve using capillary electrophoresis. The PCR peak area data was analysed using GeneMapper software v.3.7 (Applied Biosystems, UK) PRT2 The PRT2 assay was the second system of PRT assay developed to validate of CNV1 copy numbers from PRT1 assay (Figure 17). The similar primer designing strategy was applied like PRT1 but different test and reference region were targeted for PRT2 assay. The test PCR product was amplified from 5 th SRCR/SID block and reference PCR product was from 1 st SRCR/SID block (Table 7). The size difference between reference and test PCR product was 5 bp sufficient to resolve using capillary electrophoresis. Table 7: Primers sequences used in PRT2 assay for CNV1. Primers Sequence (5-3 ) Product size (bp) 5 Fluorescent Labeled Forward: TCCACTGGGGTCACAGG Reference-229 Reverse: CTACAGGGGAACACAAGAAC Test

67 The amount of all PCR components were as for the PRT1 assay but 3 µm of each primers (labeled forward and reverse primer) were used in total volume of 10 µl. The following thermal cycler conditions were used for PRT2 PCR amplification: initial denaturation of 94 C for 4 minutes, followed by 24 cycles of 94 C for 30 seconds, 63 C for 30 seconds and 72 C for 30 seconds, followed by a final extension 70 C for 10 minutes. A aliquot of 0.8μl PCR product was mixed with 0.1 μl MapMarker 1000 size standard (BioVentures, Inc., USA) and 10μl HiDi formamide (Applied Biosystems, UK) before being size-separated by capillary electrophoresis like PRT1. Figure 17: Schematic illustration of PRT2 assay. (A) Positions of PCR products of PRT2 assay shown in UCSC genome browser on Human Feb (GRCh37/hg19) Assembly. The sequences and product sizes of test PCR product and reference product are shown in red and green boxes respectively. (B) Alignment of test and reference PCR products with sequence differences are indicated with dots. (C) Electropherogram of test locus and reference locus of PRT2 assay showing different copy number CNV1 Data analysis after fractionation by capillary electrophoresis For the PRT1 assay, peak areas corresponding to the 143bp test fragment from copy number variable CNV1 region and the 153 bp reference fragment from the non-variable region were recorded using Genotyper software (Applied Biosystems). Test peak areas corresponding to the 234bp from copy number variable CNV1 region and reference peak areas corresponding to the 229bp from non-variable region were recorded for PRT2 assay using same software mentioned before. The peak area ratio of 143bp/153 bp and 234bp/229bp were measured for PRT1 and PRT2 assays for CNV1 of DMBT1. The observed raw peak area ratio were compared 45

68 with known peak area of standard control samples, and the resulting (least-squares) linear regression used to normalise raw peak area ratio of unknown samples. The eight selected HapMap DNA samples were used as calibration standards in every PRT assay to minimize experimental variation. The normalised PRT ratio from PRT1 and PRT2 were compared for the sensitivity and specificity of CNV1 copy number estimation Testing and quality control of capillary electrophoresis result of CNV1 of DMBT1 After capillary electrophoresis the peak area data from PRT1 (Figure 18) and PRT2 (Figure 19) assays were checked using GeneMapper software. The specific panels and bins were followed for analysing peaks from each PRT assays. The bins detect the expected location of test and reference peaks for each PRT assay. The peak area from both test and reference peak was used for estimation of copy number. The peak area of both test and reference locus were examined by eye and the peak area data were exported from GeneMapper software. A set of eight reference DNA sample was incorporated in every PRT experiment and any discrepancy of PRT ratio within control DNA was treated as experimental error. The PRT assays with alter PRT ratio in control DNA samples was repeated and new PRT ratio was calculated for copy number estimation. Samples showing peak area of more than 40,000 or less than 1000 for either test or reference locus was not included for estimating PRT ratio and rerun to measure peak area data. Figure 18: An example of GeneMapper electropherogram showing test locus and reference locus in PRT1 assay. The genomic ratio of test: reference copies for PRT1 are shown to the left side of the test peak. 46

69 Figure 19: An example of GeneMapper electropherogram showing test locus and reference locus in PRT2 assay. The genomic ratio of test: reference copies for PRT2 are shown to the left side of the reference peak Normalization of raw PRT data of CNV1 of DMBT1 The data from the two separate PRT assays were measured to estimate copy number across the CNV1 of DMBT1. The samples were normalized using eight known HapMap positive control samples (NA copies, NA copies, NA copies, NA copies, NA copies, NA copies, NA copies NA copies). In every experimental PCR plate one no template control and seven known DNA samples were included. The raw PRT ratios from both assays were normalized using data from known samples to overcome plate-specific biases. The normalized PRT data from PRT1 and PRT2 assays were compared and sensitivity and specificity of both assays were judged based on clustering of data. The samples showed discrepancies from clusters were retyped before further analysis. The average raw ratio of CNV1 specific PRT assays were calculated and compared with Agilent acgh data as well as to estimate integer copy number of CNV1 region of DMBT1. The PRT ratio from PRT1 (Figure 20) and PRT2 (Figure 21) were normalised based on reference DNA samples. The known calculated PRT ratio on the X axis and observed PRT ratio on Y axis of controls DNA samples were plotted and correlation between known and unknown ratio was calculated. For each PRT assay eight reference DNA samples were used to calibrate the PRT data. The normalised PRT ratios of unknown DNA samples were calculated based on the linear regression of reference/control DNA samples. 47

70 Figure 20: The calibration standard on reference DNA samples for PRT1 from single PCR reaction. The linear regression is used to normalize PRT1 ratio for unknown samples of specific PCR reaction. Figure 21: The calibration standard on reference DNA samples for PRT2 from single PCR reaction. The linear regression is used to normalize PRT2 ratio for unknown samples of specific PCR reaction Comparison of PRT value and acgh data of CNV1 of DMBT1 The copy number value detected by PRT was compared with WTCCC arraycgh data. The first PC (PC1) of array CGH data was compared with mean unrounded PRT ratio to examine the copy number calling for CNV1 using PRT assays Estimation of integer copy number for CNV1 region of DMBT1 Integer copy number was called by combining information from both assays (PRT1 and PRT2). The average normalized PRT ratios from PRT1 and PRT2 were used to estimate integer copy number of CNV1. Integer copy numbers were inferred using a Gaussian mixture model, 48

71 implemented in the statistical language R (package CNVtools v (Barnes et al., 2008)). CNVtools requires a one-dimensional normalized data set per sample for integer copy number estimation. The package was initially designed to analyze CNV data generated by the Wellcome Trust Case Control Consortium, which performed a genome-wide scan for association for eight common disorders (Barnes et al., 2008). The number of components of the CNVtools analysis was determined from the histogram of average normalized PRT ratios for CNV1 assays. The mean PRT ratios were transformed to have a standard deviation of 1 as recommended in CNVtools analysis. CNVtools called copy number of each sample based on posterior probabilities of each cluster of CNV1 data. Finally the actual integer copy number was assigned based on previous acgh data and revalidated with long range allele specific PCR. 2.8 Characterization of CNV2 region of DMBT Long PCR spanning CNV2 region To design primers for CNV2 specific long PCR, I obtained genomic sequences of the DMBT-1 region (chr10:124,320, ,403,252) from Human Feb (GRCh37/hg19) Assembly from UCSC Genome Browser. I aligned the different repeat blocks of the SRCR/SID domains and unique regions were selected for designing primers of CNV2 region of DMBT1 region. The forward primer was selected from 5 region of CNV2 region and the reverse primer was from 3 region of CNV2 region (Table 8). The PCR amplified different sizes of product depending on copy number of CNV2 of DMBT1. Table 8: Long PCR primers used to amplify CNV2 region. Oligo Name Primers Sequence (5-3 ) Product size (bp) SRCR2-DF AACACCAGTTGTGTAACTGAGACCC Allele with complete deletion SRCR2-DR GCCTTCCCAGGTGACTTTGGCAGTC Allele with one copy of CNV Allele with two copies of CNV Allele with three copies of CNV Long PCRs were performed in a total volume of 25μl reactions on a Veriti thermal cycler (ABI) with 1X of 11.1X long range PCR buffer, 5μM of each primer (forward [SRCR2-DF] and reverse primer[srcr-dr]), 0.6U of Taq DNA Polymerase (Kapa Biosystems) and 0.03U Pfu DNA polymerase (Stratagene) and 25-50ng of good quality DNA. Long PCR reactions were performed with thermal cycler conditions of an initial denaturation of 94 C for 1 min, a first stage consisting of 20 cycles each of 94 C for 15s and 68 C for 10 min, and a second stage consisting of 12 cycles each of 94 C for 15 s and 68 C for 10 min (plus 15s/cycle); these were followed by a single chase phase of 72 C for 10 min. 10μl of PCR product was mixed with 2µl 49

72 with 6X loading dye and separated on 0.8% Agarose gel (0.5 μg/ml Ethidium bromide) using 0.5X TBE buffer for 3-4 hrs. 200ng of size standard (HyperLadder TM I) was also run alongside the samples to check the product size of the amplified PCR bands Analysis of acgh data for CNV2 of DMBT1 The array data generated for CNV association study conducted by the Wellcome Trust Case Control Consortium (WTCCC) was used for analysis of CNV2 of DMBT1. A total of eighteen CNV2 specific probes covering almost 20kb genomic region were analysed. Principal component analysis (PCA) and scree plot were plotted using R (v ), as described in CNV1 analysis PRT for CNV2 of DMBT1 In total, three PRT assays were designed to estimate copy number for CNV2 region of DMBT1. The assays were described as PRT3, PRT4 and PRT5. The different reference regions are annotated in green and tests regions are coloured with red PRT3 The first assay designed to estimate copy number of CNV2 of DMBT1 gene was PRT3 (Figure 22). The primers were designed manually for PRT3 assay from multiple sequence alignments of SRCR/SID blocks as used before for PRT1 and PRT2 assays. UCSC In-Silico PCR tool available on UCSC Genome Browser was used to check the specificity of the primers for test and reference regions. The test PCR product was amplified from 9-10 th SRCR/SID block and reference region was amplified PCR product from 8 th SRCR/SID block close to reference region of PRT1 (Table 9). Table 9: Primer sequences used in PRT3 assay for CNV2. Primer Sequence (5-3 ) Product size (bp) 5 Fluorescent Labeled Forward: TGTGGTCACTTAGGACAGGGGACC Reference-198 Reverse: CCTCACAGTGAGAGGATCCC Test-201 bp The size difference of 3 base pairs between reference region and test region was sufficient to resolve using capillary electrophoresis. The PCR reactions were performed with 5-10ng of DNA as template in a total volume of 10µl with 1µl of 10X low dntps buffer, 3 µm of each primer (labeled forward and reverse primer), 0.5U Taq DNA Polymerase (kapa biosystems), 1µl of 10X KAPA Taq Buffer A (kapa biosystems) in every PCR reaction. The PCR was performed with the following thermal cycler conditions: initial denaturation of 94 C for 4 minutes, followed by 24 cycles of 94 C for 30 seconds, 62 C for 30 seconds and 72 C for 30 seconds, followed by a final extension 72 C for 10 minutes. A small amount (0.8 μl) PCR product of was mixed with 0.1 μl 50

73 MapMarker 1000 size standard (BioVentures, Inc., USA) and 10μl HiDi formamide (Applied Biosystems, UK) and fragments were separated on a 3130xl genetic analyser (Applied Biosystems, UK) according to the manufacturer s instructions. The PCR peak area data was analysed using GeneMapper software v.3.7 (Applied Biosystems, UK). Figure 22: Schematic illustration of PRT3 assay. (A) Positions of PCR products of PRT2 assay shown in UCSC genome browser on Human Feb (GRCh37/hg19) Assembly. The sequences and product sizes of test PCR product and reference product are shown in red and green boxes respectively. (B) Alignment of test and reference PCR products with sequence differences are indicated with dots. (C) Electropherogram of test locus and reference locus of PRT3 assay showing different copy number PRT4 The test PCR region was similar to PRT3 (9-11 th SRCR/SID blocks) but total four reference regions from different blocks were amplified (2 nd, 8 th, 12 th and 13 th SRCR/SID blocks) (Figure 23). The minimum size different between reference regions and test region was 3 bp to resolve using capillary electrophoresis (Table 10). Table 10: Primer sequences use in PRT4 assay for CNV2. Primer Sequence (5-3 ) Product size (bp) 5 Fluorescent Labeled Forward: GGTGTCATCTGCTCAGGTG Reference-267, 270, 279, 273 Reverse: TCTTCTGCCTCTTCCTTGC Test-282 bp 51

74 Figure 23: Schematic illustration of PRT4 assay. (A) Positions of PCR products of PRT4 assay shown in UCSC genome browser on Human Feb (GRCh37/hg19) Assembly. The sequences and product sizes of test PCR product and reference product are shown in red and green boxes respectively. (B) Alignment of test and reference PCR products with sequence differences are indicated with dots. (C) Electropherogram of test locus and reference locus of PRT4 assay showing different copy number PRT5 The genomic sequences of DMBT-1 region (chr10:124,320, ,403,252) on Human Feb (GRCh37/hg19) Assembly was retrieved from UCSC Genome Browser and repeated sequences (Alus and L1s) were masked with RepeatMasker Web Server. PRT primers were designed using PRTPrimer (Veal et al., 2013) on ALICE, a High Performance Computing (HPC) environment hosted by IT Services, University of Leicester. ALICE provided all the possible primer pairs for different region of DMBT1 gene. UCSC In-Silico PCR tool available on UCSC Genome Browser was used to check the specificity of the primers for the test and the reference sequence (Table 11). The test PCR region was similar to PRT3 and PRT4 (9-11 th SRCR/SID blocks) but reference region was amplified from non-cnv different chromosomal region (chromosome 9) (Figure 24). The size difference between reference regions and test region was 8 bp, sufficient to resolve using capillary electrophoresis. 52

75 Table 11: Primer sequences use in PRT5 assay for CNV2. Primer Sequence (5-3 ) Product size (bp) 5 Fluorescent Labeled Forward: GATCACATCCTTCACAAGGT Test-159 Reverse: CCAGTTGTCTGCTCTCAAAC Reference-167 (chr9) The PCR reactions were performed with 5-10ng of DNA as template in a total volume of 10µl, containing 1µl of 10X low dntps buffer, 3µM of each primer (labeled forward and reverse primer), 0.5U Taq DNA Polymerase (kapa biosystems), 1µl of 10X KAPA Taq Buffer A (kapa biosystems) in every PCR reaction. The PCR was performed with the following thermal cycler conditions: initial denaturation of 94 C for 4 minutes, followed by 24 cycles of 94 C for 30 seconds, 62 C for 30 seconds and 72 C for 30 seconds, followed by a final extension 72 C for 10 minutes. A small amount (0.8μl) PCR product of was mixed with 0.1 μl MapMarker 400 size standard (BioVentures, Inc., USA) and 10 μl HiDi formamide (Applied Biosystems, UK) and fragments separation and data analysis were performed like PRT3 and PRT4 assays. Figure 24: Schematic illustration of PRT5 assay. (A) Positions of PCR products of PRT5 assay are shown in UCSC genome browser on Human Feb (GRCh37/hg19) Assembly. The sequences and product sizes of test PCR product and reference product are shown in red and green boxes respectively. (B) Alignment of test and reference PCR products with sequence differences are indicated with dots. (C) Electropherogram of test locus and reference locus of PRT5 assay showing different copy number CNV2 data analysis after fractionation by capillary electrophoresis For the PRT3 assay, peak areas corresponding to the 198 bp test fragment from copy number variable CNV2 region and the 201 bp reference fragment from the non-variable region were recorded (Figure 25). Test peak areas corresponding to the 282bp from copy number variable 53

76 CNV2 region and reference peak areas corresponding to the 267bp, 270bp, 273bp and 279bp from non-variable region were recorded for the PRT4 assay (Figure 26). The PRT4 assay amplified four different reference regions from different SRCR regions of DMBT1. The average peak area from four reference regions was compared separately with different reference peak to check consistency of the reference regions. The peak area ratio of the 282bp fragment and the mean peak areas of reference regions were considered for CNV2 integer copy number estimation for PRT4 assay. The eight selected HapMap DNA samples were used as calibration standards in every PRT assay to reduce experimental variation. A PRT5 assay also designed for CNV2 copy number estimation. The test region was 159 bp amplified from CNV2 region but the reference region of 167bp was amplified from chromosome 9. The peak area ratio of test (159bp) and reference (167bp) PCR product was used to count CNV2 copy number (Figure 27) Testing and quality control of capillary electrophoresis result of CNV2 of DMBT1 The peak area data from the PRT3, PRT4 and PRT5 assays was checked using GeneMapper software. The different panels and bins were followed for each PRT assay based on PRT product size of the test and reference locus and labelled primers. Any PRT assay with unexpected PRT ratio in the control DNA samples was repeated to reduce experimental error. The accepted peak area for both test and reference locus was more than 1000 (>1000) or less than (<40000), peak area more than 40,000 considered as saturated peak in GeneMapper software. Figure 25: An example of a GeneMapper electropherogram showing the test locus and the reference locus in the PRT3 assay. The genomic ratios of test: reference copies for PRT3 are shown to the left side of the reference peak. 54

77 Figure 26: An example of a GeneMapper electropherogram showing the test locus and the reference locus in PRT4 assay. The genomic ratio of test: average reference copies for PRT4 are shown to the left side of the reference peaks. Figure 27: An example of a GeneMapper electropherogram showing the test locus and the reference locus in PRT5 assay. The genomic ratio of test: reference copies for PRT5 are shown to the left side of the test peak Normalization of raw PRT data of CNV2 of DMBT1 The samples were normalized using eight known HapMap positive control samples (NA copies, NA copies, NA copies, NA copies, NA copies, NA copies, NA copies, NA copies), obtained from the Coriell Institute. In every experimental PCR plate one no template control and seven known DNA samples were included as used in CNV1 assays. The raw PRT ratios from three PRT assays were normalized using known samples data to overcome PCR-specific biases. The normalized PRT data from PRT3 (Figure 28) and PRT4 (Figure 29) assays were compared and sensitivity and specificity of both assays were judged based on clustering of data. The PRT ratio from PRT3 and PRT4 were normalised using reference DNA samples. The known copy number ratio on X axis and observed PRT ratio on Y axis of control DNA samples were plotted and correlation (r 2 ) between the known and unknown raw ratio was calculated for both PRT3 and PRT4. A total of eight reference DNA sample was used to calibrate the raw PRT 55

78 data. The normalised PRT ratios of unknown DNA samples were calculated based on the linear regression of reference/control DNA samples. Figure 28: The calibration standard on reference DNA samples for PRT3 from single PCR reaction. The linear regression was used to normalize PRT3 ratio for unknown samples of specific PCR reaction. Figure 29: The calibration standard on reference DNA samples for PRT4 from single PCR reaction. The linear regression was used to normalize PRT3 ratio for unknown samples of specific PCR reaction Comparison of PRT value and acgh data of CNV2 of DMBT1 The average normalized PRT ratios of PRT3 and PRT4 were used to estimate integer copy number of CNV2. The copy number value of PRT was compared with arraycgh data previously generated using the Agilent 210K CNV chip for Wellcome Trust Case Control Consortium (WTCCC) in HapMap populations (CEU, YRI and CHB/JPT) (Conrad et al., 2010). 18 CNV2 specific probes covers 20 kb genomic region were selected and principal component analysis (PCA) of probe signals were conducted for CNV2 region using R as before (Barnes et al., 2008). The appropriate number of principle components and the eigenvalue against the component number was presented using a scree plot. The first PC (PC1) of arraycgh data was compared 56

79 with mean unrounded PRT ratio to examine the copy number calling for CNV2 using PRT assays Estimation of integer copy number for CNV2 region of DMBT1 The diploid copy numbers were inferred using a Gaussian mixture model, implemented in the statistical language R (package CNVtools v (Barnes et al., 2008)). CNVtools requires a one-dimensional normalized data per sample for integer copy number estimation. The number of components for mixture model in CNVtools analysis was determined from histogram of mean PRT ratio for CNV2. The mean PRT ratios were transformed to have a standard deviation of 1 as recommended by the R package, CNVtools v CNVtools called copy number of each sample based on posterior probabilities of each cluster of CNV2 data and Q score was estimated to check clustering quality of PRT values and Q-score more than 4 indicated clusters of data was very good. Finally the actual integer diploid copy number was assigned based on long range PCR. 2.9 Designing of probes for physical mapping of DMBT1 Each SRCR block contains a single exon coding for SRCR domain and two exons coding for a SID domain. The Human genome consists of variable number of SRCR block/repeats but 13 repeats are common in haploid human genome. The genomic sequences of DMBT1 region were downloaded from the Human Feb (GRCh37/hg19) Assembly from UCSC Genome Browser. Both exonic and intronic sequences of the SRCR blocks were identified and called SRCR1 to SRCR13. The last SRCR block (SRCR13) was different from the others because it contains only a SRCR domain and no SID domains. Multiple sequence alignments of all SRCR blocks did not show a sufficiently similar region for primer design, mainly due to the first SRCR repeat (SRCR1) sequence which was more diverse than other blocks. The first SRCR repeat (SRCR1) was excluded from multiple sequence alignment and alignment quality was good. Based on the region of high identity across the SRCR blocks, four regions that covered the maximum number of SRCR repeats were selected manually. Two pairs of PCR primers were designed for SRCR repeats to make a product approximately 4 kb long (Table 12). The smaller PCR products (1.5 to 2 kb) were amplified easily using 11.1X long range PCR buffer. The primer sequences were designed with degenerate nucleotides to amplify the maximum number of SRCR repeats. The strategic picture of primer design for Fibre-FISH probes and details of primers sequences is given below (Figure 30). 57

80 Figure 30: Strategic picture of primer design for synthesis of Fibre-FISH probes. A) Tandem repeated SRCR domains are indicated and numbered. B) Single SRCR domain is indicated in cartoon with primer location. Each SRCR domain was amplified using two pairs of primers. Primers for first and second regions are indicated with red and green arrows respectively. Alignments of primers binding are shown where aligned nucleotides are indicated in yellow and variable nucleotides are highlighted in red. Table 12: Primer sequences for amplification of DMBT1 probes Synthesis of DMBT1 probe Primer name Primer sequence (5-3 ) DMBT1-P1F CTGAGGCTGGTGAATGGA DMBT1-P1R TATCCCTYTCCCTGCCCRAGCA DMBT1-P2F TCAGCAATGGYRTCWGATGT DMBT1-P2R CTACAGGGGAACACAAGA Both primer pairs were used to generate the fibre-fish probes. The first primer pair (DMBT1- P1F and DMBT1-P1R) amplified the 5 region of SRCR blocks and second primer pair (DMBT1- P2F and DMBT1-P2R) amplified the 3 region of SRCR blocks. Long PCRs were performed in a total volume of 25μl reactions on a Veriti thermal cycler (ABI) with 1X of 11.1X long range PCR buffer, 5μM of each primer (forward and reverse primer), 0.6U of Taq DNA Polymerase (Kapa Biosystems) and 0.03U Pfu DNA polymerase (Stratagene) and ng of good quality DNA. Long PCR reactions were performed with thermo cycler conditions of an initial denaturation of 94 C for 1 min, a first stage consisting of 16 cycles each of 94 C for 15s and 68 C for 10 min, and a second stage consisting of 12 cycles each of 94 C for 15s and 68 C for 10 min (plus 15s/cycle); these were followed by a single chase phase of 72 C for 10 min. 25μl of PCR 58

81 products was mixed with 5 µl with 6X loading dye and separated on 0.8% Agarose gel (0.5 μg/ml ethidium bromide) using 0.5X TBE buffer. 200 ng of size standard (HyperLadder TM I, BIOLINE) was also run alongside the samples to check the product size of the amplified PCR bands. The PCR fragments of both primer pairs were excised from the agarose gel and purified using QIAquick Gel Extraction Kit Fibre-FISH molecular combing methods Fibre-FISH was performed by Dr. Sandra Gomes and Fengtang Yang at the Wellcome Trust Sanger Institute, Hinxton, UK. Human lymphoblastoid B-cell lines used in this study were purchased from the Coriell Institute for Medical Research and consists of 2 trios from the NIGMS Human Genetic Cell Repository collection: family Y045 (GM19200, GM1901 and GM19202) and family 1447 (GM12762, GM12763 and GM12753). The probes used consisted of PCR products from the SRCR repetitive regions of the DMBT1 gene, a fosmid clone (G248P8718G1) and a BAC clone (RP11-144H6), selected from the UCSC Genome Browser (GRCh37/hg19 assembly). BAC DNA was purified using a PhasePrep BAC DNA kit (Sigma-Aldrich) following manufacturer s protocol and amplified using a GenomePlex Whole Genome Amplification (WGA) kit (Sigma-Aldrich) following manufacturer s protocol. SRCR products were labelled with digoxigenin-11-dutp (Roche) and DNP-11-dUTP (Perkin Elmer), G248P8718G1 clone was labelled with digoxigenin-11-dutp (Roche) and RP11-144H6 clone was labelled with biotin-16dutp (Roche), by using a modified WGA re-amplification kit (Sigma-Aldrich) as described before (Gribble et al., 2013). Single DNA molecule fibres were prepared by Molecular Combing method (Michalet et al. 1997) following the manufacturer s instructions (Genomic Vision). Briefly, the cells were washed in 1 PBS (Invitrogen) and embedded in a 1.2% LMP Agarose (Lonza) plugs (1 milion cells/plug), followed by overnight incubation (16-18 hours) at 50 C with digestion solution [8:1:1, 0.5M EDTA: 10% sarkosyl: proteinase K (Ambion)]. The next day plugs were washed with 1 TE (10mM Tris, 1mM EDTA, ph8.0), stained with YOYO-1 (YOYO -1 Iodide ( ) - 1 mm Solution in DMSO; Molecular Probes, Life Tchnologies), transferred to 0.5M MES buffer (ph5.5) solution, melted at 70 C for 25 minutes and incubated overnight (16-18 hours) at 42 C with β-agarase enzyme (BioLabs). The following day coated coverslips (Genomic Vision) were soaked in the DNA solution and pulled out of the solution at a constant vertical speed by using a molecular combing system (Genomic Vision), which allowed the production of singlemolecule DNA fibres with a constant stretching factor (2Kb 1μm). The Cover slips were then baked for 4 hours at 68 C. 59

82 The probe mix was denatured at 65 C for 10 minutes before being applied onto the coverslips and the hybridisation was carried out in a 37 C incubator overnight. Post-hybridisation washes consisted of two rounds of 50% (v/v) formamide/2 SSC (0.30 M sodium citrate, M NaCl, ph 7.0) washes followed by two additional washes in 2 SSC. All washes were done at 25 C, 5 minutes for each time. Digoxigenin-labeled probes were detected using a 1:100 dilution of monoclonal mouse anti-dig antibody (Sigma-Aldrich) and a 1:100 of Texas Red-X-conjugated goat anti-mouse IgG (Invitrogen); DNP-labeled probes were detected using with a 1:100 dilution of Alexa 488-conjugated rabbit anti-dnp IgG and 1:100 Alexa 488-conjugated donkey anti-rabbit IgG (Invitrogen); biotin-labeled probes were detected with one layer of 1:100 of Cy3-avidin (Sigma-Aldrich). After detection, slides were mounted with SlowFade Gold mounting solution containing 4, 6-diamidino-2-phenylindole (Invitrogen). Images were visualised on a Zeiss AxioImager D1 fluorescent microscope equipped with narrow band-pass filters for DAPI, FITC, Cy3 and Texas Red fluorescence and an ORCA-EA CCD camera (Hamamatsu). Digital images were captured and processed using the SmartCapture software (Digital Scientific UK) Analysis of fibre-fish DNA fibres were captured for every sample. The lengths of the reference and test regions were measured independently using ImageJ. The well stretched fibres were used for further analysis and broken fibres were excluded. Independent measurements of labelled fibre lengths, made by two individual (Dr. Ed Hollox and Shamik Polley) were used to reduce measurement errors, using ImageJ software (Abràmoff et al., 2004). The number of red and green spots represents the 1 st and 2 nd part of each SRCR domains were also counted by eye. The length of the labelled DMBT1 regions (test) were calibrated based on the size of the reference fosmid clone, labelled in (colour), and 40kb in size. The scatter plots of all labelled fibre length measurements for both families were plotted using R (v ) Sample Preparation for PFGE For pulse field gel electrophoresis, liquid genomic DNA samples were used Genomic DNA extraction from lymphoblastoid cell lines Genomic DNA was isolated by standard phenol/chloroform extraction protocol (Sambrook and Russell, 2001). All the solutions were prepared from fresh solvents and marked as genomic DNA only. Genomic DNA was isolated from lymphoblastoid cell lines by proteinase-k 60

83 digestion (20mg/ml) at 55 C for 2 hours, followed by phenol/chloroform extraction and ethanol precipitation Digestion of liquid DNA Samples DMBT1 gene spans a genomic region of about 83 kb on Human (GRCh37/hg19) Assembly. The fragment sizes after restriction (Sca I) digestion of DMBT1 were expected to be between 40 to 200 kb depending on CNV1 and CNV2 copy number of DMBT1. The high quality DNA was stored at 4 C to avoid multiple freeze-thaws µg of gdna was digested overnight using units of Sca I (NEB). After overnight digestion at 37 C, the enzyme was inactivated at 80 C for 20 minutes. The complete digestion of gdna was determined by agarose gel electrophoresis based on the presence of uniform smearing, with no compact, high-molecularweight band Pulsed field Gel Electrophoresis conditions A 1% agarose (w/v) gel thickness of approximately 5 6 mm was used to separate the long, digested DNA fragment in pulse field gel electrophoresis. For liquid samples the digested DNA was added to the wells using pipette tips with large openings. A thin well comb (0.75 mm) was used for best resolution and sharpness of bands in our gel compared to thick well comb (1.5 mm). For agarose blocks a thick well (1.5 mm) comb was used to make place for the agarose plug and 2/3 of digested plug was loaded in each well. The rest of the well was filled with 1% agarose solution and allowed to solidify. For both liquid sample and Agarose plug samples the same ladder was used (MidRange I PFG Marker, NEB). About 1mm of ladder gel was used for single well and placed at the bottom of the well using a spatula. The well was sealed using 1% Agarose. In PFGE, DNA mobility is sensitive to changes in buffer temperature. As the buffer temperature increases, the mobility of the DNA increases, but band sharpness and resolution decrease. The buffer was chilled to 14 C during PFGE running to maintain band sharpness and to dissipate heat generated during prolonged runs. Standard Tris-borate or TBE, at a concentration of 0.5X, is the most commonly used buffer in pulsed field electrophoresis. In pulsed field electrophoresis, DNA molecules are subjected to alternating electric fields imposed for a period called the switch time. Each time the field is switched, the DNA molecules must change direction or reorient in the gel matrix. Larger molecules take longer to reorient and therefore have less time to move during each pulse, so they migrate slower than smaller molecules. DNA migration increases with increases in voltage or field strength. However, greater migration is accompanied by decreased band sharpness. In general, as the size of the DNA molecules increases, the field strength should decrease. Decreasing the included angle 61

84 will decrease the resolution of smaller DNAs by causing them to pile up on each other. The CHEF-DR III system, Bio-Rad (Clamped Homogeneous Electric Fields) was used for copy number study of DMBT1 using the following condition as recommended Bio-Rad with only small modifications (Table 13). Table 13: The conditions used to resolve DMBT1 restriction fragments after Sca I digestion % Agarose 1% Buffer 0.5x TBE Temperature 14 C Voltage 6 V/cm Pulse Parameters 1-10 sec Run Times 16 hours Angle Southern blot analysis Gel depurination, denaturation and neutralization After fractionation by CHEF electrophoresis, the gel was stained with 0.5 mg/ml ethidium bromide (Sigma-Aldrich) and photographed. The DNA was nicked by exposure to UV light for 5 minutes. The DNA was partially depurinated by placing the gel in 300 ml depurination solution (0.25M HCl) with gentle shaking for 10 min. The gel was denatured in 300 ml denaturation solution (0.5 M NaOH/1.5 M NaCl) for 30 min and neutralized in 400 ml neutralization solution (1.5 M NaCl/0.5 M Tris) for 30 min with gentle shaking Transfer of DNA to membrane The gel was covered with presoaked nylon membrane and rolled with a serological pipette to remove air bubbles. The DNA blotted for h onto nylon membrane by ascending capillary transfer, using 20 SSC (3M sodium chloride and 300mM trisodium citrate; ph 7.0 ) as the transfer buffer. After blotting the membrane was carefully transferred onto a Whatman paper and baked for 15 min at 80 C. The membrane was placed in a UV cross-linker (Hoefer TM UVC 500 Ultraviolet Crosslinker) DNA side facing upward and the DNA was cross linked onto the membrane at µj/cm 2 for a few seconds Synthesis of DMBT1 probe The probe used in Fibre-FISH was also used for PFGE. The purified PCR products were labelled with radioactive isotopes ( 32 P) and used as a probe for hybridization experiments. 62

85 Probe labeling and recovery Both DMBT1 and PFG marker specific probes were labelled with α- 32 P-dCTP (1000Ci/mmol). 100 ng of DMBT1 specific probe and 20 ng of purified λ DNA were mixed separately in two tubes with 1.2 µl of BSA (10mg/ml, Sigma-Aldrich), 1 µl Klenow fragment (5 U/µl), 6 µl of OLB (oligo labelling buffer), 1.5 µl of α- 32 P-dCTP (1000Ci/mmol), and incubated for overnight. The labelled probes were recovered using ethanol precipitation with 70µl of oligo stop solution (OSS), 30µl of Salmon sperm DNA (3mg/ml, Sigma-Aldrich), 30µl of Sodium acetate (3 M, ph 5.2) and 425µl Ethanol (100%) Hybridization The membrane was incubated in 10 ml hybridization solution (7% SDS, 0.5 M Na 2 PO 4 ph 7.2, 1 mm EDTA) containing 600μl of DMBT1 specific gene probe and 30μl of PFG ladder probe in a rotating hybridization oven at 65 C for overnight Washing blot After overnight incubation the membrane was washed once in 30ml of Phosphate Wash Solution (40mM NaHPO 4, 0.5% SDS) and twice in 50ml of High Stringency Wash Solution (0.1 x SSC, 0.01% SDS) on a rotational shaker for 15 min Preparing blot for exposure The blot was wrapped in Saran Wrap, placed into an appropriately sized cassette and kept in the -80 C freezer with an intensifying screen for 2-3 weeks depending on the strength of the signal Autoradiography The cassette was warmed to room temperature. The x-ray (Fuji RX100) film was dipped into the developing solution (RG X-Ray Developer, Champion Photochemistry) in the dark room depending on the strength of the signal to visualize the band pattern. The developing reaction was terminated using STOP solution (1% Acetic Acid) and dipped in the final tank containing fixer (RG X-Ray Fixer, Champion Photochemistry) until the x-ray film appeared clear Analysis of DNA fragment size after PFGE Samples of different copy number for DMBT1 were selected for fragment size analysis. The lengths of migrated DNA fragments were measured using ImageJ software ( The standard curve was plotted using known DNA fragments of 63

86 known size and distance migrated by each of the DNA fragments. The sizes of unknown DNA fragments were determined based on size standard curve of the PFGE ladder Estimation of allele and genotype frequency for mutation estimation The diploid copy number of DMBT1 CNV1 and CNV2 regions was estimated using PRT assays without any information about genotype status of either of the CNVs. The population allelic spectrum, allelic copy number frequency distribution and the expected copy number genotype and class distribution were estimated using a web-based program CoNVEM (Gaunt et al., 2010). CoNVEM uses an expectation maximization approach for the analysis of CNVs data under the Hardy-Weinberg equilibrium (HWE). I used total count of all copy number classes to estimate CNV1 and CNV2 copy number classes and allele and genotype frequencies were calculated under Hardy-Weinberg equilibrium (HWE) Simple tandem repeats (STR) analysis Two simple tandem repeats (STR) assays were designed to study the segregation of variable copy number haplotypes from parents to children in well-established CEPH family data DMBT1-m1 DMBT1-m1 is tetranucleotide (TTTC) repeat marker containing dinucleotide (TC) repeat unit within it. The repeats cover 165 bp (chr10: ) on Human Feb (GRCh37/hg19) assembly on UCSC Genome Browser. The region amplified is outside the CNV regions of DMBT1 gene (3 position of SRCR region) Primer Design and PCR amplification The genomic sequences of the repeat region along with 100 bases from both ends were retrieved from UCSC Genome Browser. Primers (Table 14) were designed using Primer3 software ( and genomic location of PCR product was checked on UCSC In-Silico PCR ( tool available on UCSC Genome Browser. Table 14: Primers sequences used to amplify DMBT1-m1 for STR analysis. STR Primers Sequence (5-3 ) Product size (bp) dmbt1-m1 5 Fluorescent Labeled (HEX): CTCCAGAGGGTAGGGTATCTGCTCTG 282 bp Reverse: GTGACAGAGCGAGACTCCATGTC 5-10 ng of template DNA was amplified in total volume of 10µl of PCR reaction using 1µl of 10X low dntps buffer, 5 µm of each primer (labeled forward and reverse primer), 0.3U Taq DNA 64

87 Polymerase (Kapa Biosystems) on Veriti thermal cycler (ABI). The PCR reactions were performed with thermal cycler conditions of an initial denaturation of 94 C for 4 minutes, followed by 25 cycles of 94 C for 30 seconds, 63 C for 30 seconds and 72 C for 30 seconds, followed by a final extension 72 C for 10 minutes Analysis of PCR products of DMBT1-m1 1.0μl PCR product was mixed with 0.1μl MapMarker 1000 size standard (BioVentures, Inc., USA) and 10μl HiDi formamide (Applied Biosystems, UK) and fragments were separated on a 3130xl genetic analyser (Applied Biosystems, UK) as mentioned before ( ) DMBT1-m2 For dmbt-m2 microsatellite analysis a simple tandem repeat DNA stretches of eight bases (TGCTGCTG) was amplified. The repeat cover 202 bp (chr10: ) on Human Feb (GRCh37/hg19) assembly on UCSC Genome Browser. The region amplified is outside the DMBT1 gene, just after the last exon of DMBT1 gene Primer Design and PCR amplification The primers were designed using Primer3 software ( and genomic location of PCR product was checked on UCSC In-Silico PCR tool available on UCSC Genome Browser. The details of primers used for DMBT1-m2 was given below (Table 15). The PCR condition and ABI condition was same as used in DMBT1-m1. Table 15: Primer sequences used to amplify DMBT1-m2 for STR analysis. STR Primers Sequence (5-3 ) Product size (bp) dmbt1-m2 5 Fluorescent Labeled (FAM): GGGCACAAGCTATGTCAC 275 bp Reverse: CATTCATTCCCTCGTCCATGC 2.18 Detection of de novo mutation Structural variation, specifically different combinations of haplotypes can be generated by Mendelian segregation of structural variable alleles. Most copy number measurements estimate total copy number per diploid genome (diplotype) without any hint about genotype. Segregation analysis in families is a powerful method to determine copy number genotypes. The DMBT1 copy number genotypes and combination of haplotypes of CNV1 and CNV2 was determined based on Mendelian segregation analysis from parents to offspring using four loci from different regions of DMBT1 gene. The copy number distribution in HapMap and HGDP 65

88 samples showed that both CNV1 and CNV2 were multiallelic and CNV2 varies widely from 0-10 in diploid human genome. The pedigree was drawn using package kinship2 v in R project for statistical computing ( without segregation of CNV haplotypes. The four possible combinations of haplotypes in the parental generation were predicted using information of both CNVs and two STR data from a large number of offspring Estimation of mutation rate The number of total de novo copy number mutations for CNV1 and CNV2 was identified in all CEPH families and mutation rate was estimated. CNV loci that did not follow the parental haplotypes were considered as de novo copy number mutations. We took a conservative approach and called as de novo copy number changes only those events that could not be explained by crossover in a non CNV region in one of the parental haplotypes. This is a conservative assumption, as it is possible that some of these events are genuine crossovers in the CNV region causing copy number changes. The mutation rate was estimated using binomial 95% confidence intervals using package binom v in R project for statistical computing ( using the Klopper-Pearson method Worldwide distribution of CNV1 and CNV2 copy number of DMBT1 gene The distribution of integer copy number of CNV1 and CNV2 were shown in bar plots and also in table form in subset H971 of HGDP-CEPH panel. The distribution patterns were shown in different geographical regions and also at the population level within geographical regions. The global copy number distribution was drawn using package rworldmap v written in R (v ). The package produced a world map with pie charts based on longitude and latitude coordinates. The size of the pie was indicated by the sample size Relationship of copy number variation in HGDP populations The HGDP-CEPH population showed different pattern of copy number variation in both CNV1 and CNV2 regions. The population with low copy number for CNV1 region shows higher copy for CNV2 locus. The pattern of copy number changes is shown at the individual level using scatter plot where mean unrounded copy number value of CNV1 was compared with mean unrounded copy number value of CNV2 using R (v ). A similar type of plot was also drawn for populations and geographical regions studied in HGDP-CEPH panel. 66

89 2.22 Analysis of Pathogen Richness Pathogen absence or presence matrices were constructed on the basis of the Gideon database ( for the 21 countries where the HGDP populations are located as described previously (Fumagalli et al., 2009, 2010, 2011a). The Gideon database updates weekly from WHO reports, National Health Ministries, PubMed searches, and epidemiology meetings. The Gideon Epidemiology module also follows the status of known infectious diseases globally, as well as in individual countries, indicating the disease's history, incidence, and distribution per country. For each HGDP population pathogen diversity was calculated from these data and only species or genera that are transmitted in the 21 countries were taken into account, but cases of transmission caused by tourism and immigration were not taken into account. Species that had recently been eradicated as a result of vaccination campaigns, for example, were recorded as present in the matrix. Malaria prevalence was obtained from both the Gideon or World Health Organization (WHO) databases as previously described (Pozzoli et al., 2010). I calculated Kendall s τ rank correlation coefficient between mean copy number of each CNV within the DMBT1 gene and pathogen diversity typed in HGDP-CEPH populations. Kendall s τ rank correlation coefficient is a non-parametric statistic used to measure the degree of correspondence between two variables without considering the demographic history of human populations. A few HGDP-CEPH populations were affected from recent or ancient admixture (Li et al., 2008) and population history, migratory events and genetic drift (Handley et al., 2007). Partial Mantel tests were used to calculate correlations considering the demographic history of human populations, by using the distance between populations as a co-matrix). Distances from Africa were calculated previously by Handley et al. using a model of human migration that progressed from east Africa along landmasses while avoiding mountain regions with altitudes over 2,000 m and was used in our analysis to account for human demographic history (Handley et al., 2007). The matrices were computed with pairwise Euclidean distances in variant frequency, distances from east Africa, and pathogen diversity or malaria prevalence (from either the WHO or the Gideon database). Partial Mantel correlations were performed with the R package, VEGAN v The statistical significance of correlation tests within continental regions were performed by performing 10,000 permutations of pathogen diversity or malaria prevalence. The HGDP populations were defined as suggested previously (i.e., Africa, Europe, America, central South Asia, East Asia, and Oceania), and Middle Eastern populations were grouped with Europeans (Li et al., 2008). 67

90 2.23 Analysis of DMBT1 copy number variation due to human life style adaptations After evolution in Africa approximately a kya (White et al., 2003), since then modern human populations have adopted a broad range of habitats and a variety of subsistence modes. We observe a wide range of physiological and morphological variation in human populations, some of which was undoubtedly shaped by genetic adaptations to local environments. However, identifying the genetic variations (SNPs/CNVs) underlying adaptive phenotypes is challenging as the current patterns of human genetic variation result not only from selective but also from demographic processes. Adaptation to diet was investigated by examining correlations between genetic variation and different subsistence strategies among populations (Hancock et al., 2010). Subsistence strategies data of human populations were collected from Murdock s Ethnographic Atlas (1967). In this case subsistence strategies are quantified as the relative amount of human activity spent in agriculture, animal husbandry, fishing, and hunting/gathering (Fumagalli et al., 2011a). The 4 strategies sum to 1 in each population, so they were not independent. The correlation of genetic variations (CVN1, CNV2 and total SRCR data) and different subsistence strategies (agriculture, animal husbandry, fishing, and hunting/gathering) of HGDP populations were examined using partial Mantel tests. To find out the sign of association, I also performed linear regression, using distance from Africa (dfa) as a covariate Isolation of Genomic DNA from buccal cells DNA was isolated from the buccal cells with a rapid method using proteinase K digestion, phenol-chloroform extraction, and ethanol precipitation. The buccal cells collected in mouthwash were pelleted by centrifugation at 1500g for 15 minutes and washed with 1xTE buffer (10mM Tris-Cl [ph 8.0], 1mM EDTA [ph 8.0]). The cell pallet was resuspended in 345µl lysis buffer (100 mm NaCl, 100 mm Tris-Cl [ph 8.0], 25 mm EDTA [ph 8.0], 0.5% SDS) and 20µl RNaseA (10mg/ml), and incubated the suspension at 37 C for 30 minutes. 35µl of proteinase K (20mg/ml) was added in the cell suspension and digested at 58 C for 2 hours. 400µl of Phenol:chloroform:IAA was mixed with suspension and transferred to a Gel Lock tube. After centrifugation the aqueous layer containing DNA was transferred to a fresh tube and DNA was pelleted down using 3M Sodium acetate (ph 5.6) and Isopropanol. The dry DNA pellet was dissolved in 1xTE buffer. 68

91 2.25 Designing of PCR primers for C-terminal region of Ag I/II of S. mutans The Ad1 and Ad2 sequences within the C-terminal region of Ag I/II of S. mutans were highly conserved and sequences were species-specific. To amplify both adhesion-mediating regions of AgI/II, the PCR primers were designed by Dr. Ed Hollox (Department of Genetics, University of Leicester) using primer3 software ( placing two primers (Table 16) outside the two adhesion-mediating regions (Figure 31). Figure 31: Schematic picture showing primers location to amplify C-terminal region of Ag I/II of S. mutans. Forward and reverse primer was showed in green and red respectively. The locations of two adhesion-mediating regions (Ad1 and Ad2) are showed in rectangular boxes with consensus amino acid sequences. Table 16: Primer sequences used to amplify C-terminal regions of SpaP gene of S. mutans. Oligo Name Primers Sequence (5-3 ) Product size (bp) Smutans-F Smutans-R ACTGTTCATTTCCATTACTTTAAACTAGC GTTAATCTTAGGAACATTATTGATAACG 1038 bp 2.26 PCR amplification of C-terminal region of Ag I/II of S. mutans KAPA HiFi HotStart PCR Kit was used to amplify S. mutans gdna according to the manufacturer's instructions (Kapa Biosystems, USA). The PCR products were amplified in a total volume of 25µl with 5µl of 5X KAPA HiFi Fidelity Buffers (contain 2 mm MgCl2 at 1X), 5 µm of each primer (forward and reverse primer), 0.5U KAPA HiFi (HotStart) DNA Polymerase and ng of template DNA per 25µl reaction was sufficient to achieve robust and sensitive amplification. The following KAPA HiFi cycling parameters were used for high-fidelity PCR amplification: initial denaturation of 95 C for 4 minutes, followed by 35 cycles of 98 C for 20 seconds, 58 C for 30 seconds and 72 C for 30 seconds, followed by a final extension 72 C for 5 minutes. To achieve maximum PCR efficiency, each PCR cycle denaturation was performed for 20 seconds at 98 C to ensure complete denaturation of accumulating amplification product in high-salt KAPA HiFi buffer. 69

92 2.27 Extraction of PCR product from Agarose gel After PCR amplification the amplified product was separated in 1% Agarose gel electrophoresis using 0.5X TBE buffer for 60 to 75 minutes. The PCR bands was cut from Agarose gel and extracted using a column-based method, according to the manufacturer's instructions (QIAquick Gel Extraction Kit, Qiagen). The purified PCR products were quantified using spectrophotometer (NanoDrop 1000 Spectrophotometer, Thermo Scientific) Sequencing of PCR product using internal sequencing primers The sequencing reactions were performed in a total volume of 10µl with ng of purified PCR products using BigDye Terminator v3.1 Cycle Sequencing Kit, according to the manufacturer s instructions (Applied Biosystems). To achieve full coverage sequence of full length PCR product, I designed two internal sequencing primers (Table 17) used in sequencing reactions. The sequencing reaction was performed with 1µl Big Dye Terminator Ready reaction Mix, 2µl of 5X BigDye Terminator buffers, 3.5µM sequencing primer and following cycling parameters were used: initial denaturation of 96 C for 1 minutes, followed by 25 cycles of 96 C for 10 seconds, 50 C for 5 seconds and 60 C for 4 minutes, and stored at 4 C before purification. The sequencing cleanup reaction was performed using column-based methods according to the manufacturer s instructions (Performa DTR Gel Filtration Cartridges, Edge BioSystems). The cleanup sequencing reactions were submitted to the Protein and Nucleic Acid Chemistry laboratory (PNACL), University of Leicester, UK. The products were analyzed using the 3730 automated sequencer. Table 17: Internal sequencing primer sequences used to sequence full length full length PCR product of C-terminal region of SpaP gene of S. mutans. Oligo Name Primers Sequence (5-3 ) Smut_inF Smut_inR CGCTTCTTCTGGATAATCATCTACATAG ATCCAATGTTGTTCGGGTGACAACT 2.29 Sequence read and alignment All the sequencing reads were checked using FinchTV ( All ambiguous nucleotides were re-called based on the sequencing chromatogram. Noisy sequence data from both ends were removed and complete sequence was obtained after merging forward and reverse read based on the forward strand of reference sequence. All the sequences were stored in FASTA format. The nucleotide sequences were transcribed using 70

93 Sequence Manipulation Suite ( program. Multiple sequence alignments were performed for both nucleotide and amino acid sequences using ClustalX program (Thompson et al., 1997) and output file was saved in MSF format for use in phylogenetic analysis. The alignment files also exported as a NEXUS alignment for statistical analysis using the DnaSP software (Librado & Rozas, 2009) Sequence diversity The multiple sequence alignment were performed for both nucleotide and amino acid sequences using ClustalX program (Thompson et al., 1997) and output file was exported in MSF format to analyze in MEGA6 software (Tamura et al., 2013). The MSF files of both nucleotide and amino acid sequences were imported into MEGA6, the files were converted to MEGA files and analyzed to check the sequence diversity for both nucleotide and amino acid sequences. The sequence diversity of the two adhesion-mediating regions (Ad1 and Ad2) was analysed. Sequence diversity plots were generated for both nucleotide and amino acid level using WebLogo 3 software ( (Crooks et al., 2004; Schneider & Stephens, 1990) Phylogenetic analysis MEGA6 (Tamura et al., 2013) was used to generate a maximum-likelihood tree, with 500 bootstrap replicates for both nucleotide and amino acid sequences. The evolutionary history for nucleotide sequences was inferred by using the Maximum Likelihood (ML) method based on the General Time Reversible model (Nei and Kumar, 2000). An initial tree(s) for the heuristic search was obtained by applying the Neighbor-Joining method to a matrix of pairwise distances estimated using the Maximum Composite Likelihood (MCL) approach. The evolutionary history for amino acid sequence was inferred by using the Maximum Likelihood method based on the JTT matrix-based model. Initial tree(s) for the heuristic search were obtained by applying the Neighbor-Joining method to a matrix of pairwise distances estimated using a JTT model McDonald-Kreitman test The McDonald and Kreitman test (MKT) was performed to detect the signature of natural selection at the molecular level. The amount of non-synonymous and synonymous polymorphism within a species is compared with the amount of non-synonymous and 71

94 synonymous polymorphism differences between species to calculate the McDonald and Kreitman test (MKT). Under the null hypothesis, all the nonsynonymous mutations are expected to be neutral and the ratio of nonsynonymous to synonymous variation within species (Pn/Ps) is expected to equal the ratio of nonsynonymous to synonymous variation between species (Dn/Ds). The McDonald and Kreitman test (MKT) was calculated according to the guidelines given in the standard and generalized MKT website ( (Egea et al., 2008) Input sequences for McDonald-Kreitman test A set of homologous sequences from related species were downloaded using nucleotide blast available in NCBI-BLAST ( ). The homologous sequences from six related species of Streptococcus mutans (Streptococcus downei, Streptococcus oralis, Streptococcus sanguinis, Streptococcus sorbinus, Streptococcus gordonii and Streptococcus intermedius) were stored in FASTA format. Few sequences of Streptococcus mutans contained degenerate nucleotide but the MKT website does not recognise degenerate nucleotide bases, so the nucleotide sequence containing degenerate nucleotide bases were divided into two separate sequences. The modified sequences of Streptococcus mutans sequences were aligned using the alignment program Muscle (Edgar, 2004) recommended by MKT website Sequence analysis for McDonald-Kreitman test The 1008 bp coding sequences of SpaP gene of S. mutans sequences were analysed for McDonald-Kreitman test. The two sets of sequencing data (all and European samples) were analysed for McDonald-Kreitman test. The MKT website uses a maximum parsimony criterion to estimate the number of synonymous and nonsynonymous changes within sequences. The website also uses the Jukes and Cantor s correction for correction of divergence estimations to take multiple mutational hits into account. The following contingency tables (Table 18 & table 19) were used for McDonald-Kreitman test of S. mutans using all and European samples. 72

95 Table 18: Contingency table for McDonald-Kreitman test of S. mutans from all samples. Polymorphism Divergence Variation Polymorphism Divergence Streptococcus mutans Streptococcus mutans Streptococcus mutans Streptococcus mutans Streptococcus mutans Streptococcus mutans Streptococcus downei Streptococcus oralis Streptococcus sanguinis Streptococcus sorbinus Streptococcus gordonii Streptococcus intermedius Synonymous Nonsynonymous Synonymous Nonsynonymous Synonymous Nonsynonymous Synonymous Nonsynonymous Synonymous Nonsynonymous Synonymous 85 4 Nonsynonymous 55 6 Table 19: Contingency table for McDonald-Kreitman test of S. mutans from European samples. Polymorphism Divergence Variation Polymorphism Divergence Streptococcus Streptococcus Synonymous mutans downei Nonsynonymous Streptococcus Streptococcus Synonymous mutans oralis Nonsynonymous Streptococcus Streptococcus Synonymous mutans sanguinis Nonsynonymous Streptococcus Streptococcus Synonymous mutans sorbinus Nonsynonymous Streptococcus Streptococcus Synonymous mutans gordonii Nonsynonymous Streptococcus Streptococcus Synonymous 67 6 mutans intermedius Nonsynonymous Allele frequency spectrum The NEXUS alignment file of SpaP gene of S. mutans was used for statistical analysis using the DnaSP software (Librado & Rozas, 2009) and CLUSTAL alignment was used for BioEdit software (Hall, 1999). The positional nucleotide numerical summary was calculated using BioEdit software. The minor allele frequency was calculated based on positional nucleotide numerical summary using Excel and synonymous and non-synonymous alleles were identified using DnaSP

96 The variant sites of S. mutans were grouped based on frequency of synonymous and nonsynonymous polymorphisms from both All and European samples. The line charts of allele frequency of synonymous and non-synonymous polymorphisms for all and European samples were plotted using the R project for statistical computing v ( Two tailed Fisher s exact test was performed on GraphPad Software ( using frequency data of synonymous and non-synonymous polymorphisms Secretor status assay The classic human secretor locus (Se) FUT2 encodes alpha-(1,2) fucosyltransferase which regulates expression or lack of expression of ABH and Lewis blood group antigens on secreted molecules, mucosal cells and in body fluids, and also determines the secretion status of the ABO antigens (Hazra et al., 2008). The FUT2 nonsense mutation encoding W143X (allele A) has been reported as the primary nonsecretor allele in Europeans, Africans and Iranians, with a frequency of approximately 50% (Hazra et al., 2008). The secretor (dominant) individual has demonstrable ABH blood group antigen in the saliva and other body fluids but the nonsecretor does not. DMBT1 is glycosylated by the enzyme that confers the Secretor (Se) blood group status and it has been speculated that the variant size (copy number) and glycoforms of DMBT1 modulate differential aggregation of S. mutans (Eriksson et al., 2007) Primer design for secretor status assay A PCR-RFLP assay (Table 20) was designed by Dr Ed. Hollox (Department of Genetics, University of Leicester, UK) for the SNP (rs601338) determining Se blood group status. rs (nucleotide position 428) encoding W143X is a nonsense mutation and nonsecretors are frequently homozygous for the FUT2 W143X polymorphism, resulting in an inactive FUT2 protein (Hazra et al., 2008). The genotype status of rs was typed based on G and A nucleotide at position 428 (G428T). Nucleotide G at position 428 encodes Tryptophan (W, codon TGG) and transition to nucleotide A resulting in a premature stop codon (stop, codon TAG) (Figure 32). 74

97 Figure 32: Schematic picture showing PCR-RFLP strategy of rs of FUT2 gene. The top line indicates coding amino acids and bottom line indicates nucleotide sequences; position of common allele and coding amino acid indicate with red and derived allele and coding amino acid indicate with green. Positions of forward (>) and reverse primers (<) are shown in red and blue, Ava II restriction sites are shown with dotted arrows (red for rs genotype and green for digestion control). Table 20: Primers used to amplify rs of FUT2 gene for secretor status assay. Oligo Name Primers Sequence (5-3 ) Product size (bp) secretor F secretor R GAGTACGTCCGCTTCACC CTTCCACACTTTTGGCATGA 216 bp PCR amplification and restriction digestion The PCR reactions were performed with 5-10 ng of DNA as template in a total volume of 10µl with 5µM of each primer (secretor F and secretor R), 0.5 µl of 2.5mM dntps mix (Promega), 0.5U Taq DNA Polymerase (kapa biosystems), 1µl of 10X KAPA Taq Buffer A (15mM Mg2+, kapa biosystems) in every PCR reaction. The PCR was performed with following thermal cycler conditions: initial denaturation of 94 C for 4 minutes, followed by 24 cycles of 94 C for 30 seconds, 58 C for 30 seconds and 72 C for 30 seconds, followed by a final extension 72 C for 75

98 10 minutes. After PCR, 5 µl of 1X CutSmart Buffer (NEB) was added to each well containing 2 units of Ava II enzyme (NEB) and incubated at 37 C for 1-2 hours. After digestion of PCR product, 3µl of 6Xloading buffer was added to each well and digested products were separated using 3% agarose gel (containing 0.5 μg/ml Ethidium bromide, Sigma-Aldrich) in 0.5x TBE buffer for 2 hours. 200 ng of size standard (HyperLadder V, Bioline) was also run alongside the samples to check the product size of the amplified PCR bands. The size of PCR bands and secretors status is given below (Table 21). Table 21: The fragments size, genotypes and secretor status of rs of FUT2 gene based on PCR- RFLP. Size of PCR-RFLP bands (bp) Genotype Status 130, 51 and 35 GG Secretor 165, 130, 51 and 35 GA Secretor 165 and 51 AA Non secretor The PCR products of three genotypes were extracted from agarose gel and sequenced in a total volume of 10µl with ng of purified PCR products using BigDye Terminator v3.1 Cycle Sequencing Kit (Applied Biosystems), according to the manufacturer s instructions (Applied Biosystems). The secretor status of each sample was deduced by genotyping rs in FUT2 by restriction enzyme digestion and calling genotypes GG and GA as secretors and AA genotype as nonsecretors. Three representative samples from each of three different genotypes from Leicester local samples were validated with Sanger sequencing (Figure 33). 76

99 Figure 33: Electropherogram of secretor PCR products showing three possible genotypes of rs of FUT2 gene using Leicester local samples Regression Analysis Regression analysis is used to determine whether not a relationship exists between variables. In this study, I attempted search for a relationship between variable nucleotides of C-terminal regions of Ag I/II gene of S. mutans and genotype status of individuals from which S. mutans were sequenced. The sequence analysis of S. mutans showed that 2 types of nucleotide were present in all variable sites of Ag I/II gene, so we converted our sequencing data for variable nucleotides into a binary format. A set of parameters (geographical region, CNV1, CNV2 and secretor status of individuals) were considered for predictor variables for regression analysis. Logistic regression was performed considering geographical region, CNV1, CNV2 and Secretor status of individuals as predictor variables for all samples of S. mutans. Three genotype data (CNV1, CNV2 and secretor status) were used as predictor variables for samples isolated from European regions. The logistic regression analysis for all and European samples were performed using the R (v ) project for statistical computing ( Comparison between acgh and PRT for Crohn s samples The copy number estimation of DMBT1 using PRT assays were revalidated in HapMap samples using long-range PCR, fibre-fish and PFGE. Accurate determination of DMBT1 copy number in large numbers of samples is important for robust CNV case control association studies. Some 77

100 of the Scottish and English Crohn s samples were also analysed as part of the WTCCC CNVassociations study using an Agilent 210k acgh chip. A direct comparison of results from the same samples typed by Agilent 210k acgh chip and PRT assays enabled analysis of the artefactual influences on copy number determination using Agilent 210k acgh platform Statistical analysis for case-control study Analysis of Crohn s disease samples Summary statistic calculations and t-test were performed using Microsoft Excel. Mean gene copy numbers of patients and control individuals were compared with t-test for CNV1, CNV2 and total SRCR and differences were considered significant if two-tail p value <0.05 (P<0.05). Two tailed Fisher s exact test was performed on GraphPad Software ( Analysis of African HIV cohorts To analyse the effect of CNV1, CNV2 and total SRCR at DMBT1 on HIV load at initiation of HAART, a generalised linear model was constructed using SPSS 20.0 (IBM). The model was calculated using type III sum of squares ANOVA and goodness-of-fit was analysed using Wald statistics. Initial viral load was used as the dependent variable, population and disease status was considered as fixed predictor factors, CD4 count and genotype was used as scalar predictor variables. The effect of DMBT1 CNVs on CD4+ count after initiation of HAART was estimated using a generalised linear mixed model. The dependent variable (CD4+ count) was modelled as a Gaussian distribution and effect of CNV1, CNV2 and total SRCR copy number on CD4+ count following initiation of HAART was calculated using STATA version 13 (Statacorp, Texas, USA). In this model, population and disease status were assigned as fixed factors, initial CD4+ count and time since HAART initiation were used scalar covariates and integer copy number was used as ordinal covariate. Type III sum of squares ANOVA was calculated and a variance was corrected to allow for multiple CD4+ time point readings from a single patient Analysis of lung disease cohorts The statistical analysis for Gedling and LRC cohorts was performed by Dr. Ioanna Ntalla and Dr. Louise Wain, Department of Health Sciences, University of Leicester, UK. The association of the mean raw PRT copy number of CNV1 and CNV2 with lung function (FEV 1, FVC, FEV 1 /FVC) in all individuals were performed using linear regression model whereas COPD status (Gedling) and asthma (doctor diagnosed) status (Gedling and LRC) were analysed with use of a logistic regression model. Association of integer diploid copy number of CNV1, CNV2 and total SRCR 78

101 copy number with lung function, COPD status and asthma status were also calculated using same model. Inverse normal transformation was used to FEV 1, FVC and FEV 1 /FVC. Age, age 2, sex and height were included as covariates in both models. All association analyses were performed R v Analysis of Vesicoureteral Reflux (VUR) and Urinary Tract Infections (UTI) cohorts A generalised linear model was constructed using SPSS 20.0 (IBM) to analyse the effect of CNV1, CNV2 at DMBT1 on VUR and UTI samples. The model was calculated using type III sum of squares ANOVA and goodness-of-fit was analysed using Wald statistics. Disease cohort was used as the dependent variable, secretor status was considered as a fixed predictor factor, CNV1 and CNV2 copy number data at DMBT1 were used as scalar predictor variables. 79

102 3 CHARACTERIZATION OF COPY NUMBER VARIATION OF THE HUMAN DMBT1 GENE 3.1 DMBT1 Sequence processing and bioinformatics Dot plot analysis of the entire sequenced region against itself was performed with the Basic Local Alignment Search Tool (BLAST) to identify regions of local similarity between sequences after masking repetitive sequences (Alus). Figure 34: Dot plot analysis of DMBT1 gene (exons and introns) against itself. The upper panel shows chromosomal location of DMBT1, second panel shows three DMBT1 gene annotations from different transcripts and lower panel shows segmental duplicated regions on DMBT1, thirteen homologous sequences representing thirteen SRCR blocks indicated with arrows. Regions of high sequence identity are shown as dark diagonal lines (Figure 34). The main diagonal shows self-matches of DMBT1 gene and other prominent diagonals show the pattern of direct repeat matches. The middle portion of DMBT1 gene shows extensive duplications that contain the SRCR/SID domains of DMBT1. The two algorithms identified a similar pattern in dot plots but Megablast shows smaller diagonal compared to discontiguous-megablast. Megablast identified sequences with sequence identity more that 95% without considering any gap sequence showed some regions of each blocks shared high homology (exons). In the region of the DMBT1 gene thirteen copies of a 3-4 kb genomic element each harbouring the exons for one SRCR domain and two exons coding for the subsequent SID were indicated by arrows (Figure 34). A small region (~2 kb) upstream of DMBT1 also showed as duplicated but this was outside the SRCR/SID blocks. Sequence analysis indicated that bacteria binding SRCR domains of DMBT1 are part of a tandem repeated region and due to high sequence similarities 80

103 and interaction with pathogens, the number of repeats may be variable in different human populations. 3.2 Sequence relationship of SRCR repeats The maximum-likelihood tree of 3-4 kb nucleotide sequence of SRCR repeats showed that repeats from tandem-repeated SRCR regions of DMBT1 were closer to each other when compared to SRCR14 repeat (Figure 35), indicating that the first thirteen SRCR repeats were more similar to each other than to SRCR14. The topology of first thirteen SRCR repeats indicated they were polyphyletic taxa with two major clades. The major clades included six SRCR repeats (SRCR2, 6, 8, 9, 10 and 11) and minor clade included four SRCR repeats (SRCR3, 4, 5 and 7). There were three monophyletic groups within two clades and all three groups showed high sequence similarities, SRCR2-8 (94%), SRCR3-7(98%) and SRCR 10-11(97%). The SRCR12 and SRCR 13 were polyphyletic, showed less sequence identity (83%). SRCR1 was bifurcated from minor clades but completely different from others. The sequence analysis showed that first thirteen SRCR repeats were very similar to SRCR14, but different enough to design SRCR repeat-specific primers for most of them. Figure 35: Nucleotide sequence relationship of SRCR repeats. The maximum-likelihood tree shows the relationship of SRCR repeats (3-4 kb) carrying the exon sequence of SRCR coding domains. Scale bar indicates 0.1 substitutions per site. 81

104 3.3 Evidence of copy number variable region of DMBT1 Genome wide analyses had indicated that the DMBT1 region was a highly copy number variable region (Conrad et al., 2010). The Database of Genomic Variants reported that the middle portion of DMBT1 contains tandem-repeated SRCR and SID domains, was copy number variable, and that two regions of SRCR/SID domain were copy number variable. Previous study identified two regions (2-4/2-7 and 2-9/2-10) of SRCR domains at DMBT1 that showed deletion polymorphism in CEPH individuals (Sasaki et al., 2002). WTCCC identified the DMBT1 regions using the Agilent CGH platform, and 12 probes mapped within hg18 chr10: ), a similar region to that previously characterised as deletion polymorphism 2-4/2-7 using long PCR. A further 18 probes (hg18, chr10: ) covered the 2-9/2-10 deletion region but they were not analysed for copy number calling. A principal component analysis of Agilent probes covering the 2-4/2-7 region was performed and the first principal component (1PC) showed very nice clusters without any overlapping. The 1PC of probes covering the 2-9/2-10 deletion region also indicated clustering but it was not as good as the previous region. The copy number information on Database of Genomic Variants shows evidence of deletion polymorphism and the Agilent acgh data showed two regions of DMBT1 gene are copy number variable. The two regions of DMBT1 are called CNV1 and CNV2 in this study (Figure 36). The human genome reference sequence was built from the sequence data available in the archival database and the short sequenced fragments joined up through overlapping regions into a continuous sequence ( contig ) to assemble a genome. Multiple-locus specific repeats of the DMBT1 region might be joined up in the wrong places or tandem repeated SRCR/SID sequences missed out due to high similarity. Three gene models for the DMBT1 region were annotated and exons and introns were number by transcriptional and or protein evidence. Three gene models were annotated as DMBT1 transcript variant 1, DMBT1 transcript variant 2 and DMBT1 transcript variant 3. Different copy number variable DMBT1 gene models were identified and annotated based on transcriptional data. The exons and introns of the reference sequence of DMBT1 were annotated and numbered using transcript data of smaller transcript, so one repeat containing three exons was not annotated in the reference sequences. 82

105 Figure 36: Copy number variable regions in DMBT1 gene. Database of Genomic Variants shows a comprehensive summary of structural variation (blue indicates a gain, red indicates a loss and brown indicates both a gain and loss in size) in DMBT1. The two copy number variable regions are indicated as CNV1 and CNV PRT results for HapMap samples The copy number variable regions (CNV1 and CNV2) mentioned previously were typed using PRT assays. The details of copy number estimation for both CNV analysis are described below Analysis of CNV1 region in HapMap samples The analysis was performed step-by-step to validate quality of copy number calling and to estimate total copy number of CNV1 in HapMap samples. 83

106 Distribution of raw PRT ratio for CNV1 The CNV1 copy number was estimated using two PRT assays, PRT1 and PRT2. The raw PRT ratio of both PRT assays was normalised using standard reference DNA. The histogram plots of distribution of normalized PRT ration of PRT1 (left side) and PRT2 (right side) (Figure 37) were analyzed for the 270 HapMap Phase I DNA samples. The distribution of raw PRT ratio showed very clear clustering in the HapMap Phase I DNA samples for both PRT assays. Figure 37: Histogram of raw PRT ratio of PRT1 (left side) and PRT2 (right side) in the HapMap Phase I DNA samples. The x-axis shows raw PRT ratio for CNV1 in both plots Comparison of raw PRT1 and PRT2 raw ratio Two assays for CNV1 were compared using a scatterplot, with raw PRT1 ratio on X axis, and raw PRT2 ratio on Y axis, to look for a relationship between them. The points were clustered along the line and data clusters were also well separated without any overlapping (Figure 38). The graph indicated both the assays were correlated with each other and indicated the sensitivity and specificity for CNV1 copy number estimation. The both assays produced four well defined clusters but did not produce any cluster for higher copy samples due to the small number of samples. 84

107 Figure 38: Assessment of CNV1 copy number assay quality in HapMap samples. The scatterplot and associated histogram showing raw PRT value generated by PRT1 (x-axis) plotted against raw PRT value generated by PRT2 (y-axis) for the 270 HapMap Phase I DNA samples Comparison of PRT value and acgh data for CNV1 Principal component analysis was used to compare acgh data and PRT raw ratio of CNV1 for the HapMap samples. A scree plot (Figure 39) was plotted to determine the quality of different principal components and also to finalise which principal components might be useful for validation of copy number calling of CNV1. A scree plot for the PCA displayed the total recall score of each principal component. The scree plot showed that the first principal component explained the most variation (almost 90%) for acgh data generated using the Agilent 210 K CNV chip. So, the first principal component was used to validate the quality of integer copy number calling of CNV1 using PRT ratio. 85

108 Figure 39: Scree plot for PCA of acgh data generated using Agilent 210 K CNV chip for CNV1 in HapMap samples. The X-axis shows the number of principal components. The average PRT ratios were compared with with first PC of array CGH data previously generated using the Agilent 210 K CNV chip. There was a clear positive correlation between the two methods and it was also clear that the data generated by PRT assays clustered effectively into integer copy numbers but there was considerable overlap of copy number value produced by acgh (Figure 40). 86

109 Figure 40: Scatter plot of mean unrounded copy number value of CNV1 and first PC of Agilent acgh data for CNV1 region of HapMap samples. Each point on scatter plots represents individual sample with different symbols and colours reflects copy number clusters for CNV Copy number estimation for CNV1 The R package CNVtools v was used to calculate copy number of CNV1 using transformed PRT data. The histogram of average PRT data for PRT1 and PRT2 was plotted using R. The distribution of average PRT ratio indicated five well separate clusters (Figure 41). The number of clusters was used to measure integer copy number of CNV1 using R package, CNVtools v

110 Figure 41: Histogram of mean normalized PRT ratio of CNV1 for HapMap samples. X-axis shows mean PRT ratio of CNV1. Integer copy numbers were inferred from transformed PRT data using a Gaussian mixture model. A mixture model of five components was fitted, based on clustering of the normalized PRT data (Figure 42). The resulting clustering quality score (Q) was Posterior probabilities of the integer copy number call for each sample were plotted in Figure 43. The posterior probability was > 0.99, indicated very high quality of copy number calling for CNV1. Figure 42: Output of the clustering procedure using the PRT transformed data of CNV1 for HapMap samples. The coloured lines show the posterior probability for each of the five copy number classes (copy number = 2; 3; 4; 5 or 6). The X-axis shows transformed PRT ratio of CNV1. 88

111 Figure 43: Analysis of integer copy number calling for CNV1 in HapMap samples. Scatter plot and associated histograms shows mean unrounded copy number values generated by PRT1 and PRT2 plotted against posterior probabilities of integer copy number call for HapMap samples Distribution of CNV1 integer copy number The CNV1 copy numbers per diploid genome were measured by PRT1 and PRT2 assays in 4 different HapMap populations, the European-descent collection from Utah (CEU, n=90), the Japanese from Tokyo (JPT, n=45), the Han Chinese from Beijing (CHB, n=45) and the African Yoruba from Nigeria (YRI, n=90). The CNV1 copy number was estimated in total 270 HapMap samples (Table 22). The study found total five diploid copy number classes for these HapMap populations with a diploid copy number from 2 to 6 with a modal copy number of 4 (59%)(Figure 44). 89

112 Table 22: CNV1 copy number frequencies in HapMap samples. Total HapMap CEU samples JPT+CHB YRI samples samples samples Diploid copy number Copy number count (frequencies) Copy number count (frequencies) Copy number count (frequencies) Copy number count (frequencies) 2 6 (0.02) 1 (0.01) 5 (0.06) 0 (0.00) 3 45 (0.17) 16 (0.18) 27 (0.30) 2 (0.02) (0.59) 73 (0.81) 41 (0.45) 45 (0.50) 5 51 (0.19) 0 (0.00) 17 (0.19) 34 (0.38) 6 9 (0.03) 0 (0.00) 0 (0.00) 9 (0.10) Total Mean Figure 44: Frequencies of integer copy number of CNV1 per diploid genome in different HapMap populations Genotype CNV1 using long-range PCR All 270 HapMap samples were also genotyped for CNV1 using long-range PCR, which gave three genotypes. The details of long-range PCR fragments and respective diploid copy number are shown below (Table 23). All individuals with 2 copies produced 4.2kb PCR products only and called all samples consistently. 3 copy individuals always produced a smaller (4.2 kb) PCR band but also a higher PCR band (16.9 kb) (Figure 45) was present irregularly depending on quality of DNA. The individuals with 4 or more copies CNV1 did not genotype due to the 90

113 limitations of long-pcr. So long-range PCR could genotype only 2 copy individuals but the long PCR could not distinguish higher copy numbers. Table 23: PCR fragments for Validation of CNV1 copy number using long-range PCR. CNV1 PCR fragment Genotype status kb Homozygous deletion kb, 16.9 kb Heterozygous deletion kb Without deletion fragment 5 or more 16.9 kb, 29.6 kb Without deletion fragment Figure 45: Schematic illustration of long range PCR for genotyping CNV1 region of DMBT1. The positions of forward and reverse primers are shown with red and blue box respectively. The long range PCR amplifies both red and green region for long allele (16.9 kb) but for deletion allele (4.2 kb), PCR amplifies only green region only. Figure 46: Genotyping of CNV1 allele of good quality DNA samples using long range PCR. The 16.9 kb band represents long allele (CNV1 copy number 2) and 4.2 kb band indicates deletion allele (CNV1 copy number 1) of DMBT1 gene. (Long allele: L; Deletion allele: D and size standard DNA ladder: M) The good quality genomic DNA amplified both long and short allele of CNV1 region (Figure 46) depending on the copy number of CNV1 region. The fragmented genomic DNA some times did not amplify long allele of CNV1 (Figure 47). The quality of gdna was very important for successful amplification of long allele. 91

114 Figure 47: Genotyping of CNV1 allele of freeze-thawed DNA using long range PCR. The 4.2 kb band indicates deletion allele but 16.9 kb long allele specific bands are absent due to the quality of DNA. (Long allele: L; Deletion allele: D and size standard DNA ladder: M) Long-allele-specific PCR The long-allele-specific PCRs were designed to overcome this limitation of long-range PCR. The samples did not amplify long allele in long-range PCR, were amplified using long-allele specific PCR (230 bp) and genotyped by combining both PCR results (Figure 48). Figure 48: Top panel shows location of long allele specific PCR. The PCR primers positions are indicated with two red arrows. The positions of forward and reverse primers are shown with red and blue box respectively. The 230 bp PCR band indicates long allele (Long allele: L; Deletion allele: D and size standard DNA ladder: M) Block specific long PCR A total of 270 samples of all HapMap populations were typed using block specific long-range PCR. The assay amplifies three blocks with variable sizes from three different regions of DMBT1 gene within SRCR/SID blocks. The samples with diploid CNV1 copy number 2 showed complete deletion of the second block (1549bp) but for the rest, two bands were always present in the agarose gel (Figure 49). 92

115 Figure 49: (A) Schematic presentation of block specific long PCR. The position of PCR primers are indicated with red arrows and PCR product are indicated below. (B) Agarose gel picture of DMBT1 block specific PCR product. The absence of 1549bp PCR band of second block is indicated with the red arrows in sample 2 and 15 (M: DNA marker; N: no template control). The homozygous deleted samples (CNV1 copy number 2) were double checked using block specific long PCR. The 2 nd block was completely missing for those samples but it did not genotype individuals with CNV1 copy number 3 or more. The genotyping strategy of CNV1 is shown in detail in Table 24. Table 24: Combination of different PCR assays for validation of CNV1 copy number. CNV1 Copy number Long-range PCR Long allele specific PCR Genotype Block specific long PCR 2 Present (D) Absent DD 2 bands (no 2nd block) 3 Present (D) Present (L) LD 3 bands 4 Absent Present (L) LL 3 bands The highest frequency of both DD and LD genotypes (CNV1 copy number 2 and 3 respectively) was found in HapMap JPT+CHB population (6% and 30% respectively), followed by HapMap CEU (1% and 18% respectively) (Figure 50). Only 2% of HapMap African sample (YRI) showed LD genotyped. The genotype frequency and copy number frequency was the same for HapMap samples but the main disadvantage was that it did not distinguish 4, 5 and 6 copy CNV1 individuals. 93

116 Figure 50: Genotype frequency of CNV1 region in different HapMap populations Analysis of CNV2 region in HapMap samples Details of the analyses to validate quality of copy number calling and to estimate total copy number of CNV2 in HapMap samples are shown below Distribution of raw PRT ratio for CNV2 in HapMap samples The CNV2 copy number was estimated using three PRT assays; PRT3, PRT4 and PRT5. The histogram plots of distribution of normalized PRT ration of PRT3 (left side, red) and PRT4 (right side, green) were analyzed for the 270 HapMap Phase I DNA samples (Figure 51). The distribution of raw ratio of both PRT3 and PRT4 showed very good clustering in the HapMap Phase I DNA samples for both PRT assays. Figure 51: Histogram of raw PRT ratio of PRT3 (left side, red) and PRT4 (right side, green) in the HapMap Phase I DNA samples. The X-axis shows raw PRT ratio for CNV2. The histogram plot of distribution of raw PRT ration of PRT5 was analyzed for the 270 HapMap Phase I DNA samples. The distribution of raw ratio of PRT5 showed clustering in the HapMap Phase I DNA sample but the distribution pattern and number of clusters were different compared to PRT3 and PRT4 assay for HapMap samples (Figure 52). 94

117 Figure 52: Histogram of raw PRT ratio of PRT5 in the HapMap Phase I DNA samples Comparison of PRT3 and PRT4 raw ratio for HapMap samples Scatter plot analysis of normalized PRT ratios from two experiments (PRT3 and PRT4) gave a strong correlation and very clear clustering around integer value. The scatter plot results from PRT3 and PRT4 showed distinct clustering around integer values indicating the robustness of both PRT assays (Figure 53). The scatter plot results also indicate data from both assays are equally good for CNV2 copy number estimation using CNVtools. Figure 53: Assessment of CNV2 copy number assay quality. The scatterplot and associated histogram showing raw PRT value generates by PRT3 (x-axis) plotted against raw PRT value generated by PRT4 (yaxis) for the 270 HapMap Phase I DNA samples. 95

118 Comparison of PRT3 and PRT5 raw ratio for HapMap samples The scatter plot analysis of PRT ratios from two experiments (PRT3 and PRT5) did not produce any notable association (Figure 54). The scatter plot results indicated that the clusters produced by the PRT3 ratio give a wide range of ratios in the PRT5 assays. The histogram of PRT3 ratio produced well separated data clusters compared to data clusters of PRT5. Figure 54: Assessment of CNV2 copy number assay quality. The scatterplot and associated histogram showing raw PRT3 value (x-axis) plots against raw PRT5 value (y-axis) for the 270 HapMap Phase I DNA samples. The data clustering of PRT5 ratio indicate that data from PRT5 assay are not reliable for CNV2 copy number estimation using CNVtools Comparison of PRT4 and PRT5 raw ratio for HapMap samples The scatter plot analysis of PRT ratio from PRT4 and raw PRT5 showed no significant clustering of PRT5 data and indicated poor correlation between the data ratio from two experiments (Figure 55). 96

119 Figure 55: Assessment of CNV2 copy number assay quality. The scatterplot and associated histogram showing raw PRT4 value (x-axis) plots against raw PRT5 value (y-axis) for the 270 HapMap Phase I DNA samples Comparison of PRT value and acgh for CNV2 in HapMap samples To determine which principal components of the multivariate acgh data might be useful for validation of copy number calling, a scree plot for the PCA (Figure 56) and the association of each principal component with total recall score was examined. The scree plot showed that the first principal component explained most of the variation (almost 60%) for acgh data generated using the Agilent 210K CNV chip. So, the first principal component was used to validate quality of integer copy number calling of CNV2 using PRT ratio. 97

120 Figure 56: Scree plot for PCA of acgh data generates using Agilent 210K CNV chip for CNV2 in HapMap samples. X-axis shows number of principal component for CNV2 acgh data. The scatter plot of normalized PRT ratio from PRT3 and PRT4 indicated well separated clusters for CNV2 of DMBT1. The average normalized PRT ratios of PRT3 and PRT4 were used to measure integer copy number of CNV2. The average PRT ratios were compared with the first PC of arraycgh data previously generated using the Agilent 210K CNV chip. The scatter plot analysis of PRT ratios and 1PC of acgh did not produce any notable association (Figure 57). The histogram of average PRT ratio produced well separated data clusters, in contrast to acgh data where there was considerable overlap of 1PC classes for CNV2 region in HapMap samples. The scatter plot indicated the efficiency and robustness of PRT assays to generate good quality data for CNV2 copy number estimation using CNVtools. 98

121 Figure 57: Scatter plot produced by mean unrounded copy number value and first PC of Agilent acgh signal of CNV2 of HapMap samples. Each point on scatter plots showed individual sample with each colour/shapes indicated different copy number classes of CNV Copy number estimations of CNV2 CNVtools also was used to calculate copy number of CNV2 using transformed PRT data as used before for measurement of CNV1 copy number (Barnes et al., 2008). The average PRT ratio of PRT3 and PRT4 was used to draw histogram of average PRT data (Figure 58). The distribution of average PRT ratio indicated eight clusters and a number of samples with ratio >5 was treated as outlier. Based on number of clusters, mixture model of eight components was used to measure integer copy number of CNV2 using CNVtools. 99

122 Figure 58: Histogram of normalized PRT ratio of CNV2 for HapMap samples. Integer copy numbers were inferred from transformed PRT data using a Gaussian mixture model in the R package, CNVtools v A mixture model of eight components was fitted, based on clustering of the normalized PRT data but a number of outliers were present (Figure 59). These outliers were for the higher copy number samples and were excluded from CNVtools analysis to improve clustering quality of transformed PRT data. The resulting clustering quality score (Q) was 5.5. The posterior probabilities of the integer copy number call for each sample were plotted in Figure 60. Most of the samples showed posterior probability > 0.9 indicating very high quality of copy number calling for CNV2. Where this probability was below 0.8, then the mean of a duplicate test was used to call the correct integer copy number. Figure 59: Output of the clustering procedure using the PRT transformed data of CNV2 for HapMap samples. The coloured lines show the posterior probability for each of the eight copy number classes (copy number = 1; 2; 3; 4; 5; 6; 7; 8). 100

123 Figure 60: Analysis of integer copy number calling of CNV2 in HapMap samples. Scatter plot and associated histograms show mean unrounded copy number values generate by PRT3 and PRT4 and plots against posterior probabilities of integer copy number call for HapMap samples Distribution of CNV2 integer copy number The CNV2 copy numbers per diploid genome were measured using PRT and PRT3 assays in 4 different HapMap populations, CEU, JPT, CHB and YRI. The CNV2 copy number was estimated in a total of 269 HapMap samples. The study found a total of eleven diploid copy number classes for the CNV2 region in HapMap population with a diploid copy number from 1 to 11 with a mean copy number of 4.41 per diploid genome (Figure 61). Most of the HapMap CEU samples showed higher diploid copy number classes (>4) for CNV2 copy number but most individuals of HapMap Asian (JPT and CHB) and HapMap African (YRI) populations showed lower diploid copy number classes (<4) for CNV2 copy number (Table 25). 101

124 Table 25: CNV2 copy number frequencies in HapMap samples. Diploid copy number Total HapMap CEU sample JPT+CHB sample Copy number Copy number Copy number count count count (frequencies) (frequencies) (frequencies) YRI sample Copy number count (frequencies) 1 6 (0.02) 0 (0.00) 4 (0.04) 2 (0.02) 2 46 (0.17) 1 (0.01) 19 (0.21) 26 (0.29) 3 35 (0.13) 10 (0.11) 13 (0.14) 12 (0.14) 4 70 (0.26) 24 (0.27) 24 (0.27) 22 (0.25) 5 42 (0.16) 16 (0.18) 12 (0.13) 14 (0.16) 6 32 (0.12) 20 (0.22) 8 (0.09) 4 (0.05) 7 15 (0.06) 6 (0.07) 6 (0.07) 3 (0.03) 8 14 (0.05) 8 (0.09) 2 (0.02) 4 (0.05) 9 7 (0.03) 4 (0.04) 1 (0.01) 2 (0.02) 10 1 ((<0.01) 1 (0.01) 0 (0.00) 0 (0.00) 11 1 (<0.01) 0 (0.00) 1 (0.01) 0 (0.00) Total Mean Figure 61: Frequencies of integer copy number of CNV2 per diploid genome in different HapMap populations Validation of CNV2 data using long range PCR Long-range PCR was used to validate the integer copy number of CNV2 region. Individuals that showed low copy number (0-3 copy) of CNV2 were used for the assay. The long-range PCR successfully genotyped all low copy (1-4) CNV2 samples of HapMap samples. 102

125 Figure 62: Top: Schematic presentation of CNV2 copy number estimation using long PCR. The CNV2 region is indicated with a red arrow and positions of PCR primers are indicated with arrows. CNV2 copy number specific PCR product sizes are indicated below for 0 to 4 copies per haploid genome. Bottom: agarose gel picture of CNV2 specific PCR product from low copy number individuals. Total diploid copy numbers are shown on top of the gel and genotype status was shown below the gel. The copy number specific PCR bands are indicated in left side of the gel with arrows. The long PCR called 1 (allelic status 0 and 1) and 2 (allelic status 1 and 1) copies individuals consistently and 3 and 4 copies individuals depending on allelic status and quality of DNA. The individual with 3 (allelic status 2 and 1) or 4 (allelic status 2 and 2) copies of diploid CNV2 was amplified 2 copies allele (11362 bp) depending on DNA quality. The long PCR identified one allele which was smaller but did not amplify 3 or higher copy allele (Figure 62). 3.5 Discussion Two regions (CNV1 and CNV2) of DMBT1 were genotyped using different PCR-based PRT assays. The two regions had been described previously as 2-4/2-7 and 2-9/2-10 deletion polymorphisms using long PCR (Sasaki et al., 2002). The present study called diploid copy number without any copy number bias. Previously CNV1 (2-4/2-7) region was identified using long PCR but it did not differentiate higher copy numbers (4 or more) (Renner et al., 2007; Sasaki et al., 2002). The long PCR depends also on the quality of genomic DNA. The same problem was also noticed for CNV2 (2-9/2-10) for copy number calling using long PCR. 103

126 However, the present study genotyped higher copy number individuals accurately without any copy number bias for both CNVs at DMBT1 using PRT based assay. The WTCCC called CNV1 copy number using acgh data on HapMap samples (Conrad et al., 2010). The present study showed positive correlation between the two assays but better clustering was noticed for PRT assays without any overlapping even for higher copies. The acgh data did not call CNV2 copy number whereas PRT assay called CNV2 copy number but the clustering was not as good as CNV1 assays for higher copies. The limitation of long PCR for both CNVs was not found in PRT assays. The present study found that both CNV1 and CNV2 were multiallelic CNV in all HapMap populations. The modal copy number was 4 for CNV1 in all populations but copy number distribution was different in different HapMap populations. The African individuals showed higher frequency (48%) of higher copy (copy number 5 and 6) CNV1 compared to European population (0%). The opposite trend was noticed for lower copies (2 and 3) of CNV1 in two populations where higher frequency of low copy CNV1 was noticed for European samples (19%) compared to African samples (2%). Both higher (19%) and lower copy (36%) CNV1 was present in CHB and JPT sample The mean CNV2 copy number was different for different HapMap populations. The mean CNV2 copy number was higher (5.33) in the HapMap CEU population compared to HapMap JPT and CHB (mean CNV2 number 4.03) and HapMap YRI (mean CNV2 number 3.85). The African samples showed higher frequency of lower copy number class (4 or less) but European sample showed higher frequency of higher copy number classes (5 or more). The present study shows that PRT assays are robust in accurately calling both CNVs of DMBT1 and can be used to estimate both CNVs of DMBT1 in large case-control study. 104

127 4 ANALYSIS OF COPY NUMBER VARIATION OF DMBT1 USING PHYSICAL MAPPING APPROACHES 4.1 Introduction The DMBT1 copy number was characterized quite accurately using PRT assays. Alternative approaches were explored for the determination of copy number at multi-allelic CNV1 and CNV2 loci of DMBT1 that were simultaneously convenient, economic and accurate. Analysis of DMBT1 copy number was performed using two physical mapping approaches like Fibre-FISH (Fluorescence in situ hybridization) and Pulse field gel electrophoresis (PFGE). Fibre-FISH is a visual technique and provides an alternative approach to identify chromosomal abnormalities from metaphase or interphase spreads using fluorescent probes. The strength of FISH lies in the direct visualization of DNA copy number at the single cell level, but also facilitates the identification of balanced structural variants such as inversions and translocations (Molina et al., 2012). Fibre-FISH satisfactory applied in several high-resolution physical mapping studies (Leipoldt et al., 2007; Protopopov et al., 2003; Raap et al., 1996) and as a validation technique in CNV studies (Iafrate et al., 2004; Perry et al., 2008). Pulsed field gel electrophoresis (PFGE) in combination with Southern Blotting, is a powerful technique with respect to resolution (ability to resolve DNA sequences >12 Mb) and accuracy for measuring copy number (Carvalho & Lupski, 2008; Herschleb et al., 2007; Leach & Glaser, 1998). PFGE is useful for both restriction-mapping and characterizing large genomic rearrangements. A key advantage of both Fibre-FISH and PFGE is that it allows the determination of copy number per allele which is important for studies of inheritance and disease. The aim of this study was to assess the potentials of Fibre-FISH and PFGE to support PRT-based investigations of copy number variations within DMBT1 genomic regions. 4.2 Copy number variation of DMBT1 using Fibre-FISH Analysis of DNA fibres Molecular combing generated DNA fibres stretched nicely like a straight line and were used for measuring DMBT1 copy number. The BAC clone (RP11-144H16) was labelled with biotin- 16dUTP (Roche), covered the entire genomic region of DMBT1 and also 5 and 3 flanking regions of DMBT1 and looks like a blue thread of DNA fibres. Each SCRC and SID block was covered with two probes and labelled with digoxigenin-11-dutp (Roche) and DNP-11-dUTP 105

128 (Perkin Elmer). As a result each SRCR/SID block was coloured with small red and green dots in DNA fibres with the aim that each SRCR domain was represented by one red and one green dot (Figure 63). A fosmid clone (G248P8718G1) from outside the DMBT1 genomic region was selected as reference region, labelled with digoxigenin-11-dutp (Roche) and shown as red line at 3 region of DMBT1 gene l. The long blue thread indicated the intact DNA fibre and DNA fibres without both ends were considered as broken fibres. The 40kb reference region was used to calibrate the DNA fibres length of SRCR region. The length of test region, reference region and gap length in between test and reference regions were measured using ImageJ software. Figure 63: Schematic picture of DMBT1 fibre-fish. (A) Labelling strategy of DMBT1 regions with different colour combinations. The whole region contains DMBT1 gene is labelled using the BAC clone (blue thread), the CNV (test) regions are labelled using two PCR products (red and green dots) and the reference regions (non-cnv) are labelled with the fosmid clone (red line). (B) Two CNV regions (CNV1 and CNV2) of DMBT1 are indicated with dotted boxes. (C) Available CNV data at DMBT1 region based on DGV database (red-deletion, blue-duplication and brown for both deletion and duplication) Measurements of DNA fibre lengths Because we would predict that every SRCR block was labelled with one red and green dot, the total number of paired green and red dots represents the total number SRCR domains for particular DNA fibre. We counted the red and green dots but I faced many problems. Firstly, for a few cases both red and green dots were missing from the middle portion of DNA fibres. Secondly, sometimes either red or green dot were not labelled properly, so continuity of alternative red and green dots was absent. The PCR did not amplify the 1 st and last SRCR blocks due to less sequence similarity in the intronic regions of those blocks compared to other SRCR blocks. Each SRCR block consists of one exon for SRCR and two exons for SID although one SID exon was absent in between 4 th and 5 th SRCR block; this may alter probe binding for this region. To overcome counting difficulties of red and green dots, the length of test and reference region was measured using ImageJ software. The gap length between test and reference 106

129 region also measure to check the length ratio of reference and gap region. At least good quality DNA fibres were used to measure the fibre length (Figure 64). The length of the test region was calibrated using 40kb reference long DNA fibres. Two independent measurements were compared to analyze DNA fibres. Figure 64: Analysis of DNA fibre of DMBT1 region. Schematic picture of the DMBT1 region is shown on top of the DNA fibres. An example of DNA fibres use for length measurement in GM12673 sample of 1447 family is shown Analysis of the DMBT1 region in YRI family Y045 The trio of YRI family Y045 was analyzed for copy number validation using Fibre-FISH assay for CNV1 region of DMBT1 (Figure 65). The integer copy number of CNV1 was different for family members but CNV2 region was non-variable. The variable length of the test region represents the CNV1 copy number variation for DMBT1 as CNV2 was same for all three samples. The integer copy number of CNV1 and CNV2 regions for Y045 was given in Table

130 Table 26: Family trio with integer copy number of CNV1 and CNV2 at DMBT1. The CNV1 copy number was variable but CNV2 copy number was same for all samples of family Y045. Sample ID Sex Family Relationship CNV1 CNV2 GM19200 Male Y045 father 4 2 GM19201 Female Y045 mother 6 2 GM19202 Female Y045 child 5 2 Figure 65: Fiber-FISH image on DNA from cell lines derived from YRI HapMap trio Y045. This trio is selected because copy number is invariant at CNV2 and shows variation at CNV1 copy number. The two different length alleles can be distinguished in the heterozygous daughter GM Two independent measurements from test regions of DNA fibres for all samples were compared using a scatter plot. 108

131 Figure 66: Individual measurements of DMBT1 probe fibre length of Y045 family. Each point represents two independent measurements of a single fibre, with measurements are taken by Dr Ed Hollox (y-axis) and me (x-axis). Several broken and bent fibres are not included for measurement for each individual. The mother clearly shows longer DMBT1 allele than the father, and daughter has both long and short alleles. All the measurements (both X- and y-axis) are in kb and length difference between 2 copy and 3 copy allele was 13kb, as predicted from long PCR and sequence analysis. Because the samples selected were all identical in copy number for CNV2, the clusters of length measurements of DNA fibres indicated the allele status of integer copy number for CNV1 region. The DNA fibres of sample GM19202 (daughter) showed two clusters of fibre lengths. The DNA fibres lengths of sample GM19200 (father) and GM19201 (mother) showed a single cluster for both samples for CNV1 region but two clusters were different from each other (one at ~ 29.8kb and the other at ~42.8kb) (Figure 66). For the daughter, the first cluster of DNA fibre lengths represents the allele inherited from the father and the other cluster represents the allele inherited from the mother (one at ~ 32.8kb and the other at ~42.3kb). The CNV1 integer copy number of daughter DNA (GM copy number 5) indicated that the smaller allele contained 2 copies of CNV1 repeats and large allele contained 3 copies CNV1 repeats. The CNV1 allelic status was of father sample (GM19200, CNV1 copy number 4), for daughter sample (GM19202, CNV1 copy number 5) and 3+3 for mother DNA (GM19201, CNV1 copy number 6). The size difference for one unit of CNV1 was estimated comparing average fibre length between GM19200 (father) and GM (mother). The genomic size difference between the 2 copies CNV1 allele and 3 copies CNV1 allele was almost 13 kb (Table 27), very similar to that 109

132 predicted from the long PCR and analysis of gene structure (~12.7 kb). The size difference between the 2 copies CNV1 allele and 3 copies CNV1 allele in GM19202 (daughter) was 9.5 kb. The size difference indicated that the CNV2 allelic status was 2 and 0, different from GM19200 (Father) and GM19201 (mother) where allelic status was 1 and 1 for CNV2 locus. So fiber length of GM19202 (daughter) indicated that one unit of CNV2 changed by almost 3.5 kb size, a size equivalent to that of a SRCR domain. Table 27: Estimated size of one unit CNV1 allele of DMBT1 using samples of HapMap YRI family Y045 by Fiber-FISH. DNA sample Size estimated using Fiber-FISH CNV1 allele status Size difference (kb) GM kb 2 and 2 13 kb GM kb 3 and 3 GM kb and 42.3 kb 3 and kb Analysis of DMBT1 region in 1447 family The maternal trio of the extended CEPH/UTAH pedigree 1447 was analyzed for copy number validation for DMBT1. The integer copy number of CNV1 was the same for all samples (4) but the CNV2 region was variable. The integer copy number of CNV1 was the same for all three samples, so variable length of the test region represents the CNV2 copy number variation for DMBT1. The integer copy number of CNV1 and CNV2 regions is given in Table 28. Table 28: Family trio with integer copy number of CNV1 and CNV2 at DMBT1. The CNV1 copy number was invariant but CNV2 copy number was variable for all samples of CEU family Sample ID Sex Family Relationship CNV1 CNV2 GM12762 Male 1447 Father 4 8 GM12763 Female 1447 Mother 4 3 GM12753 Female 1447 Daughter 4 5 Two independent measurements of test regions of DNA fibres for the samples were compared. The data clusters of fibre length showed two clusters for all samples (Figure 67). Two clusters indicated CNV2 region was heterozygote condition for all samples. 110

133 Figure 67: Individual measurements of DMBT1 probe fibre length of 1447 family. Each point represents two independent measurements of a single fibre, with measurements taken by Dr Ed Hollox (y-axis) and me (x-axis). Several broken and bent fibres were not included for measurements for each individual. All the measurements (both X- and y-axis) are in kb and the length difference between single unit CNV2 is kb, as predicted from long PCR and sequence analysis. Because the samples selected were all identical in copy number of CNV1, the clusters of length measurements of DNA fibres indicated the allele status of integer copy number for CNV2 region. The DNA fibres of sample GM12762 (father) showed two fibre lengths for CNV2 regions (~46.4kb and ~33.3kb). The DNA fibres of sample GM12763 (mother) also showed a length difference for CNV2 region but both clusters were closer to each other (~34.5kb and ~28.8kb). The same length pattern was noticed for DNA fibre of sample GM12753 (daughter) and both clusters were closer to each other (~36.4kb and ~31.8kb). Based on integer copy number of GM12763 (mother), the shared allele may contain either 2 or 3 copy of CNV2. But CNV2 copy number 3 did not explain the integer CNV2 copy number 5 for GM12753 (daughter). The CNV2 allelic status was for integer CNV2 copy number 8 of GM12762 (father). The CNV2 allelic status was and 3+2 for GM12763 (mother, CNV2 copy number 3) and GM12753 (daughter, CNV2 copy number 5) respectively. The size difference for one unit CNV2 was estimated by comparing average fibre length of GM12763 (mother) and GM12753 (daughter). The genomic size difference between the 2 copies CNV2 allele and 3 copies CNV2 allele was almost 4.6 kb in GM12753 (daughter). The genomic size difference between the 2 copies CNV2 allele and 1 copy CNV2 allele was almost 4.7 kb in GM12763 (mother). The size difference in both mother and daughter was almost the 111

134 same and similar to that predicted from the long PCR and analysis of gene structure (~4.2 kb). The size difference between the 2 copy CNV2 allele and the 6 copy CNV1 allele in GM12762 (father) indicated size difference due to 4 copies of CNV2 allele (13.1kb) which was lower than expected. The fiber length of difference indicated that one unit of CNV2 changed almost 4.6 to 4.7 kb, almost similar size of single SRCR domain (Table 29). Table 29: Estimated size of one unit of CNV2 of DMBT1 using samples of HapMap CEU family 1447 by Fiber-FISH. DNA sample Size estimated using Fiber-FISH CNV2 allele status Size difference (kb) GM kb and and GM kb and and GM kb and and Copy number variation of DMBT1 using PFGE Selection of restriction enzyme for genomic DNA digestion The selection of a specific restriction enzyme was the most important part of the pulse field gel electrophoresis. My aim was to select a restriction enzyme that digests the genomic DNA outside the copy number variable regions of DMBT1. I selected the Sca I enzyme to digest the genomic DNA, based on Human Feb (GRCh37/hg19) assembly on UCSC Genome Browser ( (Figure 68). Figure 68: Selection of restriction enzyme for DMBT1 gene. Restriction sites of Sca I enzyme are indicated with arrows and fragments sizes are shown between two cut sites within DMBT1 region. Two copy number variable regions (CNV1 and CNV2) of DMBT1 are indicated with red and green arrows Selection of DNA samples DNA samples were selected based on estimates of integer copy number of CNV1 and CNV2 region of DMBT1 using PRT assays. For DNA samples with 2 copies of CNV1; long range PCR produced ~4.2 kb single PCR band but 3 copies produced two PCR bands; ~4.2 kb and 16.9 kb. 2 copies (allelic copy number 1 and 1) CNV1 could be represented as a homozygous deletion whereas 3 copies (allelic copy number 2 and 1) CNV1 was heterozygous deletion. The four copies CNV1 samples produced single PCR band of 16.9 kb without any deleted fragment, indicated homozygous non-deleted allele for CNV1. So, change of CNV1 copy number alters DNA size at almost 12.7kb at genomic level. The long PCR indentified allelic status of CNV2 112

135 mainly for low copy samples (0-4 copies) and DNA size difference was ~ 4 kb for each CNV2 copy number variation. A total of four HapMap DNA samples were selected for size estimation of different copies of the DMBT1 gene. Three DNA samples (NA18956, NA18555 and NA18517) were variable CNV1 copy number (2 to 4) but CNV2 copy number was invariant (integer copy number 2). The allelic status of CNV2 region was (1+1) for first three samples but for forth sample (NA18507); CNV2 allelic status was 2+1 (integer CNV2 copy number 3). The details of CNV1 and CNV2 integer copy number for all samples mentioned below (Table 30). Table 30: The samples with integer copy number used for PFGE. Integer copy numbers of the samples were measured using PRT assays. Sample ID CNV1 CNV2 Total copy number Genotype Total copy number Genotype NA NA NA NA The genomic DNA was digested using the Sca I restriction enzyme. All respected restriction sites for Sca I are outside the both copy number variable regions. Sca I is not sensitive to any type of known methylation. The position of restriction sites and size of restriction fragments are shown in Figure Southern blotting analysis of liquid genomic DNA The digested DNA samples were fractionated on a 1% agarose gel by CHEF electrophoresis at 14 C for 16 h with a 10s switch time. A thin well comb (0.75 mm) was used to improve the band intensity and samples were loaded using pipette tips with large openings to avoid fragmentation during handling. The gel was stained with Ethidium bromide and checked for smearing of digested DNA samples before Southern-blot hybridization. The distribution patterns of Sca I restricted fragments were same for all four samples. The intensity of smear was also same for all samples; the gel was suitable for Southern blotting. The gel was transferred to the membrane and processed for Southern blotting (Figure 69). 113

136 Figure 69: Agarose gel shows uniform smearing of gdna after Sca I digestion. The smearing of gdna indicates complete digestion before transferring to the membrane for Southern blotting. M, indicates size standard of DNA ladder (HyperLadder I, Bioline) and DNA samples are indicated on the top of each lanes DNA size analysis using Southern blotting Southern blot analysis showed a different band patterns for the different genomic DNA samples (Figure 70). Southern blot analysis showed one band in lane 2 (NA18956), two bands (one small and one large) in lane 3 (NA18555), one band (equal length with large band of previous band) in lane 4 (NA18517) and one band in lane 5 (NA18507). Figure 70: Southern blot analysis of genomic DNA using DMBT1 SRCR probe. The DNA is digested with the restriction enzymes Sca I and fractionated by CHEF electrophoresis (BioRad). The DNA is fragmented using UV nicking and transferred to the membrane. M indicates DNA size standard PFGE ladder in kb. 114

137 The PFGE ladder was run on both sides of the gel (lane 1 and 6) to estimate size of DMBT1 fragments. The migration length of known DNA fragments of both ladder were measured using ImageJ software. The mean migrated length was calculated and standard curve was plotted using DNA size (kb) and migrated distance (mm) of each fragments (Figure 71). Figure 71: Standard curve using known size standard of PFGE ladder. Table 31: Estimated size of DMBT1 region from different control DNA samples using PFGE combined with Southern blotting. The difference in length between the 2 copy CNV1 allele and 3 copy CNV1 allele was almost 12 kb, as predicted from the long PCR (~12.7 kb) and analysis of gene structure. DNA sample Size estimated using PFGE Allele status Size difference NA kb Homozygous Longer and shorter allele size difference NA kb and 69 kb Heterozygous was almost 12 kb. NA kb Homozygous NA kb Unknown Did not explain size difference The PFGE for NA18507 sample produced one strong band of 47kb, which did not explain the allelic status for the sample. The band might be generated due to the restriction site of Sca I within NA18507 allele. The size differences are shown in Table Discussion The experiments in this chapter were designed to characterize the allelic status of DMBT1 gene, using physical mapping approaches. Analysis and validation of DMBT1 copy number was performed using two physical mapping approaches; Fibre-FISH (Fluorescence in situ hybridization) and Pulse field gel electrophoresis (PFGE). Both methods estimated DNA size difference and allelic status at the chromosomal level which were similar to that predicted from DNA sequence analysis and long PCR. The Fibre-FISH method estimated ~13kb genomic size difference for one unit of CNV1 whereas it was ~12kb by PFGE. The sequence analysis and long PCR results showed the predicted size difference was 12.7 kb which covered 4 tandem- 115

138 repeated SRCR domains (SRCR3 to SRCR4). The Fibre-FISH technique estimated a 4.6 to 4.7kb size difference for one unit CNV2, indicating a single SRCR domain was involved in one unit CNV2 change. The predicted size difference was 4.2 kb, estimated from sequence analysis and long PCR assays. Due to the small size difference (4.2kb) for one unit of CNV2, PFGE did not resolve the DNA fragment very accurately for single unit CNV2 changes. PFGE was used to estimate the size difference for the CNV1 locus only, which covered 12.7 kb at the genomic level. The present study indicates there are a lot of advantages in both the Fibre-FISH and PFGE approaches. Both approaches successfully estimated the allelic status of both CNVs at DMBT1, which was not possible by PRT or acgh approaches. Fibre-FISH and PFGE also estimated almost exactly the size of DNA fragments at the chromosomal level depending on the copy number of DMBT1. Both techniques were labour intensive, low throughput and required high molecular weight DNA or agarose-embedded DNA, which is not always available from archived samples. In addition, highly variable regions are difficult to interpret in Fibre-FISH if there are overlapping signals, and characterization of rearrangements using PFGE is restriction enzyme site dependent (Cantsilieris et al., 2012). 116

139 5 ANALYSIS OF SEGREGATION PATTERNS AND DE NOVO MUTATION RATES AT THE DMBT1 GENE 5.1 Aim of the study Three-generation CEPH pedigrees were typed for the purpose of inferring mutation rate of both CNV1 and CNV2 of DMBT1 on the principle of the segregation of copy number variable regions of DMBT1 from parents to offspring. The PRT assays estimated total copy number per diploid human genome without detecting allelic status/genotype of a particular locus. The prediction of genotype from diploid copy number is very important to predict segregation of paternal or maternal haplotype from parents to children. The allele spectrum from diploid copy number of both CNV1 and CNV2 was determined using CoNVEM (Gaunt et al., 2010). CoNVEM estimates both allelic copy number frequency distribution and the expected copy number genotype and class distribution under the Hardy-Weinberg equilibrium (HWE), based on an expectation maximization approach, from diploid CNV data. The assignment of de novo copy number mutation was done manually based on haplotypes of DMBT1 and segregation information from two microsatellite markers (DMBT1-m1 & DMBT1-m2) from outside the copy number variable regions of DMBT1 gene. 5.2 Estimation of DMBT1 copy number in CEPH pedigree samples Analysis of CNV1 copy number in CEPH pedigree samples The histogram of average PRT ratio of CNV1 data indicated four well separated clusters without any overlapping (Figure 72). The number of clusters (4) was used to measure integer copy number of CNV1 using CNVtools. 117

140 Figure 72: Histogram of mean normalized PRT ratio of CNV1 in CEPH pedigree samples. The x-axis shows mean unrounded PRT ratio. The mean PRT ratios were transformed to have a standard deviation of 1 to facilitate clustering and integer copy numbers were inferred from transformed PRT data using a Gaussian mixture model. A mixture model of four components was fitted, based on clustering of the normalized PRT data (Figure 73). The resulting clustering quality score (Q) was The first cluster indicated a diploid copy number of 2 and the remaining clusters were assigned as diploid copy number 3, 4 and 5, respectively. Figure 73: Output of the clustering procedure using the PRT transformed data of CNV1 in CEPH pedigree samples. The coloured lines show the posterior probability for each of the four copy number classes (copy number = 2; 3; 4 or 5). The X-axis shows the transformed PRT ratio for CNV1. 118

141 The posterior probabilities of the integer copy number call for each sample were plotted (Figure 74). The posterior probability of copy number calling CEPH pedigree samples were > 0.99, indicated very high quality of copy number measurement for CNV1. The posterior probability scores of individuals showing the discrepant (non-mendelian) copy number calls are highlighted with red crosses, to indicate that these were likely to be real events rather than copy number typing error. Figure 74: Analysis of integer CNV1 copy number calling in CEPH pedigree samples. Scatter plot and associated histogram show mean unrounded copy number values generate by PRT1 and PRT2 and plots against posterior probabilities of integer copy number call. The samples that show discrepant (non- Mendelian) copy number calls are shown by red crosses Distribution of integer CNV1 copy number In total 522 samples were typed for CNV1 copy number estimation in the CEPH pedigree samples. In CEPH pedigrees diploid copy number was found to vary from 2 to 5 with a modal copy number of 4 (73%). The mean copy number for CNV1 was 3.82 per diploid genome. The 119

142 details of each copy number class with copy number count and frequencies are shown (Table 32). Table 32: Frequency of CNV1 copy number of DMBT1 in CEPH pedigrees. Diploid copy number Copy number count Frequency Total 522 Mean copy number Analysis of CNV2 copy number The raw PRT ratio of PRT3 and PRT4 assays was normalised using standard reference DNA. The histograms of normalized PRT ratio of PRT3 (red) and PRT4 (green) in 522 CEPH pedigree samples indicated a similar type of data distribution for both assays and showed very good quality data clusters in the CEPH pedigree for both PRT assays (Figure 75). Figure 75: Histogram of raw PRT ratio of PRT3 (left side) and PRT4 (right side) in the CEPH pedigree samples. The x-axis shows raw PRT ratio for CNV1 in both plots. The PRT ratio from PRT3 and PRT4 were compared using scatter plot to check the specificity of assays for CNV2 locus in CEPH pedigrees (Figure 76). The scatter plots showed very good quality clusters without any overlapping of data. The raw PRT ratio for few samples was higher, so clusters were noisy for those samples. The samples that showed poor correlation from two assays were retyped using PRT assays before CNVtools analysis. 120

143 Figure 76: Scatter plot produces by PRT3 and PRT4 assays of CNV2 estimation in CEPH pedigree samples. The histogram of average PRT ratio of CNV2 data in CEPH pedigree samples indicated eight clusters and few samples showed high PRT ratio (Figure 77). The number of clusters (8) was used to measure integer copy number of CNV2 using CNVtools. Figure 77: Histogram shows mean normalized PRT ratio of CNV2 in CEPH pedigree samples. X-axis shows the mean PRT ratio of CNV2. 121

144 To facilitate clustering in CNVtools analysis mean PRT ratios were transformed to have a standard deviation of 1. A mixture model of eight components was fitted after excluding samples of higher PRT ratio as outliers. Integer copy numbers were inferred from transformed PRT data using a Gaussian mixture model (Figure 78). The resulting clustering quality score (Q) was The first cluster indicated diploid copy number 2 and rest clusters were assigned as diploid copy number from 3 to 10 respectively. Figure 78: Output of the clustering procedure using the PRT transformed data of CNV2 in CEPH pedigree samples. The coloured lines show the posterior probability for each of the four copy number classes (copy number = 2; 3; 4; 5; 6; 7; 8; 9 or 10). The X-axis shows the transformed PRT ratio of CNV2. CNV2 was a multi-allelic copy number variation and the distribution of total copy number was found to vary from 2-10 per diploid human genome, for CEPH families. For the CNV2 region in most of the CEPH pedigree samples, the posterior probability of copy number calling was greater than 0.95 (Figure 79), indicated very high quality of copy number measurement for CNV2. The individuals showing discrepant copy number calls, posterior probability scores were highlighted with red crosses. For all individuals showing discrepant copy number calls, the posterior probability was greater than 0.99, except for two individuals where the posterior probability was greater than The copy number posterior probability indicated that discrepant copy number calls were likely to be real events, not copy number typing error. 122

145 Figure 79: Analysis of integer copy number calling of CNV2 in CEPH family. Scatter plot and associated histogram shows mean unrounded copy number values generated by PRT3 and PRT4 plotted against posterior probabilities of integer copy number call for CEPH pedigree samples. The samples that show discrepant (non-mendelian) copy number calls are indicated by red crosses Distribution of integer CNV2 copy number The CNV2 copy number was estimated in 522 CEPH samples. The study found a total of nine diploid copy number classes in CEPH pedigree, with a diploid copy number between 2 to 10. Three diploid copy number classes 4, 5 and 6 were very common in CEPH samples with a modal copy number of 5 (25%). The mean copy number for CNV2 was 5.31 in diploid genomes for the CEPH pedigree. The frequency distribution of CNV2 copy number classes of CEPH pedigrees is shown in (Table 33). 123

146 Table 33: Frequency distribution of CNV2 copy number of CEPH family. Diploid copy number Copy number count Frequency Total 522 Mean copy number Allelic architecture and copy number genotype The PRT assays estimated diploid copy number for both CNV1 and CNV2 of DMBT1 without allelic spectrum. Inferring genotypes of both CNV1 and CNV2 regions was very important to know the haploid state of the DMBT1 region. The allelic architecture and copy number genotype was estimated using the CoNVEM program (Gaunt et al., 2010) which estimates both the allelic copy number frequency distribution and the expected copy number genotype and class distribution under the Hardy-Weinberg equilibrium (HWE). The diploid copy number of CNV1 and CNV2 from 80 unrelated individuals (parents) of 40 CEPH pedigrees was used for CoNVEM analysis. Individuals whose genotypes did not follow the segregation of parental haplotypes for particular allele were considered as mutanrs. The CoNVEM analysis showed three potential solutions for CNV1 but the lowest chi-square (chi-square=0.83) fitted well to the CNV1 class data (Figure 80). In CNV2 there were two possible solutions (chi-square= 0.86 & 1.82) and both solutions explained CNV2 class data (Figure 80). 124

147 Figure 80: CoNVEM analysis for CNVs of DMBT1 using unrelated parents data from 40 CEPH families. Left side bar plot compares observed and expected CNV1 frequencies and right side bar plot shows comparison of observed and two expected CNV2 frequencies. The application of the CoNVEM approach to CNV1 data suggested that the 2 copy allele was particularly prominent in CEPH pedigrees compared to the 1 and 3 copy allele. The observed allele frequencies of segregation analysis were compared with expected CNV1 allele frequencies; the observed and expected allele frequencies were very similar for CNV1 data (Figure 81). Figure 81: Comparison of allele frequencies for CNVs of DMBT1 using unrelated parents data from 40 CEPH pedigrees. Left side bar plot compares observed and CoNVEM estimated allele frequencies of CNV1 and right side bar plot shows comparison of observed and two CoNVEM estimated allele frequencies for CNV2. 125

148 The CoNVEM approach explained two equally likely solutions for CNV2 data and allelic distribution was from 0 copy allele (completely deleted) to 7 copy allele. The observed allelic frequencies showed allele 2 and 3 were common allele in CEPH pedigree compared to other alleles. The CoNVEM estimated allele frequencies concluded that 2 copy allele was most frequent for result 1 but 3 copy allele was the most frequent allele for result 2 in unrelated CEPH samples. Neither estimated CNV2 allelic distribution matched with observed allelic frequencies. The CoNVEM analysis also estimated the possible genotype frequency matrix for CNV classes and was very useful for segregation analysis to predict the haplotype status for parental genotypes. The CoNVEM approaches showed all possible combination for genotypes for CNV1 and it was very reliable and matched nicely with our observed genotypes (Table 34). Table 34: Estimated genotype frequencies for CNV1 after CoNVEM analysis. Allele The CoNVEM approaches produced two possible combinations of genotype for the CNV2 locus. Two combinations of genotype explained the CNV2 locus (Table 35 and Table 36) but none of them matched with observed genotype data. CoNVEM uses the expectationmaximization (EM) algorithm, which is normally used to estimate haplotype frequencies for single nucleotide polymorphisms (SNPs). CoNVEM analysis estimates one solution for estimated allele frequencies under perfect HWE and random sampling of the population. EM algorithms only estimate population allele frequencies and cannot determine individual level genotypes. Table 35: Estimated genotype frequencies for CNV2 based on result1 after CoNVEM analysis. Allele

149 Table 36: Estimated genotype frequencies for CNV2 based on result2 after CoNVEM analysis. Allele Detection of de novo mutation de novo mutations were identified based on Mendelian segregation analysis. All possible parental haplotypes were generated manually considering CNVs and STR information from all offspring. The offspring CNVs and STR were described based on segregation of parental haplotypes. CNVs that did not match with parental haplotypes and could not be explained by crossover in a non-cnv region in one of the parental haplotypes were identified as de novo mutations for that specific locus. I show two CEPH families (CEPH/FRENCH pedigree 12 and 1424) showing all haplotypes and a de novo mutation for CNV1 and CNV2 of DMBT1 in Figure 82 and Figure

150 Figure 82: Analysis of CEPH/FRENCH pedigree 12 for detection of de novo mutation. The total copy number and STR data of each family member is shown corresponding to each sample and allelic status is showed in solid coloured boxes based on parental allele. The de novo mutation is highlighted in red boxes. Offspring GM12567 shows a de novo mutation for the CNV1 locus but it is difficult to detect which haplotype is mutated; maternal or paternal. The diploid CNV1 copy number for the sample is 5 but two parental haplotypes show CNV1 alleles 2 and 2. Here the mutation is a gain of CNV1. 128

151 Figure 83: Analysis of CEPH/FRENCH pedigree 1424 for detection of de novo mutation. The total copy number and STR data of each family member is shown correspond to each sample and allelic status is showed in solid colour boxes based on parental allele. The de novo mutation is highlighted in red boxes. Offspring GM11922 shows a de novo mutation for the CNV2 locus but it is difficult to detect whether paternal or maternal allele is mutated. The diploid CNV1 copy number for the sample is 6 but two parental haplotypes shows two CNV2 alleles 5 and 2. Here the mutation is a loss of CNV Estimation of mutation rate The detection of the de novo mutation showed that some of the de novo events were ambiguous either mutation or recombination (Figure 84). After removing 4 events for CNV2 129

152 only, 9 CNV1 events and 20 CNV2 events from 632 meioses remained. The de novo copy number mutations at both loci were identified, with the mutation rate at CNV1 being 1.4% per gamete (9 out of 632 meioses, 95%CI %) and the mutation rate at CNV2 being 3.2% per gamete (21 out of 632 meioses, 95%CI %). These mutation rate estimates place both loci amongst the most highly mutating loci known, with comparable rates seen only for noncoding minisatellites (Jeffreys et al., 1988). All mutations were of a loss or gain of one CNV repeat unit, with no evidence of a bias towards loss or gain. Figure 84: Analysis of CEPH/FRENCH pedigree 1362 for detection of de novo mutation. The children with ambiguous de novo events are indicated with both combinations. The upper combination shows the recombination event and lower combination shows the mutation event for both children. Offspring GM11982 and GM11988 show copy number change for CNV, which might be explained as de novo mutation or recombination. 5.4 Discussion CEPH pedigrees serve as a large reference collection that has been widely used as a benchmark for the analysis of genetic variants, estimation of de novo mutation, and for a variety of other applications (Bosch et al., 2009; Eriksson & Manica, 2011; Kirszenbaum et al., 1997; Martin et al., 2012; Rosenberg, 2006b; Stevens et al., 2012). Here a total 40 multi-generation pedigrees were analysed to better understand the allelic architecture of both CNVs and estimate de novo mutation rate for the two CNVs of the DMBT1 gene. Analysis of different control populations 130

153 (HapMap, HGDP and HRC) showed that both regions are multi-allelic and highly copy number variable in different population. A high allelic diversity was found for CNV1 and CNV2 of DMBT1 and various combinations of copy number genotype was found for both loci in different CEPH individuals. CoNVEM analysis for CNV1 locus has one statistically and biologically plausible solution, and so might be considered more reliable. CoNVEM analysis for CNV2 locus indicated two statistically plausible solutions but neither model explained the observed allelic distribution for CNV2. The greater complexity of the CNV spectrum of CNV2 locus might largely be attributable to higher mutational rate. The mutation rate of both CNVs was estimated using allelic architecture and copy number genotypes of DMBT1. The mutation rate at CNV1 was calculated to be 1.4% per gamete and the mutation rate at CNV2 was estimated to be 3.3% per gamete. The rate of mutation for both loci was higher than the mutation rate observed at other multicopy CNV loci. All mutations were either gain or loss of single CNV repeat unit without any evidence of a bias towards loss or gain of repeat unit. The estimated mutation rate of both loci is equivalent to mutation rate of non-coding minisatellites (Jeffreys et al., 1988). The CNV regions were tandem repeats of highly similar sequences. As a result non-allelic homologous recombination mechanism may be responsible for formation of de novo copy number variation at both loci. It can be speculated that since DMBT1 acts as a pattern recognition receptor in the innate immune system and binds with gram positive and gram negative bacteria as well as viruses, infectious diseases are perhaps indirectly responsible for the high mutation rate and population specific evolution of both CNVs of DMBT1 gene. 131

154 6 DETERMINATION OF EXTENT OF DIVERSITY AND EVOLUTIONARY BASIS OF DMBT1 COPY NUMBER IN GLOBAL POPULATIONS 6.1 Introduction The HGDP collection is the most complete worldwide human DNA collection that is available to not-for-profit researchers. The Human Genome Diversity Project (HGDP) provides a resource that is aimed at promoting worldwide research on human genetic diversity, with the ultimate goal of understanding how and when patterns of diversity were formed. A resource of 1063 cultured lymphoblastoid cell lines (LCLs) from 1050 individuals in 52 world populations is banked at the Foundation Jean Dausset-CEPH in Paris. These LCLs were collected from various laboratories by the Human Genome Diversity Project (HGDP) and CEPH in order to provide unlimited supplies of DNA and RNA for studies of sequence diversity and history of modern human populations. Information for each LCL is limited to sex of the individual and population and geographic origin. 6.2 Estimation of DMBT1 copy number in HGDP samples Analysis of CNV1 copy number in HGDP samples The CNV1 copy number of DMBT1 was estimated using two PRT assays, PRT1 and PRT2. The raw PRT ratio of both PRT assays was normalised using standard reference DNA. The histogram plots of distribution of normalized PRT ration of PRT1 (red) and PRT2 (green) were analyzed in 971 HGDP-CEPH samples (Figure 85). The distribution of raw PRT ratio showed very clear clustering in HGDP-CEPH panel for both PRT assays. 132

155 Figure 85: Histogram of raw PRT ratio of PRT1 (left side) and PRT2 (right side) in the HGDP-CEPH samples. The x-axis shows PRT ratio for CNV1 in HGDP-CEPH samples. The PRT ratios of PRT1 and PRT2 assay were compared to check the quality of data from both assays. The normalised PRT1 ratio on X axis and PRT2 ratio on Y axis were plotted and the clustering of PRT ratios was checked. The scatter plot indicated good quality cluster without any overlapping of PRT ratio (Figure 86). Figure 86: Scatter plot produces by PRT1 and PRT2 assays of CNV1 estimation in HGDP samples. 133

156 The histogram of average PRT data indicated six well separate clusters (Figure 87). The number of clusters was used to measure integer copy number of CNV1 using the R package, CNVtools v Figure 87: Histogram of mean normalized PRT ratio of CNV1 for HGDP samples. The x-axis shows mean PRT ratio for CNV1 in HGDP-CEPH samples. Integer copy numbers were inferred from transformed PRT data using a Gaussian mixture model. A mixture model of six components was fitted, based on clustering of the normalized PRT data (Figure 88). The resulting clustering quality score (Q) was The first cluster indicated diploid copy number 2 and the rest of the clusters were assigned as diploid copy number 3, 4, 5, 6, and 7, respectively. Figure 88: Output of the clustering procedure using the PRT transformed data of CNV1 for HGDP samples. The coloured lines show the posterior probability for each of the six copy number classes (copy number = 2, 3, 4, 5, 6 or 7). 134

157 The posterior probabilities of the integer copy number call for each sample were plotted and shown in Figure 89. The posterior probability of copy number calling for HGDP samples was > 0.99, indicating very high quality of copy number measurement for CNV1. Figure 89: Analysis of CNV1 integer copy number calling using CNVtools. The clustering of data follows by assignment of a Gaussian mixture model to the data allowed integer copy number calling from normalized mean unrounded value of CNV1 generates by PRT1 and PRT2. The Bayesian posterior probabilities of each CNV1 copy number calls are shown for the HGDP samples Distribution of CNV1 diploid copy number in HGDP A total of 971 individuals from 52 populations from 7 geographical regions were genotyped for CNV1 copy number estimation. CNV1 diploid copy number distributions were from total copy number 2 to total copy number 7. Out of 971 samples 630 (65%) individuals were found with diploid copy number 4. A total of 174 (18%) samples were detected with diploid copy number 3 and 104 individuals (11%) with diploid copy number 5. The details of copy number count and frequencies of CNV1 diploid copy number are presented in Table

158 Table 37: CNV1 diploid copy number frequencies in HGDP samples. Diploid copy number Copy number count Copy number frequency <0.01 Total 971 Mean 4.00 The worldwide copy number distributions are also presented in world map in pie chart (Figure 90). The size of the pie chart reflecting the number of individuals of a particular population and portions were indicated using frequencies of diploid copy number classes. Figure 90: Population distribution of diploid CNV1 copy number. Distribution of CNV1 integer copy number in the HGDP populations, pie charts are sized in proportion to sample size Distribution of CNV1 diploid copy number in different geographical regions The HGDP-CEPH panel represented a total of seven geographical regions; Africa, Europe, the Middle East, central/south Asia, East Asia, Oceania and America. The distribution of diploid CNV1 copy number data was calculated to generate region wise CNV1 copy number variations. The frequency of each of the copy number class is presented in Figure 91 and Table

159 Figure 91: Frequency distribution of worldwide CNV1 copy number in HGDP continental regions. Table 38: Diploid CNV1 copy number frequencies in HGDP continental regions. Middle Africa Region America South Asia East Asia Europe East Oceania Diploid copy number Copy number count (fre.) Copy number count (fre.) Copy number count (fre.) Copy number count (fre.) Copy number count (fre.) Copy number count (fre.) Copy number count (fre.) 2 0 (0) 1 (0.005) 7 (0.030) 4 (0.025) 0 (0) 2 (0.07) 0 (0) 3 1 (0.01) 51 (0.25) 66 (0.28) 27 (0.17) 12 (0.07) 12 (0.40) 5 (0.05) 4 42 (0.64) 133 (0.65) 108 (0.47) 122 (0.77) 151 (0.89) 14 (0.47) 60 (0.54) 5 16 (0.25) 13 (0.06) 35 (0.15) 4 (0.025) 7 (0.04) 1 (0.03) 28 (0.25) 6 7 (0.10) 5 (0.025) 14 (0.060) 1 (0.006) 0 (0) 1 (0.03) 17 (0.15) 7 0 (0) 1 (0.005) 2 (0.009) 0 (0) 0 (0) 0 (0) 1 (0.009) Total The modal CNV1 copy number was 4 for all geographical regions but a different pattern of copy number classes was found for the different regions. The distribution of diploid CNV1 copy number was from 3 to 6 in American populations with major copy number class of 4 and 5 copies were found for 64% and 25% American populations respectively. All copy number classes (2-7) were found in South Asia and East Asia populations with 65% and 47% population of copy number 4 in respective populations. The frequency of copy number 3 was almost equal in South Asia (0.25) and East Asia (0.28) but frequency of 5 copy of CNV1 was higher in East Asia (0.15) compare to South Asia (0.06) populations. The frequency distributions of 3 and 4 137

160 copies were 0.17 and 0.77 respectively in European populations with copy number range from 2 to 6. The highest frequency (0.89) of 4 copies was found in the Middle East populations with few samples of copy number 3 (7 %). The sample size for Oceania was 30 and two major copy number classes were 3 (0.40) and 4 (0.47). The copy number distribution was 3 to 7 in African populations and 54% of the sample showed copy number 4. In HGDP populations, African populations showed higher frequencies of 5 and 6 copies (25% and 15% respectively) of CNV1. The high frequency of copy number 6 was found only in African population strikingly different to other regions Distribution of CNV1 diploid copy number in different populations The details of all populations from America, central/south Asia, East Asia, Europe, Middle East, Oceania and Africa and their distribution of CNV1 copy number are presented below (Table 39, Table 40, Table 41, Table 42, Table 43 and Table 44). Table 39: CNV1 copy number frequencies in HGDP American populations. Population Colombia Karitiana Maya Pima Surui Diploid copy number Copy number count Copy number count Copy number count (frequency) Copy number count Copy number count (frequency) (frequency) (frequency) (frequency) 2 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 3 0 (0) 0 (0) 1 (0.05) 0 (0) 0(0) 4 4 (0.57) 8 (0.57) 13 (0.59) 8 (0.57) 9 (1.0) 5 1 (0.14) 5(0.36) 5 (0.23) 5 (0.36) 0(0) 6 2 (0.29) 1(0.07) 3 (0.14) 1 (0.07) 0 (0) 7 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) Total Table 40: CNV1 copy number frequencies in HGDP South Asia populations. Population Balochi Brahui Burusho Hazara Kalash Makrani Pathan Sindhi Uygur Copy number CNV1 CNV1 CNV1 CNV1 CNV1 CNV1 CNV1 CNV1 CNV1 (fr.) (fr.) (fr.) (fr.) (fr.) (fr.) (fr.) (fr.) (fr.) 2 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 1(0.04) 0(0) 0 (0) 3 8 (0.33) 6(0.24) 9 (0.36) 4 (0.17) 8 (0.33) 4 (0.16) 6(0.25) 4(0.17) 2(0.20) 4 14 (0.58) 18 (0.72) 14 (0.56) 13 (0.57) 16 (0.67) 19 (0.76) 16(0.67) 19(0.79) 4(0.40) 5 2 (0.08) 0 (0) 2 (0.08) 2 (0.09) 0 (0) 2 (0.08) 1(0.04) 1(0.04) 3(0.30) 6 0 (0) 1 (0.04) 0 (0) 3 (0.13) 0 (0) 0 (0) 0 (0) 0 (0) 1(0.10) 7 0 (0) 0 (0) 0 (0) 1 (0.04) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) Total

161 Table 41: CNV1 copy number frequencies in HGDP East Asia populations. Population Cambodian Dai Daur Han Hezhen Japanese Lahu Miaozu Mongola Copy number CNV1 CNV1 CNV1 CNV1 CNV1 CNV1 CNV1 CNV1 CNV1 (fr.) (fr.) (fr.) (fr.) (fr.) (fr.) (fr.) (fr.) (fr.) 2 0 (0) 0 (0) 0 (0) 1(0.02) 0 (0) 1(0.03) 1(0.13) 0 (0) 0 (0) 3 2(0.20) 3(0.30) 2(0.20) 13(0.30) 3(0.33) 10(0.34) 2(0.25) 1(0.10) 4(0.40) 4 6(0.60) 5(0.50) 5(0.50) 17(0.39) 5(0.56) 14(0.48) 3(0.38) 5(0.50) 4(0.40) 5 2(0.20) 0 (0) 1(0.10) 12(0.27) 0 (0) 4(0.14) 2(0.25) 4(0.40) 2(0.20) 6 0 (0) 1(0.10) 2(0.20) 1(0.02) 1(0.11) 0 (0) 0 (0) 0 (0) 0 (0) 7 0 (0) 1(0.10) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) Total Populations Naxi Orogen She Tu Tujia Xibo Yakut Yizu Copy number CNV1 CNV1 CNV1 CNV1 CNV1 CNV1 CNV1 CNV1 (fr.) (fr.) (fr.) (fr.) (fr.) (fr.) (fr.) (fr.) 2 1(0.11) 0 (0) 0 (0) 0 (0) 0 (0) 1(0.11) 1(0.04) 1(0.10) 3 4(0.44) 3(0.33) 0 2(0.20) 5(0.50) 2(0.22) 7(0.28) 3(0.30) 4 4(0.44) 2(0.22) 9(0.90) 6(0.60) 4(0.40) 4(0.44) 12(0.48) 3(0.30) 5 0 (0) 0 (0) 1(0.10) 2(0.20) 0 (0) 1(0.11) 3(0.12) 1(0.10) 6 0 (0) 3(0.33) 0 (0) 0 (0) 1(0.10) 1(0.11) 2(0.08) 2(0.20) 7 0 (0) 1(0.11) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) Total Table 42: CNV1 copy number frequencies in HGDP European populations. Population Adygei French French North Orcadian Russian Sardinian Tuscan Basque Italian Copy number CNV1 (fr.) CNV1 (fr.) CNV1 (fr.) CNV1 (fr.) CNV1 (fr.) CNV1 (fr.) CNV1 (fr.) CNV1 (fr.) 2 0 (0) 0 (0) 3(0.13) 1(0.08) 0 (0) 0 (0) 0 (0) 0 (0) 3 0 (0) 5 (0.18) 7(0.29) 2(0.15) 2 (0.13) 5(0.20) 6(0.21) 0 (0) 4 17(1.00) 21(0.75) 14(0.58) 10(0.77) 10(0.67) 20(0.80) 22(0.79) 8(1.00) 5 0 (0) 2 (0.07) 0 (0) 0 (0) 2(0.13) 0 (0) 0 (0) 0 (0) 6 0 (0) 0 (0) 0 (0) 0 (0) 1(0.07) 0 (0) 0 (0) 0 (0) 7 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) Total

162 Table 43: CNV1 copy number frequencies in HGDP Middle East and Oceania populations. Populations Bedouin Druze Palestinian Mozabite NAN Papuan Melanesian Copy number CNV1 (fr.) CNV1 (fr.) CNV1 (fr.) CNV1 (fr.) CNV1 (fr.) CNV1 (fr.) 2 0 (0) 0 (0) 0 (0) 0 (0) 1(0.08) 1(0.06) 3 4(0.09) 2(0.05) 4(0.08) 2(0.07) 9(0.69) 3(0.18) 4 43(0.91) 42(0.95) 43(0.86) 23(0.79) 3(0.23) 11(0.65) 5 0 (0) 0 (0) 3(0.06) 4(0.14) 0 (0) 1(0.06) 6 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 1(0.06) 7 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) Total Table 44: CNV1 copy number frequencies in HGDP Sub-Saharan Africa populations. Populations Bantu Bantu Biaka Mandenka Mbuti San Yoruba Kenya South Africa Pygmy Pygmy Copy number CNV1 (fr.) CNV1 (fr.) CNV1 (fr.) CNV1 (fr.) CNV1 (fr.) CNV1 (fr.) CNV1 (fr.) 2 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 3 1(0.09) 0 (0) 0 (0) 2(0.08) 0 (0) 0 (0) 2(0.09) 4 5(0.45) 5(0.62) 18(0.67) 15(0.63) 6(0.46) 3(0.50) 8(0.36) 5 5(0.45) 3(0.38) 2(0.07) 5(0.21) 6(0.46) 1(0.17) 6(0.27) 6 0 (0) 0 (0) 6(0.22) 2(0.08) 1(0.08) 2(0.33) 6(0.27) 7 0 (0) 0 (0) 1(0.04) 0 (0) 0 (0) 0 (0) 0 (0) Total Analysis of CNV2 copy number Two PRT assays, PRT3 and PRT4 were used for estimation of CNV2 copy number in HGDP panel (Figure 92). The raw PRT ratio of both PRT assays was normalised using standard reference DNA. The distribution of normalized PRT ration of PRT3 (red) and PRT4 (green) were analyzed in 971 HGDP-CEPH samples. The distribution of raw PRT ratio showed very clear clustering in HGDP-CEPH panel for both PRT assays. 140

163 Figure 92: Histogram of raw PRT ratio of PRT3 (left side) and PRT4 (right side) in the HGDP-CEPH samples. The scatter plot showed very good quality clusters without any overlapping of data. The numbers of samples for clusters for higher copy number were fewer and clusters were a little bit noisy, but were good enough to estimate copy number using CNVtools (Figure 93). Figure 93: Scatter plot produced by PRT3 and PRT4 assays of CNV2 estimation in HGDP samples. The histogram analysis of mean PRT ratio showed good clusters of mean PRT values. The histogram of mean PRT value indicated a total of 12 clusters with the mean value difference 141

164 between two neighboring clusters being 0.5 (Figure 94). The histogram data indicated a 12 clusters component in CNVtools analysis. Figure 94: Histogram of mean normalized PRT ratio of CNV2 for HGDP samples. The X-axis indicates mean PRT ratio of CNV2 region at DMBT1. Integer copy numbers were inferred from transformed PRT data of CNV2 using a Gaussian mixture model. A mixture model of twelve components was used first, based on the histogram of the mean PRT data of CNV2, but the clustering quality was very noisy and few data were detected twice by overlapping clusters. The main reason behind the noisy data was very few samples for lower and higher PRT ratios. After excluding lower PRT ratio (mean data ratio around 0) and extreme higher ratio (>4.25) the CNV2 data was re-analyzed using mixture model of eight components and CNVtools produced very good data clusters without any overlapping cluster (Figure 95). The resulting clustering quality score (Q) for CNV2 data was Based on long range PCR of CNV2 region the actual copy number was assigned for each copy number cluster. The first cluster was assigned as diploid CNV2 copy number of 1 and actual diploid copy numbers of rest clusters were counted as 2, 3, 4, 5, 6, 7 and 8 respectively. 142

165 Figure 95: Output of the clustering procedure using the PRT transformed data of CNV2 for HGDP samples. The coloured lines show the posterior probability for each of the eight copy number classes (copy number = 1, 2, 3, 4, 5, 6, 7 or 8). The samples of lower PRT ratio were assigned manually as actual diploid copy number 0 (total deletion of CNV2) and revalidated using long range PCR. The samples of extreme mean PRT value were assigned manually as diploid copy number 9, 10 and 11 based on mean PRT value of CNV2. Posterior probabilities of the integer copy number call for each sample were plotted in Figure 96. The posterior probabilities for most of the samples were more than 0.95 and posterior probabilities of more than 0.80 indicated copy number calling for CNV2 was very good for HGDP samples. The samples showed posterior probabilities < 0.75 were retyped and revalidated using long PCR. 143

166 Figure 96: Analysis of CNV1 integer copy number calling using CNVtools for the HGDP samples. The clustering of data follows assignment of a Gaussian mixture model to the data and allows integer copy number calling from normalized mean unrounded value of CNV2 generated by PRT3 and PRT4. The Bayesian posterior probabilities of each CNV1 copy number calls are shown for HGDP samples Distribution of CNV2 diploid copy number in HGDP A wide range of diploid copy number (from 0-11) distributions was found for CNV2 in HGDP populations (Figure 97). The mean CNV2 copy number for HGDP populations was 4.37 and most of the HGDP samples (88%) showed diploid copy number from 2 to 6 for CNV2. The detailed copy number count and frequencies of CNV2 diploid copy number are presented in Table

167 Table 45: CNV2 copy number frequencies in HGDP samples. Diploid copy number Copy number count Copy number frequency Total 971 Mean 4.37 Figure 97: Population distribution of CNV2 copy number in the HGDP samples. Distributions of CNV2 integer copy number in the HGDP populations, pie sizes are drawn in proportion to sample size Distribution of CNV2 diploid copy number in different geographical regions The distribution of diploid CNV2 copy number data was analysed to present global CNV2 copy number variations. The frequency of each copy number classes is presented in Figure 98. The copy number of CNV2 in all populations is distributed from 0 (complete deletion) to 11 copies (Table 46). 4 copy of CNV2 is common for all regions except Europe, Oceania and African population. The 5 copy CNV2 type is common in European populations (0.31) but in 145

168 African populations 2 copies CNV2 was common (0.51%). 6 copy of CNV2 was common in Oceania population and found in 47% of populations. Figure 98: Frequency distribution of worldwide CNV2 copy number in HGDP continental regions. Table 46: CNV2 copy number frequencies in HGDP continental regions. Region America South Asia East Asia Europe Middle East Oceania Africa Diploid copy number Copy number count (frequency) Copy number count (frequency) Copy number count (frequency) Copy number count (frequency) Copy number count (frequency) Copy number count (frequency) Copy number count (frequency) 0 0 (0) 1 (<0.01) 3 (0.01) 0 (0) 0 (0) 0 (0) 0 (0) 1 0 (0) 3 (0.01) 15 (0.07) 0 (0) 0 (0) 0 (0) 3 (0.03) 2 11 (0.17) 35 (0.17) 35 (0.15) 3 (0.02) 7 (0.04) 2 (0.07) 56 (0.51) 3 3 (0.05) 33 (0.16) 25 (0.11) 17 (0.11) 18 (0.11) 0 (0) 13 (0.12) 4 22 (0.33) 47 (0.23) 71 (0.31) 23 (0.15) 55 (0.32) 10 (0.33) 24 (0.22) 5 4 (0.06) 29 (0.14) 32 (0.14) 49 (0.31) 41 (0.24) 2 (0.07) 11 (0.10) 6 23 (0.35) 38 (0.19) 29 (0.13) 40 (0.25) 40 (0.24) 14 (0.47) 2 (0.02) 7 2 (0.03) 9 (0.04) 13 (0.06) 16 (0.10) 5 (0.03) 1 (0.03) 2 (0.02) 8 0 (0) 7 (0.03) 6 (0.03) 9 (0.06) 3 (0.01) 0 (0) 0 (0) 9 0 (0) 2 (0.01) 1 (<0.01) 1 (0.01) 0 (0) 1 (0.03) 0 (0) 10 0 (0) 0 (0) 2 (0.01) 0 (0) 1 (<0.01) 0 (0) 0 (0) 11 1 (0.02) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) Total

169 6.2.8 Distribution of CNV2 diploid copy number in different populations The details of all populations from America, central/south Asia, East Asia, Europe, Middle East, Oceania and Africa and their distribution of CNV2 copy number were presented below (Table 47, Table 48, Table 49, Table 50, Table 51 and Table 52). Table 47: CNV2 copy number frequencies in HGDP American populations. Population Colombia Karitiana Maya Pima Surui Diploid copy number Copy number count (frequency) Copy number count (frequency) Copy number count (frequency) Copy number count (frequency) Copy number count (frequency) 0 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 1 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 2 2 (0.29) 3 (0.21) 4 (0.18) 2 (0.15) 0 (0) 3 0 (0) 0 (0) 2 (0.09) 1 (0.07) 0 (0) 4 2 (0.29) 6 (0.43) 4 (0.18) 10 (0.71) 0 (0) 5 0 (0) 0 (0) 4 (0.18) 0 (0) 0 (0) 6 1 (0.13) 5 (0.36) 7 (0.32) 1 (0.07) 9 (1.0) 7 2 (0.29) 0 (0) 0 (0) 0 (0) 0 (0) 8 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 9 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 10 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 11 0 (0) 0 (0) 1 (0.05) 0 (0) 0 (0) Total Table 48: CNV2 copy number frequencies in HGDP South Asia populations. Populations Balochi Brahui Burusho Hazara Kalash Makrani Pathan Sindhi Uygur Copy number CNV2 (fr.) CNV2 (fr.) CNV2 (fr.) CNV2 (fr.) CNV2 (fr.) CNV2 (fr.) CNV2 (fr.) CNV2 (fr.) CNV2 (fr.) 0 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 1(0.04) 0 (0) 1 0 (0) 0 (0) 0 (0) 1(0.04) 0 (0) 0 (0) 0 (0) 0 (0) 2(0.20) 2 6(0.25) 5(0.20) 5(0.20) 2(0.09) 1(0.04) 4(0.16) 7(0.29) 2(0.08) 3(0.30) 3 3(0.13) 4(0.16) 7(0.28) 3(0.13) 1(0.04) 3(0.12) 5(0.21) 4(0.17) 3(0.30) 4 6 (0.25) 5(0.20) 5(0.20) 3(0.13) 7(0.29) 6(0.24) 6(0.25) 7(0.29) 2(0.20) 5 4(0.17) 2(0.08) 1(0.04) 6(0.26) 6(0.25) 2(0.08) 2(0.08) 6(0.25) 0 (0) 6 2(0.08) 5(0.20) 5(0.20) 5(0.22) 7(0.29) 7(0.28) 3(0.13) 4(0.17) 0 (0) 7 0 (0) 3(0.12) 1(0.04) 2(0.09) 2(0.08) 0 (0) 1(0.04) 0 (0) 0 (0) 8 3(0.13) 1(0.04) 1(0.04) 1(0.04) 0 (0) 1(0.04) 0 (0) 0 (0) 0 (0) 9 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 2(0.08) 0 (0) 0 (0) 0 (0) 10 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 11 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) Total

170 Table 49: CNV2 copy number frequencies in HGDP East Asia populations. Population Cambodian Dai Daur Han Hezhen Japanese Lahu Miaozu Mongola Copy number CNV2 CNV2 CNV2 CNV2 CNV2 CNV2 CNV2 CNV2 CNV2 (fr.) (fr.) (fr.) (fr.) (fr.) (fr.) (fr.) (fr.) (fr.) 0 1(0.10) 0 (0) 0 (0) 1(0.02) 0 (0) 0 (0) 0 (0) 1(0.10) 0 (0) 1 4(0.40) 0 (0) 0 (0) 2(0.05) 0 (0) 1(0.03) 1(0.13) 1(0.10) 0 (0) 2 0 (0) 0 (0) 1(0.10) 6(0.14) 2(0.22) 3(0.10) 3(0.38) 0 (0) 0 (0) 3 1(0.10) 0 (0) 1(0.10) 5(0.11) 4(0.44) 5(0.17) 0 (0) 0 (0) 0 (0) 4 3(0.30) 1(0.10) 3(0.30) 14(0.32) 1(0.11) 8(0.28) 3(0.38) 3(0.30) 4(0.40) 5 0 (0) 4(0.40) 2(0.20) 7(0.16) 1(0.11) 6(0.21) 0 (0) 0 (0) 4(0.40) 6 1(0.10) 2(0.20) 1(0.10) 4(0.09) 0 (0) 3(0.10) 1(0.13) 2(0.20) 1(0.10) 7 0 (0) 0 (0) 2(0.20) 2(0.05) 1(0.11) 2(0.07) 0 (0) 3(0.30) 0 (0) 8 0 (0) 1(0.10) 0 (0) 2(0.05) 0 (0) 1(0.03) 0 (0) 0 (0) 1(0.10) 9 0 (0) 1(0.10) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 10 0 (0) 1(0.10) 0 (0) 1(0.02) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 11 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) Total Population Naxi Orogen She Tu Tujia Xibo Yakut Yizu Copy CNV2 CNV2 CNV2 CNV2 CNV2 CNV2 CNV2 CNV2 number (fr.) (fr.) (fr.) (fr.) (fr.) (fr.) (fr.) (fr.) 0 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 1 0 (0) 0 (0) 0 (0) 1(0.10) 2(0.20) 0 (0) 2(0.08) 1(0.10) 2 1(0.11) 2(0.22) 3(0.30) 3(0.30) 2(0.20) 3(0.33) 2(0.08) 4(0.40) 3 2(0.22) 0 (0) 2(0.20) 2(0.20) 0 (0) 0 (0) 1(0.04) 2(0.20) 4 4 (0.44) 5(0.56) 2(0.20) 2(0.20) 3(0.30) 3(0.33) 10(0.40) 2(0.20) 5 1(0.11) 2(0.22) 0 (0) 0 (0) 1(0.10) 2(0.22) 2(0.08) 0 (0) 6 1(0.11) 0 (0) 2(0.20) 2(0.20) 2(0.20) 0 (0) 6(0.24) 1(0.10) 7 0 (0) 0 (0) 1(0.10) 0 (0) 0 (0) 1(0.11) 1(0.04) 0 (0) 8 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 1(0.04) 0 (0) 9 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 10 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 11 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) Total

171 Table 50: CNV2 copy number frequencies in HGDP European populations. Population Adygei French French North Orcadian Russian Sardinian Tuscan Basque Italian Copy number CNV2 (fr.) CNV2 (fr.) CNV2 (fr.) CNV2 (fr.) CNV2 (fr.) CNV2 (fr.) CNV2 (fr.) CNV2 (fr.) 0 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 1 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 2 0 (0) 1(0.04) 0 (0) 0 (0) 1(0.07) 0 (0) 1(0.04) 0 (0) 3 3(0.18) 3(0.11) 2(0.08) 0 (0) 2(0.13) 3(0.12) 4(0.14) 0 (0) 4 2(0.12) 5(0.18) 4(0.17) 3(0.23) 1(0.07) 4(0.16) 3(0.11) 1(0.13) 5 5(0.29) 9(0.32) 14(0.58) 2(0.15) 3(0.20) 8(0.32) 4(0.14) 4(0.50) 6 5(0.29) 5(0.18) 3(0.13) 5(0.38) 3(0.20) 8(0.32) 8(0.29) 3(0.37) 7 1(0.06) 3(0.11) 1(0.04) 1(0.08) 5(0.33) 0 (0) 5(0.18) 0 (0) 8 1(0.06) 2(0.07) 0 (0) 2(0.15) 0 (0) 1(0.04) 3(0.11) 0 (0) 9 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 1(0.04) 0 (0) 0 (0) 10 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 11 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) Total Table 51: CNV2 copy number frequencies in HGDP Middle East and Oceania populations. Population Bedouin Druze Palestinian Mozabite NAN Papuan Melanesian Copy number CNV2 (fr.) CNV2 (fr.) CNV2 (fr.) CNV2 (fr.) CNV2 (fr.) CNV2 (fr.) 0 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 1 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 2 1(0.02) 1(0.02) 0 (0) 5(0.17) 2(0.15) 0 (0) 3 5(0.11) 5(0.11) 6(0.12) 2(0.7) 0 (0) 0 (0) 4 16(0.34) 15(0.34) 16(0.32) 8(0.28) 7(0.54) 3(0.18) 5 13(0.28) 8(0.18) 16(0.32) 4(0.14) 0 (0) 2(0.12) 6 11(0.23) 12(0.27) 9(0.18) 8(0.28) 3(0.23) 11(0.65) 7 1(0.02) 1(0.02) 2(0.04) 1(0.03) 0 (0) 1(0.06) 8 0 (0) 2(0.05) 1(0.02) 0 (0) 0 (0) 0 (0) 9 0 (0) 0 (0) 0 (0) 0 (0) 1(0.08) 0 (0) 10 0 (0) 0 (0) 0 (0) 1(0.03) 0 (0) 0 (0) 11 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) Total

172 Table 52: CNV2 copy number frequencies in HGDP Sub-Saharan African populations. Population Bantu Bantu Biaka Mandenka Mbuti San Yoruba Kenya South Africa Pygmy Pygmy Copy number CNV2 (fr.) CNV2 (fr.) CNV2 (fr.) CNV2 (fr.) CNV2 (fr.) CNV2 (fr.) CNV2 (fr.) 0 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 1 1(0.09) 0 (0) 0 (0) 2(0.08) 0 (0) 0 (0) 0 (0) 2 5(0.45) 2(0.25) 20(0.74) 8(0.33) 9(0.69) 2(0.33) 10(0.45) 3 0 (0) 1(0.13) 4(0.15) 3(0.13) 1(0.08) 3(0.50) 1(0.05) 4 2(0.18) 3(0.38) 1(0.04) 8(0.33) 3(0.23) 1(0.17) 6(0.27) 5 1(0.09) 2(0.25) 2(0.07) 1(0.04) 0 (0) 0 (0) 5(0.23) 6 1(0.09) 0 (0) 0 (0) 1(0.04) 0 (0) 0 (0) 0 (0) 7 1(0.09) 0 (0) 0 (0) 1(0.04) 0 (0) 0 (0) 0 (0) 8 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 9 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 10 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 11 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) Total Estimation and distribution of total SRCR copy number in different geographical regions The PCR-based paralog ratio test basically estimated diploid copy number of two CNV regions within SRCR domains but the total copy number of CNV1 and CNV2 does not directly represent total number of SRCR domains in a diploid human genome. The estimation of the total number of SRCR domains will be interesting to study the global copy number status of SRCR domains in different human populations. It will be also helpful to establish the effect of the total number SRCR domains in case-control and cohort studies. The DNA size difference of different copies of CNV1 and CNV2 alleles were estimated using three different assays, long PCR, Fibre-FISH and PFGE. The change of one unit of the CNV1 allele corresponds to a 12.7 kb size difference at the genomic level and covered four SRCR domains (SRCR3, SRCR4, SRCR5 and SRCR6) based on the UCSC Genome Browser (GRCh37/hg19 assembly). Long range PCR indicates a change of one unit of CNV2 produces a 4059 bp size difference at the genomic level which covers a single SRCR domain based on the UCSC Genome Browser (GRCh37/hg19 assembly). CNV1 covers the first eight tandem-repeated SRCR domains (SRCR1 to SRCR8) whereas CNV2 covers the next three tandem-repeated SRCR domains (SRCR9 to SRCR11), based on the UCSC Genome Browser (GRCh37/hg19 assembly). In PRT4 assay, SRCR12 and SRCR13 domains were used as 150

173 reference regions that showed no variation in all individuals. Individuals with a copy number at CNV1 of (a) encodes a total of 4a SRCR domains, whereas a copy number at CNV2 of (b) represents a total of b copies SRCR domains per diploid genome. The last two SRCR domains (SRCR12 and SRCR13) were fixed (non-cnv) in all individuals, so diploid copy number is always 4 for non-cnv region. Figure 99: Schematic picture of DMBT1 region. The CNV1 shows copy number a, CNV2 region shows copy number b and the last 2 SRCR domains are non-variable, so there are a total of 4 SRCR domains in this region. So if one individual contains a copy CNV1 and b copy CNV2, then the total number of SRCR domains in particular individual is (4a+b+4) per diploid human genome (Figure 99). The diploid copy number of SRCR domains was calculated in the HGDP samples. The increase in CNV2 copy number was mirrored in part by a decrease in CNV1 in HGDP populations. But the total number of SRCR domains per diploid DMBT1 was highly variable across all continental groups (Figure 100). The canonical DMBT1 sequence has 13 tandem-repeated SRCR domains, so that an individual homozygous for this canonical sequence will have 26 diploid SRCR domains. Figure 100: Frequency distribution of total diploid SRCR copy number in HGDP samples. Bar chart showing the calculated diploid tandem-repeated SRCR domain count of DMBT1 in the HGDP samples, sorted by continent of origin. 151

174 6.3.1 Analysis of CNV1 and CNV2 copy number association Pattern of copy number variation in different HGDP individuals The diploid copy number of CNV1 and CNV2 were used to measure the correlation between two copy number variable regions of DMBT1. The mean unrounded copy number value of both CNV1 and CNV2 of all HGDP individuals were compared. The mean unrounded copy number values were doubled to present the CNV1 and CNV2 data around integer value. The samples showed CNV1 cluster around integer value 2 (diploid copy number 4) but CNV2 clusters were found from integer value from 0 to 8 (diploid copy number value from 1 to 8). The samples of CNV1 clusters around 3 and 4 (diploid copy number 5 and 6 respectively) were found with CNV2 clusters around integer value 2 to 4 (diploid copy number 2 to 4). The scatter plot indicated that samples with CNV1 clusters around higher PRT value but CNV2 clusters were found around lower PRT value for those samples. But the scatter plot did not find any particular pattern of copy number changes in individual level. There was no correlation between copy number alleles at CNV1 and CNV2 (r 2 =0.01 for all HGDP) at individual level (Figure 101). Figure 101: The pattern of CNV1 and CNV2 copy number variation in different HGDP individuals. 152

175 Pattern of copy number variation in different HGDP populations The pattern of copy number variation was also compared at the population level (Table 53). The different populations of each geographical population are shown in different colours (Figure 102). The scatter plot shows the population specific pattern for CNV1 and CNV2 copy number and there was a clear negative relationship at the population level (r 2 =0.11). The populations with higher CNV1 value showed lower copy number value for CNV2 with some outlier populations. The Naxi and Melanesian populations showed low copy number for both CNV1 and CNV2 but the opposite pattern was found for Surui and Dai where both CNV1 and CNV2 showed higher copy number. The three populations were also different compared to other populations of the same region. The East Asian populations showed a wide range of copy number pattern compared to other regions. The two populations from Oceania region showed two different patterns. Figure 102: The pattern of copy number variation at CNV1 and CNV2 in different HGDP populations. 153

176 Table 53: Mean unrounded copy number for CNV1 and CNV2 in different HGDP populations. Population CNV1 CNV2 Population CNV1 CNV2 Bantu_N.E Adygei Bantu_S.E French Biaka_Pygmies French_Basque Mandenka North_Italian Mbuti_Pygmies Orcadian San Russian Yoruba Sardinian Colombians Tuscan Karitiana Bedouin Maya Druze Pima Mozabite Surui 4 6 Palestinian Cambodians NAN_Melanesian Dai Papuan Daur Balochi Han Brahui Hezhen Burusho Japanese Hazara Lahu Kalash Miaozu Makrani Mongola Pathan Naxi Sindhi Oroqen Uygur She Xibo Tu Yakut Tujia Yizu Pattern of copy number distribution in different geographical regions of the HGDP populations The mean diploid copy number for CNV1 and CNV2 of seven geographical regions (Table 54) was also compared. The Oceania region pattern was totally different from the other regions for both CNV1 (3.52) and CNV2 (5.07). The remaining four regions (Europe, Middle East, South Asia and East Asia) showed mean CNV1 copy number value near 4.00 ( ) but mean CNV2 copy number value were different for those populations. The mean CNV1 copy number was higher for two regions (America and Africa) but the mean CNV2 copy number for African samples (3.06) was low compared to American samples (4.67). 154

177 Table 54: Mean CNV1 and CNV2 copy number for different geographical regions in HGDP samples. Region CNV1 CNV2 Africa America East Asia Europe Middle East Oceania South Asia There was a clear negative relationship between CNV1 and CNV2 at the continental level (r 2 =0.43) (Figure 103). The higher relationship at the continental level compared to at the population level might be due to evolution for both at different locations across the world. The lower CNV1 copy number was compensated by higher CNV2 copy number value and maintained total SRCR copy number within a restricted range. The pathogen richness or development of agriculture in different populations might be a selective pressure influencing the frequency distributions of CNV1 and CNV2. Figure 103: The variation pattern of copy number for CNV1 and CNV2 in different HGDP regions. 155

178 6.4 Analysis of pathogen-driven selection on DMBT1 Copy number variation Genes involve in innate immunity and inflammations are known to be susceptible to genetic variation. The DMBT1 acts as a pattern recognition receptor in innate immunity, binds with both gram-positive and gram-negative bacteria and viruses. The global copy number variation at CNV1 and CNV2 of DMBT1 may be influenced by pathogen-driven selection to modulate susceptibility to infectious agents in human populations. In case of pathogen-driven selection the copy number variations at CNV1 and CNV2 alleles of DMBT1 will be influenced by pathogens. The pathogens can be divided into two major groups: micro- and macropathogens. The viruses, bacteria, fungi, and protozoa formed the micropathogen group and the macropathogens included insects, arthropods, and helminths. The macropathogens is basically equivalent to helminths as parasitic worms were the most abundant class (90% of species/genera) within this group Kendall rank correlation for pathogen richness To verify whether pathogen-driven selection has been acting on DMBT1 genes, a set of two copy number variations (CNV1 and CNV2) were genotyped in HGDP populations, distributed worldwide. A Kendall correlation was performed for both CNV1 and CNV2 copy number for viruses, bacteria, protozoa, and helminths richness in HGDP-CEPH populations. The virus and protozoa richness strongly correlated for both CNV1 and CNV2 in HGDP-CEPH populations (Table 55). The Kendall rank correlation found a positive relationship (Kendall's rank correlation coefficient = 0.32; p = 0.001) with the mean CNV1 copy number of HGDP-CEPH populations, and a negative relationship (Kendall's rank correlation coefficient = -0.24; p = 0.015) with the mean CNV2 of HGDP-CEPH populations for virus richness. The Kendall correlation did not find any significant correlation with total SRCR domains for virus richness data. Table 55: The Kendall Correlations with the richness of viruses, helminths, bacteria and protozoa. Virus Helminths Bacteria Protozoa Locus tau p tau p tau p tau p CNV CNV SRCR The Kendall rank correlation also found a similar type of relationship with mean CNV1 and CNV2 data for the protozoa database, which were strongly correlated across geographic locations. The Kendall rank correlation found a positive relationship (Kendall's rank correlation 156

179 coefficient = 0.37; p = ) with mean CNV1 copy number of HGDP-CEPH populations, and a negative relationship (Kendall's rank correlation coefficient = -0.38; p = ) with mean CNV2 copy number of HGDP-CEPH populations for protozoa richness. The Kendall correlation did not find any significant correlation with total SRCR domains for protozoa richness data. A similar type of relationship were found with mean CNV1 (Kendall's rank correlation coefficient = 0.12; p = 0.24) and CNV2 data (Kendall's rank correlation coefficient = -0.14; p = 0.15) for helminth richness database but both relationships were not statistically significant. For the bacterial richness database; the Kendall correlation analysis showed an opposite relationship pattern compare to virus, helminths and protozoa database. The Kendall rank correlation found a negative relationship (Kendall's rank correlation coefficient = -0.12; p = 0.25) with mean CNV1 copy number, and a positive relationship (Kendall's rank correlation coefficient = ; p = ) with mean CNV2 of HGDP-CEPH populations for bacteria database but these relationships were not statistically significant Partial Mantel tests for pathogen richness The pattern of global human genetic variation is influenced by geography and powerful selective force. Genetic differentiation on selected gene in human population should reflect both a signal of ancient demography as well as its own peculiar signature of selection (Fumagalli et al., 2009, 2011b; Hancock et al., 2010). Kendall rank correlation investigates association between copy number variation of DMBT1 gene and pathogen richness without considering distance from Africa. A partial Mantel test can investigate the link between pairwise genetic distances and pairwise differences in the selecting force after having accounted for geographic distance (Handley et al., 2007). To verify whether pathogen-driven selection on DMBT1 genes, a set of two copy number variations (CNV1 and CNV2) for 52 human population was analyzed for pathogen richness using partial Mantel tests. The copy number variation of CNV2 region was significantly correlated with bacteria richness data (Table 56). Table 56: Partial mantel correlations (using distance from Africa) [r=mantel statistic; p=significance]. Locus Virus Helminths Bacteria Protozoa r p r p r p r p CNV CNV SRCR Kendall rank correlation found that both CNV1 and CNV2 copy number were significantly correlated with virus and protozoa richness but the partial Mantel test did not find any 157

180 significance with virus or protozoa richness for either locus. So, the genetic association of both alleles may be influenced by the distance from Africa. Mean copy number value of CNV2 region was not statistically significant for bacteria richness after Kendall rank correlation but significant correlation was noticed in partial Mantel tests when distance from Africa was considered as a cofactor for genetic variation in human population. In summary, the CNV2 copy number variation in HGDP-CEPH populations showed evidence of both a signal of ancient demography, as well as its bacteria-driven selection. 6.5 DMBT1 copy number variation due to human life style adaptations After the evolution of modern humans in Africa (White et al., 2003), human populations adopted a variety of life styles, as well as dietary components to occupy an exceptionally broad range of habitants. Worldwide human genetic variations were undoubtedly at least shaped in part by genetic adaptation due to human life styles and food habits (Hancock et al., 2010). Dental caries affects a large part of the world s population and is characterized by progressive dissolution of dental tissues by organic acids produced by bacteria in dental plaque (Pieralise et al., 2013). S. mutans is the most prevalent bacterium in human oral flora and is widely recognized as a key causative agent of human dental caries (Cornejo et al., 2013). The prevalence of dental caries in human population was increased in post-agricultural societies (5 50%) compared to Mesolithic hunter gatherers (0 2%) human (Cornejo et al., 2013). The transition of human populations from hunting and gathering to agricultural societies increased consumption of carbohydrates in human populations. The estimated timing of the start of a demographic expansion in S. mutans in the human mouth was approximately 10,000 years ago (95% confidence interval [CI]: 3,268 14,344 years ago), coincidental with the onset of human agriculture (Cornejo et al., 2013). To adapt to the new niche of the agricultural human mouth, S. mutans needed to develop or increase efficiency in the metabolism of sugar in the human mouth and develop defenses against human immunity. DMBT1 (also known as salivary agglutinin [DMBT1 SAG ]) acts as a caries susceptibility protein by increasing the adhesion and colonization of S. mutans (Jonasson et al., 2007). The AgI/II adhesin SpaP (or PAc) is the principal surface adhesin express by S. mutans and interacts with gp-340 (Jonasson et al., 2007). The present study suggests that genetic adaptation guided by human life style and food habits may lead to a shift of copy number variation at the DMBT1 region. To verify whether human life style as well as food habits has been acting on changing pattern, DMBT1 copy number variations were analyzed across 52 HGDP-CEPH populations with human life style data against distance from Africa as a cofactor for demographic process. 158

181 6.5.1 Analysis of human life style adaptations using agriculture data as dichotomous variables A preliminary analysis relating both CNV regions in DMBT1 with human life style in the HGDP populations was performed using human life style as a dichotomous variable. The HGDP populations were divided into agricultural and non-agricultural groups. Detailed description of the dichotomous life style variables is given elsewhere (Hancock et al., 2010). To verify whether life style variables acts on copy number variation of DMBT1, both copy number variable regions (CNV1 and CNV2) were analyzed in 52 HGDP populations using distance from Africa. Using the agriculture data as a dichotomous variable, marginal significance was obtained for both CNV1 and CNV2 (Table 57). To determine the direction of association with copy number variation and the agriculture data, regression analysis was performed using distance from Africa (dfa) as a covariate. The regression analysis showed associations for agriculture data with an opposite trend for the CNV1 and CNV2 data. The regression analysis showed CNV1 data was negatively correlated, but CNV2 data positively correlated with agriculture. Table 57: Correlations with copy number variable of DMBT1 and a human life style variable (agriculture variable) as dichotomous variables. DMBT1 copy number beta p CNV1 copy number CNV2 copy number Analysis of human life style adaptations using agriculture data as a relative amount of activity A detailed analysis relating the CNV regions in DMBT1 with human life style measured as the relative amount of human activity in agriculture, animal husbandry, fishing, and hunting/gathering was conducted using the same data set. Detailed description of the life style variables was given elsewhere (Fumagalli et al., 2011b). The percentage of activity spent in each of the examined subsistence activities was retrieved from Murdock s Ethnographic Atlas (1967). To verify whether the relative amount of human activity is associated with copy number variation of DMBT1, both copy number variable regions (CNV1 and CNV2) were analyzed using the partial Mantel test, incorporating distance from Africa (Table 58). Statistically significant values were obtained for agriculture, animal husbandry and hunting & gathering using partial Mantel tests. 159

182 Table 58: Correlations with copy number variable of DMBT1 and human life style as relative amount of human activity using partial mantel tests. Locus Agriculture Fishing Hunting and Gathering Animal husbandry r p r p r p r p CNV CNV SRCR To have an idea of the association sign, linear regression was performed (Table 59), using distance from Africa (dfa) as a covariate. The associations for CNV1 and CNV2 had opposite trends for agriculture and hunting and gathering. A similar trend was observed for association of CNV1 and CNV2 with agriculture and animal husbandry. For the agriculture data, CNV1 copy number data was negatively correlated (β=-0.001, p=0.009) but the opposite trend was obtained for CNV2 copy number data (β=0.027, p=0.003). For the animal husbandry data, CNV1 copy number data was negatively correlated (β=-0.014, p=0.001) but opposite trend was obtained for CNV2 copy number data (β=0.023, p=0.039). For hunting and gathering data, CNV1 copy number data was positively correlated (β=0.008, p=0.001) but the opposite trend was obtained for CNV2 copy number data (β=-0.019, p=0.002). Table 59: Regression analysis with copy number variable of DMBT1 and human life style as relative amount of human activity spent. Locus Agriculture Fishing Hunting and Gathering Animal husbandry beta p beta p beta p beta p CNV CNV SRCR Discussion In this chapter the global diversity of DMBT1 was determined based on diploid copy number of CNV1 and CNV2 on the CEPH-Human Genome Diversity Project (HGDP) panel of 971 individuals from 52 populations worldwide (Magalhães et al., 2012; Rosenberg, 2006). A similar range was observed for CNV1 (2-7 copies per diploid genome) and for CNV2 (0-11 copies per diploid genome) as in the HapMap samples. Although, at the individual level, there was no detectable relationship between diploid copy number at CNV1 and CNV2 (r 2 =0.01 for all HGDP), there was a clear negative relationship at the population level (r 2 =0.11) and at the continental level (r 2 =0.43). 160

183 It has been reported that human adaptation to diet and different life styles had an important effect on the human genome (Hancock et al., 2010). Pathogen-driven selection has also been identified as a selective pressure throughout the human genome (Fumagalli et al., 2009, 2011b; Pozzoli et al., 2010). Given the fact that the SRCR domain is known to bind a wide range of pathogens, the present study considered that pathogen diversity might be a selective pressure influencing the frequency distributions of CNV1 and CNV2. The Kendall correlation, without considering distance from Africa, found a strong positive relationship between mean CNV1 and both virus and protozoa richness, but a strong negative relationship between mean CNV2 and both virus and protozoa richness in HGDP-CEPH populations. However partial Mantel tests considering distance from Africa found marginal correlation between mean CNV2 and bacteria richness data. The present study found a significant negative relationship between agricultural populations the mean CNV1 copy number whereas a significant positive relation was found in huntergather populations for this locus. A significant positive relationship was found between agricultural populations and mean CNV2 but the opposite trend was noticed for the same locus in hunter-gather populations. This finding suggests that the subsistence history (development of agriculture) of a population has affected the frequency distribution of both CNVs within DMBT1. 161

184 7 ANALYSIS OF DIVERSITY OF THE SALIVARY AGGLUTININ-BINDING PROTEIN OF STREPTOCOCCUS MUTANS 7.1 Aim of the study As mentioned previously DMBT1 binds with Streptococcus mutans in a calcium-dependent manner to antigen I/II polypeptides, a group of surface receptors on S. mutans (Esberg et al., 2012; Kelly et al., 1995; Kelly et al., 1990; Larson et al., 2011). Antigen I/II polypeptides have been characterized under different names (antigen B, P1, Pac, SpaP, MSL-1) in S. mutans and have been studied extensively as candidates for vaccine development (Kelly et al., 1995). SpaP encodes the surface antigen AgI/II which is the ligand for human SAG (Kelly et al., 1990) and contains two binding domains for human SAG, Ad1 and Ad2 (Kelly et al., 1995). Human SAG has been shown to be glycosylated by the enzyme alpha-(1, 2) fucosyltransferase, encoded by the FUT2 gene which is responsible for the secretor status of ABO and Lewis blood group antigens (Eriksson et al., 2007; Ligtenberg et al., 2000). S. mutans binding to host SAG is thought to be mediated, in part, by glycosylated SAG residues. The present study was designed to dissect the possible evolutionary relationship between CNVs at DMBT1 and S. mutans. An approximately 1kb region sequence from the C-terminal region of the S. mutans SpaP gene was sequenced from human mouth wash DNA from volunteers of different ethnicity/geographical origin. The aim was to identify the demographic history and signals of adaptation of different strains of S. mutans. The common secretor polymorphism was also included as a predictor variable in this analysis. 7.2 Analysis of S. mutans sequences Sequence diversity of SpaP gene of S. mutans I genotyped total of 149 local volunteers (students and staff from the University of Leicester) including 110 individuals of European volunteers. Almost 1kb of the C-terminal region of the SpaP gene (agglutinin receptor of S. mutans, WP_ from 92 to 427 amino acids) of S. mutans, known to contain two binding domains for human DMBT1, was Sanger sequenced from the oral cavity of each individual. A total 1008 bp region of the SpaP gene encoding 336 amino acids were used in our all analyses. 118 samples from a total of 147 samples (93 of the 110 European samples) produced one clear unambiguous sequence but across samples there was extensive diversity (Table 60). This observation strongly suggests very low within-mouth 162

185 diversity of S. mutans, and that most people are colonized by only one strain which is consistent with results from previous work (Cornejo et al., 2013). However, alignment of the sequences showed 136 polymorphisms, 54 altering an amino acid, reflecting very high levels of diversity between-hosts, at least for this locus. Table 60: Sequence diversity of C-terminal region of SpaP gene of S. mutans. Sample information Sequence information Geographical origin Sample size Withinmouth diversity Total polymorphism Synonymous changes Nonsynonymous changes Worldwide Europe Phylogenetic analysis of SpaP gene of S. mutans 1008 bp from c-terminal region of SpaP gene of S. mutans were analysed for phylogenetic analysis and the analysis involved 254 nucleotide sequences. A total of 29 sequences showed sequence variation at one or more nucleotide positions, indicating more than one variant of S. mutans for those samples. 29 sequences were divided into two sequences, so total a of 178 nucleotide sequences were obtained from all samples of S. mutans. A total 64 known strains of S. mutans from around the world and 12 homologous sequences from the most closely related species to S. mutans (4 from Streptococcus intermedius (CP , AP , AB , CP ), 1 from S. downei (AB ), 2 from S. gordonii (U , U ), 1 from S. oralis (FR ), 1 from S. sanguinis (CP ), 3 from S. sorbinus (D , X , S )) were included for phylogenetic analysis in MEGA6 (Tamura et al., 2013). 163

186 Figure 104: Molecular Phylogenetic analysis for all samples (including EU) by Maximum Likelihood method in MEGA6 using nucleotide sequences. The tree is drawn to scale, with branch lengths measured in the number of substitutions per site. The analysis involves 254 nucleotide sequences. All positions containing gaps and missing data are eliminated. There are a total of 1008 positions in the final dataset. Solid round shapes indicate S. mutans sequences and solid diamond shapes indicate sequences from related species to S. mutans. Reference sequences of S. mutans and related species are indicated with black. Sequences from European regions are coloured with red and sequences from outside European regions are shown in green. A phylogenetic tree (Figure 104) showed that 8 SpaP sequences from S. downei, S. gordonii, S. oralis, S. sanguinis, and S. sobrinus form a distinct clade from our S. mutans sequences. The other sequences formed two clades; the major clade included most sample sequences and all published sequences from known strains of S. mutans. The sister clade included 26 sample sequences and four published SpaP sequences from S. intermedius without any sequence from known strains of S. mutans. It is likely that these 26 sequences were in fact from S. 164

187 intermedius, and were therefore removed from subsequent analyses. Another phylogenetic tree (Figure 105) was constructed using S. mutans nucleotide sequences obtained from European individuals. The phylogenetic tree of European samples was similar to the previous tree. Similar patterns were also noticed when a Maximum Likelihood (ML) tree was drawn using amino acid sequences of all (Figure 106) and European (Figure 107) samples. Comparison of all sequences, with known strains of S. mutans from around the world, showed that our sample represented a large proportion of the known global diversity. Present analyses of Maximum Likelihood (ML) trees using both nucleotide and amino acid sequences suggest that there is no geographical differentiation of S. mutans strains, consistent with previous works (Cornejo et al., 2013; Do et al., 2010). 165

188 Figure 105: Molecular Phylogenetic analysis for European samples by Maximum Likelihood method in MEGA6 using nucleotide sequences. The tree is drawn to scale, with branch lengths measure in the number of substitutions per site. The analysis involves 203 nucleotide sequences, excluding non- European samples. All positions containing gaps and missing data are eliminated. There are a total of 1008 positions in the final dataset. Solid round shapes indicate S. mutans sequences and solid diamond shapes indicate sequences from related species to S. mutans. Reference sequences of S. mutans and related species are indicated with black, sequences from European individuals are shown in red. 166

189 Figure 106: Molecular Phylogenetic analysis for all samples by Maximum Likelihood method in MEGA6 using Amino acid sequences. The tree is drawn to scale, with branch lengths measure in the number of substitutions per site. The analysis involves 254 nucleotide sequences. All positions containing gaps and missing data are eliminated. There are a total of 336 positions in the final dataset. Solid round shapes indicate S. mutans sequences and solid diamond shapes indicate sequences from related species to S. mutans. Reference sequences of S. mutans and related species are indicated with black, sequences from European region are coloured with red and sequences from outside European individuals are shown in green. 167

190 Figure 107: Molecular Phylogenetic analysis for European samples by Maximum Likelihood method in MEGA6 using Amino acid sequences. The tree is drawn to scale, with branch lengths measure in the number of substitutions per site. The analysis involves 203 Amino Acid sequences, excluding non- European samples. All positions containing gaps and missing data are eliminated. There are a total of 336 positions in the final dataset. Solid round shapes indicate S. mutans sequences and solid diamond shapes indicate sequences from related species to S. mutans. Reference sequences of S. mutans and related species are indicated with black, sequences from European individuals are showed in red Analysis of DMBT1 binding regions of S. mutans The C-terminal region of SpaP gene contains two binding regions for DMBT1, Ad1 and Ad2 (Kelly et al., 1999). The 120 bp region Ad1 showed 14 polymorphisms (11.6%) reflecting very high level of sequence diversity between-hosts (Figure 108). 168

191 Figure 108: Sequence logos showing pattern of aligned nucleotide sequences of Ad1 region of SpaP gene of S. mutans. The height of each nucleotide is made proportional to its frequency and most common nucleotide is on top. The number of nucleotides corresponds to the main DNA sequence used in the study. The second DMBT1 SAG binding of SpaP gene was 90 bp long and 9 (10%) polymorphic sites were found (Figure 109), indicated high level of sequence diversity in Ad2 region. The sequence analysis indicated a very high level of diversity between hosts for two binding regions of SpaP gene. Figure 109: Sequence logos showing the pattern of aligned nucleotide sequences of Ad2 region of SpaP gene of S. mutans. The height of each nucleotide is made proportional to its frequency and most common nucleotide is on top. The number of nucleotides corresponds to the main DNA sequence used in the study. A total of 40 amino acids encoded the Ad1 region of Ag I/II and 6 positions were variable at the amino acid level (Figure 110) and shared 85% identity between hosts. Figure 110: Sequence logos showing the pattern of aligned amino acid sequences of Ad1 region of Ag I/II of S. mutans. The height of each amino acid is made proportional to its frequency and most common amino acid is on top. The number of amino acid corresponds to the main amino acid sequence used in the study. 169

192 The 30 amino acids encoded Ad2 of Ag I/II encoded with and only two positions were variable between hosts indicating Ad2 was less variable than Ad1 (Figure 111). Figure 111: Sequence logos showing the pattern of aligned amino acid sequences of the Ad2 region of Ag I/II of S. mutans. The height of each amino acid is made proportional to its frequency and the most common amino acid is on top. The number of the amino acid corresponds to the main amino acid sequence use in the study McDonald-Kreitman test for SpaP gene of S. mutans The test used in this section, proposed by McDonald and Kreitman (Egea et al., 2008; Stoletzki & Eyre-Walker, 2011), examines the amount of nonsynonymous and synonymous polymorphism within a species compared to the amount of nonsynonymous and synonymous fixed differences between species. Under neutrality, intra-species polymorphism levels and interspecies substitutions ratios should be equal. Interspecies nonsynonymous substitutions are expected to increase relative to intra-species nonsynonymous polymorphisms in the case of positive selection. A total 1008 nucleotides of 156 coding nucleotide sequences of SpaP gene of S. mutans from all samples of different geographical regions and similar size homologous sequences most closely related species to S. mutans (S. downei, S. gordonii, S. intermedius, S. oralis, S. sanguinis, S. sobrinus) were used to try to detect the signature of natural selection at the molecular level. The neutrality Index (NI), proportion of adaptive substitutions (α), associated chi-square (X 2 ) and p-value for S. mutans were calculated using sequences of closely related species as divergence (Table 61). The McDonald-Kreitman test showed neutrality indexes greater than 1 for all species, with a significant deviation from neutrality (p<0.003), except in the case of S. intermedius. Table 61: Summary results of McDonald-Kreitman test of Antigen I/II of S. mutans using sequences from all samples. Divergence species Neutrality Proportion of adaptive p-value Index (NI) substitutions (α) value Streptococcus downei Streptococcus oralis Streptococcus sanguinis Streptococcus sorbinus Streptococcus gordonii Streptococcus intermedius χ 2 170

193 To overcome demographic biases on SpaP region of S. mutans, sequences from the same geographical region were reanalyzed using the McDonald-Kreitman test. 110 coding nucleotide sequences of S. mutans containing 1008 nucleotides from European origin and orthologous sequences from six closely related species were used to detect the signature of natural selection at the molecular level (Table 62). The McDonald-Kreitman test showed a significant deviation from neutrality (p<0.003), except in S. intermedius. Table 62: Summary results of McDonald-Kreitman test of SpaP of S. mutans using sequences from European samples. Divergence species Neutrality Proportion of adaptive p-value Index (NI) substitutions (α) value Streptococcus oralis Streptococcus sanguinis Streptococcus sorbinus Streptococcus gordonii Streptococcus Intermedius χ 2 The neutrality index (NI) of S. mutans for all species was more than 1, which indicated excess of polymorphic variation in the terminal region of the SpaP gene of S. mutans. The excess of polymorphic variation, higher than expected non-synonymous polymorphism in the test region, changed the neutrality index (NI) and all neutrality indexs departed from the expected in the neutral model (p<0.003). The neutrality index (NI) indicated negative selection at the tested region of S. mutans Allele frequency spectrum of SpaP of S. mutans The selection acting on a gene or region modifies and shapes the spectrum of frequencies at variant sites of a particular gene. Both selection and demography alter the frequencies and under normal conditions, in the absence of selection, a population maintains an excess of intermediate frequency alleles. The allele frequency spectrum was drawn using variant sites of SpaP of S. mutans. Two separate plots were drawn using all the samples and also just the European samples of S. mutans Allele frequency spectrum of SpaP of S. mutans using frequency data The allele frequency spectrum (AFS) of the SpaP region of S. mutans was plotted using SNP frequency of all samples from different geographical regions without considering demographic 171

194 boundaries. The allele frequency spectrum showed a higher proportion of low allele frequencies for both synonymous and non-synonymous polymorphism (Figure 112). The low allele frequencies of variant sites (both synonymous and non-synonymous polymorphism) in S. mutans populations indicated that the variant sites have arisen relatively recently suggesting an expanding population, consistent with previous work (Cornejo et al., 2013; Do et al., 2010). Because non-synonymous substitutions are enriched at low allele frequencies, negative selection has acted on the sequence of SpaP region in S. mutans. Figure 112: Analysis of allele frequency spectrum in S. mutans. The observed distribution of the number of replacement SNPs at a given frequency in the sample is shown. The variant sites of SpaP of S. mutans were classified into two groups; frequency 2% and frequency >2%, based on frequency of synonymous and non-synonymous polymorphisms of all samples. Two tailed Fisher s exact test was performed using synonymous and nonsynonymous polymorphism data of all samples from different geographical regions. The association between the frequency of non-synonymous and synonymous was considered to be statistically significant (p=0.0327) (Table 63). The frequency of total non-synonymous polymorphisms was lower than the frequency of total synonymous polymorphisms, but higher than the frequency of non-synonymous polymorphisms at low allele frequencies compared to synonymous polymorphisms. A large number of unique single substitutions was observed in S. mutans by a previous study (Cornejo et al., 2013), indicating a recent expansion of S. mutans populations. 172

195 Table 63:Frequencies of synonymous and non-synonymous polymorphisms of S. mutans, from all samples. Frequency of variant sites Non-synonymous change Synonymous change Frequency 2% Frequency > 2% Two tailed Fisher s exact test, p= Figure 113: Analysis of allele frequency spectrum in S. mutans. The observed distribution of number of replacement SNPs at a given frequency in the European samples is shown. The variant sites of S. mutans isolated from European population were grouped based on frequency of synonymous and non-synonymous polymorphisms. A two-tailed Fisher s exact test showed the association between the frequency of non-synonymous and synonymous polymorphisms was statistically significant (p= ) (Table 64). The AFS showed an enrichment of non-synonymous polymorphisms at low allele frequencies compared to synonymous polymorphisms. The frequency of non-synonymous polymorphisms was lower than the synonymous polymorphisms but an enrichment of non-synonymous polymorphisms was found at low allele frequencies, compared to synonymous polymorphisms, indicating that the mutations were slightly deleterious for European samples (Figure 113). 173

196 Table 64:Frequencies of synonymous and non-synonymous polymorphism of S. mutans from European samples. Frequency of variant sites Non-synonymous change Synonymous change Frequency 2% Frequency > 2% Two tailed Fisher s exact test, p= Estimation of DMBT1 copy number in Leicester local volunteers Estimation of CNV1 copy number in Leicester local volunteers Raw ratios of PRT1 and PRT2 assays from 149 Leicester volunteer samples were used for estimation of CNV1 copy number. The raw PRT ratio of both PRT assays was normalised using standard reference DNA. The distribution of normalized PRT ratio of PRT1 (red) and PRT2 (green) showed very good clustering for both PRT assays (Figure 114). Figure 114: Histogram of raw PRT ratio of PRT1 (red), PRT2 (green) and mean CNV1 PRT ratio (gray) in the Leicester local samples. The histogram of average raw PRT ratio of CNV1 data of Leicester local samples indicated four clusters (gray) and one sample showed higher PRT ratio (Figure 114). Four clusters component was used to measure integer copy number of CNV1 using CNVtools. To estimate CNV1 copy number I used two independent PRT assays and mean ratio was used to call integer copy number using CNVtools. The scatter plot (Figure 115) showed very good quality clusters without any overlapping cluster. 174

197 Figure 115: Scatter plot produces by raw ratio from PRT1 and PRT2 assays of CNV1 estimation in Leicester local samples. The mean PRT ratios of CNV1 were transformed to have a standard deviation of 1 as per recommendation for CNVtools analysis. A mixture model of four components was fitted nicely and one extreme high PRT ratio was considered as an outlier. Integer copy numbers were inferred from transformed PRT data using a Gaussian mixture model (Figure 116). The resulting clustering quality score (Q) was The first cluster indicated diploid copy number 2 and the remaining clusters were assigned diploid copy number from 3, 4 and 5, respectively. 175

198 Figure 116: Output of the clustering procedure using the PRT transformed data of CNV1 in Leicester local samples. The coloured lines show the posterior probability for each of four copy number classes (copy number = 2; 3; 4 and 5). X-axis shows transformed PRT ratio for CNV1 in the Leicester local samples. The posterior probabilities of the integer copy number call for each sample were plotted (Figure 117) to analyze the quality of CNV1 copy number calling. The posterior probabilities of CNV1 copy number calling in Leicester local samples were greater than 0.99, indicating very high quality of copy number measurement for CNV1. 176

199 Figure 117: Analysis of integer copy number calling of CNV1 in Leicester local samples. Scatter plot and associated histogram shows mean unrounded copy number values generates using mean PRT ratio from PRT1 and PRT2 plotted against posterior probabilities of integer copy number call of CNV1 for the Leicester samples Distribution of CNV1 copy number in Leicester local volunteers A total 149 individuals were genotyped to estimate diploid copy number of CNV1 region. The distribution of CNV1 copy number varied from 2-7 per diploid genome with modal copy number 4 (70%) and 23% individuals with copy number 3 (Table 65). Table 65: CNV1 copy number frequencies in the Leicester samples. Leicester local samples HapMap CEU parents Diploid copy number Copy number count Frequency Copy number count Frequency Total Average CNV1 copy number

200 7.3.3 Estimation of CNV2 copy number in Leicester local volunteers Two PRT assays, PRT3 and PRT4 were used for estimation of CNV2 copy number in Leicester local samples. The raw PRT ratio of both PRT assays was normalised using standard reference DNA. The distribution of normalized PRT ration of PRT3 (red) and PRT4 (green) were analyzed in 149 local volunteer samples (Figure 118). Figure 118: Histogram of raw PRT ratio of PRT3 (red), PRT4 (green) and mean CNV2 PRT ratio (gray) in the Leicester local samples. The histogram of average raw PRT ratio of CNV2 data of Leicester local samples indicated six clusters (gray) and few samples showed lower and higher PRT ratio (Figure 118). The number of clusters (6) was used to measure integer copy number of CNV2 using CNVtools. The quality of raw PRT ratio from PRT3 and PRT4 were analysed using a scatter plot (Figure 119). The scatter plot showed very good quality clusters with overlapping of only a few PRT ratios, at higher ratios. 178

201 Figure 119: Scatter plot produces by raw ratio from PRT3 and PRT4 assays of CNV2 estimation in the Leicester samples. The mean PRT ratios were transformed to have a standard deviation of 1 to facilitate clustering in CNVtools analysis. A mixture model of six components was fitted nicely and extreme low and high PRT ratios were considered as outliers. Integer copy numbers were inferred from the transformed PRT data using a Gaussian mixture model (Figure 120). The resulting clustering quality score (Q) was The first cluster indicated diploid copy number 2 and the remaining clusters were assigned diploid copy numbers from 3 to 9, respectively. 179

202 Figure 120: Output of the clustering procedure using the PRT transformed data of CNV2 in Leicester local samples. The coloured lines show the posterior probability for each of the six copy number classes (copy number = 2; 3; 4; 5; 6 and 7). The posterior probabilities of the integer copy number call for each sample were plotted (Figure 121). The posterior probability of copy number calling of Leicester local samples were greater than 0.99 and only three samples showed posterior probabilities less but greater than 0.80, indicated very high quality of copy number measurement for CNV2. 180

203 Figure 121: Analysis of CNV2 integer copy number calling in Leicester local sample. Scatter plot and associated histograms shows mean unrounded copy number values generated using mean PRT ratio from PRT3 and PRT4 plotted against posterior probabilities of integer copy number call for the Leicester samples Distribution of CNV2 copy number in Leicester local volunteers A total of 149 individuals were genotyped to estimate diploid copy number of CNV2 region. The distribution of CNV2 copy number varied from 0-9 per diploid genome (Table 66). Table 66: CNV2 copy number frequencies in the Leicester local samples. Leicester local samples HapMap CEU parents Diploid copy number Copy number count Frequency Copy number count Frequency Total Average CNV2 copy number

204 7.3.5 Secretor status of Leicester local volunteers The genotype frequencies and frequencies of secretor status as deduced from genotyping are presented (Table 67). Table 67: Genotype frequency, allele frequency and secretor status of Leicester local samples. Genotype Count Frequency Allele Frequency Secretor status Count Frequency GG G 0.56 Secretor GA A 0.44 AA Non secretor The frequency of A in our samples was 44% and the frequency might be altered (it is 50% in normal European population) due to presence of individuals from south Asian populations Analysis of SpaP genotype and CNV1 and CNV2 copy number of DMBT1 Multiple testing sometimes shows significant association results by chance, but significant association within the binding region was interesting in the present study. Logistic regression analysis showed that a total of eight out of 136 alleles were associated with either both copy number or secretor status. Two (out of 54) of these polymorphisms were non-synonymous changes. One of these two was within the second DMBT1 binding region (Ad2) of SpaP gene of S. mutans (Table 68). The derived allele was associated with a lower copy number at CNV2 (p=0.019). The other associated non-synonymous allele was outside the DMBT1 binding region of Ag I/II of S. mutans and the association was marginal (p=0.046) with lower CNV2 copy number. Table 68: Summary table relating regression analysis of polymorphic alleles of S. mutans with CNVs and secretor status of all samples. The Synonymous and non-synonymous changes are indicated with S and N, whereas green and blue boxes indicate Ad1 and Ad2 regions of the SpaP gene respectively. CNV1 (p value of association) CNV2(p value of association) Secretor Position of nucleotide Type of polymorphism S N N S S S S S Nucleotide changes A-G C-G C-A C-T A-G A-G T-C T-C Amino Acid change Thr Ser (amino acid position 84) Ala Asp (amino acid position 97) To examine possible association of derived alleles with copy number and secretor status in individual of European origin, logistic regression was performed using polymorphim data of European sequences. The SpaP region of S. mutans from European samples showed a total of 182

205 104 polymorphisms and 42 polymorphisms were non-synonymous. Logistic regression analysis showed that a total of seven alleles were associated with either copy number or secretor status and one polymorphism was a non-synonymous change (Table 69). The associated nonsynonymous change was within the second DMBT1 SAG binding region (Ad2) of Ag I/II of S. mutans, also associated for all samples. The derived allele was associated with lower copy number at both CNV1 (0.038) and CNV2 (p=0.017). The six polymorphisms were also associated either at CNVs or secretor but did not change amino acid. Table 69: Summary table relating regression analysis of polymorphic alleles of S. mutans with CNVs and secretor status of European samples. The Synonymous and non-synonymous changes are indicated with S and N whereas green and blue boxes indicate Ad1 and Ad2 regions of SpaP gene respectively. CNV1(p value of association) CNV2(p value of association) Secretor Position of nucleotide Type of polymorphism S N S S S S S Nucleotide changes A-G C-A T-A C-T A-G A-G T-C Amino Acid change 7.4 Discussion Ala Asp It has been suggested that vertical transmission from mother to child is the main pathway for S. mutans acquisition (Alaluusua et al. 1996), so a strong pattern of geographic differentiation could be produced over time. A number of studies have demonstrated substantial genetic heterogeneity across clinical samples of S. mutans (Alaluusua et al., 1996; Emanuelsson et al., 1998). The phylogenetic analysis indicated that there was no geographical differentiation for different samples from mouth wash DNA of different geographical origins. The present result was consistence with previous results from other groups (Cornejo et al., 2013; Do et al., 2010). Cornejo et al. a total of 57 strains of S. mutans representing different country of origin but did not find any geographical structuring of diversity. The present study, with sampling of a restricted population effectively captured the global diversity of sequences analysed elsewhere and showed no intra-mouth sequence diversity for S. mutans, consistent with previous findings (Cornejo et al., 2013; Do et al., 2010). Lack of geographical diversity of S. mutans in the study indicated that horizontal transmission of S. mutans is also possible in human and a similar pattern was also observed in previous studies (Doméjean et al., 2010; Emanuelsson et al., 1998). In chapter 6.5, I analysed the relationship between CNV1 and CNV2 copy number and agricultural and non-agricultural populations. Previous studies reported that the S. mutans 183

206 population started expanding exponentially approximately 10,000 years ago (95% confidence interval [Cl]: 3,296-14,344 years ago), coincident with the human transition to a starch-rich diet which may have contributed to the successful adaptation of S. mutans to its new niche, the human mouth (Cornejo et al., 2013). The transition of human life style from hunting and gathering to agriculture was associated with a change in the composition of the oral microbiota and broadly coincides with the estimated timing of a demographic expansion in S. mutans (Cornejo et al., 2013; Humphrey et al., 2014). S. mutans is widely recognized as one of the key etiological agents of human dental caries. To study possible selection of S. mutans, the allele frequency spectrum (AFS) of a 1008 bp region of the SpaP gene of S. mutans was analysed. The SpaP region of S. mutans showed high sequence diversity at the DNA level (136 sites for all samples and 104 sites for European samples) and most variations were either singleton (unique) or low allele frequency. Mutations at low frequency are a signature of recently expanded populations, and AFS analysis of S. mutans also indicated recent expansion of Ag I/II region of S. mutans. The majority of mutations showing very low frequency are non-synonymous. The proportions of nonsynonymous substitutions that are at low allele frequencies indicate that negative selection is acting on the Ag I/II region of S. mutans. The McDonald-Kreitman test using homologous sequences of the most closely related species again argues for weak negative selection being the dominant force shaping diversity in S. mutans. The present study indicates that S. mutans evolved in response to DMBT1 and regression analysis finds significant association between DMBT1 copy number and non-synonymous variation of S. mutans, associated with lower copy number at both CNV1 and CNV2. The overall pattern of variation of the SAG-binding region of AgI/II was that of weak negative selection across the population, where new amino acids changes were fixed within the host but were selected against when transferred from host to host. The present study also supports the lack of geographical structure of S.mutans, as present sampling of a restricted population effectively captures the global diversity of sequences analysed elsewhere, and again argues for weak negative selection being the dominant force shaping diversity in S. mutans. The present study provides a framework for understanding the full nature and functional effect of sequence variation at DMBT1. 184

207 8 ASSOCIATION OF COPY NUMBER VARIATION OF DMBT1 IN CROHN S DISEASE PATIENTS 8.1 Introduction Inflammatory bowel disease (IBD) represents a group of chronic, relapsing and remitting inflammatory intestinal conditions comprising of two main subtypes: Crohn s disease (CD, MIM ) and ulcerative colitis (UC) (Hugot et al., 2001; Van Limbergen et al., 2009). UC and CD characterize with both overlapping and distinct clinical and pathological features. Both diseases differ by the intestinal localization and features of the inflammation. The inflammation of UC is limited to the colon and starts from the rectum and spreads proximally in a continuous fashion and frequently involves the periappendiceal region. By contrast, Crohn s disease can occur anywhere in the gastrointestinal tract from mouth to anus. The inflammation of CD starts in a non-continuous fashion and most commonly involves the terminal ileum or the perianal region. Unlike ulcerative colitis, CD is commonly associated with complications such as strictures, abscesses and fistulas. The pathogenesis of CD is incompletely understood but is considered to be multi-factorial and a combination of genetic makeup of individuals and environmental risk factors such as altered luminal bacteria and enhanced intestinal permeability play a role in the dysregulation of intestinal immunity, leading to CD. Individuals of any age can be affected with CD although a peak incidence in early adult life, with an estimated prevalence of 1 in 1000 in western countries. In recent years, genome-wide association studies (GWASs) have successfully identified 99 non-overlapping genetic risk loci, including 28 that are shared between CD and UC (Anderson et al., 2011; Bernard et al., 2011; Franke et al., 2012). 8.2 The rationale for study DMBT1 is strongly expressed in the organs of the gastrointestinal (GI) system as heavily sulfated membrane glycoprotein, also known as mucin-like glycoprotein (Muclin) (De Lisle et al. 2008). In humans, a deletion allele of DMBT1 with a reduced number of scavenger receptor cysteines rich (SRCR) domains was found to be associated with an increased risk of CD, but not UC, in a medium-sized case-control study (368 CDs and 346 controls) (Renner et al. 2007). In the same study, Renner et al. also reported that Dmbt1 / mice showed increased sensitivity to dextran sulphate sodium (DSS)-induced colitis and TNFα, IL-6, and NOD2/CARD15 expression levels were elevated during inflammation (Renner et al. 2007). The intracellular pathogen receptor NOD2 targets DMBT1 via NF-B activation and strong up-regulation was 185

208 found in the inflamed intestinal mucosa of Crohn s disease patients with wild-type, but not with mutant NOD2 (Rosenstiel et al. 2007). The non-coding DMBT1 SNP (rs ) was found to be associated with decreased overall DMBT1 expression in the colon and increased CD susceptibility (Diegelmann et al., 2013). The Wellcome Trust Case Control Consortium (WTCCC) conducted a genome wide CNV association study in eight common diseases, which included Crohn s disease patients. The deletion allele (CNV1) was effectively assayed by the Agilent 210K arraycgh chip, on the HapMap samples (Conrad et al. 2010), and no association was found to be significant at the genome wide level. However, the other complex CNV (CNV2) towards the end of DMBT1 is not well assayed by the Agilent 210k chip ( ). Because of the higher prior probability of association from the literature compared to a random CNV, the study was designed to use a different copy number typing approach (the PRT assay) on a large case-control cohort of samples and so I typed CNV1 and CNV2 to test for association of copy number and CD. The previous study reported deleted variants of DMBT1 (CNV1) was associated with Crohn s disease using relatively small sample sizes and no published study has yet attempted to replicate this finding. The main objective of the present study was to replicate the association results using more DNA samples from Crohn s cases from three different centers. The study also compared two different approaches for determining copy number at CNV1 and CNV2 of DMBT1 where both acgh and PRT data existed. 8.3 Estimation of DMBT1 Copy number in Crohn s samples Copy number estimation of English Crohn s samples CNV1 Copy number estimation in English Crohn s samples The PRT ratio from PRT1 and PRT2 were compared and scattered plot indicated positive relationship in raw PRT ratio of PRT1 and PRT2 (r 2 =0.96). The scatter plot showed very good quality clusters without any overlapping of data (Figure 122). 186

209 Figure 122: Scatter plot produces by PRT1 and PRT2 assays use to estimate diploid copy number of CNV1 in English Crohn s and control samples. The histogram analysis of average PRT ratio showed good clusters and indicated total 4 clusters with mean value difference between two respective clusters was 0.5 (Figure 123). The histogram data indicated towards 4 clusters component in CNVtools analysis. 187

210 Figure 123: Histogram of mean unrounded normalized PRT ratio of CNV1 for English Crohn s and control samples. The mean PRT ratios were transformed to have a standard deviation of 1 as recommended by CNVtools and integer copy numbers were inferred from transformed PRT data using a Gaussian mixture model. A mixture model of four components was fitted, based on clustering of the normalized PRT data (Figure 124). The quality score was measured to check the quality of the clusters and the resulting clustering quality score (Q) was The first cluster indicated diploid copy number 2 and the remaining clusters were assigned diploid copy number 3, 4, and 5, respectively. Figure 124: Output of the clustering procedures using the PRT transformed data of CNV1 for English Crohn s and control samples. The coloured lines show the posterior probability for each of the four copy number classes (copy number = 2, 3, 4 and 5). 188

211 Posterior probabilities of the integer copy number call for each sample are plotted in Figure 125. The posterior probabilities for most of the samples were more than 0.99 and posterior probabilities of five samples were more than 0.80, indicating copy number calling for CNV1 was very good for English Crohn s samples. Figure 125: Analysis of integer copy number calling. Scatter plot and associated histograms shows mean unrounded copy number values of CNV1 generates by PRT1 and PRT2 plotted against posterior probabilities of integer copy number call for English Crohn s disease and control samples CNV2 copy number estimation in English Crohn s samples The PRT ratio from PRT3 and PRT4 were compared using a scatter plot to check the specificity of assays for the CNV2 locus in English Crohn s samples (Figure 126). The scatter plot showed very good quality clusters without any overlapping of data for lower copy number clusters and raw PRT ratios of PRT3 and PRT4 were nicely correlated (r 2 =0.89). The scatter plot showed that clusters were little bit noisy for higher copy number samples. The samples that showed poor correlation between the two assays were retyped using PRT assays before CNVtools analysis. 189

212 Figure 126: Scatter plot produces by PRT ratio of PRT3 and PRT4 assays of CNV2 estimation in English Crohn s and control samples. The histogram of average PRT ratio of CNV2 data in English CD samples indicated eight clusters (Figure 127). 8 clusters were used to measure integer copy number of CNV2 with CNVtools. Figure 127: Histogram of mean normalized PRT ratio of CNV2 in English CD samples. X-axis shows mean PRT ratio for CNV2. Integer copy numbers were inferred from transformed PRT data of CNV2 using a Gaussian mixture model. A mixture model of eight components was used first, based on the histogram 190

213 of the mean PRT data of CNV2 (Figure 128). After excluding lower PRT ratio (mean ratio around 0) the CNV2 data was re-analyzed using a mixture model of eight components and CNVtools produced very good clusters. The resulting clustering quality score (Q) for CNV2 data was Based on long range PCR of CNV2 region actual copy number was assigned for each copy number cluster. The first cluster was assigned a diploid CNV2 copy number of 2 and the actual diploid copy number of the remaining clusters were counted as 3, 4, 5, 6, 7, 8 and 9, respectively. Figure 128: Output of the clustering procedure using the PRT transformed data of CNV2 for English CD samples. The coloured lines show the posterior probability for each of the eight copy number classes (copy number = 2, 3, 4, 5, 6, 7, 8 or 9). Posterior probabilities of the integer copy number call for each sample are plotted in Figure 129. The posterior probabilities for most of the samples were more than 0.95 and posterior probabilities of more than 0.80 indicated copy number calling for CNV2 was good for these samples. A total of 120 samples showing posterior probabilities < 0.75 (10%) were either retyped or called manually using duplicate PRT ratio. Finally the posterior probabilities of CNV2 for English samples indicated that copy number calling was not as good as for CNV1. 191

214 Figure 129: Analysis of integer copy number calling. Scatter plot and associated histograms show mean unrounded copy number values of CNV2 generated by PRT3 and PRT4 plotted against posterior probabilities of integer copy number call for English Crohn s disease and control samples Copy number estimation of Scottish Crohn s samples CNV1 copy number estimation in Scottish Crohn s samples The scatter plot using raw PRT ratio of PRT1 and PRT2 indicated good correlation (r 2 =0.96) without any overlapping cluster (Figure 130). 192

215 Figure 130: Scatter plot produced by PRT1 and PRT2 assays use to estimate diploid copy number of CNV1 in Scottish Crohn s and controls samples. The histogram analysis of mean PRT ratio of PRT1 and PRT2 in Scottish Crohn s patient cohort was performed to determine the number of clusters before CNVtools analysis. The histogram data of mean PRT ratio of CNV1 indicated towards 4 clusters component in CNVtools analysis. The mean PRT ratios were transformed to have a standard deviation of 1 before CNVtools analysis and integer copy numbers were called using transformed PRT data by a Gaussian mixture model. A mixture model of four components was fitted, based on clustering of the normalized PRT data (Figure 131). The quality score was measured to check the quality of the clusters and the resulting clustering quality score (Q) was The first cluster indicated diploid copy number 2 and remaining clusters were assigned diploid copy number 3, 4, and 5, respectively. 193

216 Figure 131: Output of the clustering procedure using the PRT transformed data of CNV1 for Scottish CD and controls samples. The coloured lines show the posterior probability for each of the four copy number classes (copy number = 2, 3, 4 and 5). CNVtools called the integer copy number with posterior probabilities for each sample using a Gaussian mixture model. Posterior probabilities of the integer copy number call for each sample were plotted for all samples. The posterior probabilities for most of the samples were in between 0.95 to 0.99 (Figure 132). The posterior probabilities for two samples were less than 0.95 indicated copy number calling for CNV1 was very good for Scottish Crohn s samples. 194

217 Figure 132: Analysis of CNV1 integer copy number calling using CNVtools. The clustering of data follows by assignment of a Gaussian mixture model to the data allows integer copy number calling from normalized mean unrounded value of CNV1 generates by PRT1 and PRT2. The Bayesian posterior probabilities of each CNV1 copy number calls are shown for Scottish Crohn s and control samples CNV2 copy number estimation in Scottish Crohn s samples The PRT ratio from PRT3 and PRT4 were compared using a scatter plot to check the specificity of assays for CNV2 locus in Scottish Crohn s samples (Figure 133). The scatter plot showed very good quality clusters (r 2 =0.93) without any overlapping of data for lower PRT ratio and raw PRT ratio of PRT3 and PRT4 was highly correlated. The samples that showed poor correlation between two assays were retyped using two PRT assays before CNVtools analysis. 195

218 Figure 133: Scatter plot produces by PRT ratio of PRT3 and PRT4 assays of CNV2 estimation in Scottish Crohn s and control samples. The histogram of average PRT ratio of CNV2 data in Scottish CD samples indicated eight clusters (Figure 134). The number of clusters (8) was used to measure integer copy number of CNV2 using CNVtools. The samples showed higher mean PRT ratio (> 5) were excluded from CNVtools analysis and called manually based on mean PRT ratio. A mixture model of eight components was used based on the histogram of the mean PRT data of CNV2 (Figure 135). The clustering quality score (Q) for CNV2 data was Based on long range PCR of CNV2 region the actual copy number of first cluster was assigned. The first cluster was assigned a diploid CNV2 copy number of 2 and the actual diploid copy numbers of the remaining clusters were counted as 3, 4, 5, 6, 7, 8 and 9, respectively. 196

219 Figure 134: Output of the clustering procedure using the PRT transformed data of CNV2 for Scottish Crohn s and control samples. The coloured lines show the posterior probability for each of eight copy number classes (copy number = 2, 3, 4, 5, 6, 7, 8 and 9). The quality of copy number calling for CNV2 was assessed based on posterior probabilities of the integer copy number call for each sample in Figure 135. The posterior probabilities for most of the samples were more than 0.95 and CNV calling with posterior probabilities of more than 0.80 was considered good. A total of 80 samples that showed posterior probabilities < 0.75 (11%) were either retyped or called manually using duplicate PRT ratio. Finally posterior probabilities of CNV2 for Scottish samples indicated that the quality of copy number calling was similar to English CD cohorts. 197

220 Figure 135: Analysis of CNV2 integer copy number calling using CNVtools. The clustering of data follows assignment of a Gaussian mixture model to the data allows integer copy number calling from normalized mean unrounded value of CNV2 generates by PRT3 and PR4. The Bayesian posterior probabilities of each CNV2 copy number calls are shown for Scottish Crohn s and controls samples Copy number estimation in Danish Crohn s samples CNV1 copy number estimation in Danish Crohn s samples The histogram analysis of mean PRT ratio of CNV1 in Danish Crohn s was performed to determine number of clusters before CNVtools analysis. The histogram of mean PRT ratio of CNV1 indicated a 4 cluster component in CNVtools analysis (Figure 136). 198

221 Figure 136: Histogram of mean normalized PRT ratio of CNV1 for Danish Crohn s and control samples. The average raw PRT ratios were transformed to have a standard deviation of 1 before CNVtools analysis and integer copy numbers were called using transformed PRT data by a Gaussian mixture model in CNVtools. A mixture model of four components was fitted nicely (Figure 137) and the quality score was measured to check quality of the clusters. The resulting clustering quality score (Q) was The first cluster indicated diploid copy number 2 and the remaining clusters were assigned diploid copy number 3, 4 and 5, respectively. Figure 137: Output of the clustering procedure using the PRT transformed data of CNV1 for Danish CD and control samples. The coloured lines show the posterior probability for each of the four copy number classes (copy number = 2, 3, 4 and 5). 199

222 Posterior probabilities of the integer CNV1 copy number call for each sample were plotted for Danish CD samples (Figure 138). The posterior probabilities for most of the samples were greater than 0.99 indicating CNVtools called them very nicely without any overlapping. The posterior probability for one sample was poor (almost 0.61), and so the sample was retyped and called as a 3 copy, based on raw PRT value. The overall copy number calling of CNV1 was very good for Danish IBD samples, and might be useful for a case-control association study. Figure 138: Analysis of CNV1 integer copy number calling using CNVtools. The clustering of data follows by assignment of a Gaussian mixture model to the data allows integer copy number calling from normalized mean unrounded value of CNV1 generates by PRT1 and PRT2. The Bayesian posterior probabilities of each CNV1 copy number calls are shown for Danish CD and control samples CNV2 copy number estimation in Danish Crohn s samples The sensitivity and specificity of PRT assays, PRT3 and PRT4, of Danish CD samples were compared using a scatter plot (Figure 139). The scatter plot showed high correlation (r 2 =0.94) between raw PRT ratio of PRT3 and PRT4 and produced very good quality clusters without any overlapping of data for first five clusters. The samples with higher PRT ratio showed poor 200

223 clustering around integer values and copy number was called by eye based on the averaged PRT ratio. Figure 139: Scatter plot using raw PRT ratio of PRT3 and PRT4 assays, use to estimate diploid copy number of CNV2 in Danish CD and control samples. The histogram of the averaged PRT ratio of CNV2 data in Danish CD samples indicated seven clusters (Figure 140) and the samples showing raw PRT ratios greater than 4.5 were treated as outliers. The number of clusters (7) was used to measure the integer copy number of CNV2 using CNVtools. The outlier samples were called by eye using mean PRT ratio. A mixture model of seven components was used, based on a histogram of the mean PRT data of CNV2. The clustering quality score (Q) for CNV2 data was 4.15, greater than the English or Scottish CD cohorts. Based on long range PCR and prior knowledge of CNV2 ratio, actual copy number was assigned for each copy number cluster. The first cluster was assigned as a diploid CNV2 copy number of 2, and actual diploid copy numbers of the remaining clusters were assigned 3, 4, 5, 6, 7 and 8, respectively. 201

224 Figure 140: Output of the clustering procedure using the PRT transformed data of CNV2 for Danish CD and control samples. The coloured lines show the posterior probability for each of the seven copy number classes (copy number = 2, 3, 4, 5, 6, 7 and 8). The quality of copy number calling for CNV2 was evaluated using posterior probabilities of the integer copy number call for each sample in Figure 141. The posterior probabilities for most of the samples were more than 0.95 and CNV calling with posterior probabilities of more than 0.80 was also considered good. A total of 49 samples showing posterior probabilities < 0.75 (7%) and were either retyped or called by eye using retyped PRT ratio. 202

225 Figure 141: Analysis of CNV2 integer copy number calling using CNVtools. The clustering of data follows by assignment of a Gaussian mixture model to the data allows integer copy number calling from normalized mean unrounded value of CNV2 generates by PRT3 and PRT4. The Bayesian posterior probabilities of each CNV2 copy number calls are shown for Danish CD and controls samples. 8.4 Comparison acgh and PRT for copy number estimation To validate CNV1 and CNV2 copy number calls for the Crohn s disease samples, mean unrounded PRT ratio was compared with arraycgh data previously generated using an Agilent 210k acgh chip as part of the WTCCC genome wide CNV association study. A total of 785 Crohn s disease samples (97 of Scottish Crohn s and 688 of English Crohn s samples) from our samples were analysed as part of the WTCCC CNV-associations study. A principal component analysis was used to compare acgh data and PRT raw ratio for both CNV1 and CNV2 regions. The acgh data for the Crohn s disease cohort was normalised two different ways; in the case of first normalisation (normalised1), the log of the ratio of the red and green channel data (log(r/g)) was used, whereas in the second normalisation (normalised2), the log of the ratio of the quantile normalised red and green channel data (log( QNorm(R)/QNorm(G) )) was 203

226 calculated. A scree plot for both first and second normalised value was plotted to determine the proportion of variation described by different principal components and also to finalise which principal components might be useful for validation of copy number calling of CNV1. The scree plot for first normalised acgh data shows that the first principal component describes 65% of the variation (Figure 142 A), reflecting the underlying copy number variation of the samples. The scatter plot of normalised PRT ratio obtained from PRT1 and PRT2 indicated good correlation without any overlapping clusters for the English (r 2 =0.96) and Scottish (r 2 =0.90) Crohn s samples. The average PRT ratio of PRT1 and PRT2 assays was used for integer copy number estimation of CNV1. The average PRT ratios were compared with the first PC of the first normalized arraycgh data. There was a clear positive correlation between the two methods for all samples (r 2 =0.75) and at the population level correlation was better for English (r 2 =0.77) than the Scottish (r 2 =0.62) Crohn s samples. The data generated by both PRT and acgh assays clustered effectively but limited overlap of copy number value was produced by acgh (Figure 142 B-D). The clusters produced by average PRT ratio were well separated compared to data clusters of 1PC generated by acgh data of CNV1. 204

227 Figure 142: (A) Scree plot for PCA of first normalized acgh data is generated using Agilent 210 K CNV chip for CNV1 in Crohn s disease sample. X-axis shows number of principal components. (B, C, D) Scatter plots show correlation of the mean unrounded copy number value of CNV1 and the 1PC of first normalised Agilent acgh data in all (B), English (C) and (D) Scottish Crohn s disease samples respectively. The scree plot for the second normalised acgh data from the Crohn s cohort explained maximum variation (almost 65%) for the first principal component as noticed previously for first normalised data of CNV1 (Figure 143 A). So, the first principal component of second normalised acgh data was used to validate quality of integer copy number calling of CNV1 using PRT ratio. The data generated by PRT and acgh assays clustered effectively with a moderate correlation (r 2 =0.65) and at population level correlation was higher for the English CD cohort (r 2 =0.66) than Scottish CD cohort (r 2 =0.61) (Figure 143). The second normalised data generated more overlapping copy number values of acgh data than first normalised data. The clusters produced by average PRT ratio were distinct compared to clusters of 1PC of second normalised acgh data of CNV1 in the Crohn s disease cohort. 205

228 Figure 143: (A) Scree plot for PCA of second normalized acgh data is generated using Agilent 210 K CNV chip for CNV1 in Crohn s disease. X-axis shows number of principal components. (B, C, D) Scatter plots show correlation of mean unrounded copy number value of CNV1 and 1PC of second normalised Agilent acgh data in all (B), English (C) and (D) Scottish Crohn s disease samples respectively. To validate CNV2 copy number calling, the principal component analysis was performed for acgh data of CNV2 region for all CD samples and acgh data and mean unrounded PRT raw ratio were compared. Like acgh data of CNV1, both first and second normalised were used for CNV2 region and all analysis was performed as for CNV1 region. The scree plot showed that the first principal component for the first normalised acgh data of CNV2 explained maximum variation (almost 58%) of Crohn s disease cohort (Figure 144 A) although correlation was less than CNV1. Two independent PRT assays (PRT3 and PRT4) were used to estimate the integer copy number of CNV2 in English and Scottish Crohn s disease cohorts. The normalised PRT ratio of PRT3 and PRT4 was compared and scatter plot indicated good correlation for English (r 2 =0.89) and Scottish (r 2 =0.93) Crohn s samples with some overlapping cluster for higher PRT ratio. The mean unrounded PRT generated by PRT3 and PRT4 assays was used for integer copy number 206

229 estimation of CNV2. The mean unrounded PRT ratios were compared with with first PC of first normalized arraycgh data. The scatter plot analysis of PRT ratios and 1PC of acgh showed positive correlation at moderate level (r 2 =0.55) compared to CNV1 region (Figure 144 B, C, D). The maximum correlation was noticed for Scottish (r 2 =0.66) Crohn s samples followed by English (r 2 =0.54) Crohn s samples. The histogram of average PRT ratio produced good clusters for CNV2 in the Crohn s disease cohort but clusters were not as good as the CNV1 region and also overlapped for higher PRT vales. For CNV2 region, the first PC of first normalized arraycgh data did not show any evidence of clustering. Figure 144: (A) Scree plot for PCA of first normalized acgh data is generated using Agilent 210K CNV chip for CNV2 in Crohn s disease samples. X-axis shows number of principal components. (B, C, D) Scatter plots show correlation of mean unrounded copy number value of CNV2 and 1PC of first normalised Agilent acgh data in all (B), English (C) and (D) Scottish Crohn s disease samples respectively. 207

230 The scree plot for the second normalised acgh data of Crohn s disease cohort explained the maximum variation (almost 65%) for first principal component, which explained more variation (Figure 145) than first principal component of the first normalised data of CNV2 region. The scatter plot analysis of mean unrounded PRT ratios and 1PC of second normalized arraycgh data produced less correlation (r 2 =0.42) than other data set used in our analysis. The highest correlation was found for Scottish (r 2 =0.47) Crohn s samples compared to English (r 2 =0.42) Crohn s cohort (Figure 145). The 1PC of second normalized arraycgh data showed no evidence of clustering for integer copy number estimation using both first and second normalized 1PC for CNV2 region. Figure 145: (A) Scree plot for PCA of second normalized acgh data is generated using Agilent 210K CNV chip for CNV2 in the Crohn s disease samples. (B, C, D) Scatter plots show correlation of mean unrounded copy number value of CNV2 and 1PC of second normalised Agilent acgh data in all (B), English (C) and (D) Scottish Crohn s disease samples respectively. In previous work on the HapMap samples, scatter plots showed good clustering around integer copy numbers and CNV1 was called well using both acgh using Agilent 210k acgh chip and 208

231 PRT assays. For CNV2, the PRT assays showed evidence of satisfactory clustering although it was poor at higher copy numbers and there was no evidence of clustering for the acgh data. Comparison of acgh and PRT ratio on the Crohn s patients and controls produced similar results for both CNV1 and CNV2, like the HapMap samples, but it should be noted that, for CNV2, both acgh and PRT showed poorer clustering. However, clustering was still evident for the PRT data, but not at all for the array CGH data. 8.5 Distribution of diploid copy number in the Crohn s samples Distribution of CNV1 copy number in Crohn s samples The distribution of diploid copy number CNV1 was shown in Figure 146. It was clear that CNV1 was a multiallelic CNV with copy number varying between 2 and 4 per diploid genome. The modal copy number was 4 in both case and control samples for all three Crohn s cohorts. The mean copy number for cases and controls was almost the same in the English (CD=3.82 and controls=3.85) and Scottish (CD and controls=3.84) cohorts. The mean CNV1 value for Danish Crohn s (3.83) sample was higher compared to the control (3.73) samples. Figure 146: Distribution of diploid copy number for CNV1 of DMBT1 in CD samples. Bar graphs illustrating distributions of copy number determined using CNVtools by Gaussian mixture distributions in the English, Scottish and Danish Crohn s samples (from left to right), from paralog ratio test data Distribution CNV2 copy number in Crohn s samples The distribution of diploid copy number for CNV2 is shown in Figure 147. It was clear that CNV2 was a multiallelic CNV with a copy number varying between 2 and 10 per diploid genome in English and Scottish cohorts, but in Danish cohort CNV2 diploid distribution was between 2 and 9. The mean copy number for cases was lower than controls in the English (CD=4.93 and controls=5.15) cohort but the opposite trend was observed in the Scottish (CD = 5.27and controls=5.15) and Danish (CD = 5.34 and controls=4.98) cohort. 209

232 Figure 147: Distribution of diploid copy number for CNV2 of DMBT1 in CD samples. Bar graphs illustrating distributions of copy number determined using CNVtools by Gaussian mixture distributions in the English, Scottish and Danish Crohn s samples (from left to right), using the paralog ratio test data Distribution of SRCR copy number in Crohn s samples The total number copy of SRCR domain was estimated based on the diploid copy number of CNV1 and CNV2 regions of DMBT1 together and the non-cnv region was also included. The distributions of total SRCR copy number for English, Scottish and Danish cohorts are shown in Figure 148, Figure 149 and Figure 150 respectively. The range of total number of SRCR domain was between 16 and 31 per diploid genome in the English and Scottish cohort but in the Danish cohort distribution was from 16 to 32. The three major classes of total SRCR domain (24, 25 and 26) were found in all three cohorts. Figure 148: Distribution of total SRCR domain of DMBT1 in the English Crohn s samples. Bar graphs illustrating distributions total SRCR domain in the English Crohn s samples. 210

233 Figure 149: Distribution of total SRCR domain of DMBT1 in the Scottish Crohn s samples. Bar graph illustrating distribution of total SRCR domain in the Scottish Crohn s samples. Figure 150: Distribution of total SRCR domain of DMBT1 in the Danish Crohn s samples. Bar graph illustrating distribution of total SRCR domain in the Danish Crohn s samples. 8.6 Association of DMBT1 copy number with Crohn s disease There were no significant differences between cases and control for CNV1 in any of the three CD populations. For CNV2, two of the cohorts (English and Danish) showed a slight difference that achieved modest statistical significance, but the trend was in the opposite direction in each cohort (Table 70). The mean CNV2 copy number was lower in CD cases (4.93) than 211

234 controls (5.15) in the English Crohn s samples but in the case of the Danish Crohn s samples mean CN2 copy number was higher in CD cases (5.34) compared to Danish controls (4.98). The same trend was also reported in the Scottish Crohn s samples (CD = 5.27 and controls = 5.15) with non-significant effects. There was a significant difference (p=0.005) between mean SRCR copy number of the cases and controls in the English Crohn s samples, although this did not replicate in either the Scottish (p = 0.355) or Danish (0.863) Crohn s samples. Table 70: Comparison of CNV1, CNV2 and SRCR copy number frequency in Crohn s and controls of three different Crohn s samples. CD cohort English CD Scottish CD Danish CD Disease Crohn s control P Crohn s control P Crohn s control P value status value value CNV1 copy number CNV2 copy number SRCR copy number A previous study on people of similar ethnic background (Caucasian) including 367 Italian Crohn s patients and 346 controls without history of IBD showed that a deletion allele of DMBT1 was associated with an increased risk of CD (P = ; odds ratio, 1.75) (Renner et al. 2007). The deletion allele described previously as a diallelic copy number variation corresponded to the CNV1 region in our study. From our study it was clear that CNV1 was a multiallelic CNV with copy number ranging between 2 to 5 per diploid genome in Caucasians. One CNV1 copy number variable unit represents a block of 4 SRCR domains and a diploid copy number of 2, 3 and 4 for CNV1 represents homozygous deletion, heterozygous deletion and normal genotype of longest allelic version. To determine whether the deletion allele increases the risk of CD in our cohorts, the number of deleted and non-deleted alleles was counted in CD cases and controls for the three cohorts and no statistically significant association between presence or absence of the deletion allele and CD. The present study showed that deletion allele frequency (10-14%) was almost equal in cases and controls for three CD cohorts in (Table 71). Table 71: Comparison of DMBT1 deletion allele frequency in Crohn s patients and controls of three different Crohn s cohorts. CD cohort English CD Scottish CD Danish CD Allele Deleted Non-deleted Deleted Non-deleted Deleted Non-deleted Cases 172(10%) 1490 (90%) 67(10%) 629(90%) 44(14%) 266(86%) Controls 83 (9%) 877 (91%) 67(10%) 613(90%) 34(10%) 314(90%) Fisher s exact test p=0.17 p=0.93 p=

235 8.7 Discussion A previous study had reported that a deletion allele of DMBT1 resulting in a reduced number of SRCR domains was associated with an increased risk of CD (Renner et al. 2007). The previously reported deletion allele represents a low copy number at CNV1; 2 and 3 copies of CNV1 represent homozygous deletion and heterozygous deletion allele respectively. The present study did not find any significant association with CD cases and mean CNV1 copy number. An allelic model of CNV1 did not find evidence of the association reported previously in the Italian CD cohort (Renner et al. 2007). Significant association was found with CNV2 copy number and CD in two cohorts although it was not repeated in the third cohort. The trend of mean copy number was in the opposite direction in the two cohorts that showed association. The fact that this CNV2 association shows opposite trends suggests it is unlikely to be a true genetic association. A significant association with total SRCR copy and CD was found in the English cohort, but the association was not replicated in the Scottish or Danish CD cohorts. The modest statistically significant difference in CNV2 may be due to a very subtle differential bias between cases and controls. For both the English and Danish DNA plates, the cases and controls samples were aliquotted on different plates without random distribution. In addition in the English cohort the cases and controls DNA were collected from different sources. For the English CD study, random human (HRC) DNA isolated from lymphoblastoid cell lines was used as disease free control samples. Healthy blood donors from a Danish blood bank were included as controls in the Danish Crohn s cohort. However for the Scottish cohort, the cases and controls DNA were distributed randomly on different plates and 92% of the samples were collected from blood, and the rest were from saliva. This study might argue that the copy number variation at DMBT1 is not a risk factor for CD pathogenesis and similar results were reported previously by the WTCCC genome-wide association study of CNVs for CD (Craddock et al. 2010). Our study showed that the WTCCC would have called CNV1, but not CNV2. The previous association with between the DMBT1 deletion allele and an increased risk of CD in an Italian cohort may not have been a true genetic association and might be a chance association due to the subtle shifts in allele frequency. The frequency of the deletion allele was lower in all three cohorts (10-14%) compared to the deletion allele frequency in the Italian CD cohort (22%) (Renner et al. 2007). The previous studies considered CNV1 as diallelic although the present study shows both CNVs are multiallelic. Sometimes loss of SRCR copy due to deletion allele at CNV1 might be 213

236 compensated by for CNV2 copy number resulting in no change in the total SRCR copy in the diploid genome. DMBT1 is a glycoprotein and the nature and patterns of DMBT1 of glycosylation remain to be clarified. The copy number along with glycosylation of DMBT1 might help to explain CD pathogenesis. 214

237 9 ASSOCIATION OF COPY NUMBER VARIATION OF DMBT1 IN AFRICAN HIV COHORTS 9.1 Introduction Human immunodeficiency virus infection/acquired immunodeficiency syndrome (HIV/AIDS) is caused by HIV (1&2), of the family Retroviridae, which has prevalence all around the world. A total of 35.3 million adults and children are infected with HIV (WHO-HIV department report, Figure 151 and Figure 152) and in 2006 it was predicted that it would be the largest cause of morbidity by 2030, as measured by disability-adjusted life-years (Mathers & Loncar, 2006). The disease prevalence is not the same for all countries or continents. The highest disease burden of HIV is in African countries with 9.2% prevalence in Addis Ababa in Ethiopia and over 10% in Dar-es-Salaam in Tanzania (Aklillu et al., 2013). Figure 151: Worldwide HIV prevalence among adults (adopted from WHO-HIV department) ( Various genetic studies have been conducted to establish the genetic contribution of HIV susceptibility in African populations, mainly from Western countries. A genome-wide association study has not yet found any significant signal of HIV susceptibility (Petrovski et al., 2011). Copy number variation often shows complex patterns of linkage disequilibrium with surrounding SNPs and the previous studies may have missed the association with structurally complex regions, mainly copy number variable variation (Locke et al., 2006). 215

238 Figure 152: Regional HIV and AIDS statistics according to WHO-HIV departments ( The retrovirus HIV-1 infects host cells through its viral envelope glycoprotein (gp120), displayed as a host cell-derived lipid bi-layer and virus-encoded envelope glycoprotein. HIV-1 infects host cells by binding to the CD4 molecule through the envelope glycoprotein (gp120). Binding results in conformational changes in gp120 and forms a discontinuous region for highaffinity interaction with the chemokine receptor (Levy, 1993). Chemokine receptor binding triggers additional conformational changes in gp120 that eventually leads to the fusion of viral and cellular membranes. For effective binding and efficient entry to the host cell, HIV-1 needs CD4 and a co-receptor, such as CCR5 or CXCR4 (Farzan et al., 1997; Pleskoff et al., 1997; Sattentau & Moore, 1991; Stein & Engleman, 1991; Trkola et al., 1996a; Wu et al., 1996) and chemokine receptor-binding sites induce by CD4 interaction (Wu et al. 1996). DMBT1 behaves as a secreted or membrane-linked protein (Sasaki et al., 2002; Sasaki et al., 2003) and binds with HIV-1 through the envelope glycoprotein (gp120). The DMBT1-binding site on gp120 appears to be a distinct inhibitory-binding site, different from the CD4-binding site on gp120 (Wu et al., 2004, 2006). However previous studies on the role of DMBT1 in HIV-1 infectivity have been contradictory. DMBT1 is expressed as a soluble protein in human saliva and binds with HIV-1 gp120 protein through protein-protein interactions (Wu et al. 2004; Wu et al. 2006; Chu et al. 2013) and as membrane-associated protein in cervical and vaginal epithelial cells. DMBT1 facilitates HIV trans-infection and plays a role in sexual transmission (Stoddard et al., 2007). In vitro, it has been reported that DMBT1 binds to HIV-1 and inhibits human 216

239 immunodeficiency virus type 1 infection (Nagashunmugam et al., 1998). Monocyte derived macrophages also express DMBT1 and enhance the efficiency of HIV-1 infection by increasing the local concentration of infectious virus (Cannon et al., 2008). Previous studies identified a HIV-1 gp120 binding site on the SRCR1 domain of DMBT1 (Chu et al. 2013; Wu et al. 2006) and three different regions of SRCR domains directly interact with gp120 (Chu et al. 2013). The present study has characterized the SRCR regions by considering two CNVs (CNV1 and CNV2) and showed inter-population copy number variation. The hypothesis is that extra SRCR domains may facilitate improved binding to HIV-1 directly. At present, no study of DMBT1 copy number with HIV status is available. In this study, two cohorts of HIV patients from Ethiopia and Tanzania were analyzed for association of copy number at DMBT1 with viral load immediately prior to highly active antiretroviral therapy (HAART). We also tried to discover any effect of DMBT1 copy number on response to HAART. 9.2 Estimation of DMBT1 copy number in African HIV cohorts Estimation of CNV1 copy number in African HIV cohorts Diploid copy number was estimated using the mean unrounded PRT ratio of CNV1 in HIV samples. The histogram analysis of average PRT ratio showed good clusters and indicated a total of 4 clusters with mean value difference between the two respective clusters of almost 0.5. The histogram data indicated total a 4 of clusters in CNVtools analysis. Samples showing PRT ratios greater than 2.5 were considered outliers. The mean PRT ratios of CNV1 for all samples were transformed to have a standard deviation of 1 for improving the quality of clustering and integer copy numbers were inferred from transformed PRT data using a Gaussian mixture model in CNVtools. A mixture model of four components was fitted, based on clustering of the normalized PRT data (Figure 153). The quality scores were measured to check quality of the clusters; the clustering quality score (Q) was The first cluster indicated a diploid copy number 3 and the remaining clusters were assigned diploid copy numbers of 4, 5 and 6, respectively. 217

240 Figure 153: Output of the clustering procedure using the PRT transformed data of CNV1 for HIV samples. The coloured lines show the posterior probability for each of the four copy number classes (copy number = 3, 4, 5 or 6). X-axis indicates transformed PRT ratio of CNV1. Posterior probabilities of the integer copy number call for each sample are plotted in Figure 154. The posterior probabilities for most of the samples were more than 0.95 and posterior probabilities of two samples were more than 0.80 indicating copy number calling for CNV1 was very good for HIV samples. 218

241 Figure 154: Analysis of integer copy number calling. Scatter plot and associated histograms show mean unrounded copy number values of CNV1 generates by PRT1 and PRT2, plotted against posterior probabilities of integer copy number call for HIV samples Estimation of CNV2 copy number in African HIV cohorts The mean PRT ratio for CNV2 was distributed from 0 to 5.0 but the highest number of samples showed PRT ratios from 0.5 to 3.0. The samples that showed mean PRT ratio around 0 or greater than 3.2 were treated as outliers and were excluded from the CNVtools analysis. The histogram of average PRT ratio of CNV2 data in HIV samples indicated six clusters (Figure 155). A six clusters component was used to measure integer copy number of CNV2 with CNVtools. The mean PRT ratios were transformed to have a standard deviation of 1 as recommended by CNVtools and integer copy numbers were inferred from transformed PRT data using a Gaussian mixture model. A mixture model of six components was fitted, based on clustering of the normalized PRT data (Figure 155). The quality scores were measured to check quality of the clusters; the clustering quality score (Q) was The first cluster indicated diploid copy number 1 and remaining clusters were assigned as diploid copy numbers 2, 3, 4, 5 and 6, respectively. 219

242 Figure 155: Output of the clustering procedure using the PRT transformed data of CNV2 for HIV samples. The coloured lines show the posterior probability for each of the six copy number classes (copy number = 1, 2, 3, 4, 5 or 6). X-axis indicates transformed PRT ratio of CNV2. Posterior probabilities of the integer copy number call of CNV2 for each sample are plotted in Figure 156. The posterior probabilities for most of the samples were more than 0.99 and samples with posterior probabilities greater than 0.80 indicated copy number calling for CNV2 was very good for the HIV samples. Samples (almost 2%) with posterior probabilities of less than 0.8 were retyped before final copy number calling. Outlier samples were genotyped by usual inspection of the data based on mean PRT ratio considering one unit CNV2 changed mean value of 0.5 scales from previous cluster. 220

243 Figure 156: Analysis of integer copy number calling. Scatter plot and associated histograms show mean unrounded copy number values of CNV2, generates by PRT3 and PRT4 plotted against posterior probabilities of integer copy number call of HIV cohort Distribution of CNV1 and CNV2 copy number in African HIV cohorts A total of 1002 individuals from 2 populations (Ethiopian and Tanzanian) were genotyped for CNV1 and CNV2 copy number estimation. Both CNVs at DMBT1 were multiallelic, as with other African populations (HapMap YRI and Zambian samples). The distribution of CNV1 diploid copy number (found in more than 1% of samples) varied from 3 to 7 with a modal CNV1 copy number 4 (67%) (Figure 157). The mean copy number for CNV1 for all HIV samples was 4.25 although at the population level mean CNV1 was higher in Tanzanian samples (4.50) compared to Ethiopian samples (4.10). The details of CNV1 copy number distributions in HIV and HIV+TB co-infected samples in two populations are shown in Table 72. The CNV1 copy number range for two African HIV populations is higher than European populations (HapMap CEU and HGDP populations of 221

244 European origin). The Ethiopian population showed the greatest range (2-9 copies) although three copy number classes (2, 8 and 9) were found only three Ethiopian samples, one for each class. The common copy number range for CNV1 was between 3 and 6 copies per diploid genome and one Tanzanian showed a copy number of 7. Table 72: Copy number frequencies of CNV1 at DMBT1 in African HIV cohort. Diploid copy number Ethiopian HIV count (frequency) Ethiopian HIV+TB count (frequency) Tanzanian HIV count (frequency) Tanzanian HIV+TB count (frequency) 2 1 (<0.01) 0 (0) 0 (0) 0 (0) 3 25 (0.09) 38 (0.11) 11 (0.06) 6 (0.04) (0.73) 282 (0.78) 98 (0.50) 75 (0.51) 5 37 (0.13) 31 (0.09) 64 (0.33) 54 (0.37) 6 7 (0.02) 8 (0.02) 22 (0.11) 12 (0.08) 7 4 (0.01) 2 (<0.01) 1 (<0.01) 0 (0) 8 1 (<0.01) 1 (<0.01) 0 (0) 0 (0) 9 1 (<0.01) 0 (0) 0 (0) 0 (0) Total Mean Figure 157: Distribution of diploid copy number of CNV1 and CNV2 at DMBT1. Bar graphs illustrating distribution copy number determines in CNVtools by Gaussian mixture distributions in the African HIV cohort, using paralog ratio test data. The distribution of CNV2 diploid copy number (found in more than 1% samples) varied from 1 to 8 with a modal CNV2 copy number 4 (31%). The mean copy number for CNV2 for all HIV samples was 3.70 although at the population level mean CNV2 was higher in Ethiopian samples (3.85) compared to Tanzanian samples (3.40). The Ethiopian population showed a CNV2 copy number range between 1 and 10 copies per diploid genome but in the Tanzanian samples 222

245 CNV2 copy number was distributed differently (0-7). The details of CNV2 copy number distributions in HIV and HIV+TB co-infected samples in two populations are shown in Table 73. Table 73: Copy number frequencies of CNV2 at DMBT1 in HIV samples. Diploid copy number Ethiopian HIV count (frequency) Ethiopian HIV+TB count (frequency) Tanzanian HIV count (frequency) Tanzanian HIV+TB count (frequency) 0 0 (0) 0 (0) 0 (0) 1 (<0.01) 1 1 (<0.01) 5 (0.01) 13 (0.07) 4 (0.03) 2 55 (0.20) 64 (0.18) 63 (0.32) 51 (0.35) 3 47 (0.17) 71 (0.20) 28 (0.14) 22 (0.15) (0.37) 127 (0.35) 46 (0.23) 27 (0.18) 5 37 (0.13) 54 (0.15) 32 (0.16) 24 (0.16) 6 26 (0.09) 33 (0.10) 12 (0.06) 13 (0.09) 7 5 (0.02) 6 (0.02) 2 (0.01) 5 (0.03) 8 2 (<0.01) 1 (<0.01) 0 (0) 0 (0) 9 2 (<0.01) 0 (0) 0 (0) 0 (0) 10 1 (<0.01) 1 (<0.01) 0 (0) 0 (0) Total Mean The distribution patterns of diploid copy number for both CNVs were different in the two HIV populations. A higher frequency of lower CNV1 copy number (3 and 4) was estimated in Ethiopian samples and the opposite trend was found in Tanzanian samples. In CNV2, Higher frequency of lower CNV2 copy number (1 and 2) was estimated in Tanzanian samples but higher frequency of 4 copies CNV2 was found in Ethiopian samples. The different distribution of CNV1 and CNV2 also influenced total SRCR copy number patterns in African HIV cohort. One unit of CNV1 represents four copies of SRCR domains but one unit of CNV2 contributes one copy of SRCR in diploid genomes. 223

246 Figure 158: Distribution of diploid copy number of SRCR at DMBT1. Bar graphs illustrating distribution of total SRCR copy number (frequency >1%) determines using CNV1 and CNV2 in the African HIV cohort. The total SRCR copy number in diploid genome was distributed from 16 to 45 although total SRCR copy number classes with minimum 1% frequency at least one population were found to be distributed from 18 to 34 in African HIV samples (Figure 158). The mean copy number for SRCR for all HIV samples was but mean SRCR copy number was higher in Tanzanian samples (25.39) compared to Ethiopian samples (24.26). The higher frequency of CNV1 copy number increased the total SRCR copy number in Tanzanian samples. The detailed distributions of total SRCR copy number in HIV and HIV+TB co-infected samples in two populations are shown in Table

247 Table 74: Copy number frequencies of total SRCR of DMBT1 in HIV cohorts. Total SRCR number Ethiopian HIV count (frequency) Ethiopian HIV+TB count (frequency) Tanzanian HIV count (frequency) Tanzanian HIV+TB count (frequency) 16 1(<0.01) 0 (0) 0 (0) 0 (0) 18 0 (0) 4 (0.01) 6 (0.03) 5 (0.03) (0.04) 13 (0.04) 3 (0.01) 1(<0.01) 20 8 (0.03) 10 (0.03) 1(<0.01) 0 (0) 21 5 (0.02) 15 (0.04) 7 (0.04) 3 (0.02) (0.13) 47 (0.13) 31 (0.16) 27 (0.18) (0.11) 52 (0.14) 21 (0.11) 17 (0.12) (0.29) 100 (0.28) 21 (0.11) 16 (0.11) (0.11) 42 (0.12) 17 (0.09) 7 (0.05) (0.13) 37 ( (0.11) 20 (0.14) 27 7 (0.03) 11 (0.03) 5 (0.02) 4 (0.03) (0.05) 16 (0.04) 18 (0.09) 11 (0.08) 29 4 (0.01) 1(<0.01) 17 (0.09) 14 (0.09) 30 5 (0.02) 6 (0.02) 17 (0.09) 12 (0.08) 31 1(<0.01) 1(<0.01) 1(<0.01) 4 (0.03) 32 4 (0.01) 2(<0.01) 0 (0) 1(<0.01) 33 0 (0) 1(<0.01) 0 (0) 4 (0.03) 34 5 (0.02) 2(<0.01) 0 (0) 0 (0) 35 1(<0.01) 0 (0) 0 (0) 1(<0.01) 36 0 (0) 1(<0.01) 0 (0) 0 (0) 37 0 (0) 0 (0) 1(<0.01) 0 (0) 38 0 (0) 0 (0) 0 (0) 0 (0) 39 1(<0.01) 0 (0) 0 (0) 0 (0) 42 0 (0) 1(<0.01) 0 (0) 0 (0) 45 1(<0.01) 0 (0) 0 (0) 0 (0) 9.3 Association of copy number with clinical parameters in African HIV cohorts A generalised linear model was used to investigate the effect of CNV1, CNV2 and total SRCR copy number on viral load, immediately prior to HAART. The population of origin, TB coinfection status and CD4+ count immediately prior to HAART, were considered as cofactors for the present analysis. For CNV1 copy number, significant association was found with population of origin, TB co-infection and CD4+ count but no effect of CNV1 copy number. In the case of CNV2 copy number, significant association was found with population of origin, TB coinfection, CD4+ count and CNV2 copy number (Table 75). No significant association was found with total SRCR copy number, although significant association was found with population of origin, TB co-infection and CD4+ count. 225

248 Table 75: Tests of association of copy number with HIV load pre-haart in African HIV cohorts. Model DMBT1-CVN1 DMBT1-CVN2 DMBT1-SRCR β coefficient (95% CI) (copies/ml) P value β coefficient (95% CI) (copies/ml) P value β coefficient (95% CI) (copies/ml) P value Population (-1.144, ) (-1.057, ) (-1.158, ) TB co-infected (-0.726, ) (-0.747, ) (-0.730, ) CD4+ count (-0.006, ) (-0.006, ) (-0.005, ) Copy number (-0.236, 0.190) ( , ) (-0.080, 0.016) The effects of CNV1, CNV2 and SRCR copy number on immune reconstitution following HAART were also investigated using CD+4 count at 12, 24, 36 and 48 week intervals following initiation of treatment. A multivariate linear mixed effects model was designed for the multiple repeated measurements from the same patients at different time points. Significant associations were found with time since initiation of treatment, population of origin and CD4+ level at the initiation of treatments but no effect of any copy number at DMBT1 (Table 76). Table 76: Tests of association of copy number with CD4 count during HAART in African HIV cohorts. Model DMBT1-CVN1 DMBT1-CVN2 DMBT1-SRCR β coefficient (95% CI) (copies/ml) P value β coefficient (95% CI) (copies/ml) P value β coefficient (95% CI) (copies/ml) P value Time after HAART initiation (weeks) 2.78 (1.27, 4.30) Baseline CD (0.36, 1.31) Population (2.30, ) Copy number (-55.67, 17.34) < (1.30, 4.31) (0.35,1.30) (-9.79, ) (-30.77, 4.32) < (1.28, 4.30) (0.37,1.31) (3.21,118.77) (-14.29, 1.63) < Discussion It has been shown that the DMBT1 expressed on the cell surface of female genital tract epithelial cells (Stoddard et al., 2007, 2009) acts as a binding protein for the human immunodeficiency virus type 1 (HIV-1) envelope and maintains viral infectivity for several days (Chu et al., 2013; Z. Wu et al., 2006). Previous studies showed multiple human vaginal and cervical epithelial cell lines expressed DMBT1 and promoted transcytosis of HIV virus from the apical to basolateral surfaces (Cannon et al., 2008; Stoddard et al., 2007). Macrophage expression of DMBT1 binds HIV-1 Env with high affinity and enhances infection by increasing 226

249 local concentrations of virus at the cell surface (Cannon et al., 2008). Macrophage expression of DMBT1 promotes infection of macrophages by increasing fusion with HIV-1 envelope and inhibition of DMBT1 binding to HIV-1 Env by specific antibodies or an inhibitory peptide derived from gp120 reduced infection of macrophages by up to 90% (Cannon et al., 2008). HIV-1 is believed to infect host cells after binding the CD4 molecule with the viral envelope glycoprotein (gp120) (Levy, 1993). The Interaction of CD4 to gp120 leads to conformational changes in gp120 and creates a discontinuous region for high-affinity interaction with the chemokine receptor (Sattentau & Moore, 1991; Trkola et al., 1996b; Wu et al., 1996). Chemokine receptor binding triggers additional conformational changes in gp120 and mediates fusion of viral and cellular membranes. For effective binding and cell entry, HIV-1 requires CD4 and a co-receptor, such as CCR5 or CXCR4. Individuals with higher copy numbers of SRCR might block gp120 of HIV-1 effectively and modulate viral infectivity. This study determined diploid copy number of two CNV regions (CNV1 and CNV2) of DMBT1 using PCR based methods on 1002 HIV African patients. Two CNVs determine the total number of SRCR domains in the diploid human genome at DMBT1. Total copy number of SRCR was determined from CNV1 and CNV2 copy number and also included in present study. The copy number of all three CNVs was compared with extensive clinical data prior to HAART initiation, as well as during the initial stages of HAART treatment. Similar analysis model was used earlier for other copy number variable genes (Aklillu et al., 2013; Hardwick et al., 2012; Machado et al., 2013). The present study suggests that CNV2 copy number has a mild effect on viral load during acute viral infection, just prior to initiation of HAART (β = , p=0.023) but no association with CNV1 and total SRCR copy number. The direction of effect of CNV2 copy number was that higher copy number was associated with low viral load. Higher CNV2 copy number might block chemokine co-receptor binding gp120 sequence of HIV-1, resulting in inhibition of HIV-1 infection. Previous studies by another group proposed that soluble DMBT1 interacted with the viral envelope gp120 (Wu et al., 2006) and inhibited HIV-1 infection (Etsuko & Wei, 2001; Fox et al., 1989; Nagashunmugam et al., 1997, 1998). The interaction between DMBT1-gp120 is mediated by a short sequence located at the N-terminal flank of the gp120 V3 loop (Wu et al., 2004a) that is critical for the chemokine coreceptor interaction. The interaction of DMBT1-gp120 might block access of the gp120 sequence of HIV-1 to the chemokine co-receptor and so DMBT1 mediate inhibitory activity. The present study did not suggest any significant association of CNVs at DMBT1 on immune reconstitution following initiation of HAART. It might be argued that copy number variation at DMBT1 was not responsible for immunological reconstitution. Although high quality copy number counting was used, problems remain, particularly in distinguishing the number of 227

250 SRCR domains effectively interacting with HIV-1. The present study did not consider inter individual variability due to differential glycosylation, which might alter the interactions of DMBT1 with gp120. Thorough functional approaches are needed to study the number of gp120-interacting SRCR domains, which may contribute to inter individual differences in copy DMBT1 and HIV-1 interaction. 228

251 10 ASSOCIATION OF COPY NUMBER VARIATION OF DMBT1 IN LUNG DISEASE COHORTS 10.1 Introduction Chronic obstructive pulmonary disease (COPD) is characterised by irreversible airway obstruction, is a leading cause of morbidity and mortality (Ram et al., 2011) and represents a substantial economic and social burden worldwide. The Global Burden of Disease Studies predicted that it would become the third commonest cause of death by 2020 irrespective of public-health intervention (Mathers & Loncar, 2006; Murray et al., 2001). Because of differences in the diagnosis of COPD, lung function as a quantitative trait is often used in genetic association studies in addition to, or instead of COPD disease status (Myint et al., 2005; Schünemann et al., 2000; Strachan, 1992). Lung function is measured by the ratio of forced expired volume in 1 second (FEV 1 ) to forced vital capacity (FVC). Recent genome-wide association studies (GWAS) of single nucleotide polymorphisms (SNPs) reported 26 genomic regions with significant association with FEV 1 and/or FEV 1 /FVC (Hancock et al., 2010; Repapi et al., 2010). Although the major risk factor is tobacco exposure, the there is a substantial heritable component in COPD (Silverman et al., 1998). The previous study considered lung function as the genetic determinants of COPD (Wain et al., 2014). Asthma is a chronic inflammatory disease of the airways and characterised by reversible airflow obstruction and bronchospasm. The combination of genetic and environmental factor might be responsible for Asthma (Martinez, 2007). The prevalence of Asthma has increased in an alarming rate and more than 300 million people worldwide are affected (Braman, 2006; Masoli et al., 2004). At least 10 genomic regions were identified for association with asthma using GWAS, including proteins involved in the immune response (Himes et al., 2009; Moffatt et al., 2010) but only 4% of asthma heritability is explained by these variants (Cookson & Moffatt, 2011). DMBT1 is highly expressed by mucosal epithelia and alveolar type II cells within the respiratory tract and plays an important role in lung immune defense (Mollenhauer et al., 2000, 2001; Mollenhauer et al., 2002; Müller et al., 2008). Various developmental and clinical factors such as maturity, age and bacterial infection were found to be associated with expression levels of the respiratory DMBT1 (Müller et al., 2008) and DMBT1 levels increased significantly with lung maturity (P < ) and neonatal infections (P < ). A high concentration of DMBT1 was also reported in lungs of neonates with bacterial infections (Müller et al., 2008). The Low levels of DMBT1 contribute to the high risk of infections with group B streptococci (GBS), E. coli and 229

252 Staph. aureus in respiratory tract in neonates (Ligtenberg et al., 2010). GBS are the major pathogens producing neonatal pneumonia (Doran et al., 2002) and these bacteria are aggregated by DMBT1 (Bikker et al., 2004; Loimaranta et al., 2005). The higher respiratory DMBT1 levels in infants provided protection against pulmonary infection. The lung surfactant proteins SP-D regulates host immune defence, modulates inflammatory responses, and also plays a significant role in colonisation and onset of bacteraemia in lung (Crouch, 2000; Hartshorn et al., 2007a; Jounblat et al., 2005). The respiratory DMBT1, in association with the membranes of alveolar macrophages, acts as an opsonin receptor to SP-D. DMBT1 binds to SP- D through a protein protein interaction in a calcium-dependent manner and promotes phagocytosis by binding to microbial surface carbohydrates (Holmskov et al., 1999). DMBT1 is expressed in hyaline membranes during respiratory distress syndrome and increased lung surface tension in vitro (Müller et al., 2007). In the adult lung, DMBT1 is normally expressed in low level but previous studies demonstrated an up-regulation upon respiratory inflammation (Mollenhauer et al., 2001). DMBT1 acts as a pattern recognition receptor (PRRs) to play a role in innate immune defense. The protein binds and aggregates a range of pathogenic bacteria and viruses (Bikker et al., 2002, 2004; Prakobphol et al., 2000; Renner et al., 2007; Wu et al., 2004a), stimulates phagocytosis of alveolar macrophages and interacts with important components of the innate immune system (e.g. surfactant protein-d, SP-A and secretory IgA) (Boackle et al., 1993; Holmskov et al., 1997, 1999; Ligtenberg et al., 2004). The copy number variable SRCR domain of DMBT1 binds to bacteria (Bikker et al., 2002, 2004) and viruses (Wu et al., 2004a). The region is also responsible for binding with endogenous protein ligands (SP-A, SP-D, IgA) (Holmskov et al., 1999). Most of the genetic determinants of lung function, COPD and asthma are yet to be found. The two CNVs of DMBT1 might be poorly tagged by the variants commonly studied in GWAS, and may explain some of the unexplained genetic variation of these clinically important traits. Because of potential role of DMBT1 in innate immunity and terminal differentiation of epithelium, the study was designed based on hypothesis that copy number variation at DMBT1 levels may influence the quantitative measures of lung function in all individuals (FEV 1, FVC and FEV 1 /FVC) and with risk of asthma and risk of COPD in subsets of each cohort. 230

253 10.2 Estimation of DMBT1 copy number in lung disease cohorts Estimation of DMBT1 copy number in Gedling cohort Copy number estimation for CNV1 in Gedling cohort The normalised raw PRT ratio of CNV1 showed four clusters and a single sample with higher PRT ratio was treated as outlier. Based on histogram of raw CNV1 value four cluster components was used to estimate integer copy number using CNVtools. The CNVtools analysis of raw CNV1 value in Gedling sample indicated very good clusters without any overlapping value. The clustering quality was very good (9.54) for the Gedling samples. The quality of integer copy number calling for all sample were analysed based on posterior probabilities of all samples. Almost all samples showed higher posterior probabilities for copy number calling although three samples showed posterior probabilities less than 0.80 (Figure 159). Those samples were retyped and copy numbers were assigned based on new PRT ratio. Figure 159: Analysis of integer copy number calling. Scatter plot and associated histograms show mean unrounded copy number values of CNV1, are generated by PRT1 and PRT2 plotted against posterior probabilities of integer copy number call for Gedling samples. 231

254 Copy number estimation for CNV2 in Gedling cohort The PRT ratio from PRT3 and PRT4 were compared using scattered plot to check the specificity of assays for CNV2 locus in the Gedling samples (Figure 160). The raw PRT ratio of PRT3 and PRT4 showed positive correlation (r 2 =0.94). The raw PRT ratio of PRT3 and PRT4 were distributed from 1.0 to 4.5. The mean CNV2 value showed total seven good clusters and PRT ratios of more than 4.5 was considered as outliers. Figure 160: Scatter plot produces by raw ratio of PRT3 and PRT4 assay. PRT ratio use to estimate diploid copy number of CNV2 in the Gedling samples. The mean PRT ratios of CNV2 in the Gedling samples were transformed to have a standard deviation of 1 and integer copy numbers were estimated from transformed PRT data using a Gaussian mixture model in CNVtools (Barnes et al., 2008). A mixture model of seven components was used, based on clustering of the normalized PRT data (Figure 161). The quality score was measured to check quality of the clusters and the resulting clustering quality score (Q) was

255 Figure 161: Analysis of integer copy number calling. Scatter plot and associated histogram show mean unrounded copy number values of CNV2 which is generated by PRT3 and PRT4, plot against posterior probabilities of integer copy number call for Gedling samples Estimation of DMBT1 Copy number in Leicester Respiratory cohort (LRC) Copy number estimation for CNV1 in LRC The scatter plot using raw PRT ratio of PRT1 and PRT2 indicated good correlation (r 2 =0.88) without any overlapping clusters (Figure 162). 233

256 Figure 162: Scatter plot produced by raw ratio of PRT1 and PRT2 assay, used to estimate diploid copy number of CNV1 in LRC samples. The histogram analysis of mean PRT ratio of CNV1 in LRC indicated a 4 clusters component in CNVtools analysis. The samples with average PRT ratio greater than 2.0 were not included in CNVtools analysis. The mean PRT ratios were transformed to have a standard deviation of 1 and integer copy numbers were called using transformed PRT data by a Gaussian mixture model in CNVtools (Barnes et al., 2008). The quality score (Q) was for CNV1 in LRC. Posterior probabilities of the integer copy number call for each sample were plotted for all samples. The posterior probabilities for most of the samples were greater than 0.99 (Figure 163), indicated copy number calling for CNV1 was very good for LRC without any overlap data clusters. 234

257 Figure 163: Analysis of integer copy number calling. Scatter plot and associated histogram show mean unrounded copy number values of CNV1 generated by PRT1 and PRT2 plotted against posterior probabilities of integer copy number call for LRC samples Copy number estimation for CNV2 in LRC The scatter plot showed high correlation (r 2 =0.93) between raw PRT ratio of PRT3 and PRT4 and produced very good quality clusters (Figure 164). The distribution patterns of raw PRT ratio for both assays were the same and the majority of LRC samples showed raw ratios ranging from 1.0 to The histogram of mean CNV2 ratio indicated of seven good clusters for CNV2. 235

258 Figure 164: Scatter plot produced by raw ratios of PRT3 and PRT4 assay, used to estimate diploid copy number of CNV2 in LRC samples. A mixture model of seven components was used based on histogram of the mean PRT data of CNV2 in LRC and copy numbers were estimated using a Gaussian mixture model in CNVtools (Barnes et al., 2008). The clustering quality score (Q) for CNV2 data was 4.32 for CNV2 in LRC. The quality of copy number calling for CNV2 was evaluated using posterior probabilities of the integer copy number call for each sample. The posterior probabilities for most of the samples were more than 0.95 and CNV calling with posterior probabilities of more than 0.80 was also considered good. Total 36 samples showed posterior probabilities < 0.70 (6%) were either retyped or called manually using duplicate PRT ratio (Figure 165). 236

259 Figure 165: Analysis of integer copy number calling. Scatter plot and associated histogram show mean unrounded copy number values of CNV2 generate by PRT3 and PRT4, plotted against posterior probabilities of integer copy number call for LRC samples Distribution of CNV1 and CNV2 copy number in lung disease cohorts The copy number of multiallelic CNV1 of DMBT1 was varied from 2 to 6 in diploid human genome. The modal CNV1 copy number was four (4) for both respiratory cohorts. The mean copy number of CNV1 in Gedling and LRC samples was 3.81 and The details of CNV1 copy number distributions for the two cohorts were shown in Table 77. Table 77: Copy number frequencies of CNV1 at DMBT1 in lung disease cohorts. Gedling cohort Leicester respiratory cohort CNV1 Count Frequency Count Frequency < <0.01 Total Mean

260 The distribution of CNV2 copy number was different in the two respiratory cohorts. The lowest copy number class was the same for both cohorts, although the highest copy number class varied between two cohorts. Mean copy number value of CNV2 was higher (5.16) in the Gedling samples than the LRC samples. The details distribution of CNV2 copy for all cohorts was described below (Table 78). Table 78: Copy number frequencies of CNV2 at DMBT1 in lung disease cohorts. Gedling cohort Leicester respiratory cohort CNV2 Count Frequency Count Frequency < < < Total Mean Association study in the Gedling and LRC cohorts The possible association of CNV1 and CNV2 raw PRT ratio of DMBT1 and lung function (FEV 1, FVC and FEV 1 /FVC) was tested separately in Gedling (Table 79) and LRC cohorts (Table 80). Association analysis was also performed using integer copy number of CNV1 and CNV2 copy number with all traits but the results were very similar to the raw PRT value. Integer copy number of CNV1 and CNV2 determined total SRCR domain copy number of DMBT1 in diploid genomes, so association analysis of total SRCR copy number with all traits was also under taken. There was no evidence for association of CNV1, CNV2 and total SRCR copy number with lung function in all individuals in either cohort except for association of CNV2 with FEV 1 /FVC in LRC. All analyses were adjusted for age, age 2, sex and height. The analyses were also performed without adjustment but the results were very similar to the adjusted results and nonsignificant (results not shown). The Gedling samples for adults were stratified by smoking status, and association analyses were performed, but there was no evidence of association (Table 79). 238

261 Table 79: Association of CNV1, CNV2 and total SRCR copy number of DMBT1 with lung function in the Gedling cohort. CNV1 CNV2 SRCR domains β SE p β SE p β SE p FEV 1 Average Integer FVC Average Integer FEV 1 /FVC Average copy number All Never Ever Integer copy number All Never Ever A significant association (p<0.05) of CNV2 copy number with FEV 1 /FVC was observed within LRC (p = 0.003) and results were similar for raw PRT ratio and integer copy number of CNV2. The result showed trend for the association in the opposite direction, increase copy number associated with decrease unadjusted raw and adjusted inverse normally transferred FEV 1 /FVC (Figure 166). Table 80: Association of CNV1, CNV2 and total SRCR copy number of DMBT1 with lung function in LRC. Significant results are shown in asterisk. CNV1 CNV2 SRCR domains β SE P β SE P β SE P FEV 1 Average CNV Integer CNV FVC Average CNV Integer CNV FEV 1 /FVC Average CNV * Integer CNV *

262 Figure 166: Scatter plots of unadjusted raw (A and B) and adjusted (for age, age2, sex, height) inverse normally transformed (C and D) FEV 1 /FVC against average raw (A and C) and integer CNV2 copy number (B and D) of DMBT1 in LRC. The association of CNV1 and CNV2 raw PRT ratio of DMBT1 and COPD was analysed separately in the Gedling cohort (Table 81). The association analysis was performed using integer copy number of CNV1, CNV2 and total SRCR copy number with COPD. The study did not observe any evidence of association of DMBT1 copy number (CNV1, CNV2 and SRCR) with COPD for the Gedling population (Figure 167). Table 81: Association of CNV1, CNV2 and total SRCR copy number of DMBT1 with lung COPD in the Gedling cohort. N (cases :controls) OR CI p CNV1 Average copy number 47: Integer copy number 47: CNV2 Average copy number 47: Integer copy number 47: SRCR domains Integer copy number 47:

263 Figure 167: Cumulative frequency distribution of average raw PRT ratio of DMBT1 CNV1 (left) and DMBT1 CNV2 (right) in Gedling COPD cases and controls. The x-axis indicates average raw PRT copy number of CNV1 (in left plot) and CNV2 (in right plot). The association of CNV1 and CNV2 raw PRT ratio of DMBT1 and asthma was tested separately in the Gedling (Table 82) and LRC cohorts (Table 83). Association analysis was also performed using integer copy number of CNV1, CNV2 and SRCR copy number with asthma. There was no evidence for association of CNV1, CNV2 and total SRCR copy number with asthma in either Gedling cohort (Figure 168) or LRC (Figure 169). All analyses were adjusted for age, age 2, sex and height and also performed without adjustment but results were very similar to the adjusted result and non-significant (results not shown). Figure 168: Cumulative frequency distribution of average raw PRT ratio of DMBT1 CNV1 (left) and DMBT1 CNV2 (right) in doctor diagnosed asthma cases and controls from Gedling cohort. The x-axis indicates average raw PRT copy number of CNV1 (left) and CNV2 (right). 241

264 Table 82: Association of CNV1, CNV2 and total SRCR copy number of DMBT1 with asthma (doctor diagnosed) in Gedling cohort. N (cases :controls) OR CI p CNV1 Average copy number 50: Integer copy number 50: CNV2 Average copy number 50: Integer copy number 50: SRCR domains Integer copy number 50: Figure 169: Cumulative frequency distribution of average raw PRT ratio of DMBT1 CNV1 (left) and DMBT1 CNV2 (right) in doctor diagnosed asthma cases and controls from LRC. The x-axis indicates average raw PRT copy number of CNV1 (left) and CNV2 (right). Table 83: Association of CNV1, CNV2 and total SRCR copy number of DMBT1 with asthma-ics in LRC. OR CI P CNV1 average copy number max likelihood integer copy number CNV2 average copy number max likelihood integer copy number SRCR domains max likelihood integer copy number

265 10.5 Discussion The present study estimated diploid copy number in two different European cohorts using PRT assays. The distributions of diploid copy number in each cohort were very similar for the CNV1 locus whereas distributions of diploid copy number for CNV2 copy number were a little different. The present study was designed for association with lung function as a quantitative measurement (FEV 1 and FEV 1 /FVC), in all individuals in the Gedling and LRC cohorts. The association of COPD and asthma, using appropriate cases and controls in Gedling and LRC, was also investigated. DMBT1 is highly expressed in lung and plays an important role lung immune response and defense. The copy number variable SRCR domains play an important role for pathogen binding (bacteria and viruses) and regulation of inflammation. The total number of SRCR domains depends on CNV1 and CNV2 copy number. Both CNVs are multiallelic and show extensive inter-individual variations. No study has reported a role of DMBT1 copy number with lung function or COPD and asthma. The present study did not show any strong evidence of DMBT1 copy number with either lung function or lung traits. A significant association (p=0.003 for both raw and integer CNV2 copy number) of CNV2 copy number with FEV 1 /FVC was observed in the LRC but there was no significant of CNV2 copy number with FEV 1 /FVC in the Gedling cohort (p= and for raw and integer CNV2 copy number). The ratio of FEV 1 to FVC is a commonly use in diagnostic criteria in COPD. That the association of CNV2 copy number with FEV 1 /FVC in LRC was not replicated in the Gedling cohort might be due to differences in lung function measures between the Gedling and LRC cohorts. The study thus presented no evidence of association of DMBT1 copy number with risk of COPD. The small cases (47) and controls (194) within the Gedling COPD cohort would be expected to limit statistical power to detect association with COPD risk. The association of DMBT1 copy number and asthma was not found in either Gedling or LRC cohorts. The limited number of case in the two cohorts might also limit statistical power to observe association with asthma as for the Gedling COPD cohort. In conclusion, the present study found no evidence that copy number of CNV1, CNV2 or total SRCR domains at DMBT1 affect susceptibility to COPD or asthma. 243

266 11 ASSOCIATION OF COPY NUMBER VARIATION OF DMBT1 IN VUR AND UTI SAMPLES 11.1 Introduction Urinary tract infection (UTI) is one of the most common bacterial infection in children less than 6 years old (Beetz, 2006). The incidence of UTI is higher in girls (3-7 %) compared to boys (1-2 %) in children less than 6 years old with total number of incidence of UTI between and of the annual US birth cohort by the age of 6 years (Elder et al., 1997). UTIs cause permanent renal parenchymal damage and scarring in children with vesicoureteral reflux (VUR) (Koff et al., 1998; Weiss et al., 1992). VUR is characterised by abnormal backflow of urine from the bladder to the kidneys in a retrograde fashion and is found in 30% to 40% of children with a UTI (Chesney et al., 2008). The socioeconomic burden of VUR is large and increasing at a rapid rate in the past two decades. VUR is normally associated with UTIs, acute pyelonephritis, and subsequent renal scarring. Almost 40% of children with UTIs show VUR and approximately 50% of children with VUR develop pyelonephritis and renal scarring (Elder et al., 1997). The dual function for DMBT1 in mucosal defense and epithelial differentiation was first proposed by Mollenhauer and co-workers (Mollenhauer et al., 2001). The functional study of DMBT1 related to epithelial differentiation was established through its orthologs, rabbit hensin and mouse CRP-ductin (Al-Awqati et al., 2003; Takito et al., 1999). Hensin was found to be implicated in change of the polarity and terminal differentiation of intercalated epithelial cells in kidney (Al-awqati et al., 2011; Al-Awqati et al., 2003). The renal collecting tubules are involved in controlling metabolic acidosis (i.e. ph of the blood) and acid-base transport is mediated by two canonical cell types: the β-intercalated cell secretes bicarbonate (HCO 3 ) and acid (H + ) secretion is mediated by the α-intercalated cell Acid. DMBT1 expresses as either secreted or membrane-linked protein (Sasaki et al., 2002); DMBT1 secretes as a soluble monomer and becomes active after polymerization and deposition in the extracellular matrix (ECM)(Al-Awqati, 2011). Metabolic acidosis converts beta-intercalated cells to alphaintercalated cells and consequently the collecting tubule convert from a state of HCO 3 secretion to H+ secretion. In vitro studies showed that, in an acid medium, the β-intercalated cell are remodelled to α-intercalated, and the effect is mediated by DMBT1 (Schwartz et al., 2002). The deletion of DMBT1 from intercalated cells results in development of complete distal renal tubular acidosis (drta) due to the absence of typical α-intercalated cells (Gao et al., 244

267 2010). The polymerisation of hensin regulates the switch from beta-intercalated cells to alphaintercalated cells and leads to acidification of the urine. The acidic environment of the cortical collecting duct of the kidney may contribute to a last line of defence of the kidney against infection. Further down in the ureter and bladder, acidic urine also helps prevent recurrent UTIs. DMBT1 is highly expressed in kidney intercalated cells and polymerises to multimeric hensin by binding of the SRCR domains of hensin to integrins on the cell surface (Al-Awqati, 2011). DMBT1 is a highly glycosylated protein, including addition of terminal fucose mediated by FUT2 gene. FUT2 also determines the secretor status of individuals. The blood group antigens and Lewis antigens were found to be present in the DMBT1 protein and involved with bacterial aggregation mediated by DMBT1 (Ligtenberg et al., 2000). The present study was designed with the hypothesis that more SRCR domains promote polymerisation to integrins, facilitating conversion of beta to alpha cells and so promoting acid urine. Extra SRCR domains may facilitate improved binding to bacteria directly in recurrent UTI in VUR patients. The study will result in novel information regarding the role of copy number variations at DMBT1 in the development of VUR Estimation of DMBT1 copy number in VUR and UTI samples Estimation of CNV1 copy number in VUR and UTI samples Histogram analysis of averaged PRT ratio showed good clusters and indicated a total of 5 clusters with mean value difference between each of two adjacent clusters was 0.5. The histogram data indicated there were 5 clusters in CNVtools analysis (Figure 170). Figure 170: Histogram of mean unrounded normalized PRT ratio of CNV1 for VUR, UTI and control samples. The x-axis shows the raw PRT ratio of CNV1. 245

268 The mean CNV1 ratio from two PRT assays were transformed to have a standard deviation of 1 as recommended by CNVtools and integer copy numbers were inferred from transformed PRT data using a Gaussian mixture model. A mixture model of five components was fitted, based on clustering of the normalized PRT data (Figure 171). The high (9.16) clustering quality score (Q) indicated that CNVtools called integer copy number for CNV1 very nicely without any overlapping for mean CNV1 values. The first cluster indicated a diploid copy number of 2 and the rest of the clusters were assigned as diploid copy number 3, 4, 5 and 6, respectively. Figure 171: Output of the clustering procedure using the PRT transformed data of CNV2 for VUR & UTI cohort. The coloured lines show the posterior probability for each of the five copy number classes (copy number = 2; 3; 4; 5 or 6). Posterior probabilities of the integer copy number call of CNV1 for each sample from VUR and control study were plotted in Figure 172. The posterior probabilities for most of the samples were more than Two samples showed posterior probabilities of less than 0.80 and those samples were called as integer copy number 4 and validated using long range PCR. 246

269 Figure 172: Analysis of integer copy number calling. Scatter plot and associated histogram show mean unrounded copy number values of CNV1 generate by PRT1 and PRT2, plotted against posterior probabilities of integer copy number call for VUR and UTI cohort Estimation of CNV2 copy number in VUR and UTI samples The PRT ratio from PRT3 and PRT4 showed positive correlation (r 2 =0.91) without any overlapping for lower PRT ratio (Figure 173). The distribution patterns of raw PRT ratio were the same for both assays ranging from 0.5 to 4.5. The mean CNV2 value showed nine clusters and PRT ratios of more than 4.5 were considered as outliers. 247

270 Figure 173: Histogram of mean normalized PRT ratio of CNV2 for VUR, UTI and control samples. The X- axis indicates the mean PRT ratio of CNV2. The mean PRT ratios were transformed to a standard deviation of 1 as recommended by CNVtools and integer copy numbers were estimated from transformed PRT data using a Gaussian mixture model. A mixture model of nine components was fitted, based on clustering of the normalized PRT data (Figure 174). The quality scores were measured to check quality of the clusters and the resulting clustering quality score (Q) was The first cluster indicates a diploid copy number of 1 and the remaining clusters were assigned as diploid copy number 2, 3, 4, 5, 6, 7, 8 and 9 respectively. 248

271 Figure 174: Output of the clustering procedure using the PRT transformed data of CNV2 for VUR and UTI samples. The coloured lines show the posterior probability for each of the five copy number classes (copy number = 1; 2; 3; 4; 5; 6; 7; 8 or 9). The quality of CNV2 copy number calling was evaluated using posterior probabilities of the integer copy number call of CNV2 of all samples from the RIVUR study (Figure 175). The posterior probabilities for most of the samples were greater than 0.96 and five samples showed posterior probabilities of less but greater than 0.80, indicating that the copy number calling for CNV2 was very good for the VUR samples. The outlier samples showed higher CNV2 ratio and integer diploid copy number were called manually based on the mean CNV2 ratio of those samples. 249

272 Figure 175: Analysis of integer copy number calling. Scatter plot and associated histograms show mean unrounded copy number values of CNV2 generate by PRT3 and PRT4, plot against posterior probabilities of integer copy number call for UTI and VUR samples Distribution of CNV1 and CNV2 copy number in VUR and UTI samples The CNV1 at DMBT1 is a multiallelic copy number variation as reported in other cohorts also. The distribution of CNV1 diploid copy number varied from 2 to 6 with a modal CNV1 copy number 4 (Table 84). The mean copy number of CNV1 in control, VUR, UTI and VUR+UTI samples was 3.85, 3.80, 3.88 and 3.82 respectively. Table 84: Copy number frequencies of CNV1 at DMBT1 in VUR samples. Diploid copy number Control count (frequency) VUR count (frequency) Non-VUR UTI count (frequency) VUR+UTI count (frequency) 2 4 (0.01) 3 (0.01) 1 (0.01) 4 (0.01) 3 53 (0.17) 61 (0.19) 19 (0.20) 80 (0.19) (0.77) 245 (0.78) 69 (0.72) 314 (0.76) 5 12 (0.04) 7 (0.02) 5 (0.05) 12 (0.03) 6 2 (<0.01) 0 (0) 2 (0.02) 2 (<0.01) Total Mean

273 The CNV2 locus was also multiallelic copy number variation and distribution of CNV2 diploid copy number varied from 2 to 12 without any modal CNV2 copy number. Three CNV2 copy number classes were very common for all sample cohorts. The mean copy number for CNV2 for the UTI samples (5.33) was higher than other sample cohorts. The details of CNV2 copy number distributions in children from RIVUR study are shown in Table 85. Table 85: Copy number frequencies of CNV2 at DMBT1 in VUR and UTI cohort. Diploid copy number Control count (frequency) RIVUR count (frequency) UTI count (frequency) RIVUR+UTI count (frequency) 2 9 (0.03) 15 (0.05) 3 (0.04) 18 (0.04) 3 27 (0.10) 23 (0.07) 7 (0.08) 30 (0.08) 4 65 (0.23) 69 (0.22) 14 (0.17) 83 (0.21) 5 78 (0.27) 80 (0.25) 21 (0.25) 101 (0.25) 6 67 (0.24) 78 (0.25) 21 (0.25) 99 (0.25) 7 20 (0.07) 31 (0.10) 11 (0.13) 42 (0.11) 8 12 (0.04) 15 (0.05) 6 (0.07) 21 (0.05) 9 6 (0.02) 4 (0.01) 1 (0.01) 5 (0.01) 10 0 (0) 0 (0) 0 (0) 0 (0) 11 0 (0) 1 (<0.01) 0 (0) 1 (<0.01) 12 1 (<0.01) 0 (0) 0 (0) 0 (0) Total Mean Secretor status in VUR and UTI samples DMBT1 protein shows extensive glycosylation mediated by the enzyme alpha-(1, 2) fucosyltransferase, encoded by the FUT2 gene. FUT2 regulates expression or lack of expression of ABH and Lewis blood group antigens on secreted molecules and determines the secretion status of the ABO antigens (Hazra et al., 2008). The variant sizes and glycoforms of gp-340 have been shown to modulate differential aggregation of bacteria and viruses (Eriksson et al., 2007). Figure 176: Agarose gel electrophoresis to determine of secretor status using PCR-RFLP in VUR and UTI cohort. The sizes of DNA fragments of DNA ladder (Hyperladder TM 25bp, Bioline) is indicated at the right side and allele specific PCR-RFLP bands are indicated at the left side of the gel. The genotypes of VUR samples are indicated on top of the gel. 251

UNIVERSITI TEKNOLOGI MARA COPY NUMBER VARIATIONS OF ORANG ASLI (NEGRITO) FROM PENINSULAR MALAYSIA

UNIVERSITI TEKNOLOGI MARA COPY NUMBER VARIATIONS OF ORANG ASLI (NEGRITO) FROM PENINSULAR MALAYSIA UNIVERSITI TEKNOLOGI MARA COPY NUMBER VARIATIONS OF ORANG ASLI (NEGRITO) FROM PENINSULAR MALAYSIA SITI SHUHADA MOKHTAR Thesis submitted in fulfillment of the requirements for the degree of Master of Science

More information

C H A R A C T E R I Z A T I O N O F T H E N O V E L D O M A I N W I T H N O N A M E G E N E I N C O L O N C A N C E R

C H A R A C T E R I Z A T I O N O F T H E N O V E L D O M A I N W I T H N O N A M E G E N E I N C O L O N C A N C E R C H A R A C T E R I Z A T I O N O F T H E N O V E L D O M A I N W I T H N O N A M E G E N E I N C O L O N C A N C E R Charleen Rupnarain A dissertation submitted to the Faculty of Science, University of

More information

Genomic structural variation

Genomic structural variation Genomic structural variation Mario Cáceres The new genomic variation DNA sequence differs across individuals much more than researchers had suspected through structural changes A huge amount of structural

More information

SALSA MLPA probemix P315-B1 EGFR

SALSA MLPA probemix P315-B1 EGFR SALSA MLPA probemix P315-B1 EGFR Lot B1-0215 and B1-0112. As compared to the previous A1 version (lot 0208), two mutation-specific probes for the EGFR mutations L858R and T709M as well as one additional

More information

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction Optimization strategy of Copy Number Variant calling using Multiplicom solutions Michael Vyverman, PhD; Laura Standaert, PhD and Wouter Bossuyt, PhD Abstract Copy number variations (CNVs) represent a significant

More information

Nature Biotechnology: doi: /nbt.1904

Nature Biotechnology: doi: /nbt.1904 Supplementary Information Comparison between assembly-based SV calls and array CGH results Genome-wide array assessment of copy number changes, such as array comparative genomic hybridization (acgh), is

More information

SALSA MLPA KIT P050-B2 CAH

SALSA MLPA KIT P050-B2 CAH SALSA MLPA KIT P050-B2 CAH Lot 0510, 0909, 0408: Compared to lot 0107, extra control fragments have been added at 88, 96, 100 and 105 nt. The 274 nt probe gives a higher signal in lot 0510 compared to

More information

MRC-Holland MLPA. Description version 08; 30 March 2015

MRC-Holland MLPA. Description version 08; 30 March 2015 SALSA MLPA probemix P351-C1 / P352-D1 PKD1-PKD2 P351-C1 lot C1-0914: as compared to the previous version B2 lot B2-0511 one target probe has been removed and three reference probes have been replaced.

More information

MRC-Holland MLPA. Description version 29; 31 July 2015

MRC-Holland MLPA. Description version 29; 31 July 2015 SALSA MLPA probemix P081-C1/P082-C1 NF1 P081 Lot C1-0114. As compared to the previous B2 version (lot 0813 and 0912), 11 target probes are replaced or added, and 10 new reference probes are included. P082

More information

Identification of regions with common copy-number variations using SNP array

Identification of regions with common copy-number variations using SNP array Identification of regions with common copy-number variations using SNP array Agus Salim Epidemiology and Public Health National University of Singapore Copy Number Variation (CNV) Copy number alteration

More information

Most severely affected will be the probe for exon 15. Please keep an eye on the D-fragments (especially the 96 nt fragment).

Most severely affected will be the probe for exon 15. Please keep an eye on the D-fragments (especially the 96 nt fragment). SALSA MLPA probemix P343-C3 Autism-1 Lot C3-1016. As compared to version C2 (lot C2-0312) five reference probes have been replaced, one reference probe added and several lengths have been adjusted. Warning:

More information

SALSA MLPA probemix P241-D2 MODY mix 1 Lot D As compared to version D1 (lot D1-0911), one reference probe has been replaced.

SALSA MLPA probemix P241-D2 MODY mix 1 Lot D As compared to version D1 (lot D1-0911), one reference probe has been replaced. mix P241-D2 MODY mix 1 Lot D2-0413. As compared to version D1 (lot D1-0911), one reference has been replaced. Maturity-Onset Diabetes of the Young (MODY) is a distinct form of non insulin-dependent diabetes

More information

Supplementary note: Comparison of deletion variants identified in this study and four earlier studies

Supplementary note: Comparison of deletion variants identified in this study and four earlier studies Supplementary note: Comparison of deletion variants identified in this study and four earlier studies Here we compare the results of this study to potentially overlapping results from four earlier studies

More information

SALSA MLPA probemix P169-C2 HIRSCHSPRUNG-1 Lot C As compared to version C1 (lot C1-0612), the length of one probe has been adjusted.

SALSA MLPA probemix P169-C2 HIRSCHSPRUNG-1 Lot C As compared to version C1 (lot C1-0612), the length of one probe has been adjusted. mix P169-C2 HIRSCHSPRUNG-1 Lot C2-0915. As compared to version C1 (lot C1-0612), the length of one has been adjusted. Hirschsprung disease (HSCR), or aganglionic megacolon, is a congenital disorder characterised

More information

MRC-Holland MLPA. Description version 30; 06 June 2017

MRC-Holland MLPA. Description version 30; 06 June 2017 SALSA MLPA probemix P081-C1/P082-C1 NF1 P081 Lot C1-0517, C1-0114. As compared to the previous B2 version (lot B2-0813, B2-0912), 11 target probes are replaced or added, and 10 new reference probes are

More information

SALSA MLPA probemix P241-D2 MODY mix 1 Lot D2-0716, D As compared to version D1 (lot D1-0911), one reference probe has been replaced.

SALSA MLPA probemix P241-D2 MODY mix 1 Lot D2-0716, D As compared to version D1 (lot D1-0911), one reference probe has been replaced. mix P241-D2 MODY mix 1 Lot D2-0716, D2-0413. As compared to version D1 (lot D1-0911), one reference has been replaced. Maturity-Onset Diabetes of the Young (MODY) is a distinct form of non insulin-dependent

More information

MRC-Holland MLPA. Description version 12; 13 January 2017

MRC-Holland MLPA. Description version 12; 13 January 2017 SALSA MLPA probemix P219-B3 PAX6 Lot B3-0915: Compared to version B2 (lot B2-1111) two reference probes have been replaced and one additional reference probe has been added. In addition, one flanking probe

More information

SALSA MLPA probemix P360-A1 Y-Chromosome Microdeletions Lot A

SALSA MLPA probemix P360-A1 Y-Chromosome Microdeletions Lot A SALSA MLPA probemix P360-A1 Y-Chromosome Microdeletions Lot A1-1011. This SALSA MLPA probemix is for basic research and intended for experienced MLPA users only! This probemix enables you to quantify genes

More information

The Human Major Histocompatibility Complex

The Human Major Histocompatibility Complex The Human Major Histocompatibility Complex 1 Location and Organization of the HLA Complex on Chromosome 6 NEJM 343(10):702-9 2 Inheritance of the HLA Complex Haplotype Inheritance (Family Study) 3 Structure

More information

MRC-Holland MLPA. Description version 14; 28 September 2016

MRC-Holland MLPA. Description version 14; 28 September 2016 SALSA MLPA probemix P279-B3 CACNA1A Lot B3-0816. As compared to version B2 (lot B2-1012), one reference probe has been replaced and the length of several probes has been adjusted. Voltage-dependent calcium

More information

MRC-Holland MLPA. Description version 08; 18 November 2016

MRC-Holland MLPA. Description version 08; 18 November 2016 SALSA MLPA probemix P122-D1 NF1 AREA Lot D1-1016. As compared to lot C2-0312, four probes in the NF1 area and one reference probe have been removed, four reference probes have been replaced and several

More information

MRC-Holland MLPA. Description version 06; 23 December 2016

MRC-Holland MLPA. Description version 06; 23 December 2016 SALSA MLPA probemix P417-B2 BAP1 Lot B2-1216. As compared to version B1 (lot B1-0215), two reference probes have been added and two target probes have a minor change in length. The BAP1 (BRCA1 associated

More information

Supplementary methods:

Supplementary methods: Supplementary methods: Primers sequences used in real-time PCR analyses: β-actin F: GACCTCTATGCCAACACAGT β-actin [11] R: AGTACTTGCGCTCAGGAGGA MMP13 F: TTCTGGTCTTCTGGCACACGCTTT MMP13 R: CCAAGCTCATGGGCAGCAACAATA

More information

MRC-Holland MLPA. Description version 18; 09 September 2015

MRC-Holland MLPA. Description version 18; 09 September 2015 SALSA MLPA probemix P090-A4 BRCA2 Lot A4-0715, A4-0714, A4-0314, A4-0813, A4-0712: Compared to lot A3-0710, the 88 and 96 nt control fragments have been replaced (QDX2). This product is identical to the

More information

UNIVERSITI TEKNOLOGI MARA COPY NUMBER VARIATION OF FCGR3B GENE AMONG SEVERE DENGUE PATIENT IN MALAYSIA

UNIVERSITI TEKNOLOGI MARA COPY NUMBER VARIATION OF FCGR3B GENE AMONG SEVERE DENGUE PATIENT IN MALAYSIA UNIVERSITI TEKNOLOGI MARA COPY NUMBER VARIATION OF FCGR3B GENE AMONG SEVERE DENGUE PATIENT IN MALAYSIA UMI SHAKINA HARIDAN Thesis submitted in fulfilment of the requirements for the degree of Master of

More information

CNV detection. Introduction and detection in NGS data. G. Demidov 1,2. NGSchool2016. Centre for Genomic Regulation. CNV detection. G.

CNV detection. Introduction and detection in NGS data. G. Demidov 1,2. NGSchool2016. Centre for Genomic Regulation. CNV detection. G. Introduction and detection in NGS data 1,2 1 Genomic and Epigenomic Variation in Disease group, Centre for Genomic Regulation 2 Universitat Pompeu Fabra NGSchool2016 methods: methods Outline methods: methods

More information

Detection of copy number variations in PCR-enriched targeted sequencing data

Detection of copy number variations in PCR-enriched targeted sequencing data Detection of copy number variations in PCR-enriched targeted sequencing data German Demidov Parseq Lab, Saint-Petersburg University of Russian Academy of Sciences, current: Center for Genomic Regulation

More information

DNA-seq Bioinformatics Analysis: Copy Number Variation

DNA-seq Bioinformatics Analysis: Copy Number Variation DNA-seq Bioinformatics Analysis: Copy Number Variation Elodie Girard elodie.girard@curie.fr U900 institut Curie, INSERM, Mines ParisTech, PSL Research University Paris, France NGS Applications 5C HiC DNA-seq

More information

SALSA MLPA probemix P185-C2 Intersex Lot C2-1015: As compared to the previous version C1 (lot C1-0611), the lengths of four probes have been adjusted.

SALSA MLPA probemix P185-C2 Intersex Lot C2-1015: As compared to the previous version C1 (lot C1-0611), the lengths of four probes have been adjusted. mix P185-C2 Intersex Lot C2-1015: As compared to the previous version C1 (lot C1-0611), the lengths of four s have been adjusted. The sex-determining region on chromosome Y (SRY) is the most important

More information

P323-B1 CDK4-HMGA2-MDM2

P323-B1 CDK4-HMGA2-MDM2 SALSA MLPA probemix P323-B1 CDK4-HMGA2-MDM2 Lot B1-0714, B1-0711. As compared to previous test version (lot A1-0508), this probemix has been completely redesigned. Probes for HMGA2 and several other genes

More information

MRC-Holland MLPA. Description version 07; 26 November 2015

MRC-Holland MLPA. Description version 07; 26 November 2015 SALSA MLPA probemix P266-B1 CLCNKB Lot B1-0415, B1-0911. As compared to version A1 (lot A1-0908), one target probe for CLCNKB (exon 11) has been replaced. In addition, one reference probe has been replaced

More information

Multi-clonal origin of macrolide-resistant Mycoplasma pneumoniae isolates. determined by multiple-locus variable-number tandem-repeat analysis

Multi-clonal origin of macrolide-resistant Mycoplasma pneumoniae isolates. determined by multiple-locus variable-number tandem-repeat analysis JCM Accepts, published online ahead of print on 30 May 2012 J. Clin. Microbiol. doi:10.1128/jcm.00678-12 Copyright 2012, American Society for Microbiology. All Rights Reserved. 1 2 Multi-clonal origin

More information

New: P077 BRCA2. This new probemix can be used to confirm results obtained with P045 BRCA2 probemix.

New: P077 BRCA2. This new probemix can be used to confirm results obtained with P045 BRCA2 probemix. SALSA MLPA KIT P045-B2 BRCA2/CHEK2 Lot 0410, 0609. As compared to version B1, four reference probes have been replaced and extra control fragments at 100 and 105 nt (X/Y specific) have been included. New:

More information

Genome - Wide Linkage Mapping

Genome - Wide Linkage Mapping Biological Sciences Initiative HHMI Genome - Wide Linkage Mapping Introduction This activity is based on the work of Dr. Christine Seidman et al that was published in Circulation, 1998, vol 97, pgs 2043-2048.

More information

MRC-Holland MLPA. Description version 08; 07 May 2015

MRC-Holland MLPA. Description version 08; 07 May 2015 mix P185-C1 Intersex Lot C1-0611: As compared to the previous version B2 (lot B2-0311), s for CYP21A2 have been removed and s for the CXorf21 gene as well as additional s for NR0B1, NR5A1 and the Y chromosome

More information

Drug Metabolism Disposition

Drug Metabolism Disposition Drug Metabolism Disposition The CYP2C19 intron 2 branch point SNP is the ancestral polymorphism contributing to the poor metabolizer phenotype in livers with CYP2C19*35 and CYP2C19*2 alleles Amarjit S.

More information

MRC-Holland MLPA. Description version 19;

MRC-Holland MLPA. Description version 19; SALSA MLPA probemix P6-B2 SMA Lot B2-712, B2-312, B2-111, B2-511: As compared to the previous version B1 (lot B1-11), the 88 and 96 nt DNA Denaturation control fragments have been replaced (QDX2). SPINAL

More information

SALSA MLPA KIT P060-B2 SMA

SALSA MLPA KIT P060-B2 SMA SALSA MLPA KIT P6-B2 SMA Lot 111, 511: As compared to the previous version B1 (lot 11), the 88 and 96 nt DNA Denaturation control fragments have been replaced (QDX2). Please note that, in contrast to the

More information

Challenges of CGH array testing in children with developmental delay. Dr Sally Davies 17 th September 2014

Challenges of CGH array testing in children with developmental delay. Dr Sally Davies 17 th September 2014 Challenges of CGH array testing in children with developmental delay Dr Sally Davies 17 th September 2014 CGH array What is CGH array? Understanding the test Benefits Results to expect Consent issues Ethical

More information

MRC-Holland MLPA. Description version 29;

MRC-Holland MLPA. Description version 29; SALSA MLPA KIT P003-B1 MLH1/MSH2 Lot 1209, 0109. As compared to the previous lots 0307 and 1006, one MLH1 probe (exon 19) and four MSH2 probes have been replaced. In addition, one extra MSH2 exon 1 probe,

More information

SALSA MLPA probemix P372-B1 Microdeletion Syndromes 6 Lot B1-1016, B

SALSA MLPA probemix P372-B1 Microdeletion Syndromes 6 Lot B1-1016, B SALSA MLPA probemix P372-B1 Microdeletion Syndromes 6 Lot B1-1016, B1-0613. The purpose of the P372 probemix is to further investigate results found with the P245 Microdeletion Syndromes-1A probemix. The

More information

Cytogenetics 101: Clinical Research and Molecular Genetic Technologies

Cytogenetics 101: Clinical Research and Molecular Genetic Technologies Cytogenetics 101: Clinical Research and Molecular Genetic Technologies Topics for Today s Presentation 1 Classical vs Molecular Cytogenetics 2 What acgh? 3 What is FISH? 4 What is NGS? 5 How can these

More information

From Biostatistics Using JMP: A Practical Guide. Full book available for purchase here. Chapter 1: Introduction... 1

From Biostatistics Using JMP: A Practical Guide. Full book available for purchase here. Chapter 1: Introduction... 1 From Biostatistics Using JMP: A Practical Guide. Full book available for purchase here. Contents Dedication... iii Acknowledgments... xi About This Book... xiii About the Author... xvii Chapter 1: Introduction...

More information

ANALYSIS AND CLASSIFICATION OF EEG SIGNALS. A Dissertation Submitted by. Siuly. Doctor of Philosophy

ANALYSIS AND CLASSIFICATION OF EEG SIGNALS. A Dissertation Submitted by. Siuly. Doctor of Philosophy UNIVERSITY OF SOUTHERN QUEENSLAND, AUSTRALIA ANALYSIS AND CLASSIFICATION OF EEG SIGNALS A Dissertation Submitted by Siuly For the Award of Doctor of Philosophy July, 2012 Abstract Electroencephalography

More information

CHANTALE T. GUY, M5C. A Thesis. for the Degree. Doctor of Philosophy. McMaster University. (c) Copyright by Chantale T.

CHANTALE T. GUY, M5C. A Thesis. for the Degree. Doctor of Philosophy. McMaster University. (c) Copyright by Chantale T. ROLE AND MECHANISM OF ACTION OF TYROSINE KINASES IN MAMMARY TUMORIGENESIS By CHANTALE T. GUY, M5C. A Thesis Submitted to the School of Graduate Studies in Partial Fulfilment of the Requirements for the

More information

Next Generation Sequencing as a tool for breakpoint analysis in rearrangements of the globin-gene clusters

Next Generation Sequencing as a tool for breakpoint analysis in rearrangements of the globin-gene clusters Next Generation Sequencing as a tool for breakpoint analysis in rearrangements of the globin-gene clusters XXXth International Symposium on Technical Innovations in Laboratory Hematology Honolulu, Hawaii

More information

MRC-Holland MLPA. Description version 13;

MRC-Holland MLPA. Description version 13; SALSA MLPA probemix P027-C1 Uveal Melanoma Lot C1-0211: A large number of probes have been replaced by other probes in the same chromosomal regions as compared to previous lots, and several reference probes

More information

MRC-Holland MLPA. Description version 06; 07 August 2015

MRC-Holland MLPA. Description version 06; 07 August 2015 SALSA MLPA probemix P323-B1 CDK4-HMGA2-MDM2 Lot B1-0711. As compared to version A1 (test version sent to test labs), this product has been completely redesigned. Probes for HMGA2 and several other genes

More information

Nature Genetics: doi: /ng Supplementary Figure 1. PCA for ancestry in SNV data.

Nature Genetics: doi: /ng Supplementary Figure 1. PCA for ancestry in SNV data. Supplementary Figure 1 PCA for ancestry in SNV data. (a) EIGENSTRAT principal-component analysis (PCA) of SNV genotype data on all samples. (b) PCA of only proband SNV genotype data. (c) PCA of SNV genotype

More information

Predicting and facilitating upward family communication as a mammography promotion strategy

Predicting and facilitating upward family communication as a mammography promotion strategy University of Wollongong Research Online University of Wollongong Thesis Collection 1954-2016 University of Wollongong Thesis Collections 2010 Predicting and facilitating upward family communication as

More information

Role of Paired Box9 (PAX9) (rs ) and Muscle Segment Homeobox1 (MSX1) (581C>T) Gene Polymorphisms in Tooth Agenesis

Role of Paired Box9 (PAX9) (rs ) and Muscle Segment Homeobox1 (MSX1) (581C>T) Gene Polymorphisms in Tooth Agenesis EC Dental Science Special Issue - 2017 Role of Paired Box9 (PAX9) (rs2073245) and Muscle Segment Homeobox1 (MSX1) (581C>T) Gene Polymorphisms in Tooth Agenesis Research Article Dr. Sonam Sethi 1, Dr. Anmol

More information

Generating Spontaneous Copy Number Variants (CNVs) Jennifer Freeman Assistant Professor of Toxicology School of Health Sciences Purdue University

Generating Spontaneous Copy Number Variants (CNVs) Jennifer Freeman Assistant Professor of Toxicology School of Health Sciences Purdue University Role of Chemical lexposure in Generating Spontaneous Copy Number Variants (CNVs) Jennifer Freeman Assistant Professor of Toxicology School of Health Sciences Purdue University CNV Discovery Reference Genetic

More information

Supplementary Figure 1

Supplementary Figure 1 Supplementary Figure 1 Supplementary Fig. 1: Quality assessment of formalin-fixed paraffin-embedded (FFPE)-derived DNA and nuclei. (a) Multiplex PCR analysis of unrepaired and repaired bulk FFPE gdna from

More information

Nature Genetics: doi: /ng Supplementary Figure 1. Rates of different mutation types in CRC.

Nature Genetics: doi: /ng Supplementary Figure 1. Rates of different mutation types in CRC. Supplementary Figure 1 Rates of different mutation types in CRC. (a) Stratification by mutation type indicates that C>T mutations occur at a significantly greater rate than other types. (b) As for the

More information

MRC-Holland MLPA. Description version 52; 22 July 2015

MRC-Holland MLPA. Description version 52; 22 July 2015 SALSA MS-MLPA probemix ME028-B2 Prader-Willi/Angelman Lot B2-0413, lot B2-0811. As compared to version B1 (lot B1-0609), the control fragments have been replaced (QDX2). PRADER-WILLI SYNDROME (PWS) and

More information

Global variation in copy number in the human genome

Global variation in copy number in the human genome Global variation in copy number in the human genome Redon et. al. Nature 444:444-454 (2006) 12.03.2007 Tarmo Puurand Study 270 individuals (HapMap collection) Affymetrix 500K Whole Genome TilePath (WGTP)

More information

5/2/18. After this class students should be able to: Stephanie Moon, Ph.D. - GWAS. How do we distinguish Mendelian from non-mendelian traits?

5/2/18. After this class students should be able to: Stephanie Moon, Ph.D. - GWAS. How do we distinguish Mendelian from non-mendelian traits? corebio II - genetics: WED 25 April 2018. 2018 Stephanie Moon, Ph.D. - GWAS After this class students should be able to: 1. Compare and contrast methods used to discover the genetic basis of traits or

More information

New Enhancements: GWAS Workflows with SVS

New Enhancements: GWAS Workflows with SVS New Enhancements: GWAS Workflows with SVS August 9 th, 2017 Gabe Rudy VP Product & Engineering 20 most promising Biotech Technology Providers Top 10 Analytics Solution Providers Hype Cycle for Life sciences

More information

Toll-like Receptors (TLRs): Biology, Pathology and Therapeutics

Toll-like Receptors (TLRs): Biology, Pathology and Therapeutics Toll-like Receptors (TLRs): Biology, Pathology and Therapeutics Dr Sarah Sasson SydPATH Registrar 23 rd June 2014 TLRs: Introduction Discovered in 1990s Recognise conserved structures in pathogens Rely

More information

SNP Array NOTE: THIS IS A SAMPLE REPORT AND MAY NOT REFLECT ACTUAL PATIENT DATA. FORMAT AND/OR CONTENT MAY BE UPDATED PERIODICALLY.

SNP Array NOTE: THIS IS A SAMPLE REPORT AND MAY NOT REFLECT ACTUAL PATIENT DATA. FORMAT AND/OR CONTENT MAY BE UPDATED PERIODICALLY. SAMPLE REPORT SNP Array NOTE: THIS IS A SAMPLE REPORT AND MAY NOT REFLECT ACTUAL PATIENT DATA. FORMAT AND/OR CONTENT MAY BE UPDATED PERIODICALLY. RESULTS SNP Array Copy Number Variations Result: GAIN,

More information

Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers

Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers Gordon Blackshields Senior Bioinformatician Source BioScience 1 To Cancer Genetics Studies

More information

MRC-Holland MLPA. Related SALSA MLPA probemixes P190 CHEK2: Breast cancer susceptibility, genes included: CHEK2, ATM, PTEN, TP53.

MRC-Holland MLPA. Related SALSA MLPA probemixes P190 CHEK2: Breast cancer susceptibility, genes included: CHEK2, ATM, PTEN, TP53. SALSA MLPA probemix P056-C1 TP53 Lot C1-0215 & lot C1-0214. As compared to version B1 (lot B1-1011) most of the reference and flanking probes have been replaced and several have been added. Furthermore,

More information

Structural Variation and Medical Genomics

Structural Variation and Medical Genomics Structural Variation and Medical Genomics Andrew King Department of Biomedical Informatics July 8, 2014 You already know about small scale genetic mutations Single nucleotide polymorphism (SNPs) Deletions,

More information

Misheck Ndebele. Johannesburg

Misheck Ndebele. Johannesburg APPLICATION OF THE INFORMATION, MOTIVATION AND BEHAVIOURAL SKILLS (IMB) MODEL FOR TARGETING HIV-RISK BEHAVIOUR AMONG ADOLESCENT LEARNERS IN SOUTH AFRICA Misheck Ndebele A thesis submitted to the Faculty

More information

Supplemental Data: Detailed Characteristics of Patients with MKRN3. Patient 1 was born after an uneventful pregnancy. She presented in our

Supplemental Data: Detailed Characteristics of Patients with MKRN3. Patient 1 was born after an uneventful pregnancy. She presented in our 1 2 Supplemental Data: Detailed Characteristics of Patients with MKRN3 Mutations 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Patient 1 was born after an uneventful pregnancy. She presented

More information

SNP Array NOTE: THIS IS A SAMPLE REPORT AND MAY NOT REFLECT ACTUAL PATIENT DATA. FORMAT AND/OR CONTENT MAY BE UPDATED PERIODICALLY.

SNP Array NOTE: THIS IS A SAMPLE REPORT AND MAY NOT REFLECT ACTUAL PATIENT DATA. FORMAT AND/OR CONTENT MAY BE UPDATED PERIODICALLY. SAMPLE REPORT SNP Array NOTE: THIS IS A SAMPLE REPORT AND MAY NOT REFLECT ACTUAL PATIENT DATA. FORMAT AND/OR CONTENT MAY BE UPDATED PERIODICALLY. RESULTS SNP Array Copy Number Variations Result: LOSS,

More information

The common colorectal cancer predisposition SNP rs at chromosome 8q24 confers potential to enhanced Wnt signaling

The common colorectal cancer predisposition SNP rs at chromosome 8q24 confers potential to enhanced Wnt signaling SUPPLEMENTARY INFORMATION The common colorectal cancer predisposition SNP rs6983267 at chromosome 8q24 confers potential to enhanced Wnt signaling Sari Tuupanen 1, Mikko Turunen 2, Rainer Lehtonen 1, Outi

More information

HST.161 Molecular Biology and Genetics in Modern Medicine Fall 2007

HST.161 Molecular Biology and Genetics in Modern Medicine Fall 2007 MIT OpenCourseWare http://ocw.mit.edu HST.161 Molecular Biology and Genetics in Modern Medicine Fall 2007 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

IVF Michigan, Rochester Hills, Michigan, and Reproductive Genetics Institute, Chicago, Illinois

IVF Michigan, Rochester Hills, Michigan, and Reproductive Genetics Institute, Chicago, Illinois FERTILITY AND STERILITY VOL. 80, NO. 4, OCTOBER 2003 Copyright 2003 American Society for Reproductive Medicine Published by Elsevier Inc. Printed on acid-free paper in U.S.A. CASE REPORTS Preimplantation

More information

Applications of Chromosomal Microarray Analysis (CMA) in pre- and postnatal Diagnostic: advantages, limitations and concerns

Applications of Chromosomal Microarray Analysis (CMA) in pre- and postnatal Diagnostic: advantages, limitations and concerns Applications of Chromosomal Microarray Analysis (CMA) in pre- and postnatal Diagnostic: advantages, limitations and concerns جواد کریمزاد حق PhD of Medical Genetics آزمايشگاه پاتوبيولوژي و ژنتيك پارسه

More information

PRADER WILLI/ANGELMAN

PRADER WILLI/ANGELMAN SALSA MS-MLPA probemix ME028-B2 PRADER WILLI/ANGELMAN Lot B2-0811: As compared to version B1 (lot B1-0609, B1-1108), the 88 and 96 nt control fragments have been replaced (QDX2). PRADER-WILLI SYNDROME

More information

Targeted qpcr. Debate on PGS Technology: Targeted vs. Whole genome approach. Discolsure Stake shareholder of GENETYX S.R.L

Targeted qpcr. Debate on PGS Technology: Targeted vs. Whole genome approach. Discolsure Stake shareholder of GENETYX S.R.L Antonio Capalbo, PhD Laboratory Director GENETYX, reproductive genetics laboratory, Italy PGT responsible GENERA centers for reproductive medicine, Italy Debate on PGS Technology: Targeted vs. Whole genome

More information

Characterisation of structural variation in breast. cancer genomes using paired-end sequencing on. the Illumina Genome Analyser

Characterisation of structural variation in breast. cancer genomes using paired-end sequencing on. the Illumina Genome Analyser Characterisation of structural variation in breast cancer genomes using paired-end sequencing on the Illumina Genome Analyser Phil Stephens Cancer Genome Project Why is it important to study cancer? Why

More information

Genome 371, Autumn 2018 Quiz Section 9: Genetics of Cancer Worksheet

Genome 371, Autumn 2018 Quiz Section 9: Genetics of Cancer Worksheet Genome 371, Autumn 2018 Quiz Section 9: Genetics of Cancer Worksheet All cancer is due to genetic mutations. However, in cancer that clusters in families (familial cancer) at least one of these mutations

More information

Agilent s Copy Number Variation (CNV) Portfolio

Agilent s Copy Number Variation (CNV) Portfolio Technical Overview Agilent s Copy Number Variation (CNV) Portfolio Abstract Copy Number Variation (CNV) is now recognized as a prevalent form of structural variation in the genome contributing to human

More information

Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research

Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research Application Note Authors John McGuigan, Megan Manion,

More information

Sharan Goobie, MD, MSc, FRCPC

Sharan Goobie, MD, MSc, FRCPC Sharan Goobie, MD, MSc, FRCPC Chromosome testing in 2014 Presenter Disclosure: Sharan Goobie has no potential for conflict of interest with this presentation Objectives Review of standard genetic investigations

More information

CHROMOSOMAL MICROARRAY (CGH+SNP)

CHROMOSOMAL MICROARRAY (CGH+SNP) Chromosome imbalances are a significant cause of developmental delay, mental retardation, autism spectrum disorders, dysmorphic features and/or birth defects. The imbalance of genetic material may be due

More information

The Deciphering Development Disorders (DDD) project: What a genomic approach can achieve

The Deciphering Development Disorders (DDD) project: What a genomic approach can achieve The Deciphering Development Disorders (DDD) project: What a genomic approach can achieve RCP ADVANCED MEDICINE, LONDON FEB 5 TH 2018 HELEN FIRTH DM FRCP DCH, SANGER INSTITUTE 3,000,000,000 bases in each

More information

Patterns of lymph node biopsy pathology at. Chris Hani Baragwanath Academic Hospital. over a period of three years Denasha Lavanya Reddy

Patterns of lymph node biopsy pathology at. Chris Hani Baragwanath Academic Hospital. over a period of three years Denasha Lavanya Reddy Patterns of lymph node biopsy pathology at Chris Hani Baragwanath Academic Hospital over a period of three years 2010-2012 Denasha Lavanya Reddy Student number: 742452 A research report submitted to the

More information

Félix Alberto Herrera Rodríguez

Félix Alberto Herrera Rodríguez AN ASSESSMENT OF THE RISK FACTORS FOR PULMONARY TUBERCULOSIS AMONG ADULT PATIENTS SUFFERING FROM HUMAN IMMUNODEFICIENCY VIRUS ATTENDING THE WELLNESS CLINIC AT THEMBA HOSPITAL. Félix Alberto Herrera Rodríguez

More information

A dissertation by. Clare Rachel Watsford

A dissertation by. Clare Rachel Watsford Young People s Expectations, Preferences and Experiences of Seeking Help from a Youth Mental Health Service and the Effects on Clinical Outcome, Service Use and Future Help-Seeking Intentions A dissertation

More information

p.r623c p.p976l p.d2847fs p.t2671 p.d2847fs p.r2922w p.r2370h p.c1201y p.a868v p.s952* RING_C BP PHD Cbp HAT_KAT11

p.r623c p.p976l p.d2847fs p.t2671 p.d2847fs p.r2922w p.r2370h p.c1201y p.a868v p.s952* RING_C BP PHD Cbp HAT_KAT11 ARID2 p.r623c KMT2D p.v650fs p.p976l p.r2922w p.l1212r p.d1400h DNA binding RFX DNA binding Zinc finger KMT2C p.a51s p.d372v p.c1103* p.d2847fs p.t2671 p.d2847fs p.r4586h PHD/ RING DHHC/ PHD PHD FYR N

More information

Experts warn that reduced vaccination may lead to increased health risks this flu season

Experts warn that reduced vaccination may lead to increased health risks this flu season It s coming back Experts warn that reduced vaccination may lead to increased health risks this flu season EMBARGOED: 05:00am: Wednesday 30 March 2011: As the flu season begins, new findings released today

More information

BILATERAL BREAST CANCER INCIDENCE AND SURVIVAL

BILATERAL BREAST CANCER INCIDENCE AND SURVIVAL BILATERAL BREAST CANCER INCIDENCE AND SURVIVAL Kieran McCaul A thesis submitted for fulfilment of the requirements for the degree of Doctor of Philosophy Discipline of Public Health Faculty of Health Sciences

More information

November 9, Johns Hopkins School of Medicine, Baltimore, MD,

November 9, Johns Hopkins School of Medicine, Baltimore, MD, Fast detection of de-novo copy number variants from case-parent SNP arrays identifies a deletion on chromosome 7p14.1 associated with non-syndromic isolated cleft lip/palate Samuel G. Younkin 1, Robert

More information

MRC-Holland MLPA. Description version 23; 15 February 2018

MRC-Holland MLPA. Description version 23; 15 February 2018 SALSA MLPA probemix P225-D2 PTEN Lot D2-0315. As compared to the previous version (lot D1-0613), one probe has a small change in length but no change in the sequence detected. PTEN is a tumour suppressor

More information

7.014 Problem Set 7 Solutions

7.014 Problem Set 7 Solutions MIT Department of Biology 7.014 Introductory Biology, Spring 2005 7.014 Problem Set 7 Solutions Question 1 Part A Antigen binding site Antigen binding site Variable region Light chain Light chain Variable

More information

Background Information. Instructions. Problem Statement. HOMEWORK INSTRUCTIONS Homework #2 HIV Statistics Problem

Background Information. Instructions. Problem Statement. HOMEWORK INSTRUCTIONS Homework #2 HIV Statistics Problem Background Information HOMEWORK INSTRUCTIONS The scourge of HIV/AIDS has had an extraordinary impact on the entire world. The spread of the disease has been closely tracked since the discovery of the HIV

More information

Table of Contents (continued)

Table of Contents (continued) Emerging Molecular and Immunohematology Blood Typing, Grouping And Infectious Disease NAT Screening Assays And Companies Developing New Technologies and Products Table of Contents 1. Blood Typing and Grouping

More information

Nature Genetics: doi: /ng Supplementary Figure 1. Country distribution of GME samples and designation of geographical subregions.

Nature Genetics: doi: /ng Supplementary Figure 1. Country distribution of GME samples and designation of geographical subregions. Supplementary Figure 1 Country distribution of GME samples and designation of geographical subregions. GME samples collected across 20 countries and territories from the GME. Pie size corresponds to the

More information

Supplementary Figure 1 Transcription assay of nine ABA-responsive PP2C. Transcription assay of nine ABA-responsive PP2C genes. Total RNA was isolated

Supplementary Figure 1 Transcription assay of nine ABA-responsive PP2C. Transcription assay of nine ABA-responsive PP2C genes. Total RNA was isolated Supplementary Figure 1 Transcription assay of nine ABA-responsive PP2C genes. Transcription assay of nine ABA-responsive PP2C genes. Total RNA was isolated from 7 day-old seedlings treated with or without

More information

Antigen Presentation to T lymphocytes

Antigen Presentation to T lymphocytes Antigen Presentation to T lymphocytes Immunology 441 Lectures 6 & 7 Chapter 6 October 10 & 12, 2016 Jessica Hamerman jhamerman@benaroyaresearch.org Office hours by arrangement Antigen processing: How are

More information

Breast cancer. Risk factors you cannot change include: Treatment Plan Selection. Inferring Transcriptional Module from Breast Cancer Profile Data

Breast cancer. Risk factors you cannot change include: Treatment Plan Selection. Inferring Transcriptional Module from Breast Cancer Profile Data Breast cancer Inferring Transcriptional Module from Breast Cancer Profile Data Breast Cancer and Targeted Therapy Microarray Profile Data Inferring Transcriptional Module Methods CSC 177 Data Warehousing

More information

Variant Classification. Author: Mike Thiesen, Golden Helix, Inc.

Variant Classification. Author: Mike Thiesen, Golden Helix, Inc. Variant Classification Author: Mike Thiesen, Golden Helix, Inc. Overview Sequencing pipelines are able to identify rare variants not found in catalogs such as dbsnp. As a result, variants in these datasets

More information

Bio 312, Spring 2017 Exam 3 ( 1 ) Name:

Bio 312, Spring 2017 Exam 3 ( 1 ) Name: Bio 312, Spring 2017 Exam 3 ( 1 ) Name: Please write the first letter of your last name in the box; 5 points will be deducted if your name is hard to read or the box does not contain the correct letter.

More information

Detection of aneuploidy in a single cell using the Ion ReproSeq PGS View Kit

Detection of aneuploidy in a single cell using the Ion ReproSeq PGS View Kit APPLICATION NOTE Ion PGM System Detection of aneuploidy in a single cell using the Ion ReproSeq PGS View Kit Key findings The Ion PGM System, in concert with the Ion ReproSeq PGS View Kit and Ion Reporter

More information

LIST OF INVESTIGATIONS

LIST OF INVESTIGATIONS Karyotyping: K001 K002 LIST OF INVESTIGATIONS SAMPLE CONTAINER TYPE cells For Karyotyping [Single] cells For Karyotyping [Couple] Vacutainer Vacutainer 7-8 7-8 K003 Fetal Blood Sample For Karyotyping Vacutainer

More information

Corporate Medical Policy

Corporate Medical Policy Corporate Medical Policy Invasive Prenatal (Fetal) Diagnostic Testing File Name: Origination: Last CAP Review: Next CAP Review: Last Review: invasive_prenatal_(fetal)_diagnostic_testing 12/2014 3/2018

More information

Figure S2. Distribution of acgh probes on all ten chromosomes of the RIL M0022

Figure S2. Distribution of acgh probes on all ten chromosomes of the RIL M0022 96 APPENDIX B. Supporting Information for chapter 4 "changes in genome content generated via segregation of non-allelic homologs" Figure S1. Potential de novo CNV probes and sizes of apparently de novo

More information