ASSOCIATION BETWEEN CC CHEMOKINE LIGAND 3-LIKE-1 (CCL3L1) GENE COPY NUMBER AND RHEUMATOID ARTHRITIS IN AFRICAN AMERICANS MAWULI K.

Similar documents
Genomic structural variation

Genetics and Genomics in Medicine Chapter 8 Questions

Rheumatoid Arthritis. Manish Relan, MD FACP RhMSUS Arthritis & Rheumatology Care Center. Jacksonville, FL (904)

Rheumatoid Arthritis. Marge Beckman FALU, FLMI Vice President RGA Underwriting Quarterly Underwriting Meeting March 24, 2011

Lack of association of IL-2RA and IL-2RB polymorphisms with rheumatoid arthritis in a Han Chinese population

Supplementary note: Comparison of deletion variants identified in this study and four earlier studies

AUTOIMMUNITY CLINICAL CORRELATES

AUTOIMMUNITY TOLERANCE TO SELF

Association of Single Nucleotide Polymorphisms (SNPs) in CCR6, TAGAP and TNFAIP3 with Rheumatoid Arthritis in African Americans

1.0 Abstract. Title. Keywords. Rationale and Background

Ethnic Minority RA Consortium (EMRAC)

Agilent s Copy Number Variation (CNV) Portfolio

Global variation in copy number in the human genome

UNIVERSITI TEKNOLOGI MARA COPY NUMBER VARIATIONS OF ORANG ASLI (NEGRITO) FROM PENINSULAR MALAYSIA

FONS Nové sekvenační technologie vklinickédiagnostice?

Whole-genome detection of disease-associated deletions or excess homozygosity in a case control study of rheumatoid arthritis

Patient #1. Rheumatoid Arthritis. Rheumatoid Arthritis. 45 y/o female Morning stiffness in her joints >1 hour

Supplemental Table 1. Key Inclusion Criteria Inclusion Criterion OPTIMA PREMIER 18 years old with RA (per 1987 revised American College of General

Retrospective Genetic Analysis of Efficacy and Adverse Events in a Rheumatoid Arthritis Population Treated with Methotrexate and Anti-TNF-α

Understanding Rheumatoid Arthritis

Rheumatoid Arthritis

Willcocks et al.,

Association mapping (qualitative) Association scan, quantitative. Office hours Wednesday 3-4pm 304A Stanley Hall. Association scan, qualitative

PATHOGENESIS OF RHEUMATOID ARTHRITIS

ASSESSMENT OF THE RISK FOR TYPE 1 DIABETES MELLITUS CONFERRED BY HLA CLASS II GENES. Irina Durbală

Rheumatoid Arthritis. Improving Outcomes in RA: Three Pillars. RA: Chronic Joint Destruction and Disability What We Try to Prevent

Clinical Policy: Tocilizumab (Actemra) Reference Number: ERX.SPMN.44

Introduction to Genetics and Genomics

Supplementary Appendix

Nature Biotechnology: doi: /nbt.1904

DETECTION OF LOW FREQUENCY CXCR4-USING HIV-1 WITH ULTRA-DEEP PYROSEQUENCING. John Archer. Faculty of Life Sciences University of Manchester

The Human Major Histocompatibility Complex

George R. Honig Junius G. Adams III. Human Hemoglobin. Genetics. Springer-Verlag Wien New York

Efficacy and Safety of Tocilizumab in the Treatment of Rheumatoid Arthritis and Juvenile Idiopathic Arthritis

Y. Chen, D.L. Mattey. Clinical and Experimental Rheumatology 2012; 30:

American College of Rheumatology (ACR) Updated Guideline for the Management of Rheumatoid Arthritis. Public Comments

PATHOGENESIS OF RHEUMATOID ARTHRITIS

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

Supplemental Materials and Methods Plasmids and viruses Quantitative Reverse Transcription PCR Generation of molecular standard for quantitative PCR

LTA Analysis of HapMap Genotype Data

SUPPLEMENTARY INFORMATION. Divergent TLR7/9 signaling and type I interferon production distinguish

Autoimmune Diseases. Betsy Kirchner CNP The Cleveland Clinic

When is it Rheumatoid Arthritis When to Refer

CRISPR/Cas9 Enrichment and Long-read WGS for Structural Variant Discovery

2019 COLLECTION TYPE: MIPS CLINICAL QUALITY MEASURES (CQMS) MEASURE TYPE: Process

Horizon Scanning Centre November Secukinumab for active and progressive psoriatic arthritis. SUMMARY NIHR HSC ID: 5330

Cell-Mediated Immunity and T Lymphocytes

Polymorphism of the PAI-1gene (4G/5G) may be linked with Polycystic Ovary Syndrome and associated pregnancy disorders in South Indian Women

Is it Autoimmune or NOT! Presented to AONP! October 2015!

Early synovitis clinics

SALSA MLPA KIT P050-B2 CAH

CURRENT GENETIC TESTING TOOLS IN NEONATAL MEDICINE. Dr. Bahar Naghavi

Tolerance 2. Regulatory T cells; why tolerance fails. Abul K. Abbas UCSF. FOCiS

Genetics of Pediatric Inflammatory Bowel Disease

The Genetic Epidemiology of Rheumatoid Arthritis. Lindsey A. Criswell AURA meeting, 2016

Ustekinumab (Stelara) for psoriatic arthritis second line after disease modifying anti rheumatic drugs (DMARDs)

Supplementary Figure 1 Dosage correlation between imputed and genotyped alleles Imputed dosages (0 to 2) of 2-digit alleles (red) and 4-digit alleles

Tumor suppressor genes D R. S H O S S E I N I - A S L

Corporate Medical Policy

Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research

Etiology: Pathogenesis Clinical manifestation Investigation Treatment Prognosis

New Evidence reports on presentations given at EULAR Rituximab for the Treatment of Rheumatoid Arthritis and Vasculitis

Host Genomics of HIV-1

Clinical Policy: Etanercept (Enbrel) Reference Number: PA.CP.PHAR.250 Effective Date: 01/18 Last Review Date: 08/17 Line of Business: Medicaid

MODULE NO.14: Y-Chromosome Testing

Diversity and Frequencies of HLA Class I and Class II Genes of an East African Population

Human leukocyte antigen-b27 alleles in Xinjiang Uygur patients with ankylosing spondylitis

Challenges of CGH array testing in children with developmental delay. Dr Sally Davies 17 th September 2014

Using ENBREL to Treat Rheumatoid and Psoriatic Arthritis

They are updated regularly as new NICE guidance is published. To view the latest version of this NICE Pathway see:

Supplementary Online Content

Introduction to the Genetics of Complex Disease

Association between the -77T>C polymorphism in the DNA repair gene XRCC1 and lung cancer risk

Editing file. Color code: Important in red Extra in blue. Autoimmune Diseases

Supplementary Appendix

Genetic variation in FOXO3 is associated with reductions in inflammation and disease activity in inflammatory polyarthritis

Histocompatibility Evaluations for HSCT at JHMI. M. Sue Leffell, PhD. Professor of Medicine Laboratory Director

Rheumatology Cases for the Internist

Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers

New Evidence reports on presentations given at ACR Improving Radiographic, Clinical, and Patient-Reported Outcomes with Rituximab

Association between interleukin-17a polymorphism and coronary artery disease susceptibility in the Chinese Han population

AUTOIMMUNE DISORDERS IN THE ACUTE SETTING

Whole Genome and Transcriptome Analysis of Anaplastic Meningioma. Patrick Tarpey Cancer Genome Project Wellcome Trust Sanger Institute

MRC-Holland MLPA. Description version 18; 09 September 2015

BST227 Introduction to Statistical Genetics. Lecture 4: Introduction to linkage and association analysis

5/2/18. After this class students should be able to: Stephanie Moon, Ph.D. - GWAS. How do we distinguish Mendelian from non-mendelian traits?

Mechanisms of Autontibodies

7.013 Spring 2005 Problem Set 6

Rheumatoid Arthritis in Asians

Association-heterogeneity mapping identifies an Asian-specific association of the GTF2I locus with rheumatoid arthritis

Summary of Risk Minimization Measures

Rheumatoid arthritis

Downloaded from:

Requirements in the Development of an Autoimmune Disease Amino Acids in the Shared Epitope

Annotation of Chimp Chunk 2-10 Jerome M Molleston 5/4/2009

Helminth worm, Schistosomiasis Trypanosomes, sleeping sickness Pneumocystis carinii. Ringworm fungus HIV Influenza

Structural Variation and Medical Genomics

HLA and antigen presentation. Department of Immunology Charles University, 2nd Medical School University Hospital Motol

The role of CXORF21 in systemic lupus erythematosus

HLA and antigen presentation. Department of Immunology Charles University, 2nd Medical School University Hospital Motol

Transcription:

ASSOCIATION BETWEEN CC CHEMOKINE LIGAND 3-LIKE-1 (CCL3L1) GENE COPY NUMBER AND RHEUMATOID ARTHRITIS IN AFRICAN AMERICANS by MAWULI K. NYAKU SADEEP SHRESTHA, COMMITTEE CHAIR BRAHIM AISSANI JEFFREY C. EDBERG SHAWN E. LEVY HEMANT K. TIWARI A DISSERTATION Submitted to the graduate faculty of The University of Alabama at Birmingham, in partial fulfillment of the requirements for the degree of Doctor of Public Health BIRMINGHAM, ALABAMA 2012

Copyright by Mawuli K. Nyaku 2012

ASSOCIATION BETWEEN CC CHEMOKINE LIGAND 3-LIKE-1 (CCL3L1) GENE COPY NUMBER AND RHEUMATOID ARTHRITIS IN AFRICAN AMERICANS MAWULI K.NYAKU DOCTOR OF PUBLIC HEALTH ABSTRACT Purpose: Gene copy number of the CC Chemokine ligand 3-Like-1 (CCL3L1) located on chromosome 17 position q12 varies between ethnicities. Previously, gene copy number of CCL3L1 greater than the ethnic median copy has been associated with an increased risk of Rheumatoid Arthritis. Three later studies found no association between CCL3L1 and RA. All studies were conducted in non-african Americans and an objective of this study was to determine this association in an African American cohort. CCL3L1 shares significant homology with three other genes within the same cluster; CC Chemokine ligand 3 (CCL3), CC Chemokine ligand 3-Like-2 (CCL3L2) and CC Chemokine ligand 3-Like-3 (CCL3L3). Primers and probes that have been used in determining CCL3L1 gene copy number have been specific to two or more of the genes within the cluster potentially resulting in spurious association findings. The first step of this study was to characterize the nucleotide structure of CCL3L1 and CCL3L3 which share 100% homology in the exonic region and 98.7% in the whole gene region and develop a pyrosequencing assay that will accurately quantitate CCL3L1 gene copy number. Methods: CCL3L1 and CCL3L3 genes were sequenced and a pyrosequencing assay that quantifies CCL3L1 and CCL3L3 gene copy number using the CCL3 gene that has two copies per diploid genome as the reference was developed. Finally, CCL3L1 gene copy number was quantified for 747 African American RA cases and 375 frequency iii

matched African American controls and association between CCL3L1 gene copy number and RA determined. Results: Based on the characterization of CCL3L1 and CCL3L3, 12 novel single nucleotide polymorphisms (SNPs) and five known SNPs were determined. The median CCL3L1 gene copy number in the African American cohort was three and CCL3L1 gene copy number <3 or > 3 was not associated with the risk of RA. Conclusions: Results of this study are in agreement with three previous studies although conducted in Non-African American populations. Characterization of CCL3L1 and CCL3L3 however provides a platform for accurately quantifying CCL3L1 gene copy number in future association studies. Keywords: CC Chemokine ligand 3-Like-1 (CCL3L1), Rheumatoid Arthritis, African Americans, Pyrosequencing, Gene Copy Number iv

DEDICATION I dedicate this dissertation to my Dad, my wife Margaret and my children Mya and Marie for the constant support, joy and encouragement I have experienced from them during this journey. Mlor Petee Mle Walewher Koon. Mawuay Oyare Muo v

ACKNOWLEDGEMENTS I am very grateful to the following individuals who were extremely helpful in guiding me through the successful completion of this project: Dr Sadeep Shrestha my mentor and advisor, for his support, encouragement and guidance during my graduate school career. Dr Jeffrey C. Edberg and Travis Ptacek for their enormous help during the design of the pyrosequencing assay. Dr Brahim Aissani and Dr Hemant K. Tiwari for their thoughtful inputs, comments and suggestions regarding this dissertation. Dr Shawn E. Levy for the wealth of knowledge shared during the process of gene characterization. Dr Michael Crowley and Erica Lin at the Heflin Genomics Core for their tireless help with running the pyrosequencing assays. vi

TABLE OF CONTENTS Page ABSTRACT... iii DEDICATION...v ACKNOWLEDGEMENTS... vi LIST OF TABLES...x LIST OF FIGURES... xii LIST OF ABBREVIATIONS... xiii CHAPTER 1 INTRODUCTION...1 1.1 Specific Aims...2 1.2 Public Health Significance...3 2 REVIEW OF LITERATURE 2.1 Introduction...4 2.2 Epidemiology of RA...5 2.3 Classification and Diagnosis...6 2.4 Predictors and Risk Factors of RA...7 2.5 Clinical Management of RA...8 2.6 Genomewide Association Studies of RA...9 2.7 Genetic Studies of RA in African Americans...12 2.8 Copy Number Variation...13 2.9 Homology of CCL3L1 to Gene Cluster Localized to Chromosome 17 q12...16 2.10 Functional Structure and Expression of CCL3L1...20 2.11 Association between CCL3L1 Copy Number and RA...20 2.12 Mechanism of Action between Chemokines and RA...23 3 CCL3L1 AND CCL3L3 GENE CHARACTERIZATION VIA LONG-RANGE CONSENSUS-PCR AMPLIFICATION...24 vii

3.1 Introduction...24 3.2 Study Population...24 3.3 Materials and Methods...25 3.3.a Sequencing Strategy...25 3.3.b Sequence Regions...25 3.3.c Long-Range, Consensus-PCR...25 3.3.d Large-Fragment Cloning...26 3.3.e Template Preparation...27 3.3.f DNA Sequencing...27 3.3.g Sequencing Data Analysis...28 3.3.h Grouping of Clone Sequences...28 3.4 Results...29 3.5 Discussion...32 4 CCL3L1 AND CCL3L3 GENE CHARACTERIZATION VIA SHORT-RANGE AMPLICON SEQUENCING...35 4.1 Introduction...35 4.2 Study Population...36 4.3 Materials and Methods...36 4.3.a Sequencing Strategy...36 4.3.b Boost/Nest Primers...36 4.3.c PCR Protocol...37 4.3.d DNA Sequencing...37 4.4 Results...41 4.5 Discussion...47 5 PYROSEQUENCING...49 5.1 Introduction...49 5.2 Background...50 5.3 Assay Design...50 5.4 Primer Sequences...53 5.5 PCR/Pyrosequencing Conditions...53 5.6 Interpretation of Pyrosequencing Results...54 6 ASSOCIATION STUDY BETWEEN CCL3L1 GENE COPY NUMBER AND RA...57 6.1 Introduction...57 6.2 Study Population...57 6.3 Materials and Methods...59 6.3.a Study Design...59 6.3.b Statistical Analysis...59 6.4 Results...60 viii

6.5 Discussion...73 7 SUMMARY AND CONCLUSION...76 7.1 Introduction...76 7.2 Quantification of CCL3L1 Gene Copy Number...76 7.3 Strengths...81 7.4 Limitations...83 7.5 Applications and Implications of Results...84 7.6 Future Directions and Studies...85 LIST OF REFERENCES...88 APPENDIX...100 A IRB APPROVAL FOR RESEARCH PROJECT...101 B ALIGNMENT OF CCL3, CCL3L1, CCL3L2, AND CCL3L3...102 C BOOST AND NEST SEQUENCES...109 D SUMMARY OF VARIANTS FOUND DURING CCL3L1/CCL3L3 SEQUENCING...115 E CHARACTERISTICS OF CCL3L1/CCL3L3 CODING SNPs MAPPED TO HUMAN REFERENCE SEQUENCE HG19...121 ix

LIST OF TABLES Table Page 1 Strongest SNP-risk alleles in Genomewide association studies involving RA...11 2 (%) Identity matrix between the CCL3L-related genes in complete (5 UTR, exons, introns and 3 UTR) genomic sequence (bottom left) and exonic regions (top right)...19 3 Association studies involving CCL3L1 gene copy number and risk of RA...22 4 Genotype results based on BLAT analysis...30 5 Summary of genotypes in haplotypes matching CCL3...31 6 Nest primers used in the PCR reaction to characterize CCL3L1 and CCL3L3...38 7 Boost primers used in the PCR reaction to characterize CCL3L1 and CCL3L3...39 8 Cumulative genotype table of known SNPS discovered...44 9 Cumulative genotype table of novel SNPs discovered...45 10 Population frequency distribution of known SNPs discovered...46 11 Enrollment characteristics of African American rheumatoid arthritis cases and African American controls in the CLEAR* registry...63 12 Copy number distribution of CCL3L1 among African American cases and controls in the CLEAR* registry...68 13 CCL3L1 gene copy number and risk of rheumatoid arthritis (RA) in the CLEAR* registry...69 14 CCL3L1/CCL3L3 gene copy number as a continuous variable and risk of rheumatoid arthritis (RA) in the CLEAR* registry...71 x

15 Association between CCL3L1/CCL3L3 gene copy number as a continuous variable and rheumatoid factor positive, Anti-CCP antibody positive and baseline radiographic erosions...72 16 Specificity of primer pairs used in association studies between CCL3L1 and RA...80 xi

LIST OF FIGURES Figure Page 1 Location of the CCL3 chemokine genes chromosome 17 q12...18 2 Schematic of Boost/Nest sequencing approach. Primers are aligned to CCL3L1 gene on chromosome 17 including a 250bp region at each flanking region....40 3 Gene alignment depicting nucleotide position used in the primer design process for pyrosequencing assays...52 4 Vector/Graph coordinate of % CCL3 vs ALL on the Y axis and % CCL3L1/CCL3L3 vs ALL on the X and Z axes...64 5 Vector/Graph coordinate of % CCL3L1/CCL3L3+CCL3L2 on the Y axis and % CCL3L1/CCL3L3 vs ALL on the X and Z axes...65 6 Vector/Graph coordinate of % CCL3L1/CCL3L3 + CCL3L3/CCL3L1 vs CCL3 on the Y axis and % CCL3 vs ALL on the X and Z axes...66 7 Box plot indicating the distribution of CCL3L1 gene copy number. The horizontal line within the blue box indicates the median and the red dot indicates the mean...67 8 Distribution of CCL3L1/CCL3L3 gene copy number in rheumatoid arthritis cases and frequency matched controls. Statistical significance for a difference in the distribution of the gene copy number of CCL3L1 in cases and controls was assessed by the χ²-test. d.f., degrees of freedom; P, significance value...70 xii

LIST OF ABBREVIATIONS ACR AIMs Anti-CCP ATP American College of Rheumatology Ancestry-Informative Markers Autoantibodies to Citrullinated Peptide Adenosine-5'-triphosphate CCL1 Chemokine (C-C motif) ligand 1 CCL3 CC Chemokine ligand 3 CCL3L1 CCL3L2 CCL3L3 CC Chemokine ligand 3-Like-1 CC Chemokine ligand 3-Like-2 CC Chemokine ligand 3-Like-3 CCR1 C-C chemokine receptor type 1 CCR3 C-C chemokine receptor type 3 CCR5 C-C chemokine receptor type 5 Chr 17q12 CI CLEAR CNV CNP CRP dbsnp Chromosome 17 Position q12 Confidence Interval Consortium for the Longitudinal Evaluation of African Americans with Early Rheumatoid Arthritis Copy Number Variation Copy Number Polymorphism C-reactive Protein Single Nucleotide Polymorphism Database xiii

DEXA DMARDS DNA EBV ESR EULAR GWAS HapMap HAQ HIV-1 HLA Dual-energy X-ray absorptiometry Disease Modifying Antirheumatic Drugs Deoxyribonucleic acid Epstein-Barr Virus Erythrocyte Sedimentation Rate European League Against Rheumatism Genomewide Association Study Haplotype Map Health Assessment Questionnaire Human Immunodeficiency Virus-1 Human Leucocyte Antigen IL4 Interleukin 4 MRI NCBI NHGRI NIAMS NSAID OR PCR PRT QV RA RF Magnetic Resonance Imaging National Center for Biotechnology Information National Genome Research Institute National Institute of Arthritis and Musculoskeletal and Skin Diseases Non-steroidal Anti-inflammatory Drug Odds Ratio Polymerase Chain Reaction Paralogue Ratio Test Quality Value Rheumatoid Arthritis Rheumatoid Factor xiv

RT-PCR SE SLE SNP Tm TNF UCSC UKBS UTR WTCCC Real Time Polymerase Chain Reaction Shared Epitope Systemic Lupus Erythematous Single Nucleotide Polymorphism Annealing temperature Tumor Necrosis Factor University of California, Santa Cruz United Kingdom Blood Services Untranslated Region Welcome Trust Case-Control Consortium xv

CHAPTER 1 1 INTRODUCTION The goal of this study was to determine the association between gene copy number variation (CNV) of the gene CC Chemokine ligand 3-like 1 (CCL3L1) and the risk of rheumatoid arthritis (RA) in an African American cohort. Despite advancements in the clinical management of RA, associated mortality rates continue to soar steadily and RA patients have an approximate 50% risk of premature mortality although associated co-morbidities may partly explain this trend. The underlying causes of RA are still not clearly understood however genetic factors are known to significantly contribute to the risk. Until recently, genetic studies of RA in African Americans were very rare partly due to the underrepresentation of this population in both observational studies and randomized clinical trials. The establishment of the Consortium for the Longitudinal Evaluation of African-Americans with early RA (CLEAR) in 2000 has been very importantin this regard and several association studies including genetic based studies in RA have been conducted. CNV of CCL3L1 has been associated with various autoimmune diseases such as systemic lupus erythematosus (SLE), Crohn s disease, psoriasis and RA, however results from association studies have largely been inconsistent. A defining feature of RA is an excessive production of the β chemokines CCR5, CCR1 and CCR3. CCL3L1 is a potent ligand for these β chemokines and expression levels of CCL3L1 may influence RA 1

susceptibility. CCL3L1 however shares 100% homology with the exonic region of the gene CC Chemokine ligand 3-like 3 (CCL3L3) and 98.7% homology with its total genomic structure. In addition, a high level of homology in upwards of 70% is shared between CCL3L1 and CC Chemokine ligand 3-like-2 (CCL3L2) and CC Chemokine ligand 3 (CCL3). Due to the high homology shared between these genes, primers and probes used to target CCL3L1 in association studies have not been specific. Characterizing the genomic structures of CCL3L1 and CCL3L3 is thus critical to avoid spurious interpretation in association studies. Against this backdrop, the overall goal of this project was to develop a methodology that accurately quantifies CCL3L1 copy number and apply this method to determine the association between CCL3L1 gene copy number and RA. The hypotheses and specific aims of this study are: 1.1 Specific Aims 1. To characterize the nucleotide structures of the chemokine genes CCL3L1 and CCL3L3 and develop a pyrosequencing methodology that effectively quantifies CCL3L1 gene copy number. 2. To investigate the association between copy number variation of CCL3L1 and RA in an African American cohort. Hypothesis: Chemokine inflammatory responses have been shown to play a role in the pathogenesis of autoimmunity and excessive or inappropriate production of chemokine (CC)-receptor ligands are a characteristic feature of RA. Thus genetic variations altering chemokine gene expression levels may influence susceptibility to RA. 2

1.2 Public Health Significance Genetic studies evaluating the risk of RA are perhaps more crucial now than ever given that genetic factors maybe attributable to a significant the risk of RA. In addition, the increasing mortality rates in patients with RA is alarming and completely elucidating the genetic factors that influence the risk of RA may be useful in predicting risk. Over the past decade there has been a gradual increase in genetic studies in African-Americans with RA however the association between CCL3L1 copy number and RA has not been investigated in the African American population. Although very few studies have tested the association between CCL3L1 copy number and the risk of RA, the sample population in these studies was Caucasian. Since genetic risk factors for RA may be different between ethnicities, findings from this study will highlight some differences. In addition, this project will act as a stepping stone for future association studies between CCL3L1 and RA and other autoimmune diseases in the African American population. CCL3L1 copy number has been associated with numerous disease outcomes including HIV-1, systemic lupus erythematosus, and Kawasaki disease. The nonspecificity of primers and probes to CCL3L1 in current association studies may have contributed to the variability and inconsistency in findings. Characterization of the nucleotide structure of CCL3L1 will thus have far reaching implications in accurately quantifying CCL3L1 copy number not only for this project but also in the future association studies involving CCL3L1 copy number and risk of disease. 3

CHAPTER 2 REVIEW OF LITERATURE 2.1 Introduction Gene copy number of the of the CC Chemokine ligand 3-Like-1 (CCL3L1) located on chromosome 17 position q12 has been associated with various diseases including HIV-1, psoriasis, Kawasaki disease, and autoimmune disease including rheumatoid arthritis. CCL3L1 gene copy number association studies and disease involving African Americans are rare. A major objective of this thesis was to investigate the association between the gene copy number of CCL3L1 and RA in an African American cohort. CCL3L1 also shares significant homology with three genes within the same cluster, CCL3, CCL3L2 and CCL3L3. The significant homology shared between these genes has resulted in primers and probes targeting CCL3L1 to be non-specific. A second objective of this thesis is to characterize the nucleotide structures of CCL3L1 and CCL3L3 which share the highest homology. Based on this characterization, a third objective of this thesis is to develop a pyrosequencing assay that can be used to accurately quantify CCL3L1 gene copy number. Here, a review of the literature involving RA and gene copy number of CCL3L1 including other genes within the cluster is provided. 4

2.2 Epidemiology of RA Rheumatoid arthritis (RA) is a chronic autoimmune disease characterized by chronic inflammation, progressive deterioration of joint function, increased comorbidity and excess mortality[1, 2]. In observational studies, RA is portrayed as a serious longterm disease with dominant extra-articular features (including nodules, vasculitis, neuropathy, pericarditis, interstitial lung disease and eye involvement), limited treatment options and poor outcomes [3]. It is currently known that RA patients have about a 50% risk of premature mortality with a 3 to 10 year decreased life expectancy compared with the general population however these observations were only made in the Caucasian population [1]. The high premature mortality risk statistic is troubling bearing the underlying cause of mortality in RA patients is not fully understood. However, there are a number of reasons that may explain the observed high mortality statistic. First, the mortality rates for men and women (2.5 and 2.4 per 100 person years) with RA has been fairly constant throughout the decades and throughout the same time period, mortality rates in the general population have fallen substantially resulting in a widening mortality gap between individuals with RA and individuals in the general population [4]. Also, the mean age of RA diagnosis and of prevalent cases is increasing mainly as a result of demographic changes. This in turn is accompanied by an increasing prevalence of comorbidities, in particular cardiovascular and respiratory disease. In addition, the higher prevalence of smoking, diabetes and physical inactivity in RA cases contributes to the burden of co-morbidities [5]. Studies in Caucasian populations indicate that RA is 3 times more frequent in females than in males [6]. Prevalence rises with age and is highest in women older than 65 [3]. Although RA has been described in different populations, 5

Native Americans have the highest prevalence rate of 3.1% [7]. In Caucasian North American and European populations, prevalence rates range from 0.5% to 1% [7, 8] and incidence rates in developed countries ranges from 5 to 50 per 100 000 adults and increases with age [9, 10]. 2.3 Classification and Diagnosis RA being a complex disease does not have a single defining feature due to its wide spectrum of manifestations and outcomes ranging from mild, limited disease to severe, debilitating disease. Phenotypically, RA can be subdivided based on the presence or absence of autoantibodies such as rheumatoid factor (RF) or autoantibodies to cyclic citrullinated peptide (anti-ccp). The American College of Rheumatology (ACR) 1987 classification criteria [11] classified patients with early RA as having RA as a result of poor sensitivity and specificity [12]. This classification failed to identify individuals with very early RA who subsequently developed severe or debilitating RA [13]. These concerns led the ACR and the European League Against Rheumatism (EULAR) in 2010 to devise a new criteria for classifying early arthritis that assesses joint involvement, serology, acute-phase reactants and duration of symptoms [14]. Diagnosis of RA involves a combination of core measures, combined indices and imaging. Core measures assess joint inflammation, laboratory measures such as erythrocyte sedimentation rate and or C- reactive protein, pain, global assessment, disability, fatigue and depression [15, 16]. The combined indices combine 28 swollen and 28 tender joints (hands, arms and knees), patients global assessment and erythrocyte sedimentation rate (ESR) to indicate the patients current status [15]. Imaging is done by radiography of the hands and feet. 6

Currently reversible and irreversible structural changes can be assessed using ultrasound and MRI [17, 18]. 2.4 Predictors and Risk Factors for RA There are several environmental factors that have been associated with the risk of developing RA with varying degrees. These include alcohol intake, coffee intake, vitamin D status, oral contraceptive use, low socioeconomic status, ultraviolet light, climatic differences, early life environmental factors, smoking and silica dust exposure. While the supporting evidence for most risk factors is weak [19] there is substantial evidence for early life environmental factors, smoking and silica dust exposure. Maternal smoking ( 10 cigarettes/day) during pregnancy increases the odds of developing RA by about 3 times [20] and the odds for a high birth weight ( 4000-4540 g) range from 1.2-3.3 [21, 22]. Several epidemiologic studies have linked silica dust exposure to the risk of developing seropositive RA [23-25]. Increased risk and severity of RA has been associated with smoking [26-29]. In addition, the presence of environmental cigarette smoke and a personal history of smoking are associated with the probability of positivity for rheumatoid factor (RF) [27, 28, 30]. The effect of smoking seems to be strong, consistent and dose-dependent [31]. Several infectious organisms both viruses and bacteria including Epstein-Barr virus (EBV) [32], Parvovirus B19 [31], Streptococcus, Mycoplasma, Proteus and Escherichia coli [33] have been linked to a higher risk of RA. Although no single infectious agent appears to be predominant, roughly 10-20% of patients presenting with early RA have serological evidence of recent infection that could trigger RA [34]. Genetic factors are attributable to 50% of the risk of developing RA [35] 7

and heritability of RA is estimated to be between 53-65% [36]. The human leucocyte antigen (HLA), specifically HLA-DRB1 shared epitope is the strongest genetic link to RA and is associated with both disease severity and susceptibility [37]. In Caucasian patients, two copies are associated with a relative risk of developing RA of 3.8-6 [38]. Other loci have also been associated with RA in multiple genome-wide association studies (GWASs) [39, 40]. Predictors of RA also include clinical factors (the number of swollen and tender joints and the presence of extraarticular manifestations and erosions at baseline) [41], laboratory data (presence of rheumatoid factor [RF] and/or anti-cyclic citrullinated peptide antibodies [anti-ccp] [42], higher erythrocyte sedimentation rate [ESR] or C-reactive protein [CRP] levels), and other variables (educational level and score in the Health Assessment Questionnaire [HAQ]). 2.5 Clinical Management of RA Treatment goals for RA seek to attain a state of disease remission that has been defined as an absence of articular and extra-articular inflammation and disease activity with the hope of improved long-term effects such as decreased functional disability [43]. However, currently no gold standard exists regarding what specifically constitutes disease remission or indices that should be used to assess disease activity [44]. Traditionally, control of symptoms of RA was via the use of analgesics to reduce pain and non-steroidal anti-inflammatory drugs (NSAIDs) to lessen pain and stiffness. Although their mode of action is not completely understood, disease-modifying antirheumatic drugs (DMARDS) are currently extensively used to reduce joint swelling and pain, decrease acute-phase markers, limit progressive joint damage and improve 8

function. DMARDS now represent the standard of care for RA as they are recommended by major rheumatologic societies and received by more than 90% of RA patients in rheumatic disease specialty practices [45-47]. In addition, US national care quality organizations have included DMARD treatment for RA as a performance standard [48]. Biological agents such as TNF inhibitors are also licensed for management of RA and are highly effective [49-53]. Bacterial, fungal and viral infections are however a concern for the use of biological agents [54]. Biological agents are usually combined with methotrexate or leflunomide to reduce antibody formation and also to increase efficacy [55]. Glucocorticoids are also used in the management of RA, reducing synovitis shortterm and decreasing joint damage long-term [56]. The development of biological agents for treating RA has not always transmitted into significantly better quality of life. For instance, there are a group of patients whose disease remains active and progressive despite optimal use of traditional DMARDS. One explanation for this observation is that data are only starting to be published which focus exclusively on patients with an onset of symptoms since the start of the biologic era [5]. 2.6 Genomewide Association Studies of RA Genomewide association studies (GWAS) generally involve comparing two groups of individuals (usually hundreds or thousands), one group with the disease of interest and a disease-free group. Within both groups, hundreds of thousands of singlenucleotide polymorphisms (SNPs) are tested for disease association. The association involves scanning the genomic DNA of individuals with markers that detect genetic variations associated with a disease of interest. Genomewide association studies have 9

significantly increased since the completion of the Human Genome Project in 2003 and the International HapMap Project in 2005. Genomewide associations targeting complex traits are further complicated compared to single-gene disorders. Thus for instance, a genetic association found for a complex disease trait may in reality only cause disease in concert with a host of genetic and environmental factors, each contributing minimally to the net effect. Some of the genetic or environmental factors may be absolutely necessary for disease to occur. Genomewide association over the past few years have unearthed SNPs implicating hundreds of robustly replicated loci for common traits [57]. A drawback of Genomewide association studies is that although they can determine locations within the genome that may be associated with disease risk, only few of the SNPs that have been identified demonstrate a clear functional implication relevant to the mechanism of disease [58]. Genomewide association studies conducted to date targeting RA are depicted in (Table 1) [59-71]. The strongest SNP-risk allele found in the association studied is also depicted. Although most of the studies have a significant sample size to determine small associations, the association studies were conducted in mostly non-african Americans that further underscore the importance of conducting association studies in this population. 10

Table 1 Strongest SNP-risk alleles in Genomewide association studies involving RA Publication Hu HJ et. al. Exp Mol Med, 2011 Eleftherohorinou H et. al. Hum Mol Genet, 2011 Terao C et al. Hum Mol Genet, 2011. Freudenberg J. et al. Arthritis Rheum, 2011 Zhernakova A. et al. PLoS Genet, 2011. Freudenberg J et al. Arthritis Rheum, 2011 Kochi Y. et al. Nat Genet, 2010. Stahl EA. et al. Nat Genet, 2010. Gregersen PK et al. Nat Genet, 2009 Raychaudhuri S. et al. Nat Genet, 2008 Julia A. et al. Arthritis Rheum, 2008. Plenge RM et al. Nat Genet, 2007 Plenge RM et al. N Engl J Med, 2007 WTCCC. Nature, 2007. Strongest SNP-risk Allele rs805297-a rs9268853-c, rs9272219, rs1063635, rs1610677, rs743777- G rs9296015, rs2075876-a, rs2240335 rs7765379, rs2062583, rs1600249, rs12831974 rs653178-c, rs10892279, rs1893217-g, rs864537, rs1953126-t, rs2298428-t, rs7574865-t, rs11203203-a, rs7579944, rs1876518, rs975730, rs11984075-g, rs1020388, rs1772408, rs10876993 rs7765379, rs2240335, rs2062583, rs1600249, rs12831974 rs13192471-g, rs3093024-t, rs7574865-t, rs2230926-c rs6910071-g, rs2476601-a, rs874040-c, rs11676922-t, rs6920220-a, rs706778-t, rs6859219-c, rs3093023-a, rs10488631-c, rs951005-a, rs934734-g, rs4810485-t, rs3087243-g, rs26232-c, rs13315591-c, rs7155603-g, rs3761847-g, rs7574865-t, rs12131057-g, rs13119723-a, rs13031237-t, rs2872507-a, rs4750316-g, rs17374222-a, rs840016-c, rs10865035-a, rs3890745-t, rs11203203-a, rs3184504-t rs2476601, rs13017599-a, rs231735-t, rs2736340-a, rs881375 rs6457620, rs6679677, rs6920220, rs4810485-g, rs2812378-g, rs1678542-c, rs3890745-t, rs4750316-g, rs42041-g rs6457617, rs2002842-a rs10499194-c, rs6920220 rs660895, rs3761847-g, rs2476601 rs6457617-t, rs615672, rs6679677-a, rs11761231-c, rs743777-g, rs2837960-g, rs3816587-c 11

2.7 Genetic Studies of RA in African Americans Prior to the establishment of the Consortium for the Longitudinal Evaluation of African-Americans with early RA (CLEAR) in 2000, genetic studies in the African- American population were almost non-existent. Historically it has been known that this population has been more reluctant than Caucasians to participate in clinical research [72]. As a result, African Americans have been under-represented in established RA cohorts. There are several reasons why genetic studies of RA in this population are equally important as in other populations. First, RA is known to occur in all races and in all parts of the world and with some exceptions the disease prevalence is similar across various populations around the world [73]. In the US, prevalence rates are similar among persons from rural or urban areas and between persons from differing socioeconomic or occupational backgrounds [74]. Therefore, it is important that the racial composition of studies involving RA patients should approach that of the general population in a defined geographical area. Next, the prevalence of RA in both Caucasians and African Americans is approximately 0.5-1% [75] however this does not prove that the genetic factors influencing susceptibility, severity or outcome are similar in both populations. Recently, very significant headway has been made regarding genetic studies involving African- Americans. In RA patients of European ancestry, HLA-DRB1 alleles containing the shared epitope (SE), a common sequence at amino acids 70-75 (QKRAA) in the third hypervariable region of the β-chain [76-78] are found in ~50-70% of this population [79, 80]. An association study done in African Americans found a strong association between the HLA-DRB1 alleles containing the SE with susceptibility to RA [81]. The study also found that the SE association was strongest in the sub-set of African Americans patients 12

with anti-ccp antibodies as in Europeans with RA [82]. Additionally, the study found a higher degree of European ancestry among African Americans with SE alleles suggesting that a genetic risk factor for RA was introduced into the African American population through admixture thus, making individuals more susceptible to subsequent environmental or unknown factors that trigger the disease. In another study, the association between interleukin-4 receptor (IL4R) single-nucleotide polymorphisms (SNPs) with rheumatoid nodules was conducted in African Americans with RA [83]. Studies have suggested that IL4R could play a role in the pathogenesis of RA [84, 85]. The study found that IL4R SNPs, rs1801275 and rs1805010 were associated with rheumatoid nodules in autoantibody-positive African American RA patients with at least one HLA-DRB1 allele encoding the SE. A different study investigated the association between HLA-DRB1 SE among individuals of European ancestry in African Americans. Findings from this study suggested that the majority of RA risk alleles showed similar odds ratios in the African Americans with RA as with RA patients with European ancestry [81]. 2.8 Copy Number Variation Copy number variation (CNV) is the alteration of the genomic DNA resulting in cells having an abnormal number of copies of one or more sections of DNA. CNVs correspond to large sections of the genome that have either been deleted or duplicated. Over the course of the past decade, it has become increasingly evident that a major difference between individuals is a variation in copy numbers of segments of their genomes. CNVs have also been referred to as a copy number polymorphisms (CNPs) or 13

CNVs with a frequency >1% [86, 87]. The distribution of CNVs even in healthy individuals is very wide and is strongly involved in the significant amounts of population-based genomic variation [88-91]. CNV is known to account for about 12% of the human genomic DNA [92]. It has also been proposed that CNV is a major driving force in the rapid evolution that continues to occur both within humans and in the great ape lineage [93]. CNVs are known to emerge both meiotically and somatically on the basis that identical twins can have different CNVs [94]. Further, within the same individual, repeated sequences can vary in number in different tissues or organs [95]. The specific mechanisms that lead to a change in copy number are thought to include homologous recombination and non-homologous repair mechanisms [96]. The advent of whole genome sequencing technology has significantly impacted the knowledge regarding the large numbers of polymorphisms in different species examined. Apart from copy number variations in humans, copy numbers have been found in a wide variety of organisms including fruitflies, mice, chickens, maize chimpanzees, rhesus macaques and cows [97-108]. It is thus not surprising that there are currently several CNV databases including The Hospital for Sick Children s Database of Genomic Variants and Wellcome Trust Sanger Institute s DECIPHER. Detection of CNVs has been a significant challenge and still continues to be despite significant advancements in CNV copy number detection technology. One method that has been used in the detection of CNVs is hybridization-based mapping. The methodology works with the premise that when the genome of an individual is compared to a reference genome, areas within the individual s genome that are deleted or duplicated would indicate a decrease or an increase of DNA respectively. Although this may seem to 14

be a very simplistic assumption, it holds a very significant scientific relevance. Another method used in detecting CNVs is paired-end mapping which in principle works by comparing differences in the lengths between captured genomic fragments relative to a reference genomic sequence. Thus a captured fragment bigger than the reference genomic sequence is expected to have an insertion or duplication while a captured fragment of a smaller length relative to the reference genomic sequence is expected to have a deletion. Both methods above are used during the initial discovery process for CNVs and differ from CNV quantification methods such as RT-PCR and the paralogue ration test (PRT). There are inherent limitations regarding detection and quantification of gene copy number. Detection of CNVs are usually based on a reference genomic sequence of a known or unknown random individual. Copy number detection for non-humans is also dependent on having a reference genomic sequence for the organism in question. The problem with using the genomic sequence of an individual as a reference is that a duplication of a part of their genome could well be a deletion of a duplication in another part of the individual s genome. Similarly, a deletion in the genomic sequence compared to the reference sequence could potentially be a novel duplication in the reference genomic sequence. Also, individuals of different ethnicities may have differing genomic sequences thus it may not be accurate to compare the genomic sequence of an Asian to a Caucasian in detecting CNV. Perhaps a way to address this issue is to generate a reference genomic sequence that combines several individuals from difference ethnicities. This may provide a more accurate method for detecting CNVs. Quantification of gene copy number has remained a challenging issue despite the developments of new quantification methods. Very frequently, findings from studies testing the association 15

between gene copy number and disease cannot be replicated. A major example is the finding from Gonzalez et al that having a gene copy number of CCL3L1 greater than the ethnic median decreases susceptibility to HIV-1 and progression to disease [109]. Although this finding was replicated by three independent studies [110-112], four other independent studies did not find an association between CCL3L1 gene copy number and HIV-1 [113-116]. Recently Shrestha et al highlighted the challenges in measure gene copy number greater than two [117]. Traditional gene copy number quantification methods involving PCR suffer from low precision and accuracy with increasing copy number. There is still currently no gold standard for quantifying gene copy number however there has been significant headway to this regard with methods including next generation sequencing. 2.9 Homology of CCL3L1 to Gene Cluster Localized to Chromosome 17 q12 The chemokines are a superfamily of small structurally related cytokines of secreted proteins that are involved in immunoregulatory and inflammatory processes. Most chemokine genes are clustered in specific chromosomal locations [118]. The CXC cluster located in chromosome 4q12-21 and the CC cluster located in chromosome 17q11.2-q12 are the main clusters encoding essential inflammatory cytokines [93]. CCL3L1 is one of several chemokine genes clustered to the q-arm of chromosome 17, specifically at 17q12 (Figure 1). The q-arm of chromosome 17 of humans is known to have multiple regions of genomic instability where gene duplications, chromosomal rearrangements and CNVs are common [119, 120]. High resolution CNV data of this region depict that this region has very extensive architectural complexity bearing that 16

smaller CNVs are embedded within larger ones and also the existence of inter-individual variation in breakpoints [119, 121]. CCL3L1(Gene ID: 6349) and CCL3L3 (Gene ID: 414062) both located on chromosome 17q12 share 100% of their exonic genomic structure and 98.7% of their total genomic structure and as a result of the very high homology shared between these genes, primers and probes targeted to them have not always been specific [122]. Two other genes within this cluster CCL3 (Gene ID: 6348) and CCL3L2 (Gene ID: 390788) also share sequence homologies ranging from 56% to 93% with CCL3L1 and CCL3L3. The reference genomic sequences for CCL3 - (34415602-34417506), CCL3L1 - (34623842-34625730), CCL3L2 - (34610211-34611454), CCL3L3 - (34522262-34524142) are mapped to human chromosome 17 sequence, accession # NC_000017.10 and indicated in (Figure 1) and their homologies indicated in Table 2. An alignment of all 4 genes indicating their untranslated, exonic and intronic regions is depicted in Appendix B. The high homology between these genes underscores the need to accurately differentiate their genomic structures in order to accurately quantify copy numbers of these genes in individuals. The highest copy number of the CCL3L1 gene is displayed in Sub-Saharan African populations with a median copy of 6 whereas Europeans present with the lowest copy numbers with a median of two. Individuals with a zero copy number of CCL3L1 exist but are very rare and the proportion is less than 5% in all ethnicities [109, 123]. 17

Figure 1. Location of the CCL3 chemokine genes chromosome 17 q12 18

Table 2 (%) Identity matrix between the CCL3L-related genes in complete (5 UTR, exons, introns and 3 UTR) genomic sequence (bottom left) and exonic regions (top right) CCL3 CCL3L1 CCL3L2 CCL3L3 CCL3-0.957 0.713 0.957 CCL3L1 0.929-0.719 1 CCL3L2 0.525 0.553-0.719 CCL3L3 0.921 0.987 0.56-19

2.10 Functional Structure and Expression of CCL3L1 The CCL3L1 mature protein has a proline (P) in position 2 instead of a serine (S) compared with CCL3 and a serine or glycine (G) in the region between cysteines 3 and 4. The CCL3L1 receptor binds efficiently to CCR5, CCR1 and CCR3 [124]. CCL3L1 is also very potent in inducing intracellular Ca2+ signaling and chemotaxis through the CCR5. Truncated forms of CCL3L1 show an increased binding affinity for CCR1 and CCR5 that converts this truncated form into a highly efficient monocyte and lymphocyte chemoattractant [125]. Gene copy number regulates the production of CCL3L1 at the mrna and protein level as evidenced by increasing CCL3L copy number positively associated with CCL3L1 mrna production and protein secretion [109, 126, 127]. 2.11 Association between CCL3L1 Copy Number and RA CNV of the CCL3L1 has been associated with a host of diseases the most notable being viral infections and autoimmune diseases. Four studies to date have investigated the association between CCL3L1 gene copy number and susceptibility to RA in Caucasians (Table 3). The first study found that individuals with a copy number higher than two (the population median) had a higher risk for RA (OR 1.30, 95% CI 1.00-1.54, p=0.003) [128]. Three later studies however found no evidence of an association between CCL3L1 gene copy number and RA [129-131].The study finding an association was conducted in individuals from New Zealand. The study suggested an association at the single-locus level between CCL3L1 copy number and RA. In addition, the study demonstrated a statistical interaction between CCL3L1 and CCR5 that supports the hypothesis that copy number variation in CCL3L1 influences susceptibility to RA and 20

that CCL3L1-initiated signaling through CCR5 is a check point in RA. Within the same study finding an association between CCL3L1 gene copy number and RA, a parallel association study was performed in individuals from the United Kingdom that did not find an association. A sample size not large enough to determine association may seem to be the reason for the lack of association. Similarly, in the association study conducted by Mamtani et.al. and Carpenter et. al., the smaller sample size seems a plausible reason for the lack of an association finding between CCL3L1 gene copy number and the risk of RA [130, 131]. However, there wasn t an association found in the study conducted by the Welcome Trust Case Control Consortium (WTCCC) that involved 17, 304 cases and controls [129]. The finding from the WTCCC indicates that sample size alone may not be the sole variable contributing to the lack of an association. A number of reasons including flawed results in the initial study, genetic or environmental differences between the populations studied, the method used in quantifying CCL3L1 gene copy number and quality or source of the DNA used may have contributed to the lack of association findings. 21

Table 3 Association studies involving CCL3L1 gene copy number and risk of RA Study Population Sample Size Association McKinney et al. Ann Rheum Dis 2008; New Zealanders 834 cases and 938 controls [OR 1.30, 95% CI 1.00-1.54, p=0.003] 67:409-413 United Kingdom 302 cases and 255 controls None Mamtani et al. Genes and Immunity 158 cases and 409 Colombians 2010; 11, 155-160 controls None The Welcome Trust Case Control Consortium. Nature 2010; 446:713-720 Carpenter et al. BMC Genomics 2011; 12:418 United Kingdom United Kingdom 17,304 cases and controls 252 cases and 252 controls None None 22

2.12 Mechanism of Action between Chemokines and RA A characteristic of some tissue-specific autoimmune diseases including RA is a substantial penetration into tissues by inflammatory cells. Several factors working simultaneously including localized and systemic concentrations of regulatory chemokines and the number and type of receptor expressed by different leukocyte populations mediate this response [128]. CCR5 expressing leukocytes have been associated with RA disease progression [132, 133]. A defining feature of RA is an excessive production of chemokine (CC)-receptor ligands (β chemokines) including CCL5, CCL3 and CCL4 [134-136]. Experimentally induced arthritis can be partially blocked using selective CCR5 antagonists or anti-ccr5 antibodies in animal models [137, 138]. Anti-CCR5 and anti-ccl3 antibodies decrease autoimmune symptoms [139, 140]. The mechanism of action of some anti-rheumatic drugs including dexamethasone and KE-298 is by inhibiting the production of CCL5 in humans [141] and the CCR5 32 deletion has been found to be protective against RA [142]. With CCL3L1 being a potent ligand for CCR1, CCR3 and CCR5, the expression levels of CCL3L1 may influence RA susceptibility [128]. It has also been proposed that CCL3L1 gene copy number variation might cause differences in T cell activation via the CCR5 receptor. This is because the variable CCL3L1 gene copy numbers lead to differences in CCL3L1 mrna dosage, CCL3L1 protein expression and chemokine secretion, and CCR5 receptor binding, which ultimately affects T cell functions [143]. 23

CHAPTER 3 CCL3L1 AND CCL3L3 GENE CHARACTERIZATION VIA LONG-RANGE CONSENSUS-PCR AMPLIFICATION 3.1 Introduction The initial characterization of the genomic sequence of CCL3L1 and CCL3L3 was done via the use of long-range consensus PCR amplification primers. In addition to characterization of CCL3L1 and CCL3L3, CCL3 which also shares significant homology to both CCL3L1 and CCL3L3 was also characterized. CCL3L1, CCL3L3 and CCL3 characterization was performed in eight individuals of Yoruban origin. The purpose of characterizing genes localized to the Chr 17 q12 cluster was for the design of primers and probes specific to CCL3L1 during association testing between CCL3L1 gene copy number and the risk of RA. 3.2 Study Population Characterization of CCL3L1 and CCL3L1 was carried out in 8 unrelated HapMap samples of Yoruba decent from Ibadan, Nigeria. High quality genomic DNA of all samples was obtained from the Coriell Institute for Medical Research. Use of these samples for research purposes was in compliance with the National Human Genome Research Institute (NHGRI) regulations for the protection of human subjects. 24

3.3 Materials and Methods 3.3.a Sequencing Strategy The initial approach for sequencing was to use a long-range, consensus-pcr amplification to co-amplify the genes CCL3, CCL3L1 and CCL3L3 and then clone that mixture and sequence multiple clones. It was expected that the sequence of each clone will belong to one of six possible haplotypes (up to two versions per individual sample of each of three genes). By comparing these haplotypes to the known reference sequences of the three genes, each haplotype sequence could be characterized as CCL3, CCL3L1, or CCL3L3, by comparing to the reference sequences of these genes. 3.3.b Sequence Regions The sequence of CCL3 is mapped to the human reference sequence hg18 in the region Chr17:34,415,602-34,417,506 on the - strand. The UCSC genome browser lists both CCL3L1 and CCL3L3 ambiguously, each occurring at two different positions. However, NCBI gives the fasta sequences of both CCL3L1 and CCL3L3. The NCBI fasta sequences of the three genes map to the following positions: {(CCL3: 1917kb, Chr 17: 34,415,602-34,417,506), (CCL3L1: 1919kb, Chr 17: 34,623,842-34,625,730), (CCL3L3: 1919kb, Chr 17:34,522,262-34,524,142). 3.3.c Long-Range, Consensus-PCR A single pair of PCR primers was chosen to co-amplify all three genes. The two primers chosen were: F_CCL3 5 -CCTCCTCACCCCCAGATT-3 25

R_CCL3 5 -TTCACCTCTTCCTAATCTTTGCCTA-3 These primers where chosen to have a perfect match to priming sites immediately upstream and downstream of all three genes, without interference of any known sequence variations within or near each priming site. The primers were also chosen to be locally unique, and have favorable properties related to base-composition and annealing temperature (Tm). Using this primer pair with in silico PCR results in the unique selection of the three regions of interest. The predicted amplification product sizes are 2046 bp, 2048 bp, and 2048 bp, for CCL3, CCL3L1, and CCL3L3, respectively. Long range PCR was performed on each of the eight genomic samples and also on one control DNA sample obtained from a cell line. 3.3.d Large-Fragment Cloning PCR products were run on a 0.8% agarose gel, visualized by crystal violet dye, and compared to size standards. All samples produced one clear band with a size of approximately 2 kb. The 2kb product was cut out of the gel and extracted with purification materials included with the TOPO XL PCR Cloning kit (Invitrogen). Longrange PCR products were cloned into a TOPO XL PCR cloning vector. This system uses a TA cloning vector and is recommended for inserts of up to 10 kb. Per the manufacturer s instructions, electro-competent cells (from the same kit) were transformed by the vector, plated in the presence of antibiotic, and incubated. Thirty-six clones from each plate were picked and cultured in a 96-well format. 26

3.3.e Template Preparation Diluted cultures were transferred to a denaturing buffer that was part of the TempliPhi DNA Sequencing Template Amplification kit (GE HealthCare/Amersham Biosciences). This buffer causes the release of plasmid DNA but not bacterial DNA. Cultures were heated, cooled, spun, and transferred to fresh plates containing the TempliPhi enzyme and other components. This mixture was incubated at 30 degrees for 18 hours to promote amplification of the plasmid templates. These products were then spun and heated to 65 degrees to destroy the enzyme. 3.3.f DNA Sequencing Plasmid templates were then used in DNA sequencing reactions using the Big Dye, version 3.1 sequencing kit (Applied Biosystems). Initially, 18 internal sequencing primers (9 forward primers and 9 reverse primers) were tested on the control DNA clones. These primers were selected at positions roughly 200 bp apart on each strand. It was found that 16 of these primers produced good sequencing results. From these, 10 internal sequencing primers (5 forward and 5 reverse) were chosen that would provide adequate coverage of the 2 kb region in both directions. In addition, we used the universal vector sequencing priming sites (M13F and M13R), for a total of 12 sequencing reactions per clone. Cycle sequencing was carried out with an annealing temperature of 50 degrees, an elongation temperature of 60 degrees, and a denaturation temperature of 96 degrees, for a total of 30 cycles. Sequencing reaction products were run on an ABI 3730XL DNA sequencer with a 50 cm capillary array using standard run mode. 27

3.3.g Sequencing Data Analysis A proprietary sequencing analysis program called Agent (developed by Celera) was used to align sequencing reads and produce contigs associated with each clone. This system displays sequence information in an Excel format, using one Excel cell per base character, and it provides estimated quality scores for all base calls. Sequencing reports for each of the eight samples plus the control sample were prepared. In most cases, a high confidence, full-length, nucleotide sequence for all 36 clones was obtained. All sequencing results were aligned to the known 1917-bp CCL3 reference sequence. 3.3.h Grouping of Clone Sequences A proprietary haplotype analysis program was used to compare patterns of variation among the different clone sequences for a given genomic sample. Because of the possibility of random PCR errors, the program only recognizes a variation as being present if it occurs in at least two clones. Also, a given overall sequence pattern was only considered to be a unique haplotype pattern if there were at least two clones that contained that exact or nearly exact pattern. Since up to six possible sequence patterns per sample were expected, up to six distinct patterns from among the 36 clones of each sample were searched for. For each such pattern, all clone sequences related to that pattern were reported as a group and the consensus sequence of that group was determined. Only one to three distinct sequence patterns per genomic sample were found (even though as many as six were expected). 28

3.4 Results A BLAT analysis was performed on each of the haplotype sequences found. The IDENTITY of each sequence to the known sequences of CCL3, CCL3L1, and CCL3L3 are displayed in Table 4. Table 5 shows all variations found between the sequencing results and the hg18 reference sequence of CCL3, with variations highlighted in yellow. A total of 13 polymorphic sites were found, all of which were single nucleotide polymorphisms, and all of which were known variations found in dbsnp (build 130). Among the eight experimental samples, as many as 16 CCL3-related haplotypes were found, but only six different haplotype patterns were found. All sequences from the eight samples were exact or near-perfect matches to the CCL3 reference sequence. Only in the control sample was the presence of CCL3L1 detected. CCL3L3 was not detected in any sample. 29

Table 4 Genotyping results based on BLAT analysis Haplotype Sequence Identity to CCL3 Identity to CCL3L1 Identity to CCL3L3 Sequence Category Sample1_seq1 99.90% 94.70% 94.80% CCL3 Sample2_seq1 99.90% 94.70% 94.80% CCL3 Sample3_seq1 99.90% 94.70% 94.80% CCL3 Sample3_seq2 99.70% 94.60% 94.70% CCL3 Sample4_seq1 99.90% 94.70% 94.80% CCL3 Sample4_seq2 99.80% 94.80% 94.90% CCL3 Sample5_seq1 99.90% 94.70% 94.80% CCL3 Sample5_seq2 99.60% 94.40% 94.60% CCL3 Sample6_seq1 99.90% 94.70% 94.80% CCL3 Sample6_seq2 99.90% 94.70% 94.80% CCL3 Sample7_seq1 99.90% 94.70% 94.80% CCL3 Sample7_seq2 100.00% 94.80% 94.90% CCL3 Sample8_seq1 99.70% 94.60% 94.70% CCL3 Sample8_seq2 99.90% 94.70% 94.80% CCL3 Standard Positive Control_seq1 99.90% 94.70% 94.80% CCL3 Standard Positive Control _seq2 94.70% 99.90% 99.80% CCL3L1 Standard Positive Control _seq3 94.60% 100.00% 99.80% CCL3L1 30

Table 5 Summary of genotypes in haplotypes matching CCL3 Chromosome Position Known SNPs Found Standard Positive Control_seq1 Sample38_seq2 Sample8_seq1 Sample7_seq2 Sample7_seq1 Sample6_seq2 Sample6_seq1 Sample5_seq2 Sample5_seq1 Sample4_seq2 Sample4_seq1 Sample3_seq2 Sample3_seq1 Sample2_seq1 Sample1_seq1 Position No. CCL3 Reference C 228 C C C C C C C T C C C C C C C rs35511254 34,417,292 C 574 C C C T C C C T C C C C T C C rs1719134 34,416,946 T 855 C C C C C C C C C C C T C C C rs1719133 34,416,665 C 983 C C C T C C C T C C C C T C C rs1130371 34,416,537 C 1067 T T T T T T T T T T T C T T T rs1719131 34,416,453 T 1121 T T T T T A T T T T T T T T T rs6505507 34,416,399 C 1123 C C C C C T C C C C C C C C C rs6505506 34,416,397 A 1274 A A A G A A A G A A A A G A A rs1719130 34,416,246 T 1369 T T T T G G T T T T T T T T T rs5029410 34,416,151 G 1457 G G G G G G G T G G G G G G G rs34171309 34,416,063 G 1535 G G G G G G G G G A G G G G G rs5029407 34,415,985 C 1757 C C C G C C C G C C C C G C C rs1063340 34,415,763 A 1800 A A A G A A A A A A A A G A A rs8951 34,415,720 31

3.5 Discussion The long-range consensus-pcr method by design provided an effective means of accurately determining the sequence and variations of genomic regions that are otherwise very difficult to assess because of their high degree of homology to other regions. It is thus perplexing that the results did not meet the expectations intended as the method failed to find any copies whatsoever of two of the three genes under investigation (CCL3L1 and CCL3L3). However, the results with a DNA control sample showed that the assay was capable of detecting CCL3L1 if present. Based on findings solely from the method used, the following conclusions are drawn. First, the most likely reason that there were no occurrences of either CCL3L1 or CCL3L3 in the samples tested was that CCL3L1 and CCL3L3 genes were not present in these samples. Alternatively, the genes may have been present, but either the PCR or cloning procedure managed to exclude them. This could have happened if, for example, the actual PCR priming sites for CCL3L1 and CCL3L3 are different from those reported in the human genome reference sequence hg18. However, since it was possible to detect two versions of CCL3L1 in the control DNA sample, it appears that the procedures are capable of detecting CCL3L1 at least some of the time. During the initiation of this project, the focus was on trying to determine the sequences of CCL3, CCL3L1, and CCL3L3 in the genomic samples. A preliminary assay was not devised to determine which of those three genes might actually be present, since it was assumed that they all would be. In order to further explore potential reasons why the long-range consensus-pcr method may have failed a short-range amplicons sequencing assay was developed that 32

indeed showed that all three genes could be amplified in small amplicons. Various longrange combinations were then tried to determine if alternative primers could yield more of the other genes. In some cases, it was possible to get somewhat more of CCL3L1 and CCL3L3 however this was not enough to retry cloning unless a very large amount of clones were to be screened. A follow-up experiment was conducted involving the use of short-range amplicon sequencing assays to sequence the flanking regions of the three genes. There were two sets of assays. One set was specific for the regions flanking CCL3, and the other was specific for the pair of genes CCL3L1 and CCL3L3. The theory behind this method was that perhaps these flanking regions were substantially different from what is published, and that these differences were making it difficult to use a set of consensus primers that recognize the flanking sequences outside of CCL3L1 and CCL3L3. High quality sequencing results were obtained in all cases, but the flanking sequences found were exactly what is published, which means that there is still no explanation for why the longrange PCR results were so biased towards just CCL3. An additional finding from the short-range amplicon method results was that they showed that it is possible to correctly sequence the flanking regions of CCL3 without contamination of CCL3L1 and CCL3L3, and it is possible to correctly sequence the flanking regions of the combined CCL3L1 plus CCL3L3 without contamination from CCL3. Based on these findings, it seemed plausible that the short-range amplicon sequencing strategy could be successfully applied to the genes themselves, to give a specific sequence for all of CCL3 and a specific sequence for the combined CCL3L1 plus CCL3L3. 33

In conclusion the results obtained dampen any hopes that a robust assay for these genes can be developed using the cloning method however the short-range amplicon method looked promising. This was the basis for the next round of CCL3L1 and CCL3L3 gene characterization. 34

CHAPTER 4 CCL3L1 AND CCL3L3 GENE CHARACTERIZATION VIA SHORT-RANGE AMPLICON SEQUENCING 4.1 Introduction The initial characterization of CCL3L1 and CCL3L3 via long-range consensus PCR amplification primers was unsuccessful hence a different approach involving shortrange amplicon sequencing was used. The design of the short-range amplicons involved six Boost primers that generated sequence lengths ranging from 655 to 936 base pairs, spanning the length of the reference sequence of CCL3L1/CCL3L3 including the flanking regions of the 5 and 3 ends. To ensure improved accuracy in CCL3L1/CCL3L3 characterization, the Boost primers were designed to overlap with each other. Nest primers were designed based on each Boost sequence. The Nest sequence lengths ranged from 571 to 616 base pairs. The nest product from the PCR reaction was sequenced and used in characterizing the nucleotide sequence of CCL3L1 and CCL3L3. 4.2 Study Population Characterization of CCL3L1 and CCL3L3 was carried out in 24 unrelated HapMap samples of Yoruba decent from Ibadan, Nigeria. High quality genomic DNA of all samples was obtained from the Coriell Institute for Medical Research. Use of these 35

samples for research purposes was in compliance with the National Human Genome Research Institute (NHGRI) regulations for the protection of human subjects. 4.3 Methods and Materials 4.3.a Sequencing Strategy The methodology involved a two-step boost/nest PCR strategy. A boost reaction was to be done first with a larger fragment and the product used as a template for the nest reaction. The nest product was to be sequenced. CCL3L1 and CCL3L3 were to be characterized in a combined fashion since both genes can be sequenced together without contamination with other genes. 4.3.b Boost/Nest Primers Design of the boost/nest primers involved using the flanking region of the combined CCL3L1 and CCL3L3 genes. The sequence mapped to the February 2009 human reference sequence (hg19) was used as the reference. Single stranded DNA primers were designed to be between 15-30 bases in length and to maximize specificity the %G+C of primers were near 50%. Primers were also purified via gel electrophoresis to avoid potential problems. Also in the design of the primers, care was taken to ensure that primer sequences did not complement each other particularly at the 3 end in order to avoid template-independent amplification of the primer sequences which could potentially lead to larger primer artifacts. The complete boost/primer design is depicted in Figure 2 while the specific primers are shown in Tables 6 and 7 and the boost/nest fragments shown in Appendix C. 36

4.3.c PCR Protocol The AmpliTaq Gold PCR Master Mix from Applied Biosystems supplied at a 2X concentration was used and the manufacturer s instructions followed however the reactions were scaled down10 fold. Therefore, a 5ul reaction was run versus the 50ul reaction described in the protocol for both the boost/nest reactions. Due to the relatively high G+C contents of the template, a three-temperature thermal cycling method was used as recommended by the manufacturer to optimize results. PCR conditions are as follows: Hold period for initial enzyme activation 94C, 4mins; 30 PCR cycles, Denaturing 94C, 20secs, Annealing 55C, 25secs, Extending 72C 1min and Holding 72C, 7min. The DNA samples were run on an Eppendorf Mastercycler 384. The Millipore Montage PCR384 plates were used for PCR cleanup. A cleanup was performed for the nest reaction and not the boost reaction. 4.3.d DNA Sequencing A 5ul.25x reaction was run during the preparation of samples for cycle sequencing. The BigDye 3.1 chemistry and Millipore SEQ384 plates were used for dye terminator removal. 37

Table 6 Nest primers used in the PCR reaction to characterize CCL3L1 and CCL3L3 Fragment Name NST5' NST3' N-LEN N_GC% CCL3L1_1 AGTTAGAGAGTAGCTCAGA CCCTGTTTTTCTATCTGTAC 601 50 CCL3L1_2 CCACGTGAGTCCATGTT AGGCATTTGGGGGGT 577 47 CCL3L1_3 TCTCCAGCTCACCC ACACCTCAGTGCCC 616 53 CCL3L1_4 TGCCCTCCTCAACCA GAGGAAGAGTTAAGCAC 611 59 CCL3L1_5 GAGTGAGGTGGGTG AACACACTGTGAAATCAAAAATAAATTAT 571 54 CCL3L1_6 AGTGGGTCCAGAAATAC AAGTGGATAACTCTGTCG 597 46 38

Table 7 Boost primers used in the PCR reaction to characterize CCL3L1 and CCL3L3 Fragment Name BST5' BST3' B-LEN B_GC% CCL3L1_1 CAGGAGAAACCCCATG TGGGAGACCTAGGGT 744 50 CCL3L1_2 TCTGCAACCAGGTCC CTTCTGATCCCTGAGTG 655 47 CCL3L1_3 GGGTTCAAAACGAATCAGTTT CTGACTCTGTAACACCC 732 53 CCL3L1_4 GACATTTCTCTGCAAAACC CCTGCCGGCCTCT 705 59 CCL3L1_5 CTGAGCTGTGACTCG CACTGTGAGGGAAGGT 936 54 CCL3L1_6 ACAGCTTCCTAACCAAGA CTAGAGCGTGCATATTAC 714 46 39

Figure 2 Schematic of Boost/Nest sequencing approach. Primers are aligned to CCL3L1 gene on chromosome 17 including a 250bp region at each flanking region. 40

4.4 Results The combined sequencing of CCL3L1/CCL3L3 in the 24 Yoruba individuals from the HAPMAP generated a total of 17 SNPs based on the minus strand of HG19 which served as the reference sequence. Five of the SNPs have previously been reported in the NCBI database (dbsnp build 131); rs2944 corresponding to chromosome 17 position 34,624,169, rs17850251 corresponding to chromosome 17 position 34,624,269, rs1804185 corresponding to chromosome 17 position 34,624,269, rs1828283 corresponding to chromosome 17 position 34,625,253 and rs2073462 corresponding to chromosome 17 position 34,625,500. Three of the five known SNPs (rs2944, rs1828283 and rs2073462) are located in the intronic region corresponding to the CCL3L1/CCL3L3 genes. Rs17850251 is a non-synonymous SNP (changes the codon to one that codes for a different amino acid) while rs1804185 is a synonymous SNP (changes a codon to a different codon for the same amino acid). The reference allele for the SNP ID rs2944 is T and out of the 24 Yoruba individuals four had the same reference allele, one had the opposite reference allele; C and 19 were heterozygotes or mixed base-calls. For the SNP ID rs17850251 with a reference allele T, 16 individuals had the same reference allele, none had the opposite reference allele C, seven were heterozygotes and allele information wasn t determined for one individual. For SNP ID rs1804185 with a reference allele A, 12 individuals had the same reference allele, none had the opposite reference allele T, 11 were heterozygotes and allele information wasn t determined for one individual. For SNP ID rs1828283 with a reference allele G, three individuals had the same reference allele, four individuals had the opposite reference allele C and 17 individuals were heterozygotes. For SNP ID rs2073462 with a reference allele C, three individuals had the 41

same reference allele, two individuals had the opposite reference allele T and 19 individuals were heterozygotes. The genotype table of known SNPs is depicted in Table 8. For locations at which there was observed variation in at least one sample, quality values (QV's) are given to the right of each base-call, in brackets. In most cases, these quality scores are determined by assessing multiple electropherograms (usually a forward and a reverse sequencing read). For homozygous calls (A, C, G, or T), the estimated confidence in the call (as opposed to some other call, including heterozygote calls) is 90% for QV=20, 97% for QV=30, 99% for QV=40, 99.7% for QV=50, and 99.9% for QV=60. For heterozygote calls (R, Y, M, K, W, and S), the estimated confidence in the call (as opposed to any other possible call) is 90% for QV=10, 97% for QV=15, 99% for QV=20, 99.7% for QV=25, and 99.9% for QV=30. Table 10 depicts the population frequencies of the five known SNPs found during sequencing. Reference SNPs rs2944, rs1828283 and rs2073462 have different reference alleles in regards to CCL3L1 and CCL3L3. In the NCBI s reference sequence for rs2944, CCL3L1 is listed as a T (NG_023369.1) while CCL3L3 is listed as a C (NG_023325.1). The population frequency in a single Caucasian submitted to the NCBI s database depicts 50% each for the T and C alleles. For rs1828283, CCL3L1 is depicted as a G (NG_023369.1) allele while CCL3L3 is depicted as a C (NG_023325.1) allele. Population frequency information exists for two Caucasians, one with 50% G and C alleles and the other with 100% G allele. For rs2073462, CCL3L1 is listed as a C allele (NG_023369.1) while CCL3L3 is listed as a T allele (NG_023325.1). Population frequency in two Caucasians is 100% for the C allele. In 752 unrelated Japanese individuals, the C allele is represented in 49% while the T allele is represented in 51%. 42

The cumulative genotype table of novel SNPs found are depicted in Table 8. A total of 12 novel SNPs were determined however three of the SNPs corresponding to chromosome 17 positions 34,611,029, 34,611,160 and 34,611,199 are located in the flanking region of CCL3L1 and CCL3L3. All the novel SNPs are located in the intronic region except for the SNP located at chromosome position 34,624,348 which is a nonsynonymous SNP. The summary of variants found and the characteristics of the coding SNPs are depicted in Appendices D and E. 43

Table 8 Cumulative genotype table of known SNPs discovered Chromosome Position 34,624,169 34,624,269 34,624,839 34,625,253 34,625,500 Reference SNP ID rs2944 rs17850251 rs1804185 rs1828283 rs2073462 Gene CCL3L1, CCL3L1, CCL3L1, CCL3L1, CCL3L1, CCL3L3 CCL3L3 CCL3L3 CCL3L3 CCL3L3 Reference Allele T T A G C Y1 Y(29) Y(2) A(33) S(17) Y(26) Y2 T(65) T(62) A(64) G(65) T(49) Y3 Y(35) T(31) R(21) S(24) Y(16) Y4 T(66) T(63) R(40) S(30) Y(31) Y5 C(65) T(67) A(67) C(66) C(64) Y6 Y(33) T(64) A(67) S(24) Y(2) Y7 T(59) T(65) R(36) C(68) Y(39) Y8 T(67) Y(2) R(34) S(31) Y(4) Y9 Y(41) Y(10) A(62) G(66) Y(30) Y10 Y(30) Y(26) R(17) S(11) Y(16) Y11 Y(38) Y(2) R(15) S(18) Y(17) Y12 Y(39) T(64) R(35) S(23) Y(2) Y13 Y(38) T(33) R(12) S(3) Y(30) Y14 Y(34) Y(33) R(39) C(68) Y(26) Y15 Y(32) A(33) C(37) C(54) Y16 Y(36) Y(6) A(64) S(33) T(66) Y17 Y(26) T(44) A(66) S(21) Y(27) Y18 Y(37) T(62) R(23) S(13) Y(8) Y19 Y(33) T(43) A(66) G(68) Y(20) Y20 Y(26) T(62) S(25) Y(2) Y21 Y(24) T(51) A(66) S(25) Y(23) Y22 Y(8) T(64) R(16) S(6) C(29) Y23 Y(37) T(26) A(64) S(21) Y(5) Y24 Y(42) T(64) A(64) S(32) Y(18) R = A/G, Y = C/T, S = C/G 44

Table 9 Cumulative genotype table of novel SNPs discovered Chromosome Position 34,611,029 34,611,160 34,611,199 34,623,978 34,624,086 34,624,187 34,624,208 34,624,242 34,624,348 34,624,964 34,625,070 34,625,723 Reference Novel Novel Novel SNP ID * * * Novel Novel Novel Novel Novel Novel Novel Novel Novel Reference Allele A C A A A G G G T T T G Y1 R(2) C(54) A(45) A(60) R(9) G(56) G(55) G(56) T(65) T(31) T(34) R(23) Y2 A(66) C(65) A(65) A(66) A(66) G(58) G(56) G(58) T(66) T(62) Y(18) R(22) Y3 R(3) C(65) A(54) R(18) A(68) G(4) G(28) G(14) T(33) T(62) T(59) R(28) Y4 R(37) C(64) A(66) A(66) A(67) G(59) G(54) G(54) T(66) T(64) Y(29) R(34) Y5 A(63) C(59) W(7) A(64) A(68) G(59) G(54) G(59) T(68) T(60) T(57) G(66) Y6 A(66) C(66) W(6) R(5) A(68) G(63) G(57) G(57) T(66) T(62) T(59) R(2) Y7 A(66) C(63) A(68) A(67) A(66) G(57) G(55) G(57) T(65) K(34) T(59) G(55) Y8 R(35) C(66) A(67) R(7) A(64) G(59) G(57) G(59) T(66) T(62) T(61) R(7) Y9 R(13) C(63) A(67) A(64) A(66) G(58) G(57) G(58) Y(2) T(62) T(59) R(25) Y10 R(14) Y(23) A(65) A(66) A(63) G(62) R(16) G(62) Y(17) T(64) T(61) R(23) Y11 R(5) C(64) A(66) R(22) A(64) G(60) G(59) G(60) T(66) T(62) T(57) R(10) Y12 R(11) C(46) A(62) A(66) A(68) R(9) G(55) R(5) T(64) T(71) T(59) R(8) Y13 A(66) C(64) A(66) R(2) R(8) G(31) G(29) G(29) T(33) T(62) T(63) R(12) Y14 R(33) C(62) A(65) A(66) A(66) R(26) G(59) G(59) T(66) T(62) T(59) G(66) Y15 A(66) C(65) A(66) A(62) A(66) T(30) T(26) G(57) Y16 R(31) C(64) A(66) A(65) A(66) G(59) G(57) G(59) T(66) T(58) T(57) G(52) Y17 R(26) C(54) A(67) A(66) A(64) G(36) G(57) G(36) T(52) T(64) T(57) G(56) Y18 A(68) C(63) W(2) R(27) A(66) G(57) G(57) G(57) T(66) T(62) T(62) G(66) Y19 R(15) C(65) A(66) R(15) A(68) G(59) G(55) G(59) T(67) T(62) T(59) R(15) Y20 R(33) C(64) W(9) A(66) A(64) G(56) G(57) G(56) T(65) G(62) Y21 R(20) C(65) W(19) A(68) A(66) G(56) G(52) G(56) T(54) T(60) T(57) G(64) Y22 A(66) C(65) A(66) R(2) A(65) G(55) G(57) G(55) T(70) T(62) T(70) G(55) Y23 A(68) C(64) W(2) A(64) A(64) G(57) G(57) G(57) Y(2) T(61) T(59) R(16) Y24 R(16) C(64) A(64) A(66) A(67) G(57) G(57) G(57) T(66) T(65) T(59) G(57) *SNPs in flanking region of sequence 45

Table 10 Population frequency distribution of known SNPs discovered Reference SNP ID Variant Function Reference Allele CCL3L1/ CCL3L3 CCL3L3/ CCL3L3 Population Frequency Ethnicity rs2944 Intronic T T C T: 50%, C: 50% Caucasian rs1828283 Intronic G G C rs2073462 Intronic C C T G: 50%, C: 50% Caucasian G: 100% Caucasian C: 100% Caucasian C: 49%, T: 51% Japanese (Unrelated) rs1804185 Coding, Synonymous A A A A: 49%, G: 51% Caucasians rs17850251 Coding, Nonsynonymous T T T Information Unavailable 46

4.5 Discussion The major objective for this phase of the project was to characterize the nucleotide structures of CCL3L1 and CCL3L3, in essence to provide a means of accurately distinguishing between these two genes. This was a very significant challenge due to the very high homology between these genes. It is worth mentioning that three of the nucleotides that could be used to distinguish between CCL3L1 and CCL3L3 were found as SNPs during this characterization process (rs2944, rs1828283 and rs2073462). It was thus important to determine whether the SNPs are differences between the genes or within the genes. In the 24 Yoruba samples sequenced, heterozygote alleles were found in over 70% of these individuals for each SNP. Assuming no deviations, the data suggests that Hardy-Weinberg frequencies were maintained and the nucleotide differences could be the result of copy number differences between the genes or other factors such as mutations. Inferences cannot be made regarding the publicly availably population frequencies of rs2944 or rs1828283 due to the very limited sample sized (maximum of 2 individuals). However for rs2073462, an almost equal frequency of alleles (C: 49%, T: 51%) was observed in 752 unrelated Japanese. The position on chromosome 17 corresponding to rs2073462 is characterized as a C allele for CCL3L1 and as a T allele for CCL3L3 in the NCBI s reference sequence. Although the Japanese represent a single ethnicity, these results may suggests that nucleotide differences are between CCL3L1 and CCL3L3 rather than within each gene. 47

An important outcome from this characterization effort is that one of the novel SNPs found (Chr 17 position: 34,624,348) falls in the reverse primer and forward primer regions corresponding to assays used by Shao et al. and Townson et al respectively in RT-PCR to quantify CCL3L1 gene copy number [115, 126] although this SNP has a 6% frequency in the 24 Yoruba samples. It has previously been reported that primers and probes used in current RT-PCR methods to quantify CCL3L1 gene copy number are indeed not specific and possibly result in inaccurate quantification [122]. Current findings provide additional information that cautions the use of these specific primers to quantify CCL3L1 gene copy number based on RT-PCR. In summary, the SNPs found will play a very crucial role moving forward. The SNP information that has been determined from this project will serve several purposes. First, one of the central themes of this thesis was to develop a pyrosequencing assay that could be used to accurately determine CCL3L1 copy number. Since the effectiveness of a pyrosequencing primer to accurately capture the target gene depends on avoiding potential SNP sites within the genomic sequence, novel SNP data determined will contribute to developing highly specific primers. Also, and previously mentioned probes and primers currently used in RT-PCR assays to quantify gene copy number of CCL3L1 are not specific. Thus gene characterization findings from this study will be essential in developing new primers and probes that specifically target CCL3L1 even in other quantification approaches including RT-PCR and the Paralogue Ratio Test (PRT). Finally, current findings significantly add to literature and could potentially serve as a basis of future characterization efforts. 48

CHAPTER 5 PYROSEQUENCING 5.1 Introduction Design of the pyrosequencing assay was based on the characterization of the CCL3L1 and CCL3L3 nucleotide sequence. The design of the pyrosequencing assay involved four assays that could be used in combination to determine gene copy number of all four genes within the Chr 17 q12 cluster (CCL3, CCL3L1, CCL3L2 and CCL3L3). Three methods are proposed for use in quantifying gene copy number. The first involves generating vector/graph coordinates by plotting various combinations of each assay to generate clusters based on which gene copy number can be determined. The second method involves using CCL3 which is known to have two genes per diploid genome as the reference and quantifying gene copy number of the other genes within the Chr 17 q12 cluster (CCL3L1, CCL3L2 and CCL3L3). The third method involves combining gene copy number from RT-PCR assays with gene copy number from the pyrosequencing assays run on the same samples to generate a single gene copy number. Pyrosequencing inherently does not distinguish between the CCL3L1 and CCL3L3 genes hence quantification results are classified as CCL3L1 or CCL3L3. 49

5.2 Background Pyrosequencing is a non-electrophoretic, real-time DNA sequencing technology [144, 145] which provides a direct sequence method for the detection of single nucleotide polymorphysims. A primer is hybridized to a single-stranded PCR template, and the sequencing analysis is started by the addition of nucleotides. The nucleotides are added sequentially, and through coupled enzymatic reactions, the polymerase-catalyzed incorporation of nucleotides can be monitored as light peaks in a pyrogram. Pyrosequencing has previously been used successfully in the determination of copy numbers of genes [146, 147]. Peak heights in a pyrogram are directly proportional to the emitted light from the pyrosequencing reaction and can therefore be quantified. In Summary, a four enzyme cocktail is used; DNA polymerase, sulfurylase, luciferase and apyrase. A sequencing primer placed adjacent to the SNP to be analyzed (usually within 3 nucleotides) is annealed to a single stranded DNA template. Sequence elongation occurs when a nucleotide is added to the mix and incorporated into the DNA strand by DNA polymerase. Pyrophosphate is released and converted into ATP by sulfurylase. The ATP is then used in the conversion of luciferin to oxyluciferin by luciferase, which produces light. The unused nucleotides are removed by apyrase and the reaction in ready for the next round of nucleotide additions. The amount of light produced by luciferase is directly proportional to the amount of nucleotide incorporated into the DNA strand. 5.3 Assay Design The design of the assay involved four genes CCL1, CCL3L1, CCL3L2 and CCL3L3 all located on chromosome 17q12. All four genes were aligned to find loci that 50

are identical except for a single nucleotide difference for PCR primer positioning. Four assay designs that involved amplification of various gene pairs were developed to enable accurate quantification of CCL3L1. The design of the four assays also makes it possible for quantification of the other three genes in the cluster. A PCR design in which only CCL3L1 was amplified is referred to as CCL3L1 vs ALL. The primer design in which only CCL3 was amplified is referred to as CCL3 vs ALL. The primer design in which both CCL3L1 and CCL3L2 were amplified is referred to as CCL3L1+CCL3L2 vs ALL. Finally, a primer design in which both CCL3L1 and CCL3L3 were amplified in reference to CCL3 is denoted CCL3L1+CCL3L3 vs CCL3 Figure 3. 51

Figure 3. Gene alignment depicting the nucleotide positions used in the primer design process for the pyrosequencing assay 52