Whole-genome detection of disease-associated deletions or excess homozygosity in a case control study of rheumatoid arthritis

Similar documents
Supplementary note: Comparison of deletion variants identified in this study and four earlier studies

November 9, Johns Hopkins School of Medicine, Baltimore, MD,

New Enhancements: GWAS Workflows with SVS

5/2/18. After this class students should be able to: Stephanie Moon, Ph.D. - GWAS. How do we distinguish Mendelian from non-mendelian traits?

Global variation in copy number in the human genome

Challenges of CGH array testing in children with developmental delay. Dr Sally Davies 17 th September 2014

Supplementary Figures

CS2220 Introduction to Computational Biology

Introduction to the Genetics of Complex Disease

LTA Analysis of HapMap Genotype Data

Introduction to LOH and Allele Specific Copy Number User Forum

Genome-wide Association Analysis Applied to Asthma-Susceptibility Gene. McCaw, Z., Wu, W., Hsiao, S., McKhann, A., Tracy, S.

During the hyperinsulinemic-euglycemic clamp [1], a priming dose of human insulin (Novolin,

Introduction to Genetics and Genomics

For more information about how to cite these materials visit

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

Structural Variants and Susceptibility to Common Human Disorders Dr. Xavier Estivill

Identification of regions with common copy-number variations using SNP array

Dan Koller, Ph.D. Medical and Molecular Genetics

GENOME-WIDE ASSOCIATION STUDIES

CHROMOSOMAL MICROARRAY (CGH+SNP)

Genetics and Genomics in Medicine Chapter 8 Questions

QTs IV: miraculous and missing heritability

Rare Variant Burden Tests. Biostatistics 666

BST227 Introduction to Statistical Genetics. Lecture 4: Introduction to linkage and association analysis

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0

Nature Genetics: doi: /ng Supplementary Figure 1

Detection of aneuploidy in a single cell using the Ion ReproSeq PGS View Kit

Structural Variation and Medical Genomics

CURRENT GENETIC TESTING TOOLS IN NEONATAL MEDICINE. Dr. Bahar Naghavi

DOES THE BRCAX GENE EXIST? FUTURE OUTLOOK

Association mapping (qualitative) Association scan, quantitative. Office hours Wednesday 3-4pm 304A Stanley Hall. Association scan, qualitative

Association for Molecular Pathology Promoting Clinical Practice, Basic Research, and Education in Molecular Pathology

Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies

Generating Spontaneous Copy Number Variants (CNVs) Jennifer Freeman Assistant Professor of Toxicology School of Health Sciences Purdue University

Understanding DNA Copy Number Data

Large-scale identity-by-descent mapping discovers rare haplotypes of large effect. Suyash Shringarpure 23andMe, Inc. ASHG 2017

Genetic Testing for Single-Gene and Multifactorial Conditions

Chromatin marks identify critical cell-types for fine-mapping complex trait variants

Quantitative genetics: traits controlled by alleles at many loci

Lack of association of IL-2RA and IL-2RB polymorphisms with rheumatoid arthritis in a Han Chinese population

Genomics 101 (2013) Contents lists available at SciVerse ScienceDirect. Genomics. journal homepage:

SNP Array NOTE: THIS IS A SAMPLE REPORT AND MAY NOT REFLECT ACTUAL PATIENT DATA. FORMAT AND/OR CONTENT MAY BE UPDATED PERIODICALLY.

2) Cases and controls were genotyped on different platforms. The comparability of the platforms should be discussed.

Analysis of single gene effects 1. Quantitative analysis of single gene effects. Gregory Carey, Barbara J. Bowers, Jeanne M.

Statistical power and significance testing in large-scale genetic studies

Cognitive, affective, & social neuroscience

Tutorial on Genome-Wide Association Studies

The Inheritance of Complex Traits

Multimarker Genetic Analysis Methods for High Throughput Array Data

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

National Disease Research Interchange Annual Progress Report: 2010 Formula Grant

Supplementary Figure S1A

Name: BIOS 703 MIDTERM EXAMINATIONS (5 marks per question, total = 100 marks)

An Introduction to Quantitative Genetics I. Heather A Lawson Advanced Genetics Spring2018

Results. Introduction

Genomewide Linkage of Forced Mid-Expiratory Flow in Chronic Obstructive Pulmonary Disease

Using GWAS Data to Identify Copy Number Variants Contributing to Common Complex Diseases

Applications of Chromosomal Microarray Analysis (CMA) in pre- and postnatal Diagnostic: advantages, limitations and concerns

MULTIFACTORIAL DISEASES. MG L-10 July 7 th 2014

Multiple Copy Number Variations in a Patient with Developmental Delay ASCLS- March 31, 2016

Supplementary Figure 1: Attenuation of association signals after conditioning for the lead SNP. a) attenuation of association signal at the 9p22.

QTL detection for traits of interest for the dairy goat industry

Handling Immunogenetic Data Managing and Validating HLA Data

The Genetic Epidemiology of Rheumatoid Arthritis. Lindsey A. Criswell AURA meeting, 2016

New methods for discovering common and rare genetic variants in human disease

GWAS of HCC Proposed Statistical Approach Mendelian Randomization and Mediation Analysis. Chris Amos Manal Hassan Lewis Roberts Donghui Li

Interaction of Genes and the Environment

Genetic association analysis incorporating intermediate phenotypes information for complex diseases

Title: Pinpointing resilience in Bipolar Disorder

Cytogenetics 101: Clinical Research and Molecular Genetic Technologies

SNP Array NOTE: THIS IS A SAMPLE REPORT AND MAY NOT REFLECT ACTUAL PATIENT DATA. FORMAT AND/OR CONTENT MAY BE UPDATED PERIODICALLY.

CNV Detection and Interpretation in Genomic Data

Nature Biotechnology: doi: /nbt.1904

MBG* Animal Breeding Methods Fall Final Exam

Supplementary Information. Supplementary Figures

Integrated Analysis of Copy Number and Gene Expression

Supplementary Figure 1. Principal components analysis of European ancestry in the African American, Native Hawaiian and Latino populations.

Introduction to linkage and family based designs to study the genetic epidemiology of complex traits. Harold Snieder

Association-heterogeneity mapping identifies an Asian-specific association of the GTF2I locus with rheumatoid arthritis

Sharan Goobie, MD, MSc, FRCPC

Lecture 20. Disease Genetics

Supplementary Material to. Genome-wide association study identifies new HLA Class II haplotypes strongly protective against narcolepsy

Genetics and the Path Towards Targeted Therapies in Systemic Lupus

Numerous hypothesis tests were performed in this study. To reduce the false positive due to

Supplementary Figure 1

Most severely affected will be the probe for exon 15. Please keep an eye on the D-fragments (especially the 96 nt fragment).

Overlap of disease susceptibility loci for rheumatoid arthritis and juvenile idiopathic arthritis

Stat 531 Statistical Genetics I Homework 4

American Psychiatric Nurses Association

Nature Genetics: doi: /ng Supplementary Figure 1. Rates of different mutation types in CRC.

Clinical evaluation of microarray data

Effects of age-at-diagnosis and duration of diabetes on GADA and IA-2A positivity

Genomic structural variation

Comparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes

Associating Copy Number and SNP Variation with Human Disease. Autism Segmental duplication Neurobehavioral, includes social disability

Practical challenges that copy number variation and whole genome sequencing create for genetic diagnostic labs

Effects of Stratification in the Analysis of Affected-Sib-Pair Data: Benefits and Costs

Imaging Genetics: Heritability, Linkage & Association

Module 3: Pathway and Drug Development

Transcription:

HMG Advance Access published December 21, 2012 Human Molecular Genetics, 2012 1 13 doi:10.1093/hmg/dds512 Whole-genome detection of disease-associated deletions or excess homozygosity in a case control study of rheumatoid arthritis Chih-Chieh Wu 1,, Sanjay Shete 2, Eun-Ji Jo 4, Yaji Xu 5, Emily Y. Lu 3, Wei V. Chen 3 and Christopher I. Amos 3,6 1 Department of Epidemiology, 2 Department of Biostatistics and 3 Department of Genetics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA, 4 Duncan Cancer Center, Baylor College of Medicine, Houston, TX, USA 5 Department of Biostatistics, School of Public Health, Yale University, New Haven, CT, USA, and 6 Department of Community and Family Medicine, Geisel School of Medicine, Dartmouth College, Hanover, NH, USA Received July 23, 2012; Revised November 1, 2012; Accepted November 29, 2012 Unlike genome-wide association studies, few comprehensive studies of copy number variation s contribution to complex human disease susceptibility have been performed. Copy number variations are abundant in humans and represent one of the least well-studied classes of genetic variants; in addition, known rheumatoid arthritis susceptibility loci explain only a portion of familial clustering. Therefore, we performed a genome-wide study of association between deletion or excess homozygosity and rheumatoid arthritis using high-density 550 K SNP genotype data from a genome-wide association study. We used a genomewide statistical method that we recently developed to test each contiguous SNP locus between 868 cases and 1194 controls to detect excess homozygosity or deletion variants that influence susceptibility. Our method is designed to detect statistically significant evidence of deletions or homozygosity at individual SNPs for SNP-by-SNP analyses and to combine the information among neighboring SNPs for cluster analyses. In addition to successfully detecting the known deletion variants on major histocompatibility complex, we identified 4.3 and 28 kb clusters on chromosomes 10p and 13q, respectively, which were significant at a Bonferroni-type-corrected 0.05 nominal significant level. Independently, we performed analyses using PennCNV, an algorithm for identifying and cataloging copy numbers for individuals based on a hidden Markov model, and identified cases and controls that had chromosomal segments with copy number <2. Using Fisher s exact test for comparing the numbers of cases and controls with copy number <2 per SNP, we identified 26 significant SNPs (protective; more controls than cases) aggregating on chromosome 14 with P-values <10 28. INTRODUCTION Studies of human genome have demonstrated extensive and wide-spread copy number variations (CNVs) of DNA sequences, such as deletions, insertions, duplications and complex multi-site variants, that indicate the presence of variable numbers of copies of large genomic regions (mostly.1 kb in size) among individuals. Comprehensive wholegenome reference maps of human CNVs by SNP microarrays and array comparative genomic hybridization have been constructed (1 3). Genomic deletions represent a variant class that is often associated with disease. Three concurrent studies that specifically investigated common deletion polymorphisms in healthy individuals demonstrated that deletion variants of various sizes are ubiquitous; they also provided comprehensive maps of deletions in the human genome (4 6). These studies provided important baseline information to enable the discovery of CNV classes and facilitate whole-genome studies of associations between disease and CNVs. To whom correspondence should be addressed at: Department of Epidemiology, Unit 1340, The University of Texas MD Anderson Cancer Center, 1155 Pressler Street, Houston, TX 77030, USA. Tel: +1 7137453977; Fax: +1 7137928261; Email: ccwu@mdanderson.org # The Author 2012. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

2 Human Molecular Genetics, 2012 Deletion variants have long been known to cause microdeletion syndromes, such as DiGeorge syndrome, Prader-Willi syndrome and Wilms tumor (7), and are frequently observed in patients with neuron-developmental disorders, such as autism and schizophrenia (8 11). Recently discovered are a common 20 kb deletion upstream of the IRGM gene that is associated with Crohn s disease, a 45 kb deletion upstream of NEGR1 that is associated with the body mass index, and a deletion and duplication of KIR that is associated with HIV-1 control (12 14). A study of CNVs as trait-associated polymorphisms and expression quantitative trait loci that influence phenotype by altering gene regulation demonstrated that they contribute to the genetics of certain disease classes, such as autoimmune disorders and metabolic traits (15). However, controversy exists; it has yet to be fully ascertained to what extent CNVs account for missing heritability that is undetected by genome-wide association studies (1,15 17). In fact, few comprehensive whole-genome studies exist of their contribution to susceptibility over a wide variety of common, complex human diseases compared with genome-wide association studies (15,17). CNVs remain one of the least wellstudied classes of genetic variants. More recently, a nucleotide-resolution map of CNVs based on whole-genome DNA sequencing data from 185 individuals in the 1000 Genome Project was constructed, enabling the discovery, genotyping and imputation of CNVs and serving as a resource for sequencing-based association studies (18). Rheumatoid arthritis (RA) is a common autoimmune disorder of unknown etiology; it is characterized by the destruction of the synovial joints, resulting in severe disability. It has a complex mode of inheritance and is influenced by both genetic and environmental risk factors. It affects 1% of individuals of European ancestry, with an estimated sibling recurrence risk of 5 10 (19 21). In addition to the established susceptibility loci of HLA-DRB1 and PTPN22 (protein tyrosine phosphatase and non-receptor type 22) in patients with severe anti-ccp-positive RA, several associated alleles of modest risk on the newly identified loci have been reproducibly discovered in recent genome-wide association studies, including REL, STAT4, TNFAIP3 and BLK. On the basis of estimates of a recent meta-analysis, validated RA risk alleles on major histocompatibility complex (MHC) and non-mhc loci explained 12 and 4% of phenotypic variance, respectively; a large portion of heritable variation remains to be discovered (22). We recently developed a genome-wide statistical method for detecting disease-associated deletion variants or excess homozygosity using high-density SNP genotype data in genome-wide association studies (23). Our method is based on identifying areas in which excess homozygosity of cases varies from controls and is structured to test each contiguous SNP locus across the whole genome between a group of cases and a group of controls from a genome-wide association study. The method has proved to be useful and robust in the presence of linkage disequilibrium. It provides outcomes for SNP-by-SNP analyses and cluster analyses on the basis of combined evidence from multiple neighboring SNPs in case control studies. Genome-wide association studies are designed to discover individual disease-associated SNPs; in contrast, methods for detecting CNVs and deletions are generally designed to find small chromosomal segments (4,6,23 25). In this study, we used our method to perform a comprehensive genome-wide study of associations between common deletion variants or excess homozygosity and RA susceptibility using an Illumina HumanHap550 array in 868 RA patients and 1194 controls from the North American Rheumatoid Arthritis Consortium (20). The SNP-by-SNP analyses identified individual significant SNPs over the whole genome at a nominal significance level of 10 28 ; the cluster analyses detected candidate deleted segments in which at least 2 neighboring significant SNPs were overly aggregated. In addition to successfully detecting known deleterious deletion variants on HLA-DRB1 and C4 genes that increase RA risk in the MHC region, we identified additional 4.3 and 28 kb clusters on chromosomes 10p (5 316 846 5 321 159) and 13q (20 783 404 20 811 429), respectively, which were significant at a corrected 0.05 nominal significance level, adjusted for multiple comparison procedures. Independently, we performed analyses using the PennCNV method and identified cases and controls that had chromosomal segments with copy number,2. PennCNV is an algorithm for identifying and cataloging copy numbers for individuals on the basis of a hidden Markov model (25). Using Fisher s exact test to compare the numbers of cases and controls per SNP, we identified 26 significant SNPs (protective; more controls than cases) that were overly aggregated on chromosome 14 with P-values,10 28 and additional 49 SNPs on chromosomes 2, 14 and 20 with P-values of 10 25 10 28. In this report, we extend genome-wide association studies to deletion and excess homozygosity detection for finding additional common genetic variants that influence RA susceptibility. We also provide a strategy and analytical framework that can be used at no additional cost: using SNP and intensity data from genome-wide association studies to detect disease-associated deletion variants or excess homozygosity and identify individual patients with commonly shared disease-associated deletion variants. RESULTS For SNP-by-SNP analyses, we performed the z-score test to assess the statistical significance of differences in homozygosity proportions between the 868 cases and 1194 controls on each of 550 K contiguous SNP loci. We found that 535 individual SNPs reached genome-wide significance (defined as P-value,10 28 ). Table 1 shows the frequencies of SNP genotypes, missing SNP genotypes, SNPs tested and significant SNPs by chromosome and arm. The number of SNPs tested is the difference in counts between SNP genotypes and missing SNP genotypes. Figure 1 displays a graphical summary of outcomes of the genome-wide association scan between deletion variants or excess homozygosity and RA risk in which SNPs are plotted according to corresponding chromosomal locations with the values of log 10 (P-values). The largest association signal lies in the MHC region with a maximal aggregation of neighboring significant SNPs. We identified the deleterious deletion variants that encompassed HLA-DRB1 and C4 genes in the MHC region in which deletions and CNVs were previously

Human Molecular Genetics, 2012 3 Table 1. Frequencies of SNPs, missing SNPs, SNPs tested and significant SNPs by chromosome and arm Chromosome Arm Number of SNP genotyped Number of missing SNPs Number of SNPs tested a 1 p 21 533 81 21 452 19 q 19 396 104 19 292 13 2 p 18 526 98 18 428 9 q 25 564 105 25 459 22 3 p 18 457 55 18 402 8 q 18 233 94 18 139 7 4 p 9488 35 9453 3 q 23 140 105 23 035 10 5 p 9106 37 9069 11 q 24 506 96 24 410 13 6 p 13 964 47 13 917 81 q 21 610 67 21 543 12 7 p 13 249 30 13 219 9 q 15 995 69 15 926 14 8 p 12 222 64 12 158 3 q 18 768 74 18 694 13 9 p 10 878 31 10 847 6 q 15 250 45 15 205 19 10 p 9616 18 9598 7 q 18 715 53 18 662 21 11 p 10 550 49 10 501 11 q 15 927 46 15 881 13 12 p 8048 43 8005 7 q 18 317 79 18 238 11 13 p q 20 242 84 20 158 11 14 p q 17 951 62 17 889 9 15 p q 16 166 47 16 119 19 16 p 6382 38 6344 13 q 10 078 43 10 035 11 17 p 4526 11 4515 10 q 9501 31 9470 17 18 p 3515 3 3512 4 q 12 935 63 12 872 9 19 p 3704 5 3699 20 q 5532 13 5519 12 20 p 6697 17 6680 7 q 7146 27 7119 17 21 p 1 1 q 8050 18 8032 13 22 p q 8205 33 8172 21 Total 529 669 535 Number of significant SNPs b a The number of SNPs tested is the difference in counts between SNP genotypes and missing SNPs. b These SNPs were statistically significant by the z-score test at a nominal significance level of 10 28 for SNP-by-SNP analyses. discovered in RA patients (26,27). Deletions in the HLA-DRB1 region are a common characteristic of HLA class II haplotypes, and the major DR4 and DR9 haplotypes associated with RA belong to a related haplotype family with multiple DRB loci, including several pseudogenes. In contrast, the DR1 haplotypes associated with RA are members of a distinct family of haplotypes that have fewer DRB loci. Thus, we would expect to find copy number differences in DRB genes between RA cases and controls, which contain haplotype families with more variable numbers of DRB loci. In this study, a cluster was defined as two or more significant SNPs gathered on a short chromosomal segment of predetermined length on the basis of the SNP-by-SNP analysis outcome on the first stage. Because the tagged SNP genotypes used in genome-wide association studies are not uniformly distributed over the whole genome and because gene-sparse regions may have fewer SNPs genotyped and higher probabilities of containing genomic deletions, we used two different cluster criteria to determine the minimal length of a chromosomal segment that accommodates multiple adjacent significant SNPs. One criterion for defining a cluster of significant SNPs is that two successive significant SNPs are separated by 20 or fewer SNP loci; the other criterion is the use of a maximum distance of 100 kb between two successive significant SNPs. Under these criteria, a cluster begins with a significant SNP locus and ends with another significant SNP locus. Clusters can continuously extend this way to accommodate more than two significant SNPs. The mean distance was 5.39 kb between adjacent SNPs in this application; a 20-SNP-locus chromosomal segment spans a mean of 107.8 kb. We previously used extensive simulations to demonstrate that our method is effective at detecting disease-associated deletions and excess homozygosity under these cluster criteria (23). Cluster analysis under the first criterion Under the first cluster criterion of two successive significant SNPs separated by no.20 SNP loci, we identified 14 distinct clusters of neighboring significant SNPs over the whole genome. Each is described and shown in Table 2 in detail. Common variants of the first cluster in the MHC region contributed the strongest statistical signal of risk. We found that 54 significant SNPs overly aggregated on a short segment of 252 contiguous SNP loci in the first cluster. Excluding two significant SNPs on each end of the cluster, 52 significant SNPs were allocated inside this cluster. In this case, T ¼ 13917, k ¼ 81, w ¼ 252 and x ¼ 54 were used for formula (1). The null probability was p 0 = 5.82 10 3 (= 81/13 917), as 13 917 contiguous SNP loci were tested individually on the p arm of chromosome 6 and 81 of them were significant at the nominal significance level of 10 28 (shown in 5th and 6th columns and 12th row of Table 1). The exact P-value of this cluster was Pr(X 54 w = 252, p 0 = 5.82 10 3 )=2.91 10 66, using formula (1). It is noteworthy that the P-values that are directly obtained using expression (1) have no corrections imposed, adjusted for multiple comparison procedures. The chromosomal segment that encompassed this cluster can occur at other locations along chromosome 6p; we must take this into account when assessing statistical significance using this test for cluster analyses. We used a Bonferroni-type correction to adjust P-value thresholds by multiplying the P-value with the ratio of T (the total number of SNPs tested over a chromosomal region) to w (the number of SNPs that encompass the cluster of interest) (23). In this case, the corrected P-value is equal to 1.61 10 64 (= 2.91 10 66 (13917/252)). Because only one cluster is present on chromosome 6p, we used the scan test and obtained the P-value of 3.43 10 70. Both the cluster test and scan test demonstrated that this 627 kb clustering segment on chromosome 6p (32 182 782 32 810 427) is highly significant.

4 Human Molecular Genetics, 2012 We analyzed the remaining 13 distinct clusters using the same approach; the corresponding results are shown in Table 2. In contrast with the first cluster on chromosome 6p, Figure 1. Genome-wide scan of association between the homozygosity level and rheumatoid arthritis, using the z-score test for SNP-by-SNP analyses. SNPs were plotted according to corresponding chromosomal locations with the values of log 10 (P-values), using the z-score test. The largest association signal lay in the MHC region, with a maximal aggregation of neighboring significant SNPs (nominal significance level, 10 28 ) that encompassed HLA-DRB1 and C4 genes, in which deleterious deletions had been previously discovered in patients. Table 2. Cluster analysis under the first criterion of two successive significant SNPs separated by 20 or fewer SNP loci each of these 13 clusters contained exactly two significant SNPs. The clusters of significant SNPs on the 4.3 kb segment of chromosome 10p (5 316 846 5 321 159) and 28-kb segment of chromosome 13q (20 783 404 20 811 429) had corrected P-values of 2.55 10 3 and 2.10 10 2, respectively; these were significant at a corrected 0.05 nominal significance level, adjusted for multiple comparison procedures. The corresponding P-values of the scan test for these two clusters were 8.74 10 3 and 4.34 10 2. Detailed information on these two clusters is presented in the fourth and seventh columns of Table 2. It is important to determine the pattern of linkage disequilibrium between adjacent significant SNPs in clusters. We used the values of r 2 to measure the magnitude of linkage disequilibrium between two adjacent significant SNPs on these two clusters. The r 2 values between significant SNPs were 0.387 for cases and 0.310 for controls on the 4.3 kb cluster of chromosome 10p; 0.121 for cases and 0.070 for controls on the 28 kb cluster of chromosome 13q. The P-values of the 8th, 13th, and 14th clusters of Table 2 for the scan test are not available because the scan test only assesses the largest cluster on a chromosome arm, and the 9th and 12fth clusters of Table 2 are the largest on chromosomes 19p and 22q, respectively. Cluster analysis under the second criterion Under the second cluster criterion of a maximum distance of 100 kb between two adjacent significant SNPs, we identified Cluster number 1 2 3 4 5 6 7 Chromosome 6p 7q 10p 10q 11p 13q 16p Position a (kb) 32 182.782 32 810.427 148 176.586 148 326.588 5 316.846 5 321.159 105 335.191 105 403.030 45 242.379 45 297.296 20 783.404 20 811.429 1 066.544 1 091.324 Cluster size (kb) 627.645 150.002 4.313 67.839 54.917 28.025 24.780 No. of significant 54 2 2 2 2 2 2 SNPs No. of SNPs 252 19 2 16 10 8 7 encompassed cluster P-value of cluster test 2.91 10 266 1.31 10 24 5.32 10 27 1.50 10 24 4.91 10 25 8.32 10 26 8.76 10 25 Corrected P-value of 1.61 10 264 0.110 2.55 10 23 0.175 5.16 10 22 2.10 10 22 7.94 10 22 cluster test P-value of scan test 3.43 10 270 0.212 8.74 10 23 0.351 0.103 4.34 10 22 0.169 Cluster number 8 9 10 11 12 13 14 Chromosome 19p 19p 20q 21q 22q 22q 22q Position a (kb) 2 054.962 2 165.057 19 083.070 19 117.870 49 383.424 49 422.842 14 121.682 14 367.339 24 086.564 24 108.959 42 601.072 42 611.432 48 493.142 48 620.780 Cluster size (kb) 110.095 34.800 39.418 245.657 22.395 10.360 127.638 No. of significant 2 2 2 2 2 2 2 SNPs No. of SNPs 18 8 8 14 3 10 14 encompassed cluster P-value of cluster test 4.22 10 23 8.01 10 24 1.58 10 24 2.35 10 24 1.98 10 25 2.93 10 24 5.89 10 24 Corrected P-value of 0.868 0.370 0.141 0.135 5.39 10 22 0.240 0.344 cluster test P-value of scan test b 0.774 0.298 0.264 0.153 a Build 35. b The P-values of the clusters that are not the largest on a chromosome arm are not available and are indicated by the symbol for the scan test.

Human Molecular Genetics, 2012 5 Table 3. Cluster analysis under the second criterion of a maximum distance of 100 kb between two successive significant SNPs Cluster number 1 2 3 4 5 6 7 Chromosome 6p 6p 6p 6p 6p 10p 10q Position a (kb) 31 133.030 31 203.780 31 652.168 31 723.146 32 182.782 32 536.263 32 680.229 32 810.427 33 194.227 33 293.896 5 316.846 5 321.159 105 335.191 105 403.030 Cluster size (kb) 70.750 70.978 353.481 130.198 99.669 4.313 67.839 No. of significant 2 2 34 20 2 2 2 SNPs No. of SNPs 37 23 164 85 46 2 16 encompassed cluster P-value of cluster test 1.97 10 22 7.90 10 23 8.39 10 242 1.95 10 226 5.16 10 22 5.32 10 27 1.50 10 24 Corrected P-value of 7.12 10 240 3.19 10 224 2.55 10 23 0.175 cluster test P-value of scan test b 5.38 10 223 8.74 10 23 0.351 Cluster number 8 9 10 11 12 13 14 Chromosome 11p 13q 16p 19p 20q 22q 22q Position a (kb) 45 242.379 45 297.296 20 783.404 20 811.429 1 066.544 1 091.324 19 083.070 19 117.870 49 383.424 49 422.842 24 086.564 24 108.959 42 601.072 42 611.432 Cluster size (kb) 54.917 28.025 24.780 34.800 39.418 22.395 10.360 No. of significant 2 2 2 2 2 2 2 SNPs No. of SNPs 10 8 7 8 8 3 10 encompassed cluster P-value of cluster test 4.91 10 25 8.32 10 26 8.76 10 25 8.01 10 24 1.58 10 24 1.98 10 25 2.93 10 24 Corrected P-value of 5.16 10 22 2.10 10 22 7.94 10 22 0.370 0.141 5.39 10 22 0.240 cluster test P-value of scan test b 0.103 4.34 10 22 0.169 0.774 0.298 0.153 a Build 35. b The P-values of the clusters that are not the largest on a chromosome arm are not available and are indicated by the symbol for the scan test. 14 distinct clusters of neighboring significant SNPs, each of which is described and shown in Table 3 in detail. The strongest association signal remained in the MHC region, as it contained five distinct clusters on chromosome 6p rather than only the 1 shown in Table 2. The largest cluster found using the first criterion on chromosome 6p in Table 2 was split into two adjacent clusters (the third and fourth clusters of Table 3) under the use of second cluster criterion because only three SNPs were genotyped on the 144 kb gap between these two clusters. Both clusters were large (353 kb and 130 kb in size) and highly significant by our cluster test or scan test. The remaining clusters contained exactly two significant SNPs each. Besides the clusters on chromosome 6p, the same two clusters on chromosomes 10p and 13q were significant at a corrected 0.05 nominal significance level, using a Bonferroni-type correction, as those using the first cluster criterion. Four clusters of significant SNPs under the second cluster criterion, shown in Table 3, were statistically significant at a corrected 0.05 nominal significance level, adjusted for multiple comparison procedures. However, the two largest clusters (the third and fourth clusters of Table 3) combined in the MHC region were eventually the same as the single largest cluster found under the first cluster criterion (the first cluster of Table 2). In conclusion, both our cluster test and scan test identified three nearly identical clusters of neighboring significant SNPs under any cluster criteria at a corrected 0.05 nominal significance level: the known deleterious deletion variants in the MHC region, a 4.3 kb segment of chromosome 10p and a 28 kb segment of chromosome 13q. Because genomic variants are not uniformly distributed and genotyped over the whole genome, it is more prudent to perform additional, separate association analyses using both cluster criteria rather than using any single criterion alone in real-data analyses. We used the proposed logistic regression framework extension (shown in the Test for SNP-by-SNP Analyses on the First Stage section) to assess significance of excess homozygosity on these three clusters of significant SNPs, accounting for population stratification. Our analysis showed that 58 of 252 SNPs in the MHC region (32 182.782 32 810.427) were significant at a nominal significance level of 10 28. In addition, the two significant SNPs on chromosomes 10p and 13q, respectively, also remained highly significant using this logistic regression extension. These results indicate that our clusterbased method is robust for population stratification in this application. In addition, using any cluster criteria, we found that three clusters of significant SNPs were borderline statistically significant. These clusters were located on chromosomes 11p, 16p and 22q and had corrected P-values of 5.16 10 22, 7.94 10 22, and 5.39 10 22, respectively. The corresponding r 2 values between significant SNPs were,0.01 for cases and controls on chromosomes 11p and 16p and 0.213 for cases and 0.167 for controls on chromosome 22q.

6 Human Molecular Genetics, 2012 Figure 2. Genome-wide scan of association between rheumatoid arthritis and deletions (copy number ¼ 0 or 1) defined by PennCNV, using Fisher s exact test. SNPs were plotted according to corresponding chromosomal locations with log 10 (P-values), using two-sided Fisher s exact test. We identified 26 significant SNPs overly aggregating on chromosome 14 with P-values,10 28 and additional 49 SNPs on chromosomes 2, 14 and 20 with P-values of 10 25 10 28. These SNPs on chromosomes 2 and 14 were protective; those on chromosomes 20 were associated with increased RA risk. Whole-genome scan of RA-association using PennCNV The SNP-based statistical method that we developed is designed to detect disease-associated deletion variants or excess homozygosity; it is structured to test each contiguous SNP locus between a group of cases and a group of controls from a genome-wide association study (23). In contrast, PennCNV is an algorithm that calls individual level copy numbers, providing position-specific copy numbers (25). We used PennCNV to obtain whole-genome CNV maps for 891 RA cases and 601 controls that had available intensity data. PennCNV outputs small chromosomal segments with copy numbers other than two. We detected 62 162 CNVs with a median size of 54 kb: cases had 44 729 CNVs with a median size of 64 kb; and controls had 17 433 with a median size of 32 kb. We first used PennCNV to identify cases and controls that had chromosomal segments with copy number ¼ 0 or 1;we then used Fisher s exact test to assess the statistical significance of association between RA risk and deletions (copy number ¼ 0 or 1 determined by PennCNV) by comparing the numbers of cases and controls per SNP locus. In Figure 2, we present a graphical SNP-by-SNP outcome summary of the whole-genome scan of association between deletions and RA risk in which SNPs are plotted according to corresponding chromosomal locations with the values of log 10 (P-values). The P-values of the two-sided Fisher s exact test were calculated and are shown in the figure. We identified 26 significant SNPs (protective; more controls than cases) clustering on chromosome 14 with P-values,10 28. An amplified display of the values of log 10 (P-values) by their corresponding physical position over this small region Figure 3. Amplification of association scan on chromosome 14, 20.5 23 Mb, between rheumatoid arthritis and deletions (copy number ¼ 0 or 1) defined by PennCNV, using Fisher s exact test. The largest association signal appears on a 165 kb segment of chromosome 14q (21 834 952 21 999 998) in which all 26 significant SNPs lie at the nominal significance level of 10 28 and spans 59 SNP loci. Twenty-four consecutive SNPs were statistically significant on a 46.5 kb segment of chromosome 14q (21 834 952 21 881 469). is shown in Figure 3. In addition, we found 49 SNPs with P-values between 10 25 and 10 28 : 9 SNPs on chromosome 20 increased RA risk (more cases than controls), 35 SNPs on chromosome 14 and 5 SNPs on chromosome 2 decreased RA risk (more controls than cases). Table 4 shows all 75 SNPs with P-values,10 25, including their positions, names and exact P-values. We also present the corresponding numbers of cases and controls with copy number ¼ 0 or 1 for each SNP in the table. There were 891 RA cases and 601 controls that had available intensity data for the PennCNV analyses; thus, the numbers of cases and controls with copy number = 0 or 1 can be obtained correspondingly for calculating P-values of the Fisher s exact test. It is noteworthy that, unlike our cluster-based approach, the PennCNV method did not detect known deleterious deletion variants that encompassed HLA-DRB1 and C4 genes in the MHC region. The largest association signal appeared on a 165 kb segment of chromosome 14q (21 834 952 21 999 998), in which all 26 significant SNPs lie at the nominal significance level of 10 28, and spans 59 SNP loci. Notably, we found that 24 consecutive SNPs were statistically significant on a 46.5 kb segment of chromosome 14q (21 834 952 21 881 469). The respective maps of this region for cases and controls, shown in Figure 4, suggest that at least four distinct loci in separate linkage disequilibrium blocks are present on the 46.5 kb segment that accommodates the 24 consecutive significant SNPs; at least eight distinct loci in separate linkage disequilibrium blocks are present on the 165 kb segment of chromosome 14q (21 834 952 21 999 998) that accommodates all 26 significant SNPs. This region contains the T-cell receptor alpha chain which is rearranged in T-cells. As different T-cells

Human Molecular Genetics, 2012 7 Table 4. Regions of the genome showing evidence of association between rheumatoid arthritis and deletions (copy number ¼ 1 or 0) by PennCNV No. Chromosome Position SNP No. of cases with copy no. ¼ 0,1 a No. of controls with copy no. ¼ 0,1 a P-values of two-sided Fisher s exact test b 1 14 21 852 217 rs11845134 13 61 4.10 10 214 2 14 21 849 683 rs7146411 12 58 9.08 10 214 3 14 21 850 339 rs3811259 12 58 9.08 10 214 4 14 21 850 502 rs11850894 12 58 9.08 10 214 5 14 21 834 952 rs12588739 6 43 3.39 10 212 6 14 21 837 485 rs722448 6 43 3.39 10 212 7 14 21 856 055 rs1474477 18 62 5.24 10 212 8 14 21 859 477 rs8007403 24 70 5.28 10 212 9 14 21 860 760 rs916048 25 71 8.69 10 212 10 14 21 857 381 rs10047935 18 60 1.93 10 211 11 14 21 861 403 rs2204990 26 70 2.56 10 211 12 14 21 841 092 rs741713 9 46 3.12 10 211 13 14 21 841 139 rs1076549 9 46 3.12 10 211 14 14 21 841 963 rs2009858 9 46 3.12 10 211 15 14 21 838 610 rs3811260 7 42 3.98 10 211 16 14 21 845 319 rs1540268 11 49 4.11 10 211 17 14 21 845 708 rs10142594 11 49 4.11 10 211 18 14 21 864 135 rs17793809 27 70 6.52 10 211 19 14 21 867 816 rs4981422 27 69 1.18 10 210 20 14 21 869 910 rs11627649 27 69 1.18 10 210 21 14 21 842 503 rs1467891 9 44 1.28 10 210 22 14 21 862 055 rs11847479 26 64 1.37 10 209 23 14 21 878 594 rs4981423 17 52 1.78 10 209 24 14 21 881 469 rs3811256 17 51 3.38 10 209 25 14 21 999 540 rs10162417 22 56 9.05 10 209 26 14 21 999 998 rs10131293 22 56 9.05 10 209 27 14 21 885 790 rs2032442 14 45 1.46 10 208 28 14 21 886 996 rs12436199 14 45 1.46 10 208 29 14 22 000 627 rs2733776 22 54 2.95 10 208 30 14 21 831 090 rs10483271 7 33 5.90 10 208 31 14 21 832 139 rs17198314 7 33 5.90 10 208 32 14 21 832 903 rs17198328 7 33 5.90 10 208 33 14 21 898 729 rs2331662 14 42 9.73 10 208 34 14 21 996 759 rs17794083 26 57 1.05 10 207 35 14 21 995 192 rs1882704 28 58 1.94 10 207 36 14 21 827 106 rs2001022 7 31 2.25 10 207 37 20 35 462 245 rs1570209 96 22 2.45 3 10 207 38 14 21 994 034 rs2242545 29 59 2.49 10 207 39 14 21 985 656 rs12147516 52 83 2.87 10 207 40 14 21 986 886 rs10483273 52 83 2.87 10 207 41 20 35 442 559 rs6090585 62 9 3.36 3 10 207 42 20 35 443 071 rs6018199 62 9 3.36 3 10 207 43 14 21 826 110 rs10129606 7 30 4.42 10 207 44 2 208 064 035 rs918843 34 63 5.25 10 207 45 2 208 064 167 rs918842 34 63 5.25 10 207 46 2 208 064 454 rs2551649 34 63 5.25 10 207 47 2 208 065 237 rs6755425 34 63 5.25 10 207 48 2 208 066 083 rs959668 34 63 5.25 10 207 49 14 21 991 120 rs11848747 32 61 5.32 10 207 50 20 35 440 545 rs12329503 61 9 5.49 3 10 207 51 20 35 485 009 rs6018428 98 24 6.36 3 10 207 52 20 35 485 260 rs6018432 98 24 6.36 3 10 207 53 20 35 438 689 rs6094509 60 9 9.03 3 10 207 54 20 35 475 054 rs11905013 97 24 9.49 3 10 207 55 20 35 476 320 rs4810624 97 24 9.49 3 10 207 56 14 22 009 307 rs8020193 17 42 1.14 10 206 57 14 22 002 896 rs10483275 21 47 1.49 10 206 58 14 21 908 470 rs8014927 14 38 1.79 10 206 59 14 21 822 713 rs17116039 8 30 1.84 10 206 60 14 21 819 582 rs4435168 13 36 2.13 10 206 61 14 21 973 302 rs2141988 49 75 2.30 10 206 62 14 21 973 771 rs3811232 49 75 2.30 10 206 63 14 21 974 905 rs8021297 49 75 2.30 10 206 64 14 22 010 682 rs10483277 15 38 2.87 10 206 65 14 21 970 760 rs6572449 49 74 4.98 10 206 66 14 21 972 830 rs7142158 49 74 4.98 10 206 Continued

8 Human Molecular Genetics, 2012 Table 4. Continued No. Chromosome Position SNP No. of cases with copy no. ¼ 0,1 a No. of controls with copy no. ¼ 0,1 a P-values of two-sided Fisher s exact test b 67 14 21 975 565 rs11623995 49 74 4.98 10 206 68 14 21 976 908 rs11157596 49 74 4.98 10 206 69 14 21 914 810 rs4982619 14 36 5.51 10 206 70 14 21 816 895 rs3811266 13 34 7.13 10 206 71 14 21 817 304 rs4982599 13 34 7.13 10 206 72 14 21 931 475 rs12891257 13 34 7.13 10 206 73 14 21 933 475 rs10142552 13 34 7.13 10 206 74 14 21 928 200 rs3811247 12 33 7.51 10 206 75 14 21 929 322 rs3811244 12 33 7.51 10 206 a There were 891 RA cases and 601 controls in the PennCNV analyses. With the numbers of cases and controls with copy number ¼ 0 or 1, the numbers of cases and controls with copy number = 0 or 1 can be obtained correspondingly to calculate P-values of the Fisher s exact test. b The two-sided Fisher s exact test was used to assess the statistical significance of association between RA risk and deletions (copy number ¼ 0 or 1). show different rearrangements, the DNA intensity across this region would be decreased, while heterozygosity calling of genotypes would not be altered, hence explaining differences between PennCNV and the homozygosity clustering approach. We used the proposed logistic regression framework extension (shown in the Test for SNP-by-SNP Analyses on the First Stage section) to assess significance of deletions (copy number ¼ 0 or 1 determined by PennCNV) on the top-signal region of chromosome 14q, accounting for population stratification. Our analysis showed that this region remained highly significant. In addition, we found that nine consecutive SNPs on a 46.6kb segment of chromosome 20 (35 438 689 35 485 260) were associated with increased RA risk with P-values ¼ 10 26 to 10 27 (shown in bold in Table 4); five consecutive SNPs on a 2 kb segment of chromosome 2 (208 064 035 208 066 083) were associated with decreased RA risk with a P-value of 5.25 10 27. The proto-oncogene tyrosine-protein kinase SRC lies in the 46.6 kb chromosomal segment of chromosome 20. Additional analysis outcome of cluster-based and PennCNV methods combined Twelve RA patients and one control commonly shared a 6.6 kb segment of deletion with copy number ¼ 1 by PennCNV on chromosome 19p (2 060 157 2 066 790) that spans two SNP loci. This segment also lay between two adjacent significant SNPs on chromosome 19p (2 054 962 2 165 057) identified by our cluster-based method (shown on the lower second column of Table 2). This cluster of significant SNPs was not statistically significant at a corrected 0.05 nominal significance level, using a Bonferroni-type correction, by our cluster test. The 12 RA patients commonly shared a 15.4 kb segment on chromosome 19p (2 051 346 2 066 790) that spans four SNP loci. The AP301 adaptor-related protein complex 3, delta 1, lies in this region. The Fisher s one-sided (two-sided) exact test for comparing 12/891 versus 1/601 gives a P-value of 1.17 10 22 (1.98 10 22 ); significantly more RA cases than controls were observed on this 6.6 kb deleted segment. Supplementary Material, Table S1 provides data on the 12 identified RA patients and 1 control, including their respective affection statuses, copy numbers, deletion segment lengths, starting and ending deletion SNPs and starting and ending physical deletion positions. DISCUSSION Because known RA susceptibility loci explain only a small portion of familial clustering (22) and because CNVs are abundant in humans and represent one of the least well-studied classes of genetic variants (18), we attempted to determine some of the unknown heritability by performing a genomewide study of association between deletions or excess homozygosity and RA risk in this report. We analyzed high-density 550 K SNP genotype data from a genome-wide association study of RA (20). In the SNP-by-SNP analysis using our method (23), we detected the strongest association signal in the MHC region with a maximal aggregation of neighboring significant SNPs at the nominal significance level of 10 28, which encompasses known deletion variants on HLA-DRB1 and C4 genes. We observed a complex and extensive linkage disequilibrium pattern among significant SNPs in this region. The subsequent cluster analysis is designed to detect clusters of two or more neighboring significant SNPs overly aggregated on a small chromosomal segment and to test for statistical significance of clustering. In addition to successfully detecting known deleterious deletion variants on HLA-DRB1 and C4 genes in the MHC region (shown in the second column of Table 2), we identified 4.3 and 28 kb clusters of significant SNPs on chromosomes 10p and 13q (shown in the fourth and seventh columns of Table 2) using our cluster test and scan test, which were significant at a corrected 0.05 nominal significance level, adjusted for multiple comparison procedures. Several RA-associated alleles of modest risk sizes on new loci have been discovered in recent genome-wide association studies. We evaluated the significance status of the neighboring SNPs that encompassed these associated alleles, including PTPN22, STAT4, CTLA4, REL, HLA-DRB1, TNFAIP3, BLK, TRAF1-C5, PRKCQ and CD40. We evaluated 100 adjacent SNPs (50 SNPs on each of the two sides of the associated

Human Molecular Genetics, 2012 9 Figure 4. (A and B) The haplotype maps of chromosome 14q (21 834 952 22 000 629). The first figure is the haplotype map for cases (A) and second for controls (B). We used the value of D to create linkage disequilibrium blocks of these two haplotype maps. These two figures suggest that at least four distinct loci, in separate linkage disequilibrium blocks, are present on the 46.5 kb segment of chromosome 14q (21 834 952 21 881 469); this segment accommodates 24 consecutive significant SNPs. At least eight distinct loci, in separate linkage disequilibrium blocks, are present on the 165 kb segment of chromosome 14q (21 834 952 21 999 998); this segment accommodates all 26 significant SNPs. loci each) from the SNP-by-SNP analysis outcome. Thirty-two significant SNPs encompassed HLA-DRB1; 4 encompassed C4 and 1 (rs2572386) was apart from BLK by 114 kb. Given a complex and extensive linkage disequilibrium pattern in the MHC region, it may not be surprising that many significant SNPs neighbor HLA-DRB1. Further fine-mapping studies are required to determine whether additional risk deletion variants exist besides HLA-DRB1 and C4 in the MHC region. Independently, we performed PennCNV analyses and obtained whole-genome CNV maps for 891 RA cases and 601 controls with available intensity data. We first identified cases and controls that had chromosomal segments with copy number ¼ 0 or 1; we then used Fisher s exact test to compare the numbers of cases and controls per SNP locus for testing the statistical significance of the association between RA risk and deletions (copy number ¼ 0 or 1 by PennCNV). In Figure 2, we present a graphical SNP-by-SNP outcome summary according to corresponding chromosomal locations with the values of log 10 (P-values). We identified 26 significant SNPs aggregating on chromosome 14 with P-values,10 28 and additional 49 SNPs on chromosomes 2, 14 and 20 with P-values of 10 25 10 28. The SNPs that were found on chromosomes 2 and 14 are protective (more controls than cases); those that were found on chromosome 20 increased RA risk (more cases than controls). The 75 SNPs with P-values,10 25 are presented in Table 4.

10 Human Molecular Genetics, 2012 The cluster-based and PennCNV methods are different approaches to investigating the relationships between disease status and deletion variants. The cluster-based method is structured to identify commonly shared excess homozygosity among patients with a genetic disorder, providing strong evidence that the genes in the deleted or excess homozygosity region predispose patients to the disease. It uses a two-stage design to evaluate the association with complex human traits from high-density SNP genotype data in genome-wide association studies (23). The evidence of genomic deletions that are associated with disease is further enhanced by observing successive or neighboring SNPs with excess homozygosity in cases compared with in controls in our cluster-based scheme. In contrast, the PennCNV method is an algorithm for cataloging and identifying copy numbers for individuals, using intensity data on the basis of a hidden Markov model (25). We used PennCNV to identify cases and controls that had chromosomal segments with copy number ¼ 0 or 1 and used Fisher s exact test to assess the statistical significance of association between RA risk and deletions by comparing the numbers of cases and controls per SNP locus. The clusterbased and PennCNV methods may be sensitive to different aspects of data and observation, thus providing different information for discovering associated deletion variants or excess homozygosity in RA patients. Notably, our cluster-based method identified the strongest signals on a chromosomal segment that encompassed known deleterious deletion variants on HLA-DRB1 and C4 genes, but the PennCNV analysis did not detect statistical significance in the MHC region. We performed another cluster-based analysis using a smaller data set of 851 RA cases and 571 controls that was included in the PennCNV analysis and was a subset of the 868 RA cases and 1194 controls in our original cluster analysis. The cluster-based method remained effective and identified the largest association signal with a maximal aggregation of 50 neighboring significant SNPs in the MHC region. Supplementary Material, Figure S1 displays a graphical summary of outcomes of the genome-wide association scan between deletion variants or excess homozygosity and RA risk; SNPs are plotted, according to corresponding chromosomal locations, with the values of log 10 (P-values) on the basis of this smaller data set. A smaller data set is not likely to be the major reason that the PennCNV method failed to detect known deleterious deletion variants in the MHC region. The cluster-based method also detected a segment on chromosome 19p (2 054 962 2 165 057) that was encompassed by two adjacent significant SNPs but was not statistically significant at a corrected 0.05 nominal significance level, using a Bonferroni correction, by our cluster test (shown on the lower second column of Table 2). The PennCNV analysis identified 12 RA patients and 1 control that commonly shared a 6.6 kb segment of copy number ¼ 1 on chromosome 19p (2 060 157 2 066 790) that lay in the segment that was described by our cluster-based approach. We used Fisher s one-sided (two-sided) exact test for comparing cases (12/891) and controls (1/601) and obtained a P-value of 1.17 10 22 (1.98 10 22 ): significantly more RA cases than controls were observed on this 6.6 kb chromosomal segment. Supplementary Material, Table S1 presents detailed information on these 13 individuals and their respective deletion segments. Several sequencing-based methods are available to validate deletion variants or excess homozygosity, such as fluorescent in situ hybridization, two-color fluorescence intensity, PCR amplification and quantitative PCR. Biological confirmation and molecular validation on the top-signal chromosomal segments detected by the cluster and PennCNV analyses, including those on chromosomes 10p, 13q, 14q and 19p, are warranted in the future. In this study, we (i) used our cluster-based method to perform a whole-genome scan of disease-associated deletions or excess homozygosity and identified novel 4.3 and 28 kb clusters on chromosomes 10p and 13q, respectively, at a corrected 0.05 nominal significance level; (ii) used PennCNV and Fisher s exact test to independently perform a whole-genome analysis of association with deletion variants and identified 26 significant SNPs that were overly aggregated on a 165 kb segment of chromosome 14q at a nominal significance level of 10 28 ; (iii) identified 12 RA cases and 1 control that commonly shared a 6.6 kb segment with copy number ¼ 1, determined by PennCNV, on chromosome 19p that were also identified by our cluster-based method; (iv) proposed a novel logistic regression method to perform additional analyses for deletions and excess homozygosity, accounting for population stratification. In contrast to the design of genome-wide association studies in which a point-wise approach is used to find individual disease-associated SNPs, segment-wise approaches are generally used to discover small chromosomal CNV segments. Existing SNP-based approaches and algorithms, including our cluster-based method, are structured to identify deletion variants or excess homozygosity through observing aberrant SNP patterns in a run of consecutive SNPs (4,6,23 25). If we find statistically significant evidence of excess homozygosity at individual SNPs for SNP-by-SNP analyses, we use the cluster-based statistical approach to combine information from multiple neighboring SNPs and find a run of tightly adjacent significant SNPs associated with a disease of interest. In this report, we also provide a strategy and analytical framework that can be used, at no additional cost, to detect disease-associated deletion variants or excess homozygosity and identify individual patients with commonly shared disease-associated deletion variants, using SNP and intensity data from a genome-wide association study. In addition to unbalanced structural variants, low-frequency and rare variants may explain a portion of the missing heritability of many common human diseases. The high-density SNP genotype data in genome-wide association studies are more likely to capture common CNVs than are low-frequency ones. Furthermore, early commercial SNP array platforms were designed to be biased against SNP genotyping near CNV regions. These factors may limit the sensitivity and scope of SNP-based CNV association studies. However, newer generations of SNP arrays have been designed to eliminate much of the bias against capturing genomic segments affected by CNVs and provide higher-resolution maps of CNVs, enabling more effective and efficient CNV association studies using SNPs (28,29). The recent nucleotide-resolution CNV map on the basis of whole-genome DNA sequencing data will further enable robust investigation in sequencingbased CNV association studies (18).

Human Molecular Genetics, 2012 11 MATERIALS AND METHODS Study population To evaluate the potential role of deletion variants of CNVs that influence the case control status on a whole-genome scale, we used data from the North American Rheumatoid Arthritis Consortium, genotyped on the Illumina Human- Hap550 array. The study population consisted of 868 cases and 1194 controls from North America and was previously reported in a genome-wide association study of RA susceptibility loci (20). All patients were anti-ccp-positive and met the criteria for RA adopted by the American College of Rheumatology in 1987. Cases and controls were self-reported as white. Genotyping was performed on the SNP assay with Infinium HumanHap550 (Illumina), and 54 080 SNPs were genotyped in samples from cases and controls. The data set was filtered individually on the basis of SNP genotype call rates (.95% completeness), minor allele frequency (.0.01) and the Hardy Weinberg proportion (P 10 25 ). Patients and controls whose percentages of missing genotypes were.5%, who had non-european ancestry, who were related, or who had evidence of DNA contamination were removed from the analysis. Written informed consent was obtained from all subjects who provided blood samples, in accordance with protocols approved by the local institutional review boards. More details of the sample collection used are described elsewhere (20). SNP-based statistical method in a two-stage design Current molecular technologies and SNP genotyping methods have technical challenges that result in relatively limited resolutions; they are not capable of effectively identifying and cataloging CNVs in whole-genome array scans. CNVs and genomic deletions in particular can perturb the collection of SNP genotype data in CNV regions, causing SNP intensity data to cluster poorly and SNP genotypes in the hemizygous deletion regions to be observed as homozygous for the present allele (4,6,24). We recently proposed and developed a statistical method that uses a two-stage design to detect deletion variants or excess homozygosity that are associated with complex human traits from high-density SNP genotype data in genomewide association studies. The method was designed for single-snp analyses on the first stage and utilized evidence from multiple adjacent SNPs combined with a cluster-based approach on the second stage in case control studies (23). SNP-based methods, including our cluster-based method, are not capable of effectively distinguishing between homozygosity and deletions. The identification of excess homozygosity regions in multiple cases forms the basis of our method. It was structured to detect commonly shared deletion variants or excess homozygosity among patients with a genetic disorder, providing strong evidence that the genes in the deleted region predispose patients to the disease. Test for SNP-by-SNP analyses on the first stage We compared the level of homozygosity on each contiguous SNP locus by using normal approximations to test the significance of differences in homozygosity proportions between cases and controls on the first stage. This test infers the presence of genomic deletions associated with disease by assessing the statistical significance of higher homozygosity proportions in cases than in controls. Letting ˆp 1 and ˆp 2 be the respective estimates of homozygosity proportions in cases and controls at a single SNP locus and ˆp be their weighted average, the normal deviate Z is based on the difference in proportion quantities, ˆp 1 ˆp 2, divided by its standard error, ˆp(1 ˆp)(1/n 1 + 1/n 2 ), where n 1 and n 1 represent the sample sizes of cases and controls, respectively. This z-score test can be performed on each contiguous SNP locus along the whole human genome. The above method does not account for covariates in the model (e.g. eigen vectors for population stratification, age and sex). Therefore, we considered a logistic regression framework extension of this approach to assess significance of excess homozygosity or CNV as follows: log(pr(individual is a case)/pr(individual is a control) ¼ b 0 + b 1 x + b 2 eigenvectors + b 3 covariates, where x is the indicator of homozygosity status (or copy number ¼ 0, 1) at a SNP locus for an individual; b 2 is a vector with the same dimension as the numbers of eigenvectors adjusted for population stratification. In our analyses, we adjusted for the top four significant eigenvalues as performed in the original genome-wide association study (20). Test for cluster analyses on the second stage Evidence of disease-associated genomic deletions can be enhanced by observing successive or neighboring SNPs with excess homozygosity in cases compared with in controls in our cluster-based scheme. In addition, it can delineate or define the extent of the minimal regions of common genomic deletions among patients, indicating the critical region of disease. Our cluster test is useful for subsequent and further investigations into the outcomes of SNP-by-SNP analyses on the first stage and is designed to assess the statistical significance of multiple clusters of SNPs with excess homozygosity in cases compared with in controls. Suppose that T SNP loci over a chromosomal region are tested using the z-score test for SNP-by-SNP analyses in which k SNP loci have significantly higher homozygosity proportions in cases than in controls. Consider the frequency of significant SNP loci occurring within a narrow segment of interest compared with the frequency of significant SNP loci over the whole region. Suppose that the narrow segment of interest encompasses w SNP loci, among which x SNP loci have significant excess homozygosity in cases aggregating in this segment. What interests us is to determine whether the observation of x significant SNP loci in the segment that contains w SNP loci is statistically significant compared with the occurrence of k significant SNP loci over T SNP loci. Assuming that each of the T SNP loci tested is independently and equally likely to have a significant excess homozygosity proportion in cases and assuming that X represents the number of significant SNP loci within the segment that contains w SNP loci, the statistical test for cluster analyses is based on the random variable X with a binomial distribution. The P-value formula for this cluster test under the null hypothesis of random