Supplementary Methods. 1. Cancer Genetic Markers of Susceptibility (CGEMS) Prostate Cancer Genome-Wide Association Scan

Similar documents
CS2220 Introduction to Computational Biology

Supplementary Figure 1. Principal components analysis of European ancestry in the African American, Native Hawaiian and Latino populations.

Statistical Tests for X Chromosome Association Study. with Simulations. Jian Wang July 10, 2012

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

Supplementary Figures

Whole-genome detection of disease-associated deletions or excess homozygosity in a case control study of rheumatoid arthritis

Lecture Outline. Biost 590: Statistical Consulting. Stages of Scientific Studies. Scientific Method

Genome-wide association study of prostate cancer identifies a second risk locus at 8q24

Nature Genetics: doi: /ng Supplementary Figure 1

During the hyperinsulinemic-euglycemic clamp [1], a priming dose of human insulin (Novolin,

2) Cases and controls were genotyped on different platforms. The comparability of the platforms should be discussed.

GENETIC LINKAGE ANALYSIS

New Enhancements: GWAS Workflows with SVS

Decomposition of the Genotypic Value

CONTRACTING ORGANIZATION: University of California, Irvine Irvine CA

PREPARED FOR: U.S. Army Medical Research and Materiel Command Fort Detrick, Maryland

LTA Analysis of HapMap Genotype Data

Supplementary information. Supplementary figure 1. Flow chart of study design

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data

Analysis of single gene effects 1. Quantitative analysis of single gene effects. Gregory Carey, Barbara J. Bowers, Jeanne M.

An Introduction to Quantitative Genetics I. Heather A Lawson Advanced Genetics Spring2018

Supplementary Figure 1: Attenuation of association signals after conditioning for the lead SNP. a) attenuation of association signal at the 9p22.

Genome-wide Association Analysis Applied to Asthma-Susceptibility Gene. McCaw, Z., Wu, W., Hsiao, S., McKhann, A., Tracy, S.

For more information about how to cite these materials visit

1) Postmenopausal invasive breast cancer case-control study nested within the NHS

UNIVERSITY OF CALIFORNIA, LOS ANGELES

BST227 Introduction to Statistical Genetics. Lecture 4: Introduction to linkage and association analysis

1) Postmenopausal invasive breast cancer case-control study nested within the NHS

BIOSTATISTICAL METHODS

DOES THE BRCAX GENE EXIST? FUTURE OUTLOOK

Assessing Gene-Environment Interactions in Genome-Wide Association Studies: Statistical Approaches

Gene expression profiling predicts clinical outcome of prostate cancer. Gennadi V. Glinsky, Anna B. Glinskii, Andrew J. Stephenson, Robert M.

Handling Immunogenetic Data Managing and Validating HLA Data

Statistical power and significance testing in large-scale genetic studies

CONTENT SUPPLEMENTARY FIGURE E. INSTRUMENTAL VARIABLE ANALYSIS USING DESEASONALISED PLASMA 25-HYDROXYVITAMIN D. 7

(b) What is the allele frequency of the b allele in the new merged population on the island?

Supplementary note: Comparison of deletion variants identified in this study and four earlier studies

Genetic association analysis incorporating intermediate phenotypes information for complex diseases

Genetics and Genomics in Medicine Chapter 8 Questions

Assessing Accuracy of Genotype Imputation in American Indians

Human population sub-structure and genetic association studies

Introduction to the Genetics of Complex Disease

Lab 5: Testing Hypotheses about Patterns of Inheritance

Global variation in copy number in the human genome

The Association Design and a Continuous Phenotype

University of Groningen. Metabolic risk in people with psychotic disorders Bruins, Jojanneke

Chapter 4 INSIG2 Polymorphism and BMI in Indian Population

Genetics and Pharmacogenetics in Human Complex Disorders (Example of Bipolar Disorder)

Ct=28.4 WAT 92.6% Hepatic CE (mg/g) P=3.6x10-08 Plasma Cholesterol (mg/dl)

HRAS1 Rare Minisatellite Alleles and Breast Cancer in Australian Women Under Age Forty Years

Pedigree Analysis Why do Pedigrees? Goals of Pedigree Analysis Basic Symbols More Symbols Y-Linked Inheritance

A Comparison of Sample Size and Power in Case-Only Association Studies of Gene-Environment Interaction

Non-parametric methods for linkage analysis

Association-heterogeneity mapping identifies an Asian-specific association of the GTF2I locus with rheumatoid arthritis

Lecture Outline Biost 517 Applied Biostatistics I

Genomewide Linkage of Forced Mid-Expiratory Flow in Chronic Obstructive Pulmonary Disease

Systems of Mating: Systems of Mating:

Tutorial on Genome-Wide Association Studies

Transmission Disequilibrium Methods for Family-Based Studies Daniel J. Schaid Technical Report #72 July, 2004

PRINCIPLES OF STATISTICS

Quality Control Analysis of Add Health GWAS Data

Association between the -77T>C polymorphism in the DNA repair gene XRCC1 and lung cancer risk

White Paper Estimating Genotype-Specific Incidence for One or Several Loci

Association Between F9 Malmö, Factor IX And Deep Vein Thrombosis

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0

SUPPLEMENTARY DATA. 1. Characteristics of individual studies

Contingency Tables Summer 2017 Summer Institutes 187

Linkage analysis: Prostate Cancer

Imaging Genetics: Heritability, Linkage & Association

PCPT and SELECT Cancer Cohorts

Title:Validation study of candidate single nucleotide polymorphisms associated with left ventricular hypertrophy in the Korean population

Supplementary Online Content

Detection of Differential Test Functioning (DTF) and Differential Item Functioning (DIF) in MCCQE Part II Using Logistic Models

Association between ERCC1 and ERCC2 polymorphisms and breast cancer risk in a Chinese population

National Surgical Adjuvant Breast and Bowel Project (NSABP) Foundation Annual Progress Report: 2009 Formula Grant

Rare Variant Burden Tests. Biostatistics 666

additive genetic component [d] = rded

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida

XMRV among Prostate Cancer Patients from the Southern United States and Analysis of Possible Correlates of Infection

Selection at one locus with many alleles, fertility selection, and sexual selection

GENOME-WIDE ASSOCIATION STUDIES

5/2/18. After this class students should be able to: Stephanie Moon, Ph.D. - GWAS. How do we distinguish Mendelian from non-mendelian traits?

BIOSTATISTICAL METHODS AND RESEARCH DESIGNS. Xihong Lin Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA

Nature Methods: doi: /nmeth.3115

A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY

Lessons in biostatistics

Modelling Spatially Correlated Survival Data for Individuals with Multiple Cancers

National Disease Research Interchange Annual Progress Report: 2010 Formula Grant

Section 8.1 Studying inheritance

Allowing for Missing Parents in Genetic Studies of Case-Parent Triads

Association mapping (qualitative) Association scan, quantitative. Office hours Wednesday 3-4pm 304A Stanley Hall. Association scan, qualitative

Single SNP/Gene Analysis. Typical Results of GWAS Analysis (Single SNP Approach) Typical Results of GWAS Analysis (Single SNP Approach)

Genome-wide association study identifies variants in TMPRSS6 associated with hemoglobin levels.

Supplementary Figure 1. Quantile-quantile (Q-Q) plot of the log 10 p-value association results from logistic regression models for prostate cancer

Introduction to linkage and family based designs to study the genetic epidemiology of complex traits. Harold Snieder

GROWTH HORMONE GENES AND PROSTATE CANCER RISK. Amy S. Dressen. B.A., University of Oklahoma, Submitted to the Graduate Faculty of

Pirna Sequence Variants Associated With Prostate Cancer In African Americans And Caucasians

Lessons learned for the conduct of a successful screening trial

Part [2.1]: Evaluation of Markers for Treatment Selection Linking Clinical and Statistical Goals

Conditions. Name : dummy Age/sex : xx Y /x. Lab No : xxxxxxxxx. Rep Centre : xxxxxxxxxxx Ref by : Dr. xxxxxxxxxx

Transcription:

Supplementary Methods 1. Cancer Genetic Markers of Susceptibility (CGEMS) Prostate Cancer Genome-Wide Association Scan The CGEMS data portal provides public access to summary results for approximately 550,000 SNPs genotyped in the CGEMS prostate cancer scan (Phase 1A with HumanHap300 and Phase 1B HumanHap240, both from Illumina, San Diego, CA) in more than 1,100 prostate cancer patients and an equivalent number of controls. Analysis of nearly 550,000 SNP genotypes per subject provides approximately 90% coverage of common SNPs based on HapMap Phase 2 with minor allele frequency (MAF) greater than 0.05 in the European population and a linkage disequilibrium coefficient threshold of r 2 >0.8 using Tagzilla (http://tagzilla.nci.nih.gov/) 1-3. Study Population The Prostate, Lung, Colon and Ovarian (PLCO) Cancer Screening Trial is a large, randomized controlled trial of approximately 155,000 men and women. Participants are randomized to either a screening or control arm. Each year after enrollment, subjects are asked to notify the study of any cancers diagnosed in the past year using the Annual Study Update (ASU). The trial is designed to test the efficacy of cancer screening to prevent early death from prostate, lung, colorectal and ovarian cancer. The collection of questionnaire data and biospecimens (e.g., repeated blood samples and in some instances, buccal cell samples) allows investigation of early markers for cancer as well as etiology of common cancers 4-6. PLCO enrollment began in 1993 and ended in 2001. Recruitment included men and women, aged 55 to 74 with no reported history of prostate, lung, colon and ovarian cancer, although prior diagnoses of other cancers were acceptable. The CGEMS cohort consisted of men enrolled in the screening arm of the PLCO Trial who: 1. were White and non-hispanics; 2. had no prior history of prostate of cancer before randomization; 3. had at least one (PLCO) prostate cancer screen (PSA testing) before October 1, 2003; 4. had completed a Baseline Questionnaire about risk factors for cancer; 5. had signed informed consent; 6. had provided a blood sample sufficient to meet one of the following criteria: a. at least 11 µg DNA after extraction b. at least 1 vial of buffy coat, or c. at least 7 vials of whole blood was available; and 7. for controls, had returned at least one Annual Study Update (ASU). 1

Based on these criteria, 28,521 men were included in the CGEMS sub-cohort. CGEMS distinguishes between non-aggressive and aggressive cases of prostate cancer at the time of diagnosis. The two subtypes are defined as follows: 1. Non-aggressive: cases with a Gleason Score < 7 and Stage < III. 2. Aggressive: cases with a Gleason Sore > 7 or Stage > III. Study enrollment began on October 1, 1993. Consequently, study years in the PLCO Trial are counted according to the Federal fiscal year, Oct 1 to the following September 30. All men diagnosed with prostate cancer between enrollment and the end of FY2001 were considered for inclusion in CGEMS. Because of our interest in the clinically more significant, but less common aggressive form of prostate cancer, we increased the fraction of aggressive cases in the CGEMS case series by extending eligibility for cases diagnosed with aggressive prostate cancer through the end of FY2003. A total of 1,361 subjects with prostate cancer met the eligibility criteria and were considered for the CGEMS project; 737 cancers were aggressive 624 cancers were nonaggressive. Of the eligible cases, all aggressive cases (n=737) were chosen to be cases in the CGEMS prostate cancer study. Of the 624 men found to have non-aggressive tumors, 493 men (70.4%) whose diagnosis was temporally closest to the first screening were included in this study. Controls were selected by incidence-density sampling. The first step was creation of nonoverlapping sets of cases characterized by: 1. Calendar year (FY) of entry into the cohort, 2. Age at entry in five-year intervals (55-59, 60-64, 65-69, 70-74) 3. Number of years under follow-up between enrollment and diagnosis of prostate cancer. Next, for each case set, we identified eligible men among all 28,251 men in the CGEMS cohort who met each of the following three criteria: 1. Same year of entry into the cohort as the case set; 2. Same five-year age-at-entry interval (55-59, 60-64, 65-69, 70-74) as the case set; and 3. Observed through the year of follow-up in the case set with no prostate cancer diagnosis. In incidence density sampling, a male subject is included as a control for a given case set independently of eligibility and selection as a control for other case sets and independently of future diagnosis as a case. Our goal was to obtain a random sample of controls from the corresponding risk set with size equal to the number of cases in the set. For cases diagnosed before 2002, some samples had already been extracted for another prostate cancer study within the PLCO cohort. To assure that all eligible subjects had equal inclusion probability for this study, we replaced some of the previously selected 2

controls with newly enrolled cohort members according to a random selection rule that ensured that the chance of inclusion as a control in the CGEMS study for a given case was the same for each man eligible to be a control. 1,204 different men, representing 1,230 control selections, were identified as controls at least once (1,179 subjects sampled once, 24 subjects sampled twice, and one subject sampled three times). Characteristics of the 1,230 controls selected are in Table 1 below. Forty-six control subject selections, derived from 44 eligible subjects, subsequently developed prostate cancer; 32 of these cases were included as cases and 12 were among the men with non-aggressive prostate cancer not included in the analysis. Table 1. Potential CGEMS participants from PLCO cohort Case status Number of times selected as a control 0 1 2 3 Total No prostate cancer diagnosis 26,000 1,136 23 1 27,160 Prostate ca dx, but not selected as a case 119 11 1 0 131 Prostate ca dx, and selected as a case 1198 32 0 0 1,230 All subjects 27,317 1,179 24 1 28,521 2. Replication Study Characteristics i) The Health Professionals Follow-up Study The Health Professionals Follow-up Study (HPFS) began in 1986 and is an ongoing prospective cohort study of 51,529 United States male dentists, optometrists, osteopaths, podiatrists, pharmacists, and veterinarians 40 to 75 years of age. The baseline questionnaire provided information on age, marital status, height and weight, ancestry, medications, smoking history, disease history, physical activity, and diet. At baseline the cohort was 97% white, 2% Asian American, and 1% African American. The median follow-up through 2005 was 10.5 years (range 2-19 years). Self-reported prostate cancer diagnoses were confirmed by obtaining medical and/or pathology records. Prostate cancer deaths are either reported by family members in response to follow-up questionnaires, discovered by the postal system, or the National Death Index. Questionnaires are sent every two years to surviving men to update exposure and medical history. In 1993 and 1994, a blood specimen was collected from 18,018 men without a prior diagnosis of cancer. Prostate cancer cases are matched to controls on birth year (+/-1) and ethnicity. Controls are selected from those who are cancer-free at the time of the case s diagnosis, and had a prostate-specific antigen test after the date of blood draw. ii) The Alpha-Tocopherol, Beta-Carotene Cancer Prevention Study (ATBC) The Alpha-Tocopherol, Beta-Carotene Cancer Prevention Study (ATBC Study) was initiated in 1985 as a randomized controlled trial of 29,133 Caucasian male smokers 50-69 years old in southwestern Finland that tested the cancer preventive effects of 3

vitamin E and beta-carotene on cancer incidence. It was continued as a longitudinal cohort following the intervention period. Current follow-up exists for up to 20 years (16 years at the time the nested case-control set was created for the BPC3). Cancer incidence and overall mortality are 100% ascertained through follow-up with national registries in Finland. Serum and questionnaire data were collected from all participants at baseline, and whole blood (for DNA) in 1992 from 20,243 men. The ATBC Study nested set for the BPC3 included 1058 prostate cancer cases and 1058 controls without prostate cancer who were matched based on age, intervention assignment, and date of blood draw. iii) The American Cancer Society Cancer Prevention Study II Nutrition Cohort (CPS-II) The American Cancer Society Cancer (ACS) Prevention Study II Nutrition Cohort (CPS-II) was established in 1992; the cohort includes over 86,000 men and 97,000 women from 21 U.S. states who completed a mailed questionnaire in 1992. At baseline, the cohort was 97% white and the median age of participants was 63 (range: 40-92). Starting in 1997, follow-up questionnaires have been sent to surviving cohort members every other year to update exposure information and to ascertain occurrence of new cases of cancer; a >90% response rate has been achieved for each follow-up questionnaire. Incident cancers are verified through medical records, state cancer registries, or death certificates. From 1998-2001, blood samples were collected from a subgroup of 39,376 cohort members. To further supplement the DNA resources, during 2000-2001, buccal cell samples were collected by mail from an additional 70,004 cohort members. Cases of prostate cancer were frequency matched to controls on single year of age, ethnicity, date of sample collection and specimen type (blood or buccal cell sample). All controls were selected from individuals who were cancer-free (except for non-melanoma skin cancer) at the beginning of the interval preceding the diagnosis of each case. Similarly, cases were not eligible if they had a history of another cancer other than nonmelanoma skin cancer prior to their diagnosis of prostate cancer. For all cases, exposure information was collected by questionnaire before the cancer diagnosis. However, collection of some biologic samples occurred after cancer diagnosis or immediately preceding cancer diagnosis. iv) The CeRePP French Prostate Case-Control Study The French Prostate Cancer Case Control Study (CeRePP), also known as ProGene study began in July 1994. Patients treated in three French Departments of Urology (Paris, Brest and Nancy) were recruited for the study. Patients who had histologically confirmed prostate cancer were included after receiving their informed consent. At this time, a blood sample was collected. Clinical and histological data in the probands were obtained from medical records. Each participating patient received a questionnaire by mail to collect family data on their first-degree relatives (date of birth, and if deceased, age at death and cause of death) and the occurrence of cancers in their families. 4

Controls were recruited as participating in a systematic health screening program funded by the French National Health Insurance and found unaffected (asymptomatic for prostate cancer, normal digital rectal examination and total PSA < 4 ng/ml). They were recruited in the same geographical areas than the three Hospitals. They also completed a questionnaire to collect information on occurrence of cancers on their first-degree relatives. Additional information for both cases and controls included the month and year of blood donation, height & weight at diagnosis/screening, history of smoking at blood donation and family history for prostate cancer in first-degree relatives. Additional information for cases included tumor stage (TNM), tumor grade (Gleason score), tumor histology and death (year, date and whether due to prostate cancer). Cases were matched with controls within each recruiting location and by 5 year interval at age of interview/diagnostic. 3. Genome-Wide Association Study of PLCO in CGEMS Sample handling of PLCO: DNA samples were received from the PLCO bio-repository and visually inspected for adequate fluid in individual tubes. Three measurements of DNA quantification were performed according to the standard procedures at the Core Genotyping Facility of the National Cancer Institute 7. These include pico-green analysis, optical density spectrophotometry and real time PCR (http://cgf.nci.nih.gov/dnaquant.cfm). Samples were also analyzed with 15 short tandem repeats and the Amelogenin marker in the Identifiler TM Assay (ABI, Foster City, CA). Samples that completed less than 13 of the 15 micro-satellite markers were excluded and not deemed suitable for additional genotyping. Table 2. Final set of PLCO samples genotyped in CGEMS Status at initiation of CGEMS project Number of times selected as a control 0 1 2 3 Total Prostate cancer-negative during followup 0 1,087 22 1 1,110 Diagnosed with non-aggressive cancer 466 26 1 0 493 Diagnosed with aggressive cancer 679 16 0 0 695 All subjects 1,145 1,129 23 1 2,298 After final review and sample handling, 1,188 of 1,361 (87.2%) of eligible cases were genotyped in CGEMS. A total of 1,188 men with prostate cancer and 1,110 men not diagnosed with cancer are included in our analysis. For quality control analysis, 49 DNA samples from PLCO were genotyped in duplicate. We also genotyped 100 DNAs from Centre d Etude du Polymorphisme Humain (CEPH, Paris, France) families of which 77 were genotyped in duplicate. 5

Selection of SNPs: Genotyping of the CGEMS Prostate Cancer Study was performed under contract by Illumina Corporation in two parts, Phase 1A used the Sentrix HumanHap300 genotyping assay and Phase 1B used the Sentrix HumanHap240 8-10. Together, the chips constitute a fixed panel of 561,494 tagsnps identified following the method initially described by Carlson et al 11. This selection was performed using the data from the International HapMap project (http://www.hapmap.org/) that included a threshold for the linkage disequilibrium statistic r 2 >0.7 for non-genic regions and r 2 >0.8 for genic regions 9. For the European population, this panel is expected to cover close to 90% of the common SNPs in HapMap phase 1 at a threshold of r 2 >0.8 as evaluated by the TagZilla (http://tagzilla.nci.nih.gov/) program. Quality control Initial Assessment of Call Rates A total of 561,494 SNP genotype assays were attempted on 2,540 DNA samples in two phases, Phase 1A (HumanHap300) and Phase 1B (HumanHap240) (see Table 3). If the completion rate for any sample was below 90%, the sample was excluded from further analysis; 18 CEPH samples and 7 PLCO samples were excluded by this criterion. 175 CEPH samples (including 73 duplicate DNAs including one set of a quadruplet) and 2,340 PLCO samples (including 49 duplicate DNAs) passed. A total of 14,901 SNPs (~2.6% overall) failed to provide accurate genotype results due to either low locus call rates (<90%) or were monomorphic across the study. Subsequent quality control analysis was performed on the remaining 546,593 SNPs. Table 3. Sample completion rates Typed on Human Hap300 Typed on Human Hap240 Completion Rate All SNPs Phase Sample Subset Completion Rate Successful* SNPs Phase 1A All X 99.731 95.587 Phase 1A PLCO X 99.742 96.534 Phase 1A CEPH X 99.602 96.399 Phase 1B All X 99.850 98.351 Phase 1B PLCO X 99.852 98.801 Phase 1B CEPH X 99.789 98.738 Phase 1A+1B All X 99.448 91.578 Phase 1A+1B All X 99.687 98.637 Phase 1A+1B All X X 99.796 98.222 Phase 1A+1B PLCO X 94.899 98.043 Phase 1A+1B PLCO X N/A** N/A** Phase 1A+1B PLCO X X 99.798 98.224 Phase 1A+1B CEPH X 99.600 96.397 Phase 1A+1B CEPH X 99.702 98.653 Phase 1A+1B CEPH X X 99.589 98.018 6

* Successful SNPs are those with >90% completion across all attempted samples ** Samples that failed in Phase 1A were not attempted in Phase 1B. Overall, 219 of 243 (90%) CEPH samples and 2,243 of 2,340 (96%) PLCO samples generated high performance genotype calls (>99% completion rate). Figure 1 provides the distribution of the completion rate for the 97 remaining samples. The lowest accepted call rate was 94.24% in Phase 1A and 97.27% for Phase 1B. Overall, the average completion rate for the PLCO samples for 546,593 SNPs is 99.8%. This rate was not significantly different in the controls compared to the combined aggressive and nonaggressive case groups (status at initiation of CGEMS project, t test p=0.30, Kruskal- Wallis p=0.46) and remained non-significant when the two case phenotypes were separated (Kruskal-Wallis p=0.76). No additional subjects were excluded from the association analysis based on completion rate. Call rate for SNPs A total of 538,548 (96%) SNPs yielded a call rate higher than 99%. Figure 2 provides the distribution of the call rate for the 22,946 remaining SNPs. The lowest call rate retained for our analysis was 89%. Concordance rate The genotype concordance rate for SNP assays was evaluated based on three comparisons: 1. Genotype generated on CEPH DNA. Of the CEPH DNAs, 77 had been provided twice as separate aliquots from the same DNA preparation, then genotyped and evaluated for genotype concordance. An average of 305,400 genotypes comparisons were performed for each DNA pair. Between 0 and 1,342 discordances (average 56) were observed within each pair comparison, yielding a discordance rate of 1.85 10-4. 2. Genotypes generated on duplicate DNA from PLCO. From PLCO, 56 pairs of controls DNA were provided twice and analyzed as separate aliquots from the same DNA preparation but performed comparably, thereby, providing reliable data. The DNA samples were selected randomly. Analysis of the discrepancies within these pairs revealed similar results to the CEPH DNA duplicates. An average discordance rate 1.88 10-4 was observed. 3. Comparison of CGEMS and HapMap Genotypes 28 samples genotyped in CGEMS were also genotyped by the International HapMap Consortium 2,3. The discordance rate between the CGEMS Phase 1A and HapMap genotypes for these individuals was 1.7 10-3, which is an order of magnitude higher than that observed for PLCO duplicates in the CGEMS project. Notably, the International HapMap Consortium received its DNA from the Corriell Institute (Camden, New Jersey) and CGEMS received DNA from the CEPH. Although the DNAs were obtained from cell lines derived from the same individual, they were extracted from cell lines propagated at different times and location and processed independently. The distribution of 7

discordances was as follows: 94% between homozygote and heterozygote calls and only 6% showed homozygote/homozygote discordances. Hardy Weinberg Proportions in control DNA Genotype data for all autosomal SNPs were tested on PLCO controls for deviation from Hardy-Weinberg proportions 12. The analysis was conducted in the PLCO control group. Significant p values (p < 0.05) were observed for 30,887 SNPs (5.8%). Final sample and SNP selection for association analysis Three DNA samples revealed a large number of heterozygous loci on the X chromosome, suggesting sample mix-up. In addition, three pairs of samples were found to be unexpected duplicates. These 9 subjects were removed from subsequent analysis. Thus, association analysis was performed on the final set of 2,282 subjects described in table 4 Table 4. Final set of PLCO samples analyzed for association Status at initiation of Number of times selected as a control CGEMS project 0 1 2 3 Total Never developed prostate cancer 0 1,082 22 1 1,105 Diagnosed with non-aggressive cancer 461 26 1 0 488 Diagnosed with aggressive cancer 673 16 0 0 689 All subjects 1,134 1,124 23 1 2,282 Association Analysis The primary analysis of the CGEMS prostate GWAS study explores the association between single SNPs and prostate cancer susceptibility in 561,494 SNPs per subject. By maximizing genome coverage for a given number of SNPs, we increase the opportunity to pursue different working hypotheses and different regions of interest now and in the future. Similarly, our agnostic approach to the analysis does not take gene function or prior information in prostate or other cancers into consideration. The analytic approach assumes no structure to the risk across the 3 possible genotypes at each locus. This approach maintains power to detect recessive or over-dominant alleles at the cost of a small decrease in power relative to an Armitage trend test for the detection of alleles with multiplicative effect. Prostate cancer stage and grade at diagnosis are important predictors of survival; they may also have different etiologic factors. Therefore, we distinguish between nonaggressive and aggressive prostate cancer in the analysis. Essentially, our analysis combines the effect from looking separately at the two case phenotypes. Our analysis has power to identify susceptibility loci specific to aggressive or non-aggressive prostate cancer, at a small cost of power for loci with the same odds ratio in aggressive and nonaggressive cases. 8

Analytic approaches We present results from two distinct analytic approaches. The first scheme is more frequently used in case control studies. The second scheme takes full advantage of the prospective nature of the PLCO cohort and the power from incidence density sampling. Analysis with Single selection For this scheme, which will be more familiar to non-epidemiologists, does not account for the dynamic nature of the cohort. Genotypes of individuals that have been selected as a case in the relevant phenotype case group are counted once as a case and never as a control. Individuals who have been selected several times as controls but had not developed prostate cancer during follow-up are counted only once in the control group. Analysis accounting for incidence density sampling Selection of controls from cases identified in a cohort that accounts for the dynamic nature of the cohort including development of disease during follow-up and timing of entry to and exit from follow-up may have more power to detect an association than the single-selection method. The main feature of incidence-density sampling, as used for control selection here, is that controls are selected independently for each case among those who are at risk at the time of the diagnosis of the case; i.e., among those who would become a case in the study had they developed disease at the same time. Inclusion as a control for a given case set is independent of future diagnosis as a case, of selection as a control for other case sets, and of entry and exit times. Thus, individuals may be included as a case and as a control. Genotypes of individuals who have been selected multiples times are taken into account each time he is selected; the man s covariates that vary with time, such as age are defined differently each time, depending on the characteristics of the case set for which he was selected as a control 13. Genotypes In order to maintain high power to detect SNPs that are involved in non multiplicative models (such as complete recessivity or over-dominance), we provide analyses of the data based on genotype frequencies. Each of the three possible genotype states are considered separately. Accordingly, for autosomal loci, analysis of each case phenotype uses a statistical test with two degrees of freedom for each case phenotype considered separately (aggressive and non aggressive separately yield 4 degrees of freedom with 3 genotypes). For tests involving X-linked loci, a single degree of freedom is used per case phenotype. Single SNP statistics In order to expedite public access to the data, the first-pass analysis of the CGEMS data aims at detecting association of single SNPs with prostate cancer susceptibility. Multi- SNP approaches, such as haplotype association, have not yet been performed. Statistical tests. We performed four sets of analyses. For each test, analysis included 561,494 SNPs, 9

491 cases diagnosed with non-aggressive tumors and 691 cases diagnosed with aggressive tumors.. The characteristics of the four tests are: 1. Single selection, unadjusted score test 1,105 controls that were not diagnosed with prostate cancer at the start of followup. 3-by-3 contingency table of genotypes by phenotypes was constructed. No adjustment for covariates. The p-value from the standard test of independence was computed from the 3-by- 3 contingency table table, based on a chi-squared test with up to 4 degrees of freedom. 2. Single selection, adjusted score test 1,105 controls that were not diagnosed with prostate cancer at the start of followup. Polytomous logistic regression was performed. o Two case phenotypes with a common set of controls; o The regression variable was a two-indicator variable for genotype. Adjustment for o Age group at randomization (4 groups), o Region of recruitment (9 regions) and o Indicator variable for cases diagnosed within one year of entry to the trial. The p-value was obtained from a score test with up to 4 degrees of freedom. 3. Incidence-density sampling, unadjusted score test 1,174 controls using an incidence density sampling strategy. 3-by-3 contingency table of genotypes by phenotypes was constructed. No adjustment for covariates The p-value from the standard test of independence was computed from this table, based on a chi-squared test with 4 degrees of freedom (or fewer if there were empty cells). 4. Incidence density sampling, adjusted score test 1,174 controls selected using an incidence density sampling strategy. Polytomous logistic regression o Two case phenotypes with a common set of controls; o The regression variable was a two-indicator variable for genotype. Adjustment for o Age group at randomization (4 groups), o Region of recruitment (9 regions), and o 0-1 variable that indicates cases diagnosed within one year of entry to the trial. The p-value was obtained from a score test with up to 4 degrees of freedom. 10

4. Follow-up Genotype Analysis Genotype analysis with TaqMan TM assays for (rs1442795, rs7837688 and rs6983267) was performed on samples at the Core Genotyping Facility except for the CPS-II (ACS) and HPFS studies for rs1442975 only at the Harvard School of Public Health. The same assay with comparable conditions was used. The primers and probes for the assays are listed on the SNP500Cancer website (http://snp500cancer.nci.nih.gov/). All assays were analyzed on ABI 7900HT units with commercial software. Genotype assays were optimized with bi-directional sequence analysis of 102 subjects in the SNP500Cancer Program. 5. Statistical Analysis of Follow-up Studies For the follow-up replication studies, all one- and two- SNPs analyses were conducted using unconditional logistic regression, adjusted for age (in five-year categories), study, and center for the two studies in which center for recruitment was available (i.e., PLCO and the French Prostate Case Control). The odds ratio for a single SNP was modeled codominantly (no constraints on the odds ratios for heterozygotes OR 1 or the odds ratio OR 2 for homozygote carries of the risk allele), multiplicatively on the odds scale (OR 2 =OR 1 2 ), and additively on the absolute risk scale (OR 2 =2 OR 1-1). Similarly, the joint effect for two SNPs was modeled multiplicatively (the odds ratio for genotype i at the first locus and genotype j at the second OR ij =OR i0 OR 0j, where 0 denotes the reference genotype), additively on the absolute risk scale (OR ij =OR i0 +OR 0j - 1), and generally (OR ij unconstrained, except for OR 00 =1). Goodness-of-fit was assessed using likelihood ratio tests for nested models and Akaike Information and Bayesian Information Criteria for non-nested models. We analyzed each study separately, the pooled replication studies, and all studies combined. We used the incidence-density sampling approach (described above) for PLCO. 6. Detecting of Recombination Hotspots We used SequenceLDhot 14 to search for recombination hotspots. This program uses an approximate marginal likelihood 15 to calculate Likelihood Ratio (LR) Statistics for the presence of a hotspot at a series of putative hotspots across the region of interest. We analysed subsets of 50 controls, and as recommended 14 in each case we used Phasev2.1 16,17 (run using the MR flag) to infer the haplotypes and background recombination rates. We searched for 2kb hotspots across a grid which allowed for hotspots starting every 1kb. To test the robustness of our results we repeated the analysis for five different subsets of 50 controls. 11

In each of the five cases we found substantial evidence for a hotspot somewhere (LR values between 23 and 37 across the five different samples); with the most likely position being somewhere between rs7841264 and rs1447293 at positions 128535996 and 128541502. After a crude correction for multiple testing the p-values for a hotspot lie between 5 x10-5 and 4 x 10-8 for the five different samples. To infer the strength of the hotspot we used Phasev2.1 under the-mr2 flag, conditioning on the detected 2kb hotspot between rs7841264 and rs1447293 18. This program estimates both the background recombination rate and the recombination rate within the hotspot. Across the five different subsets of 50 controls we obtain estimates of the population-scaled recombination rate of 0.24 per kb for the background rate (95% confidence interval 0.14-0.34); and 130 per kb for the hotspot (95% confidence interval 50-270). Appendix for statistical Analysis of CGEMS GWAS in PLCO COMPUTATION OF 4 D.F SCORE-TEST FOR CGEMS ANALYSIS The association of each SNP with advanced and non-advanced prostate cancer was tested using a 4 d.f score-test based on a polytomous logistic regression model 19. If Y = 0, Y = 1and Y = 2 denote controls, non-advanced and advanced prostate cancer cases, respectively, we specify the probability of observing a phenotype category as a function of the genotype data ( G ) and a set of co-factors ( Z ) as Pr( Y = 0 G, Z) = K Pr( Y = 1 G, Z) = Kexp{ α + γ Z + β I( G = 1) + β I( G = 2)} 1 1 11 12 Pr( Y = 2 G, Z) = Kexp{ α + γ Z + β I( G = 1) + β I( G = 2)} 2 2 21 22 (0.1) where K denotes a normalizing constant, IG ( = 1) and IG ( = 2) denote indicator variables for heterozygous and homozygous variant genotypes for a given SNP. In (0.1), the parameters β 11 and β 12 denote the log-odds-ratios associated with heterozygous and homozygous variant genotypes for non-advanced cases and β 21 and β22 denote those for advanced cases. The null hypothesis of interest is H : β = β = β = β = 0; 0 11 12 21 22 i.e., that carrying 1 or 2 variant alleles is not associated with aggressive or non-aggressive tumors. The parameters α 1 and α2 determine the baseline probability of non-advanced and advanced prostate cancer for subjects with homozygous wild-type genotype ( G = 0) and some reference values for the covariates. The parameters γ 1 andγ 2 denote the log-odds-ratio associated with covariates (Z) for advanced and non-advanced cases, respectively. 12

We implemented an Iterated Re-weighted Least Square (IRLS) algorithm 20 for obtaining the maximum-likelihood estimates of the parameters from model (0.1). Steps for score-test calculations 1) Obtain estimates of covariate effects under the null Estimate of θ1 = ( α1, γ1) and θ2 = ( α2, γ 2 ) using the IRLS algorithm by setting the design matrix as X = [1 Z] 2) Compute the score-vector The formula for score-function S ij for βij is given by N + N 0 1 Sij = I( Gm = j) { I( Ym = i) pi ( Zm) }, m= 1 where pi ( Z) = Pr θ ( Y = i Z ) is computed using estimate of θ i obtained from step (1). i Alternatively, the score can be written in terms of cell frequencies (as opposed to individual level data) as following. Suppose the covariate Z defines a total of K strata. Let nijk and N jk be the number of subjects in the data with ( Y = i, G = j, Z = z k ) and ( G = j, Z = zk ), respectively. Then the score S ij can be written as K S = { n n p ( z ij ijk + jk i k k = 1 which has the usual (O-E) (difference between observed and expected) form. Define the score vector S = ( S11, S12, S21, S22). Now we need to find the variance of the score-vector. 3) Variance calculation Steps for variance calculations )}, 3.1) Define the design matrix X = [1 Z I( G = 1) I( G = 2)] 3.2) Based on this design matrix and parameter values ) ) ( θ1 = θ1, β11 = 0, β21 = 0, θ2 = θ2, β21 = 0, β22 = 0) compute the information matrix which can be as part of the IRLS algorithm. 13

3.3) Get V = I 1 and extract the 4 by 4 sub-matrix from V, say V that ββ corresponds to the rows and columns for the four β ij parameters. 4) The final test-statistics is now given by T = SV ββ S Handling missing data on genotypes The null model in step (1) should be fitted using all subjects that have covariate data Z, irrespective of whether those subjects have missing genotypes. Note that, under the null, this gives statistically the most efficient estimate of the covariate effect parameters θ 1 and θ 2. Moreover, this will be computationally also very efficient as one has to fit the null polytomous regression model only once. Alternatively, for computing the test of association for a particular SNP, one can estimate the parameters of the null model using only the data from those subjects who have complete genotype data for that SNP, but this will require fitting up to 550,000 different null polytomous model. Once θ 1 and θ 2 are estimated using the full data, the score-vector calculation for a particular SNP will remain as above, except that the sums would now involve only those subjects who has complete genotype data on that particular SNP. The information matrix (I) calculation for a particular SNP will similarly involve only those subjects with complete genotype data for that particular SNP, except that the sub-matrix of I that corresponds to the parameters ( θ 1 and θ 2 ) should be computed based on all the subjects that went into estimation of these parameters. This sub-matrix is simply given by the information matrix computed in step (1). 14

Reference List 1. Barrett,J.C. & Cardon,L.R. Evaluating coverage of genome-wide association studies. Nat. Genet. 38, 659-662 (2006). 2. The International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299-1320 (2005). 3. The International HapMap Consortium. The International HapMap Project. Nature 426, 789-796 (2003). 4. Gohagan,J.K., Prorok,P.C., Hayes,R.B. & Kramer,B.S. The Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial of the National Cancer Institute: history, organization, and status. Control Clin. Trials 21, 251S-272S (2000). 5. Hayes,R.B. et al. Etiologic and early marker studies in the prostate, lung, colorectal and ovarian (PLCO) cancer screening trial. Control Clin. Trials 21, 349S-355S (2000). 6. Hayes,R.B. et al. Methods for etiologic and early marker investigations in the PLCO trial. Mutat. Res. 592, 147-154 (2005). 7. Haque,K.A. et al. Performance of high-throughput DNA quantification methods. BMC. Biotechnol. 3, 20 (2003). 8. Gunderson,K.L. et al. Whole-genome genotyping. Methods Enzymol. 410, 359-376 (2006). 9. Gunderson,K.L. et al. Whole-genome genotyping of haplotype tag single nucleotide polymorphisms. Pharmacogenomics. 7, 641-648 (2006). 10. Steemers,F.J. & Gunderson,K.L. Illumina, Inc. Pharmacogenomics. 6, 777-782 (2005). 11. Carlson,C.S. et al. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am. J. Hum. Genet. 74, 106-120 (2004). 12. Wigginton,J.E., Cutler,D.J. & Abecasis,G.R. A note on exact tests of Hardy- Weinberg equilibrium. Am. J. Hum. Genet. 76, 887-893 (2005). 13. Wacholder,S., Silverman,D.T., McLaughlin,J.K. & Mandel,J.S. Selection of controls in case-control studies. III. Design options. Am. J. Epidemiol. 135, 1042-1050 (1992). 15

14. Fearnhead,P. SequenceLDhot: detecting recombination hotspots. Bioinformatics. 22, 3061-3066 (2006). 15. Fearnhead,P. & Donnelly,P. Approximate likelihood methods for estimating local recombination rates. Journal of the Royal Statistical Society Series B-Statistical Methodology 64, 657-680 (2002). 16. Li,N. & Stephens,M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165, 2213-2233 (2003). 17. Stephens,M. & Donnelly,P. A comparison of bayesian methods for haplotype reconstruction from population genotype data. Am. J. Hum. Genet. 73, 1162-1169 (2003). 18. Fearnhead,P. & Smith,N.G. A novel method with improved power to detect recombination hotspots from polymorphism data reveals multiple hotspots in human genes. Am. J. Hum. Genet. 77, 781-794 (2005). 19. Hosmer,D.W. & Lemeshow,S. Applied Logistic Regression. John Wiley and Sons, Hoboken, NJ (2000). 20. McCullagh,P. & Nelder,J.A. Generalized Linear Models. Chapman and Hall, London (1989). 16