Identifying Susceptibility Genes for Familial Pancreatic Cancer Using Novel High-Resolution Genome Interrogation Platforms

Size: px
Start display at page:

Download "Identifying Susceptibility Genes for Familial Pancreatic Cancer Using Novel High-Resolution Genome Interrogation Platforms"

Transcription

1 Identifying Susceptibility Genes for Familial Pancreatic Cancer Using Novel High-Resolution Genome Interrogation Platforms by Wigdan Ridha Al-Sukhni A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Institute of Medical Science University of Toronto Copyright by Wigdan Ridha Al-Sukhni 2012

2 Identifying Susceptibility Genes for Familial Pancreatic Cancer Using Novel High-Resolution Genome Interrogation Platforms Wigdan Ridha Al-Sukhni Doctor of Philosophy Institute of Medical Science University of Toronto Abstract 2012 Familial Pancreatic Cancer (FPC) is a cancer syndrome characterized by clustering of pancreatic cancer in families, but most FPC cases do not have a known genetic etiology. Understanding genetic predisposition to pancreatic cancer is important for improving screening as well as treatment. The central aim of this thesis is to identify candidate susceptibility genes for FPC, and I used three approaches of increasing resolution. First, based on a candidate-gene approach, I hypothesized that BRCA1 is inactivated by lossof-heterozygosity in pancreatic adenocarcinoma of germline mutation carriers. I demonstrated that 5/7 pancreatic tumors from BRCA1-mutation carriers show LOH, compared to only 1/9 sporadic tumors, suggesting that BRCA1 inactivation is involved in tumorigenesis in germline mutation carriers. Second, I hypothesized that the germline genomes of FPC subjects differ in copy-number profile from healthy genomes, and that regions affected by rare deletions or duplications in FPC subjects overlap candidate tumor-suppressors or oncogenes. I found no significant difference in the global copy-number profile of FPC and control genomes, but I identified 93 copy-number variable genomic regions unique to FPC subjects, overlapping 88 genes of which several have functional roles in cancer development. I investigated one duplication to sequence the breakpoints, but I found that this duplication did not segregate with disease in the affected family. Third, I hypothesized that in a family with multiple pancreatic cancer patients, genes containing rare variants shared by the affected members constitute ii

3 susceptibility genes. Using next-generation sequencing to capture most bases in coding regions of the genome, I interrogated the germline exome of three relatives who died of pancreatic cancer and a relative who is healthy at advanced age. I identified a short-list of nine candidate genes with unreported mutations shared by the three affected relatives and absent in the unaffected relative, of which a few had functional relevance to tumorigenesis. I performed Sanger sequencing to screen an unrelated cohort of approximately 70 FPC patients for mutations in the top two candidate genes, but I found no additional rare variants in those genes. In conclusion, I present a list of candidate FPC susceptibility genes for further validation and investigation in future studies. iii

4 Acknowledgments My research would not have been possible without the contribution of the following individuals: A. Borgida, S. Holter, H. Rothenmund, and K. Smith at Ontario Pancreas Cancer Study and Ontario Familial Gastrointestinal Cancer Registry for patient recruitment and selection. T. Selander of Samuel Lunenfel Research Institute Biospecimen Repository for DNA extraction. S. Joe (Gallinger Lab) for script-writing; N. Zwingerman, A. Gropper, and S. Moore (Gallinger Lab) for assistance with qpcr; A. Lionel (Scherer Lab) for computational analysis of Affy6.0 data on Birdsuite and ipattern; Q. Trinh (McPherson Lab) for computational analysis of exome data; R. Grant (Gallinger Lab) for assistance with exome data interpretation; H. Kim and T. McPherson (Gallinger Lab) for assitance with PCR and Sanger validation of exome variants. K. Hay, J. Keating, and S. Levitt (Gallinger Lab) for administrative support; J. McPherson (Ontario Institute for Cancer Research) for exome sequencing data; and C. Marshall, D. Pinto, D. Merico (The Centre for Applied Genomics), A. Shlien and D. Malkin (Malkin Lab) for their advice on my data analysis and manuscript preparations. My sincere gratitude to the Pancreatic Cancer Genetic Epidemiology Consortium (PACGENE) (PI - G Petersen, Mayo) for being an invaluable source of DNA samples and insight into pancreatic cancer genetics. I am very grateful to my Program Advisory Committee (Gary Bader, Steven Narod, Stephen Scherer) for their insightful feedback and advice throughout the five years of my PhD. In particular, their thoughtful review of my manuscripts and thesis was most helpful and deeply appreciated. To my supervisor, Steve Gallinger I cannot adequately thank you in this crowded page for all that your mentorship has meant to me since I first met you seven years ago. You pushed me when I needed pushing and supported me when I was afraid of falling. You listened patiently to my complaints. You cared about my success. I will always appreciate your open-mindedness, your integrity, and your compassion. I feel most fortunate that I am able to call you my mentor and friend. Thank you for everything. A special thank you to M. Crump for helping me maneuver around some unexpected bumps in the road of my PhD, and for exemplifying the compassionate clinician. iv

5 I dedicate this thesis to my beautiful family: To Mama and Baba Your love for me has been the greatest gift and blessing in my life, it is the reason for who I am today. Thank you for supporting my aspirations even when you did not always understand where they were taking me. To Eisar, Mayce, Mohammed, and Bann Thank you for putting up with me in my worst days I am proud of you all. To my aunts, uncles, and cousins in Iraq and elsewhere Thank you for keeping me alive in your hearts despite the long years and oceans separating us. You inspire me. I am grateful for the financial support received from the CIHR Vanier Doctoral Research Award, Lustgarten grant, Invest-in-Research grant from Princess Margarte Hospital, Canadian Society for Surgical Oncology grant, Johnson & Johnson research award, American HepatoPancreaticoBiliary Association grant, and the Department of Surgery at the University of Toronto. v

6 Table of Contents Abstract...ii Acknowledgments...iv List of Tables...vii List of Figures...viii List of Appendices...ix Abbreviations...xi Chapter 1 Literature Review Pancreatic Cancer Copy Number Variation Whole-Exome Sequencing...37 Chapter 2 Loss of Heterozygosity at BRCA1 Locus in Pancreatic Adenocarcinoma Abstract Introduction Materials & Methods Results Discussion...58 Chapter 3 Germline Genomic Copy Number Variation in Familial Pancreatic Cancer Abstract Introduction Materials & Methods Results Discussion...94 Chapter 4 Exome Sequencing in a Familial Pancreatic Cancer Kindred Abstract Introduction Materials & Methods Results Discussion Chapter 5 General Discussion, Conclusions, and Future Directions References Appendices vi

7 List of Tables Table 1 Studies estimating risk of pancreatic adenocarcinoma in relatives of affected patients Table 2 Summary of published studies reporting germline genomic copy-number variation in nondisease samples Table 3 Studies using exome-sequencing to identify genetic cause of disease Table 4 Characteristics of BRCA1 mutation carriers and sporadic pancreatic cancer patients Table 5 Pedigree summary for BRCA1 mutation carriers Table 6 LOH results for BRCA1 mutation carriers and sporadic pancreatic cancer cases Table 7 Proportion of high-confidence losses in cases and controls Table 8 Proportion of high-confidence gains in cases and controls Table 9 CNVs called by each of Birdsuite and ipattern in 36 samples on Affymetrix 6.0 array Table 10 High confidence CNV profile of cases vs. controls (excluding EBV-derived samples and excluding controls with data from only one chip) Table 11 FPC specific CNVs Table 12 Genes whose coding regions are affected by FPC-specific CNVs Table 13 Summary of raw sequence data from Illumina GAII for each subject Table 14 Sanger validation data for selected SNVs in each exome subject Table 15 Sanger validation data for selected indels in each exome subject Table 16 Number of variants identified in each exome subject Table 17 Genes containing variants identified by filtration model #1, 2, 3, and/or 4 Table 18 Additional candidate variants in untranslated regions shared by exome subjects vii

8 List of Figures Figure 1 Location of BRCA1 microsatellite markers on chromosome 17 Figure 2 Sample electropherogram of microsatellite marker fragment analysis Figure 3 Three representative matched-pair electropherograms for microsatellite LOH Figure 4 Representative sequencing result for an individual with 5382insC germline BRCA1 mutation Figure 5 Analysis of 500K arrays in FPC cases and controls Figure 6 Criteria for merging CNVs Figure 7 CNV prioritization plan Figure 8 Gains and losses identified in FPC cases by each algorithm/chip Figure 9 Gains and losses identified in controls by each algorithm/chip Figure 10 Duplications overlapping TGFBR3 gene Figure 11 Pedigree of case ID-203, indicating results of qpcr testing for duplication G_97 Figure 12 Fine-mapping the breakpoint of duplication overlapping TGFBR3 using qpcr walk-along method Figure 13 PCR gel demonstrating amplification of ~1.5-2kb fragment containing G_97 duplication breakpoint in case Id_203 Figure 14 G_97 duplication breakpoint mapping by Sanger sequencing Figure 15 PCR gel illustrating amplification of test regions and duplication breakpoint in case Id-203 and affected sister Figure 16 FPC-specific losses and gains on autosomal chromosomes Figure 17 Pedigree of FPC kindred investigated by exome sequencing Figure 18 Average coverage of bases in target region of exome per subject Figure 19 Read-depth per base in target region of exome in each subject Figure 20 Genome-wide distribution of all SNVs identified in each exome subject Figure 21 Genome-wide distribution of SNVs excluding synonymous variants in each exome subject Figure 22 Genome-wide distribution of SNVs not reported in dbsnp131 in each exome subject viii

9 List of Appendices Table S1 Primers for BRCA1 microsatellite markers Table S2 BRCA1 mutations sequencing primers Table S3 FPC cases in CNV study Table S4 Controls (OFCCR and FGICR) in CNV study Table S5 Primers for qpcr validation of CNVs Table S6 Primers for qpcr breakpoint mapping of TGFBR3-transecting duplication Table S7 High- and low-confidence losses on Affy500K array in FPC cases Table S8 High- and low-confidence gains on Affy500K array in FPC cases Table S9 High- and low-confidence losses on Affy500K array in controls Table S10 High- and low-confidence gains on Affy500K array in controls Table S11 High-confidence CNVs on Affy 6.0 array in FPC cases Table S12 High-confidence CNVs on Affy 6.0 array in controls Figure S1 qpcr of region D_180 Figure S2 qpcr of region D_19 Figure S3 qpcr of region D_128 Figure S4 qpcr of region D_152 Figure S5 qpcr of region D_234 (primer A) Figure S6 qpcr of region D_234 (primer B) Figure S7 qpcr of region D_143 (primer A) Figure S8 qpcr of region D_143 (primer B) Figure S9 qpcr of region D_220 Figure S10 qpcr of region D_30 & D_36 Figure S11 qpcr of region D_40 Figure S12 qpcr of region D_105 (primer A) Figure S13 qpcr of region D_105 (primer B) Figure S14 qpcr of region D_83 Figure S15 qpcr of region D_48 Figure S16 qpcr of region D_125 Figure S17 qpcr of region D_134 Figure S18 qpcr of region D_142 (primer A) Figure S19 qpcr of region D_142 (primer B) Figure S20 qpcr of region D_56 Figure S21 qpcr of region G_225 ix

10 Figure S22 qpcr of region G_226 Figure S23 qpcr of region G_365 (primer A) Figure S24 qpcr of region G_365 (primer B) Figure S25 qpcr of region G_369 Figure S26 qpcr of region G_380 Figure S27 qpcr of region G_407 Figure S28 qpcr of region G_603/604 Figure S29 qpcr of region G_69 Figure S30 qpcr of region G_88 Figure S31 Region: G_97 (primer A) ID_27 Figure S32 Region: G_97 (primer B) ID_27 Figure S33 Region: G_97 (primer A) ID_203 and family members Figure S34 Region: G_97 (primer A) ID_203 s family members Figure S35 Region: G_97 (primer A) ID_203 and family members Figure S36 Region: G_97 (primer A) ID_203 s family members Figure S37 Region: G_97 (primer A) ID_203 s family members Figure S38 Region: G_97 (primer B) ID_203 and family members Figure S39 T_Out_1 Fine-mapping G_97 breakpoint in Id_203 Figure S40 T_Out_2 Fine-mapping G_97 breakpoint in Id_203 Figure S41 T_Out_3 Fine-mapping G_97 breakpoint in Id_203 Figure S42 T_Out_4 Fine-mapping G_97 breakpoint in Id_203 Figure S43 O_In_2 Fine-mapping G_97 breakpoint in Id_203 Figure S44 O_Out_1 Fine-mapping G_97 breakpoint in Id_203 Figure S45 O_Out_5 Fine-mapping G_97 breakpoint in Id_203 x

11 Abbreviations AD autosomal dominant AGTC - Analytical Genetics Technology Centre AJ Ashkenazi Jewish AML acute myeloid leukemia AR autosomal recessive BAC bacterial artificial chromosome BC breast cancer CCDS - Collaborative Consensus Coding Sequence CGH comparative genomic hybridization ChIP-seq - chromatin immunoprecipitation sequencing CIN chromosomal instability CNV copy number variation Conc concordant COSMIC - Catalogue of Somatic Mutations in Cancer CRC colorectal cancer CSI chromosomal structure instability ddntps - dideoxy trinucleotide triphosphates del - deletion DGV Database of Genomic Variants Disc - discordant EBV Epstein-Barr virus FAMMM - familial atypical multiple mole melanoma FDR first degree relative FFPE formalin-fixed paraffin-embedded FGICR familial gastrointestinal cancer registry FISH fluorescence in-situ hybridization FN false negative FoSTeS - fork stalling and template switching FP false positive FPC familial pancreatic cancer GB gallbladder GDB human genome database GST glutathione-s-transferase xi

12 GTC genotyping console GWAS genome wide association study HBOC - hereditary breast and ovarian cancer Het - heterozygous HMM hidden Markov model Homo - homozygous HP hereditary pancreatitis HR hazard ratio ICGC - International Cancer Genome Consortium IHGSC - International Human Genome Sequencing Consortium Ins - insertion IPMN intraductal pancreatic mucinous neoplasm LCL lymphoblastoid cell lines LD linkage disequilibrium LOD logarithm of odds LOH loss of heterozygosity MAF - minor allele frequency MCN mucinous cystic neoplasm MEI mobile element insertion MLPA multiplex ligation probe amplification MMBIR - microhomology-mediated break-induced replication MSKCC - Memorial Sloan Kettering Cancer Centre NAHR nonallelic homologous recombination NBPF neuroblastoma breakpoint family NCBI National Centre for Biotechnology Information NFPTR - National Familial Pancreas Tumor Registry NHEJ nonhomologous end joining NIH National Institute of Health NGS next generation sequencing NK natural killer cell nssnv nonsynonymous single nucleotide variants OC ovarian cancer OFCCR - Ontario Familial Colon Cancer Registry OHI Ottawa Heart Institute OMIM - Online Mendelian Inheritance in Man xii

13 OPCS Ontario Pancreas Cancer Study OR odds ratio OR genes olfactory receptor genes QC quality control PACGENE - Pancreatic Cancer Genetic Epidemiology Consortium PanIN pancreatic intraepithelial neoplasia PARP poly-(adp-ribose)-polymerase PC pancreatic cancer PCR polymerase chain reaction PGFE pulsed gel field electrophoresis PJS - Peutz-Jeghers syndrome qpcr quantitative polymerase chain reaction qrt-pcr quantitative reverese-transcription polymerase chain reaction ROMA representational oligonucleotide microarray analysis RR relative risk SDR second degree relative SEER surveillance, epidemiology and end results SIR standardized incidence ratio SNP single nucleotide polymorphism SNV single nucleotide variants SPC sporadic pancreatic cancer TCAG The Centre for Applied Genomics TN true negative TP true positive UCSC - University of California, Santa Cruz UPD uniparental disomy UTR untranslated region VNTR - variable nucleotide tandem repeat WT - wildtype xiii

14 1 1. Pancreatic Cancer Chapter 1 - Literature Review 1.1 Pathology and epidemiology Pancreatic ductal adenocarcinoma (otherwise known as pancreatic cancer) is a highly lethal invasive epithelial neoplasm with ductal differentiation, obscuring the lobular pattern of normal pancreatic parenchyma. Pancreatic cancer grossly appears as a firm highly sclerotic mass with poorly circumscribed borders. Microscopically, infiltrating gland-forming neoplastic cells are commonly surrounded by nonneoplastic stroma in a characteristically intense desmoplastic reaction which often results in low tumor cellularity. 1 Pancreatic cancer is the fourth leading cause of cancer death in North America. The estimated number of incident cases and deaths due to pancreatic cancer in the US in 2010 was 43,140 and 36,800, respectively. 2 In Canada, the estimated number of new cases and deaths from pancreatic cancer in 2011 was 4,100 and 3,800, respectively. 3 Age-adjusted incidence in the U.S. based on SEER (Surveilance, Epidemiology and End Results) data between was 12 per 100,000 men and women; total lifetime risk was 1.45% (approximately 0.5% by age 70). 2 Due to the retroperitoneal location of the pancreas and lack of specific symptoms of early pancreatic cancer, most patients present with advanced disease that precludes surgical resection. For those patients, the only treatment option is palliation, and despite many trials of various chemotherapeutic and molecular-target drugs and/or radiotherapy, median survival is 9-11 months. 4 For patients who do undergo surgical resection of localized pancreatic cancer, 80-85% ultimately recur locally and/or systemically, resulting in 5-year survival of < 20% and overall 5-year survival for all pancreatic cancer patients of <5% Molecular biology Three distinct pre-invasive lesions have been identified as precursors for pancreatic adenocarcinoma: pancreatic intraepithelial neoplasia (PanIN), intraductal papillary mucinous neoplasms (IPMNs), and mucinous cystic neoplasms (MCNs). Each of these lesions has been associated with increased risk of cancer and the arising cancer has been shown to develop from cells within the precursor. PanINs are microscopic lesions in the smaller pancreatic ducts, and they are associated with a progressive spectrum of cytologic and architectural atypia (corresponding to the classification of PanIN1-A, PanIN1-B, PanIN- 2, and PanIN-3). 6 Mouse models of pancreatic cancer develop very similar lesions to human PanINs, and

15 2 molecular analyses have demonstrated that PanINs sequentially accumulate genetic alterations found in invasive cancer, suggesting an adenoma-to-carcinoma progressive model akin to that of colorectal cancer. 7 However, the natural history of PanINs is not yet clear: while it is evident that advanced stage PanIN-3 lesions are tightly associated with cancer 8, early-stage PanIN-1 lesions are quite common and are most prevalent in older subjects. 9 Moreover, PanINs are frequently multi-focal, and although endoscopic ultrasound can detect parenchymal changes associated with PanINs, it does so at less than 100% specificity. 10,11 Therefore, deciding if and when to resect pancreata with suspected PanIN lesions is contentious. IPMNs are grossly visible cystic lesions with direct communication to the main or branch pancreatic ducts. The mutational spectrum of IPMNs differs somewhat from that of PanINs and invasive adenocarcinoma, suggesting an alternate path of development. 12 Main-duct IPMNs are associated with up to 40% risk of malignant transformation and usually are resected, especially if they are growing and/or larger than 3 cm, demonstrate mural nodularity on imaging, or are associated with main duct dilation. 13 However, branch-duct IPMNs are more challenging to manage as their natural history is less clear. They are associated with up to 15% risk of malignancy, and most authorities recommend resection if the branch-duct IPMN exceeds 3 cm in size or has mural nodules or other suggestion of malignancy, but it is unclear what to do with smaller lesions since most branch-duct IPMNs remain unchanged over long-term follow-up. 13,14 Since IPMNs are often multifocal, patients who undergo subtotal pancreatic resections would need to continue surveillance for potential cancer recurrence. MCNs are rare, mucin-producing cystic lesions not directly communicating with the pancreatic ducts and with a distinctive ovarian-type stromal epithelium. 15 They only account for approximately 1% of pancreatic cancers, but if detected they should always be resected because they have a 40% chance of malignancy and have a 100% cure rate if the MCN is resected before invasive carcinoma develops whereas the cure rate is only 50-60% if cancer is present at time of resection. 15 Molecular analyses have identified a variety of genetic, epigenetic, and genomic alterations in pancreatic adenocarcinoma. The most common genetic mutation is Kras2 activation, present in 90-95% of cases; it also appears to be one of the earliest changes that promote tumor development, as evidenced by its presence in 36% of PanIN-1A and the fact that mice engineered to express the activated Kras G12D mutant develop PanIN-like lesions and eventually invasive pancreatic carcinoma. 7 Kras2 is a well-established proto-oncogene, part of the RAS family of GTP-binding protein which are involved in proliferation, cell survival, cytoskeletal modeling, motility, and other cellular functions. 16 In pancreatic cancer, activating mutations primarily occurring in codon 12 cause constitutive activation of the intracellular signal transduction function of the expressed protein. This constitutive signaling appears to be necessary for maintenance of pancreatic cancer, in addition to initiating its development. 17 Other oncogenes activated

16 3 in pancreatic cancer include BRAF 18, AKT2 19, cmyc 17, and EGFR 17. Moreover, constitutive activation of the Hedgehog developmental signaling pathways has also been implicated in the development of pancreatic cancer. The mammalian Hedgehog signaling pathway appears to play a critical role in developmental patterning and mature tissue homeostasis, and it has been observed to be dysregulated in many cancers, including pancreas. 20 In fact, Hedgehog signaling activation appears to be one of the initiating events in pancreatic cancer, as evidenced by ligand overexpression in PanINs 21 and IPMNs 22 and the fact that Hedgehog signaling cooperates with Kras G12D mutant in mouse models to promote development of PanINs. 23 Hedgehog signaling also appears to be important in regulating metastases. 24 While the Kras G12D mutation is necessary for development of pancreatic cancer in mice, latency to tumor development is significantly shortened if additional inactivating mutations of the tumor suppressor genes TP53, p16, or BRCA2 are added. 25 All three tumor suppressor genes, along with others, have been identified in pancreatic adenocarcinoma. Inactivating mutations (homozygous deletions, intragenic mutations plus loss of second allele, or epigenetic silencing) of p16 are found in approximately 90% of tumors. 26 This gene is a well-known tumor suppressor that codes for a cyclin-dependent kinase involved in inhibiting progression through the G1-S checkpoint of the cell cycle. TP53, the guardian of the genome, is involved in maintenance of genomic stability, apoptosis, and activation of DNA repair (among its many functions), and is inactivated in 50-75% of pancreatic cancers (almost always via intragenic mutations coupled with loss of the second allele). 27 Another tumor suppressor gene commonly inactivated in pancreatic cancer (in about 55% of cases) is SMAD4, a critical signaling intermediate in the transforming growth factor (TGF)-beta pathway, providing selective growth advantage to affected cells. 28 Patients who undergo resection and whose pancreatic cancer has loss of SMAD4 function have worse prognosis than age- and stage-matched patients without SMAD4 mutations. 29 Other tumor suppressor genes inactivated at a lower frequency (5-10%) include BRCA2, STK11, TGFBR1, and TGFBR2. 26 Of note, p16 inactivation appears to be a relatively early event in tumor development, as it is detectable in PanIN-2 lesions, whereas TP53, SMAD4, and BRCA2 mutations are not seen until the PanIN-3 stage. 7 Genomic instability is a hallmark of most solid tumors, including pancreatic cancer. The types of genomic rearrangements commonly identified in pancreatic adenocarcinoma are reviewed elsewhere (see Literature Review - CNVs and Cancer ). Telomere shortening, which predisposes to end-to-end chromosomal fusions and breakage during anaphase thus generating amplifications and deletions in the daughter cell genomes, is a very frequent and early event in pancreatic cancer development, demonstrated in over 90% of the earliest stage PanINs. 30 It is believed that the inactivation of TP53 allows the survival of the pre-invasive cells which develop a heavy burden of genomic instability as a result of telomere attrition, permitting them to progress through the activation of oncogenes and inactivation of tumor suppressor genes to invasive status. 31 It should be noted that most invasive pancreatic cancers appear to

17 4 reactivate telomerase, mitigating the degree of genomic instability and helping to stabilize the neoplastic cells. 32 In addition to genetic and genomic alterations, epigenetic silencing of tumor suppressor genes (via methylation of CpG islands in the 5 regulatory regions) is frequently observed in pancreatic adenocarcinoma. 33 Alternatively, hypomethylation of candidate oncogenes (which are overexpressed in pancreatic cancer) has also been observed. 34 MicroRNAs have also been implicated in pancreatic cancer tumorigenesis, both as potential tumor suppressor as well as oncogenes. 35 Furthermore, inflammation and the tumor micro-environment appears to have a role in pancreatic tumorigenesis. 36 Jones et al. 37 examined the genomic profile of pancreatic adenocarcinoma in depth by sequencing the coding regions of 20,661 genes in 24 pancreatic adenocarcinoma as well as hybridizing tumor DNA to a high-resolution single nucleotide polymorphism (SNP) array to detect genomic rearrangements. The authors identified 1,562 somatic mutations in 1,007 genes, of which 74.5% mutations were missense, nonsense, small insertions/deletions, or splice-site/untranslated region (UTR) changes. The average number of mutated genes per tumor (48) was much less than the number of mutations discovered in breast cancer (101) or colorectal cancer (77) in previous studies, and one potential explanation given is that the cells which initiate pancreatic tumorigenesis are likely to have undergone fewer divisions than tumor initiating cells in breast or colorectal cancer. Gene-set analyses of the genes mutated in pancreatic cancer identified 69 gene sets that were altered in most pancreatic tumors, of which 31 gene sets can be grouped in 12 core signaling pathways with discernible functional relevance to neoplasia, which were affected in % of the pancreatic tumors. Notably, although the 12 core pathways were altered in almost all cancers, the specific genes that are mutated in each tumor differed significantly across patients, aside from the few frequently mutated genes discussed above. These results emphasized the importance of the pathway approach to understanding tumorigenesis, and suggest that successful anti-cancer therapy may depend more on targeting pathways rather than individual genes. A subsequent study applied massively parallel sequencing to sequence the entire genome of metastases from seven of the subjects included in the previous study. 38 On average, two-thirds of mutations detected in each metastasis were also present in the paired primary tumor and were called founders, while the remaining mutations that were only identified in metastases were termed progressors. Subclones that led to the development of metastases were identified within each primary tumor. The authors devised a mathematical model for calculating the timing of different stages of pancreatic cancer development and estimated that it takes an average of 11.7 years from the initiation of tumorigenesis until the generation of the cell that develops into the parental clone; another 6.8 years were estimated for the evolution into subclones with metastatic capacity, and 2.7 years until the death of the

18 5 patient. It should be noted that most of the tumors in this study were not from familial cases, and tumors with highly-penetrant germline predisposing mutations may follow a different evolutionary timeline and pathway. Nonetheless, it appears that a significant window of opportunity for screening and curative intervention exists, if it is possible to identify tumors before metastatic subclones develop. 1.3 Risk factors The list of putative risk factors for pancreatic cancer is long, with wide variability in degree of risk conferred and strength of evidence for the association. Age is strongly correlated with increased risk of pancreatic cancer, with the median age for diagnosis at 72 years and more than two-thirds of cases occurring after age Race is also a factor, with African-Americans having substantially higher rates of pancreatic cancer than white, Asian, or Hispanic Americans. 2 Perhaps the strongest association of a risk factor exists for tobacco use, as numerous studies have demonstrated that smoking can double lifetime risk and the estimated population attributable risk is 25%. 39 Other risk factors with low-to-moderate contribution to pancreatic cancer include alcohol consumption 40, obesity 40, occupational exposure to certain chemicals 41, long-standing diabetes mellitus 42, and Helicobacter pylori infection 43. However, only smoking has been consistently associated with pancreatic cancer. Chronic pancreatitis is associated with up to 13-fold increased risk in pancreatic cancer, and even higher risk in patients with hereditary pancreatitis, caused by genetic mutations (e.g. PRSS1, SPINK1). 44 Possible protective factors include allergies 45, Vitamin D intake 46 (although this is contentious 47 ), and consumption of citrus fruit 48 and Mediterranean diet 49. The role of germline genetic factors predisposing to pancreatic cancer is a subject of numerous studies and ongoing collaborations. Polymorphisms in the following genes have been associated with increased or decreased risk of sporadic pancreatic cancer: GCKR (odds ratio (OR) = 2.14 ) 50, IGF1 and IGF1R (OR = ) 51, IGFPB1 (OR = 1.46) 51, SSTR5 (OR = 1.62) 52, [MGMT (OR = 0.6), PMS2 (OR = 1.44), PMS2L3 (OR = 5.54)] 53, HNF1A (OR = ) 54, SDF1 (OR = 2.74) 55, [FTO (OR = 1.12), MNTR1B (OR = 1.11), MADD (OR = 1.14)] 56, ALDH2 (OR = 1.37) 57, HK2 (OR= 0.68 in diabetic/3.69 in nondiabetic) 58, [PPARG (OR = 0.21), NR5A2 (OR = ), ADIPOQ (OR = 0.67), GGT1 (OR = 1.86) 59, CASP9 (OR = ) 60, CAPN10 (OR = 1.57) 61, p21 (OR = 1.70) 62, CYP1B1 (OR = 0.67) 63, CFTR (OR = 1.4; OR = 1.83 if diagnosed under age 60) 64, GSTP1 (OR = 3.09 if diagnosed under age 50) 65, CYP17A1 (OR = ) 66, PPARG in conjunction with high-dose Vitamin A (OR = 2.80) 67, PTGS2 (OR = ) 68, MMS19L (OR = 0.7/1.34) 69, IL1beta (OR = 2.0 for unresectable cancer) 70, [LIG3 (OR = 0.23), ATM (OR = 2.55)] 71, IGF2 (OR = 0.07) 72, [MTHFR (OR = 4.50), MTR (OR = 2.65), MTRR (OR = 3.35) in heavy drinkers] 73, MTRR (OR = ) 74, [FasL (OR = ), CASP8 (OR = )] 75, NAT2 (slow-type, OR = 5.7) 76, XRCC2 in smokers (OR = 2.32) 77, ERCC2 in smokers (OR =

19 6 0.46) 78, [MTHFR (OR = ), TYMS (OR = 2.19)] 79, NAT1-rapid type (OR = 1.5) 80, RNASEL (OR = ) 81, UGT1A17 (OR= ) 82, XRCC1 in smokers (OR = 7.0 in women/or = 2.4 in men) 83. Pathways affected by those genes include diabetes mellitus type II and glucose metabolism, insulin growth factors, somatostatin, DNA repair, tumor growth, alcohol metabolization, obesity, glutathione metabolism, cytochrome P450, cystic fibrosis transductance regulator, fatty acid storage, cyclooxygensase-2, nucleotide excision repair, inflammation, folate metabolism, cell cycle and cell death, and toxin detoxification. Many of the aforementioned studies suggest gene-environment interactions. To date, four genome-wide association studies (GWAS) of pancreatic cancer have been published: two related GWAS were conducted on subjects drawn from 12 cohort studies and 9 case-control studies (mostly of European ancestry) 84-85, a study performed in a Japanese population 86, and the most recent study was in a Chinese population. 87 While SNPs in several loci were observed to be associated at sufficiently low p-values to suggest statistical significance (7q36-SHH, 15q14-gene desert) 84, (13q22.1- near KLF5 and KLF12,1q32.1-NR5A2, 5p15.33-CLPTM1L-TERT) 85, (6p25.3-FOXQ1, 12p11.21-BICD1, 7q36.2-DPP6) 86, (21q21.3 BACH1, 5p13.1 DAB2, 10q26.11 near PRLHR, 21q22.3 near TFF1, 22q13.32 near FAM19A5) 87, to date only one association has been successfully replicated in additional studies: the ABO blood group locus at 9q34. In the GWAS by Amundadottir et al. 84, the ABO locus was identified as a potential associated locus in the initial phase of the study and confirmed in a replication case-control set (odds ratio (OR) per non-o allele = 1.20). This association of non-o blood group with pancreatic cancer risk was further replicated in other case-control studies (OR , OR , OR , protective O-blood type OR ). Furthermore, Wolpin et al. 92 reported a higher risk of pancreatic cancer for carriers of the A(1) variant of the A-allele, which has a higher glycosyltransfrase activity than the A(2) allele (OR 1.38). In addition, Risch et al. 89 observed increased risk of pancreatic cancer in non-o blood group subjects who are seropositive for H.pylori but negative for its virulence protein CagA (OR 2.78). Analyses in non-caucasian populations found similar risk effects of the non-o alleles (OR ; OR ). Wang et al. 95 also found evidence for an additive effect of A blood type with Hepatitis B infection. It should be noted that the association of non-o blood type with pancreatic cancer predates these GWAS; one of the earliest reports suggesting an association was in The British Medical Journal in How blood type mediates pancreatic cancer risk and tumorigenesis is unknown 97, but it appears that approximately 20% of pancreatic cancers in European populations is attributable to having a non-o blood type status. 88 Higher-penetrant genes may also predispose to pancreatic cancer, as shown by the co-occurrence of pancreatic cancer with several known cancer syndromes. The highest-known risk is associated with Peutz-Jeghers syndrome (PJS), caused by germline mutations of STK11. This autosomal dominant syndrome is associated with melanocytic macules on the lips and buccal mucosa, gastrointestinal

20 7 hamartomas, and cancer. The lifetime risk of pancreatic cancer in PJS patients is up to 132-fold relative to the general population, or about 66% by age ,99 Another condition associated with up to 80-fold higher risk of pancreatic cancer is hereditary pancreatitis, most commonly caused by mutations in PRSS1 in an autosomal dominant fashion (although SPINK1 mutations have also been implicated) Familial atypical multiple mole melanoma (FAMMM) is an autosomal dominant syndrome characterized by multiple nevi and increased risk of cancers, predominantly melanoma and pancreatic adenocarcinoma. The primary genetic cause of FAMMM is mutations in CDKN2A/p16, and carriers (particularly of the p16-leiden founder) have up to 47-fold increased risk of developing pancreatic cancer. 102 Some genes that cause hereditary breast and ovarian cancer also raise risk of pancreatic cancer. To date, the gene contributing to the largest proportion of hereditary pancreatic cancer is BRCA2, which is estimated to raise lifetime risk of pancreatic cancer by 3.5- to -10-fold and accounts for up to 19% of high-risk families (although the contribution of BRCA2 may be population dependent, as it appears to be significantly lower in German, Korean, and Spanish populations ). Although most BRCA2 families with pancreatic cancer also cluster breast and/or ovarian cancer, some families are characterized by exclusive presence of pancreatic cancer 112, and even apparently sporadic cases have been demonstrated to carry deleterious germline BRCA2 mutations. 113 Interestingly, while the BRCA2 locus was first proposed to contain a cancer-associated gene via linkage to familial breast cancer, 114 the localization of the gene itself and suggestion of its tumor-suppressor role was facilitated by discovery of a homozygous deletion at 13q12 in a pancreatic adenocarcinoma Germline mutations of other Fanconi-anemia pathway genes have been reported in pancreatic cancer families but the magnitude of risk associated with these genes is unclear: PALB2 in ~0.9-4% of families ), BRCA1 in % of families (although Axilbund et al. failed to find mutations in a series of 66 familial pancreatic cancer patients 123 ), ATM in 2.4% of families 124, and mutations in FANCC and FANCG have been reported in young-onset pancreatic cancer subjects 125 although these genes do not appear to contribute significantly to familial pancreatic cancer Several other syndromes associated with risk of pancreatic cancer include Lynch syndrome (caused by mutations of the mismatch repair genes MLH1, MSH2, MSH6, PMS2 or TACSTD1-3 deletion), Li- Fraumeni syndrome (caused by mutations of TP53) 133, Familial Adenomatous Polyposis (caused by mutations of APC) 134, and cystic fibrosis (caused by mutations of CFTR) 135. However, the contribution of known genetic syndromes to the overall heritability of pancreatic cancer is limited; approximately 10% of all pancreatic cancer cases appear to be familial or hereditary and most do not have a known genetic explanation. 136 Perhaps the earliest indications that a familial pancreatic cancer syndrome exists were several case reports and case series in the 1970s and 1980s describing clusters of pancreatic cancer in first- and second-degree blood relatives. ( ). Subsequently, both retrospective

21 8 case-control and prospective cohort studies suggested increased risk of pancreatic cancer in close relatives of patients compared to the general population. (Table 1) Table 1- Studies estimating risk of pancreatic adenocarcinoma in relatives of affected patients Paper Type of Description Study Ghadirian et al. 144 Case-control 179 cases vs 179 controls (French Canadian) Fernandez et al. 145 Case-control 362 cases vs controls (Italian) Silverman et al. 146 Case-control 484 cases vs controls (US) Schenk et al. 147 Case-control 247 cases vs. 420 controls (US) Ghadirian et al. 148 Case-control 174 cases vs. 136 control s (Canada) Inoue et al. 149 Case-control 200 cases vs controls (Japan) Rulyak et al. 150 Nested casecontrol 251 members of 28 families (US) Cote et al. 151 Case-control 247 cases vs. 420 controls (US) Hassan et al. 152 Case-control 808 cases vs. 808 controls (US) Jacobs et al. 153 Case-control 1,183 cases vs. 1,205 controls (US,Europe,China) Matsabuyashi et Case-control 577 cases vs. 577 controls al. 154 (Japan) Risk of pancreatic cancer in relatives of patients OR in subjects with positive family history = 13 (p<0.001) OR in FDR of affected cases = 3.0 (95% CI ) OR in FDR of affected cases = 3.2 (95% CI ) OR in FDR of affected cases = 2.49 (95% CI ) OR in FDR of affected cases = 5.0 (p=0.01) OR in subjects with positive family history = 2.09 (95% CI ) OR with each affected FDR = 1.8 (95% CI ) OR in subjects with positive family history = 2.49 (95% CI ) OR in FDR of affected cases = 3.3 (95% CI ); OR in SDR of affected cases = 2.9 (95% CI ) OR in FDR of affected cases = 1.76 (95% CI ) OR in FDR of affected cases = 2.5 (p=0.02) Coughlin et al. 155 Cohort 1.1 million US RR for PC mortality in FDR of affected cases (males) = 1.5 (95% CI ); (females) = 1.7 (95% CI ) Tersmette et al. 156 Cohort Prospectively followed 150 FPC kindreds and 191 SPC kindreds from NFPTR Hemminki et al. 157 Cohort 10.2 million Swedish (21,000 PC cases) Klein et al. 158 Cohort Prospectively followed 370 FPC kindreds and 468 SPC kindreds from NFPTR SIR in FPC relatives if 2 or more affecteds = 18.3 (95% CI ); SIR in FPC relatives if 3 or more affecteds (56.6 ( ) [no significant elevated risk in SPC relatives SIR in FDRs = 6.5 ( )] SIR for children of affected cases = 1.73 (95% CI ) SIR in FDRs of FPC affecteds = 9.0 ( ) if 1 FDR affected, SIR = 4.5 (95% CI ); if 2 FDRs affected, SIR = 6.4 (95% CI ); if 3 or more FDRs affected, SIR = 32 (95% CI ) [no significant elevated risk in FDRs of SPC affecteds, Sir =1.8 (95% CI ) or spouses/unrelated relatives, SIR =2.4 (95% CI ) Jacob et al. 159 Cohort 1.1 million (US) RR for PC mortality in FDR of affected cases = 1.66 (95% CI ) Brune et al. 160 Cohort Prospectively followed SIR in FDR of FPC affected = 6.79 (95% CI

22 9 1,718 kindreds from NFPTR ) if 1 FDR affected, SIR = 6.86 (95% CI ); if 2 FDRs affected, SIR = 3.97 (95% CI ); if 3 or more FDRs affected, SIR = (95% CI ) Young-onset (< 50 years) in FDR associated with SIR=9.31 (95% CI ); Lateonset (> 50 years) in FDR associated with SIR=6.34 (95% CI ) OR = odds ratio; 95% CI= 95% confidence interval; FDR= first-degree relative; SDR = second-degree relative; PC = pancreatic cancer; SIR = standardized incidence ratio; RR = relative risk; FPC = familial pancreatic cancer (at least 1 pair of affected FDRs); SPC = sporadic pancreatic cancer (no affected FDR pairs); NFPTR = National Familial Pancreas Tumor Registry at Johns Hopkins University ( Segregation analysis of 287 families with an index case of pancreatic cancer recruited by Johns Hopkins Medical Institutions supports the hypothesis that a major gene is involved in pancreatic cancer risk, with the most likely model including the autosomal dominant inheritance of a rare allele. 161 The degree of risk is linked to the number of affected relatives, the degree of relation, as well as the age of onset of disease in relatives. Three large cohort studies following kindreds recruited by the National Familial Pancreas Tumor Registry (NFPTR) at Johns Hopkins Medical Institutes found risk in first-degree relatives (FDR) of affected patients in families with at least one pair of affected first-degree relatives of if only one FDR is affected, if two FDRs are affected, and if three or more FDRs are affected. 156,158,160 Moreover, the younger the age of onset of cancer in the affected relative, the higher the risk in first-degree relatives (hazard ratio (HR) 1.55 per decreased year of onset). 160 It is not clear whether the average age of onset of pancreatic cancer is significantly lower in FPC, as many studies found no difference in age of onset of disease between FPC and sporadic cases 143,144,156,162,163 and even the few studies that identified a difference found it to be rather small (65-68 yrs in FPC vs. 70 yrs in SEER database). 160,164,165 However, there is evidence for genetic anticipation in FPC families, with members of each successive generation developing cancer on average 6-15 years younger than the previous generation. 166,167;168,169 There is strong evidence for gene-environment interaction in FPC, particularly with respect to tobacco use; FPC kindred smokers developed pancreatic cancer a decade earlier than non-smokers 168 and the relative risk of developing cancer is approximately 19-fold that of the average population in smokers from FPC families. 158 In some cancer syndromes, there is a significant difference in survival between familial and sporadic cases (e.g. colorectal cancer), but it is not clear that there is such a difference in FPC. Several studies have found no difference in survival between sporadic and familial pancreatic cancer. 143,164,170,171 Ji et al. 172 found that familial cases had worse outcome than sporadic cases (HR=1.37) in a Swedish Family Cancer database, while Yeo et al. 173 identified significantly worse survival in unresected FPC cases compared to unresected sporadic cases but no significant difference for resected cases. Interestingly,

23 10 recent anecdotal reports and small series of FPC patients with mutations in BRCA-related genes who were treated with platinum-based chemotherapy, topoisomerase inhibitors, or poly-adp-ribose-polymerase (PARP1)-inhibitors suggest that this subset of familial cases may have good chemotherapy responses and improved survival compared to sporadic cases Aside from the difference in inactivation of BRCA-related pathway between familial and sporadic cases (up to a fifth of FPC tumors vs. less than 10% in sporadic cases), there has been limited investigation into molecular genetic and pathologic differences between familial and sporadic pancreatic cancers. Pancreata from FPC subjects appear to have increased prevalance of precursor lesions (PanINs and IPMNs) compared to sporadic pancreatic cancer. 179,180 Studies analyzing the rate and genome-wide distribution of loss-of-heterozygosity (LOH) have shown conflicting results: Abe et al. 181 identified LOH at approximately 50% of informative markers in 20 FPC tumors while a similar study in 82 sporadic tumors found the average LOH rate to be 25% 182, but a third study that used a SNP array to identify LOH in 26 pancreatic cancer cell lines found a rate of LOH similar to that in familial tumors (average 43%). 183 Differences in LOH rates aside, the pattern of LOH across the genome appeared similar across all three studies. Brune et al. 184 analyzed familial tumors for Kras mutations, Tp53 and SMAD4 expression, and methylation rate of seven genes previously shown to be hypermethylated in sporadic tumors, and found no significant difference between familial and sporadic tumors. Given all the evidence supporting the existence of at least one major gene explaining the heritability of pancreatic cancer in high-risk families, much effort has been directed at attempting to identify the responsible gene, including genetic linkage. Linkage analysis is a statistical tool which uses family-based data and the likelihood of recombination between loci on a chromosomal arm to identify genomic regions that appear to be transmitted to affected members of the family more frequently than by chance alone. Since linkage analysis was successful in mapping the location of and facilitating the identification of highly-penetrant genes in many cancer syndromes (e.g. APC in Familial Adenomatous Polyposis 185 ; BRCA1 and BRCA2 in Hereditary Breast and Ovarian Cancer syndrome 114,186 ), this technique has been applied to the study of FPC. Familial registries fostered the collection of high-risk families, and a large North American consortium has pooled the resources of six major sites: the Pancreatic Cancer Genetic Epidemiology Consortium (PACGENE). 165 This National Institute of Health (NIH)-funded collaboration includes the University of Toronto, Mayo Clinic, Johns Hopkins University, MD Anderson Cancer Centre, Dana Farber Cancer Institute, and Karmanos Cancer Institute. Each site prospectively identifies pancreatic cancer patients with a family history of at least two affected members. If a pedigree is deemed suitable for linkage analysis (with the help of linkage simulation programs), probands are asked to consent to contact their relatives for recruitment to the study. Consenting individuals complete questionnaires about clinical and family history and provide blood samples for DNA extraction.

24 11 Linkage efforts in FPC have yielded limited results. The linkage work by PACGENE is ongoing, but to date no highly significant loci have emerged. Investigators at the University of Washington (not connected to PACGENE) published results of a linkage analysis conducted in a single FPC family (identified as Family X ) characterized by four generations of affected members with an autosomal dominant pattern of inheritance suggesting high penetrance, young age of onset (median age 43), and concomitant endocrine and/or exocrine pancreatic insufficiency. 187 Based on a genome-wide screen using 373 microsatellite markers, significant linkage with LOD (logarithm of odds) scores was identified on chromosome 4q Although other centres failed to find a significant association at this locus in European 188 or North American 189 FPC kindreds, the University of Washington group subsequently claimed to have pinpointed PALLD, coding for palladin, a cytoskeleton scaffold protein. 190 They demonstrated a variant (P239S) that segregated only with the affected members of the family linked to 4q32-34, and they further presented evidence of PALLD overexpression in premalignant and cancerous pancreatic tissue. However, significant doubt has been cast on the likelihood that PALLD is the responsible gene for FPC, or at least that it is a significant cause of this cancer syndrome. Due to the large number of candidate genes in the 4q32-34 locus, Pogue-Geile et al. 187 were unable to screen all candidates for mutations in Family X. Rather, they used a custom expression microarray to analyze RNA extracted from whole tissue PanIN in one of the affected members of Family X and in another 10 sporadic pancreatic cancers. PALLD appeared to have the highest expression, and it was based on this finding that this gene was sequenced in Family X. However, Salaria et al. 191 used immunohistochemistry of 177 pancreatic adenocarcinomas to show that palladin overexpression was primarily localized to nonneoplastic stroma, with 96.6% of tumors demonstrating overexpression in the stroma and only 12.4% of tumors had overexpressed palladin in neoplastic cells. Furthermore, three studies of Canadian, US, and European families found no deleterious PALLD mutations in any other FPC families. Zogopoulos et al. 192 genotyped the P239S variant in 51 familial cases, 33 early-onset cases, and 555 controls and found only one familial case diagnosed at age 74 (they did not have DNA available for the other family members) and in one 91-year-old unaffected control. Slater et al. 193 sequenced the locus containing the variant in 74 FPC families and found no mutations. Finally, Klein et al. 194 performed sequencing on 92% of the coding region of the entire PALLD gene in 48 FPC cases and found no deleterious mutations. Since the PACGENE linkage study has not yet been completed, it is not known if any other loci will be reliably linked to FPC. Some of the challenges associated with applying linkage analysis to FPC are: (1) small number of affected individuals per family and rapid mortality, precluding recruitment and limiting the number of meioses available to perform the analysis; (2) penetrance of the FPC gene(s) is likely lower than in previously mapped hereditary cancer syndromes, reducing the power of linkage analysis; (3) there is increasing evidence for locus heterogeneity in the etiology of FPC. To date, only BRCA2 has been

25 12 shown to account for a substantial portion of familial cases, while all other identified genes appear to be responsible for fewer than 5% of cases each. Locus heterogeneity is a significant confounder of linkage analysis, and the lack of distinguishing phenotypic or pedigree characteristics among families makes it very difficult to confidently separate cases that are likely caused by different genes; (4) reduction of power in linkage analysis due to phenocopies. Given all these challenges, it is evident that other techniques are needed in the effort to identify germline genetic alterations that predispose to FPC. 2. Copy Number Variation 2.1 Copy Number Variation a novel paradigm Our understanding of the nature and degree of variation in the human genome has accelerated in the past few years. Until recently, single nucleotide polymorphisms (SNPs) appeared to be the most frequent and important source of genomic variation in humans. Significant efforts have been directed at identifying and genotyping SNPs in different populations, and numerous disease association and linkage studies have been conducted using SNPs as genomic markers. Yet, the development of higher-resolution genomic scanning technologies has highlighted a previously under-recognized but clearly significant submicroscopic structural variation in the human genome. Structural variants encompass copy-number variants (CNVs) (defined as genomic segments which are present in variable copy numbers when comparing two or more genomes) as well as inversions, novel sequence or mobile element insertions, and translocations. 195 The original definition of CNVs used 1,000 base pairs as a lower-limit size threshold, to differentiate from smaller insertions/deletions. However, more recently the spectrum of CNVs has been expanded to include any variants larger than 50bp, reflecting the identification of smaller variants using sequencing technologies. 195 Although CNVs at certain loci had long been recognized as polymorphisms in normal individuals (e.g. alpha-globin gene family; Rhesus blood group) as well as the cause of genomic disorders (e.g. Charcot- Marie-Tooth neuropathy type IA; Williams-Beuren syndrome; Potocki-Lupski syndrome), 196 the ubiquitous presence of CNVs in normal human genomes first became apparent with the publication of two genome-wide studies in Since that time, more CNV-detection surveys, with continually improving genomic coverage and resolution, have reported thousands of CNVs affecting all human chromosomes in apparently normal individuals (See Table 2) While the number of known SNPs (~11 million) exceeds that of CNVs, the proportion of genomic sequence that is different between any two genomes due to indels/cnvs is approximately 12-fold that of SNPs (1.2% vs. 0.1%). 238

26 13 Table 2 - Summary of published studies reporting germline genomic copy-number variation in non-disease samples Study (Year Published) Population Sebat et al. 20 (2004) 197 ethnically diverse individuals Iafrate et al. 55 (2004) 198 ethnically diverse individuals (39 unrelated healthy controls + 16 individuals with known chromosomal imbalances) Sharp et al. 47 (2005) 199 ethnically diverse individuals Tuzun et al. Single (2005) 200 female NA15510 (fosmid library) Primary CNV detection method acgh: ROMA (85,000 probes, 35kb apart; Bgl II restriction enzyme) acgh: BAC array (2632 clones, 1Mb apart) acgh: BAC array (2194 clones, targeting 130 segmental duplication regions) In-silico Fosmid end sequence pair mapping Reference genome 12 samples (mostly from a single male sample); single ref per hybridization experiment Pooled male or female normal samples Single male sample NCBI reference human genome Build 35 (hg17) Source of DNA Blood, sperm, cell lines Whole blood + cell lines Cell lines Number of CNVs Size of reported CNVs 76 Average = 465kb 255 Average = 150kb 160 (represent 119 regions if merge BACs <250kb apart) Average BAC insert size = 164kb, some CNVs involve > 1 clone n/a 297 Median = 15.7 kb (8-329kb) Proportion of CNVs detected in > 1 sample Number of CNVs confirmed within same study CNV confirmation methods 41% 11/12 FISH, hybridization to HIND III ROMA platform 40% 19/19 qpcr, FISH 55% 7/11 FISH n/a 16/57 33/40 BAC array (comparing 97 genomes) Sequencing of fosmid inserts 30 YRI Conrad et al. (2006) 201 trios + 30 CEU trios (HapMap) McCarroll et al. (2006) HapMap individuals (4 ethnic groups) In-silico: Assessment of Mendelian inconsistencies in trios n/a n/a 586 (396 in YRI; 228 in CEU) YRI median = 8.5kb ( kb) CEU median = 10.6 kb ( kb) n/a n/a 541 Median = 7 kb In-silico: Analysis of Mendelian (1-745kb) 7/11 PCR 61% 92/105 qpcr, hybridization to custom high-density oligo array 51% 90/541 FISH, allelespecific fluorescence measure, PCR, qpcr

27 14 Hinds et al. 24 (2006) 203 ethnically diverse individuals (Discovery panel) Locke et al. 269 (2006) 204 HapMap individuals Mills et al. 36 (2006) 205 individuals (different ethnic groups) Redon et al. 270 (2006) 206 HapMap individuals (4 ethnic groups) Simon- Sanchez et al. (2007) wellphenotyped Cauasians, from NINDS study transmission errors, HW disequilibrium, null genotypes acgh: Highdensity oligo custom array acgh: BAC array (2007 clones, targeting 130 segmental duplication regions) In-silico: Computa -tional alignment of DNA resequencing traces from SNP studies to reference genome acgh: Whole Genome Tiling Path array (26,574 BACs) + SNP array intensity comparis on: 500K SNP platform SNP array intensity comparis on: 1)109,36 5 genecentric SNP array NCBI reference human genome Build 35 (hg17) Single male reference (NA10851) for acgh; pairwise comparison between all samples for 500K Reference genotyping clusters (used in Illuminaspecific CNVdetection algorithms) Cell lines Cell lines 215 Median = 0.75kb 384 (in 222 regions, if merge BACs < 250kb apart) (70bp 10kb) Average = 436kb NCBI reference human genome (build not indicated) Wellcharacterized single male sample (GM15724) (145kb- 1.4Mb) n/a 294,498 2bp- 9989bp Cell lines Cell lines 1447 merged CNVRs (913 on WGTP platform; 980 on 500K platform) Average = 341kb (WGTP) 206kb (500K SNP) 340 ~20kb 3Mb (for nonheterosomic CNVs) 67% 100/215 PCR 67% 136/207 Custom highdensity oligo array ~50% 173/ /189 PCR, sequencing 43% of all CNVs 5 13/24 Locusspecific quantitative assay Replicated on both platforms qpcr replication of CNV detection in DNA from whole blood

28 15 Wong et al. 95 samples (2007) 208 (include healthy blood donors, cancer screening program participants, 16 distinct ethnic groups) Levy et al. Single (2007) 209 diploid genome of Craig Venter Korbel et 2 al. (2007) 210 previously analyzed female subjects: NA15510 (presumed European ancestry) and NA18505 (YRI) Pinto et al. 506 (2007) 211 controls of North German descent (PopGen study) Wang et al. 112 (2007) 212 HapMap individuals (4 ethnic groups) 2) 300K SNP array acgh: BAC array (26,363 clones) In-silico: Random shotgun sequencing, comparison to NCBI reference genome acgh: 244K oligo array; 385 oligo array; 2 different SNP array platforms In-silico: Pairedend sequence mapping (generated by nextgeneration massive parallel sequencing) SNP array intensity comparis on: 500K SNP array SNP array intensity comparison: 550K Single male reference NCBI reference genome Build 36 for one-toone mapping of insertions/ deletions Single male reference (NA10851) for acgh and SNP array comparisons NCBI reference human genome Build 36 Multiple references Reference genotyping clusters (used in Illuminaspecific Whole blood, cell lines Whole blood Cell lines Cell lines Cell lines 3654 >40 kb 22% detected in >2 samples 919,584 indels (600 1kb in size) + 62 CNVs 1175 total (422 in NA15510 ; 753 in NA18505 ) Indels = 1-82,711 bp (average bp) CNV (~8kb- 2Mb) Majority <10kb, but variants up to >1Mb detected 1023 CNVRs (430 highconfidence; i.e. detected by 2 algorithms) Average size of highconfidence CNVRs = 369kb 2633 Average 31.5kb- 61.2kb (depending on ethnic n/a 37/40 indels 89% of 249 variants tested in individuals from 4 population 4% of CNVRs in >2% of population 265 Confirmed in 5 cases on oligo array 132/261 (NA15510) 328/616 (NA18505) 95 (NA15510) 97 (NA18505) 31/48 (NA15510) Comparison to fosmid clones from 8 other individuals PCR (+ sequencing breakpoints in a subset of amplicons) Also present in Celera assembly acgh with oligo tiling arrays comparing NA15510 to NA /1010 Overlap with CNVRs called in 269 HapMap samples analyzed with identical algorithms to PopGen % of CNVs were also detected in parents Assumes high heritability of CNVs, compares to CNVs called in parents

29 16 SNP array CNVdetection algorithms) group) 3 CNVs PCR, resequencing of breakpoints Zogopoulo us et al. (2007) controls from Ontario Familial Colorectal Cancer Registry (Canada); mostly Caucasian desmith et 50 males al. (2007) 214 (north French origin) Jakobsson et al. (2008) individuals, from 29 populations (Human Genome Diversity Project) Perry et al. 30 HapMap (2008) 216 individuals (4 populations ) Takahashi et al. (2008) healthy Japanese offspring of atomic bomb survivors SNP array intensity comparison: 100K and 500K arrays acgh-- 2-stages: 1) 185K oligo genomewide array (in 35 individuals) 2) custom highdensity 244K array SNP array intensity comparis on: Illumina Infinium Human HapMap 500 Beadchip acgh: Custom oligo array (470,163 probes) targeting CNVs previousl y detected by Redon et al. (2006) acgh: 2238 BAC custom array Multiple references Pooled references for 185K array; single female reference (NA15510) for 244K array Reference genotyping clusters (used in Illuminaspecific CNVdetection algorithms) Single male reference (NA10851) One male and one female Japanese Blood 578 CNVRs Blood 9244 multiprobe CNVs (1469 CNVRs) Cell lines Cell lines Cell lines 6089 singleprobe CNVs (4705 CNVRs) 3552 (map 1428 loci) 2664 (map 1153 loci) to to 251 (mapping to 30 regions) Average = 408kb (12bp 4.5Mb) Median 4.4kb Average = 82.7kb (deletion) 130.4kb (duplicati on) (2kb- 998kb) 15-33% smaller CNVs than detected by Redon et al. (2006) in same sample Average: 120kb (deletion) 160kb (duplicati on) < 7% are detected in >1% of population 4 qpcr 45% 90-95% of common CNVRs detected on 185K array 21 Replication on 244K array PCR, MLPA 50% 23/51 Sequencing over breakpoints 53% 14/14 rare CNV regions qpcr, FISH, PGFE- Southern Blot, sequencing) Wheeler et Single In-silico: (sequence Blood 163,608 (2bp- n/a Excellent acgh

30 17 al. (2008) 218 McCarroll et al. (2008) 219 diploid genome of James Watson 270 HapMap Nextgeneration sequencing, comparison to NCBI reference human genome + acgh: 244K oligo array million probe array (3 experiments with 2 different references) SNP microarra y (Affy6.0) 9 HapMap SNP Cooper et al. (2008) 220 microarray (Illumina ) Kidd et al. 8 HapMap (2008) 221 samples (4 ethnic groups) Bentley et al. (2008) 222 Single YRI male (NA18507) Wang et al. Single (2008) 223 Asian male (Han Chinese) In-sliico: Fosmidend sequence pair mapping In-silico: Paired reads of massively parallel sequencing In-silico: pairedend reads of massively mapping) NCBI reference human genome Build 36 (acgh) a) standard Caucasian male ref and b) NA HapMap Reference genotyping cluster NCBI reference human genome Build 35 NCBI reference human genome Build 36 NCBI reference human genome Build 36 Cell lines Cell lines Cell lines Cell line 23 CNVs (by acgh) 38,896bp ) indels (by sequence comparison) 26kb- 1.6Mb concordance in CNV calls when using same reference on different oligo arrays (data not shown) experiments against NA10851 on 244k and 2.1 million probe arrays 3048 CNVs (1320 CNVRs) 50% 27 loci qpcr % Fosmid sequence alignment date 7184 predicted nonredundant CNVs >6kb 50% 1471 MCD analysis (multiple complete restriction enzyme digest); High-density oligo arrays and SNP arrays; Correlation to SNP genotyping data for 130 deletions; Full-length sequencing of fosmid clones 4116 n/a blood 2474 Median = 492 bp n/a

31 18 Gusev et al (2009) 224 individuals from Kosrae island (Micronesia ) parallel sequencing In-silico: Uses novel algorithm to identify gaps in identityby-state stretches of SNP genotypes Itsara et al. (2009) SNP microarrays (Illumina ) Cell lines; blood Used other computational methods and compared to previous reports 13,843 (map to 3476 CNVRs) 77% Crossplatform comparison (to CGH array) Shaikh et 2026 (1320 al. (2009) 226 Caucasian; 694 African- American; 12 Asian- American) Kim et al. Single (2009) 227 Korean male (AK1) SNP microarray (Illumina HumanH ap550) In-silico: pairedend reads of massively parallel sequencing and endsequences of BAC clones Reference genotyping clusters (used in Illuminaspecific CNVdetection algorithms) NCBI reference human genome Build 36 Reference for CGH arrays not identified Blood 54,462 (nonunique CNVs map to 3272 CNVRs) Blood, sperm Median = 8kb bp- 2Mb 77.8% 16/ / /21 n/a qpcr array-based comparison (affy vs illumina) comparison to previously published data of a HapMap samples (Kidd et al) Sequence data complemented microarray data Ahn et al. Single (2009) 228 Korean male acgh: custom 24M microarray; SNP arrays In-silico: pairedend reads of massively parallel sequenc- NCBI reference human genome Build 36 Blood Kb n/a 2344 Detected in DGV (no direct confirmation)

32 19 ing Matsuzaki et al. (2009) 229 McKernan et al. (2009) 230 McElroy et al. (2009) 231 Conrad et al. (2009) HapMap YRI samples Single YRI male (NA18507) 385 African Americans and 435 White Americans Discovery in 40 females (19 CEU + 20 YRI + 1 diversity panel); genotyping in 450 HapMap acgh: Custom oligonucleotide microarrays In-silico: ABI SOLiD pairedend and splitreads (ligationbased sequencing assay) SNP array (Affy 500K) Discovery: Nimble- Gen 42M arrays Genotyping: Custom Agilent 105k arrays; SNP array (Illumina Infinium Human66 0W) Signal compared to normalized signal of all 90 samples NCBI reference human genome 50 African Americans females (derived from blood) Discovery: NA10851 Genotyping : pooled DNA of 10 European samples (9 males + 1 female) Cell lines Cell line Cell lines + Blood Cell lines 6578 Median = 4.9kb /40 qpcr (also compared to findings of previous studies % agreement)) kb n/a n/a n/a 1362 in African Americans in White Americans (map to 412 African- American unique CNVRs; 580 Whiteunique CNVRs; 76 shared CNVRs) Mean duplication = 827kb; mean deletion = 703kb 11,700 Median = 2.7kb 174 CNVRs 49% 79/99 (qpcr) 3 loci qpcr 15% FDR (microarray ) qpcr; other microarrays

33 20 Alkan et al. 3 (2009) 233 individuals Lin et al. 813 (2009) 234 Taiwanese individuals Readdepth of massively parallel sequencing reads Illumina 550K Bead- Chip Reference human genome Reference genotyping cluster Cell lines Blood 4452 (map to 1025 CNVRs) % of all variants Mean = 497kb 365 CNVRs 17/25 acgh FISH 279/365 CNVRs Identified on Affy 500K array Li et al (2009) 235 Caucasians and 700 Han Chinese Altshuler et 1184 al. (2010) 236 (HapMap3-11 populations ) Ju et al. Single (2010) 237 Caucasian male (HapMap NA10851) Pang et al. Single (2010) 238 diploid genome of Craig Venter SNP array (Affymetrix 500K) SNP arrays (Affymetrix 6.0 and Illumina 1M arrays) Data from previous acgh studies that used NA10851 as reference + readdepth of NA10851 massively parallel sequencing In silico: de novo assembly comparis on; pairedend reads; splitreads acgh: Agilent 24M + Nimble- Gen 42M arrays Half the samples were used as references for the other half and viceversa Reference genotyping clusters 73 individuals (from Conrad et al, 2010 and Park et al. 2010) NA15510 for Agilent 24M and NimbleGen 42M arrays Blood 2381 Median = 195kb Cell lines Cell line Whole blood 856 Median = 7.2 kb 1309 Median = 2.7kb 808,179 insertions or deletions (2641 1kb) (1-1.7Mb) 27.6% 680/985 overlap DGV All CNPs detected in 1% of population Compared to DGV No experimental validation n/a FDR of algorithms determined by comparing to CGH data for 34 individuals n/a n/a n/a n/a 89/96 SVs identified by sequence analysis 20/25 CNVs identified by microarrays 11,140 SVs in common to this study and Levy et al Compared to SVs called in previous analysis of same genome (Levy et al) PCR/qPCR SNP arrays: Affyme-

34 21 trix Illumina 1M Park et al. 30 females (2010) 239 (10 Korean; 10 HapMap Chinese; 10 HapMap Japanese) acgh: 24M custom Agilent arrays Single male reference (NA10851) Cell lines 20,099 (map to 5177 loci) Median = 2.7kb (438bp- 1.1Mb) 39% 106/116 loci qpcr Teague et al. (2010) 240 NA15510, NA10860, NA18994 Optical Mapping (singlemolecule restrictio n mapping) NCBI reference human genome Build 35 Cell lines kbmegabase s >1/3 all variants 42-61% (depends on platform being compared against) Compared to fosmid-end sequencing, paired-end sequencing, SNP array (Affy6.0), tiling arary CGH Kidd et al. 9 HapMap (2010) 241 individuals NCBI reference human genome Build 35 Cell lines Identifying fosmidend clones that did not map to reference genome 2363 novel insertion sites (correspond to 720 loci) Median = 1kb (1-20kb) 192 loci Sequencing, genotyping

35 22 Kidd et al. 17 (2010) 242 individuals Capillary end sequencing of fosmid clones NCBI reference human genome Build 35 Cell lines 973 n/a n/a n/a n/a Schuster et 5 al. (2010) 243 individuals Read depth acgh NCBI reference human genome Blood 187 n/a n/a n/a n/a Yim et al (2010) 244 Korean individuals SNP array (Affy5.0) NA pooled 100 Korean females Blood (map to 4003 CNVRs) Median 18.9kb 656 CNVRs in 1% of samples 14/16 loci qpcr Gayan et al. 801 (2010) 245 Spanish individuals SNP array (Affymetrix 250 NspI array) 25 female samples from other studies Blood 11,743 Median 150.7kb 623 CNVs present in >2 individuals 519 CNVs previously described Comparison to DGV (no experimental validation) The 1000 Genome Project Consortium (2010) 246 ; Mills et al. (2011) 247 Three pilots: (1) 3 trios from 2 families deep sequencing (avg 42x) (2) 179 unrelated low depth (2-6x) (3) deep sequencing Pairedend mapping, readdepth analysis, split-read analysis, and sequence assembly of massively parallel NCBI reference human genome Cell lines 14,327 50bp - ~1Mb <10% FDR PCR acgh

36 23 of exons of 1000 genes in 697 individuals (avg >50x) sequencing Chen et al (2011) 248 individuals from three European populations SNP array (Illumina Infinium Human- Hap 300) Reference genotyping cluster Blood 4016 (map to 743 CNVRs) Mean = 205kb CNVRs Overlap with reported CNVs in DGV (no experimental validation done) Moon et al. Discovery: (2011) Korean individuals Genotyping : 8842 Korean individuals acgh array (Nimble Gen 3 x 720K) + SNP array (Affy 5.0) NA10851 Blood 8779 (576 CNVRs chosen for frequency analysis) Median length of 576 CNVRs = 113kb (1kb- 4.56Mb) 807 CNVRs (576 chosen for frequency analysis in larger sample set) 66.7%- 100% positive predictive values for 20 randomly chosen CNVRs TaqMan assays Studies listed in chronological order by publication date. CGH, comparative genomic hybridization; oligo, oligonucleotide; FISH, fluorescence in situ hybridization; ROMA, representational oligonucleotide microarray analysis; qpcr, quantitative polymerase chain reaction; BAC, bacterial artificial chromosome; YRI, Yoruba in Ibadan, Nigeria; CEU, Utah residents with ancestry from northern and western Europe; NCBI, National Centre for Biotechnology Information; PGFE, pulsed gel field electrophoresis; MLPA, multiplex ligation-dependent probe amplification 2.2 CNV Databases The Database of Genomic Variants (DGV) ( was founded in conjunction with the publication of the first few CNVs in 2004 by Sebat et al. 197 and Iafrate et al. 198, to catalogue former and future discoveries of structural variants in the human genome. Curated by The Centre for Applied Genomics (TCAG) in Toronto, the objective of this database is to summarize published data on structural variation detected in healthy control samples, and it is periodically updated as new data becomes available. 198 At this time, the DGV presents data from each study separately, only merging overlapping CNV calls (in the same direction) across samples within the same study. Moreover, calls made by different platforms in the same study are also presented separately. Regions are displayed in

37 24 relation to the human genome reference assembly (Build 35/May 2004 or Build 36/March 2006 or GRCH37/Feb 2009). The latest version of the DGV (updated Nov 02, 2010) contains 101,923 entries mapped to the human genome Build 36, corresponding to 66,741 CNVs >1kb (mapping to 15,963 genomic loci), 34,229 InDels (relative gains or losses between 100bp-1000bp in size), and 953 inversions. Forty-two published articles are cited as the source of data in the DGV. A beta-version of the database has been released (October 2011) which provides access to data in partner databases at European Bioinformatics Institute (DGVa) and National Centre for Biotechnology Information (dbvar). The DGVa repository has been the primary supplier of data to the DGV. dbvar includes structural variants from multiple species and also includes data from clinical studies (non-healthy populations). Future submission of CNV data will be managed by DGVa and dbvar, while the role of DGV will be to manually curate and visualize selected studies to allow better interpretation of the clinical significance of CNVs. Clinically significant CNVs (mainly those linked to genomic syndromes) are catalogued in DECIPHER 250 (DatabasE of Chromosomal Imbalance and Phenotype in Humans using Ensembl Resources, and ECARUCA 251 (European Cytogeneticists Association Register of Unbalanced Chromosome Aberrations, In addition, there are several data sources for copy number alterations that are detected in tumors or cancer cell lines. Those include The Wellcome Trust Sanger Institute Cancer Genome Project 252 ( and the Pancreatic Expression Database 253 ( 2.3 Discovery and Genotyping of CNVs A variety of platforms and algorithms have been applied for CNV detection, with a wide range of resolution, coverage, and signal-to-noise ratio, resulting in significant non-overlap in the CNVs detectable between different platforms used to study the same samples. The earliest studies mapping CNVs in the human genome were based on flourescent in situ hybridization (FISH) and spectral karyotyping and were limited in resolution to variants of large size (>500kb), most of which were associated with disease. 254 Later, genome-wide CNV mapping became possible with array comparative genomic hybridization (acgh), a technique involving competitive hybridization of flourescently labeled DNA samples from two sources on a single array that contains immobilized target DNA sequences and use of computational algorithms to analyze the hybridization ratio of the test and reference samples. The DNA targets on the arrays originally comprised Bacterial Artificial Chromosome (BAC) clones but later were made of long oligonucleotides. 195 Early CGH arrays were of low resolution (typical CNV size detectable by these platforms was greater than 100kb), and they significantly overestimated the true number of bases affected

38 25 by CNVs. 197,198 Later, high density oligonucleotide tiling CGH microarrays became available, allowing more accurate determination of CNV breakpoints and detecting many more CNVs of smaller size. 232 One important consideration in the use of CGH arrays for CNV detection is the reference sample. In any given acgh experiment, it is not possible to distinguish between a copy number loss on the test sample versus a gain on the reference sample in the same region (or vice-versa), since both scenarios would generate the same hybridization signal ratio. Moreover, a loss or gain present in both samples would be entirely missed (since the signal ratio would appear to be 1). Ideally, the reference sample genome should be well characterized using a variety of methods, and the same reference sample should be hybridized against all test samples in an experiment to allow better comparison of the results. To date, several individuals have had their genomes extensively mapped and have been used repeatedly in CNV studies (HapMap NA10851, NA18507, NA15510). Another type of microarray used for CNV detection is the SNP array. Originally designed to genotype SNPs for genome-wide association studies, these arrays contain multiple probes corresponding to each selected SNP, and a single test DNA sample is hybridized to each array. Various computational algorithms have been developed to analyze the hybridization intensity data to estimate copy number at each SNP location, and the two primary methods are the Hidden-Markov-Model and Segmentation. Earlier SNP arrays had lower resolution and coverage for CNV detection due to the nature of SNP selection (focused on tag SNPs with minimal allele frequencies of 1% to maximize coverage of the genome while minimizing cost, and avoiding SNPs in regions that increase genotyping error due to violation of Hardy-Weinberg Equilibrium or Mendelian inheritance errors). 206,213 More recent SNP arrays from Affymetrix and Illumina not only have a higher density of SNPs distributed genome-wide (approximately 1 million) but also include probes for known CNV regions, hence allowing discovery of smaller CNVs and the genotyping of polymorphic CNVs. 219,220 Compared to CGH arrays, SNP arrays have the added advantage of SNP genotype information which can be used to detect CNVs (by analyzing B-allele frequency, which represents the proportion of total allele signal that is represented by a single allele) as well as provide information on loss-of-heterozygosity (LOH) and uniparental disomy (UPD). Both CGH and SNP microarrays are limited by detecting CNVs that map to regions known in the reference genome that was the basis for the microarray build. Moreover, neither of those platforms distinguishes between tandem and interspersed duplications, and they tend to be more sensitive in detecting deletions than duplications (due to a higher signal ratio differential between 2 and 1 copies vs. 2 and 3 copies, for example). 195 Furthermore, even the highest resolution arrays available lose sensitivity in genome-wide detection of CNVs smaller than 10kb. 219 Sequence-based methods have become used increasingly to bridge the gap in mapping the full extent of variability of the genome. Even in the early days of CNV discovery, several CNV papers were published based on mining of genotyping errors ,

39 26 fosmid paired-ends 200,221, and paired massively parallel sequencing of paired-ends of 3-kb fragments. 210 Since then, many more studies have utilized the data from next-generation sequencing technologies to identify CNVs, although there remain substantial bioinformatic challenges associated with analyzing this data. The four main methods of using sequencing data to identify CNVs are 255 : (1) identifying read-pairs whose mapping span is inconsistent with the reference genome; (2) identifying regions with significantly increased or reduced read-depth compared to the distribution of read-depth across the (presumed diploid) genome; (3) identifying split-reads, whereby there is a break in the alignment of a read relative to the reference genome; (4) sequence assembly. To date the most commonly used method has been read-pair mapping. All four approaches are limited in their sensitivity, specificity, and breakpoint accuracy depending on read length, insert size, and physical coverage. Future direction in CNV detection includes nascent technologies like optical mapping 256, nanochannel flow cells 257, and emulsion picolitre droplet PCR 258 that are being developed to allow high-throughput detection of CNVs on an individual cellular and/or molecular level. Multiple studies have demonstrated significant non-overlap between different platforms and algorithms when analyzing the same samples. 211,259 Given the variability in sensitivity and specificity of CNV detection by the various platforms to date, validation is essential. Validation of detected CNVs has taken two main forms in most studies: detection of the same (or overlapping) variants by different studies, and replication within the same study (different array platform, PCR, qpcr, FISH, other experimental methods). Overlap with regions identified in previous studies lends support to the variability of those specific regions in the human genome, although many of the non-overlapping regions are also real (as demonstrated by other replication methods). Similarly, replication on different platforms or with different calling algorithms adds validity to detected CNVs in any tested sample, but regions identified by a single approach can also be real. Experimental replication of CNVs provides the highest level of validation, but those methods are often time-consuming and not optimized for high-throughput testing of multiple regions and samples. As a result, most studies experimentally validated only a subset of their detected CNVs (Table 2). However, high-throughput validation techniques have become available (e.g. Sequenom ) 260, so most CNVs published in the future should be confirmed more readily. While most early CNV studies focused on variant discovery, determination of disease association with specific CNVs requires accurate genotyping of the CNVs of interest. A number of techniques have been employed for genotyping, including PCR based (e.g. PCR across breakpoints; quantitative PCR; multiplex methodologies that assay multiple loci at once), SNP-array based (e.g. customizing arrays using Illumina GoldenGate assay for specific CNVs; using tag SNPs to impute common CNVs that are in high linkage disequilibrium (LD) with the tag SNP), acgh-based (e.g. customized high-density tiling arrays

40 27 with probes for known CNVs), and sequencing-based (e.g. building a library of breakpoints discovered and validated from previous sequencing-based studies and comparing future de novo sequences against it to rapidly genotype CNVs in those locations; calibrating acgh data using sequencing-based data to obtain absolute copy numbers). 195 Accurate genotyping is easier for deletions than duplications, and is particularly challenging in multi-allelic regions. 2.4 Structure and mechanism of CNV formation Several mechanisms of genomic rearrangement have been identified predisposing to duplications and deletions, driven by structural motifs in the genome. One of the earliest observations in CNV surveys was the association of CNVs with segmental duplications. 197,198,199,200,206,208,209,212 Segmental duplications (also called low-copy repeats or duplicons) are genomic regions 1kb in size and with 90% sequence homology, present in multiple copies and covering approximately 5% of the human genome. 261 Segmental duplications, particularly those with 97% or greater sequence identity and less than 10Mb distance between them, can cause misalignment of homologous chromosomes or sister chromatids and mediate non-allelic homologous recombination (NAHR), thus producing genomic duplications and deletions of regions flanked by the segmental duplications. 262 In addition, segmental duplications themselves may be CNVs if they are not yet fixed in the human genome and they vary in copy number between individuals. 199 Most recurrent CNVs appear to be caused by NAHR mediated by segmental duplications. However, not all CNVs are associated with segmental duplications and other mechanisms have been implicated in CNV formation. Different repetitive elements found in the breakpoint junctions of CNVs include Alu SINES, L1 LINES, and long terminal repeats. 210,247 Other mechanisms associated with CNV formation include non-homologous end-joining (NHEJ), retrotransposition events (otherwise known as mobile element insertion, or MEI), Variable Nucleotide Tandem Repeat (VNTR) expansion/contraction events, replication Fork Stalling and Template Switching (FoSTeS), and microhomology-mediated breakinduced replication (MMBIR). 263 In some cases, a parental inversion may predispose to de novo unbalanced variants in the children, such as in the example of 17q21.31 microdeletion syndrome. 264 Multiple studies have noted certain genomic locations as hotspots for CNVs, including 6cen, 8pter, 15q13-14, 11q11, 19q13, and 7q ,212,210,221 Some regions, such as 8p23, appear to be hotspots for recombination as well as sequence variation, containing an enrichment of both structural variants as well as SNPs 205,221. In a recent report analyzing next-generation sequencing data for 1000 Genomes project, structural variants were found to cluster into hotspots by the mechanism of their formation, with VNTR clustering near the centromeres and NAHR near the telomeres 247. Possible explanations for genomic

41 28 variation hotspots include: older evolutionary age of the target genomic segments; biological functional effect of involved regions driving selective pressure to maintain diverse alleles; or complete lack of functional importance and selective pressure. 205, Population Genetics of CNVs Population genetics of CNVs are somewhat more complex than that of SNPs. Both forms of variation may occur de novo or be inherited, but the de novo mutation rate for CNVs has been estimated to be 2-4 orders of magnitude greater than for single base mutations. Certain genomic regions are indeed susceptible to recurrent rearrangements due to their structure (e.g. flanked by segmental duplications), but when Mendelian inheritance was specifically investigated most common CNVs were indeed inherited from a parent. 219 Different studies have been differentially powered to detect common versus rare CNVs, thus yielding conflicting data on the proportion of CNVs in the genome that are polymorphic (>1%). Earlier SNP arrays and lower-resolution CGH arrays tended to be biased against common CNVs, so the majority of CNVs identified using those platforms were rare in the general population. However, higher resolution SNP arrays (such as Illumina 1M and Affymetrix 6.0) as well as very high-density CGH custom arrays succeeded in detecting and genotyping a significant proportion of common CNVs over 1kb, and it is evident that most of the variation between any two individuals at that resolution is due to common CNVs that obey Hardy Weinberg Equilibrium. 219,232 Sequencing based technologies have been identifying more CNVs at a smaller size, and the data is a mix of rare and common CNVs. 247 Most common CNPs are biallelic (with a bias for detecting deletions on the platforms used), and most of those were found to be tagged well by SNPs of similar frequencies, suggesting that they are ancestral events. 219 CNPs that are in strong LD with tagging SNPs can be easily genotyped in association studies, thus facilitating their study. However, SNP taggability depends on the frequency as well as density of nearby SNPs, meaning that some CNVs of lower frequency or present in regions not populated by many SNPs will need to be genotyped directly. The same is true for complex CNVs or CNVs that have multiple copy number alleles, as those tend to be in poor LD with nearby SNPs as well. Studies in populations of different ethnicities have suggested population differentiation in the frequency of some CNVs, and some CNVs do appear to be population-specific ,232 In keeping with the out of Africa hypothesis, African populations have been found to have a higher number of rare or lowfrequency CNVs than non-african populations. 229 These findings emphasize the importance of matching the ethnicity of cases and controls in association studies to minimize spurious associations of populationspecific CNVs with disease.

42 Phenotypic impact of CNVs The earliest known CNVs, usually large genomic deletions and duplications often encompassing many genes, were invariably linked to significant genomic disorders. With the discovery of ubiquitous CNVs in healthy controls, interpreting the functional significance of such genomic alterations became more complex. Of note, many studies have observed a general bias against genic CNVs in general, and large genic deletions in particular 265,232, suggesting that genomic alterations negatively impact fitness and undergo purifying selection. Interestingly, there is also some evidence of positive selection (or potentially reduced purifying selection 266 ) acting on some genes, such as the salivary amylase gene AMY1 which appears in higher copy number in humans than in other primates and which is found in higher copy number in human populations with high-starch diets relative to populations with traditionally low-starch diets. 267 Alternatively, many common CNVs have been identified at high frequencies in all human populations and appear to have only a modest effect, if any, on phenotype. Early CNV surveys identified a large number of genes as copy number variable, but care must be exercised in interpreting those results given the propensity of those early platforms to overestimate the size of CNVs, and hence the actual number and identity of involved genes reported in earlier studies may be inaccurate. However, even more recent studies, with the power to identify smaller CNVs with more accurate breakpoints, have detected thousands of genes that are affected at least in part by deletions or duplications. For example, Pang et al. 238 reported an extensive analysis of the diploid genome of Dr. Craig Venter based on multiple microarray and sequencing platforms, and they identified 189 genes completely encompassed by gains or losses and an additional 4,867 genes whose exons were impacted by CNVs. While they did find an overall paucity of CNVs affecting genes associated with autosomal dominant or recessive diseases, cancer syndromes, imprinted and dosage-sensitive genes, 573 of the CNV genes were in the Online Mendelian Inheritance in Man (OMIM) database. Conrad et al. 232 used a discovery cohort of 20 CEU and 20 YRI HapMap individuals to detect common CNVs using a highdensity CGH array, then genotyped 450 HapMap samples at approximately 5,000 common CNVs. On average, they found 445/1,098 CNVs overlapping 622 genes between any two individuals, and they identified 2,698 genes affected by CNVs in the total sample set. Over half of partial gene deletions were predicted to induce frameshifts, and 267 genes appeared to be affected by unambiguous loss of function CNVs. Genes affected by CNVs appeared to be enriched for extracellular functions such as cell adhesion, recognition, and communication, whereas they appeared to be biased away from intracellular functions such as metabolic and biosynthetic pathways. These results extended those of previous as well as subsequent CNV surveys, which also reported enrichment of immune and defense responses as well as neurological system processes. 239,268,247 Those latter functions are also proposed to have been involved in the adaptive differentiation of humans and chimpanzees. 269

43 30 The exact contribution of CNVs to gene expression variability, and how they relate to SNPs, is unclear. Stranger et al. 270 interrogated the contribution of CNVs detected by Redon et al. 206 on BAC-CGH array and Affymetrix 500K array to gene expression variability in lymphoblastoid cell lines from 210 HapMap samples (within a 2Mb CNV-gene), and found that 17.7% of 1,061 genes with expression variability were associated with CNVs, with over half of the associations appearing to be long-range (i.e. the CNV did not overlap the gene whose expression it appeared to impact). While 83.6% of variability was attributed to SNPs, only 1.3% of genes were associated with both CNVs and SNPs. Schlattl et al. 271 extended this analysis of CNV-expression association by comparing normalized transcriptome data for lymphoblastoid cell lines (LCLs) from 60 CEU and 69 YRI HapMap samples to CNV data published in the same samples on multiple platforms (high-resolution tiling CGH array 232, high-resolution SNP array 219, and nextgeneration sequencing data 247 ). By concentrating on common CNVs and restricting to effect range of 200kb or less, they found a significant association between CNVs and the expression of 110 genes. Despite an abundance of deletions in the CNV set, Schlattl et al. 271 found enrichment of duplications among CNVs associated with variable expression, suggesting purifying selection acting against deletions that impact gene expression. While comparing results from this analysis to previously published studies, the authors were able to confirm several CNV-gene expression associations, including 6/13 that were identified by Stranger et al. 270 within the same effect range. Most of the CNV associations (70%) occurred without overlap of the CNV with the respective gene, although the range of effect appeared to be <100 kb in most cases. Interestingly, several intronic deletions were associated with gene expression, but expression was decreased in only half of the cases, whereas it was increased in the other half. Such a mix of positive and negative CNV effect on expression was also observed for the CNVs which did not directly overlap genes. CNVs that overlapped exons or completely encompassed CNVs usually affected expression in the same direction as the copy number change. Unlike Stranger et al. 270, Schlattl et al. 271 found that most CNVs associated with gene expression (70%) overlap previously published SNPexpression associations. This discrepancy in overlap likely reflects the differences in CNV characteristics detectable by earlier platforms (more rare than common CNVs, biased away from common SNPs) relative to the platforms used by Schlattl et al. 271 Conrad et al. 232 proposed that since most common genotyped CNVs were well tagged by SNPs, it would be expected that SNP-based genome-wide association studies would have already screened most common CNVs for association with common diseases. Based on the finding by Conrad et al. 232 that less than 5% of trait-associated SNPs in 279 publications were in linkage disequilibrium > 0.5 with a nearby CNV and the additional finding by the Wellcome Trust Case Control Consortium that only three CNV loci reliably associated with one or more of eight common diseases (all of which are tagged by SNPs that were previously detected in genome-wide association studies), the authors of those papers argued that common genotyped CNVs do not explain a significant proportion of heritability in common diseases. Nonetheless, the findings of Schlattl et al. 271 indicate that a non-

44 31 negligible proportion of CNVs associated with gene expression variability do not link to SNPs, and moreover 57% of genes with expression associated with CNVs were found to have a greater correlation with their most strongly associated CNV than with any nearby SNP. This was especially true for CNVs that overlap exons (10/10). Other studies of CNVs in mice, rats, and Drosophila have observed similar impact of CNVs on gene expression Many diseases have been associated with CNVs. Recurrent de novo microdeletions and microduplications are linked to many sporadic genomic disorders such as Williams-Beuren syndrome, Angelman syndrome/pradel-willi syndrome, Charcot-Marie-Tooth disease 1A, and idiopathic mental retardation. 195 Rare CNVs (de novo or heritable) have been associated with neuropsychiatric disorders such as autism spectrum disorder and schizophrenia; neurodegenerative diseases such as Parkinson Disease 275 ; and metabolic disorders such as obesity 276, among others. Common heritable CNVs have been associated with autoimmune and infectious diseases such as Crohn s disease 277, rheumatoid arthritis 278, diabetes mellitus 278, psoriasis 279, lupus 280, and susceptibility to HIV infection 281. Both rare as well as common CNVs have also been associated with susceptibility to cancer, as discussed below. Determining the pathogenicity of CNVs, and delineating the responsible gene(s) or genomic elements, can be challenging. CNVs may affect phenotype in a number of ways, including: increasing or decreasing copy of dosage-sensitive genes; disrupting genes or producing fusion genes; position effect; unmasking recessive alleles; affecting communication between alleles on homologous chromosomes. 264 The effect of CNVs is also moderated by variable penetrance and expressivity. 264 Some CNVs have been associated with a wide range of phenotypes (e.g. 1q21.1 has been associated with dysmorphic features, cardiac abnormalities, learning difficulties, mental retardation, autism, and schizophrenia) 282 ; this may reflect ascertainment bias due to the study design (e.g. phenotype-driven vs. genotype-driven) 264 but may also reflect variability in expressivity. Some studies have also demonstrated buffering effect in cells, whereby the observed expression level of a given gene does not correspond linearly to the expected level based on copy number. 271,272 It should be noted that in addition to copy number, the phase information and genomic context of CNVs is also important for understanding the potential effect of the variant. 264 Other challenges in CNV research include distinguishing germline from somatic alterations. Many studies used DNA from immortalized lymphoblast cell lines, and it has become apparent that some structural variants occur exclusively in or may be amplified by the Epstein-Barr virus (EBV) transformation process. 278,283 Moreover, few studies addressed the issue of somatic mosaicism or heterosomy (variants present in only a fraction of cells in the tissue/blood sample), since most platforms/algorithms are not designed to identify the partial nature of these regions, and few studies compared the genomes of different tissues from the same individual. 207,212,284 One survey of large

45 32 structural variations in blood-derived DNA in 957 controls and 1,034 bladder cancer patients identified mosaic structural variations in 1.7% of all individuals with no significant difference between cases and controls. 285 The regions most commonly found to be somatic or cell-line artifact are T cell receptors or immunoglobulin genes, including loci at 2q11 200, 2p ,212, 22q ,208,212, 14q ,208,212, and 14q as well as chromosomes 9 and Interestingly, some studies identified copy-number variation within monozygotic twin pairs, both phenotypically concordant as well as discordant, suggesting post-twinning somatic development of CNVs. 286,287, CNVs and cancer Chromosomal aneuploidy, whether involving entire chromosome, chromosomal arms, or segments of chromosomes, is a characteristic feature of most solid malignant tumors. Chromosomal instability (CIN) is the high rate of loss and gain of whole chromosomes and has been attributed to various mechanisms that interfere with correct segregation of chromosomes during mitotic division. 289 Chromosomal structure instability (CSI) is another hallmark of most solid cancers, involving multiple chromosomal segmental breakages and fusions associated with telomere shortening, inappropriate DNA repair of double-strand breaks, and chromosomal fragile sites, resulting in amplifications or deletions of the involved genomic regions. A chicken-vs-egg debate has revolved around the relationship of CIN and CSI with the development of cancer: not all aneuploid cells are unstable or tumorigenic and certainly many copy number alterations in tumors appear to be passengers rather than driver mutations. Nonetheless, there is evidence for CIN and CSI in cancer development, such as generating LOH at loci of inactivated tumor suppressor genes or amplified oncogenes. 290 Two decades ago, comparative genomic hybridization (CGH) was developed to facilitate identifying regions of copy number gain and loss by hybridizing biotinylated DNA from paired tumor and normal samples to metaphase chromosome spreads. Several years later, array-based CGH was introduced and became a commonly used tool in the study of cancer genomes. Later, SNP microarrays also came into use, providing the added advantage of detecting regions of copy-neutral LOH and uniparental disomy. Very recently, the drop in cost of whole-genome and exome sequencing has allowed the use of these technologies to identify a wide range of variants in tumors, from single base to large structural variants. In keeping with the classical Knudson two-hit hypothesis for inactivation of tumor suppressors, a number of well-known tumor suppressor genes were first identified by analyzing focal homozygous deletions in cancer in combination with linkage and/or LOH results (e.g. CDKN2A/B, PTEN, WT1, BRCA2). Those discoveries spurred the identification of numerous candidate tumor suppressors by characterizing recurrent deletions in tumors or cancer cell lines. Mouse studies have even suggested that haploinsufficiency of some cancer genes can be sufficient to cooperate with other oncogenic alterations in

46 33 initiating tumor development (e.g. LKB1 and BRCA2 heterozygosity have been reported to accelerate pancreatic tumor development in mice with activated Kras mutations). Similarly, genomic amplifications in cancer can help identify candidate oncogenes. Moreover, some deletions and amplifications carry prognostic significance (e.g. MYCN amplification in neuroblastoma, ERBB2 amplification in breast cancer, 18q deletion in colon cancer), and whole-genome profiling of copy number alterations in tumors can be diagnostic or prognostic (e.g. distinguishing gastrointestinal stromal tumors from leiomyosarcomas 291 ; acgh classifier based on BRCA1-mutated breast cancer predicting sensitivity to double-strand-dna-break-inducing chemotherapy in patients without germline BRCA1/2 mutations 292 ). Structural rearrangements of pancreatic adenocarcinoma have been described in multiple studies, ranging from cytogenetic karyotyping 293 and microsatellite genotyping 12,182,294 to CGH and SNP microarrays 37,307,308,309 to next-generation sequencing 38. Certain patterns have emerged: all chromosomal arms manifest genomic rearrangements, and the most frequently reported rearrangements are losses on 1p, 3p, 6p, 6q, 8p, 9p, 9q, 17p, 18q, 19p and gains on 8q. Some studies attempted to identify candidate tumor suppressor genes or oncogenes, and while most results were of insufficient resolution to pinpoint a target gene, certain genes were highlighted by multiple studies using a combination of genomic and expression data (e.g. SMURF1 on 7q ,303 and GATA6 on 18q ,310 were proposed as novel oncogenes.) LOH is a common event across the pancreatic cancer genome, often occurring in the form of whole chromosome loss, and there was no significant difference in the pattern of LOH between sporadic and familial tumors. 12,182 One recent study that used massive parallel sequencing technology to detect variants at fine resolution in 3 primary tumors and 10 metastases reported significant inter-patient heterogeneity in the number, type, and distribution of rearrangements. 38 Interestingly, one sixth of all rearrangements were in a pattern they termed fold-back inversions, whereby regions are duplicated but with the duplications facing in opposite directions. This appeared to be an early event in the development of pancreatic cancer and is associated with telomere loss. Moreover, sequence analysis of metastases indicated that this type of rearrangement did not continue occurring later in the pancreatic cancer developmental pathway, suggesting a reactivation of telomere repair function. Other interesting findings from this analysis of somatic rearrangements in pancreatic cancer metastases were: evidence of ongoing clonal evolution in the primary tumor among cells capable of initiating metastases (based on identifying finding some rearrangements only in some metastases), evidence for driver mutations involved in metastatic spread (based on finding some rearrangements only in the metastases but not in the primary tumor), and evidence for differences in evolution of metastases within each organ. Less well studied than somatic genomic rearrangements in cancer is the relationship between germline CNVs and cancer susceptibility. It is well known that moderate-to-high-penetrance rare germline CNVs contribute to the heritability of familial cancer. Large germline genomic rearrangements that are absent

47 34 or rare in healthy populations have been reported as the cause of 15% of Familial Adenomatous Polyposis (APC) 311, 19% of Von Hippel Lindau disease (VHL) 312, 4% of Hereditary Diffuse Gastric Cancer (CDH1) 313, 2-12% of Hereditary Breast and Ovarian Cancer (BRCA1 and BRCA2) , 6-27% of Lynch Syndrome (MSH2 & MLH1 genes) 321,322, 16% of Peutz-Jeghers Syndrome (STK11) 323, and 15% of juvenile polyposis (SMAD4, BMPR1A, and PTEN) 324 cases. Deleterious germline CNVs have also been reported in non-brca1/2 associated familial breast cancer (PALB2 325 ; BARD1 326 ), Hereditary Leiomatomatosis and Renal Cell Cancer (FH) 327, Cowden disease (PTEN) 328, Familial Atypical Multiple Mole Melanoma (CDKN2A) 329, Neurofibromatosis Type 1 (NF1) 330, Ataxia Telangiectasia (ATM) 331, Li Fraumeni syndrome (TP53) 332, familial retinoblastoma (Rb) 333, and Multiple Endocrine Neoplasia Type 1 (MEN1) 334. Interestingly, there are examples of copy number alterations at a distance from the coding region of a gene influencing its expression, whether by affecting regulatory elements or by inducing epigenetic changes that inactivate the gene. For example, in approximately 20% of suspected Lynch syndrome cases with MSH2 loss but no detectable germline mutations or rearrangements in MSH2 335 (about 1-3% of all Lynch Syndrome patients 336 ), the causative mutation is a large heritable deletion at the 3 end of the TACSTD1 gene, which causes transcriptional read-through and epigenetic silencing of the adjacent MSH2 gene. In one juvenile polyposis kindred with 10 affected members who had no mutations or rearrangements in the coding regions of SMAD4 and BMPR1A, Calva-Cerqueira et al. 337 identified a large deletion mapping 119kb upstream of the coding region of BMPR1A segregating with disease. The deletion affected a promoter of BMPR1A and was demonstrated to diminish expression of the gene. Common copy number polymorphisms at some genes linked to cancer have also been associated with modest risk. For example, the glutathione-s-transferases (GSTs) constitute a family of genes involved in drug and toxin metabolism and are thus hypothesized to protect cells against xenobiotics and oxidative stress. Two of those genes, GSTT1 and GSTM1, have polymorphic deletions shown to correlate with lowered enzyme activity. In one recent study that accurately quantified the copy number of those genes in approximately 2,000 cancer patients and 8,000 controls, a gene dosage effect was demonstrated in GSTT1 for prostate cancer in men and corpus uteri cancer in women, and in GSTM1 for bladder cancer. 338 Another interesting association between a common copy number polymorphism and cancer was identified in familial breast cancer for a deletion that eliminates exon 4 of MTUS1, a gene implicated as a tumor suppressor. Interestingly, the common deletion was found to have a protective effect against breast cancer, suggesting that the exon 4 deletion may paradoxically increase the tumor suppressor activity of the gene (although this has yet to be demonstrated in functional studies). 339 All of the aforementioned germline rearrangements were identified in targeted studies, commonly utilizing PCR-based assays, which specifically searched for and/or quantified deletions or duplications at or near known cancer genes in high-risk populations. The discovery of predisposition germline

48 35 rearrangements in cancer subjects without a priori knowledge of the region/gene of interest requires a different approach. Most studies addressing this question have adopted two main strategies: genomewide CNV surveys in large cohorts of sporadic cancer patients and controls allow the identification of statistically significant associations between common CNVs and a low-to-modest cancer risk; alternatively, genome-wide CNV surveys in familial or hereditary cancer patients should facilitate the detection of rare heritable CNVs (not previously published in controls nor present in a concurrently studied control cohort) that potentially alter cancer genes and produce a modest-to-high risk of cancer. Genome-wide case-control CNV association studies have identified candidate risk alleles for several sporadic cancers: neuroblastoma in a Caucasian population (deletion at 1q21.1, OR=2.49, p=2.97 x ) 340, aggressive prostate cancer in Caucasian populations (deletion at 2p24.3, OR=1.31, p=0.006; deletion at 20p13, OR=1.17, 2.75 x 10-4 ) 341,342, and nasopharyngeal carcinoma in Han Chinese males (deletion at 6p21.3, OR=18.92, ). 343 Most recently, Huang et al. 344 identified a common 10,379bp deletion at 6q13 that was found to be higher in frequency in sporadic pancreatic cancer Han Chinese patients compared to controls, and confirmed via a qpcr assay to have an odds ratio of 1.31 for 1-copy carriers compared to 2-copy carriers. All those studies replicated their results in a confirmation cohort and used ethnicity-matched cases and controls, and all but Diskin et al. 340 used a PCR-based assay as the confirmation assay; Diskin et al. 340 applied multiple correction testing to verify the statistical significance of their results. Three of the identified CNVs overlapped genes: The neuroblastoma CNV overlapped a novel transcript that demonstrated high sequence homology to the neuroblastoma breakpoint family (NBPF) genes, was shown to correlate in expression with copy number, and was highly expressed in fetal brains. The prostate cancer CNV at 20p13 differentially affects isoforms of the SIRPB1 gene, which codes for a signal regulatory protein. The CNV at 6p21.3 encompassed MICA, a major histocompatibility complex class (MHC)-A gene which functions to mediate natural killer (NK) cell activation and T- lymphocyte costimulation and which has been associated with nasopharyngeal cancer in previous studies. The pancreatic cancer CNV at 6q13 and the prostate cancer CNV at 2p24.3 are non-genic and are hypothesized to impact risk through long-range regulatory effects on an unidentified gene. Indeed, functional analysis of the non-genic deletion associated with pancreatic cancer suggested that it may be involved in long-range regulation of CDKN2B, an established tumor-suppressor gene. While these results are interesting, they remain to be further validated in future studies. Some analyses may be confounded by inaccurate genotyping of the CNV of interest: for example, the Database of Genomic Variants has reports of gains as well as deletions at several of these putative cancer-associated CNVs, suggesting that they may not be simple biallelic variants. Moreover, previous studies of CNVs in Asian populations 232,239 reported higher frequencies of the deletion at 6p21 in controls than was identified in the population studied in the nasopharyngeal carcinoma study. This is particularly significant because the odds ratio

49 36 identified for the 6p21 deletion (~19) was much higher than for any other common CNV or SNP associations, and it may in fact be an overestimation if the deletion was undercalled in controls. A few studies have been published surveying germline CNVs in familial solid cancer patients, and although they have proposed several candidate predisposition genes based on overlap with patientspecific CNVs, none to date have been able to show a significant contribution or segregation with disease of any one gene to those cancer syndromes. One of the earliest studies analyzed 57 predominantly Caucasian pancreatic cancer patients from 56 high-risk kindreds (each containing at least a pair of affected first-degree relatives) using an oligonucleotide-based CGH platform, filtering out losses or gains that were also identified in 607 mostly Caucasian controls (372 were analyzed in the same study, and 235 were previously reported in two other studies). 345 Twenty-five losses overlapping 81 genes and 31 gains overlapping 425 genes were identified specific to the cancer patients, and those genes were presented as potential candidate predisposition genes. Due to lack of sufficient related samples, the authors were unable to demonstrate heritability or segregation with disease of the patient-specific CNVs. Moreover, the resolution of the CGH array used in this study was relatively lower than current platforms (approximately 30kb), which resulted in relatively large CNV calls that likely overestimated the actual breakpoint boundaries of rearrangements. Furthermore, the available control data available at the time of publication was limited, so some of the supposedly familial pancreatic cancer (FPC)-specific CNVs were identified in control populations in subsequent studies. The abstract of the paper refers to two deletions that were observed in two different patients and one deletion that was observed in three different individuals, yet no discussion of these regions is found in the main text of the manuscript. If such regions were truly found to be recurrent in patients and absent in controls, they would be of particular interest as candidate predisposition CNVs, but we cannot draw any conclusions given the paucity of information provided. Two other studies similarly provided a list of candidate genes in familial cancer. Yoshihara et al. 346 compared 68 Japanese subjects with germline BRCA1 mutations (including 51 subjects with ovarian cancer), 34 sporadic ovarian cancer patients, and 47 healthy controls, and they identified 31 CNVs specific to the BRCA1-mutation group. All 31 CNVs overlapped genes, and three CNVs segregated with ovarian cancer in affected members of the same family (of which two CNVs were present in two different families each). No significant difference was found in the per-genome total number of CNVs between BRCA1-mutation carriers and controls, although the number of deletions was higher in the BRCA1- mutation subjects. Otherwise, they found no evidence for differential clustering of the global CNV data between groups, and no correlation of age at diagnosis with CNV frequency. Since the BRCA1 gene was already identified as the primary genetic mutation in this study, the list of genes overlapped by CNVs represented potential modifying genes that may contribute to the unique biological characteristics of

50 37 BRCA1-mutated ovarian cancer. Venkatachalam et al. 347 studied 41 young-onset and/or familial colorectal cancer with microsatellite-stable tumors and identified four losses and three gains in six patients (one patient had a loss and a gain) which were not present in a large control cohort nor reported in previous control studies. Each CNV overlapped at least one gene and each was detected in a single patient only. A study by Shlien et al. 348 presented an intriguing perspective of the connection between germline CNVs and somatic tumor development in TP53 germline mutation carriers. They studied 53 Li-Fraumeni family members (20 with wildtype TP53, 23 with TP53 mutations and history of cancer, and 8 with TP53 mutations and no cancer) and 70 unrelated healthy controls, and demonstrated a significantly elevated frequency of germline CNVs in the TP53 mutation carriers relative to controls with wild-type TP53. There was also a trend for a higher frequency of germline CNVs in cancer patients carrying TP53 mutations relative to mutation carriers without a history of cancer, but this did not reach statistical significance possibly due to the small sample size. Furthermore, not only was the number of individual CNVs elevated in mutation carriers but the number of copy-number variable bases was also higher, even when the absolute number of CNVs was not, due to a tendency toward larger CNVs in the TP53 mutation cohort. Comparison between germline and choroid plexus tumor DNA in four patients identified 15/21 loci overlapping germline CNVs that became substantially larger in the paired tumors, and three of four tumors had loci at which a germline hemizygous deletion had progressed to homozygous deletion. These findings suggested a model of tumor development in Li-Fraumeni syndrome in which germline genomic instability (manifested as a higher than average CNV frequency) predisposes to additional genomic rearrangements and/or expansion of germline CNVs in somatic tissue, affecting genes that drive the development of cancer. The authors also report a list of cancer-related genes overlapped by germline CNVs in the TP53-mutation carriers which may act synergistically with the TP53 mutation in promoting cancer development. Of course, the role of TP53 in maintaining the genome is well known 349, and it is not surprising to find that even non-malignant cells exhibit increased genomic instability in Li-Fraumeni patients. However, it is unclear if this phenomenon applies to other tumor suppressor genes that predispose to familial cancer. Future surveys of CNV burden in other cancer syndromes would shed more light on this question. 3. Whole-Exome Sequencing The human genome is comprised of approximately 3 billion base pairs, of which less than 2% code for proteins. The release of the first reference build of the human genome in 2003, after a 13-year collaborative international effort, opened the door to significant advancements in understanding the genetic and genomic makeup of individuals, populations, and cancers. The Human Genome Project

51 38 expanded understanding of the identity and population frequency of SNPs, the most frequently occurring variant in the human genome, and efforts to determine haplotype structure (blocks of SNPs present in different combinations and segregating in populations) have accelerated progress in the fields of population genetics, human evolution, and disease-gene associations. The original sequencing effort was based on the technique developed by Fredrick Sanger in the 1970s, utilizing labeled dideoxy trinucleotide triphosphates (ddntps) as DNA chain terminators and separating terminated chains of various lengths by gel electrophoresis to determine base order in the sequence. High-throughput requirements of the DNA sequencing effort drove the development of automated capillary electrophoresis and other laboratory process automation. The International Human Genome Sequencing Consortium (IHGSC) employed a hierarchical shotgun sequencing approach that involved fragmenting and cloning DNA (initially using yeast artificial chromosomes, then subsequently bacterial artificial chromosomes), mapping clones on the physical map of the genome with the help of established genomic markers, shot-gun sequencing clones, and finally aligning sequenced fragments to the developing map. 350 In the last few years of the IHGSC project, a competing effort undertaken by Craig Venter s company CELERA utilized a whole genome shotgun sequencing approach which was considered by Venter to be more efficient and faster, although CELERA did end up incorporating publicly available data that was generated by the IHGSC to allow accurate mapping of sequenced fragments due to the difficulty of mapping to highly repetitive regions of the genome (which constitute a large portion of the human genome) without the use of additional genome map information. 350,351 The approximate cost of sequencing the first reference human genome was $3 billion. Importantly, neither the IHGSC nor the CELERA genomes was the sequence of a single diploid genome but rather each was a haploid consensus sequence of DNA derived from several anonymous individuals of different ancestries (although the IHGSC sequence was primarily based on a single male individual, and the CELERA reference sequence may have included Craig Venter s genome). Building on the data discovered from the reference human genome, the International HapMap Project set out to identify common SNPs (defined as minor allele frequency (MAF) >1% frequency, but most identified by this project have a MAF >5%) and their haplotype structure in members of different populations. 352 This important source of information allowed the development of genotyping arrays for genome-wide association studies. Only four years after the release of the nearly complete human reference genome, the first diploid human genome sequence to be published belonged to Craig Venter, using the CELERA whole-genome shotgun sequencing method, costing $ million and was completed in about 4 years. (The cost estimate incorporates costs incurred during the development of the CELERA reference genome). 209 While this sequence presented an interesting perspective on the makeup of individual genomes, it is also clear that

52 39 many more genomes need to be sequenced before the full potential of genomic analysis and comparisons among individuals can be realized. Making whole-genome sequencing possible for many genomes required a dramatic reduction in cost and increase in the speed of the process. To that end, the development of massively-parallel next-generation technologies presented a breakthrough in genomics. Since publication of the first sequencing-bysynthesis technology in , a number of different platforms have been developed. While they employ different techniques of sequencing (Illumina and Roche/454 use DNA polymerase-based sequencing-by-synthesis approaches while ABI SOLiD uses DNA ligase-based sequencing by ligation), all are based on clonal cluster amplification of target molecules to generate a sufficiently strong signal. 354 The first human genome to be fully sequenced by a massively-parallel platform belonged to James Watson, co-discoverer of the DNA double helix. 218 In a demonstration of the significantly increased power of next-generation sequencers, the Watson genome was sequenced in 4.5 months and this effort cost less than $1.5 million. 355 Since then, many other individuals of different ancestries have been sequenced. 209,218,222,223,227,228,230,239,243,356,357,358,359 The 1000 Genomes project is an endeavour to sequence the genomes of 2,500 unidentified individuals from 29 populations to discover, genotype, and accurately identify haplotypes, with the overarching goal of characterizing 95% of variants with allele frequency of 1% or greater in genomic regions that can be sequenced by the most recently available next-generation platforms. 246 To date, three pilot projects have been completed: (1) low-coverage sequencing (2-4x) of the whole genome of 180 individuals provides data on 1% or higher frequency SNPs; (2) deep sequencing (20-60x) of two mother-father-adult child trios whole genomes allow quality control of data from pilot project (1) and inferring haplotypes; (3) targeted capture and deep sequencing (50x) of ~8,000 exons from approximately 900 randomly selected genes -- to test the effectiveness of targeted capture sequencing in identifying common, low-frequency, and rare variants in protein-coding regions of the genome. The main project involves low-depth sequencing (4x) of the whole genome of 2,500 individuals as well as deeper sequencing of their exomes by the target-enrichment method (See below for more detail on exome sequencing). Whereas the Sanger-based automated sequencers generated approximately 100 kbp of data per day on a single machine, the earliest next-generation platform increased the output by two orders of magnitude and this was very quickly surpassed by further developments of other platforms with larger output, and a single sequencer in 2011 produces around 40 Gbp per day. 360,361 An important distinction between Sanger-based and next-generation sequencers is the read length: bp for capillary Sanger sequencers compared to bp in next-generation sequencers, depending on the platform. The cost of whole-genome sequencing has dropped significantly, currently as low as $5000-$ Interestingly, while the cost of generating a genome sequence has dropped dramatically, the capacity to analyze the data

53 40 has advanced less rapidly. Some challenges have included the inadequate adaptation of software originally designed for alignment and variant calling of Sanger sequencing and the need for newer validated software packages that can handle the significantly larger quantity of data that is generated with newer platforms. 362 The relatively short reads have also posed a problem for de novo genome assembly and correct alignment to repetitive or highly homologous regions. In recent years, third-generation sequencing methodologies have been introduced, characterized by the ability to directly sequence single molecules without needing to amplify the template. 363 Those newest methods of sequencing may address some of the limitations of next-generation sequencers (e.g. they appear to generate longer reads approximating the length obtainable by the Sanger capillary sequencers) but they have their own challenges, such as higher raw read error rate from the single molecule sequencing approach. As such, ongoing improvements in both sequencing technologies as well as bioinformatic tools will be necessary to achieve the most cost-effective means of sequencing large numbers of genomes for disease gene discovery and clinical diagnostic purposes. (I am not addressing other applications of next-generation sequencing such as transcriptomics, epigenomics, and chromatin immunoprecipitation sequencing (ChIPseq) as they are outside the scope of this thesis). The cost of whole-genome sequencing has not yet reached the promised $1,000-genome level that has been identified as a goal for the genomic community, particularly if post-sequencing analysis cost is taken into consideration; moreover, much of the information identified in a whole genome remains difficult to evaluate in terms for functional impact on disease or phenotype since only 1-2% of the entire genome has been annotated as protein-coding. Indeed, to date, several reports of whole-genome sequencing in disease cases have been published but invariably they focus on coding region variants to identify candidate causative genes These two current limitations of whole-genome sequencing (cost and functional annotation of the genome) have made exome-sequencing an attractive alternative for researchers. Exome sequencing is based on capturing and subsequently amplifying and sequencing the coding region of the genome using massively-parallel sequencing. Since the target region in exome sequencing is less than 2% that in whole-genome sequencing, it is possible to obtain much greater read-depth per base per run. This means that more samples can be sequenced in the same amount of time and for the same price as a single whole genome. A number of methods of target enrichment have been introduced, including both solid-phase (e.g. Nimblegen Sequence Capture Human Exome 2.1M array) as well as in-solution oligonucleotide arrays (e.g. Agilent SureSelect System). 372,373 The latest arrays can capture up to 44-50Mb of genomic sequence, encompassing most of the annotation of the Collaborative Consensus Coding Sequence (CCDS 2009) 374 database and flanking base pairs of target regions as well as micrornas and other non-coding RNAs. It should be noted that, although the coverage of exome sequencing for coding

54 41 regions and adjacent regulatory sequences is excellent, it is not perfect and the success of capture varies between arrays to some extent, as well as sequence-specific characteristics such as high GC-content. 375 The first description of a human exome was based on the coding variants identified in the previously published diploid genome of Craig Venter (HuRef). 376 The authors reported that most nonsynonymous SNPs are common (15-20% are rare and ~95% of the rare variants were heterozygous). They also identified 105 premature-terminating codons, many of which are common and do not appear to be under negative selection. They noted that many of these variants were present in duplicated genes and hypothetical genes, suggesting that their impact in this setting may be less deleterious. They also noted that half of all coding indels occurred in tandem repeats, and tended to occur at the C and N termini of genes and/or near exon boundaries (which in some cases were considered likely mapping errors in the reference genome). There was a bias toward indels composed of multiples of 3 bases (3n) in coding regions that are likely to be functionally significant, suggesting purifying selection acting on frameshift indels in those regions. Of additional importance, the authors noted that the Venter genome contained at least 680 nonsynonymous SNPs affecting 443 genes with some association with disease, including 7 that were in dbsnp and OMIM database, which foreshadowed the challenge that would be encountered in interpreting the clinical significance of coding variants as more genomes and exomes are sequenced. The first report of target-captured exome sequencing using next-generation sequencing was published in 2009 by Ng et al. 377, describing the exomes of 8 HapMap individuals whose genomes were previously characterized by sequencing fosmid-clones to identify structural variants. In addition, in a proof of concept experiment, the exomes of four unrelated individuals with a rare autosomal dominant disorder (Freeman-Sheldon Syndrome) caused by MYH3 mutations were sequenced to demonstrate a filtering strategy that would identify the causative gene. The average depth of coverage was 51x, translating into 95% of coding bases in 78% of genes being successfully called (based on a threshold of 8x depth per base required to reliably call a heterozygous variant). The estimated average number of truncating single base variants per genome was higher in African than non-african genomes (20/African vs. 10/non- African), and a similar ratio was observed for rare frameshift indels (17/African vs. 8/non-African). As was observed in the Venter exome, most indels in coding regions were non-frameshift. To identify the causative gene in the four Freeman-Sheldon Syndrome patients, the authors filtered variants to focus on non-synonymous and/or splice-site variants or indels that were not previously reported in dbsnp or found in the 8 HapMap exomes, and which were in the same gene in all four affected patients. This approach reduced the number of candidate genes to precisely one, namely MYH3. A subsequent study applied the same filtering strategy to successfully identify the unknown genetic cause of a rare autosomal recessive Mendelian disorder (Miller Syndrome), the first of approximately 90 such studies to be published in quick succession over a period of 24 months. (Table 3) Currently ongoing large-scale projects employing

55 42 exome sequencing include the 1000 genomes project (which aims to sequence the exomes of,2500 anonymous individuals) as well as the Exome Sequencing Project, which aims to discover variants relevant to heart, lung, and blood diseases and has to date sequenced the exomes of nearly 5,400 individuals from multiple study cohorts (the project plans to sequence approximately 7,000 exomes). Table 3 Studies using exome-sequencing to identify genetic cause of disease Authors Year Journal Disease Autosomal Description dominant or recessive (AD or AR) Vissers et al Nat Genet Mental Retardation Sporadic Studied 10 trios; identified de novo mutations as potential cause for unexplained mental retardation Walsh et al Am J Hum Genet Nonsyndromic Hearing Loss AR Combined homozygosity mapping in consanguinous family with exome sequencing to identify DFNB82 as cause Lalonde et al Hum Mut Fowler Syndrome AR Identified compound hets in FLVCR2 in two fetuses from consanguinous families Pierce et al Am J Hum Genet Perrault Syndrome AR Identified compound hets in HSD17B4 in two sisters Ng et al Nat Genet Kabuki Syndrome AD Studied 10 unrelated affected subjects; identified MLL2 as cause Bilguvar et al Nature Malformation of Cortical Development Gilissen et al Am J Hum Genet Sensenbrenner Syndrome Krawitz et al Nat Genet Hyperphosphatasia Mental Retardation Syndrome Anastasio et 2010 Am J Hum al. 386 Genet Johnson et al Am J Hum Genet Sirmaci et al Am J Hum Genet Van Den Ende- Gupta Syndrome Brown-Vialetto-van Laere Syndrome AR Combined homozygosity mapping and exome sequencing in family with two affected members; identified WDR62 as cause AR Identified compound hets in WDR35 in two unrelated affected subjects AR Performed identity-bydescent filtering on exome data to identify PIGV as cause in 3 affected siblings of nonconsanguinous family AR Combined homozygosity mapping with exome sequencing to identify SCARF2 as cause in 4 affecteds from 3 consanguinous families AR Identified C20orf54 as cause in three affected siblings Michels Syndrome AR Combined homozygosity mapping with exome sequencing to identify

56 43 Haack et al Nat Genet Isolated complex I deficiency AR MASP1 as cause in 3 individuals from 2 consanguinous families Identified compound hets in ACAD9 in single affected individual Wang et al Brain Spinocerbellar ataxia AD Combined linkage analysis with exome squencing in a Chinese family with 4 affecteds; identified TGM5 as cause Musunuru et AR Identified compound hets in al. 391 hypolipidemia ANGPTL3 in 2 affected sibs Johnson et al Neuron ALS AD Combined linkage analysis with exome sequencing in 2 affected relatives, identified VCP as cause Bolze et al Am J Hum Genet Autoimmune lymphoproliferative syndrome (ALPS) AR Found homozygous variants in FADD Liu et al PLoS One Moyamoa disease AD Combined linkage analysis with exome sequencing to identify RNF213 Zuchner et al Am J Hum Genet Glazov et al PloS Genet Anauxetic dysplasialike condition Worthey et al Genet Med Inflammatory bowel disease Simpson et al Nat Genet Hajdu-Cheney Syndrome Becker et al Am J Hum Genet Retinitis pigmentosa AR Identified homozygous variants in DHDDS Osteogenesis imperfecta Ostergaard et 2011 J Med Genet Primary al. 400 lymphoedema Caliskan et al Hum Mol Genet Non-syndromic mental retardation Erlich et al Genome Res Hereditary spastic paraparesis 2011 Ann Neurol Tourette Sundaram et al. 403 syndrome/chronic tic phenotype Puente et al Am J Hum Genet Vissers et al Am J Hum Genet Hereditary Progeroid Syndrome Chondrodysplasia and abnormal joint development syndrome AR Identified compound hets in POP1 AR Identified hemizygous variant on X chromosomes (XIAP) AD Exome sequencing of 3 unrelated affecteds identified NOTCH2 AR Identified homozygous variants in SERPINF1 in 2 affected sibs AD Combined linkage analysis with exome sequencing to identify GJC2 AR Combined homozygosity mapping with exome sequencing to identify TECR AR Combined homozygosity mapping with exome sequencing to identify KIF1A AD Identified OFCC1 as cause AR Identified homozygous mutations in BANF1 AR Identified homozygous variants in IMPAD1 in three affected unrelated individuals

57 44 O Sullivan et 2011 Am J Hum al. 406 Genet Gotz et al Am J Hum Genet Amelogenesis imperfecta and gingival hyperplasia syndrome Infantile hypertrophic mitochondrial cardiomyopathy AR Combined homozygosity mapping with exome sequencing to identify FAM20A AR Identified compound heterozygous mutations in mtalars Shi et al PLoS Genet Myopia AD Identified mutations in ZNF644 in 2 relatives Klein et al Nat Genet Hereditary sensory AD Combined linkage with neuropathy with exome data to identify dementia and mutations in DNMT1 hearing loss Barak et al Nat Genet Malformations of AR Identified homozygous occipital cortical mutation in single affected development child of consang parents O Roak et al Nat Genet Autism Sporadic Identified 11 de novo proteinaltering mutations, some genes previously connected to autism Alvarado et al Bone Joint Surg Am De Greef et al Am J Hum Genet Yamaguchi et 2011 J Bone Miner al. 414 Res Distal arthrogryposis type 1 Immunodeficiency, centromeric instability, and facial anomalies Primary failure of tooth eruption Zhou et al Hum Mutat Hereditary hypotrichosis simplex Le Goff et al Am J Hum Geleophysic and Genet acromicric dysplasia Hanson et al Am J Hum Genet Vilarino-Guell et al Am J Hum Genet Zimprich et al Am J Hum Genet Sergouniotis et 2011 Am J Hum al. 420 Genet Albers et al Nat Genet Gray Platelet Syndrome Sanna-Cherchi et 2011 Kidney Int Steroid-resistant al. 422 nephrotic syndrome AD Identified MYH3 as cause AR Combined homozygosity mapping with exome sequencing to identify ZBTB24 AD Combined linkage with exome sequencing to identify PTH1R as cause AD Combined linkage with exome sequencing to identify RPL21 as cause AD Identified FBN1 as candidate gene in 5 patients 3-M syndrome AR Combined homozygosity mapping with exome sequencing to identify mutation in CCDC8 Late-onset Parkinson AD Identified mutation in VPS35 Late-onset Parkinson AD Identified VPS35 as cause (different patients from Vilarino-Guell) Leber congenital AR Combined homozygosity amaurosis mapping with exome sequencing to identify KCNJ13 as cause AR Identified NBEAL2 as cause AR Combined homozygosity mapping with exome sequencing in 3 affected sibs

58 45 Liu et al J Exp Med Chronic mucocutaneous candidiasis disease Yariz et al Fertil Seril Empty Follicle Syndrome of consang parents to identify homozygous mutations in MYO1E and NEIL1 AD Identified mutations in STAT1 as cause AR Identified homozygous mutation in LHGCR in 2 sisters Xu et al Nat Genet Schizophrenia Sporadic Identified 40 rare de novo protein altering mutations in 40 genes (in 27 cases), including DGCR2, a gene in schizophrenia-predisposing region 22q11.2 Sirmaci et al Am J Hum Genet KBG syndrome AD Identified ANKRD11 as cause Shaheen et al Am J H um Genet Noskova et al Am J Hum Genet Weedon et al Am J Hum Genet Ozgul et al Am J Hum Genet Doi et al Am J Hum Genet Adams-Oliver syndrome Adult-onset neuronal ceroid lipofuscinosis Charcot-Marie- Tooth Sloan et al Nat Genet Malonic and methylmalonic aciduria Aldahmesh et AR Combined homozygosiy mapping with exome sequencing to identify homozygous mutations in DOCK6 AD Identified 5 unrelated individuals with mutations in DNAJC5 AD Found DYNC1H1 as cause in 3 relatives Retinitis pigmentosa AR Identified homozygous mutation in MAK as cause Cerebellar ataxia AR Identified mutation in SYT14 as cause AR Identified mutation in ACSF3 as cause al. 433 cause 2011 J Med Genet Knobloch Syndrome AR Identified ADAMTS18 as Murdock et al Am J Med Genet A Recurrent polymicrogyria 2011 Blood Dendritic cell, Regalado et al Circ Res Thoracic aortic aneurysms leading to acute aortic dissection Dickinson et al. 436 monocyte, B and NK lymphoid deficiency Hor et al Am J Hum Genet Familial narcolepsy with cataplexy Marti-Masso et 2011 Hum Genet Early-onset al. 438 generalized dystonia AR Identified compound het mutations in WDR62 as cause in 2 sibs AD Identified SMAD3 as cause AD Identified GATA2 as cause in 4 unrelated affecteds AR Combined linkage with exome sequencing to identify MOG as cause AR Identified GCDH as cause in 2 affected siblings Tariq et al Genome Biol heterotaxy AR Combined homozygosity mapping with exome

59 46 Takata et al Genome Biol Progressive external ophthalmoplegia Theis et al Circ Cardiovasc Genet Dilated cardiomyopathy Pierson et al PLoS Genet Spastic ataxianeuropathy syndrome Al Badr et al J Pediatr Urol Ochoa (urofacial) syndrome Cullinane et 2011 J Invest al. 444 Dermatol Ovunc et al J Am Soc Nephrol Bowne et al Eur J Hum Genet Oculocutaneous albinism and neutropenia Intermittent nephrotic-range proteinuria Retinitis pigmentosa with choroidal involvement Kitamura et al J Clin Invest Autoinflammation and lipodystrophy Tyynismaa et 2011 Hum Mol al. 448 Genet Bjursell et al Am J Hum Genet Zangen et al Am J Hum Genet sequencing to identify SHROOM3 as candidate cause AR Combined homozygous mapping with exome sequencing to identify RRM2B as cause in patient from consang family AR Combined homozygosity mapping with exome sequencing to identify GATAD1 mutations in 2 affected sisters AR Identified AFG3L2 as cause in 2 brothers of consang family AR Combined homozygosity mapping with exome sequencing to identify HPSE2 as cause in child of consang parents AR Combined homozygosity mapping with exome sequencing to identify two candidate genes (SLC45A2 and G6PC30 AR Identified CUBN as cause in 2 sibs of consang parents AD AR Combined linkage analysis with exome sequencing to identify RPE65 as cause Identified PSMB8 as cause in patients from 2 consang families Identified TK2 as cause Progressive external AR ophthalmoplegia with multiple mitochondrial DNA deletions hypermethioninemia AR Identified ADK as cause XX female gonadal dysgenesis Galmiche et 2011 Hum Mutat Mitochondrial al. 451 cardiomyopathy AR Combined homozygosity mapping with exome sequencing to identify PSMC3IP/HOP2 as cause AR Identified compound hets in MRPL3 as cause in 4 affected sibs AR Identified compound hets in WDR19 as cause Bredrup et al Am J Hum Ciliopathies with Genet skeletal anomalies with renal insufficiency Saitsu et al Am J Hum Hypomyelinating AR Identified POLR3A and

60 47 Clayton-Smith et 2011 Am J Hum al. 454 Genet Aldahmesh et 2011 Am J Hum al. 455 Genet Genet leukoencephalopathy POLR3B as cause Say-Barber- sporadic Biesecker variant of Ohde syndrome Ichthyosis, intellectual disability, and spastic quadriplegia Chen et al Nat Genet Paroxysmal kinesigenic dyskinesia Logan et al Nat Genet Early onset myopathy, areflexia, respiratory distress and dysphagia (EMARDD) Dauber et al J Clin Severe infantile Endocrinolo hypercalcemia Metab Shamseldin et al. 459 malformation Sergouniotis et 2011 Am J Hum al. 460 Genet Berger et al Mol Genet Metabol Bhat et al Clin Genet Primary microcephaly Wang et al Hum Mutat Leber congenital amaurosis Identified KAT6B as cause in 4 individuals AR Combined homozygosity mapping with exome sequencing to identify ELOVL4 as cause in 2 individuals AD Identified PRRT2 as cause in 8 families AR AR Identified MEGF10 as cause Identified CYP24A1 as cause AR Combined homozygosity mapping with exome sequencing in consang family to identify DLX5 as cause Benign Flack Retina AR Combined homozygosity mapping with exome squencing to identify PLA2G5 as cause Early prenatal AD Combined linkage with ventriculomegaly exome sequencing to identify AIFM1 as cause AR AR Identified WDR62 as cause Identified ALMS1, IQCB1, CNGA3, MYO7A as candidates To date, most successful exome-based studies were in monogenic Mendelian disorders. The first filtering step in most studies was to exclude variants reported in dbsnp and any other exome data available to the investigators. Depending on the version of dbsnp used and the number of available exomes, this step usually eliminates at least half of the called variants. Furthermore, only variants that cause potential protein change or truncation are included in the analysis (i.e. nonsynonymous single nucleotide variants; splice-site variants; nonsense variants; and indels). At this point, studies diverge in their strategies, depending on the nature of the condition being studied and the available samples for sequencing. A notable characteristic of most exome studies published to date is that the diseases being investigated are recessive (Table 3). This allows the application of homozygosity mapping or identity-by-descent analysis to family data, or even simply filtering out all genes except those that have homozygous variants or compound heterozygous variants in the exome samples. If multiple affected relatives and/or more than

61 48 one family are available for a rare, fairly homogeneous condition, this strategy is very successful at narrowing down the list of candidate genes to just one or at most a few genes. Even if only one sample is available, it is possible to identify the causative gene for an autosomal recessive condition using this method. For autosomal dominant conditions, where the causative variant is heterozygous, the use of family linkage data can aid in significantly reducing the number of candidate genes. Alternatively, for diseases caused by mutations in a single gene in most affected cases, identifying genes with novel variants in more than one subject also helps pinpoint the causal gene. Additional filtering by predicted effect of variants (using such tools as Polyphen ( and SIFT 465 ( and/or conservation scores (using PhyloP and GERP) may help in ranking multiple candidate genes. However, those latter tools have their limitations and are often not consistent in ascribing functional importance to the same variant. Some investigators have presented statistical attempts at ranking variants and genes identified in such exome studies, but their applicability and success rates are not known as of yet Regardless, almost all studies provide further evidence in support of the gene identified by sequencing the gene in other patients with the disease and/or presenting functional analysis of the gene in the disease process. The somatic genomes of many cancers have been sequenced, shedding light on important genes and pathways involved in driving tumorigenesis and/or metastasis. The earliest of those involved a laborious approach of sequencing coding regions exon-by-exon using the conventional Sanger method. 37, The first cancer genome to be sequenced using next-generation platforms was that of a cytogenetically normal acute myeloid leukemia (AML) 473 ; subsequently, additional genomes of AML ; breast cancer ; lung cancer ; uveal melanoma 480 ; colorectal cancer 481 ; multiple myeloma 482 ; hepatocellular carcinoma 483 ; hairy cell leukemia 484 ; diffuse large B-cell lymphoma 485 ; pancreatic neuroendocrine tumor 486 ; and gastric cancer 487. An international collaboration under the auspices of the International Cancer Genome Consortium (ICGC) 488 is currently undertaking a large-scale integrative analysis of 50 different cancer types and/or subtypes at the genomic, epigenomic, and transcriptomic levels. In addition to investigating the somatic genome of cancer, germline sequencing can help identify genes that predispose to Mendelian cancer syndromes and/or familial cancer clustering. The first such study used paired germline-tumor exome data to identify PALB2 as a new FPC gene in a patient who did not carry mutations in known predisposition genes. 117 The paired tumor variants allowed Jones et al. 117 to narrow the search down to genes that had a germline truncating mutation as well as a somatic secondhit deleterious mutation, thus excluding all but three genes, two of which were previously reported to have truncating mutations in healthy controls. Resequencing the full PALB2 coding region in a cohort of 96 FPC subjects identified an additional three families with protein-truncating mutations in the gene, whereas truncating mutations in PALB2 are rare in control populations, further supporting PALB2 as an

62 49 FPC predisposition gene. In addition, the function of PALB2, a partner of BRCA2 which is already implicated in pancreatic tumorigenesis, provided further weight to this discovery. Despite the success of this initial report, few familial and/or syndromic cancer exome studies have been published to date. Two studies, investigating the cause of childhood classic Kaposi Sarcoma 489 and mosaic variegated aneuploidy 490, were able to take advantage of apparently recessive inheritance to filter the exome data and identify the causative genes. In the case of Kaposi Sarcoma, variants were filtered for homozygosity, protein-altering effect, and absence in dbsnp129, 1000 Genomes, or 49 in-house exomes, leaving only 1 splice-site variant and 11 missense variants. The splice-site variant affects a gene (STIM1) that is also mutated in a recessive immunodeficiency syndrome, and given the previous link of Kaposi Sarcoma to immunodeficiency, this was considered a strong candidate. The investigators of mosaic variegated syndrome sequenced two siblings of non-consanguinous parents and attempted to identify a gene with two loss-of-function mutations shared by both siblings (as compound heterozygotes). Interestingly, they did not initially identify a single causal gene, and rather identified 12 genes with a single loss-of-function mutation in common to the siblings. Focusing on a gene with a putative functional connection to the disease (CEP57 -centrosomal localization), Snape et al. sequenced its full coding region in both siblings and identified a second mutation, an 11-bp deletion that was not called in the exome data. This highlights current limitations of sensitivity and specificity of exome analysis. Two additional unrelated patients were also found to carry compound heterozygote mutations in CEP57. Two studies of autosomal dominant hereditary cancer were able to harness the power of sequencing multiple unrelated individuals or linkage analysis to narrow down the list of susceptibility gene candidates. In a study of hereditary pheochromocytoma 491, three unrelated patients were sequenced and the variants filtered to only include heterozygous protein-altering mutations shared by all three subjects and absent in dbsnp and 1000 Genomes data. This reduced the list of candidates to just two genes, of which only one segregated with disease in the respective families (MAX). By demonstrating LOH at the MAX locus and absence of MAX expression in tumors from the affected families, Comino-Mendez et al. 491 presented strong evidence for the role of MAX as a tumor suppressor gene in pheochromocytoma. Moreover, they identified five additional unrelated patients with mutations in this gene (2 truncating and 3 missense). To identify susceptibility genes for familial nodular Hodgkin s lymphoma, Saarinen et al. 492 used information from linkage analysis of a large family in conjunction with exome sequencing of one family member to narrow the list of candidates with a deleterious mutation segregating in the affected family members and not present in controls to one gene: a 2-bp deletion in NPAT. Further sequencing of this gene in other unrelated patients identified no other rare deletrious mutations in NPAT but they did find a common amino-acid deletion that seemed to be significantly more frequent in Hodgkin s patients than controls (4.2% vs. 1.1%, OR 4.11, p=0.018). Gene expression array demonstrated decreased NPAT

63 50 mrna in carriers of the 2-bp deletion. These findings, in addition to the fact that NPAT shares a putative promoter with another known tumor suppressor gene (ATM) and is thought to have a role in cell cycle regulation, suggest that NPAT germline mutations predispose to nodular Hodgkin s lymphoma. One of the promises of whole-genome and exome sequencing is the power to bridge the gap occupied by low-frequency moderately penetrant variants in explaining disease heritability which until recently could not be identified by family-based studies (because they usually do not segregate with disease) nor by genome-wide association studies based on common SNPs. 493 Such variants have been identified in the past through candidate gene sequencing in cases, and require relatively large case-control studies to demonstrate significant enrichment in the disease population. (e.g. BRIP1 in prostate cancer 494 ; CHEK2 in breast cancer 495 ). With the increasing number of exomes or whole genomes being sequenced, it is possible to capture those functional variants on a genome-wide level. For example, a recent report describes whole-genome sequencing of approximately 450 Icelandic individuals then imputes the genotype of detected variants in a large cohort of Icelandic ovarian cancer cases and controls, thus identifying the most significant association to be for an intronic SNP in BRIP1. Subsequent fine-mapping of the associated regions revealed a 2-bp deletion in exon 14 of BRIP1 that was in partial linkage disequilibrium with the intronic SNP, and which had an odds ratio > 8 for ovarian cancer. Alternatively, exome or whole-genome data itself may reveal the functional variant directly in family-based studies, although the challenge lies in determining which non-segregating rare/low-frequency variant is causally important. In a recent study by Yokoyama et al. 496, whole-genome sequencing of a single member of a large familial melanoma kindred identified over 400 germline variants, one of which was a missense variant in a gene called MITF. Genotyping of this variant in the remaining family members demonstrated non-segregation (only three of eight affected members carried the variant). However, due to interest in the previously reported role of MITF in development of melanoma, the investigators genotyped this variant in two large case-control cohorts and identified a significantly elevated frequency of the MITF variant in cases, with an odds ratio of approximately 2, supporting the hypothesis that this low-frequency variant is enriched in familial cases and confers a moderate risk of melanoma. In a similar study by Park et al. 497 in which members of four early-onset, multiple-case breast cancer pedigrees underwent exome sequencing, a functionally interesting gene (FAN1) with two deleterious-predicted missense variants in two families (one family segregated while the second did not segregate the variant) was identified, but Parks et al. 497 reported no statistically significant association of the variant with breast cancer in two casecontrol analyses.

64 51 Chapter 2 - Loss of Heterozygosity at BRCA1 Locus in Pancreatic Adenocarcinoma The contents of this chapter have been published in Human Genetics 2008 Oct;124(3): PMID: [ The final publication is available at (I am first author). 1. Abstract Although the association of germline BRCA2 mutations with pancreatic adenocarcinoma is well established, the role of BRCA1 mutations is less clear. We hypothesized that loss of heterozygosity at the BRCA1 locus occurs in pancreatic cancers of germline BRCA1 mutation carriers, acting as a second-hit that contributes to tumorigenesis. Seven germline BRCA1 mutation carriers with pancreatic adenocarcinoma and 9 patients with sporadic pancreatic cancer were identified from clinic- and population-based registries. DNA was extracted from paraffin-embedded tumor and non-tumor samples. Three polymorphic microsatellite markers for the BRCA1 gene, and an internal control marker on chromosome 16p, were selected to test for loss of heterozygosity. Tumor DNA demonstrating loss of heterozygosity in BRCA1 mutation carriers was sequenced, to identify the retained allele. The loss of heterozygosity rate for the control marker was 20%, an expected baseline frequency. Loss of heterozygosity at the BRCA1 locus was 5/7 (71%) in BRCA1 mutation carriers; tumor DNA was available for sequencing in 4/5 cases, and three demonstrated loss of the wild-type allele. Only 1/9 (11%) sporadic cases demonstrated loss of heterozygosity at the BRCA1 locus. Loss of heterozygosity occurs frequently in pancreatic cancers of germline BRCA1 mutation carriers, with loss of the wild-type allele, and infrequently in sporadic cancer cases. Therefore, BRCA1 germline mutations likely predispose to the development of pancreatic cancer, and individuals with these mutations may be considered for pancreas cancer screening programs. 2. Introduction As discussed in the Literature Review section of the thesis, identifying genes implicated in predisposition to FPC is important for developing early-detection and prevention strategies as well as more effective therapeutic options. Several hereditary syndromes due to mutations in tumor suppressor/caretaker genes cause an elevated risk of pancreatic cancer. These syndromes contribute to a small proportion of familial cases, and it is expected that other genes play an important role 136. Both BRCA1 and BRCA2 were initially identified as highly penetrant genes in familial breast and ovarian cancer, but germline mutations of these genes are also associated with several other malignancies 498. Studies of cancer risks in BRCA2

65 52 germline carriers have reported a relative risk of for pancreatic cancer , and it is estimated that BRCA2 mutations contribute to 6-19% of FPC cases 103,121,501,502. Molecular genetic studies have confirmed the role of BRCA2 inactivation in the development of pancreatic cancer 115, As with BRCA2, clinic-based studies have suggested an increased risk of pancreatic cancer in germline BRCA1 mutation carriers 508,509. There is also evidence for downregulation of BRCA1 expression in sporadic pancreatic cancer tumors 510. However, the aforementioned levels of evidence are much weaker for BRCA1 compared to BRCA2. Inactivation of the wild-type BRCA1 allele in breast and ovarian cancer most commonly occurs by loss of heterozygosity (LOH) 511. We hypothesized that LOH at the BRCA1 locus occurs in pancreatic cancers of germline BRCA1 mutation carriers, acting as a second-hit event contributing to pancreatic tumorigenesis. In this study, we compared the rate of LOH at BRCA1 in pancreatic tumors in mutation-carriers and patients with sporadic pancreatic cancers. 3. Materials & Methods Ethical approval for this study was obtained from the Mount Sinai Hospital Research Ethics Board. Microdissection and DNA extraction from formalin-fixed paraffin-embedded (FFPE) tissue, primer design and optimization for sequencing, PCR amplification, and interpretation of genotyping and sequencing results was performed by W. Al-Sukhni. Microsatellite genotyping and Sanger sequencing was performed by the Analytical Genetics Technology Centre (AGTC) at Princess Margaret Hospital, Toronto. 3.1 Tissue Specimens Germline BRCA1 mutation carriers were identified by: (1) clinic-based recruitment of incident cases of pancreatic cancer at the University of Toronto, as described in a previous report by our group 121 ; and (2) population-based recruitment of pancreatic cancer cases through the Ontario Pancreas Cancer Study (OPCS) 45. BRCA1 testing was performed at provincial labs in most cases due to a strong history of breast/ovarian; in one case, a BRCA1 mutation was identified by our research group as part of 102 unselected hereditary pancreatic cancer patients screened for several germline mutations. This latter mutation was subsequently confirmed by testing in an offsite provincial lab 121. All seven mutation carriers included in this study had pathologically-confirmed adenocarcinoma of the pancreas. Pancreatic tumor resection or biopsy specimens were obtained for all patients. Non-tumor tissue and/or blood samples were also obtained for each patient. Microdissected, formalin-fixed paraffin-embedded samples were prepared from each tumor ( 70% cellularity) and non-tumor specimen, and DNA was extracted using the QIAmp DNA FFPE Tissue Kit, as per the manufacturer s recommendations (QIAGEN Inc., Mississauga, Ontario, Canada). Blood lymphocyte DNA was extracted using standard Ficoll-Paque

66 53 technique, as per the manufacturer s recommendations (Amersham Biosciences, Baie d Urfe, Quebec, Canada). Nine patients recruited through the clinic-based Familial Gastrointestinal Cancer Registry (FGICR) 121 with newly-diagnosed pancreatic cancer and no known BRCA1 germline mutations or family history of breast/ovarian syndrome were selected for comparison. Tumor and non-tumor/lymphocyte DNA was similarly extracted for each patient. All patients were deceased before this study was performed; tissue specimens were previously banked for research after obtaining consent from patients or from family members. 3.2 LOH Assay Three microsatellite markers linked to the BRCA1 locus were used for LOH analysis: D17S855, D17S1322, and D17S579. The first two markers are intragenic. (See Figure 1 for locations of microsatellite markers on chromosome 17) Figure 1 - Location of BRCA1 microsatellite markers on chromosome 17 Figure 1 Legend: D17S1322 and D17S855 are intragenic (in introns 19 and 20, respectively), while D17S579 is distal to BRCA1. The distance in base pairs between markers is identified. Primer pair sequences were published in previous studies , and primers were purchased from Invitrogen Canada Inc. (Burlington, Ontario, Canada). Primer sequences are listed in Appendix Table S1. A microsatellite marker on 16p (D16S2616) was selected as an internal control. The expected allelic loss rate on this chromosomal arm in sporadic and FPC is 20-25%. 181,182 For each primer pair, a (FAM-6) 5 -labeled forward primer and an unlabeled reverse primer were used. Platinum Taq DNA Polymerase from Invitrogen was used for polymerase chain reaction amplification. For each reaction, 20-25ng of genomic DNA were amplified in 25 µl reaction volume containing 10X

67 54 PCR buffer (Invitrogen Canada Inc.), 2mM MgCl 2, 0.5µL of 10mM dntp, 1-1.5µL of 10mM primers, and 0.2µL of Invitrogen Platinum Taq DNA Polymerase. Initial denaturation was performed at 95 C x 2 minutes; followed by 35 cycles of (a) 94 C x 30 seconds, (b) primer-specific annealing temperature x 30 seconds, and (c) 72 C x 30 seconds; and final extension at 72 C x 5 minutes. Automated DNA fragment analysis was performed using the ABI 3100 Prism sequencer (Applied Biosystems), and GeneMapper Software version 3.7 was used to measure the allelic peak intensities. A case was informative for a particular marker if two distinct alleles were amplified in the nontumor/lymphocyte DNA. Allelic peak ratio was calculated in informative cases as (T1/T2)/(N1/N2), where T1, N1 = peak intensities for larger alleles; T2, N2 = peak intensities for smaller alleles; T = tumor DNA; N = non-tumor or lymphocyte DNA (Figure 2). Figure 2 - Sample electropherogram of microsatellite marker fragment analysis Figure 2 Legend: T=tumor DNA; N=non-tumor/lymphocyte DNA; T1,N1=peak intensities of larger alleles; T2,N2=peak intensities of smaller alleles; Allelic peak ratio = (T1/T2)/(N1/N2); LOH = 0.70 > allelic ratio > 1.43 An allelic ratio of < 0.70 or > 1.43 was considered evidence of LOH in tumor DNA. Results were confirmed with at least 2 separate PCRs. 3.3 Tumor DNA Sequencing in BRCA1 Mutation Carriers For carriers of germline BRCA1 mutations who demonstrated LOH in their pancreatic tumors, the DNA of the pancreatic cancer tissue was sequenced to determine if the wild-type or mutated allele was retained. Since paraffin-extracted DNA was being amplified, unique primers were designed for each BRCA1 mutation to obtain amplification products < 110 bp. Appendix Table S2 lists primer sequences. Nontumor/lymphocyte DNA was sequenced for comparison for each case. Unlabeled primers were purchased from Invitrogen. The ABI Prism 3130 XL Genetic Analyzer (Applied Biosystems) was used to perform automated sequencing. The forward primer was used for sequencing, and results were confirmed by sequencing two independently amplified PCR products for each sample.

68 55 4. Results 4.1 Patient Characteristics Table 4 compares the characteristics of BRCA1 mutation carriers and sporadic pancreatic cancer patients. Table 4 - Characteristics of BRCA1 mutation carriers and sporadic pancreatic cancer patients Patient Characteristic BRCA1 Mutation Carriers (N=7) Sporadic Pancreatic Cancer (N=9) Gender (F:M) 0:7 4:5 Age at diagnosis with pancreatic cancer, years (mean +/- SD) / / Ethnicity: (n;(%)) Ashkenazi Jewish Caucasian Other Source of specimen: (n;(%)) Whipple resection Biopsy Autopsy BRCA1 mutation: 5382insC 185delAG 2318delG 5 (71%) 2 (29%) 0 2 (29%) 4 (57%) 1 (14%) (89%) 1 (11%) 6 (67%) 3 (33%) 0 Families with BRCA1 mutations demonstrated a history of breast +/- ovarian cancer, and four families also had 2 pancreatic cancer cases (one of these cases has been previously reported) 121. Most BRCA1 mutation carriers were of Ashkenazi Jewish descent, whereas we excluded patients with Jewish ancestry from the sporadic cancer group due to the elevated prevalence of BRCA1 mutations in this population. The two founder Ashkenazi Jewish BRCA1 mutations, 5382insC and 185delAG, were present in the majority of mutation carriers (6/7 families). Table 5 summarizes the pedigree information for the seven mutation carriers. N/A N/A N/A BRCA1 mutation carrier ID Table 5 - Pedigree summary for BRCA1 mutation carriers Ethnicity Mutation Age at diagnosis of PC (years) Number of relatives with PC Number of relatives with BC and/or OC Tumors at other sites BRC-1 AJ 5382insC 79 2 (brother, 1 st 6 CRC cousin) BRC-2 Caucasian 5382insC 57 1 (1 st cousin) 5 - BRC-3* AJ 5382insC 52 1 (son) 1 (sister; dx - age 42) BRC-4 AJ 185delAG (daughter; dx age 39) Prostate

69 56 BRC-5 AJ 185delAG Prostate BRC-6 Caucasian 2318delG BRC-7 AJ 185delAG 66 2 (sister, 1 st cousin) 3 - AJ = Ashkenazi Jewish; PC = pancreatic cancer; BC = breast cancer; OC = ovarian cancer; CRC = colorectal cancer *This patient did not have molecular testing to confirm mutation; his brother and son both have confirmed 5382insC mutation The mean age at diagnosis was similar for the two groups: 65.4 years in mutation-carriers vs years in sporadic patients. Three BRCA1 mutation carriers had a history of other malignancies: two prostate cancer and one colorectal cancer. No sporadic cancer patient had a history of multiple primary tumors. 4.2 LOH Analysis All cases (BRCA1 mutation carriers and sporadic cancers) were informative for at least one BRCA1 marker. D17S855 was informative in 11/16 (69%) cases; D17S1322 and D17S579 were each informative in 13/16 (81%) cases. The internal control marker D16S2616 was informative in 10/16 (63%) of all cases. Two BRCA1 mutation carriers did not have enough tumor DNA to test for LOH with D16S2616; tumor DNA from one sporadic cancer patient could not be amplified when testing for LOH with D17S855. Table 6 shows the LOH results for each case with each marker. Table 6 - LOH results for BRCA1 mutation carriers and sporadic pancreatic cancer cases BRCA1 Mutation Carriers Sporadic Pancreatic Cancer Cases Case ID BRC 1 BRC 2 BRC 3 BRC 4 BRC 5 BRC 6 BRC 7 SPR 1 SPR 2 SPR 3 SPR 4 SPR 5 SPR 6 SPR 7 SPR 8 SPR 9 Marker D17S U + + U + U * - - U D17S U - U U - - D17S579 - U U U D16S2616 U + * - * U U U (+) = LOH [1.43 < allelic peak ratio < 0.70] (-) = No LOH [1.43 > allelic peak ratio > 0.70] (U) = uninformative sample (homozygous at the tested microsatellite marker in germline DNA) (*) = DNA unavailable for amplification/dna did not amplify

70 57 Ten cases in total were successfully tested with D16S2616, and only 2/10 (20%) demonstrated LOH. Five of seven (71%) BRCA1 mutation carriers demonstrated LOH with at least one marker, whereas only one of nine (11%) sporadic cancer cases demonstrated LOH with any BRCA1 marker (p = 0.035, 2-tailed Fisher s Exact test). In four of the five BRCA1-mutated cases with LOH, the allelic peak ratio was < 0.5 or > 2.0. (See Figure 3 for representative genotyping results). Figure 3 - Three representative matched-pair electropherograms for microsatellite LOH Figure 3 Legend: T=tumor DNA; N=non-tumor DNA. (a) and (b) represent LOH; (c) represents no LOH The histopathologies of pancreatic tumors from BRCA1 mutation carriers were moderately- and poorlydifferentiated ductal adenocarcinoma, with no distinguishing pathologic characteristics of tumors with LOH compared to tumors without LOH. 4.3 Sequencing to Identify Retained Allele in LOH Tumors Four of five BRCA1-mutation carriers demonstrating LOH had sufficient tumor DNA for sequencing. Three cases (BRC-1, BRC-2, and BRC-3) had the 5382insC mutation, and one (BRC-6) the 2318delG mutation. Three of four sequenced cases (BRC-2, BRC-3, and BRC-6) demonstrated loss of or decrease in wild-type allele, while BRC-1 was inconclusive. (Figure 4 demonstrates a sample sequencing result)

71 58 Figure 4 - Representative sequencing result for an individual with 5382insC germline BRCA1 mutation Figure 4 Legend: T=tumor DNA; N=non-tumor/lymphocyte DNA. The top panel demonstrates sequencing of two alleles in non-tumor DNA (mutant and wild-type allele); the bottom panel demonstrates only the mutant allele sequence in tumor DNA of the same individual. Of note, patient BRC-3 who did not have molecular confirmation of the germline mutation was successfully sequenced for the 5382insC mutation carried by his brother and son, confirming that he is a carrier. 5. Discussion This analysis sheds light, at the molecular level, on the putative role of BRCA1 in pancreatic cancer tumorigenesis. The importance of LOH as a second-hit in tumorigenesis is well-established in many cancers. Since BRCA1 inactivation occurs via LOH in the majority of breast and ovarian tumors in BRCA1-mutation carriers, we hypothesized that LOH also plays a primary role in inactivation of BRCA1 in mutation-positive pancreatic cancer. Indeed, we found that the majority of our mutation-positive pancreatic cancer subjects (5/7) did demonstrate LOH in tumor DNA. In comparison, we found that only 1/9 sporadic cancer patients demonstrated LOH at the BRCA1 locus in tumor DNA. It is possible that the remaining two subjects had inactivation of their wild-type allele by epigenetic methylation of the promoter; promoter hypermethylation of the wild-type allele in a minority of BRCA1 mutation-positive breast tumors has been previously reported 512. Due to the limitations of quantity and quality of our

72 59 paraffin-embedded specimens, we were not able to correlate LOH with decreased BRCA1 expression. However, our sequencing results did confirm loss of wild-type in most of the cases with LOH, suggesting that only the truncated protein product from the mutated allele would be expressed in those cases. The link between BRCA2 mutations and pancreatic cancer is well-established, and most recommend including this gene in mutational screening for high-risk pancreatic cancer individuals and their relatives. However, the contribution of germline BRCA1 mutations to increased risk of pancreatic cancer is less clear. Both BRCA1 and BRCA2 have important roles in the repair of double-stranded DNA breaks. 513 A number of anecdotal reports have described pancreatic cancer in association with BRCA1 mutations. 514,515 Our group previously identified 38 individuals from a group of 102 pancreatic cancer patients who were considered to have intermediate/high-risk families, of whom one Ashkenazi Jewish patient screened positive for a deleterious BRCA1 mutation. 121 A study by Tonin et al. 516 screened 220 Ashkenazi Jewish breast cancer families for BRCA1 and BRCA2 mutations, and reported pancreatic cancer in 11/91 families with a BRCA1 mutation compared to 5 cases in 120 families without BRCA1 mutations. More recently, Skudra et al. 122 screened 90 consecutive Latvian patients presenting with pancreatic cancer and 640 controls for several germline BRCA1 mutations, including two Latvian founder mutations (5382insC, 4154delA) and two less common mutations (300T>G, 185delAG) in the BRCA1 gene. Four of 90 (4.4%) pancreatic cancer patients were found to carry a BRCA1 mutation compared to 1/640 (0.15%) controls. It was noted, however, that the rate of mutation in controls likely underestimates the true prevalence of the founder mutations in the general Latvian population since control subjects were relatively older, hence selecting against highly penetrant mutations. Two large studies used family-based designs to study cancer risk at sites other than breast or ovary in families with multiple breast/ovarian cancers or with young age of onset of breast cancer. There was some overlap in the families used between the two studies, but different analytical methods were used. 508,509,517 Both studies found a statistically significant association for pancreatic cancer, albeit lower than the association with BRCA2: Brose et al. 509 reported a three-fold increase in pancreatic cancer risk among BRCA1 carriers (3.6%, compared to 1.3% estimated general population risk); Thompson et al. 508 reported a relative risk of 2.26 (95% CI ) for developing pancreatic cancer in BRCA1 mutation carriers, with a greater association in individuals diagnosed under age 65 (RR 3.10, 95% CI ). One limitation of these studies was the family-based design, which may overestimate cancer risks due to possible confounding effects of other genetic and/or environmental factors shared by members of a family. To circumvent this problem, Risch et al. 498 performed a population-based study of 1171 unselected women from Ontario, Canada who presented with new-onset ovarian carcinoma. Subjects were screened for BRCA1 and BRCA2 mutations, and information about other cancers in their first-degree relatives was used to estimate cancer risk at other sites in mutation carriers, and compared to estimated

73 60 cancer incidence rates in Ontario. Seventy-five BRCA1 mutation carriers were identified, and a relative risk of 3.1 was calculated for pancreatic cancer; however, this was not statistically significant (95% CI ). More recently (and subsequent to completion of our study), Ferrone et al. 502 published an analysis of unselected Ashkenazi Jewish patients who underwent pancreatic cancer resection and found no significant increase in BRCA1 frequency relative to the general Ashkenazi population (1.3% vs. 1.1%); however, the BRCA1 mutation rate was based on previous reports and not directly assessed in a control cohort in this study, and the authors acknowledged that the small size (145 subjects) may have resulted in insufficient power to detect a statistically significant difference. Axilbund et al. 123 did not find carriers of BRCA1 mutations in 66 FPC patients (defined as having at least two additional relatives with pancreatic cancer), but most of the subjects did not report Ashkenazi Jewish ancestry. In the non-jewish North American population, the estimated frequency of BRCA1 mutations is 1/500-1/ ,519 ; this suggests that Axilbund et al. s study was underpowered to identify an association of BRCA1 with FPC unless the effect size was at least 15-fold, a value exceeding the estimated risk of BRCA2. Kim et al. 520 reported a statistically lower age of onset for pancreatic cancer in BRCA1-mutation carriers than in non-carriers. For our study, we identified seven unrelated individuals with pathologically-confirmed pancreatic adenocarcinoma whose families have BRCA1 mutations. In all but one of these cases, a molecular confirmation of the mutation was previously available. The patient without molecular confirmation had a brother and son who carried the identical 5382insC mutation; we later confirmed the presence of the same mutation in this patient when we sequenced his tumor DNA to identify the remaining allele. The age at diagnosis of pancreatic cancer did not differ significantly between the mutation carriers and sporadic cases; this is similar to findings of other studies. 515,521 Though further studies are needed to definitively determine if BRCA1 is associated with increased pancreatic cancer risk, current data suggests that the penetrance of BRCA1 mutations for pancreatic cancer is lower than that of BRCA Moreover, some studies have suggested that some pancreatic cancer patients with BRCA2 mutations may not have a family history of breast or ovarian cancers. 501,522 It is not clear if the same may be true for pancreatic cancer patients with BRCA1 mutations; most studies to date have characterized families selected for breast or ovarian cancer. Possible sources of experimental artifact include contamination of microdissected tumor cells with adjacent stromal cells and potential bias from PCR-based microsatellite assay. Measures to reduce the impact of such bias included using microdissected tumor samples with minimum 70% cellularity, as identified by an experienced pathologist), and confirming PCR-based results with at least two separate PCR experiments. Since FFPE-specimens often yield DNA of variable quality as a result of nucleic acid

74 61 cross-linking by the fixation process, we minimized potential bias from degraded DNA by selecting primers for microsatellite markers that amplify small fragments ( bp). Due to the limitation of available DNA, and the amplicon size restriction in selecting microsatellite markers, we were limited to just three BRCA1 markers for our experiments. However, every sample produced informative results for at least one marker, and most generated results for two or more markers. We also attempted to include an internal control, an unrelated microsatellite marker at chromosome 16 with a previously reported LOH frequency of 20-25%. Due to technical reasons and inadequate DNA for further testing, only three of the seven familial samples successfully amplified this marker, with 1/3 demonstrating LOH. In comparison, seven of nine sporadic cases amplified this internal control marker, with 1/7 showing LOH. Overall, 2/10 (20%) of samples showed LOH at this locus, consistent with previous reports. Although the inadequate number of informative samples among the familial cases reduced the value of this control in our comparison, our results remain valid given the confirmatory Sanger sequencing that demonstrated decreased signal for the functional allele in tumors from samples that demonstrated LOH. Our small sample size (seven germline BRCA1 mutation carriers with pancreatic cancer) reflects the challenges inherent in studying a malignancy as lethal as pancreatic cancer, in which only 15% of cases are resectable. To our knowledge, this is the first molecular genetic study investigating BRCA1 LOH in pancreatic cancer of germline BRCA1 mutation carriers. Two previous studies have investigated BRCA1 in sporadic pancreatic tumors. Beger et al. 510 used quantitative reverse-transcription PCR (qrt-pcr) and immunohistochemistry antibody staining to analyze BRCA1 and BRCA2 gene expression in 13 normal pancreas samples, 30 chronic pancreatitis samples, and 53 sporadic pancreatic adenocarcinomas. They found decreased BRCA1, but not BRCA2, mrna and protein expression in 50% of pancreatic cancer samples, and also found decreased BRCA1 mrna expression in chronic pancreatitis samples, whereas normal expression was observed in normal pancreatic tissue. Correlation of these findings with clinical information demonstrated worse 1-year survival in patients whose tumors had reduced BRCA1 expression, compared to patients with normal BRCA1 expression. Another study by Peng et al. 523 found that BRCA1 was frequently methylated in sporadic pancreatic adenocarcinoma as well as in ductal cells showing inflammatory background without histologic change. The authors suggested that promoter methylation of the BRCA1 gene may be the mechanism explaining the reduced gene expression reported by Beger et al. 510 in pancreatic cancer and in chronic pancreatitis. However, they noted heterogeneity of methylation in different sections of the same tumor, and they did not directly measure gene expression level, so it is not clear how promoter methylation impacted expression. Moreover, they found methylation of BRCA1 even in normal ductal cells. Our study adds to the evidence for BRCA1 in pancreatic tumorigenesis by specifically demonstrating an inactivating mechanism in the pancreatic tumor

75 62 DNA of BRCA1 mutation carriers, likely akin to the role of BRCA1 in breast and ovarian cancer tumorigenesis. Determining the association between BRCA1 and pancreatic cancer has diagnostic and therapeutic implications. The implication of BRCA2 in pancreatic cancer has allowed incorporation of this gene in mutational screening panels and identification of kindreds at risk; the same can be done for BRCA1. As for treatment, current chemotherapeutic protocols for pancreatic cancer are based on 5-FU and gemcitabine. 524 Interestingly, in-vitro and in-vivo studies have found BRCA1-deficient tumors to be particularly sensitive to certain chemotherapeutic agents that take advantage of the impaired DNA repair mechanism that characterizes these tumors, such as cross-linking agents (e.g. Mitomycin C), type II topoisomerase inhibitors (e.g. etoposide), and PARP1 (Poly ADP-ribose polymerase family, member 1) inhibitors Recently, case reports and small series have shown that patients with BRCA1 or BRCA2 mutations respond to such therapies. 174,178,528,529,530 In conclusion, we demonstrate that LOH occurs at the BRCA1 locus in pancreatic cancers of BRCA1- mutation carriers, suggesting that this gene is inactivated in these tumors and may play a role in pancreatic tumorigenesis. Further research into the role of BRCA1 in pancreatic cancer is needed to assess the expression of this gene in pre-invasive and invasive pancreatic lesions. Subjects with germline BRCA1 mutations should be considered for inclusion in pancreas cancer screening programs, and they may benefit from chemotherapies that target the DNA repair pathway.

76 63 Chapter 3 - Germline Genomic Copy Number Variation in Familial Pancreatic Cancer The contents of this chapter have been published in Human Genetics 2012 Jun 5 (Epub ahead of print). PMID: [ The final publication is available at (I am first author). 1. Abstract Adenocarcinoma of the pancreas is a significant cause of cancer mortality, and up to 10% of cases appear to be familial. Heritable genomic copy number variants (CNVs) can modulate gene expression and predispose to disease. We hypothesized that genes overlapped by rare germline genomic losses or gains identified exclusively in pancreatic cancer patients from high-risk families are candidate FPC genes. A total of 120 FPC cases and 1194 controls were genotyped on the Affymetrix 500K array, and 36 cases and 2357 controls were genotyped on the Affymetrix 6.0 array. Detection of CNVs was performed by multiple computational algorithms and partially validated by quantitative PCR. We found no significant difference in the germline CNV profiles of cases and controls. A total of 93 non-redundant FPC-specific CNVs (53 losses and 40 gains) were identified in 50 cases, each CNV present in a single individual. FPC-specific CNVs overlapped the coding region of 88 RefSeq genes. Several of these genes have been reported to be differentially expressed and/or affected by copy number alterations in pancreatic adenocarcinoma. Further investigation in high-risk subjects may elucidate the role of one or more of these genes in genetic predisposition to pancreatic cancer. 2. Introduction As illustrated in Chapter 1 of this thesis, a small proportion of familial pancreatic cancer cases can be attributed to known cancer genes, such as Hereditary Breast and Ovarian Cancer (HBOC); BRCA2/BRCA1/PALB2;Peutz-Jeghers Syndrome (PJS), STK11; Familial Atypical Multiple Mole Melanoma (FAMMM), p16/cdkn2a; and Hereditary Pancreatitis (HP), PRSS1. However, most cases of Familial Pancreatic Cancer (FPC) have an unknown genetic etiology. 136 Segregation analysis of families with multiple affected members suggests that FPC is caused by heritable alterations in at least one rare major gene, likely in an autosomal dominant manner. 161 Moreover, multiple case-control and cohort studies have demonstrated that members of FPC families, particularly those with an affected first-degree relative, have a significantly elevated lifetime risk of developing the disease (up to fold). 156;158,160 However, to date traditional methods of linkage analysis for identifying predisposition genes have met with challenges in studying FPC, due in part to probable genetic heterogeneity as well as difficulty in

77 64 collecting DNA specimens on multiple affected members in a family due to the rapid mortality of the disease. Recently, it has become clear that submicroscopic copy number variants (CNVs) are prevalent throughout all genomes, accounting for at least 1.2% of nucleotide variation between any two individuals. 238 CNVs have been linked to rare genomic disorders 531 as well as common neurodevelopmental 196, psychiatric 532, autoimmune 533 and metabolic 534 diseases. Some studies have suggested an association between common CNVs and sporadic cancers (e.g. pancreatic cancer (6q13) 344, neuroblastoma (1q21.1) 340, prostate cancer (2p24.3; 20p13; GSTT1) 338,341,342, nasopharyngeal carcinoma (6p21.3) 343, and endometrial cancer (GSTT1) 535 ). The recent paper by Huang et al. 344 is the first to describe an association of a germline CNV with pancreatic cancer risk: a common 10,379bp deletion at 6q13 was found to be higher in frequency in sporadic pancreatic cancer patients compared to controls, with an odds ratio of 1.31 for 1-copy carriers compared to 2-copy carriers. Interestingly, functional analysis of this non-genic deletion suggested that it may be involved in long-range regulation of CDKN2B, an established tumor-suppressor gene. In addition, it is well known that rare germline CNVs contribute to the genetic basis of familial cancer. Indeed, large germline genomic rearrangements cause 15% of Familial Adenomatous Polyposis (APC gene) 311, 2% of breast and ovarian cancer (BRCA1 gene) 536, and 5% of Lynch Syndrome (MSH2 & MLH1 genes) 321 cases. In 1-3% of Lynch Syndrome patients, the causative mutation is a large heritable deletion at the 3 end of the TACSTD1 gene, which causes transcriptional read-through and epigenetic silencing of the adjacent MSH2 gene. 336 Furthermore, a report by Shlien et al. 348 identified an elevated frequency of germline CNVs in individuals with Li Fraumeni syndrome (TP53 mutation), and suggested that the increased predisposition to cancer in this syndrome may be proportional to the frequency of germline CNVs, many of which overlap known cancer genes. Since germline CNVs implicated in familial cancers to date are rare with relatively high penetrance, we hypothesized that familial and young-onset pancreatic cancer patients have a distinctive germline genomic copy number variation (CNV) profile compared to non-cancer controls and that tumor suppressor genes or oncogenes predisposing to pancreatic cancer may be overlapped by one or more CNVs that are detected exclusively in patients. Here we present an analysis of germline CNVs detected in 120 high-risk pancreatic cancer patients and compare them to CNVs in a large cohort of unaffected controls. 3. Materials & Methods This study was approved by the Research Ethics Boards at Mount Sinai Hospital and University Health Network in Toronto, Canada; Office for Human Research Studies at Dana Farber/Harvard Cancer Centre

78 65 in Boston, Massachusetts; Institutional Review Board at Mayo Clinic in Rochester, Minnesota; Institutional Review Board at M.D. Anderson Cancer Centre in Houston, Texas; Office of Human Subjects Research at Johns Hopkins University in Baltimore, Maryland; and Human Investigation Committee at Karmanos Cancer Institute, Wayne State University in Detroid, Michigan. DNA extraction from blood or EBV-transformed cell lines was performed by technicians at each participating site and provided to W. Al-Sukhni. Genotyping of samples and ancestry verification on STRUCTURE was performed by W. Al-Sukhni. Computational analysis of Affy 500K data on dchip, CNAG, and Partek was performed by W. Al-Sukhni, with assistance from S. Joe in script-writing for organization and filtration of data (as directed by W. Al-Sukhni). To standardize the analysis of Affy6.0 chips in the same manner used for the POPGEN and OHI controls, computational analysis of Affy6.0 data on Birdsuite and ipattern was performed by A. Lionel at TCAG. Filtration and annotation of all CNV data was performed by W. Al-Sukhni. Validation of CNVs by qpcr was performed by W. Al- Sukhni with technical assistance from N. Zwingerman, A. Gropper, and S. Moore. Breakpoint-mapping of CNV by qpcr and Sanger sequencing entirely performed by W. Al-Sukhni. Comparison of case and control CNVs and statistical analysis performed by W. Al-Sukhni. 3.1 DNA extraction DNA was extracted at each centre from either whole blood (white blood cells/lymphocytes) or EBVtransformed cell lines. Cells were purified from whole blood using Ammonium Chloride-Tris lysis of red blood cells. DNA was extracted using MaXtract Low Density tubes, which is an adaptation of the standard organic solvent method of DNA extraction using phenol and chloroform. Purified DNA was precipitated with 95% ethanol and dissolved in low TE buffer. 3.2 FPC cases recruitment Genomic DNA was extracted from peripheral blood or EBV-transformed cell lines of 133 pancreatic cancer patients from 131 high-risk families recruited by PACGENE (Pancreatic Cancer Genetic Epidemiology Consortium; PI, G Petersen, Mayo) 165, a six-centre consortium that recruits kindreds containing two or more blood relatives affected with pancreatic cancer for genetic studies. Inclusion criteria in the current study included: subjects with two or more affected relatives ( 3+ FPC ; N=79); subjects with only one affected relative diagnosed at age 49 years or younger ( 2 FPC ; N=22); and subjects without affected relatives who were diagnosed at age 49 years or younger ( single young ; N=32). (Some of the families were reassigned based on updated information after analysis see Results section). We included young cases with no family history of pancreatic cancer because they may have de novo mutations in the gene(s) of interest, although we acknowledge that the definition of FPC involves

79 66 more than one affected member in the family. Subjects were excluded if they carried known mutations or were in families with syndromes which predispose to pancreatic cancer (BRCA2, BRCA1, p16/fammm, STK11/PJS, PRSS1/HP, Lynch Syndrome). The majority of DNA samples were extracted from blood (N=97) and the remaining samples were from EBV-transformed lymphoblast cell lines. (Appendix Table S3 (excel sheet on attached CD) for details.) 3.3 Controls recruitment Control samples of matched ancestry (> 95% of cases and controls reported Caucasian ancestry) were obtained from two sources: 45 samples were healthy controls recruited by the Familial Gastrointestinal Cancer Registry (FGICR) 537 at Mount Sinai Hospital, Toronto, and 1,153 samples were recruited by the Ontario Familial Colon Cancer Registry (OFCCR) 538. Almost all control DNA samples were extracted from blood (only 12 OFCCR controls were from lymphoblasts). (Appendix Table S4 (excel sheet on attached CD) for details.) In addition, we had access to CNV data for 1,234 controls recruited through the Ottawa Heart Institute (OHI) 539 and 1,123 controls of German descent recruited by the POPGEN project 540. Most of the OHI and POPGEN DNA samples were extracted from blood, and the platform for CNV detection was the Affymetrix 6.0 array. 3.4 SNP genotyping For primary CNV discovery, 128 cases and all 1,198 FGICR + OFCCR controls were genotyped at approximately 500,000 genome-wide SNPs on the Affymetrix GeneChip Human Mapping 500K Array (NspI and StyI chips) according to Affymetrix standard protocol. The cases and 45 FGICR controls genotyping was performed at The Centre for Applied Genomics (TCAG) in Toronto, while the 1,153 OFCCR controls were previously genotyped at Genome Quebec Innovation Centre as part of the ARCTIC case-control colorectal cancer GWAS study. Briefly, whole genomic DNA was digested with restriction enzyme (NspI or StyI) and ligated to universal adaptors, and adaptor-ligated fragments were PCR-amplified with preference for 200bp-1,100bp size range. Subsequently, PCR amplicons were fragmented, labeled, and hybridized to NspI or StyI chips. Chips were scanned using GeneChip Scanner G, and Affymetrix GeneChip Command Console (AGCC) files were produced for further processing. Intensity files (CEL) and genotype files (CHP) were converted from AGCC files using GeneChip Operating Software (GCOS) and GeneChip Genotyping Analysis (GTYPE) software, respectively. Genotype calls were made by Affymetrix Genotyping Console (GTC 2.1), which implements the BRLMM genotype calling algorithm (Bayesian Robust Linear Model with Mahalanobis

80 67 distance classifier), using default settings (Score Threshold = 0.5, Block Size = 0, Prior Size = 10,000, DM Threshold = 0.7). GTC 2.1 performs a quality control (QC) analysis of the SNP genotype call rate, to estimate overall quality of the chip hybridization, based on the Dynamic Model genotype calling algorithm. For 500K arrays, Affymetrix considers QC < 93% call rate to suggest poor hybridization. However, QC call rate in the range of 88-93% can also produce useable data for CNV analysis, in the experience of collaborators at TCAG. Therefore, if we were unable to obtain rehybridized chips for some samples, we retained arrays with QC call rate> 88% in the CNV analysis but inspected the raw calls made from those arrays to verify if they appear to be false. A subset of the original FPC cohort (33 samples) plus five new cases (Appendix Table S3) were genotyped on the Affymetrix 6.0 array according to standard protocol to validate CNVs detected on the Affymetrix 500K array as well as detect new CNVs. Arrays meeting Affymetrix quality control guidelines of Contrast QC > 0.4 were used for further analysis. The Affymetrix Power Tools platform was used to extract normalized intensities for each array and inter-array intensity correlation was calculated; arrays with average correlation of > 0.9 were considered suitable for joint analysis. 3.5 Ancestry verification Subject ancestry was verified using STRUCTURE software ( which infers population structure using genotype data of unlinked markers 541. We used 1,089 unlinked genome-wide autosomal SNPs that map to the Affymetrix 500K array (NspI and StyI chips), with differing minor allele frequencies across three major HapMap populations (Caucasian (CEU), African (YRI), and Asian (CHB/JPT)). The observed alleles (major and minor) at each SNP in HapMap populations were obtained using UCSC genome browser Tables function. To determine the population cluster (assuming three ancestral populations), 270 unrelated HapMap samples were used (90 CEU, 90 YRI, 90 CHB/JPT) as reference of known ancestry. Ancestries were assigned using a coefficint of ancestry threshold > CNV discovery Figure 5 is a summary flow chart of the primary CNV discovery on the Affy500K arrays.

81 68 Figure 5 Analysis of 500K arrays in FPC cases and controls 128 FPC cases 45 FGICR controls 1153 OFCCR controls Affymetrix 500K SNP arrays (TCAG) 500K ARRAY ANALYSIS PIPELINE Affymetrix 500K SNP arrays (Genome Quebec) dchip CNAG Partek Genomics Suite (HMM) dchip CNAG Partek Genomics Suite (HMM) 8 cases excluded (noise, no longer FPC) 120 Cases CNVs in 45 controls 1194 controls 4 controls excluded (personal PC or family history suggests FPC) Merged overlapping CNVs per sample Merged overlapping CNVs per sample LOW CONFIDENCE CNVs (single algorithm/chip) HIGH-CONFIDENCE CNVs ( 2 algorithms or chips) HIGH-CONFIDENCE CNVs ( 2 algorithms or chips) LOW CONFIDENCE CNVs (single algorithm/chip) FPC-specific CNVs (HIGH-CONFIDENCE SET cases vs. controls) Figure 5 Legend: Cases and controls were analyzed in a parallel fashion on three independent computational algorithms. A high-confidence CNV set (based on support by at least two algorithms or chips) was obtained for each of cases and controls and compared. Copy number at each SNP position was estimated using three validated Hidden Markov Model (HMM)- based CNV-calling algorithms (dchip , CNAG , and Partek Genomics Suite v6.3 ). NspI and StyI chips were analyzed separately for each individual. After conducting several trials of different analysis approaches, we identified the following as the method that best addresses the noise level in our data: for dchip and Partek, samples were analyzed in batches corresponding to the grouping of samples during chip hybridization (to minimize batch effect differences in hybridization that may lead to false differences in intensity between samples): FPC cases and FGICR controls were analyzed in two batches (batch 1 contained 47 cases and 22 controls; batch 2 contained 81 cases and 23 controls); OFCCR controls were analyzed on dchip and Partek in 10 batches of approximately 100 samples each. For CNAG, use of a maximum number of samples improves CNV detection, so the full group of FPC cases and FGICR (173 samples) were analyzed concurrently, while the ARCTIC controls were analyzed in 6 random batches of approximately 200 samples each. Default analysis settings were used for each of the computational programs: invariant-set probe normalization and hidden markov model copy inference

82 69 method for dchip; non-paired reference/test sample category and automated analysis option for CNAG; 2-probe minimum used for calling CNV on Partek Suite (HMM method). The Partek CNV coordinates were based on hg18 genome build and were converted to hg17 to merge with dchip and/or CNAG. A loss was defined by two or more consecutive SNPs with estimated copy number of < 2; a gain was defined by two or more consecutive SNPs with estimated copy number of > 2. CNVs whose size was less than 1,100bp were excluded to avoid the bias of PCR artifact causing false calls (since the fragment size of amplified fragments was 200-1,100bp). Losses larger than 2 Mb and gains larger than 7 Mb were also excluded (the cut-off was based on the largest CNVs seen in cases, with intention of maximizing sensitivity in detecting case CNVs while removing excessively large CNVs in controls that are likely false calls and/or represent somatic events). CNVs that crossed the centromere were removed because they were incompatible with chromosomal stability and expected to be false calls. For any given chip and algorithm, if the number of CNVs (losses + gains) called in a sample exceeded 40 (after above filters), that sample was eliminated from the analysis for that given algorithm and chip (i.e. considered too noisy). For each sample on a given chip, CNVs identified by two or more algorithms with overlapping breakpoints (same direction on all algorithms) are merged if the length of the overlap area corresponds to at least 20% of the length of any of the overlapping CNVs (Figure 6). Figure 6 Criteria for merging CNVs For each sample, CNVs identified on both chips of the 500K array with overlapping breakpoints (same direction on both chips) are merged if the length of the overlap area corresponds to at least 20% of the length of either of the overlapping CNVs (Figure 6). High-confidence calls were identified as CNVs called by at least two different algorithms and/or on both chips. Note, if a CNV is called by a different algorithm on each chip, it was not considered high-confidence. For the purpose of identifying CNV

83 70 loci, CNVs in multiple samples with overlapping CNVs are merged (using the above-described 20% threshold). CNV calling on Affy6.0 arrays was performed using the Birdsuite tools (Canary + Birdseye algorithms) 544 and ipattern 545 algorithms, using a reference set that included the 38 FPC cases in addition to 100 other closely-correlated Affy6 arrays previously analyzed at TCAG (based on correlation coefficient > 0.9). (Samples were also analyzed on GTC 4.1, but this data was only used to support calls made on Birdsuite or ipattern). For each of these algorithms, we required CNVs to span 5 or more consecutive array probes and be at least 20 kb in length. Detection by either Birdsuite or ipattern was sufficient for the purpose of validating 500K array CNVs. Only high-confidence calls (i.e. called by at least two of Birdsuite, ipattern, and/or GTC 4.1 software boundaries of overlapping regions were determined in the same manner as for 500K data) were included as novel FPC-specific CNVs. Samples with number of calls greater than three times the standard deviation from the mean number of calls for an analysis batch were excluded from the study. The combined results of Birdsuite (Canary and Birdseye) were filtered to remove CNVs with the following: excluded centromere jumpers; excluded X chromosome variants; tag of loss with a copy number of > 1 or tag of gain with a copy number of < 3. The ipattern results were filtered to remove CNVs in X chromosome and CNVs tagged as complex. 3.7 PCR validation of CNVs Quantitative PCR validation of a subset of CNVs was performed using Invitrogen Platinum SYBR Green qpcr Supermix UDG, with primers designed within the CNV of interest, and MSH2-exon2 used as a reference gene. (Appendix Table S5 for primer sequences). Standard PCR conditions were used: (50C x 2mins; 95C x 2mins; (95C x 15sec; 60C x 32sec) x 40 cycles). Reactions were performed in replicates of 4-8x per sample. A standard curve was performed on each plate using control DNA (From a single sample for all experiments) to ensure primer efficiency is between 90%-110% (slope = ) and the correlation coefficient (R 2 ) of the standard curve samples is > Dissociation curve was checked for a single peak (indicating a single product). Data was analyzed on the ABI 7500 real-time machine, setting the baseline and threshold manually to reflect the exponential phase of amplification. Finally, data from each plate was analyzed using the ddct method 546 : for each sample with at least 4 replicates, one sample may be excluded from the calculation if it falls outside the range of Mean +/- 2*SD of Ct values (range calculated after removal of uppermost or lowermost value); a validation curve of dct vs. log input DNA amount was done for each primer set to prove that the absolute slope is <0.1, signifying that the efficiencies of the test gene and reference gene primer sets are approximately equal. The calculations for ddct are made as follows:

84 71 dct = mean Ct (test gene) mean Ct (control gene (MSH2)) Standard deviation (SD) of dct = SquareRoot[(SD Ct(gene of interest)) 2 + (SD Ct (MSH2)) 2 ] ddct = dct (test sample) dct (control sample) Fold difference in copy number = 2 ddct SD of fold difference in copy number = Ln(2)*SD of dct*2 ddct 3.8 Prioritization of CNVs Figure 7 illustrates the priority order for investigating CNVs detected in cases. Figure 7 CNV prioritization plan Figure 7 Legend: CNVs segregating with disease in a family or de novo in single case are highest priority, followed by recurrent CNVs in unrelated affected individuals that are not found in unaffected controls. Single-affected disease-specific CNVs are lower in priority, and least likely to yield candidate genes are CNVs found in both affecteds and unaffecteds. We defined FPC-specific CNVs as losses or gains detected in FPC cases on the 500K or Affymetrix 6.0 array, and which did not overlap (by 20% or more) with losses or gains in FCIGR, OFCCR, OHI, or POPGEN controls, nor overlapped CNVs reported from non-bac based platforms in the Database of Genomic Variants (DGV) 547 ( -updated Nov 2010). Although we did not

85 72 control for ancestry in this analysis, we did note which FPC-specific CNVs were detected in non- Caucasian samples. 3.9 Annotation of CNVs Affymetrix 500K and Affymetrix 6.0 array coordinates were aligned to the NCBI hg17 and NCBI hg18 human genome builds, respectively. Genes overlapped by CNVs were identified through the University of California, Santa Cruz (UCSC) genome browser ( using the respective human genome build. Information about CNV-overlapped genes was obtained from Entrez Gene ( and Pubmed ( The Memorial Sloan Kettering Cancer Centre (MSKCC) CancerGenes database ( 548 was used to identify genes with reported pathways or functions linked to cancer development. The Wellcome-Trust Sanger Catalogue of Somatic Mutations in Cancer (COSMIC version 55) database ( 549 (used Biomart to identify all genes with mutation type complex-compound substitution; complex frameshift; deletion-frameshift; insertion-frameshift; substitution-missense; substitution nonsense; unknown. To get all COSMIC genes fitting these categories, the gene field was left empty; otherwise the desired gene lists were used) and the Pancreatic Expression Database version 2.0 ( 253 identified genes that had previously reported point mutations or copy number alterations in tumors or cancer cell lines, or which were reported to be differentially expressed in pancreatic cancer according to published gene expression studies Comparing Affy500K CNV profile between cases and controls Only high-confidence CNVs from non-ebv samples were included in the CNV profile comparison to minimize potential cell line artifacts and false calls. 278 As well, only controls with data available for both NspI and StyI chips were included in this comparison to minimize bias of undercalling CNVs in singlechip samples. To minimize CNV calling errors for complex CNVs (i.e. losses and gains in different samples overlapping the same region), we performed the rare CNV analysis only on regions reported as either losses or gains only. CNV loci that are present in fewer than 1% of the total number of samples (cases + controls) were considered rare, excluding EBV samples and the complex CNVs. For losses, 32 cases and 235 controls (total 267 samples) were included in the rare loss analysis, so a rare loss was defined as present in fewer than 3 individuals. For gains, 56 cases and 551 controls (total 607 samples) were included in the rare gain analysis, so a rare gain was defined as present in fewer than 7 samples.

86 Statistical analysis Comparison of medians was performed using the Mann Whitney U test and comparison of means was performed using the two-tailed Student s t-test with Levene s test for equal variance. Testing for significant difference in proportions was performed with the two-tailed Fisher s exact test. A p-value < 0.05 was considered significant. Statistical testing was performed using the SPSS software package (version 17). For comparing differences in proportions of cases and controls at each CNV locus, we only considered regions containing only losses or only gains (in cases and/or controls) for non-ebv samples, and we excluded samples with only a single chip in the analysis. After calculating two-tailed Fisher s exact test p-values for each loss and gain locus, we performed a Bonferroni correction to account for multipletesting. The number of multiple tests was defined as the total number of loss or gain loci in the above comparison (losses and gains were assessed separately) Breakpoint Mapping and Sequencing To precisely identify the CNV breakpoints, qpcr was performed at several positions near the estimated breakpoints (based on the SNP microarray results), narrowing down the estimated location of the breakpoint to a region approximately 1,000 bp in length. (See Appendix Table S6 for primer sequences; standard PCR conditions were used as described previously). Primers were designed to PCR-amplify the region estimated to contain the breakpoint (see Appendix Table S6) and Sanger sequencing was used to identify the exact base pairs delineating the breakpoint. Products were cleaned up using Qiagen MinElute PCR purification kit. Sanger sequencing was performed by the AGTC service lab. 4. Results 4.1 Affymetrix 500K results Of the original 128 FPC cases genotyped on the Affymetrix 500K array, eight were subsequently excluded (two subjects had excessively noisy data based on CNV count > 40 per analysis run; one subject was discovered to have had chronic lymphocytic leukemia at the time of blood sample donation, making it difficult to distinguish germline from somatic CNVs detected in the sample; and five subjects no longer met inclusion criteria in light of new information that became available after the start of the study), leaving 120 cases in the final analysis with both NspI and StyI chips represented for each sample. Some of the subjects were reassigned to different inclusion criteria after updated information became available, resulting in FPC subjects, 28 2 FPC subjects, and 24 single young subjects contributing to

87 74 the final set of case CNVs detected on Affymetrix 500K array. Two controls were discovered to have a history of sporadic pancreatic cancer (no affected relatives), and two other controls each reported having two relatives with pancreatic cancer, suggesting potential FPC kindreds. After excluding those four samples, 1,194 controls remained in the final analysis. For 236 of those controls, only one chip was included in the analysis (137 NspI only; 99 StyI only) due to inadequate hybridization of the second chip. STRUCTURE software was used for estimating population ancestry of the 120 FPC cases and 958 controls that had NspI + StyI chips available for analysis: 89.2% of cases and 94.8% of controls were Caucasian; 1.7% of cases and 2.1% of controls were Asian; and 9.2% of cases and 3.1% of controls were of admixed background. Figures 8 and 9 summarize the number of gains and losses called by each algorithm on each chip in cases and controls. Figure 8 Gains and losses identified in FPC cases by each algorithm/chip Figure 8 Legend: Number of losses and gains identified by each algorithm and resultant number of losses and gains after merging overlapping CNVs.

88 75 Figure 9 - Gains and losses identified in controls by each algorithm/chip Figure 9 Legend: Number of losses and gains identified by each algorithm and resultant number of losses and gains after merging overlapping CNVs. The total number of autosomal CNVs identified in cases and controls was 873 and 10,794 respectively, of which 382 CNVs (123 losses gains) in cases and 3,115 CNVs (805 losses + 2,310 gains) in controls were considered high confidence calls (corresponding to 66 loss loci gain loci in cases and 313 loss loci gain loci in controls). (Appendix Tables S7 to S10 for high- and low-confidence CNVs in cases and controls (available as excel files on attached CD)). The proportion of losses and gains considered high-confidence was significantly larger in cases than in controls (losses: 48% cases vs. 33% controls, p<0.001; gains: 42% cases vs. 28% controls, p<0.001). As well, the percentage of cases with at least one high-confidence loss was significantly greater than controls (68% vs 47%, p<0.001), but no significant difference existed between cases and controls in the percentage of samples with high-confidence gains (85% vs. 80%, p=0.227). Significance testing results were the same whether or not the 236 controls with only one chip in the analysis were included, or whether the denominator is all samples vs. only samples that had at least one CNV call. We note that no significant difference was observed between cases and controls when restricting the analysis to FGICR controls that were genotyped at the same centre (TCAG). (Tables 7 and 8)

89 76 Table 7 - Proportion of high-confidence losses in cases and controls % of samples with HC losses if remove controls with only 1 chip % of samples with HC losses among 2-chip samples with at least one loss call % of losses that % of HC losses if are highconfidence remove controls % of samples (HC) with only 1 chip with HC losses Cases All Controls Fisher's exact p < p < p < p=0.002 p=0.009 FGICR controls Fisher's exact (compared to cases) p=0.303 p=0.512 p=0.070 p=0.190 p=0.190 Table 8 - Proportion of high-confidence gains in cases and controls % of samples with HC gains if remove controls with only 1 chip % of samples with HC gains among 2-chip samples with at least one gain call % of gains that % of HC gains if are highconfidence remove controls % of samples (HC) with only 1 chip with HC gains Cases All Controls Fisher's exact p < p < p=0.227 p=0.782 p=0.882 FGICR controls Fisher's exact (compared to cases) p=0.109 p=0.086 p=0.227 p=0.626 p= Affymetrix 6.0 results In 36 cases genotyped on the Affymetrix 6.0 array (two of the original 38 samples were excluded due to excess noise see methods), a total of 3,364 autosomal CNVs (2,665 losses and 699 gains) were identified using Birdsuite, and 3,266 autosomal CNVs were identified using ipattern (1,975 losses and 1,291 gains). Table 9 summarizes some key parameters of CNVs identified by each algorithm. Table 9 - CNVs called by each of Birdsuite and ipattern in 36 samples on Affymetrix 6.0 array Birdsuite ipattern # losses 2,665 1,975 # gains 699 1,291 median size losses (bp) 7,793 10,388 median size gains (bp) 60,599 19,857 # genic losses (% of all losses) 969 (36%) 693 (35%) # genic gains (% of all gains) 512 (73%) 690 (53%) # losses called as HC losses in 500K array (in same sample) # losses called as LC losses in 500K array (in same sample) # gains called as HC gains in 500K array (in same sample) # gains called as LC gains in 500K array (in same sample) mean # losses per sample/mean # gains per sample 74/19 55/36 HC = high-confidence; LC = low-confidence on 500K array

90 77 The high-confidence set of Affy6 CNVs (incorporating GTC-supported CNVs) comprised 2,187 CNVs (1,656 losses gains). (Appendix Tables S11 to S12 for high-confidence CNVs on Affy6 array in FPC cases and controls (available as excel files on attached CD)). The median size of high-confidence losses and gains was 12.7kb (1kb-1.4Mb) and 48.9kb (1kb-1.6Mb), respectively, and the average number of losses and gains per genome was 46 and 15, respectively. 4.3 CNV validation Quantitative PCR was used to attempt validation of 18 losses (13 high-confidence and 5 low-confidence) and 10 gains (all high-confidence) in FPC cases, of which all the high-confidence CNVs validated and 4/5 low-confidence CNVs validated. (Appendix Figures S1 to S32 for qpcr results). Of the 33 FPC cases that were hybridized to both Affy 500K and Affy6.0 arrays, 31 yielded useable results on both arrays. For those 31 cases, 113 high-confidence CNVs and 142 low-confidence CNVs were called on the 500K array, of which 107 (95%) high-confidence CNVs and 63 (44%) low-confidence CNVs were validated on the Affy6 array. The combined results of qpcr validation and Affy6 genotyping demonstrated a validation rate of 95% (121/127) for high-confidence CNVs but only 45% (66/146) for low-confidence CNVs. Therefore, the remainder of this analysis was limited to high-confidence CNVs in cases and controls. Approximately one third (121/382) of all high-confidence case CNVs identified on the 500K array, corresponding to half (88/171) of all high-confidence CNV loci in cases, have been confirmed by either the Affymetrix 6.0 array and/or qpcr. 4.4 Comparing CNV profile of cases and controls We compared several characteristics of CNVs identified on the 500K array between FPC cases and FGICR/OFCCR controls. Table 10 compares several key CNV attributes between cases and controls (based on high-confidence CNVs and excluding EBV-derived samples and controls with only one chip in the analysis). Table 10 - High confidence CNV profile of cases vs. controls (excluding EBV-derived samples and excluding controls with data from only one chip) FPC cases Controls p- value # Lymphocyte samples #High-confidence losses/high-confidence gains 91/ /2,059 Median CNV size (range) 219.5kb (1.2kb-6.4Mb) 219.5kb (1.2kb-6.8Mb) Median CNV SNP count (range) 42 (2-417) 40 (2-1318) #Genic CNVs/all CNVs Losses Gains 52/91 (57%) 153/190 (81%) 400/731 (55%) 1,646/2,059 (80%)

91 78 #Samples with genic CNVs/samples with any CNVs Losses Gains #CNV genes identified as Cancer Genes in MSKCC CancerGenes database/all CNV genes recognized by the MSKCC database Losses Gains #CNV loci included in rare analysis/all CNV loci Losses Gains #CNVs that are part of rare loci/all CNVs Losses Gains #Samples with CNVs included in rare analysis/samples with any CNV Losses Gains #Samples with rare CNVs/samples with any CNV Losses Gains #Genic rare CNVs/all rare CNVs Losses Gains #Samples with genic rare CNVs/samples with rare CNVs Losses Gains Mean CNVs per genome * Losses Gains Mean rare CNVs per genome * Losses 43/59 (73%) 70/75 (93%) 8/36 (22%) 53/335 (16%) 36/52 (69%) 65/83 (78%) 23/91 (25%) 47/190 (25%) 32/59 (54%) 56/75 (75%) 21/59 (36%) 37/75 (49%) 10/23 (43%) 33/47 (70%) 10/21 (48%) 27/37 (73%) /500 (65%) 765/816 (94%) 35/264 (13%) 507/2940 (17%) 203/290 (70%) 349/428 (82%) 199/731 (27%) 461/2,059 (22%) 235/500 (47%) 551/816 (68%) 169/500 (34%) 348/816 (43%) 69/199 (35%) 330/461 (72%) 63/169 (37%) 267/348 (77%) Gains * mean and t-test calculated for losses and gains based only on samples with at least one high-confidence loss or gain, respectively (to avoid the bias of samples which didn t get a high-confidence CNV call due to noise) Overall, no significant difference was observed in the CNV profile of cases and controls, including such parameters as CNV size, proportion of genic CNVs, proportion of rare CNVs, and average number of CNVs per individual genome. In both groups, gains were larger than losses (median size - cases: 228.7kb vs kb, p=0.016; controls: 224.4kb vs kb, p<0.001) and were more likely to overlap genes (cases: 153/190 gains vs. 52/91 losses are genic, p<0.001; Controls: 1,641/2,059 gains vs. 400/731 losses are genic, p<0.001).

92 CNVs of interest Figure 7 summarizes the CNV prioritization plan that we applied to our data. The highest priority is assigned to CNVs that segregate with disease status in blood relatives, or alternatively de novo CNVs in singleton young affected subjects. Since no trios were available for analysis, we could not determine which CNVs were de novo. Only two pairs of siblings were genotyped, while the remaining were all unrelated subjects. In one pair of siblings whose parents are not consanguinous, only a single gain was shared by the two siblings and this CNV was also identified in many other cases and controls. In the second pair of siblings whose parents are firstcousins, one loss and three gains were shared by the two siblings but all the CNVs were also shared by controls. Hence, no FPC-specific CNVs were found to segregate in either of the two pairs of siblings. Next in priority are CNVs that overlap in two or more unrelated cases and are absent in controls. We also considered CNVs present in cases and controls if they met the following conditions: (1) CNV present in two or more cases; (2) CNV overlaps gene(s) in cases; (3) the genic portion of the region is not overlapped by control CNVs or DGV CNVs. (To ensure that we are not missing anything significant, we assessed the data for loci overlapping two or more cases and no controls even if reported in the DGV, but none fit this criteria). A total of 64 FPC CNVs (27 losses and 37 gains) detected on the 500K array were not identified in FGICR or OFCCR controls. After further excluding regions that overlapped POPGEN or OHI controls or were reported in the DGV, the number of FPC-specific CNVs identified on the 500K array is 37 CNVs (16 losses and 21 gains). On the Affymetrix 6.0 array, 119 FPC CNVs (71 losses and 48 gains) were not identified in POPGEN or OHI controls, and after further excluding regions which overlapped FGICR and OFCCR controls or were in the DGV, 73 FPC-specific CNVs (45 losses and 28 gains) remained. Combining results from the two arrays (including regions identified on both platforms) yielded a total of 93 non-redundant FPC-specific CNVs (53 losses and 40 gains), each CNV present in a single individual only (a total of 50 FPC cases, including 7 EBV-derived samples); 13 losses and 8 gains were in non-caucasian individuals. One duplication (G_97) appeared to affect the same gene (TGFBR3) in two unrelated cases, albeit with different breakpoints in each case (Figure 10). This gene codes for a receptor of TGF-beta, a signaling molecule with an important role in pancreatic cancer initiation and progression, and decreased expression of TGFBR3 has been observed in various cancers suggesting that it behaves as a tumor-suppressor. Given the potential significance of this gene for pancreatic cancer, we aimed to investigate this duplication further.

93 80 Figure 10 Duplications overlapping TGFBR3 gene Figure 10 Legend: TGFBR3 transcripts circled; red bars represent breakpoints of CNVs identified on SNP arrays Although an overlapping duplication was also present in one POPGEN control, the control duplication only overlapped the beginning of one of the multiple isoforms of this gene. (There was also a large lowconfidence duplication called in one of our ARCTIC controls, but this appeared to be a false call as demonstrated by qpcr see Appendix Figure S33). The duplication in case ID-27 was validated by qpcr using two different primer sets. We validated the duplication in case ID-203 using those same primer sets, and additionally tested family members for this subject for whom DNA was available. (Figure 11; Appendix Figures S33-S38).

94 81 Figure 11 Pedigree of case ID-203, indicating results of qpcr testing for duplication G_97 Figure 11 Legend: GB = gallbladder; PC = Pancreas cancer; dup = duplication identified; no dup = no duplication identified; blood = source of DNA is lymphocytes; tissue = source of DNA is FFPE resected specimen At this point, we observed that the mother of the proband did not carry the duplication, which weakened the argument for this CNV being causative for pancreatic cancer (since the pancreatic cancer was considered matrilineal in this family, with a maternal grandmother reported to have died of the disease). However, we considered the possibility of the disease being inherited from the paternal side, particularly since the paternal grandmother was reported to have died of gallbladder cancer, which could have been a misdiagnosis of pancreatic cancer. We did not have access to DNA from the father or paternal grandmother, but as noted in the pedigree, a sister of the proband s had also died of pancreatic cancer. We wished to test for segregation of the duplication with the disease, but only formalin-fixed paraffinembedded (FFPE) tissue was available for DNA extraction from this sister. Due to the fragmented nature of FFPE-derived DNA (caused by cross-linking and degradation of nucleic acid by formalin preservation), qpcr performed on FFPE-DNA can be biased and difficult to verify. Therefore, we decided to fine-map the breakpoints of the duplication to allow Sanger sequencing of the tandem duplication point. Our fine-mapping method involved designing qpcr probes at several positions falling within as well as outside the array-defined boundaries of the duplication (Figure 12; Appendix Figures S39 to S45 for qpcr results).

95 82 Figure 12 Fine-mapping the breakpoint of duplication overlapping TGFBR3 using qpcr walk-along method A B C Figure 12 Legend: Panel [A] depicts the array-based estimation of the duplication breakpoints; panels [B] and [C] indicate the locations of the qpcr probes at either end of the duplication (shown as small vertical black bars). Panels [B] and [C], the red arrows indicate the area between the confirmed duplicated and non-duplicated positions at either end of the CNV. At this point, we selected two primers used for qpcr analysis (O_Out_5 and T_Out_3) to attempt PCR amplification of the region containing the duplication breakpoint. Although we did not know at this point the exact size of the duplication, we were able to amplify a fragment approximately 1.5-2kb in size (see Figure 13), whereas a control sample not containing the duplication failed to amplify anything using these primers (as would be expected). Figure 13 PCR gel demonstrating amplification of ~1.5-2kb fragment containing G_97 duplication breakpoint in case Id_203 Figure 13 Legend: Each well represents a separate PCR reaction (three for duplication-carrying sample and three for non-duplication control)

96 83 We submitted the fragment for Sanger sequencing from both ends; although the size of the fragment was too large to read completely from either primer, we obtained sufficient length of reads from each primer such that they overlapped at the breakpoint of the duplication, thus allowing us to pinpoint the exact location of the breakspoint (see Figure 14). Figure 14 G_97 duplication breakpoint mapping by Sanger sequencing A B C Figure 14 Legend: Sequence [A] is located at the end of G_97 that does not transect TGFBR3; the purple-highlighted portion is seen in Sanger sequence reads from forward primer (O_Out_5) located at that end of the duplication. Sequence [C] is located at the end of G_97 that transects TGFBR3; the yellowhighlighted portion is seen in Sanger sequence reads from reverse primer (T_Out_3) located at that end of the duplication. Non-highlighted portion of each of those reads represents the normally expected sequence in each location if no duplication was present. The red-higlighted sequence is the region of the tandem duplication breakpoint that observed in each of the Sanger sequence reads from the above-described primers; note the insertion of TAT at the point of duplication. Based on this information, we designed a primer set to amplify a smaller fragment encompassing the breakpoint (~100 bp), to allow amplification of FFPE-derived DNA (obtained from non-tumor region of the specimen block) from the affected sister of the proband. We also performed PCR amplification of several other amplicons of similar size to control for DNA degradation, and we used case Id-203 as a positive control for the duplication. As Figure 15 illustrates, although the FFPE DNA appeared to amplify the four other test amplicons well, no amplification of the duplication breakpoint region was observed in the affected sister, indicating that she did not inherit the duplication.

97 84 Figure 15 - PCR gel illustrating amplification of test regions and duplication breakpoint in case Id-203 and affected sister 100 bp 100 bp Figure 15 Legend: Wells within the blue boxes belong to sister of ID_203 (source of FFPE DNA); wells outside blue boxes belong to case ID_203 (blood-derived DNA); every fifth column is water control 4.6 FPC-specific CNVs Since the TGFBR3 duplication did not segregate with pancreatic cancer in the family we studied, and no FPC-specific CNV occurred in more than one case, we proceeded to annotate the FPC-specific CNVs and to prioritize them based on gene content and their association with cancer. (Figure 16 illustrates the distribution of FPC-specific CNVs across the genome).

98 85 Figure 16 - FPC-specific losses and gains on autosomal chromosomes Figure 16 Legend: Red box = loss; Green box = gain Twenty-three FPC-specific losses and 23 FPC-specific gains overlapped introns, exons, and/or untranslated regions of 104 RefSeq genes (Table 11). Table 11 FPC specific CNVs CNV type CNV Id Sample Id Gain Affy6.0_G_ Gain Affy500K_G_280 & Affy6_G_ Gain Affy500K_G_ Gain Affy6.0_G_ (Admixed) Coordinates (hg18) Size (kb) RefSeq Genes chr1: AGBL4 no chr18: chr3: chr19: ARHGAP28, LAMA1, LRRC30, LOC ATR, TRPC1 PLS1, BRSK1, UBE2S, SHISA7, TMEM190, COX6B2, Overlaps Pancreatic Expression Database CNVs? High-level amplification no no

99 86 Gain Affy500K_G_ (EBV) Gain Affy500K_G_615 & Affy6_G_ Gain Affy6.0_G_ Gain Affy500K_G_ Gain Affy6.0_G_ Gain Gain Affy500K_G_603/604 & Affy6_G_93 Affy6.0_G_39 Gain Affy6.0_G_ (Admixed) 123 (Admixed) Gain Affy6.0_G_ (Admixed) Gain Affy500K_G_176 & Affy6_G_ Gain Affy6.0_G_33 69 Gain Affy500K_G_88 24 Gain Affy500K_G_ Gain Affy500K_G_602 Affy6_G_50 & 123 (Admixed) Gain Affy500K_G_ (EBV) Gain Affy500K_G_ chr16: FAM71E2, HSPBP1, TMEM150B, ISOC2, IL11, RPL28, TMEM238, ZNF628, SUV420H2, NAT14, PPP6R1, SSC5D DYNLRB2, CDYL2, MIR548H4 chr7: EXOC4 no chr15: GJD2 no chr4: GRID2 no chr15: HEXA, CELF6 no chr8: IDO2 no chr3: IFT80 no chr10: LRRC20 no chr11: LTBP3, PCNXL3, MAP3K11, MIR4489, MALAT1, RELA, SIPA1, SSSCA1, FAM89B, KCNK7, MIR4690, EHBP1L1, LOC254100, SCYL1 chr18: METTL4 no chr2: none no chr4: none (mrna present) chr4: none no chr4: none no chr4: none no chr6: none no Gain Affy6.0_G_70 44 chr6: none no High-level amplification no no

100 87 Gain Affy6.0_G_95 99 Gain Affy500K_G_49 12 (Admixed) Gain Affy6.0_G_ Gain Affy6.0_G_ Gain Affy500K_G_622 & Affy6_G_ Gain Affy500K_G_ (EBV) Gain Affy6.0_G_ Gain Affy6.0_G_ Gain Affy500K_G_ Gain Affy500K_G_ Gain Affy500K_G_105 & Affy6_G_283 & Affy6_G_ Gain Affy500K_G_95 26 Gain Affy6.0_G_ Gain Gain Affy6.0_G_3 Affy500K_G_69 & Affy6_G_87 18 Gain Affy500K_G_ (Admixed) Gain Affy6.0_G_ (Asian) Gain Affy6.0_G_ Gain Affy6.0_G_ Loss Loss Affy500K_D_125 & Affy6_D_ Affy6.0_D_ (Admixed) chr8: none no chr9: none no chr10: none no chr11: chr11: chr12: none (mrna present) none (mrna present) none (mrna present) chr13: none no chr20: none no chr21: chr21: chr17: none (mrna present) none (mrna present) OR1D2, OR1G1, OR1A2, OR1A1, OR1D4, OR3A2, OR3A1, OR3A4P chr10: PLXDC2 chr8: PRKDC, MCM4 no chr1: PYHIN1 no chr8: RSPO2 chr2: SP110, SP140 no chr12: TMTC2 no chr14: ZNF410, PTGR2 no chr19: chr12: CNTN1 ZNF784, NLRP9, EPN1, CCDC106, ZNF580, U2AF2, ZNF581 chr5: CTNND2 no Loss Affy6.0_D_ (Asian) chr18: DLGAP1 no no no no no no no High-level amplification High-level amplification no High-level amplification

101 88 Loss Affy6.0_D_1127 Loss Affy6.0_D_ Loss Affy500K_D_24 Affy6_D_1342 Loss Affy6.0_D_ Loss Affy500K_D_ Loss Affy6.0_D_ Loss Affy6.0_D_ Loss Affy6.0_D_ Loss Affy6.0_D_911 & 123 (Admixed) 11 (Asian) 123 (Admixed) Loss Affy500K_D_ (EBV) Loss Affy500K_D_ (Admixed) Loss Affy6.0_D_ Loss Affy500K_D_114 & Affy6_D_74 62 Loss Affy6.0_D_ (Admixed) Loss Affy6.0_D_ Loss Affy6.0_D_ Loss Affy6.0_D_ Loss Affy6.0_D_ (Admixed) Loss Affy6.0_D_ (Admixed) Loss Affy6.0_D_ (Admixed) Loss Affy6.0_D_ Loss Affy500K_D_93 48 Loss Affy6.0_D_ Loss Affy6.0_D_ chr10: DOCK1 no chr2: EML6 no chr13: GPC6 chr4: HTN1, STATH HTN3, chr3: KALRN no chr1: KANK4 no chr19: LOC no chr4: LOC no chr6: MAN1A1 no chr8: MCPH1, ANGPT2 chr8: NAT1 no chr2: none no chr2: none no chr3: none no chr3: none no chr4: chr6: chr7: none (mrna present) none (mrna present) none (mrna present) chr8: none no chr8: none no chr8: none no chr8: chr8: none (mrna present) none (mrna present) chr8: none no High-level amplification no no no no no no no

102 89 Loss Affy500K_D_ Affy500K_D_43 & Loss Affy6_D_ Loss Affy6.0_D_ (Admixed) Loss Affy6.0_D_ Loss Affy6.0_D_ Affy500K_D_40 & Loss Affy6_D_ Loss Affy500K_D_6 2 (EBV) Loss Affy6.0_D_ Affy500K_D_83 & Loss Affy6_D_ Loss Affy6.0_D_ Loss Affy6.0_D_ Affy500K_D_121 & Loss Affy6_D_ (Admixed) Loss Affy6.0_D_ Loss Affy6.0_D_ Loss Affy6.0_D_ Loss Affy6.0_D_ (Asian) Loss Affy500K_D_ (EBV) Loss Affy6.0_D_ Loss Affy6.0_D_ Loss Affy500K_D_ (EBV) Loss Affy6.0_D_ Loss Affy6.0_D_ Loss Affy6.0_D_ Loss Affy6.0_D_ chr9: none no chr9: none no chr9: none no chr9: none no chr9: none (mrna present) chr11: none no chr11: chr11: chr12: none (mrna present) none (mrna present) none (mrna present) chr13: none no chr14: none no chr14: none no chr15: none no chr15: chr16: none (mrna present) none (mrna present) chr20: none no chr21: none no chr2: ORC4 no chr6: PARK2 no chr5: PCSK1, ERAP1, CAST chr8: RALYL no chr18: RIT2 no chr17: RNF213 no chr4: SCFD2 no no no no no no no High-level amplification

103 90 Loss Affy6.0_D_ Affy500K_D_98 & Loss Affy6_D_ chr2: SNAR-H no chr4: TTC29 no Fourteen genes (including one small nuclear RNA) had at least part of their coding regions affected by FPC-specific losses, and 74 genes (including 3 micrornas) had at least part of their coding regions affected by FPC-specific gains (Table 12). Table 12 Genes whose coding regions are affected by FPC-specific CNVs CNV type Gene Entrez Id Official full name Position (hg18) Array Sample Gain OR1A Gain OR1A Gain OR1D Gain OR1G Gain OR1D Gain OR3A Gain OR3A Gain OR3A Gain CDYL Gain DYNLRB olfactory receptor, family 1, subfamily A, member 1 olfactory receptor, family 1, subfamily A, member 2 olfactory receptor, family 1, subfamily D, member 2 olfactory receptor, family 1, subfamily G, member 1 olfactory receptor, family 1, subfamily D, member 4 (gene/pseudogene) olfactory receptor, family 3, subfamily A, member 1 olfactory receptor, family 3, subfamily A, member 2 olfactory receptor, family 3, subfamily A, member 4 chromodomain protein, Y- like 2 dynein, light chain, roadblock-type 2 Gain MIR548H microrna 548h-4 Gain METTL methyltransferase like 4 Rho GTPase activating Gain ARHGAP protein 28 Gain LAMA laminin, alpha 1 Gain LOC hypothetical LOC leucine rich repeat containing Gain LRRC Gain SP SP110 nuclear body protein Gain SP SP140 nuclear body protein glutamate receptor, Gain GRID ionotropic, delta 2 chr17: K 28 full chr17: K 28 full chr17: k & Affy6 28 full chr17: k & Affy6 28 full Extent of gene affected chr17: k & Affy6 28 full chr17: k & Affy6 28 full chr17: k & Affy6 28 full chr17: k & Affy6 28 full chr16: K 37 partial chr16: K 37 full chr16: K 37 partial chr18: k & Affy6 44 partial chr18: k & Affy6 62 partial chr18: k & Affy6 62 full chr18: k & Affy6 62 full chr18: k & Affy6 62 full chr2: K 65 partial chr2: K 65 partial chr4: K 79 partial

104 91 Gain ATR 545 Gain PLS plastin 1 Gain TRPC Gain IDO Gain EXOC Gain RSPO ataxia telangiectasia and Rad3 related transient receptor potential cation channel, subfamily C, member 1 indoleamine 2,3-dioxygenase 2 exocyst complex component 4 R-spondin 2 homolog (Xenopus laevis) Gain PLXDC plexin domain containing 2 ATP/GTP binding proteinlike Gain AGBL Gain EHBP1L Gain FAM89B Gain KCNK EH domain binding protein 1-like 1 family with sequence similarity 89, member B potassium channel, subfamily K, member 7 Gain LOC hypothetical LOC Gain LTBP Gain MALAT Gain MAP3K latent transforming growth factor beta binding protein 3 metastasis associated lung adenocarcinoma transcript 1 (non-protein coding) mitogen-activated protein kinase kinase kinase 11 Gain MIR microrna 4489 Gain MIR microrna 4690 Gain PCNXL pecanex-like 3 (Drosophila) v-rel reticuloendotheliosis viral oncogene homolog A Gain RELA (avian) Gain SCYL SCY1-like 1 (S. cerevisiae) Gain SIPA Gain SSSCA signal-induced proliferationassociated 1 Sjogren syndrome/scleroderma autoantigen 1 Gain PTGR prostaglandin reductase 2 Gain ZNF zinc finger protein 410 CUGBP, Elav-like family Gain CELF member 6 hexosaminidase A (alpha Gain HEXA 3073 polypeptide) gap junction protein, delta 2, Gain GJD kDa chr3: K 82 partial chr3: K 82 full chr3: K 82 partial chr8: chr7: chr8: K & Affy6 123 partial 500K & Affy6 125 partial 500K & Affy6 18 partial chr10: K 26 partial chr1: Affy6 127 partial chr11: Affy6 20 full chr11: Affy6 20 full chr11: Affy6 20 full chr11: Affy6 20 full chr11: Affy6 20 full chr11: Affy6 20 partial chr11: Affy6 20 full chr11: Affy6 20 full chr11: Affy6 20 full chr11: Affy6 20 full chr11: Affy6 20 partial chr11: Affy6 20 full chr11: Affy6 20 full chr11: Affy6 20 full chr14: Affy6 67 partial chr14: Affy6 67 partial chr15: Affy6 44 partial chr15: Affy6 44 partial chr15: Affy6 99 full

105 92 Gain CCDC Gain EPN epsin 1 Gain NLRP Gain U2AF coiled-coil containing 106 domain NLR family, pyrin domain containing 9 U2 small nuclear RNA auxiliary factor 2 Gain ZNF zinc finger protein 580 Gain ZNF zinc finger protein 581 Gain ZNF zinc finger protein 784 Gain BRSK BR serine/threonine kinase 1 Gain COX6B Gain FAM71E Gain HSPBP Gain IL interleukin 11 Gain ISOC Gain NAT Gain PPP6R cytochrome c oxidase subunit VIb polypeptide 2 (testis) family with sequence similarity 71, member E2 HSPA (heat shock 70kDa) binding protein, cytoplasmic cochaperone 1 isochorismatase containing 2 domain N-acetyltransferase 14 (GCN5-related, putative) protein phosphatase 6, regulatory subunit 1 Gain RPL ribosomal protein L28 Gain SHISA Gain SSC5D Gain SUV420H shisa homolog 7 (Xenopus laevis) scavenger receptor cysteine rich domain containing (5 domains) suppressor of variegation 4-20 homolog 2 (Drosophila) Gain TMEM150B transmembrane protein 150B Gain TMEM transmembrane protein 190 Gain TMEM transmembrane protein 238 Gain UBE2S ubiquitin-conjugating enzyme E2S Gain ZNF zinc finger protein 628 Gain IFT Gain PYHIN Gain MCM Gain PRKDC 5591 intraflagellar transport 80 homolog (Chlamydomonas) pyrin and HIN domain family, member 1 chr19: Affy6 62 full chr19: Affy6 62 full chr19: Affy6 62 partial chr19: Affy6 62 full chr19: Affy6 62 full chr19: Affy6 62 full chr19: Affy6 62 partial chr19: Affy6 20 full chr19: Affy6 20 full chr19: Affy6 20 full chr19: Affy6 20 full chr19: Affy6 20 full chr19: Affy6 20 full chr19: Affy6 20 full chr19: Affy6 20 partial chr19: Affy6 20 full chr19: Affy6 20 full chr19: Affy6 20 partial chr19: Affy6 20 full chr19: Affy6 20 full chr19: Affy6 20 full chr19: Affy6 20 full chr19: Affy6 20 full chr19: Affy6 20 full chr3: Affy6 123 partial chr1: Affy6 123 partial chr8: Affy6 202 partial minichromosome maintenance complex component 4 protein kinase, DNAactivated, catalytic polypeptide chr8: Affy6 202 partial

106 93 Loss NAT1 9 N-acetyltransferase 1 (arylamine N- acetyltransferase) Loss KALRN 8997 kalirin, RhoGEF kinase Loss ANGPT2 285 angiopoietin 2 Loss CAST 831 calpastatin Loss ERAP Loss PCSK Loss TTC endoplasmic aminopeptidase 1 reticulum proprotein convertase subtilisin/kexin type 1 tetratricopeptide domain 29 repeat chr8: K 77 full chr3: K 85 partial chr8: K 112 partial chr5: K 117 full chr5: K 117 full chr5: K 117 partial chr4: K & Affy6 54 full Loss RNF ring finger protein 213 Loss ORC Loss SNAR-H Loss HTN histatin 1 Loss HTN histatin 3 origin recognition complex, subunit 4 small ILF3/NF90-associated RNA H Loss STATH 6779 statherin sec1 family domain Loss SCFD containing 2 chr17: Affy6 35 partial chr2: Affy6 61 partial chr2: Affy6 99 full chr4: Affy6 69 partial chr4: Affy6 69 full chr4: Affy6 69 full chr4: Affy6 28 partial Fifty-five percent of the genes in Table 12 (48/88) have reported non-silent mutations (missense or nonsense variants; insertions/deletions; gene fusions) in different cancers according to the COSMIC v.55 database, whereas only 37% of genes in all 500K + Affymetrix 6.0 FPC CNVs (p=0.002) and only 42% of genes in all 500K + Affymetrix 6.0 control CNVs (p=0.022) had such mutations. None of the genes overlapped by FPC-specific losses were reported to have downregulated expression in pancreatic cancer in the Pancreatic Expression Database, whereas six genes overlapped by gains had reports of upregulation in pancreatic adenocarcinoma and three genes were reported to be upregulated in intraductal papillary mucinous neoplasm, a pre-invasive lesion. Furthermore, four FPC-specific gains overlapped regions reported to have high-level amplification in pancreatic adenocarcinoma in the Pancreatic Expression Database. The four gains overlap eight genes, of which four genes (LOC400643, DYNLRB2, LRRC30, and LAMA1) are entirely encompassed by their respective gains. LOC is a non-coding RNA and has no known association with cancer. There are no reports of differential expression in pancreatic cancer or somatic mutations in DYNLRB2, which codes for a light chain component of cytoplasmic dynein 1 complex but this gene is reported to be involved in TGF-beta/SMAD3 signaling 550 and reported to be downregulated in hepatocellular carcinoma 551. LRRC30, which codes for leucine-rich repeat-containing protein 30, has no reports of differential expression in pancreatic cancer or other association with tumorigenesis, but does have two reported mutations in the COSMIC database (one nonsense mutation in

107 94 ovarian serous carcinoma and one missense mutation in hepatocellular carcinoma). LAMA1 codes for laminin, an extracellular matrix component that binds to cells via high-affinity receptors and mediates attachment, migration, and organization of cells into tissues during embryogenesis. 552 The COSMIC v.55 database reports 18 protein-altering or truncating somatic mutations in this gene in tumors of the pancreas, ovary, central nervous system, large intestine, breast, upper aerodigestive tract, and skin. In comparison, for 10,849 COSMIC v.55 database genes that had at least one non-silent/non-intronic mutation, the average number of mutations per gene is 3.7. A similar average number of reported somatic mutations is observed in genes affected by CNVs in our study (determined from the compiled data of 500K and Affymetrix 6.0 arrays): 3.6 mutations per gene for FPC-specific genes (p=0.983), 3.4 mutations per gene for all FPC genes (p=0.821), and 3.7 mutations per gene for all control genes (p=0.955). There is also evidence for differential expression of LAMA1 in tumors of sites other than the pancreas: one study reported hypermethylation and under-expression of LAMA1 in colorectal cancer 553, while another study reported overexpression of this gene in glioblastoma 554. Lastly, for non-complex CNV loci (i.e. only losses or gains per locus), we performed Fisher s exact testing to determine if any loci had a significantly different proportion in cases relative to controls. After multiple-correction testing, no loss or gain locus demonstrated a significant difference. 5. Discussion Identifying predisposition genes associated with FPC has been challenging due to the rapid lethality of the disease, low rate of tumor resection (resulting in paucity of tissue specimens for analysis), and probable genetic heterogeneity. An estimated 20% of hereditary cases are linked to cancer syndromes caused by alterations in known genes. However, most families that demonstrate clustering of pancreatic cancer do not meet criteria for known cancer syndromes. 161 We performed an analysis of germline CNVs in pancreatic cancer patients suspected to have a heritable genetic cause for their disease. These primarily included members of families with three or more affected cases, but also included families with only one or two affected cases if at least one of the cases was under age 50 at diagnosis. Three different computational algorithms were used for CNV identification in each array to identify high confidence CNVs, an approach that is commonly used in CNV studies. One advantage to utilizing different algorithms is improved sensitivity for detecting CNVs, since it multiple studies have illustrated significant non-overlap between algorithms. For our purpose, the use of multiple CNV-calling algorithms identified variants with a very high likelihood of validation (the high-confidence set), as verified by qpcr and/or second-array hybridization. This allowed us to focus our downstream analysis on these high-confidence CNVs, whose expected validation rate was > 95%, rather than low-confidence CNVs (meaning those called by only a single algorithm on a single chip), of which only half appeared to be true

108 95 genomic alterations. Those results are in keeping with recently published comparative assessment of CNV-calling algorithms and platforms. 555 While we acknowledge that our CNV list is not exhaustive, this is a logistical limitation of the field as it is neither plausible to genotype hundreds of samples on multiple platforms nor to perform qpcr validation on hundreds of CNVs. Thus, our approach at least ensured that we were working with a highly valid set of data. Interestingly, we noted a discrepancy in the proportion of high-confidence CNVs between samples genotyped at TCAG in Toronto (all the cases and a small subset of controls) and those genotyped in Quebec (most controls). We attributed this difference to an apparently higher level of noise in control arrays genotyped in Quebec. Pinto et al. 555 commented on the effect of inter-laboratory variability on CNV validation rate, finding it to be less important than reproducibility of the chosen platform or calling algorithm. However, they do note that Affymetrix arrays (the platform used in our study) are an exception to this, being highly dependent on the reference data set used for the analysis. Since we used the total number of samples within each group (i.e. those genotyped at each centre constituted a group) as reference, a noisier set of data from the Quebec samples would be expected to result in a greater proportion of noisy and/or unreliable calls. We expect that some of the control low-confidence CNVs would in fact be real calls, so we advocate that CNVs of interest that are to be investigated futher should be checked for CNV calls in controls and those should be validated before further analysis. (We performed such validation for the CNV G_97 that overlapped TGFBR3; it appeared to overlap a lowconfidence duplication in an ARCTIC control but this putative gain was demonstrated by qpcr to be a false call). To date, this is the largest study of germline CNVs in unrelated cancer patients from high-risk families. A previous study of 57 pancreatic cancer patients from 56 high-risk kindreds (each containing at least a pair of affected first-degree relatives) used an oligonucleotide-based CGH platform to identify FPC-specific germline CNVs, filtering out losses or gains that were also identified in 607 controls (372 were analyzed in the same study, and 235 were previously reported in two other studies). 345 Twenty-five FPC-specific losses overlapping 81 genes and 31 FPC-specific gains overlapping 425 genes were identified. In our study, we investigated 133 members of 131 high-risk kindreds, of whom 17 subjects were part of the previous CGH study, and we identified 93 FPC-specific CNVs using a combination of Affymetrix 500K and Affymetrix 6.0 arrays. The median size of FPC-specific CNVs in the CGH study was larger than in our FPC-specific CNVs (losses: 151kb vs. 35.5kb; gains: 379kb vs. 73kb). This may be due, in part, to the lower resolution of the CGH platform (mean inter-marker distance = 30kb) compared to the Affymetrix 500K array (median inter-marker distance = 2.5kb) and Affymetrix 6.0 array (median intermarker distance = 0.7kb) used in our study. It may also reflect enrichment for somatic CNVs caused by EBV-transformation, since all FPC DNA samples in the CGH study were extracted from EBV-

109 96 transformed cells whereas only 29 samples in our population were EBV lymphoblasts. The size of control populations used to filter CNVs was larger in our study and the number of control CNVs from non-bac studies currently catalogued in the DGV is greater than was available at the time of publication of the previous FPC CNV study. As a result, some of the CNVs identified as FPC-specific in the previous study overlapped CNVs in our controls and/or in the DGV. This may explain the slightly higher (FPC-specific CNVs)-to-sample ratio observed in the CGH study (approximately 1 CNV per sample) compared to our study (0.8 CNV per sample). It is difficult to estimate concordance in CNV calling between the two studies, as we do not know how many of the 56 FPC-specific CNVs reported in the CGH study were identified in samples that were also used in our study. Only 1/25 loss and 3/31 gain loci reported in the CGH study were also observed in our analysis in samples common to both studies, and all of these overlapped CNVs in our controls and/or in the DGV. Interestingly, multiple reports have demonstrated generally low concordance for CNV calling on different platforms/algorithms when analyzing the same DNA source. 259,555 In addition to CNVs identified in cases common to both studies, there was one FPC-specific loss locus which was identified in two different subjects (one in each study). The region overlapped a gene, DOCK1 (dedicator of cytokinesis 1), but in our study the loss only encompassed an intronic portion of the gene. This gene may have a role in cellular proliferation and migration 556,557, and it has been reported to be overexpressed in high-grade dysplastic lesions (PanIN3), suggesting that it may be important in advancing tumorigenesis. 558 A number of other genome-wide germline CNV analyses have been reported for various cancers, but only a few have studied familial cancers. In addition to the aforementioned familial pancreatic cancer study, microarray-based germline CNV studies have been reported for Li-Fraumeni syndrome 348, young-onset and/or familial colorectal cancer in families without mutations in known predisposition genes 347, and BRCA1-associated ovarian cancer. 346 Shlien et al. 348 described an increased frequency of germline CNVs in 33 Li-Fraumeni family members carrying mutations in the TP53 gene (of which 23 were affected by cancer), compared to 20 Li-Fraumeni family members with wildtype TP53 and 70 healthy controls. Since many of the CNVs overlapped or were near important cancer genes, the authors proposed a model whereby baseline genomic instability in these patients progresses over time, leading to more frequent and larger copy number alterations affecting genes that contribute to tumorigenesis. In our study, patients and controls had a similar number of alterations per genome, with similar CNV size, ratios of losses to gains, likelihood of CNVs to overlap genes, and proportion of genic CNVs that were associated with cancer. The lack of significant difference in the germline CNV profile between cases and controls suggests that causative genes for pancreatic cancer do not significantly impact genomic stability in non-tumor cells. Our results are similar to those of Yoshihara et al. 346 who compared 68 Japanese subjects with germline

110 97 BRCA1 mutations (of whom 51 had ovarian cancer), 34 sporadic ovarian cancer patients, and 47 healthy controls. They reported no significant difference in the per-genome total number of CNVs between BRCA1-mutation carriers and controls, although the number of deletions was higher in the BRCA1 subjects. Otherwise, they found no evidence for differential clustering of the global CNV data between groups, and no correlation of age at diagnosis with CNV frequency. Our proposal for CNV prioritization emphasized regions that segregate with disease in the same family and/or overlapping CNVs in multiple unaffected cases (and absent in controls). We only had access to CNV data for two sets of relatives (two sibling pairs), neither of which demonstrated evidence of FPCspecific CNVs that were co-inherited within the same family. When looking at overlapping CNVs in cases, one region that caught our interest contained two overlapping duplications in two unrelated cases, both of which intersected the TGFBR3 gene. While none of the ARCTIC controls had a validated CNV in this region, a single POPGEN control from the Affy6 dataset contained a duplication that overlapped the cases duplications. However, the control s duplication did not intersect the gene to the same extent as the cases, and in fact only appeared to transect the 5 end of one of multiple isoforms of the gene (whereas the cases intersected all isoforms). The significance of the TGF-beta pathway in cancer initiation and progression in general, and in pancreatic cancer in particular, made this duplication especially interesting to us. We successfully validated this CNV in both affected cases, and we further demonstrated that it was heritable in one of the two families for which we had access to DNA from multiple relatives. Furthermore, we successfully identified the exact breakpoint of the duplication, proving in the process that it is a tandem duplication, by a combined approach of qpcr walk-along and Sanger-sequencing of a PCR-amplified fragment. This breakpoint contained three base pairs that do not appear to be derived from the sequence of either end of the duplication ( TAT ), which is a common finding at the breakpoints of duplications caused by non-homologous end-joining (NHEJ). 559 However, once we were able to design a sufficiently small fragment containing the region of the breakpoint to test its presence in FFPE-derived tissue from an affected sister of the proband, we found that this duplication does not cosegregate with pancreatic cancer in that family. This effectively refuted the implication of this duplication as a cause for familial pancreatic cancer. (We also note that the breakpoints of both case duplications fell within intronic regions of TGFBR3, further decreasing the likelihood of disrupting the gene). While this direction in our investigation ultimately proved fruitless, it confirmed the challenge of interpreting the impact of CNVs, an aspect of CNV research that has lagged behind the ability to detect CNVs or statistical methods for performing genome-wide association studies using CNVs as disease markers. As illustrated by our effort, the process of fine-mapping CNV breakpoints is painstaking but necessary to understanding the precise region that is transected by a duplication or deletion. And even that alone is not sufficient to prove that a CNV causes a particular phenotype; for that, further functional

111 98 work would be required such as demonstration of expression correlation to copy number, and impact of altered expression on cellular function. Next in priority for our analysis were single-case FPC-specific CNVs that overlapped genic regions. We identified 88 genes whose coding regions were partially or completely encompassed by FPC-specific CNVs, and although some are unlikely to be candidate FPC genes (e.g. olfactory receptor genes), many are functionally relevant to carcinogenesis, and are differentially expressed and/or overlap regions that are reported deleted or amplified in pancreatic adenocarcinoma. Moreover, the proportion of genes that were reported in COSMIC v.55 to have protein-altering mutations in tumors or malignant cell lines was significantly higher in FPC-specific genes than in either the full population of cases or in controls. This further suggests that FPC-specific CNVs are enriched for cancer-associated genes. In the report by Yoshihara et al. 346, the primary genetic etiology for the hereditary cancer was already known (BRCA1), and the authors presented genes overlapped by BRCA1-specific CNVs as potential modifiers to the development of cancer. Alternatively, the study by Venkatachalam et al. 347 identified seven genic CNVs specific to patients with familial colorectal cancer who have no known genetic mutation, each CNV found in a single individual only. In that study, like ours, each gene is considered a potential causative gene for familial colorectal cancer. None of the genes overlapped by cancer patient CNVs reported by Shlien et al. 348, Yoshihara et al. 346, or Venkatachalam et al. 347 were part of our FPC-specific gene list. It should be noted that, in addition to the RefSeq genes we highlighed in this paper, 6 FPC-specific gains and 11 FPCspecific losses that did not overlap RefSeq genes did overlap expressed human mrna. While these regions are of lower interest relative to bonefide genes, some published CNV studies have reported associations of non-genic regions with disease, demonstrating evidence for hitherto unidentified genes and/or regulatory elements. 340,344 The final stage of our CNV prioritization involved calculating the difference in proportion of cases vs. controls containing each simple loss or simple gain locus. This approach would theoretically identify CNVs that are detected in both cases and controls but at a higher frequency in cases. No locus achieved a statistically significant p-value after multiple-testing correction. This was not unexpected, since the number of cases included in our analysis was too small for the purpose of identifying a significant genome-wide association result, unless a very high effect size was associated with a CNV. Furthermore, the biases inherent in the design of our study (e.g. the Affymetrix 500K array is suboptimal for detecting recurrent CNVs relative to rare CNVs, the differences in noise level and high-confidence CNV calling between cases and controls) meant that such an analysis would be inappropriate with our dataset. A properly designed genome-wide association CNV study requires a well-validated platform for genotyping CNVs, such as the Affymetrix 6.0 array, and the necessary sample size for achieving sufficient power in the statistical analysis. Alternatively, we note that some of the loci in our study had a significant p-value

112 99 with a higher case frequency before multiple-correction testing, and those regions can be selected for further testing in an independent case-control study that directly genotypes the CNVs of interest (for example using a PCR-based approach). Such a technique was in fact utilized by Huang et al. 344 for identifying the 6q13 deletion associated with pancreatic cancer. In conclusion, we have presented a list of candidate predisposition genes for FPC overlapped by germline CNVs that are specific to the largest cohort of high-risk pancreatic cancer patients published to date. One limitation of our analysis is the coverage and resolution of the platform we used for primary CNV discovery (i.e. Affymetrix 500K array). Since the completion of our study, novel methods of CNV detection have become available, including very high resolution tiling microarrays and next-generation sequencing. We expect future studies using these methods to independently test our findings and detect additional FPC candidate genes. Some of the samples containing FPC-specific CNVs in our study differed in ancestry from the majority of controls, raising the possibility that these CNVs are specific to the respective ancestry group rather than to pancreatic cancer risk. Those CNVs should be investigated further in a larger ethnicity-matched control cohort. Despite these limitations, our list of FPC-specific genes contains several interesting candidates and further screening for mutations in other high-risk pancreatic cancer subjects, along with investigation of the functional role of these genes, would add support to the role of one or more genes in predisposition to FPC.

113 100 Chapter 4 - Exome Sequencing in a Familial Pancreatic Cancer Kindred 1. Abstract In recent years, the significant drop in cost of next-generation sequencing and target-region enrichment have enabled researchers to use whole-exome sequencing for identification of predisposition genes for a variety of Mendelian disorders, a few of which have been familial cancer syndromes. In this study, we aimed to apply this novel method to investigate the genetics of a family containing four relatives affected by pancreatic cancer. Blood-derived DNA was available from three affected relatives (two siblings and their maternal uncle), and we also included an unaffected maternal aunt as a control. Target-enrichment was performed using Nimblegen in-solution array and sequencing was performed by Illumina GAII parallel sequencer. We present two alternative hypotheses: (1) in this family, rare variants that are inherited by the three affected individuals and not inherited by the unaffected aunt are candidate susceptibility genes for familial pancreatic cancer; and (2) in this family, rare variants that are inherited by the three affected individuals, whether or not the are present in the unaffected aunt, are candidate susceptibility genes for familial pancreatic cancer. We present four potential variant filtration models to develop a list of candidate genes for further investigation, but we focus our downstream analysis on one model. The validation rate for heterozygous single nucleotide variants and indels was high (> 80%) but significantly lower for homozygous variants. In Model#1 of our analysis, we identify 9 candidate genes with heteozygous single nucelotide variants in the three affected family members and absent in the unaffected aunt, of which we further investigate the two top-ranked genes using Sanger sequencing in a cohort of unrelate high-risk pancreatic cancer patients. We do not identify further subjects with unreported variants in those genes. Further investigation of other genes in this model and the other three filtration models will be possible in future exome sequening studies on other pancreatic cancer patients. 2. Introduction In the previous chapter, we performed a genome-wide analysis of germline CNVs in pancreatic cancer patients from high-risk families to identify candidate susceptibility genes. As was discussed, this was based on the hypothesis that a proportion of syndromic cancer cases occur due to large rearrangements affecting the causative gene. It remains, though, that most variants which cause hereditary cancer are point mutations, most commonly occurring in coding regions or splice-sites, thus altering the encoded protein or causing premature termination. Until recently, such variants could only be identified by a candidate-gene approach and laborious Sanger sequencing. However, the development of target-capture arrays for building DNA arrays enriched for coding regions ( the exome ), in combination with

114 101 decreasing cost of massively parallel next-generation sequencing, has enabled interrogation of entire genomes for susceptibility variants. Indeed, over the past couple of years, a large number of exome-based studies have been published identifying causative genes for heretofore unexplained Mendelian diseases. (See Literature Search for more details). While the number of studies specifically addressing cancer syndromes has been small, it is evident that a similar strategy can be applied to identifying susceptibility genes in individuals or families who appear to inherit the disease in a Mendelian fashion (dominant or recessive). Therefore, we chose a family consisting of two affected siblings, their affected mother, and an affected maternal uncle to investigate using exome sequencing. The CNV profile of the two siblings was already characterized in Chapter III (CNV-case ID-89 here identified as ID-001 and CNV-case ID-30 here identified as ID-006), and all deletions and gains segregating in the two siblings were also found in controls. (Indeed, only one deletion was FPC-specific, found exclusively in sibling ID-006, and it occurred in a non-genic region. This CNV was not identified in sibling ID-001). For the study described in this chapter, blood-derived DNA was available for the two siblings and their affected maternal uncle (but not their mother). We chose to also include DNA from an unaffected maternal aunt to act as a control for filtering out candidate variants, with the hypothesis that all three affecteds would be carriers of a high-penetrance variant and that the 80-yearold unaffected aunt is unlikely to be a carrier. However, we acknowledge that, since we do not know the penetrance of the gene in question, the aunt may also be an unaffected carrier. For that reason, we also present an alternate hypothesis that considers the aunt a possible carrier of the variant of interest, and thus identifying variants shared among the affected members whether or not present in the aunt. In the methods below, filtration models#1 and #3 fall under the first hypothesis: variants inherited by the three affected relatives and absent in the unaffected are in candidate susceptibility genes for FPC; filtration models # 2 and #4 are based on the second hypothesis: variants inherited by the three affected relatives are in candidate susceptibility genes for FPC, regardless of inheritance in the unaffected family member. As described in this chapter, we only focus our downstream investigation and candidate gene screening on results from model#1, pertaining to the first hypothesis. 3. Materials & Methods 3.1 Description of Family C We identified a consanguinous family of Maltese ancestry with a strong history of pancreatic cancer: the proband was a male (ID-001) who presented with metastatic pancreatic cancer at age 42 years; soon after, one of his sisters (ID-006) also presented with metastatic pancreatic cancer at age 34 years. Neither patient had a resectable tumor and both subjects died within one to two years of diagnosis. The two

115 102 siblings were part of a sibship of seven (in addition, their mother had two miscarriages); a brother was affected with low-grade B-cell follicular lymphoma at age 45 and remains alive and free of disease today at age 49. Their mother had previously undergone a pancreaticoduodenectomy for pancreatic cancer at age 58 but also died soon after from disease recurrence. Several years later, a maternal uncle (ID-011) developed metastatic pancreatic cancer at age 80 while enrolled in an MRI-based screening program and died of his disease. Figure 17 illustrates the pedigree of the family. Figure 17 Pedigree of FPC kindred investigated by exome sequencing ID-011 ID-010 ID-001 ID-006 Figure 17 Legend: Large red box indicates affected mother without available DNA for sequencing; blue circles indicate family members on whom exome sequencing was performed. Filled box = affected male; filled circle = affected female; unfilled box = unaffected male; unfilled circle = unaffected female. ( affected refers to pancreatic cancer) Blood samples were taken from all seven siblings (including the two pancreatic cancer patients before they died), as well as from the affected maternal uncle and an unaffected maternal aunt (ID-010). No blood sample was available for the mother. DNA was extracted from blood samples as per previously described protocol (see Chapter II of this thesis). 3.2 Target-capture, next generation sequencing, and raw-data analysis [Note: DNA library preparation and sequencing, alignment of reads, and variant calling was performed by members of Dr. John McPherson lab at Ontario Institute for Cancer Research (Quang Trinh). Data was provided to W. Al-Sukhni for validation and downstream variant filtration and subsequent Sanger sequencing in other patients. Most PCR-amplifcation for variant validation and screening in other

116 103 patients described in this chapter was performed by W. Al-Sukhni, with assistance from H. Kim and T. McPherson.] DNA samples from the siblings, uncle, and aunt were enriched for exomic regions using Nimblegen SeqCap EZ Human Exome Library v2.0, as per industry protocol. This in-solution array contains 2.1 million empirically optimized oligonucleotide probes targeting approximately 300,000 exons based on annotation of consensus coding sequence (CCDS) project (Sep 2009) 374, RefSeq database (Jan 2010) 560 and mirbase database (v.14, Sep 2009) 561, with a total target size of approximately 35Mb. Resulting DNA libraries were sequenced using the Illumina GAII next-generation sequencer using paired-end 2x101 standard sequencing procedure provided by Illumina, generating 101-bp reads to align against the reference genome. For ID-001 and ID-006, the data in this analysis were generated by 6 sequencing lanes each, for ID-010, two lanes were used, and for ID-011 three lanes were used. Raw data was processed through an empirically-validated workflow: First, basic quality controls (QC) such as number of reads, average base quality per cycle, and percentage of bases with their corresponding Phred quality values were examined on each lane of raw data. Next, raw reads were aligned to the reference human genome (GRCH37) using Novoalign 562, and only uniquely aligned reads were included for downstream analysis. After documenting several QC parameters (e.g. % of reads aligned, % of reads aligned in correct orientation, % of reads aligned only as singletons), duplicated fragments that have exactly the same start and end points are presumed to be PCR artifacts and are removed ( collapsing ) using Picard command-line tools ( Further QC parameters to be assessed at this point include comparing percent of reads aligned before and after collapsing, proportion of target region that is covered at least once by sequencing, percent of bases covered at incrementally higher depth of coverage, and average depth of coverage across the captured target region. At this point, the data was processed through GATK 563 software for quality recalibrations, local realignments, and variant/indel calling. Variants passing a minimum quality score threshold of 30 were considered reliable. A minimum read depth of 8x was considered necessary to call a variant, and the maximum allowable number of single nucleotide variants (SNVs) in a 10-base window was two. Heterozygosity/homozygosity for each variant was also estimated by GATK. 3.3 Validation of variants Validation of exome sequencing data was performed by two approaches. First, we took advantage of the fact that the two siblings were previously genotyped on Affymetrix 500K array for the CNV project (see Chapter 3 of this thesis). We identified common SNPs in common to both platforms for each sample and checked the concordance rate in genotype call between the two platforms. The microarray genotype calls

117 104 were determined using the Affymetrix Genotyping Console (GTC 2.1), which uses the BRLMM 564 algorithm for assigning genotypes. This algorithm has >99% accuracy in detecting homozygous and heterozygous variant alleles. (Note that we were not able to directly identify SNPs that were wildtype (i.e. reference) allele in the exome data since only variants were called and provided to us. As a result, we can only comment on the concordance of heterozygous and homozygous variant calls in exome data in relation to the microarrays) Second, for variants that did not appear in the dbsnp database at the time of initial sequence results (identifying the variant as novel ), we performed Sanger sequencing to validate the variants. Sequencing was performed in the sense and anti-sense direction for each variant to confirm. We calculated specificity and sensitivity of heterozygous variant calling in exome data as follows: Specificity = TN/(TN+FP), where TN = true negative (no variant is called in either the exome data or Sanger sequencing) and FP = false positive (variant called by exome data but not validated by Sanger sequencing) Sensitivity = TP/(TP+FN), where TP = true positive (variant called by exome data and validated by Sanger sequencing) and FN = false negative (variant not called by exome data but identified by Sanger sequencing) For the above definitions, we excluded homozygous calls and calls where the exome data indicates that the allele is different from the reference genome but misidentifies the allele (e.g. exome analysis calls G>A variant, but Sanger proves the variant to be G>T). 3.4 Filtering strategy All SNVs and indels within the exome-capture target regions were identified, and SIFT 565 was used to annotate the synonymous/non-synonymous/frameshift/non-frameshift nature of each SNV or indel. Synonymous variants (i.e. no alteration in amino acid) were identified and removed. In addition, variants reported in dbsnp131 were removed. Only coding region and/or splicing-site variants (up to +/- 3bp from exons) were included in the final list per subject. We screened the excluded variants for very low minor allele frequencies (< 0.2%) or variants that are somatic variants in cancer that should be re-included in our list (since dbsnp does contain some somatic variants). To identify candidate susceptibility genes for the pancreatic cancer in this family, we adopted four filtering approaches:

118 105 Model#1 - Assuming the two siblings and the uncle are all carriers of the responsible variant, and that the unaffected aunt is not a carrier, we identified variants in common to the siblings + uncle and absent in the aunt. Model#2 - To account for incomplete penetrance of the susceptibility gene, we assumed the aunt may or may not be a carrier and identified variants in common to the siblings and uncle, whether or not present in the aunt as well. Model#3 - To account for the lower coverage in the uncle, we assumed the two siblings are carriers and the aunt is not a carrier and identified variants in common to the siblings and absent in the aunt, whether or not called in the uncle. Model#4 - To account for lower coverage in the uncle and incomplete penetrance of the gene, we assumed the two siblings are carriers and identified variants in common to the siblings, whether or not present in the aunt and/or uncle. For each model, the final list of variants was manually curated by screening in dbsnp135 ( which includes results from the first phase of the 1000 Genomes 246 project (low-coverage genome-wide sequencing of 180 samples, sufficient to call most variants 1% minor allele frequency, and deep-sequencing of exons captured for 1000 genes in 900 individuals, sufficient to call rare and low-frequency variants in the coding region of these exons). We also screened the variants in the Exome Sequencing Project 566, a collaborative project that is sequencing thousands of genomes from large, well-phenotyped cohorts. To date, data for approximately 5,400 samples are available online. For the purpose of this analysis, since cancer syndromes are typically caused by highpenentrance, rare variants, we removed variants that appear with a frequency >0.2% in the 1000 Genome or Exome Sequencing Project. 572 For indels, we individually inspected the region of the genome near the putative variant to verify that it is indeed novel based on the latest information in dbsnp135, since in some repetitive regions, the exact position of the indel can be called differently by different algorithms. For the remaining variants under each model, we identified the predicted effect of variants using SIFT 565 and Polyphen We also determined if the genes containing the variants have been reported to be differentially expressed in pancreatic adenocarcinoma or pre-invasive lesions (in Pancreatic Expression Database) 253, as well as whether they have reported somatic mutations in cancer (as catalogued in COSMIC database 549 ). We also compared our list of genes generated from this analysis with the list of genes affected by coding-region CNVs, reported in Chapter III of this thesis.

119 Screening candidate genes We performed PCR amplification and Sanger sequencing to validate top candidates and also performed Sanger sequencing to screen candidate genes in a cohort of 70 familial and young-onset pancreatic cancer cases. (Primer sequences were previously published by Jones et al. 37 ). 4. Results Table 13 summarizes the number of raw reads generated per sample and the percentage of reads that were aligned after collapsing PCR artifacts. Table 13 Summary of raw sequence data from Illumina GAII for each subject N reads aligned marked as PCR % reads aligned marked as PCR % reads aligned after collapsing N reads aligned in + strand % reads aligned in + strand N reads aligned in - strand % reads aligned in - strand N raw N reads reads aligned ID-001 (sibling) ID-006 (sibling) ID-010 (aunt) ID-011 (uncle) Although the two siblings generated approximately twice as many raw reads as the aunt and uncle, only 40-50% of the siblings reads were ultimately aligned after excluding PCR artifacts while nearly 80% of the reads for the aunt were aligned. This resulted in approximately an equivalent number of reads for those three samples contributing to the final alignment of each genome. Fewer than 20% of the raw reads generated for the uncle were aligned after excluding PCR artifacts, resulting in significantly lower coverage for the uncle s genome: while each of the four samples had the majority of the target region bases (~35Mb) covered by at least one read (1x), the exome-wide average read-depth for the uncle was about 10-fold the average coverage of the other three samples (~20x vs. 186x). (Figures 18 and 19). Figure 18 Average coverage of bases in target region of exome per subject Figure 18 Legend: ID-011 (uncle) had lower average read depth for target exome than the other 3 subjects.

120 107 Figure 19 Read-depth per base in target region of exome in each subject 8x Of note, an accepted minimum threshold for accurate identification of a heterozygous variant (in previous papers and by the lab performing sequencing) is 8x coverage: at this threshold, the algorithm can reliably call a heterozygous variant at approximately 94-95% of the target region of the siblings and aunt but at only 82% of the uncle. 4.1 Validation For siblings ID-001 and ID-006, a total of 1,985 and 1,995 SNPs, respectively, were identified as having a heterozygous or homozygous non-reference allele in the exome data and which were genotyped on the Affymetrix 500K array. Of those, 473 variants in ID-001 and 439 variants in ID-006 were discordant between the exome data and the microarray genotypes; 318 of those were discordant in both siblings, the majority of which were identified as wildtype on the microarray and homozygous variant on the exome data. For ID-001, 1,086/1,103 (98.5%) SNPs identified as heterozygous in the exome data were concordant with the microarray results, while only 426/882 (48.3%) SNPs identified as homozygous variant in the exome were concordant with the microarray results (p<0.0001). The results for ID-006 were nearly identical: 1,122/1,141 (98.3%) of heterozygous SNPs and 434/854 (50.8%) homozygous variants allele called by the exome data were concordant with microarray genotypes (p<0.0001). We also performed Sanger sequencing on 38 SNVs that were unreported in dbsnp131, including eight putatively novel homozygous variants. (Table 14)

121 108 Gene ABCC12 ADAMTS20 APLF ASTN2 AZI1 C14orf102 C1orf65 CCDC141 CEP110 CREBBP MUC7 PCYOX1 RASSF6 SEZ6L2 SFRS2IP TAF5L CYP2C9 AGL ARAP1 RPA1 AKAP7 NEIL3 C9 RAPGEF3 SERPINB3 C2orf24 KDM4C Variant (hg19) chr16: G/A chr12: T/G chr2: A/T chr9: A/G chr17: T/C chr14: G/A chr1: G/A chr2: G/A chr9: A/T chr16: G/A chr4: C/T chr2: T/A chr4: A/C chr16: C/T chr12: C/T chr1: C/T chr10: C/A chr1: T/G chr11: T/C chr17: G/C chr6: A/G chr4: G/A chr5: A/G chr12: G/A chr18: A/T chr2: A/C chr9: A/C chr11: G/A Table 14 Sanger validation data for selected SNVs in each exome subject Sib ID-001 (affected) Sib ID-006 (affected) Uncle ID-011 (affected) Aunt ID-010 (unaffected) NGS Sanger NGS Sanger NGS Sanger NGS Sanger het het conc het het conc wt het disc wt wt conc het het conc het het conc wt het disc wt wt conc het het conc het het conc het het conc wt wt conc het het conc het het conc het het conc wt wt conc het het conc het didn t do n/a wt didn t do n/a wt wt conc het het conc het het conc wt het disc wt noisy n/a het het conc het didn t do n/a het didn t do n/a wt wt conc het het conc het het conc het het conc wt wt conc het het conc het het conc het het conc wt wt conc het het conc het didn t do n/a wt didn t do n/a wt wt conc het het conc het het conc het het conc wt wt conc het het conc het het conc het het conc wt wt conc het het conc het het conc het het conc wt wt conc het het conc het het conc wt het disc wt wt conc het het conc het het conc wt het disc wt wt conc het het conc het het conc het het conc wt wt conc het het conc het het conc wt het disc wt wt conc het wt disc het didn t do n/a wt didn t do n/a wt wt conc het het conc het noisy n/a wt wt conc wt didn t do n/a het het conc het het conc wt wt conc wt didn t n/a do het het conc het het conc wt wt conc wt didn t n/a do het het conc het het conc wt wt conc wt didn t n/a do het het conc het het conc wt wt conc wt didn t n/a do het het conc het het conc wt wt conc wt wt conc het het conc het het conc wt wt conc wt wt conc het het conc het het conc wt wt conc wt wt conc het het conc het wt disc wt wt conc wt didn t do n/a het het conc het het conc wt didn t n/a wt wt conc do EXPH5 MSH6 chr2: het het conc het het conc het het conc het het conc

122 109 PCSK9 ANKRD11 KIAA (1) KIAA (2) USP6 CHRNE TXNDC17 MYH2 G/A chr1: G/T chr16: G/T chr9: G/T homo homo homo homo (diff variant- A) homo (diff variant- A) homo (diff variant- A) chr9: C/T homo homo (diff variant- G) chr17: T/C chr17: G/T chr17: G/A chr17: disc homo homo (diff variant- A) disc homo homo (diff variant- A) disc homo homo (diff variant- A) disc homo homo (diff variant- G) disc wt het (diff variant- G/A) disc wt homo (diff variant- A) disc wt homo (diff variant- A) disc homo het (diff variant- C/G) disc het het (diff variant - G/A) disc het het (diff variant - G/A) disc het het (diff variant - G/A) disc het het (diff variant - C/G) homo homo conc homo homo conc wt wt conc wt wt conc homo homo (diff variant- A) disc homo homo (diff variant- A) disc disc disc disc disc wt wt conc wt wt conc homo homo conc homo homo conc wt wt conc wt het disc homo homo conc homo homo conc wt wt conc wt wt conc C/T NGS = next-generation sequencing; het = heterozygous variant; homo = homozygous variant; wt = wildtype; i.e. homozygous reference allele; conc = concordant results between next-generation and Sanger sequencing; disc = discordant results between next-generation and Sanger sequencing For heterozygous exome variants, 53/57 (93%) of calls in the siblings and aunt and 9/9 (100%) of calls in the uncle were concordant with Sanger sequencing (p=1.000); for homozygous exome variants, 6/16 (37.5%) of calls in the siblings and aunt and 0/1 (0%) of calls in the uncle were concordant with Sanger sequencing (p=1.000); for wildtype alleles in the exome data, 24/25 (96%) of calls in the siblings and aunt and 13/22 (59%) of calls in the uncle were concordant with Sanger sequencing (p=0.003). Of the eight homozygous variants called in the two siblings, only three validated as called in the exome data; the remaining five were discovered to be a different homozygous allele by Sanger sequencing. Notably, the three accurately called homozygous variants were all novel, whereas the five inaccuarately identified variants were at positions of reported SNPs (i.e. the Sanger-sequence allele is the same as that reported in dbsnp). Based on the Sanger sequencing results, the specificity for heterozygous variant calling in our exome data in the siblings and aunt was 24/(24+2)=92% and in the uncle was 13/(13+0)=100% (p=0.544); the sensitivity in the siblings and aunt was 52/(52+1)=98% and in the uncle 9/(9+6)=60% (p<0.001). In addition, we performed Sanger sequencing on 15 indels called in the exome data. (Table 15)

123 110 Table 15 Sanger validation data for selected indels in each exome subject Sib ID-001 (affected) Sib ID-006 (affected) Uncle ID-011 (affected) Aunt ID-010 (unaffected) Gene Variant Position NGS Sang NGS Sang NGS Sang NGS Sang TUB ins GAGGA TGAG chr11: y y conc n y disc n n conc n n conc C22orf40 del T chr22: y y conc y y conc n n conc n n conc WDR92 ins C chr2: y n disc n didn't test n/a n didn't test n/a n KCNMB3 del T chr3: y y conc n n conc n n conc n n conc c4orf35 del A chr4: y y conc n n conc n n conc n n conc FAM53C del CCTCA GGCCT GAGCC TGCA chr5: y y conc n n conc n n conc n n conc STAG3 ins G chr7: y n disc n ARHGAP 36 ins C chrx: y n disc n NBPF3 didn't test n/a n didn't test n/a n didn't test n/a n didn't test n/a n didn't test didn't test didn't test del GTCTC CCAG chr1: n n conc y y conc n n conc n n conc ZNF683 del CCACC GAGCG CTGGG GTGCC CCAG chr1: n n conc y y conc n y disc n n conc CLSPN del TTC chr1: n n conc y y conc n n conc n n conc FLVCR2 del CCCAG CGTCT CGGTC CAT chr14: n n conc y y conc n n conc n n conc NUCB1 del AGCAG C chr19: n n conc y y conc n y disc n n conc MNDA del AGAA chr1: n n conc y y conc n n conc n n conc PCDHGA conc conc conc conc 2 del C chr5: n n y y y y y y NGS = next-generation sequencing; Sang = Sanger sequencing; y = indel identified; n = indel not identified; conc = concordant results between NGS and Sanger; disc = discordant results between NGS and Sanger Thirty-nine sequencing reactions were conducted in the siblings and aunt: 14/17 (82%) of indels called in exome data were validated and only a single indel in one individual was missed on exome sequencing. There were too few tests in the uncle to identify a significant difference (only one indel was called in this sample set, which validated, and for 11 indels that were not called in the uncle 9 were also not observed on Sanger sequencing). The specificity of indel calling in the sibs and aunt was 21/(21+3) = 88% and in the uncle was 9/(9+0)=100% (p=0.545); the sensitivity in the sibs and aunt was 14/(14+1)=93% and in the uncle was 1/(1+2)=33% (p=0.056). n/a n/a n/a

124 Filtration results Table 16 summarizes the number of variants identified in each subject. Table 16 Number of variants identified in each exome subject Sibling 001 Sibling 006 Uncle 011 (affected) (affected) (affected) All SNVs in-target (autosomes + X chr) 20,665 20,822 10,815 21,930 In-target SNVs excluding synonymous SNVs 13,551 13,413 7,267 14,328 In-target nssnvs that are nonsense or missense or splice-site excluding any corresponding to position in dbsnp131 (het/homozygous; % homozygous) [% of all nssnvs] 298 (282/16; 5.4%) [2.2% of all nssnvs] 306 (289/17; 5.6%) [2.3% of all nssnvs] 146 (144/2; 1.4%) [2.0% of all nssnvs] All indels (intronic + exonic) Exonic and splice-site indels not in dbsnp Model# 4 - Rare variants in common to siblings (+/- uncle +/- aunt) [truncating mutations (splice-site/nonsense/fs indels)] Model # 3 - Rare variants in common to siblings +/- uncle (-aunt) [truncating mutations (splice-site/nonsense/fs indels)] Model # 2 - Rare variants in common to siblings + uncle (+/- aunt) [truncating mutations (splice-site/nonsense/fs indels)] Model # 1 - Rare variants in common to siblings + uncle (- aunt) [truncating 98 SNVs + 5 indels * [9 truncating (3/1/5)] 68 SNVs + 1 indels * [4 truncating (3/0/1)] 14 SNVs + 2 indels * [2 truncating (0/0/2)] 9 SNVs + 0 indels * [0 truncating] Aunt 010 (unaffected) 325 (319/6; 1.9%) [2.3% of all nssnvs] mutations (splice-site/nonsense/fs indels)] * Number of combined variants in each model given after excluding olfactor receptor genes and pseudogenes.; fs=frameshift; nssnv = non-synonymous single nucelotide variant For each of the siblings and the aunt, approximately 20,000-21,000 SNVs in the autosomes and X- chromosome were identified within the target region of the exome, at 8x depth of coverage and passing the quality thresholds of the alignment and variant-calling algorithms. For the uncle, the number of variants called under the same threshold parameters was only half as many as the other samples. For each of the four samples, approximately one-third of called variants were synonymous and were filtered out. Further filtering of variants reported in dbsnp131 and present in untranslated regions or in introns beyond +/- 3bp from exons (i.e. not splice site variants) reduced the number of variants per sample to approximately 300 variants in the siblings and aunt, and approximately 150 variants in the uncle approximately 2% of all nonsynonymous variants (nssnvs) in each subject. We noted that each sibling had a higher proportion of unreported homozygous variants compared to the uncle and aunt (Sib 001 and Sib 006 = 5.5% vs. Aunt = 1.9%, p=0.03 and p=0.02 respectively; Sib 001 and Sib 006 = 5.5% vs. Uncle = 1.3%, p=0.07 and p=0.04 respectively). While it is possible that some of these may be false calls, this higher degree of homozygosity in the siblings is expected since their parents are first cousins. Figures 20 to 22 illustrate the distribution of SNVs across the 22 autosomes and X chromosome in each subject; the pattern of distribution is nearly identical in the siblings and aunt, and fairly similar to the uncle, for the total SNV group and there was no significant difference in the pattern after excluding synonymous SNVs.

125 112 However, SNVs not reported in dbsnp131 took on a differing chromosomal distribution, and while the new pattern remained consistent across the siblings and aunt, the uncle displayed a visibly differing pattern of variant distribution. Figure 20 Genome-wide distribution of all SNVs identified in each exome subject Figure 21 Genome-wide distribution of SNVs excluding synonymous variants in each exome subject

126 113 Figure 22 Genome-wide distribution of SNVs not reported in dbsnp131 in each exome subject Around indels were identified in each of the siblings and aunts, and about 450 in the uncle, but most of those were intronic. Combining unreported protein-altering SNVs and indels, each of the siblings and aunt had approximately 370 potentially significant variants, and the uncle had approximately Candidate genes Table 17 lists the genes identified by each filtering model described in the methods section. Table 17 Genes containing variants identified by filtration model #1, 2, 3, and/or 4 Filtering Model Variant VariantType GeneName SIFT Polyphen-2 Model#1/2/3/4 chr4# #c#t# nonsynonymous_snv MUC7 DAMAGING unknown Model#1/2/3/4 chr1# #g#a# nonsynonymous_snv C1orf65 TOLERATED benign Model#1/2/3/4 chr2# #a#t# nonsynonymous_snv APLF DAMAGING probably damaging Model#1/2/3/4 chr4# #a#c# nonsynonymous_snv RASSF6 DAMAGING probably damaging Model#1/2/3/4 chr9# #a#g# nonsynonymous_snv ASTN2 DAMAGING possibly damaging Model#1/2/3/4 chr2# #g#a# nonsynonymous_snv CCDC141 TOLERATED benign Model#1/2/3/4 chr9# #a#t# nonsynonymous_snv CEP110 DAMAGING probably damaging Model#1/2/3/4 chr2# #t#a# nonsynonymous_snv PCYOX1 TOLERATED benign

127 114 Model#1/2/3/4 chr1# #c#t# nonsynonymous_snv TAF5L TOLERATED possibly damaging Model#2/4 chr2# #g#a# nonsynonymous_snv MSH6 TOLERATED benign Model#2/4 chr9#711331#g#a# nonsynonymous_snv KANK1 DAMAGING probably damaging Model#2/4 chr13# #c#t# nonsynonymous_snv MTMR6 TOLERATED benign Model#2/4 chr21# #a#c# nonsynonymous_snv DOPEY2 TOLERATED possibly damaging Model#2/4 chr10# #g#a# nonsynonymous_snv STAM DAMAGING probably damaging Model#2/4 chr10# #(+g) frameshift_indel ITIH2 FRAMESHIFT n/a Model#2/4 chr3# #(-t) frameshift_indel MCCC1 FRAMESHIFT n/a Model#3/4 chr16# #g#a# nonsynonymous_snv ABCC12 DAMAGING probably damaging Model#3/4 chr12# #t#g# nonsynonymous_snv ADAMTS20 DAMAGING benign Model#3/4 chr4# #g#a# nonsynonymous_snv NEIL3 DAMAGING probably damaging Model#3/4 chr16# #c#t# nonsynonymous_snv SEZ6L2 DAMAGING probably damaging Model#3/4 chr16# #g#a# nonsynonymous_snv CREBBP TOLERATED unknown Model#3/4 chr17# #t#c# nonsynonymous_snv AZI1 TOLERATED benign Model#3/4 chr11# #t#c# nonsynonymous_snv ARAP1 DAMAGING possibly damaging Model#3/4 chr5# #a#g# nonsynonymous_snv C9 DAMAGING probably damaging Model#3/4 chr10# #c#a# nonsynonymous_snv CYP2C9 TOLERATED possibly damaging Model#3/4 chr11# #g#a# nonsynonymous_snv EXPH5 DAMAGING probably damaging Model#3/4 chr17# #c#t# nonsynonymous_snv MYH2 DAMAGING probably damaging Model#3/4 chr8# #t#c# nonsynonymous_snv ADAM7 TOLERATED possibly damaging Model#3/4 chr22# #c#a# nonsynonymous_snv C22orf13 TOLERATED benign Model#3/4 chr16# #g#a# nonsynonymous_snv COX4NB TOLERATED benign Model#3/4 chr2# #g#a# nonsynonymous_snv EFHD1 DAMAGING probably damaging Model#3/4 chr9# #g#a# splice-site GLIPR2 Not_scored not given Model#3/4 chr12# #g#a# nonsynonymous_snv IRAK4 DAMAGING probably damaging Model#3/4 chr6# #g#a# nonsynonymous_snv KIF6 DAMAGING probably damaging Model#3/4 chr18# #a#g# nonsynonymous_snv PIK3C3 TOLERATED possibly damaging Model#3/4 chr11# #a#c# nonsynonymous_snv SNX32 TOLERATED benign

128 115 Model#3/4 chr9# #a#c# nonsynonymous_snv KDM4C DAMAGING probably damaging Model#3/4 chr14# #g#a# nonsynonymous_snv ABHD12B TOLERATED possibly damaging Model#3/4 chr9#399245#c#a# nonsynonymous_snv DOCK8 TOLERATED benign Model#3/4 chr8# #a#g# nonsynonymous_snv MTMR9 TOLERATED benign Model#3/4 chr11# #a#c# nonsynonymous_snv PGA5 TOLERATED benign Model#3/4 chr12# #g#a# nonsynonymous_snv ACAD10 DAMAGING probably damaging Model#3/4 chr12# #g#a# nonsynonymous_snv DDX55 TOLERATED benign Model#3/4 chr5# #g#a# nonsynonymous_snv FAM151B TOLERATED benign Model#3/4 chr1# #g#c# splice-site CAPN9 Not_scored not given Model#3/4 chr2# #c#t# nonsynonymous_snv SUCLG1 DAMAGING probably damaging Model#3/4 chr7# #g#a# nonsynonymous_snv DAGLB TOLERATED benign Model#3/4 chr14# #g#a# nonsynonymous_snv PIGH DAMAGING probably damaging Model#3/4 chr9# #c#t# nonsynonymous_snv ZNF510 TOLERATED benign Model#3/4 chr2# #a#g# nonsynonymous_snv CPS1 DAMAGING benign Model#3/4 chr8# #g#a# nonsynonymous_snv PIWIL2 TOLERATED benign Model#3/4 chr17# #g#a# nonsynonymous_snv CDRT15 TOLERATED benign Model#3/4 chr12# #a#g# nonsynonymous_snv SLCO1B3 TOLERATED possibly damaging Model#3/4 chr10# #c#t# nonsynonymous_snv TACC2 DAMAGING probably damaging Model#3/4 chr12# #a#g# nonsynonymous_snv ARNTL2 TOLERATED benign Model#3/4 chr12# #t#c# nonsynonymous_snv ASUN TOLERATED probably damaging Model#3/4 chr7# #g#a# nonsynonymous_snv CHST12 TOLERATED benign Model#3/4 chr17# #g#a# nonsynonymous_snv DNAH17 TOLERATED probably damaging Model#3/4 chr6# #t#c# splice-site F13A1 Not_scored not given Model#3/4 chr9# #g#a# nonsynonymous_snv FAM189A2 DAMAGING probably damaging Model#3/4 chr13# #c#t# nonsynonymous_snv KIAA0564 TOLERATED probably damaging Model#3/4 chr9# #c#g# nonsynonymous_snv KIF27 TOLERATED benign Model#3/4 chr12# #g#c# nonsynonymous_snv LTA4H TOLERATED possibly damaging

129 116 Model#3/4 chr9# #c#g# nonsynonymous_snv NCBP1 TOLERATED benign Model#3/4 chr6# #g#c# nonsynonymous_snv NCOA7 DAMAGING possibly damaging Model#3/4 chr11# #a#g# nonsynonymous_snv NDUFC2- KCTD14 DAMAGING probably damaging Model#3/4 chr14# #g#t# nonsynonymous_snv PAPLN DAMAGING probably damaging Model#3/4 chr11# #a#t# nonsynonymous_snv PPFIA1 TOLERATED benign Model#3/4 chr12# #c#t# nonsynonymous_snv RBM19 TOLERATED benign Model#3/4 chr14# #t#a# nonsynonymous_snv STON2 DAMAGING possibly damaging Model#3/4 chr12# #c#t# nonsynonymous_snv TAS2R8 DAMAGING possibly damaging Model#3/4 chr6# #g#a# nonsynonymous_snv TINAG TOLERATED benign Model#3/4 chr12# #a#g# nonsynonymous_snv TMTC1 TOLERATED benign Model#3/4 chr14# #g#a# nonsynonymous_snv VRTN DAMAGING possibly damaging Model#3/4 chr16# #c#g# nonsynonymous_snv ZSCAN10 DAMAGING possibly damaging Model#3/4 chr22# #(-t) frameshift_indel C22orf40 FRAMESHIFT n/a Model#4 chr2# #t#c# nonsynonymous_snv APOB TOLERATED benign Model#4 chr5# #a#g# nonsynonymous_snv ARAP3 DAMAGING possibly damaging Model#4 chr2# #g#a# nonsynonymous_snv CYP27C1 DAMAGING probably damaging Model#4 chr6# #a#g# nonsynonymous_snv PAK1IP1 TOLERATED benign Model#4 chrx# #c#t# nonsynonymous_snv EMD DAMAGING benign Model#4 chr16# #g#a# nonsynonymous_snv CENPN TOLERATED probably damaging Model#4 chr14# #c#g# nonsynonymous_snv MIPOL1 TOLERATED probably damaging Model#4 chr5# #c#t# nonsynonymous_snv TCOF1 TOLERATED probably damaging Model#4 chr6# #t#a# nonsynonymous_snv AKAP12 TOLERATED benign Model#4 chr6# #g#t# nonsynonymous_snv UBE2CBP DAMAGING. probably damaging Model#4 chr12# #g#a# nonsynonymous_snv PZP DAMAGING probably damaging Model#4 chr11# #a#g# nonsynonymous_snv ACAT1 DAMAGING probably damaging

130 117 Model#4 chr8# #c#t# nonsynonymous_snv ASPH TOLERATED probably damaging Model#4 chr5# #c#t# nonsynonymous_snv IQGAP2 DAMAGING probably damaging Model#4 chr9# #g#a# nonsynonymous_snv KIF24 TOLERATED benign Model#4 chr19# #c#t# nonsynonymous_snv KRI1 DAMAGING probably damaging Model#4 chr12# #g#t# nonsynonymous_snv KRT6B DAMAGING benign Model#4 chr6# #t#c# nonsynonymous_snv MDN1 DAMAGING possibly damaging Model#4 chr11# #g#a# nonsynonymous_snv MMP13 TOLERATED benign Model#4 chr7# #c#t# stopgain_snv PKD1L1 N/A not given Model#4 chr1# #a#g# nonsynonymous_snv PLEKHA6 DAMAGING possibly damaging Model#4 chr16# #c#t# nonsynonymous_snv RFWD3 DAMAGING. probably damaging Model#4 chr17# #t#a# nonsynonymous_snv SP2 TOLERATED benign Model#4 chr11# #t#c# nonsynonymous_snv TUT1 DAMAGING not given Model#4 chr5# #c#t# nonsynonymous_snv GPR151 DAMAGING probably damaging Model#4 chr11# #(-t) frameshift_indel FAM111B FRAMESHIFT n/a Model#4 chr16# #(- CACT) frameshift_indel NQO1 FRAMESHIFT n/a Four of the missense variants and two of the indels in the final list of candidates are in olfactory receptor (OR) genes which we automatically downgrade on our list because they are functionally unlikely to be cancer susceptibility genes, they are commonly affected by variants, and they have many homologous pseudogenes that may inadvertently be captured and sequenced. A fifth missense variant belongs to a pseudogene called RPL21P44, and it was also excluded. Model#1, comprising variants shared by all three affected relatives and absent in the unaffected aunt, generated the shortest list with only 9 SNVs and zero indels. Model#2, including shared variants by the siblings and uncles without incorporating the aunt in the filtration, generated a final list of 16 genes (14 SNVs + 2 indels). Model#3 contained variants shared by the siblings and absent in the aunt, regardless of whether they were called in the uncle; the final list consists of 69 genes (68 SNVs + 1 indels). Model#4 in our filtration strategy yielded the longest list of variants, producing 98 SNVs and 5 indels shared by the two siblings irrespective of their status in the uncle and aunt, including 9 protein-truncating variants. No gene contained more than one novel/rare variant in any model.

131 118 We also reviewed the list of filtered out variants in untranslated regions of the gene, but found that no additional genes are added to Model#1 and #2 lists, only 6 additional variants in Model#3, and 11 additional variants in Model#4. These variants are identified separately in Table 18. Table 18 Additional candidate variants in untranslated regions shared by exome subjects Variant Model Gene Position chr12# #g#a# Model#3/4 RAPGEF3 3' UTR chr7# #t#c# Model#3/4 ZC3HAV1 3' UTR chr14# #a#t# Model#3/4 FCF1 3' UTR chr14# #g#c# Model#3/4 IFI27L1 5' UTR chr6# #a#g# Model#3/4 AKAP7 5' UTR chr19# #g#a# Model#3/4 PLAC2 predicted noncodingrna chr16# #g#c# Model#4 CLEC18A 3' UTR chr9# #c#t# Model#4 ZNF658 3' UTR chr7# #c#t# Model#4 GTF2I 3' UTR chr7# #g#a# Model#4 TMEM195 3' UTR chrx# #a#g# Model#4 CXorf40A 3'UTR None of the genes with SNVs or indels in our exome data contained coding-region CNVs in the CNV study, nor were any reported to be associated with pancreatic cancer in published case-control studies (see Literature Search). Due to time and resource constraints, the focus of the remainder of this chapter is on discussing the results of model#1, the most stringent and shortest list of candidate susceptibility genes. Using Sanger sequencing, we validated the missense variants in the 9 genes in the three affecteds and verified absence in the aunt. Four genes had variants that were identified as damaging by SIFT as well as Polyphen-2. Moreover, three of those genes have functions that suggest potential importance in tumor development: APLF (aprataxin and PNKP like factor) has been shown to play a role in DNA single- and double-strand repair by interacting with members of the PARP (Poly-ADP-Ribose-Polymerase) family 567, and APLF undergoes ATM-dependent hyperphosphorylation following ionizing radiation 568 ; RASSF6 (Ras asssociation (RalGDS/AF-6) domain family member 6) is a Ras effector and candidate tumor suppressor that is downregulated in some tumors 569 ; and CEP110 (centriolin) encodes a protein required for centrosome function as a microtubule organizing centre and is associatd with centrosomal maturation 570. A fourth gene, MUC7 (mucin 7, secreted), is overexpressed in pancreatic adenocarcinoma 571 ; however, we ranked it lower than the other above-mentioned genes since (a) SIFT and Polyphen-2 did not provide a strong prediction of damaging effect for this variant, likely because it was poorly conserved, and (b) most hereditary cancer syndromes are caused by inactivating mutations in tumor suppressor genes that cause decreased expression of the encoded protein, and mucin 7 appeared to be more of a marker and potential oncogene in pancreatic cancer rather than a tumor suppressor. The remaining genes (ASTN2, TAF5L, CCDC141, C1orf65, and PCYOX1) were ranked lower on the list of

132 119 candidates due to lack of evidence linking them to cancer, and most of these variants were predicted to be benign. We PCR-amplified and Sanger sequenced each exon of APLF (10 exons) and RASSF6 (11 exons) in a cohort of approximately 70 pancreatic cancer cases. No novel variant was identified in either gene in the screening cohort. CEP110 was not screened in the same manner due to its very large size (42 coding exons), and instead it will be investigated for variants in other subjects using future data from planned whole-exome sequencing of 75 additional familial pancreatic cancer patients. 5. Discussion We have presented a list of candidate susceptibility genes for FPC by performing exome sequencing in a family with a strong history of pancreatic cancer in two of seven siblings, their mother, and a maternal uncle. Initially, our plan was to filter variants shared by the three affected members (2 siblings + maternal uncle) while excluding variants present in the aunt (who was unaffected by age 80). This model is based on an autosomal dominant mode of inheritance of a relatively high-penetrance gene. However, since we do not actually know the penetrance of the gene in question, we also decided to account for the possibility that the unaffected aunt may be a carrier. Thus model#2 comprised genes with variants shared by the three affecteds irrespective of the status in the aunt. This approximately doubled the number of candidate genes (16 vs. 9), but the list size remained manageable. Interestingly, the model#1 list, while containing three functionally interesting genes, did not have any truncating mutations, whereas model#2 yielded two frameshift indels. Most familial cancer syndromes are caused by tumor suppressor genes that segregate protein-truncating mutations in the affected members of the family. Nonetheless, although several additional genes in the model#2 group are of potential interest, we elected to focus our investigation on top candidates in model#1 for the purpose of this thesis due to time and resource constraints. One of the most interesting genes in our list is APLF, encoding a protein that has been demonstrated to participate in DNA repair and is also thought to be a histone chaperone 567,568,573. The DNA repair genes BRCA2, PALB2, and ATM have all been linked to FPC in recent years, suggesting the importance of this pathway in pancreatic tumorigenesis. However, a Sanger-based screen of all 10 exons of APLF in ~70 unrelated pancreatic cancer subjects yielded no novel variants. Similarly, RASSF6 is appealing as a susceptibility gene in pancreatic cancer due to its regulatory effect on Ras, a protein whose activation has been demonstrated in the majority of pancreatic adenocarcinomas and is an early event in tumorigenesis. Sanger sequencing of the 11 exons of RASSF6 also failed to show novel variants in the screening cohort. We note that the variants affecting each of these gene in Family C are rare (~0.2%), and both were

133 120 predicted to be damaging by both SIFT and Polyphen-2. This emphasizes the challenge inherent in using exome sequencing in a single family, particularly with closely-related relatives, for attempting to identify the genetic cause of a familial cancer syndrome. The presence of many potentially deleterious variants in the exome of any individual has been well demonstrated by multiple whole-genomes and exomes published to date. (See Literature Search for details). The successful studies that used exome sequencing to identify high-penetrance cancer genes in autosomal dominant syndromes did so either by accessing paired-tumor sequence to identify second hits or else by sequencing multiple unaffected individuals. Whole exome sequencing does not yield good results from formalin-fixed paraffin-embedded (FFPE) tumors, and the only resected specimen available in Family C belonged to the mother and was indeed FFPE. An alternative method of guiding exome data filtering in autosomal dominant syndromes is with linkage analysis data, as has been demonstrated in several studies in other Mendelian diseases. As described in the Literature Search, our group is part of a multi-centre consortium that has collected eligible families for linkage analysis. Unfortunately, to date no useable results have been generated to allow us to guide our exome sequencing. We also did not find any of our variants among the genes reported to be associated with pancreatic cancer in case-control studies. Our study had some technical limitations; perhaps the most significant was the lower depth of coverage in the uncle s exome compared to the other sequenced samples, which resulted in only half as many variants being called in the uncle as in each of the siblings and aunt. Importantly, the distribution of novel SNVs across chromosomes differed between the uncle and the other three subjects; suggesting that the uncle s decreased coverage is not evenly distributed across the genome and some chromosomes appear to be particularly under-represented compared to the siblings and aunt (e.g. chromosomes 7 and 12). Sanger sequencing indicated that the specificity of variant calling in the uncle was equivalent to that of the other subjects but the sensitivity was significantly lower in the uncle. For this reason, we also considered models that did not take the uncle s data into account (#3 and #4). These analyses produced a much longer list of candidate genes (75-110, depending on whether the aunt s exome was used to filter out variants). Those genes are too numerous to be individually screened in other pancreatic cancer patients using Sanger sequencing. We present those genes here as additional candidates, and anticipate that data from additional exomes will facilitate variant filtration and allow screening of interesting genes in a more cost-effective manner. Another limitation observed in our data is the low specificity of homozygous variant calls. It is not clear what is causing these erroneous calls, and certainly it raises the importance of individual Sanger validation of any homozygous variant. However, we note that all the homozygous variants we found to be

134 121 inaccurately called were at positions reported as SNPs in dbsnp; the only two novel homozygous variants validated by Sanger were actually true calls. This suggests that homozygous calls in our final filtration models may still have a higher validation rate than observed from the comparison to SNP chips. Had our analysis been based on an autosomal recessive model of inheritance, this issue would have been of greater significance (as we would have focused on homozygous variants in the siblings). In any case, only three variants in any of our models were called as homozygotes (one of which we had successfully validated by Sanger), and they were only present in the model#3 and #4 lists. In conclusion, we present a list of candidate susceptibility genes for familial pancreatic cancer based on exome sequencing of three affected members and one unaffected member of a single family. Our screening of two top candidates in a cohort of unrelated cases failed to identify novel variants to support the role of these genes in pancreatic cancer causation. However, other potential candidates remain to be investigated and further screening of those candidates will be facilitated by large-scale exome sequencing of other families.

135 122 Chapter 5 - General Discussion, Conclusions, and Future Directions General Discussion The overall aim of my research has been to better understand genetic susceptibility to pancreatic cancer, a highly lethal malignancy that has dismal outcome for the majority of affected patients. More specifically, I am interested in relatively highly-penetrant genetic variants that explain some or most of familial pancreatic cancer (FPC), the autosomal dominant syndrome that has been proposed to explain clustering of pancreatic cancer in families, often occurring at a younger age of onset than in sporadic cases. The benefits of identifying such susceptibility genes include: to facilitate development of early-detection and intervention by enriching trials with subjects that carry known predisposition genes; to calculate the attributable risk of a particular variant through case-control and/or cohort studies, allowing more accurate estimation of individual risk in members of FPC families and providing more informed genetic counseling to such individuals; to identify individuals who may benefit from specific forms of therapy that target the specific pathways implicated in tumorigenesis and to enable development of targeted biological therapies. To date, only a small proportion of hereditary pancreatic cancer cases is attributable to mutations in specific genes, almost all of these occurring in the context of rare cancer syndromes such as Peutz-Jeghers Syndrome or Familial Atypical Multiple Mole Melanoma. The most frequently identified mutated gene in hereditary pancreatic cancer cases is BRCA2, accounting for up to 19% 103 of pancreatic cancer families and conferring an estimated lifetime risk of up to 5%. 502 Often, BRCA2 families demonstrate other associated cancers as well, particularly breast or ovarian cancer; however, a subset of BRCA2-associated pancreatic cancer patients have no family history of other cancers, and indeed this gene has even been implicated in apparently sporadic cases. 112,113 Given this well-established link between BRCA2 and pancreatic cancer, investigators have sought to determine if a similar association exists with BRCA1. Indeed, as discussed in detail in Chapter 1, multiple studies have suggested that BRCA1 increases risk of pancreatic cancer, albeit to a lesser extent than BRCA2. However, most of previous studies have been criticized for being biased by their family-based design and population-based studies have produced conflicting results. Notwithstanding these limitations, I felt that the role of BRCA1 in pancreatic cancer required further consideration, not only for the value of providing more complete genetic counseling to affected families and possibly including carriers in screening studies, but also because of the recent accumulation of anecdotal reports indicating that BRCA1 and BRCA2 mutation carriers respond well to certain chemotherapies (e.g. platinum-based chemotheraphy, PARP-1 inhibitors) which targeted the

136 123 impaired DNA repair system resulting from BRCA1/2 gene inactivation in these tumors. At the time of conducting my study, our research group had collected seven FFPE-tumor specimens from pancreatic cancer patients with confirmed germline BRCA1 mutations. Therefore, for the first section of my thesis, I decided to conduct a loss-of-heterozygosity (LOH) analysis on these samples and compare with nine sporadic cases that have no known BRCA1 mutations or familial history of breast/ovarian cancer. I hypothesized that tumors with germline heterozygous inactivating mutations in BRCA1 demonstrate loss of the remaining functional allele. My analysis indeed demonstrated that LOH at the BRCA1 locus was a common event in tumors of mutation carriers, with evidence of loss of the functional allele, occurring in 5/7 BRCA1-mutation carriers while only 1/9 sporadic cases demonstrated LOH. The limitations of my study, namely small sample size and the variable quality of DNA extracted from FFPE tissue, are challenges that characterize the field of pancreatic cancer research. Due to the rapid lethality of pancreatic cancer, only a small percentage of patients undergo resection before death. Moreover, most specimens available for research exist as paraffin blocks of formalin-fixed tissue; formalin fixation causes cross-linking of nucleic acids, often resulting in degradation of DNA and RNA. For those reasons, molecular analyses of pancreatic tumors are fraught with difficulties and potential biases. In my analysis, I attempted to circumvent the potential bias of DNA degradation by selecting microsatellite markers that generate small amplicons, well below the lower limit of expected DNA fragments in FFPE tissue (180bp). To my knowledge, this is the first LOH analysis using familial pancreatic cancer cases with deleterious BRCA1 mutations. Only two molecular studies previous to mine had investigated BRCA1 in pancreatic tumors, and both assessed sporadic tumors only. Beger et al. 510 found decreased mrna and protein expression of BRCA1 in half of 50 pancreatic cancers, with worse 1-year survival in the group with decreased expression. Peng et al. 523 reported frequent BRCA1 methylation in sporadic pancreatic cancers. No additional studies have since been reported. Interestingly, although sporadic breast and ovarian cancers do not usually have somatic BRCA1 mutations, they have been reported to have frequent LOH events at the BRCA1 locus, prompting speculation about potential haploinsufficiency of BRCA1 in these tumors that drives further genetic alterations. 574 My findings suggest that sporadic pancreatic cancer cases do not have frequent loss at the BRCA1 loss; this would be consistent with Peng et al. s 523 findings of methylation being a frequent event, since it would function as an alternative to LOH for gene inactivation. However, I acknowledge that my small sample size, due to the scarcity of resected tumor samples from pancreatic cancer patients and particularly those with BRCA1 germline mutations, limits the generalizability of my results. Further investigation of molecular alterations of BRCA1 in pancreatic tumors is needed on a larger scale before drawing more conclusions regarding its mechanism of action in the pancreas. Nonetheless, although my findings do not definitively implicate BRCA1 as a familial

137 124 pancreatic cancer gene, they certainly suggest such a role for this gene and indicate that larger epidemiologic studies need to be conducted to establish the risk associated with BRCA1 mutations and pancreatic cancer. While my first study contributed toward understanding the role of a specific candidate gene (BRCA1) in pancreatic tumorigenesis, the expected attributable risk of this particular gene to familial pancreatic cancer is fairly low. Several approaches can be taken to identify genetic predisposition for the majority of FPC cases not linked to a known gene. Candidate genes can be identified based either on function or connection to the pathway of another established susceptibility gene, which was the rationale for pursuing BRCA1. It is possible to screen high-risk pancreatic cancer patients for mutations in additional genes associated with BRCA1 or BRCA2, or even other genes in pathways that have been implicated in pancreatic tumorigenesis from somatic studies 37, but performing Sanger sequencing on all coding regions of each candidate gene is a costly and laborious process. Furthermore, the functional and pathway properties of many genes are incompletely understood at this time, thus biasing the investigation to the relatively small proportion of genes that have been well annotated thus far. One can also derive a candidate gene list for screening in high-risk subjects based on results of genome-wide association studies conducted on a large number of sporadic cases; as would be expected, these variants are invariably associated with low odds ratios in sporadic cases, but some may be of greater significance in smaller populations enriched for familial cases. However, most variants identified by genome-wide association studies are not within coding sequences, requiring further fine-mapping and delineation of the actual genes affected. Under ideal conditions, genetic linkage analysis would be a powerful approach for identifying high-risk variants segregating with a disease that is inherited in an autosomal dominant fashion in family-based studies. Indeed, much effort has been invested in collecting families with multiple cases of pancreatic cancer in closely-related members for the purpose of performing genetic linkage. One of the largest such projects has been undertaken by the PACGENE consortium (described in the Literature Search), which has been investigating FPC genetics for about 10 years. Thus far, no linkage results have been released by PACGENE, and indeed only one FPC linkage analysis has been published by any group to date, in a single high-risk family that does not resemble most FPC cases. 187 The latter found evidence of linkage to a region on chromosome 4q and proposed the gene of interest to be Palladin; however, multiple subsequent analyses of Palladin in high-risk populations refuted it as a likely FPC gene. Genetic linkage analysis is a statistics-based method that requires a sufficient number of genotyped affected and unaffected members in a family to generate power for detecting regions segregating with disease status. It is significantly weakened if there is genetic heterogeneity (i.e. multiple loci involved in causing the same phenotype) or if some of the affected subjects are phenocopies. Moreover, linkage analysis alone

138 125 cannot pinpoint the causative gene, as illustrated by the aforementioned 4q linked region and the failure to determine the responsible gene in that region. For all those reasons, I elected to approach the FPC question from two novel directions: mapping the copy-number variable portion of the genome in a cohort of probands from high-risk families and mapping the whole exome (single nucleotide variants and small indels) of members of a single high-risk family. The 2004 seminal papers demonstrating that structural variation of the human genome is detectable in all individuals, regardless of phenotype or disease status, generated a paradigm shift in the field of genomics. 197,198 After multiple reports established that CNVs are a significant source of genomic variability, attention turned to investigating their association with disease. To date, the majority of such studies have been in diseases other than cancer, particularly the neuropsychiatric disorders; however, copy number alteration is in fact a well-known characteristic of tumor genomes, often causing the inactivation or amplification of important cancer-suppressing or cancer-driving genes, respectively. Furthermore, germline genomic rearrangements represent a well-recognized mechanism of heredity in familial cancer syndromes, usually affecting a small but non-negligible portion of cases. When I embarked on this study, only two published report of germline CNVs in familial cancer syndromes were available. The first was a survey of CNVs in 57 FPC subjects using an oligonucleotide-based CGH array. 345 This study presented several candidate regions, but lacked in array resolution and coverage, sample size, and the size of the control dataset available for data filtration. The second report was based on Li-Fraumeni syndrome patients who carry TP53 mutations 348 : the authors found that patients with germline TP53 mutations have a significantly more unstable genome, manifested as higher frequency of germline copy number variation than control genomes. They proposed that the increased frequency of CNVs in Li-Fraumeni genomes predisposes to somatic expansion of deletions or duplications that affect cancer-suppressing or cancer-driving genes, respectively. Since pancreatic cancer contains a high degree of somatic genome instability, I hypothesized that the genomic profile of germline CNVs in FPC patients may be distinct from that of controls. Furthermore, I hypothesized that identifying germline deletions or duplications in cases that are not observed in healthy controls would generate a list of candidate susceptibility genes for FPC. For the third chapter of my thesis, I focused on a single family that was part of my CNV study. This family contained two siblings (in a sibship of seven) who had died of pancreatic cancer at young ages (30s and 40s), and whose mother and maternal uncle also died of the disease. At the time of this study, the technology for sequencing most of the coding region of the genome (i.e. the exome) had become accessible for considerably lower expense than in the past. Many studies had been published describing the use of whole-exome analysis to pinpoint the causative variant in rare Mendelian disorders. Only one report applying whole-exome sequencing to familial cancer had been published, showing PALB2 to be a

139 126 susceptibility gene for FPC. Notably, this latter paper did not use exome-capture and next-generation sequencing as with all other reports, but was based on a large-scale Sanger-sequencing based effort to sequence pancreatic tumors and paired blood-derived DNA to identify germline variants. I hypothesized that whole-exome sequencing would reveal susceptibility genes in this high-risk family by identifying rare variants shared by affected members. My CNV results refuted the first part of my hypothesis, indicating that no discernible difference in genome stability or other CNV characteristics exist between FPC cases and healthy controls. Since conducting my study, only a couple of other such studies have been published in familial/hereditary cancer populations. 346,347 Neither offered much beyond a list of susceptibility genes, as we have done, and neither described a significant difference in the frequency of germline CNVs between cases and controls. While it is difficult to draw firm conclusions based on only a few studies, thus far there is little to suggest that the phenomenon observed by Shlien et al. 348 in Li-Fraumeni patients is replicated in other familial cancer cases. TP53 is known to act as the guardian of the genome. 575 Given our observations, we would conclude that most FPC cases are not caused by mutations in genes with a similar impact on genomic stability. Furthermore, CNVs in general do not appear to play as significant a role in susceptibility to most familial cancers as they do in other diseases like neuropsychiatric and developmental disorders. Both the CNV study and the whole-exome analysis relied on relatively novel technology and were significantly dependent on recently developed bioinformatic tools, and as such both had limitations related to the technology and/or the available resources for analyzing the data. In the CNV study, the Affymetrix GeneChip Human Mapping 500K SNP array used for CNV detection, consisting of two chips that together genotype approximately 500,000 genome-wide SNPs, was originally designed for the purpose of accurate SNP genotyping to enable sufficiently powered SNP-based genome-wide association studies. As such, SNPs selected for inclusion in the array underwent rigorous validation for accuracy of genotype, call rate, and linkage disequilibrium in different populations, but the probe design was not optimized for accurate copy number. The median physical distance between SNPs on the array is 2.5kb, but the density of genotyped SNPs across the genome is not uniform resulting in excellent coverage for some regions and incomplete or entirely absent coverage in others. Nonetheless, at the time of my study design, this array was one of the highest-resolution and best coverage platforms available for CNV detection. When subsequent generations of CNV detection platforms were developed, it became evident that most common CNVs tend to not be captured well by the Affy 500K array, due to this bias of SNP distribution. However, this was not a significant concern for my analysis since I was specifically interested in rare or low-frequency deletions or duplications. Since the use of the Affy500K array for CNV analysis only began shortly before the design of my study, new algorithms had to be developed to

140 127 analyze the data with variable sensitivity and specificity. Therefore, it was necessary to use multiple algorithms, and moreover I needed to demonstrate an approach that generates a well-validated set of CNVs. I utilized qpcr to validate a subset of CNVs, but given the time- and resource-consuming nature of individually validating individual CNVs in this manner, I also performed a secondary CNV analysis on a subset of samples using a newer array (the Affy6.0). Therefore, I generated a set of high-confidence CNVs with a validation rate of 95% or higher and due to logistical constraints I did not address any of the remaining low-confidence CNVs. Since my validation experiments suggested that approximately half of the 491 low-confidence case CNVs are likely to be real, it is likely that my approach missed some additional FPC-specific CNVs containing candidate genes. Future investigations of FPC cases using newer and higher-resolution platforms would serve to validate my results as well as fill the gaps in coverage due to limitations of the 500K array and my analysis strategies. Similarly, the technologies and algorithms used for studying the high-risk Family C were rapidly evolving even as I was conducting my study. First, no target-capture array available to me at the time of my study targeted 100% of coding regions in the genome, but rather they aimed to capture most of the wellannotated coding regions. Even then, technical problems sometimes result in incomplete capture of this target region. One of my samples, from the uncle in Family C, could not be sequenced to the same depth of coverage as the remaining samples due to technical problems, resulting in a significantly lower number of variants called in this individual. Since my hypothesis relied to a greater extent on filtering unshared variants between the affected cases, and since the uncle s second-degree relation to the siblings means that he is expected to share fewer variants with the siblings than they share with each other, the incomplete variant list generated in the uncle invariably meant that I would almost certainly miss potential candidate genes if I included the uncle s exome data. To address this shortcoming, I presented alternative filtering models that did not necessarily exclude variants shared by the siblings but not called in the uncle. As expected, these models generate considerably longer variant lists and require other methods of prioritizing the results for further investigation. Furthermore, since my project was conducted as a collaboration with the laboratory that performed the whole-exome sequencing, I was not directly involved in running the analysis pipeline implemented by their group. I was able to validate the resultant variant calls and determined that the dataset generated by this pipeline was accurate for both heterozygous single variant and indel calls; validation of homozygous variant calls, however, was significantly lower. I could not directly assess sensitivity on a large scale and so it is possible that additional true variants were missed. Therefore, as with the CNV analysis, I needed to prioritize high specificity of variant calling at the expense of slightly lower sensitivity so that I could work with a reliable dataset for downstream analysis.

141 128 Another common theme to both types of analyses is the large number of variants generated for each sample, even after applying quality controls to ensure maximum validity of data. Clearly, this is a direct result of the higher resolution and genome-wide coverage of these approaches compared to older techniques that assess only one or a few genomic regions or genes at a time. While such high coverage is one of the primary attractive features of these technologies, it also creates significant challenges in interpretation and prioritization of data. One component of data prioritization in my studies was the focus on rare variants; since I am most interested in identifying variants with a relatively high effect size to explain familial inheritance of pancreatic cancer, the frequency of such variant in the general population is expected to be very low. The identification of a rare variant posed some interesting challenges for the CNV and the exome analyses. To interpret the significance of CNVs, particularly in the context of my hypothesis, I needed to have a control set for comparison to the cases. Approximately 45 spousal controls were selected for genotyping alongside the cases; genotyping additional controls was not feasible at the time due to financial constraints. Instead, I took advantage of a large control cohort that was previously genotyped on the same Affy 500k array for a genome-wide association study of colorectal cancer (ARCTIC). Approximately 1,100 controls were genotyped at a different facility from the cases, but I analyzed these controls in a parallel manner to the cases, applying the same algorithm parameters and filtering rules. It became evident during analysis that there was a greater level of noise in the ARCTIC controls, manifesting as a greater proportion of control CNVs that were low-confidence. This highlights the importance of study design in facilitating CNV analysis, which is more sensitive to batcheffect than SNP studies. These data also suggests that some real CNVs in controls may be missed in our analysis, and if those regions overlap rare CNVs in cases then they would be inaccurately identified as candidate FPC-specific CNVs under our hypothesis. To address this concern, I noted the FPC-specific CNVs that overlapped a low-confidence CNV in controls and validated the region before investigating that region further. Furthermore, I also utilized the Database of Genomic Variants (DGV), but the quality of data in this resource is directly linked to the limitations of the platform and algorithms used in each source publication. While I was unable to determine the accuracy of each data source, I chose to exclude CNVs detected by studies that used BAC clone arrays because those were later demonstrated to greatly overestimate CNV size. For filtering the exome variants, I turned to the dbsnp database which is continuously updated and houses a large set of single base as well as indel variants. Older versions of dbsnp were largely populated by data from the HapMap study, which mostly identified common variants present at a population frequency of > 1%. However, as more human genomes were being sequenced in their entirety, including results from the 1000 genome project and the Exome Sequencing Project, the dataset became more difficult to interpret since most variants were not adequately validated and/or their

142 129 population frequency were not calculated, and many variants had a minor allele frequency < 1%. Indeed, for my exome analysis, I decided to use a relatively strict definition of rare (< 0.2%) since variants with higher frequencies have been described as low-frequency variants and some have been demonstrated to have an intermediate effect size on disease predisposition rather than the high-penetrance effect in which I am interested. 494,495 It should be noted that indel reporting in dbsnp is significantly less accurate than single nucleotide variants, particularly from next-generation sequencing platforms. As such, the accurate determination of population frequency of indels is even more challenging. Moreover, dbsnp has been contaminated with somatic variants found in tumors and other potentially pathogenic germline variants in cancer. Therefore, I performed a careful screen of my final dataset to ensure that I did not filter out a variant linked to cancer if the frequency of the variant was low. Beyond filtering by frequency of variants, I attempted to take advantage of common phenotypes. For CNVs, I attempted to identify CNVs present in multiple cases (but not in controls), but ultimately found none (except for the TGFBR3 duplication, discussed below). For the exome data, I filtered by shared variants among the three affected relatives, incorporating an unaffected family member as a negative control (i.e. to filter out variants identified in this relative). My rationale for doing so was that Family C had a very strong history of pancreatic cancer occurring at young ages in most of the affecteds, and thus the unaffected 80-year-old aunt seemed significantly less likely to be a carrier of the putative highpenetrance variant responsible for the disease in this family. Indeed, I modeled our primary filtering approach on this premise, and it successfully reduced the number of eligible candidate genes to a workable size. However, since I do not know the actual penetrance of the variant in question, I risked losing the actual causal FPC gene by excluding all variants found in the aunt. I offered alternative filtering models that took this ambiguity into account, and they generated significantly longer lists of candidate genes. In addition to using other cases (or family members) to filter variants, I turned to functional annotation. For my CNV data, I focused on coding region variants and turned to available databases containing somatic cancer variants (COSMIC) and pancreatic expression data (Pancreas Expression Database) to annotate involved genes. While many genes did have potential connections to pancreatic cancer or carcinogenesis in general, it was evident that none were immediately obvious candidates. This again emphasizes the limitations of available functional annotation for most genes, and the challenge in utilizing this approach to identifying susceptibility genes. Similarly, I attempted to prioritize variants from my exome analysis based on likelihood to damage protein function (using two well-known algorithms), as well as referring to the aforementioned databases for gene annotation. However, it is difficult to be certain of the accuracy of prediction for any one variant, particularly if the prediction is benign or tolerated, without adequate functional assays.

143 130 In both the CNV and exome analyses, I selected top-prioritized candidate genes for further investigation. In the CNV study, overlapping duplications in two unrelated cases were found to intersect TGFBR3, a receptor gene in the TGF-beta pathway that is of importance in the initiation and progression of pancreatic cancer. This region overlapped only one duplication in controls, but with different breakpoints from the case CNVs. Importantly, the control CNV did not appear to extend into the gene except for a small part of one isoform that was longer than most other isoforms. I conducted a series of experiments to validate the duplications in the cases, demonstrate heritability of the CNV in members of one of the subjects families, delineate the exact location of the CNV breakpoints, and sequence the amplicon containing tandem duplication breakpoints. However, an affected sister of the proband with only FFPE tissue available did not harbor the duplication, indicating that it does not segregate with disease in that family. In my exome analysis, I performed Sanger sequencing of all exons in the two top-ranked genes identified by filtering Model #1 (rare variants shared by the three affecteds, absent in the unaffected relative). Each gene had an exome variant predicted to be damaging, and both were reported to be have potential tumor-suppressor roles. Yet, I did not find any other rare variants in the ~70 unrelated cases that I screened. These results raise several important issues. First, they highlight the significant challenge associated with using a limited number of samples in genome-wide analyses such as CNV surveys or exome sequencing. In the case of CNVs, since only a small percentage of all FPC cases attributed to a particular gene would be expected to have a genomic rearrangement rather than a single base mutation or indel in that gene, a small sample size reduces the likelihood of identifying multiple cases with the same affected gene. This is particularly more challenging due to genetic heterogeneity. The fact that linkage analysis on the best available families to date has failed to generate strong locus-specific linkage scores strongly suggests that the families included in the analysis have different causal genes. Alternatively, there may be inaccuracy in identifying FPC families, leading to inclusion of subjects who do not carry a high-penetrance variant. For the exome analysis, it is evident that every individual genome contains a large number of lowfrequency or rare variants, many of which appear to be potentially damaging. Therefore, in a familybased design, it is most helpful to sequence multiple affected subjects who have some genetic distance (i.e. not just first-degree pairs) to maximize the filtering potential of identifying shared variants. Even then, use of whole-exome data in a single family to identify a dominant-acting variant is difficult. Most successful exome analyses of dominant Mendelian diseases have used more than one family, or at least in the case of cancer they have utilized data from paired tumor genome to identify second-hits in candidate genes. Genetic heterogeneity may also pose a problem in this setting, since the accepted method of conclusively demonstrating involvement of a gene in prediposition to familial cancer is by identifying rare deleterious variants in the same gene in other unrelated cases. However, if there are many different

144 131 genes that cause the disease, the possibility exists of family-specific genes (or more likely, genes acting in a small percentage of families). This makes the decision to discard genes that do not demonstrate variants in other samples difficult. Finally, there always remains the possibility that a non-coding variant (whether a CNV or SNV/indel) may in fact be the causative agent. The reason for prioritizing coding region of the genome in these types of analyses is more practical rather than dogmatic: while it is evident from a number of studies that apparent gene deserts or unexpressed regions of genes such as introns can impact gene expression (short- or long-range), there is little to no annotation of those regions to allow prioritization and interpretation of the potential variant effect. Given that genic regions alone generate sufficiently long lists of candidate genes, many studies, including mine, elect to ignore the non-genic regions. However, should extensive investigations of the exome fail to yield answers, it will become necessary to cast a wider net and characterize non-coding variants. Conclusions I have successfully tested and proven my first hypothesis (that LOH occurs frequently at the BRCA1 locus in pancreatic tumors from germline BRCA1-mutation carriers), thus contributing novel information to understanding the role of BRCA1 in susceptibility to pancreatic cancer. For my second hypothesis, I found no evidence of a distinct CNV profile in high-risk pancreatic cancer cases relative to controls but demonstrated that FPC-specific losses and gains overlap some genes that have the potential to be involved in pancreatic tumorigenesis. My data constitute the most comprehensive set of annotated germline CNVs in high-risk familial pancreatic cancer patients to date. Finally, for the third part of my thesis, I applied a heirarchical filtering approach to generate a list of candidate susceptibility genes responsible for FPC. Similar to the list of genes generated by my CNV analysis, the exome candidates include many that have a potential role in tumorigenesis. The combined list of genes generated by my thesis represents an important resource for future studies of candidate FPC susceptibility genes. Future Directions As discussed above, a number of follow-up investigations flow naturally from the results of my studies, including: validation of detected variants using more uptodate, higher resolution platforms and larger sample sizes; sequencing the entire coding region of candidate genes identified by the CNV and/or exome analysis in additional cases; and performing additional exome sequencing on other families to increase the power to detect additional variants in the same gene(s). In addition, several new directions may be taken in the future for the investigation of heritable susceptibility to familial pancreatic cancer. One limitation to my studies was the focus on protein-coding

145 132 genes as the causative agent for heritability of pancreatic cancer. In part, this was necessary because of the relative lack of annotation of non-coding regions of the genome and the challenge of studying such regions. Another constraint is the single-view approach of each study; only one platform was utilized at a time, and generalizing the results of different platforms used in different samples is challenging. A more valuable approach would be to integrate data from multiple profiling techniques (e.g. genomic, epigenomic, transcriptomic, immunohistochemistry) for specimens from the same individuals, thus allowing for a more comprehensive assessment of potential hertiable factors in disease susceptibility. Of course, there are practical limitations to such an approach, foremost among them the challenge of obtaining pancreatic tumors from familial cancer patients due to the high mortality of the disease. However, the aforementioned ICGC consortium has been addressing this issue by prospectively collecting tumor specimens and developing xenografts and cell lines to allow further investigations on recruited subjects. An important question that arises after considering the results of my studies is whether a significant portion of familial pancreatic cancer cases can be explained by relatively highly-penetrant variants in a single gene. The fact that I did not find evidence for one gene being affected by deleterious variants in more than one family suggests the possibility of many private genes contributing to familial pancreatic cancer in different families. This would make the identification of such susceptibility genes considerably more difficult. Certainly, functional analyses genes would become much more important in delineating the causative agents, but pathway analysis may aid in identifying genes affected in different individuals that lead to similar outcomes (i.e. pancreatic cancer development). Another possibility that must be considered is the role of intermediate-effect variants and gene-gene interactions within the same individual. Recently, our group has found evidence of rare deleterious variants in cancer-predisposing genes that do not segregate with all pancreatic cancer patients in the same family. While the non-carriers may be phenocopies, this observation also raises important questions about the extent of genotyping that should be performed in a given family before attributing familial cancer to a specific gene, and the importance of more extensive population data in understanding the effect size of rare variants. Such data is forthcoming from large-scale exome and genome-sequencing projects (such as the 1000 Genomes Project and the Exome Sequencing Project), but it also requires the assessment of much larger FPC cohorts.

146 133 References 1. Hruban RH, Fukushima N. Pancreatic adenocarcinoma: update on the surgical pathology of carcinomas of ductal origin and PanINs. Mod Pathol Feb;20 Suppl 1:S Howlader N, Noone AM, Krapcho M, et al. (eds). SEER Cancer Statistics Review, , National Cancer Institute. Bethesda, MD, based on November 2010 SEER data submission, posted to the SEER web site, Canadian Cancer Society s Steering Committee on Cancer Statistics. Canadian Cancer Statistics Toronto, ON: Canadian Cancer Society; Tada M, Nakai Y, Sasaki T, et al. Recent progress and limitations of chemotherapy for pancreatic and biliary tract cancers. World J Clin Oncol Mar 10;2(3): Cleary SP, Gryfe R, Guindi M, et al. Prognostic factors in resected pancreatic adenocarcinoma: analysis of actual 5-year survivors. J Am Coll Surg May;198(5): Sipos B, Frank S, Gress T, et al. Pancreatic intraepithelial neoplasia revisited and updated. Pancreatology. 2009;9(1-2): Hruban RH, Goggins M, Parsons J, et al. Progression model for pancreatic cancer. Clin Cancer Res Aug;6(8): Yamaguchi K, Yokohata K, Noshiro H, et al. Mucinous cystic neoplasm of the pancreas or intraductal papillary-mucinous tumour of the pancreas. Eur J Surg 2000;166(2): Tanaka M, Chari S, Adsay NV, et al. International consensus guidelines for management of intraductal papillary mucinous neoplasms and mucinous cystic neoplasms of the pancreas. Pancreatology 2006;6(17): Canto MI, Goggins M, Yeo CJ, et al. Screening for pancreatic neoplasia in high-risk individuals: an EUS-based approach. Clin Gastroenterol Hepatol Jul;2(7): Canto MI, Goggins M, Hruban RH, et al. Screening for early pancreatic neoplasia in high-risk individuals: a prospective controlled study. Clin Gastroenterol Hepatol Jun;4(6): Abe K, Suda K, Arakawa A, et al. Different patterns of p16ink4a and p53 protein expressions in intraductal papillary-mucinous neoplasms and pancreatic intraepithelial neoplasia. Pancreas Jan;34(1): Tanno S, Nakano Y, Nishikawa T, et al. Natural history of branch duct intraductal papillary-mucinous neoplasms of the pancreas without mural nodules: long-term follow-up results. Gut Mar;57(3): Al-Sukhni W, Borgida A, Rothenmund H, et al. Screening for pancreatic cancer in a high-risk cohort: an eight-year experience. J Gastrointest Surg Apr;16(4):

147 Jeurnink SM, Vleggaar FP, Siersema PD. Overview of the clinical problem: facts and current issues of mucinous cystic neoplasms of the pancreas. Dig Liver Dis Nov;40(11): Maitra A, Hruban RH. Pancreatic cancer. Annu Rev Pathol. 2008;3: Maitra A, Fukushima N, Takaori K, et al. Precursors to invasive pancreatic cancer. Adv Anat Pathol Mar;12(2): Calhoun ES, Jones JB, Ashfaq R, et al. BRAF and FBXW7 (CDC4, FBW7, AGO, SEL10) mutations in distinct subsets of pancreatic cancer: potential therapeutic targets. Am J Pathol Oct;163(4): Cheng JQ, Ruggeri B, Klein WM, et al. Amplification of AKT2 in human pancreatic cells and inhibition of AKT2 expression and tumorigenicity by antisense RNA. Proc Natl Acad Sci U S A Apr 16;93(8): Morris JP 4th, Wang SC, Hebrok M. KRAS, Hedgehog, Wnt and the twisted developmental biology of pancreatic ductal adenocarcinoma. Nat Rev Cancer Oct;10(10): Thayer SP, di Magliano MP, Heiser PW, et al. Hedgehog is an early and late mediator of pancreatic cancer tumorigenesis. Nature Oct 23;425(6960): Satoh K, Kanno A, Hamada S, et al. Expression of Sonic hedgehog signaling pathway correlates with the tumorigenesis of intraductal papillary mucinous neoplasm of the pancreas. Oncol Rep May;19(5): Morton JP, Mongeau ME, Klimstra DS, et al. Sonic hedgehog acts at multiple stages during pancreatic tumorigenesis. Proc Natl Acad Sci U S A Mar 20;104(12): Dai J, Ai K, Du Y, et al. Sonic hedgehog expression correlates with distant metastasis in pancreatic adenocarcinoma. Pancreas Mar;40(2): Feldmann G, Karikari C, dal Molin M, et al. Inactivation of Brca2 cooperates with Trp53(R172H) to induce invasive pancreatic ductal adenocarcinomas in mice: a mouse model of familial pancreatic cancer. Cancer Biol Ther Jun 1;11(11): Maitra A, Hruban RH. Pancreatic cancer. Annu Rev Pathol. 2008;3: Redston MS, Caldas C, Seymour AB, et al. p53 mutations in pancreatic carcinoma and evidence of common involvement of homocopolymer tracts in DNA microdeletions. Cancer Res. 1994;54: Iacobuzio-Donahue CA, Klimstra DS, et al. Dpc-4 protein is expressed in virtually all human intraductal papillary mucinous neoplasms of the pancreas: comparison with conventional ductal carcinomas. Am J Pathol. 2000;157(3): Blackford A, Serrano OK, Wolfgang CL, et al. SMAD4 gene mutations are associated with poor prognosis in pancreatic cancer. Clin Cancer Res Jul 15;15(14):

148 van Heek NT, Meeker AK, Kern SE, et al. Telomere shortening is nearly universal in pancreatic intraepithelial neoplasia. Am J Pathol. 2002;161: Siveke JT, Schmid RM. Chromosomal instability in mouse metastatic pancreatic cancer--it's Kras and Tp53 after all. Cancer Cell May;7(5): Hiyama E, Kodama T, Shinbara K, et al. Telomerase activity is detected in pancreatic cancer but not in benign tumors. Cancer Res Jan 15;57(2): Sato N, Goggins M. The role of epigenetic alterations in pancreatic cancer. J Hepatobiliary Pancreat Surg. 2006;13: Sato N, Maitra A, Fukushima N, et al. Frequent hypomethylation of multiple genes overexpressed in pancreatic ductal adenocarcinoma. Cancer Res. 2003;63: Szafranska AE, Davison TS, John J, et al. MicroRNA expression alterations are linked to tumorigenesis and non-neoplastic processes in pancreatic ductal adenocarcinoma. Oncogene 2007;26: Erkan M, Reiser-Erkan C, Michalski CW, et al. Tumor microenvironment and progression of pancreatic cancer. Exp Oncol Sep;32(3): Jones S, Zhang X, Parsons DW, et al. Core signaling pathways in human pancreatic cancers revealed by global genomic analyses. Science Sep 26;321(5897): Campbell PJ, Yachida S, Mudie LJ, et al. The patterns and dynamics of genomic instability in metastatic pancreatic cancer. Nature Oct 28;467(7319): Fuchs CS, Colditz GA, Stampfer MJ, et al. A prospective study of cigarette smoking and the risk of pancreatic cancer. Arch Intern Med Oct 28;156(19): Genkinger JM, Spiegelman D, Anderson KE, et al. Alcohol intake and pancreatic cancer risk: a pooled analysis of fourteen cohort studies. Cancer Epidemiol Biomarkers Prev Mar;18(3): Santibañez M, Vioque J, Alguacil J, et al. Occupational exposures and risk of pancreatic cancer. Eur J Epidemiol Oct;25(10): Huxley R, Ansary-Moghaddam A, Berrington de González A, et al. Type-II diabetes and pancreatic cancer: a meta-analysis of 36 studies. Br J Cancer. 2005;92: Risch HA, Yu H, Lu L, Kidd MS. ABO blood group, Helicobacter pylori seropositivity, and risk of pancreatic cancer: a case-control study. J Natl Cancer Inst Apr 7;102(7): Talamini G, Falconi M, Bassi C, et al. Incidence of cancer in the course of chronic pancreatitis. Am J Gastroenterol May;94(5): Eppel A, Cotterchio M, Gallinger S. Allergies are associated with reduced pancreas cancer risk: A population-based case-control study in Ontario, Canada. Int J Cancer Nov 15;121(10):

149 Bao Y, Ng K, Wolpin BM, et al. Predicted vitamin D status and pancreatic cancer risk in two prospective cohort studies. Br J Cancer Apr 27;102(9): Stolzenberg-Solomon RZ, Jacobs EJ, Arslan AA, et al. Circulating 25-hydroxyvitamin D and risk of pancreatic cancer: Cohort Consortium Vitamin D Pooling Project of Rarer Cancers. Am J Epidemiol Jul 1;172(1): Jansen RJ, Robinson DP, Stolzenberg-Solomon RZ, et al. Fruit and vegetable consumption is inversely associated with having pancreatic cancer. Cancer Causes Control Dec;22(12): Jiao L, Mitrou PN, Reedy J, et al. A combined healthy lifestyle score and risk of pancreatic cancer in a large cohort study. Arch Intern Med Apr 27;169(8): Prizment AE, Gross M, Rasmussen-Torvik L, et al. Genes related to diabetes may be associated with pancreatic cancer in a population-based case-control study in Minnesota. Pancreas Jan;41(1): Dong X, Li Y, Tang H, et al. Insulin-like growth factor axis gene polymorphisms modify risk of pancreatic cancer. Cancer Epidemiol Apr;36(2): Li D, Tanaka M, Brunicardi FC, et al. Association between somatostatin receptor 5 gene polymorphisms and pancreatic cancer risk and survival. Cancer Jul 1;117(13): Dong X, Li Y, Chang P, et al. DNA mismatch repair network gene polymorphism as a susceptibility factor for pancreatic cancer. Mol Carcinog Jun 16. doi: /mc Pierce BL, Ahsan H. Genome-wide "pleiotropy scan" identifies HNF1A region as a novel pancreatic cancer susceptibility locus. Cancer Res Jul 1;71(13): Theodoropoulos GE, Panoussopoulos GS, Michalopoulos NV, et al. Analysis of the stromal cellderived factor 1-3'A gene polymorphism in pancreatic cancer. Mol Med Report Jul- Aug;3(4): Pierce BL, Austin MA, Ahsan H. Association study of type 2 diabetes genetic susceptibility variants and risk of pancreatic cancer: an analysis of PanScan-I data. Cancer Causes Control Jun;22(6): Mazaki T, Masuda H, Takayama T. Polymorphisms and pancreatic cancer risk: a meta-analysis. Eur J Cancer Prev May;20(3): Dong X, Li Y, Chang P, et al. Glucose metabolism gene variants modulate the risk of pancreatic cancer. Cancer Prev Res (Phila) May;4(5): Diergaarde B, Brand R, Lamb J, et al. Pooling-based genome-wide association study implicates gamma-glutamyltransferase 1 (GGT1) gene in pancreatic carcinogenesis. Pancreatology. 2010;10(2-3):

150 Theodoropoulos GE, Michalopoulos NV, Panoussopoulos SG, et al. Effects of caspase-9 and survivin gene polymorphisms in pancreatic cancer risk and tumor characteristics. Pancreas Oct;39(7): Fong PY, Fesinmeyer MD, White E, et al. Association of diabetes susceptibility gene calpain-10 with pancreatic cancer among smokers. J Gastrointest Cancer Sep;41(3): Chen J, Amos CI, Merriman KW, et al. Genetic variants of p21 and p27 and pancreatic cancer risk in non-hispanic Whites: a case-control study. Pancreas Jan;39(1): Vrana D, Novotny J, Holcatova I, et al. CYP1B1 gene polymorphism modifies pancreatic cancer risk but not survival. Neoplasma. 2010;57(1): McWilliams RR, Petersen GM, Rabe KG, et al. Cystic fibrosis transmembrane conductance regulator (CFTR) gene mutations and risk for pancreatic adenocarcinoma. Cancer Jan 1;116(1): Vrana D, Pikhart H, Mohelnikova-Duchonova B, et al. The association between glutathione S- transferase gene polymorphisms and pancreatic cancer in a central European Slavonic population. Mutat Res Nov-Dec;680(1-2): Duell EJ, Holly EA, Kelsey KT, et al. Genetic variation in CYP17A1 and pancreatic cancer in a population-based case-control study in the San Francisco Bay Area, California. Int J Cancer Feb 1;126(3): Fesinmeyer MD, Stanford JL, Brentnall TA, et al. Association between the peroxisome proliferatoractivated receptor gamma Pro12Ala variant and haplotype and pancreatic cancer in a high-risk cohort of smokers: a pilot study. Pancreas Aug;38(6): Zhao D, Xu D, Zhang X, et al. Interaction of cyclooxygenase-2 variants and smoking in pancreatic cancer: a possible role of nucleophosmin. Gastroenterology May;136(5): McWilliams RR, Bamlet WR, de Andrade M, et al. Nucleotide excision repair pathway polymorphisms and pancreatic cancer risk: evidence for role of MMS19L. Cancer Epidemiol Biomarkers Prev Apr;18(4): Hamacher R, Diersch S, Scheibel M, et al. Interleukin 1 beta gene promoter SNPs are associated with risk of pancreatic cancer. Cytokine May;46(2): Li D, Suzuki H, Liu B, et al. DNA repair gene polymorphisms and risk of pancreatic cancer. Clin Cancer Res Jan 15;15(2): Suzuki H, Li Y, Dong X, et al. Effect of insulin-like growth factor gene polymorphisms alone or in interaction with diabetes on the risk of pancreatic cancer. Cancer Epidemiol Biomarkers Prev Dec;17(12): Suzuki T, Matsuo K, Sawaki A, et al. Alcohol drinking and one-carbon metabolism-related gene polymorphisms on pancreatic cancer risk. Cancer Epidemiol Biomarkers Prev Oct;17(10):

151 Ohnami S, Sato Y, Yoshimura K, et al. His595Tyr polymorphism in the methionine synthase reductase (MTRR) gene is associated with pancreatic cancer risk. Gastroenterology Aug;135(2): Yang M, Sun T, Wang L, et al. Functional variants in cell death pathway genes and risk of pancreatic cancer. Clin Cancer Res May 15;14(10): Ayaz L, Ercan B, Dirlik M, et al. The association between N-acetyltransferase 2 gene polymorphisms and pancreatic cancer. Cell Biochem Funct Apr;26(3): Jiao L, Hassan MM, Bondy ML, et al. XRCC2 and XRCC3 gene polymorphism and risk of pancreatic cancer. Am J Gastroenterol Feb;103(2): Jiao L, Hassan MM, Bondy ML, et al. The XPD Asp312Asn and Lys751Gln polymorphisms, corresponding haplotype, and pancreatic cancer risk. Cancer Lett Jan 8;245(1-2): Wang L, Miao X, Tan W, et al. Genetic polymorphisms in methylenetetrahydrofolate reductase and thymidylate synthase and risk of pancreatic cancer. Clin Gastroenterol Hepatol Aug;3(8): Li D, Jiao L, Li Y, et al. Polymorphisms of cytochrome P4501A2 and N-acetyltransferase genes, smoking, and risk of pancreatic cancer. Carcinogenesis Jan;27(1): Bartsch DK, Fendrich V, Slater EP, et al. RNASEL germline variants are associated with pancreatic cancer. Int J Cancer Dec 10;117(5): Ockenga J, Vogel A, Teich N, et al. UDP glucuronosyltransferase (UGT1A7) gene polymorphisms increase the risk of chronic pancreatitis and pancreatic cancer. Gastroenterology Jun;124(7): Duell EJ, Holly EA, Bracci PM, et al. A population-based study of the Arg399Gln polymorphism in X-ray repair cross- complementing group 1 (XRCC1) and risk of pancreatic adenocarcinoma. Cancer Res Aug 15;62(16): Amundadottir L, Kraft P, Stolzenberg-Solomon RZ, et al. Genome-wide association study identifies variants in the ABO locus associated with susceptibility to pancreatic cancer. Nat Genet Sep;41(9): Petersen GM, Amundadottir L, Fuchs CS, et al. A genome-wide association study identifies pancreatic cancer susceptibility loci on chromosomes 13q22.1, 1q32.1 and 5p Nat Genet Mar;42(3): Low SK, Kuchiba A, Zembutsu H, et al. Genome-wide association study of pancreatic cancer in Japanese population. PLoS One Jul 29;5(7):e Wu C, Miao X, Huang L, et al. Genome-wide association study identifies five loci associated with susceptibility to pancreatic cancer in Chinese populations. Nat Genet Dec 11;44(1):62-6.

152 Wolpin BM, Kraft P, Gross M, et al. Pancreatic cancer risk and ABO blood group alleles: results from the pancreatic cancer cohort consortium. Cancer Res Feb 1;70(3): Risch HA, Yu H, Lu L, et al. ABO blood group, Helicobacter pylori seropositivity, and risk of pancreatic cancer: a case-control study. J Natl Cancer Inst Apr 7;102(7): Greer JB, Yazer MH, Raval JS, et al. Significant association between ABO blood group and pancreatic cancer. World J Gastroenterol Nov 28;16(44): Iodice S, Maisonneuve P, Botteri E, et al. ABO blood group and cancer. Eur J Cancer Dec;46(18): Wolpin BM, Kraft P, Xu M, et al. Variant ABO blood group alleles, secretor status, and risk of pancreatic cancer: results from the pancreatic cancer cohort consortium. Cancer Epidemiol Biomarkers Prev Dec;19(12): Ben Q, Wang K, Yuan Y, et al. Pancreatic cancer incidence and outcome in relation to ABO blood groups among Han Chinese patients: a case-control study. Int J Cancer Mar 1;128(5): Nakao M, Matsuo K, Hosono S, et al. ABO blood group alleles and the risk of pancreatic cancer in a Japanese population. Cancer Sci May;102(5): Wang DS, Chen DL, Ren C, et al. ABO blood group, hepatitis B viral infection and risk of pancreatic cancer. Int J Cancer Aug 19. doi: /ijc [Epub ahead of print] 96. Aird I, Lee DR, Roberts JA. ABO blood groups and cancer of oesophagus, cancerof pancreas, and pituitary adenoma. Br Med J Apr 16;1(5180): Lennon AM, Klein AP, Goggins M. ABO blood group and other genetic variants associated with pancreatic cancer. Genome Med Jun 22;2(6): Giardiello FM, Welsh SB, Hamilton SR, et al. Increased risk of cancer in the Peutz-Jeghers syndrome. N Engl J Med Jun 11;316(24): Giardiello FM, Brensinger JD, Tersmette AC, et al. Very high risk of cancer in familial Peutz-Jeghers syndrome. Gastroenterology Dec;119(6): Lowenfels AB, Maisonneuve P, Cavallini G, et al. Pancreatitis and the risk of pancreatic cancer. International Pancreatitis Study Group. N Engl J Med May 20;328(20): Lowenfels AB, Maisonneuve P, DiMagno EP, et al. Hereditary pancreatitis and the risk of pancreatic cancer. International Hereditary Pancreatitis Study Group. J Natl Cancer Inst Mar 19;89(6): de Snoo FA, Riedijk SR, van Mil AM, et al. Genetic testing in familial melanoma: uptake and implications. Psychooncology Aug;17(8): Hahn SA, Greenhalf B, Ellis I, et al. BRCA2 germline mutations in familial pancreatic carcinoma. J Natl Cancer Inst Feb 5;95(3):

153 Murphy KM, Brune KA, Griffin C, et al. Evaluation of candidate genes MAP2K4, MADH4, ACVR1B, and BRCA2 in familial pancreatic cancer: deleterious BRCA2 mutations in 17%. Cancer Res Jul 1;62(13): Martin ST, Matsubayashi H, Rogers CD, et al. Increased prevalence of the BRCA2 polymorphic stop codon K3326X among individuals with familial pancreatic cancer. Oncogene May 19;24(22): Stadler ZK, Salo-Mullen E, Patil SM, et al. Prevalence of BRCA1 and BRCA2 mutations in Ashkenazi Jewish families with breast and pancreatic cancer. Cancer Jan 15;118(2): Ghiorzo P, Pensotti V, Fornarini G, et al. Contribution of germline mutations in the BRCA and PALB2 genes to pancreatic cancer in Italy. Fam Cancer Mar;11(1): Schneider R, Slater EP, Sina M, et al. German national case collection for familial pancreatic cancer (FaPaCa): ten years experience. Fam Cancer Jun;10(2): Slater EP, Langer P, Fendrich V, et al. Prevalence of BRCA2 and CDKN2a mutations in German familial pancreatic cancer families. Fam Cancer Sep;9(3): Cho JH, Bang S, Park SW, et al. BRCA2 mutations as a universal risk factor for pancreatic cancer has a limited role in Korean ethnic group. Pancreas May;36(4): Real FX, Malats N, Lesca G, et al. Family history of cancer and germline BRCA2 mutations in sporadic exocrine pancreatic cancer. Gut May;50(5): Greer JB, Whitcomb DC. Role of BRCA1 and BRCA2 mutations in pancreatic cancer. Gut May;56(5): Goggins M, Schutte M, Lu J, et al. Germline BRCA2 gene mutations in patients with apparently sporadic pancreatic carcinomas. Cancer Res Dec 1;56(23): Wooster R, Neuhausen SL, Mangion J, et al. Localization of a breast cancer susceptibility gene, BRCA2, to chromosome 13q Science Sep 30;265(5181): Schutte M, da Costa LT, Hahn SA, et al. Identification by representational difference analysis of a homozygous deletion in pancreatic carcinoma that lies within the BRCA2 region. Proc Natl Acad Sci U S A Jun 20;92(13): Schutte M, Rozenblum E, Moskaluk CA, et al. An integrated high-resolution physical map of the DPC/BRCA2 region at chromosome 13q12. Cancer Res Oct 15;55(20): Jones S, Hruban RH, Kamiyama M, et al. Exomic sequencing identifies PALB2 as a pancreatic cancer susceptibility gene. Science Apr 10;324(5924): Tischkowitz MD, Sabbaghian N, Hamel N, et al. Analysis of the gene coding for the BRCA2- interacting protein PALB2 in familial and sporadic pancreatic cancer. Gastroenterology Sep;137(3):

154 Slater EP, Langer P, Niemczyk E, et al. PALB2 mutations in European familial pancreatic cancer families. Clin Genet Nov;78(5): Adank MA, van Mil SE, Gille JJ, et al. PALB2 analysis in BRCA2-like families. Breast Cancer Res Treat Jun;127(2): Lal G, Liu G, Schmocker B, et al. Inherited predisposition to pancreatic adenocarcinoma: role of family history and germ-line p16, BRCA1, and BRCA2 mutations. Cancer Res Jan 15;60(2): Skudra S, Staka A, Pukitis A, et al. Association of genetic variants with pancreatic cancer. Cancer Genet Cytogenet 2007;179: Axilbund JE, Argani P, Kamiyama M, et al. Absence of germline BRCA1 mutations in familial pancreatic cancer patients. Cancer Biol Ther Jan;8(2): Roberts NJ, Jiao Y, Yu J, et al. ATM mutations in patients with hereditary pancreatic cancer. Cancer Discov Jan;2: van der Heijden MS, Yeo CJ, Hruban RH, et al. Fanconi anemia gene mutations in young-onset pancreatic cancer. Cancer Res May 15;63(10): Rogers CD, van der Heijden MS, Brune K, et al. The genetics of FANCC and FANCG in familial pancreatic cancer. Cancer Biol Ther Feb;3(2): Rogers CD, Couch FJ, Brune K, et al. Genetics of the FANCA gene in familial pancreatic cancer. J Med Genet Dec;41(12):e Couch FJ, Johnson MR, Rabe K, et al. Germ line Fanconi anemia complementation group C mutations and pancreatic cancer. Cancer Res Jan 15;65(2): Gargiulo S, Torrini M, Ollila S, et al. Germline MLH1 and MSH2 mutations in Italian pancreatic cancer patients with suspected Lynch syndrome. Fam Cancer. 2009;8(4): Kastrinos F, Mukherjee B, Tayob N, et al. Risk of pancreatic cancer in families with Lynch syndrome. JAMA Oct 28;302(16): Kempers MJ, Kuiper RP, Ockeloen CW, et al. Risk of colorectal and endometrial cancers in EPCAM deletion-positive Lynch syndrome: a cohort study. Lancet Oncol Jan;12(1): Lindor NM, Petersen GM, Spurdle AB, et al. Pancreatic cancer and a novel MSH2 germline alteration. Pancreas Oct;40(7): Ruijs MW, Verhoef S, Rookus MA, et al. TP53 germline mutation testing in 180 families suspected of Li-Fraumeni syndrome: mutation detection rate and relative frequency of cancers in different familial phenotypes. J Med Genet Jun;47(6): Groen EJ, Roos A, Muntinghe FL, et al. Extra-intestinal manifestations of familial adenomatous polyposis. Ann Surg Oncol Sep;15(9):

155 Sheldon CD, Hodson ME, Carpenter LM, et al. A cohort study of cystic fibrosis and malignancy. Br J Cancer Nov;68(5): Hruban RH, Canto MI, Goggins M, et al. Update on familial pancreatic cancer. Adv Surg. 2010;44: MacDermott RP, Kramer P. Adenocarcinoma of the pancreas in four siblings. Gastroenterology Jul;65(1): Friedman JM, Fialkow PJ. Carcinoma of the pancreas in four brothers. Birth Defects Orig Artic Ser. 1976;12(1): Danes BS, Lynch HT. A familial aggregation of pancreatic cancer. An in vitro study. JAMA May 28;247(20): Dat NM, Sontag SJ. Pancreatic carcinoma in brothers. Ann Intern Med Aug;97(2): Grajower MM. Familial pancreatic cancer. Ann Intern Med Jan;98(1): Ehrenthal D, Haeger L, Griffin T, et al. Familial pancreatic adenocarcinoma in three generations. A case report and a review of the literature. Cancer May 1;59(9): Lynch HT, Fitzsimmons ML, Smyrk TC, et al. Familial pancreatic cancer: clinicopathologic study of 18 nuclear families. Am J Gastroenterol Jan;85(1): Ghadirian P, Boyle P, Simard A, et al. Reported family aggregation of pancreatic cancer within a population-based case-control study in the Francophone community in Montreal, Canada. Int J Pancreatol Nov-Dec;10(3-4): Fernandez E, La Vecchia C, D'Avanzo B, et al. Family history and the risk of liver, gallbladder, and pancreatic cancer. Cancer Epidemiol Biomarkers Prev Apr-May;3(3): Silverman DT, Schiffman M, Everhart J, et al. Diabetes mellitus, other medical conditions and familial history of cancer as risk factors for pancreatic cancer. Br J Cancer Aug;80(11): Schenk M, Schwartz AG, O'Neal E, et al. Familial risk of pancreatic cancer. J Natl Cancer Inst Apr 18;93(8): Ghadirian P, Liu G, Gallinger S, et al. Risk of pancreatic cancer among individuals with a family history of cancer of the pancreas. Int J Cancer Feb 20;97(6): Inoue M, Tajima K, Takezaki T, et al. Epidemiology of pancreatic cancer in Japan: a nested casecontrol study from the Hospital-based Epidemiologic Research Program at Aichi Cancer Center (HERPACC). Int J Epidemiol Apr;32(2): Rulyak SJ, Lowenfels AB, Maisonneuve P, et al. Risk factors for the development of pancreatic cancer in familial pancreatic cancer kindreds. Gastroenterology May;124(5): Cote ML, Schenk M, Schwartz AG, et al. Risk of other cancers in individuals with a family history of pancreas cancer. J Gastrointest Cancer. 2007;38(2-4):

156 Hassan MM, Bondy ML, Wolff RA, et al. Risk factors for pancreatic cancer: case-control study. Am J Gastroenterol Dec;102(12): Jacobs EJ, Chanock SJ, Fuchs CS, et al. Family history of cancer and risk of pancreatic cancer: a pooled analysis from the Pancreatic Cancer Cohort Consortium (PanScan). Int J Cancer Sep 1;127(6): Matsubayashi H, Maeda A, Kanemoto H, et al. Risk factors of familial pancreatic cancer in Japan: current smoking and recent onset of diabetes. Pancreas Aug;40(6): Coughlin SS, Calle EE, Patel AV, et al. Predictors of pancreatic cancer mortality among a large cohort of United States adults. Cancer Causes Control Dec;11(10): Tersmette AC, Petersen GM, Offerhaus GJ, et al. Increased risk of incident pancreatic cancer among first-degree relatives of patients with familial pancreatic cancer. Clin Cancer Res Mar;7(3): Hemminki K, Li X. Familial and second primary pancreatic cancers: a nationwide epidemiologic study from Sweden. Int J Cancer Feb 10;103(4): Klein AP, Brune KA, Petersen GM, et al. Prospective risk of pancreatic cancer in familial pancreatic cancer kindreds. Cancer Res Apr 1;64(7): Jacobs EJ, Rodriguez C, Newton CC, et al. Family history of various cancers and pancreatic cancer mortality in a large cohort. Cancer Causes Control Oct;20(8): Brune KA, Lau B, Palmisano E, et al. Importance of age of onset in pancreatic cancer kindreds. J Natl Cancer Inst Jan 20;102(2): Klein AP, Beaty TH, Bailey-Wilson JE, et al. Evidence for a major gene influencing risk of pancreatic cancer. Genet Epidemiol Aug;23(2): Lynch HT, Fusaro L, Lynch JF. Familial pancreatic cancer: a family study. Pancreas. 1992;7(5): Bartsch DK, Kress R, Sina-Frey M, et al. Prevalence of familial pancreatic cancer in Germany. Int J Cancer Jul 20;110(6): James TA, Sheldon DG, Rajput A, et al. Risk factors associated with earlier age of onset in familial pancreatic carcinoma. Cancer Dec 15;101(12): Petersen GM, de Andrade M, Goggins M, et al. Pancreatic cancer genetic epidemiology consortium. Cancer Epidemiol Biomarkers Prev Apr;15(4): McFaul CD, Greenhalf W, Earl J, et al. Anticipation in familial pancreatic cancer. Gut Feb;55(2): Rieder H, Sina-Frey M, Ziegler A, et al. German national case collection of familial pancreatic cancer - clinical-genetic analysis of the first 21 families. Onkologie Jun;25(3):262-6.

157 Rulyak SJ, Lowenfels AB, Maisonneuve P, et al. Risk factors for the development of pancreatic cancer in familial pancreatic cancer kindreds. Gastroenterology May;124(5): Schneider R, Slater EP, Sina M, et al. German national case collection for familial pancreatic cancer (FaPaCa): ten years experience. Fam Cancer Jun;10(2): Olson SH, Chou JF, Ludwig E, et al. Allergies, obesity, other risk factors and survival from pancreatic cancer. Int J Cancer Nov 15;127(10): Barton JG, Schnelldorfer T, Lohse CM, et al. Patterns of pancreatic resection differ between patients with familial and sporadic pancreatic cancer. J Gastrointest Surg May;15(5): Ji J, Forsti A, Sundquist J, et al. Survival in familial pancreatic cancer. Pancreatology. 2008;8(3): Yeo TP, Hruban RH, Brody J, et al. Assessment of "gene-environment" interaction in cases of familial and sporadic pancreatic cancer. J Gastrointest Surg Aug;13(8): Fogelman DR, Wolff RA, Kopetz S, et al. Evidence for the efficacy of Iniparib, a PARP-1 inhibitor, in BRCA2-associated pancreatic cancer. Anticancer Res Apr;31(4): Villarroel MC, Rajeshkumar NV, Garrido-Laguna I, et al. Personalizing cancer treatment in the age of global genomic analyses: PALB2 gene mutations and the response to DNA damaging agents in pancreatic cancer. Mol Cancer Ther Jan;10(1): James E, Waldron-Lynch MG, Saif MW. Prolonged survival in a patient with BRCA2 associated metastatic pancreatic cancer after exposure to camptothecin: a case report and review of literature. Anticancer Drugs Aug;20(7): Sonnenblick A, Kadouri L, Appelbaum L, et al. Complete remission, in BRCA2 mutation carrier with metastatic pancreatic adenocarcinoma, treated with cisplatin based therapy. Cancer Biol Ther Aug 1;12(3): Lowery MA, Kelsen DP, Stadler ZK, et al. An emerging entity: pancreatic adenocarcinoma associated with a known BRCA mutation: clinical descriptors, treatment implications, and future directions. Oncologist. 2011;16(10): Shi C, Klein AP, Goggins M, et al. Increased Prevalence of Precursor Lesions in Familial Pancreatic Cancer Patients. Clin Cancer Res Dec 15;15(24): Brune K, Abe T, Canto M, et al. Multifocal neoplastic precursor lesions associated with lobular atrophy of the pancreas in patients having a strong family history of pancreatic cancer. Am J Surg Pathol Sep;30(9): Abe T, Fukushima N, Brune K, et al. Genome-wide allelotypes of familial pancreatic adenocarcinomas and familial and sporadic intraductal papillary mucinous neoplasms. Clin Cancer Res Oct 15;13(20):

158 Iacobuzio-Donahue CA, van der Heijden MS, Baumgartner MR, et al. Large-scale allelotype of pancreaticobiliary carcinoma provides quantitative estimates of genome-wide allelic loss. Cancer Res Feb 1;64(3): Calhoun ES, Hucl T, Gallmeier E, et al. Identifying allelic loss and homozygous deletions in pancreatic cancer without matched normals using high-density single-nucleotide polymorphism arrays. Cancer Res Aug 15;66(16): Brune K, Hong SM, Li A, et al. Genetic and epigenetic alterations of familial pancreatic cancers. Cancer Epidemiol Biomarkers Prev Dec;17(12): Bodmer WF, Bailey CJ, Bodmer J, et al. Localization of the gene for familial adenomatous polyposis on chromosome 5. Nature Aug 13-19;328(6131): Hall JM, Lee MK, Newman B, et al. Linkage of early-onset familial breast cancer to chromosome 17q21. Science Dec 21;250(4988): Eberle MA, Pfützer R, Pogue-Geile KL, et al. A new susceptibility locus for autosomal dominant pancreatic cancer maps to chromosome 4q Am J Hum Genet Apr;70(4): Earl J, Yan L, Vitone LJ, et al. Evaluation of the 4q32-34 locus in European familial pancreatic cancer. Cancer Epidemiol Biomarkers Prev Oct;15(10): Klein AP, de Andrade M, Hruban RH, et al. Linkage analysis of chromosome 4 in families with familial pancreatic cancer. Cancer Biol Ther Mar;6(3): Pogue-Geile KL, Chen R, Bronner MP, et al. Palladin mutation causes familial pancreatic cancer and suggests a new cancer mechanism. PLoS Med Dec;3(12):e Salaria SN, Illei P, Sharma R, et al. Palladin is overexpressed in the non-neoplastic stroma of infiltrating ductal adenocarcinomas of the pancreas, but is only rarely overexpressed in neoplastic cells. Cancer Biol Ther Mar;6(3): Zogopoulos G, Rothenmund H, Eppel A, et al. The P239S palladin variant does not account for a significant fraction of hereditary or early onset pancreas cancer. Hum Genet Jun;121(5): Slater E, Amrillaeva V, Fendrich V, et al. Palladin mutation causes familial pancreatic cancer: absence in European families. PLoS Med Apr;4(4):e Klein AP, Borges M, Griffith M, et al. Absence of deleterious palladin mutations in patients with familial pancreatic cancer. Cancer Epidemiol Biomarkers Prev Apr;18(4): Alkan C, Coe BP, Eichler EE. Genome structural variation discovery and genotyping. Nat Rev Genet May;12(5): Morrow EM. Genomic copy number variation in disorders of cognitive development. J Am Acad Child Adolesc Psychiatry Nov;49(11): Sebat J, Lakshmi B, Troge J, et al. Large-scale copy number polymorphism in the human genome. Science. 2004;305:

159 Iafrate AJ, Feuk L, Rivera MN, et al. Detection of large-scale variation in the human genome. Nat Genet. 2004;36: Sharp AJ, Locke DP, McGrath SD, et al. Segmental duplications and copy-number variation in the human genome. Am J Hum Genet. 2005;77: Tuzun E, Sharp AJ, Bailey JA, et al. Fine-scale structural variation of the human genome. Nat Genet. 2005;37: Conrad DF, Andrews TD, Carter NP, et al. A high-resolution survey of deletion polymorphism in the human genome. Nat Genet. 2006;38: McCarroll SA, Hadnott TN, Perry GH, et al. Common deletion polymorphisms in the human genome. Nat Genet. 2006;38: Hinds DA, Kloek AP, Jen M, et al. Common deletions and SNPs are in linkage disequilibrium in the human genome. Nat Genet. 2006;38: Locke DP, Sharp AJ, McCarroll SA, et al. Linkage disequilibrium and heritability of copynumber polymorphisms within duplicated regions of the human genome. Am J Hum Genet. 2006;79: Mills RE, Luttig CT, Larkins CE, et al. An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res. 2006;16: Redon R, Ishikawa S, Fitch KR, et al. Global variation in copy number in the human genome. Nature. 2006;444: Simon-Sanchez J, Scholz S, Fung HC, et al. Genome-wide SNP assay reveals structural genomic variation, extended homozygosity and cell-line induced alterations in normal individuals. Hum Mol Genet. 2007;16: Wong KK, deleeuw RJ, Dosanjh NS, et al. A comprehensive analysis of common copy-number variations in the human genome. Am J Hum Genet. 2007;80: Levy S, Sutton G, Ng PC, et al. The diploid genome sequence of an individual human. PLoS Biol. 2007;5:e Korbel JO, Urban AE, Affourtit JP, et al. Paired-end mapping reveals extensive structural variation in the human genome. Science. 2007;318: Pinto D, Marshall C, Feuk L, et al. Copy-number variation in control population cohorts. Hum Mol Genet. 2007;16 Spec No. 2:R Wang K, Li M, Hadley D, et al. PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 2007;17: Zogopoulos G, Ha KC, Naqib F, et al. Germ-line DNA copy number variation frequencies in a large North American population. Hum Genet. 2007;122:

160 desmith AJ, Tsalenko A, Sampas N, et al. Array CGH analysis of copy number variation identifies 1284 new genes variant in healthy white males: implications for association studies of complex diseases. Hum Mol Genet. 2007;16: Jakobsson M, Scholz SW, Scheet P, et al. Genotype, haplotype and copy-number variation in worldwide human populations. Nature. 2008;451: Perry GH, Ben-Dor A, Tsalenko A, et al. The fine-scale and complex architecture of human copy-number variation. Am J Hum Genet. 2008;82: Takahashi N, Tsuyama N, Sasaki K, et al. Segmental copy-number variation observed in Japanese by array-cgh. Ann Hum Genet. 2008;72: Wheeler DA, Srinivasan M, Egholm M, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008;452: McCarroll SA, Kuruvilla FG, Korn JM, et al. Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat Genet Oct;40(10): Cooper GM, Zerr T, Kidd JM, et al. Systematic assessment of copy number variant detection via genome-wide SNP genotyping. Nat Genet Oct;40(10): Kidd JM, Cooper GM, Donahue WF, et al. Mapping and sequencing of structural variation from eight human genomes. Nature. 2008;453: Bentley DR, Balasubramanian S, Swerdlow HP, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature Nov 6;456(7218): Wang J, Wang W, Li R, et al. The diploid genome sequence of an Asian individual. Nature Nov 6;456(7218): Gusev A, Lowe JK, Stoffel M, et al. Whole population, genome-wide mapping of hidden relatedness. Genome Res Feb;19(2): Itsara A, Cooper GM, Baker C, et al. Population analysis of large copy number variants and hotspots of human genetic disease. Am J Hum Genet Feb;84(2): Shaikh TH, Gai X, Perin JC, et al. High-resolution mapping and analysis of copy number variations in the human genome: a data resource for clinical and research applications. Genome Res Sep;19(9): Kim JI, Ju YS, Park H, et al. A highly annotated whole-genome sequence of a Korean individual. Nature Aug 20;460(7258): Ahn SM, Kim TH, Lee S, et al. The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome Res Sep;19(9): Matsuzaki H, Wang PH, Hu J, et al. High resolution discovery and confirmation of copy number variants in 90 Yoruba Nigerians. Genome Biol. 2009;10(11):R125.

161 McKernan KJ, Peckham HE, Costa GL, et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res Sep;19(9): McElroy JP, Nelson MR, Caillier SJ, et al. Copy number variation in African Americans. BMC Genet Mar 24;10: Conrad DF, Pinto D, Redon R, et al. Origins and functional impact of copy number variation in the human genome. Nature Apr 1;464(7289): Alkan C, Kidd JM, Marques-Bonet T, et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat Genet Oct;41(10): Lin CH, Lin YC, Wu JY, et al. A genome-wide survey of copy number variations in Han Chinese residing in Taiwan. Genomics Oct;94(4): Li J, Yang T, Wang L, et al. Whole genome distribution and ethnic differentiation of copy number variation in Caucasian and Asian populations. PLoS One Nov 23;4(11):e International HapMap 3 Consortium, Altshuler DM, Gibbs RA, et al. Integrating common and rare genetic variation in diverse human populations. Nature Sep 2;467(7311): Ju YS, Hong D, Kim S, et al. Reference-unbiased copy number variant analysis using CGH microarrays. Nucleic Acids Res Nov;38(20):e Pang AW, MacDonald JR, Pinto D, et al. Towards a comprehensive structural variation map of an individual human genome. Genome Biol. 2010;11(5):R Park H, Kim JI, Ju YS, et al. Discovery of common Asian copy number variants using integrated high-resolution array CGH and massively parallel DNA sequencing. Nat Genet May;42(5): Teague B, Waterman MS, Goldstein S, et al. High-resolution human genome structure by singlemolecule analysis. Proc Natl Acad Sci U S A Jun 15;107(24): Kidd JM, Sampas N, Antonacci F, et al. Characterization of missing human genome sequences and copy-number polymorphic insertions. Nat Methods May;7(5): Kidd JM, Graves T, Newman TL, et al. A human genome structural variation sequencing resource reveals insights into mutational mechanisms. Cell Nov 24;143(5): Schuster SC, Miller W, Ratan A, et al. Complete Khoisan and Bantu genomes from southern Africa. Nature Feb 18;463(7283): Yim SH, Kim TM, Hu HJ, et al. Copy number variations in East-Asian population and their evolutionary and functional implications. Hum Mol Genet Mar 15;19(6): Gayán J, Galan JJ, González-Pérez A, et al. Genetic structure of the Spanish population. BMC Genomics May 25;11:326.

162 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature Oct 28;467(7319): Mills RE, Walter K, Stewart C, et al. Mapping copy number variation by population-scale genome sequencing. Nature Feb 3;470(7332): Chen W, Hayward C, Wright AF, et al. Copy number variation across European populations. PLoS One. 2011;6(8):e Moon S, Kim YJ, Hong CB, et al. Data-driven approach to detect common copy-number variations and frequency profiles in a population-based Korean cohort. Eur J Hum Genet Nov;19(11): Helen V. Firth, Shola M. et al. DECIPHER: Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources. Am J Hum Genet. 2009;84(4): Feenstra I, Fang J, Koolen DA, et al. European Cytogeneticists Association Register of Unbalanced Chromosome Aberrations (ECARUCA); an online database for rare chromosome abnormalities. Eur J Med Genet Jul-Aug;49(4): Futreal PA, Coin L, Marshall M, et al. A census of human cancer genes. Nat Rev Cancer Mar;4(3): Cutts RJ, Gadaleta E, Hahn SA, et al. The Pancreatic Expression database: 2011 update. Nucleic Acids Res Jan;39(Database issue):d Malcolm S. Microdeletion and microduplication syndromes. Prenat Diagn Dec;16(13): Medvedev, P., Stanciu, M. & Brudno, M. Computational methods for discovering structural variation with next-generation sequencing. Nature Methods. 2009;6:S13 S Riley MC, Kirkup BC Jr, Johnson JD, et al. Rapid whole genome optical mapping of Plasmodium falciparum. Malar J Aug 26;10: Kim Y, Kim KS, Kounovsky KL, et al. Nanochannel confinement: DNA stretch approaching full contour length. Lab Chip May 21;11(10): Xu MY, Aragon AD, Mascarenas MR, et al. Dual primer emulsion PCR for next- generation DNA sequencing. Biotechniques May;48(5): Winchester L, Yau C, Ragoussis J. Comparing CNV detection methods for SNP arrays. Brief Funct Genomic Proteomic Sep;8(5): Gautam P, Jha P, Kumar D, et al. Spectrum of large copy number variations in 26 diverse Indian populations: potential involvement in phenotypic diversity. Hum Genet Jan;131(1): Scherer SW, Lee C, Birney E, et al. Challenges and standards in integrating surveys of structural variation. Nat Genet Jul;39(7 Suppl):S7-15.

163 Stankiewicz P, Pursley AN, Cheung SW. Challenges in clinical interpretation of microduplications detected by array CGH analysis. Am J Med Genet A May;152A(5): Hastings PJ, Lupski JR, Rosenberg SM, et al. Mechanisms of change in gene copy number. Nat Rev Genet Aug;10(8): Lee C, Scherer SW. The clinical context of copy number variation in the human genome. Expert Rev Mol Med Mar 9;12:e Schrider DR, Hahn MW. Gene copy-number polymorphism in nature. Proc Biol Sci Nov 7;277(1698): Nguyen DQ, Webber C, Hehir-Kwa J, et al. Reduced purifying selection prevails over positive selection in human copy number variant evolution. Genome Res Nov;18(11): Perry GH, Dominy NJ, Claw KG, et al. Diet and the evolution of human amylase gene copy number variation. Nat Genet Oct;39(10): Yim SH, Kim TM, Hu HJ, et al. Copy number variations in East-Asian population and their evolutionary and functional implications. Hum Mol Genet Mar 15;19(6): Perry GH, Yang F, Marques-Bonet T, et al. Copy number variation and evolution in humans and chimpanzees. Genome Res Nov;18(11): Stranger BE, Forrest MS, Dunning M, et al. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science Feb 9;315(5813): Schlattl A, Anders S, Waszak SM, et al. Relating CNVs to transcriptome data at fine resolution: assessment of the effect of variant size, type, and overlap with functional regions. Genome Res Dec;21(12): Henrichsen CN, Vinckenbosch N, Zöllner S, et al. Segmental copy number variation shapes tissue transcriptomes.nat Genet Apr;41(4): Guryev V, Saar K, Adamovic T, et al. Distribution and functional impact of DNA copy number variation in the rat. Nat Genet May;40(5): Zhou J, Lemos B, Dopman EB, et al. Copy-number variation: the balance between gene dosage and expression in Drosophila melanogaster. Genome Biol Evol. 2011;3: Nuytemans K, Meeus B, Crosiers D, et al. Relative contribution of simple mutations vs. copy number variations in five Parkinson disease genes in the Belgian population. Hum Mutat Jul;30(7): Walters RG, Jacquemont S, Valsesia A, et al. A new highly penetrant form of obesity due to deletions on chromosome 16p11.2. Nature Feb 4;463(7281): Prescott NJ, Dominy KM, Kubo M, et al. Independent and population-specific association of risk variants at the IRGM locus with Crohn's disease. Hum Mol Genet May 1;19(9):

164 Wellcome Trust Case Control Consortium, Craddock N, Hurles ME, et al. Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls. Nature Apr 1;464(7289): de Cid R, Riveira-Munoz E, Zeeuwen PL, et al. Deletion of the late cornified envelope LCE3B and LCE3C genes as a susceptibility factor for psoriasis. Nat Genet Feb;41(2): Morris DL, Roberts AL, Witherden AS, et al. Evidence for both copy number and allelic (NA1/NA2) risk at the FCGR3B locus in systemic lupus erythematosus. Eur J Hum Genet Sep;18(9): Gonzalez E, Kulkarni H, Bolivar H, et al. The influence of CCL3L1 gene-containing segmental duplications on HIV-1/AIDS susceptibility. Science Mar 4;307(5714): O'Donovan MC, Kirov G, Owen MJ. Phenotypic variations on the theme of CNVs. Nat Genet Dec;40(12): Itsara A, Wu H, Smith JD, et al. De novo rates and selection of large copy number variation. Genome Res Nov;20(11): Piotrowski A, Bruder CE, Andersson R, et al. Somatic mosaicism for copy number variation in differentiated human tissues. Hum Mutat Sep;29(9): Rodríguez-Santiago B, Malats N, Rothman N, et al. Mosaic uniparental disomies and aneuploidies as large structural variants of the human genome. Am J Hum Genet Jul 9;87(1): Bruder CE, Piotrowski A, Gijsbers AA, et al. Phenotypically concordant and discordant monozygotic twins display different DNA copy-number-variation profiles. Am J Hum Genet Mar;82(3): Sasaki H, Emi M, Iijima H, et al. Copy number loss of (src homology 2 domain containing)- transforming protein 2 (SHC2) gene: discordant loss in monozygotic twins and frequent loss in patients with multiple system atrophy. Mol Brain Jun 10;4: Pamphlett R, Morahan JM. Copy number imbalances in blood and hair in monozygotic twins discordant for amyotrophic lateral sclerosis. J Clin Neurosci Sep;18(9): Thompson SL, Bakhoum SF, Compton DA. Mechanisms of chromosomal instability. Curr Biol Mar 23;20(6):R Thompson SL, Compton DA. Chromosomes and cancer cells. Chromosome Res Apr;19(3): Meza-Zepeda LA, Kresse SH, Barragan-Polania AH, et al. Array comparative genomic hybridization reveals distinct DNA copy number differences between gastrointestinal stromal tumors and leiomyosarcomas. Cancer Res Sep 15;66(18):

165 Vollebergh MA, Lips EH, Nederlof PM, et al. An acgh classifier derived from BRCA1-mutated breast cancer and benefit of high-dose platinum-based chemotherapy in HER2-negative breast cancer patients. Ann Oncol Jul;22(7): Johansson B, Bardi G, Heim S, et al. Nonrandom chromosomal rearrangements in pancreatic carcinomas. Cancer Apr 1;69(7): Brat DJ, Hahn SA, Griffin CA, et al. The structural basis of molecular genetic deletions. An integration of classical cytogenetic and molecular analyses in pancreatic adenocarcinoma. Am J Pathol Feb;150(2): Heidenblad M, Schoenmakers EF, Jonson T, et al. Genome-wide array-based comparative genomic hybridization reveals multiple amplification targets and novel homozygous deletions in pancreatic carcinoma cell lines. Cancer Res May 1;64(9): Aguirre AJ, Brennan C, Bailey G, et al. High-resolution characterization of the pancreatic adenocarcinoma genome. Proc Natl Acad Sci U S A Jun 15;101(24): Holzmann K, Kohlhammer H, Schwaenen C, et al. Genomic DNA-chip hybridization reveals a higher incidence of genomic amplifications in pancreatic cancer than conventional comparative genomic hybridization and leads to the identification of novel candidate genes. Cancer Res Jul 1;64(13): Mahlamäki EH, Kauraniemi P, Monni O, et al. High-resolution genomic and expression profiling reveals 105 putative amplification target genes in pancreatic cancer. Neoplasia Sep- Oct;6(5): Bashyam MD, Bair R, Kim YH, et al. Array-based comparative genomic hybridization identifies localized DNA amplifications and homozygous deletions in pancreatic cancer. Neoplasia Jun;7(6): Nowak NJ, Gaile D, Conroy JM, et al. Genome-wide aberrations in pancreatic adenocarcinoma. Cancer Genet Cytogenet Aug;161(1): Loukopoulos P, Shibata T, Katoh H, et al. Genome-wide array-based comparative genomic hybridization analysis of pancreatic adenocarcinoma: identification of genetic indicators that predict patient outcome. Cancer Sci Mar;98(3): Harada T, Baril P, Gangeswaran R, et al. Identification of genetic alterations in pancreatic cancer by the combined use of tissue microdissection and array-based comparative genomic hybridisation. Br J Cancer Jan 29;96(2): Suzuki A, Shibata T, Shimada Y, et al. Identification of SMURF1 as a possible target for 7q amplification detected in a pancreatic cancer cell line by in-house array-based comparative genomic hybridization. Cancer Sci May;99(5):

166 Kwei KA, Bashyam MD, Kao J, et al. Genomic profiling identifies GATA6 as a candidate oncogene amplified in pancreatobiliary cancer. PLoS Genet May 23;4(5):e Harada T, Chelala C, Crnogorac-Jurcevic T, et al. Genome-wide analysis of pancreatic cancer using microarray-based techniques. Pancreatology. 2009;9(1-2): Birnbaum DJ, Adélaïde J, Mamessier E, et al. Genome profiling of pancreatic adenocarcinoma. Genes Chromosomes Cancer Jun;50(6): Calhoun ES, Hucl T, Gallmeier E, et al. Identifying allelic loss and homozygous deletions in pancreatic cancer without matched normals using high-density single-nucleotide polymorphism arrays. Cancer Res Aug 15;66(16): Harada T, Chelala C, Bhakta V, et al. Genome-wide DNA copy number analysis in pancreatic cancer using high-density single nucleotide polymorphism arrays. Oncogene Mar 20;27(13): Lin LJ, Asaoka Y, Tada M, et al. Integrated analysis of copy number alterations and loss of heterozygosity in human pancreatic cancer using a high-resolution, single nucleotide polymorphism array. Oncology. 2008;75(1-2): Fu B, Luo M, Lakkur S, et al. Frequent genomic copy number gain and overexpression of GATA- 6 in pancreatic carcinoma. Cancer Biol Ther Oct;7(10): Michils G, Tejpar S, Thoelen R, et al. Large deletions of the APC gene in 15% of mutationnegative patients with classical polyposis (FAP): a Belgian study. Hum Mutat Feb;25(2): Richards FM, Crossey PA, Phipps ME, et al. Detailed mapping of germline deletions of the von Hippel-Lindau disease tumour suppressor gene. Hum Mol Genet Apr;3(4): Oliveira C, Senz J, Kaurah P, et al. Germline CDH1 deletions in hereditary diffuse gastric cancer families. Hum Mol Genet May 1;18(9): Palanca Suela S, Esteban Cardeñosa E, Barragán González E, et al. Identification of a novel BRCA1 large genomic rearrangement in a Spanish breast/ovarian cancer family. Breast Cancer Res Treat Nov;112(1): Vasickova P, Machackova E, Lukesova M, et al. High occurrence of BRCA1 intragenic rearrangements in hereditary breast and ovarian cancer syndrome in the Czech Republic. BMC Med Genet Jun 11;8: Buffone A, Capalbo C, Ricevuto E, et al. Prevalence of BRCA1 and BRCA2 genomic rearrangements in a cohort of consecutive Italian breast and/or ovarian cancer families. Breast Cancer Res Treat Dec;106(2):

167 Smith LD, Tesoriero AA, Ramus SJ, et al. BRCA1 promoter deletions in young women with breast cancer and a strong family history: a population-based study. Eur J Cancer Mar;43(5): Casilli F, Tournier I, Sinilnikova OM, et al. The contribution of germline rearrangements to the spectrum of BRCA2 mutations. J Med Genet Sep;43(9):e Walsh T, Casadei S, Coats KH, et al. Spectrum of mutations in BRCA1, BRCA2, CHEK2, and TP53 in families at high risk of breast cancer. JAMA Mar 22;295(12): Gad S, Caux-Moncoutier V, Pagès-Berhouet S, et al. Significant contribution of large BRCA1 gene rearrangements in 120 French breast and ovarian cancer families. Oncogene Oct 3;21(44): Taylor CF, Charlton RS, Burn J, et al. Genomic deletions in MSH2 or MLH1 are a frequent cause of hereditary non-polyposis colorectal cancer: identification of novel and recurrent deletions by MLPA. Hum Mutat Dec;22(6): Gylling A, Ridanpää M, Vierimaa O, et al. Large genomic rearrangements and germline epimutations in Lynch syndrome. Int J Cancer May 15;124(10): Hearle NC, Rudd MF, Lim W, et al. Exonic STK11 deletions are not a rare cause of Peutz- Jeghers syndrome. J Med Genet Apr;43(4):e van Hattem WA, Brosens LA, de Leng WW, et al. Large genomic deletions of SMAD4, BMPR1A and PTEN in juvenile polyposis. Gut May;57(5): Blanco A, de la Hoya M, Balmaña J, et al. Detection of a large rearrangement in PALB2 in Spanish breast cancer families with male breast cancer. Breast Cancer Res Treat Feb;132(1): Sabatier R, Adélaïde J, Finetti P, et al. BARD1 homozygous deletion, a possible alternative to BRCA1 mutation in basal breast cancer. Genes Chromosomes Cancer Dec;49(12): Ahvenainen T, Lehtonen HJ, Lehtonen R, et al. Mutation screening of fumarate hydratase by multiplex ligation-dependent probe amplification: detection of exonic deletion in a patient with leiomyomatosis and renal cell cancer. Cancer Genet Cytogenet Jun;183(2): Chibon F, Primois C, Bressieux JM, et al. Contribution of PTEN large rearrangements in Cowden disease: a multiplex amplifiable probe hybridisation (MAPH) screening approach. J Med Genet Oct;45(10): Knappskog S, Geisler J, Arnesen T, et al. A novel type of deletion in the CDKN2A gene identified in a melanoma-prone family. Genes Chromosomes Cancer Dec;45(12): Wu R, López-Correa C, Rutkowski JL, et al. Germline mutations in NF1 patients with malignancies. Genes Chromosomes Cancer Dec;26(4):

168 Broeks A, de Klein A, Floore AN, et al. ATM germline mutations in classical ataxiatelangiectasia patients in the Dutch population. Hum Mutat. 1998;12(5): Plummer SJ, Santibáñez-Koref M, Kurosaki T, et al. A germline 2.35 kb deletion of p53 genomic DNA creating a specific loss of the oligomerization domain inherited in a Li-Fraumeni syndrome family. Oncogene Nov;9(11): Otterson GA, Chen W, Coxon AB, et al. Incomplete penetrance of familial retinoblastoma linked to germ-line mutations that result in partial loss of RB function. Proc Natl Acad Sci U S A Oct 28;94(22): Fukuuchi A, Nagamura Y, Yaguchi H, et al. A whole MEN1 gene deletion flanked by Alu repeats in a family with multiple endocrine neoplasia type 1. Jpn J Clin Oncol Nov;36(11): Rumilla K, Schowalter KV, Lindor NM, et al. Frequency of deletions of EPCAM (TACSTD1) in MSH2-associated Lynch syndrome cases. J Mol Diagn Jan;13(1): Kuiper RP, Vissers LE, Venkatachalam R, et al. Recurrence and variability of germline EPCAM deletions in Lynch syndrome. Hum Mutat Apr;32(4): Calva-Cerqueira D, Dahdaleh FS, Woodfield G, et al. Discovery of the BMPR1A promoter and germline mutations that cause juvenile polyposis. Hum Mol Genet Dec 1;19(23): Nørskov MS, Frikke-Schmidt R, Bojesen SE, et al. Copy number variation in glutathione-stransferase T1 and M1 predicts incidence and 5-year survival from prostate and bladder cancer, and incidence of corpus uteri cancer in the general population. Pharmacogenomics J Aug;11(4): Frank B, Bermejo JL, Hemminki K, et al. Copy number variant in the candidate tumor suppressor gene MTUS1 and familial breast cancer risk. Carcinogenesis Jul;28(7): Diskin SJ, Hou C, Glessner JT, et al. Copy number variation at 1q21.1 associated with neuroblastoma. Nature Jun 18;459(7249): Liu W, Sun J, Li G, et al. Association of a germ-line copy number variation at 2p24.3 and risk for aggressive prostate cancer. Cancer Res Mar 15;69(6): Jin G, Sun J, Liu W, et al. Genome-wide copy-number variation analysis identifies common genetic variants at 20p13 associated with aggressiveness of prostate cancer. Carcinogenesis Jul;32(7): Tse KP, Su WH, Yang ML, et al. A gender-specific association of CNV at 6p21.3 with NPC susceptibility. Hum Mol Genet Jul 15;20(14): Huang L, Yu D, Wu C, et al. Copy number variation at 6q13 functions as a long-range regulator and is associated with pancreatic cancer risk. Carcinogenesis Jan;33(1):

169 Lucito R, Suresh S, Walter K, et al. Copy-number variants in patients with a strong family history of pancreatic cancer. Cancer Biol Ther Oct;6(10): Yoshihara K, Tajima A, Adachi S, et al. Germline copy number variations in BRCA1-associated ovarian cancer patients. Genes Chromosomes Cancer Mar;50(3): Venkatachalam R, Verwiel ET, Kamping EJ, et al. Identification of candidate predisposing copy number variants in familial and early-onset colorectal cancer patients. Int J Cancer Oct 1;129(7): Shlien A, Tabori U, Marshall CR, et al. Excessive genomic DNA copy number variation in the Li-Fraumeni cancer predisposition syndrome. Proc Natl Acad Sci U S A Aug 12;105(32): Talos F, Moll UM. Role of the p53 family in stabilizing the genome and preventing polyploidization. Adv Exp Med Biol. 2010;676: McPherson JD, Marra M, Hillier L, et al. A physical map of the human genome. Nature Feb 15;409(6822): Venter JC, Adams MD, Myers EW, et al. The sequence of the human genome. Science Feb 16;291(5507): Sachidanandam R, Weissman D, Schmidt SC, et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature Feb 15;409(6822): Margulies M, Egholm M, Altman WE, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature Sep 15;437(7057): Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet Jan;11(1): Wadman M. James Watson's genome sequenced at high speed. Nature Apr 17;452(7189): Kitzman JO, Mackenzie AP, Adey A, et al. Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nat Biotechnol Jan;29(1): Cirulli ET, Singh A, Shianna KV, et al. Screening the human exome: a comparison of whole genome and whole transcriptome sequencing. Genome Biol. 2010;11(5):R Tong P, Prendergast JG, Lohan AJ, et al. Sequencing and analysis of an Irish human genome. Genome Biol. 2010;11(9):R Fujimoto A, Nakagawa H, Hosono N, et al. Whole-genome sequencing and comprehensive variant analysis of a Japanese individual using massively parallel sequencing. Nat Genet Nov;42(11): Mardis ER. A decade's perspective on DNA sequencing technology. Nature Feb 10;470(7333):

170 Kahn SD. On the future of genomic data. Science Feb 11;331(6018): McPherson JD. Next-generation gap. Nat Methods Nov;6(11 Suppl):S Schadt EE, Turner S, Kasarskis A. A window into third-generation sequencing. Hum Mol Genet Oct 15;19(R2):R Ashley EA, Butte AJ, Wheeler MT, et al. Clinical assessment incorporating a personal genome. Lancet May 1;375(9725): Roach JC, Glusman G, Smit AF, et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science Apr 30;328(5978): Lupski JR, Reid JG, Gonzaga-Jauregui C, et al. Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy. N Engl J Med Apr 1;362(13): Sobreira NL, Cirulli ET, Avramopoulos D, et al. Whole-genome sequencing of a single proband together with linkage analysis identifies a Mendelian disease gene. PLoS Genet Jun 17;6(6):e Bainbridge MN, Wiszniewski W, Murdock DR, et al. Whole-genome sequencing for optimized patient management. Sci Transl Med Jun 15;3(87):87re Dewey FE, Chen R, Cordero SP, et al. Phased whole-genome genetic risk in a family quartet using a major allele reference sequence. PLoS Genet Sep;7(9):e Baranzini SE, Mudge J, van Velkinburgh JC, et al. Genome, epigenome and RNA sequences of monozygotic twins discordant for multiple sclerosis. Nature Apr 29;464(7293): Rios J, Stein E, Shendure J, et al. Identification by whole-genome resequencing of gene defect responsible for severe hypercholesterolemia. Hum Mol Genet Nov 15;19(22): Hodges E, Xuan Z, Balija V, et al. Genome-wide in situ exon capture for selective resequencing. Nat Genet Dec;39(12): Garber K. Fixing the front end. Nat Biotechnol Oct;26(10): Pruitt KD, Harrow J, Harte RA, et al. The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. Genome Res Jul;19(7): Asan, Xu Y, Jiang H, et al. Comprehensive comparison of three commercial human whole-exome capture platforms. Genome Biol Sep 28;12(9):R Ng PC, Levy S, Huang J, et al. Genetic variation in an individual human exome. PLoS Genet Aug 15;4(8):e Ng SB, Turner EH, Robertson PD, et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature Sep 10;461(7261): Vissers LE, de Ligt J, Gilissen C, J et al. A de novo paradigm for mental retardation. Nat Genet Dec;42(12):

171 Walsh T, Shahin H, Elkan-Miller T, et al. Whole exome sequencing and homozygosity mapping identify mutation in the cell polarity protein GPSM2 as the cause of nonsyndromic hearing loss DFNB82. Am J Hum Genet Jul 9;87(1): Lalonde E, Albrecht S, Ha KC, et al. Unexpected allelic heterogeneity and spectrum of mutations in Fowler syndrome revealed by next-generation exome sequencing. Hum Mutat Aug;31(8): Pierce SB, Walsh T, Chisholm KM, et al. Mutations in the DBP-deficiency protein HSD17B4 cause ovarian dysgenesis, hearing loss, and ataxia of Perrault Syndrome. Am J Hum Genet Aug 13;87(2): Ng SB, Bigham AW, Buckingham KJ, et al. Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nat Genet Sep;42(9): Bilgüvar K, Oztürk AK, Louvi A, et al. Whole-exome sequencing identifies recessive WDR62 mutations in severe brain malformations. Nature Sep 9;467(7312): Gilissen C, Arts HH, Hoischen A, et al. Exome sequencing identifies WDR35 variants involved in Sensenbrenner syndrome. Am J Hum Genet Sep 10;87(3): Krawitz PM, Schweiger MR, Rödelsperger C, et al. Identity-by-descent filtering of exome sequence data identifies PIGV mutations in hyperphosphatasia mental retardation syndrome. Nat Genet Oct;42(10): Anastasio N, Ben-Omran T, Teebi A, et al. Mutations in SCARF2 are responsible for Van Den Ende-Gupta syndrome. Am J Hum Genet Oct 8;87(4): Johnson JO, Gibbs JR, Van Maldergem L, Houlden H, Singleton AB. Exome sequencing in Brown-Vialetto-van Laere syndrome. Am J Hum Genet Oct 8;87(4):567-9; author reply Sirmaci A, Walsh T, Akay H, et al. MASP1 mutations in patients with facial, umbilical, coccygeal, and auditory findings of Carnevale, Malpuech, OSA, and Michels syndromes. Am J Hum Genet Nov 12;87(5): Haack TB, Danhauser K, Haberberger B, et al. Exome sequencing identifies ACAD9 mutations as a cause of complex I deficiency. Nat Genet Dec;42(12): Wang JL, Yang X, Xia K, et al. TGM6 identified as a novel causative gene of spinocerebellar ataxias using exome sequencing. Brain Dec;133(Pt 12): Musunuru K, Pirruccello JP, Do R, et al. Exome sequencing, ANGPTL3 mutations, and familial combined hypolipidemia. N Engl J Med Dec 2;363(23): Johnson JO, Mandrioli J, Benatar M, et al. Exome sequencing reveals VCP mutations as a cause of familial ALS. Neuron Dec 9;68(5):

172 Bolze A, Byun M, McDonald D, et al. Whole-exome-sequencing-based discovery of human FADD deficiency. Am J Hum Genet Dec 10;87(6): Liu W, Morito D, Takashima S, et al. Identification of RNF213 as a susceptibility gene for moyamoya disease and its possible role in vascular development. PLoS One. 2011;6(7):e Züchner S, Dallman J, Wen R, et al. Whole-exome sequencing links a variant in DHDDS to retinitis pigmentosa. Am J Hum Genet Feb 11;88(2): Glazov EA, Zankl A, Donskoi M, et al. Whole-exome re-sequencing in a family quartet identifies POP1 mutations as the cause of a novel skeletal dysplasia. PLoS Genet Mar;7(3):e Worthey EA, Mayer AN, Syverson GD, et al. Making a definitive diagnosis: successful clinical application of whole exome sequencing in a child with intractable inflammatory bowel disease. Genet Med Mar;13(3): Simpson MA, Irving MD, Asilmaz E, et al. Mutations in NOTCH2 cause Hajdu-Cheney syndrome, a disorder of severe and progressive bone loss. Nat Genet Mar 6;43(4): Becker J, Semler O, Gilissen C, et al. Exome sequencing identifies truncating mutations in human SERPINF1 in autosomal-recessive osteogenesis imperfecta. Am J Hum Genet Mar 11;88(3): Ostergaard P, Simpson MA, Brice G, et al. Rapid identification of mutations in GJC2 in primary lymphoedema using whole exome sequencing combined with linkage analysis with delineation of the phenotype. J Med Genet Apr;48(4): Çalışkan M, Chong JX, Uricchio L, et al. Exome sequencing reveals a novel mutation for autosomal recessive non-syndromic mental retardation in the TECR gene on chromosome 19p13. Hum Mol Genet Apr 1;20(7): Erlich Y, Edvardson S, Hodges E, et al. Exome sequencing and disease-network analysis of a single family implicate a mutation in KIF1A in hereditary spastic paraparesis. Genome Res May;21(5): Sundaram SK, Huq AM, Sun Z, et al. Exome sequencing of a pedigree with Tourette syndrome or chronic tic disorder. Ann Neurol May;69(5): Puente XS, Quesada V, Osorio FG, et al. Exome sequencing and functional analysis identifies BANF1 mutation as the cause of a hereditary progeroid syndrome. Am J Hum Genet May 13;88(5): Vissers LE, Lausch E, Unger S, et al. Chondrodysplasia and abnormal joint development associated with mutations in IMPAD1, encoding the Golgi-resident nucleotide phosphatase, gpapp. Am J Hum Genet May 13;88(5):

173 O'Sullivan J, Bitu CC, Daly SB, et al. Whole-Exome sequencing identifies FAM20A mutations as a cause of amelogenesis imperfecta and gingival hyperplasia syndrome. Am J Hum Genet May 13;88(5): Götz A, Tyynismaa H, Euro L, et al. Exome sequencing identifies mitochondrial alanyl-trna synthetase mutations in infantile mitochondrial cardiomyopathy. Am J Hum Genet May 13;88(5): Shi Y, Li Y, Zhang D, et al. Exome sequencing identifies ZNF644 mutations in high myopia. PLoS Genet Jun;7(6):e Klein CJ, Botuyan MV, Wu Y, et al. Mutations in DNMT1 cause hereditary sensory neuropathy with dementia and hearing loss. Nat Genet Jun;43(6): Barak T, Kwan KY, Louvi A, et al. Recessive LAMC3 mutations cause malformations of occipital cortical development. Nat Genet Jun;43(6): O'Roak BJ, Deriziotis P, Lee C, et al. Exome sequencing in sporadic autism spectrum disorders identifies severe de novo mutations. Nat Genet Jun;43(6): Alvarado DM, Buchan JG, Gurnett CA, et al. Exome sequencing identifies an MYH3 mutation in a family with distal arthrogryposis type 1. J Bone Joint Surg Am Jun 1;93(11): de Greef JC, Wang J, Balog J, et al. Mutations in ZBTB24 are associated with immunodeficiency, centromeric instability, and facial anomalies syndrome type 2. Am J Hum Genet Jun 10;88(6): Yamaguchi T, Hosomichi K, Narita A, et al. Exome resequencing combined with linkage analysis identifies novel PTH1R variants in primary failure of tooth eruption in Japanese. J Bone Miner Res Jul;26(7): Zhou C, Zang D, Jin Y, et al. Mutation in ribosomal protein L21 underlies hereditary hypotrichosis simplex. Hum Mutat Jul;32(7): Le Goff C, Mahaut C, Wang LW, et al. Mutations in the TGFβ binding-protein-like domain 5 of FBN1 are responsible for acromicric and geleophysic dysplasias. Am J Hum Genet Jul 15;89(1): Hanson D, Murray PG, O'Sullivan J, et al. Exome sequencing identifies CCDC8 mutations in 3- M syndrome, suggesting that CCDC8 contributes in a pathway with CUL7 and OBSL1 to control human growth. Am J Hum Genet Jul 15;89(1): Vilariño-Güell C, Wider C, Ross OA, et al. VPS35 mutations in Parkinson disease. Am J Hum Genet Jul 15;89(1): Zimprich A, Benet-Pagès A, Struhal W, et al. A mutation in VPS35, encoding a subunit of the retromer complex, causes late-onset Parkinson disease. Am J Hum Genet Jul 15;89(1):

174 Sergouniotis PI, Davidson AE, Mackay DS, et al. Recessive mutations in KCNJ13, encoding an inwardly rectifying potassium channel subunit, cause leber congenital amaurosis. Am J Hum Genet Jul 15;89(1): Albers CA, Cvejic A, Favier R, et al. Exome sequencing identifies NBEAL2 as the causative gene for gray platelet syndrome. Nat Genet Jul 17;43(8): Sanna-Cherchi S, Burgess KE, Nees SN, et al. Exome sequencing identified MYO1E and NEIL1 as candidate genes for human autosomal recessive steroid-resistant nephrotic syndrome. Kidney Int Aug;80(4): Liu L, Okada S, Kong XF, et al. Gain-of-function human STAT1 mutations impair IL-17 immunity and underlie chronic mucocutaneous candidiasis. J Exp Med Aug 1;208(8): Yariz KO, Walsh T, Uzak A, et al. Inherited mutation of the luteinizing hormone/choriogonadotropin receptor (LHCGR) in empty follicle syndrome. Fertil Steril Aug;96(2):e Xu B, Roos JL, Dexheimer P, et al. Exome sequencing supports a de novo mutational paradigm for schizophrenia. Nat Genet Aug 7;43(9): Sirmaci A, Spiliopoulos M, Brancati F, et al. Mutations in ANKRD11 cause KBG syndrome, characterized by intellectual disability, skeletal malformations, and macrodontia. Am J Hum Genet Aug 12;89(2): Shaheen R, Faqeih E, Sunker A, et al. Recessive mutations in DOCK6, encoding the guanidine nucleotide exchange factor DOCK6, lead to abnormal actin cytoskeleton organization and Adams- Oliver syndrome. Am J Hum Genet Aug 12;89(2): Nosková L, Stránecký V, Hartmannová H, et al. Mutations in DNAJC5, encoding cysteine-string protein alpha, cause autosomal-dominant adult-onset neuronal ceroid lipofuscinosis. Am J Hum Genet Aug 12;89(2): Weedon MN, Hastings R, Caswell R, et al. Exome sequencing identifies a DYNC1H1 mutation in a large pedigree with dominant axonal Charcot-Marie-Tooth disease. Am J Hum Genet Aug 12;89(2): Ozgül RK, Siemiatkowska AM, Yücel D, et al. Exome sequencing and cis-regulatory mapping identify mutations in MAK, a gene encoding a regulator of ciliary length, as a cause of retinitis pigmentosa. Am J Hum Genet Aug 12;89(2): Doi H, Yoshida K, Yasuda T, et al. Exome sequencing reveals a homozygous SYT14 mutation in adult-onset, autosomal-recessive spinocerebellar ataxia with psychomotor retardation. Am J Hum Genet Aug 12;89(2): Sloan JL, Johnston JJ, Manoli I, et al. Exome sequencing identifies ACSF3 as a cause of combined malonic and methylmalonic aciduria. Nat Genet Aug 14;43(9):883-6.

175 Aldahmesh MA, Khan AO, Mohamed JY, et al. Identification of ADAMTS18 as a gene mutated in Knobloch syndrome. J Med Genet Sep;48(9): Murdock DR, Clark GD, Bainbridge MN, et al. Whole-exome sequencing identifies compound heterozygous mutations in WDR62 in siblings with recurrent polymicrogyria. Am J Med Genet A Sep;155A(9): Regalado ES, Guo DC, Villamizar C, et al. Exome sequencing identifies SMAD3 mutations as a cause of familial thoracic aortic aneurysm and dissection with intracranial and other arterial aneurysms. Circ Res Sep 2;109(6): Dickinson RE, Griffin H, Bigley V, et al. Exome sequencing identifies GATA-2 mutation as the cause of dendritic cell, monocyte, B and NK lymphoid deficiency. Blood Sep 8;118(10): Hor H, Bartesaghi L, Kutalik Z, et al. A missense mutation in myelin oligodendrocyte glycoprotein as a cause of familial narcolepsy with cataplexy. Am J Hum Genet Sep 9;89(3): Marti-Masso JF, Ruiz-Martínez J, Makarov V, et al. Exome sequencing identifies GCDH (glutaryl-coa dehydrogenase) mutations as a cause of a progressive form of early-onset generalized dystonia. Hum Genet Mar;131(3): Tariq M, Belmont JW, Lalani S, et al. SHROOM3 is a novel candidate for heterotaxy identified by whole exome sequencing. Genome Biol Sep 21;12(9):R Takata A, Kato M, Nakamura M, et al. Exome sequencing identifies a novel missense variant in RRM2B associated with autosomal recessive progressive external ophthalmoplegia. Genome Biol Sep 28;12(9):R Theis JL, Sharpe KM, Matsumoto ME, et al. Homozygosity mapping and exome sequencing reveal GATAD1 mutation in autosomal recessive dilated cardiomyopathy. Circ Cardiovasc Genet Dec;4(6): Pierson TM, Adams D, Bonn F, et al. Whole-exome sequencing identifies homozygous AFG3L2 mutations in a spastic ataxia-neuropathy syndrome linked to mitochondrial m-aaa proteases. PLoS Genet Oct;7(10):e Al Badr W, Al Bader S, Otto E, et al. Exome capture and massively parallel sequencing identifies a novel HPSE2 mutation in a Saudi Arabian child with Ochoa (urofacial) syndrome. J Pediatr Urol Oct;7(5): Cullinane AR, Vilboux T, O'Brien K, et al. Homozygosity mapping and whole-exome sequencing to detect SLC45A2 and G6PC3 mutations in a single patient with oculocutaneous albinism and neutropenia. J Invest Dermatol Oct;131(10):

176 Ovunc B, Otto EA, Vega-Warner V, et al. Exome sequencing reveals cubilin mutation as a single-gene cause of proteinuria. J Am Soc Nephrol Oct;22(10): Bowne SJ, Humphries MM, Sullivan LS, et al. A dominant mutation in RPE65 identified by whole-exome sequencing causes retinitis pigmentosa with choroidal involvement. Eur J Hum Genet Oct;19(10): Kitamura A, Maekawa Y, Uehara H, et al. A mutation in the immunoproteasome subunit PSMB8 causes autoinflammation and lipodystrophy in humans. J Clin Invest Oct;121(10): Tyynismaa H, Sun R, Ahola-Erkkilä S, et al. Thymidine kinase 2 mutations in autosomal recessive progressive external ophthalmoplegia with multiple mitochondrial DNA deletions. Hum Mol Genet Jan 1;21(1): Bjursell MK, Blom HJ, Cayuela JA, et al. Adenosine kinase deficiency disrupts the methionine cycle and causes hypermethioninemia, encephalopathy, and abnormal liver function. Am J Hum Genet Oct 7;89(4): Zangen D, Kaufman Y, Zeligson S, et al. XX ovarian dysgenesis is caused by a PSMC3IP/HOP2 mutation that abolishes coactivation of estrogen-driven transcription. Am J Hum Genet Oct 7;89(4): Galmiche L, Serre V, Beinat M, et al. Exome sequencing identifies MRPL3 mutation in mitochondrial cardiomyopathy. Hum Mutat Nov;32(11): Bredrup C, Saunier S, Oud MM, et al. Ciliopathies with skeletal anomalies and renal insufficiency due to mutations in the IFT-A gene WDR19. Am J Hum Genet Nov 11;89(5): Saitsu H, Osaka H, Sasaki M, et al. Mutations in POLR3A and POLR3B encoding RNA Polymerase III subunits cause an autosomal-recessive hypomyelinating leukoencephalopathy. Am J Hum Genet Nov 11;89(5): Clayton-Smith J, O'Sullivan J, Daly S, et al. Whole-exome-sequencing identifies mutations in histone acetyltransferase gene KAT6B in individuals with the Say-Barber-Biesecker variant of Ohdo syndrome. Am J Hum Genet Nov 11;89(5): Aldahmesh MA, Mohamed JY, Alkuraya HS, et al. Recessive mutations in ELOVL4 cause ichthyosis, intellectual disability, and spastic quadriplegia. Am J Hum Genet Dec 9;89(6): Chen WJ, Lin Y, Xiong ZQ, et al. Exome sequencing identifies truncating mutations in PRRT2 that cause paroxysmal kinesigenic dyskinesia. Nat Genet Nov 20;43(12): Logan CV, Lucke B, Pottinger C, et al. Mutations in MEGF10, a regulator of satellite cell myogenesis, cause early onset myopathy, areflexia, respiratory distress and dysphagia (EMARDD). Nat Genet Nov 20;43(12):

177 Dauber A, Nguyen TT, Sochett E, et al. Genetic defect in CYP24A1, the vitamin D 24- hydroxylase gene, in a patient with severe infantile hypercalcemia. J Clin Endocrinol Metab Feb;97(2):E Shamseldin HE, Faden MA, Alashram W, et al. Identification of a novel DLX5 mutation in a family with autosomal recessive split hand and foot malformation. J Med Genet Jan;49(1): Sergouniotis PI, Davidson AE, Mackay DS, et al. Biallelic mutations in PLA2G5, encoding group V phospholipase A2, cause benign fleck retina. Am J Hum Genet Dec 9;89(6): Berger I, Ben-Neriah Z, Dor-Wolman T, et al. Early prenatal ventriculomegaly due to an AIFM1 mutation identified by linkage analysis and whole exome sequencing. Mol Genet Metab Dec;104(4): Bhat V, Girimaji SC, Mohan G, et al. Mutations in WDR62, encoding a centrosomal and nuclear protein, in Indian primary microcephaly families with cortical malformations. Clin Genet Dec;80(6): Wang X, Wang H, Cao M, et al. Whole-exome sequencing identifies ALMS1, IQCB1, CNGA3, and MYO7A mutations in patients with Leber congenital amaurosis. Hum Mutat Dec;32(12): Adzhubei IA, Schmidt S, Peshkin L, et al. A method and server for predicting damaging missense mutations. Nat Methods Apr;7(4): Kumar P, Henikoff S, Ng PC. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc. 2009;4(7): Pollard KS, Hubisz MJ, Rosenbloom KR, et al. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res Jan;20(1): Cooper GM, Stone EA, Asimenos G, et al. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res Jul;15(7): Melton PE, Pankratz N. Joint analyses of disease and correlated quantitative phenotypes using next-generation sequencing data. Genet Epidemiol. 2011;35 Suppl 1:S Stitziel NO, Kiezun A, Sunyaev S. Computational and statistical approaches to analyzing variants identified by exome sequencing. Genome Biol Sep 14;12(9): Ionita-Laza I, Makarov V, Yoon S, et al. Finding disease variants in Mendelian disorders by using sequence data: methods and applications. Am J Hum Genet Dec 9;89(6): Sjöblom T, Jones S, Wood LD, et al. The consensus coding sequences of human breast and colorectal cancers. Science Oct 13;314(5797): Parsons DW, Jones S, Zhang X, et al. An integrated genomic analysis of human glioblastoma multiforme. Science Sep 26;321(5897):

178 Ley TJ, Mardis ER, Ding L, et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature Nov 6;456(7218): Mardis ER, Ding L, Dooling DJ, et al. Recurring mutations found by sequencing an acute myeloid leukemia genome. N Engl J Med Sep 10;361(11): Ley TJ, Ding L, Walter MJ, et al. DNMT3A mutations in acute myeloid leukemia. N Engl J Med Dec 16;363(25): Shah SP, Morin RD, Khattra J, et al. Mutational evolution in a lobular breast tumour profiled at single nucleotide resolution. Nature Oct 8;461(7265): Ding L, Ellis MJ, Li S, et al. Genome remodelling in a basal-like breast cancer metastasis and xenograft. Nature Apr 15;464(7291): Pleasance ED, Stephens PJ, O'Meara S, et al. A small-cell lung cancer genome with complex signatures of tobacco exposure. Nature Jan 14;463(7278): Lee W, Jiang Z, Liu J, et al. The mutation spectrum revealed by paired genome sequences from a lung cancer patient. Nature May 27;465(7297): Harbour JW, Onken MD, Roberson ED, et al. Frequent mutation of BAP1 in metastasizing uveal melanomas. Science Dec 3;330(6009): Timmermann B, Kerick M, Roehr C, et al. Somatic mutation profiles of MSI and MSS colorectal cancer identified by whole exome next generation sequencing and bioinformatics analysis. PLoS One Dec 22;5(12):e Chapman MA, Lawrence MS, Keats JJ, et al. Initial genome sequencing and analysis of multiple myeloma. Nature Mar 24;471(7339): Totoki Y, Tatsuno K, Yamamoto S, et al. High-resolution characterization of a hepatocellular carcinoma genome. Nat Genet May;43(5): Tiacci E, Trifonov V, Schiavoni G, et al. BRAF mutations in hairy-cell leukemia. N Engl J Med Jun 16;364(24): Pasqualucci L, Trifonov V, Fabbri G, et al. Analysis of the coding genome of diffuse large B-cell lymphoma. Nat Genet Jul 31;43(9): Jiao Y, Shi C, Edil BH, et al. DAXX/ATRX, MEN1, and mtor pathway genes are frequently altered in pancreatic neuroendocrine tumors. Science Mar 4;331(6021): Wang K, Kan J, Yuen ST, et al. Exome sequencing identifies frequent mutation of ARID1A in molecular subtypes of gastric cancer. Nat Genet Oct 30;43(12): International Cancer Genome Consortium, Hudson TJ, Anderson W, et al. International network of cancer genome projects. Nature Apr 15;464(7291): Byun M, Abhyankar A, Lelarge V, et al. Whole-exome sequencing-based discovery of STIM1 deficiency in a child with fatal classic Kaposi sarcoma. J Exp Med Oct 25;207(11):

179 Snape K, Hanks S, Ruark E, et al. Mutations in CEP57 cause mosaic variegated aneuploidy syndrome. Nat Genet Jun;43(6): Comino-Méndez I, Gracia-Aznárez FJ, Schiavi F, et al. Exome sequencing identifies MAX mutations as a cause of hereditary pheochromocytoma. Nat Genet Jun 19;43(7): Saarinen S, Aavikko M, Aittomäki K, et al. Exome sequencing reveals germline NPAT mutation as a candidate risk factor for Hodgkin lymphoma. Blood Jul 21;118(3): Bodmer W, Tomlinson I. Rare genetic variants and the risk of cancer. Curr Opin Genet Dev Jun;20(3): Kote-Jarai Z, Jugurnauth S, Mulholland S, et al. A recurrent truncating germline mutation in the BRIP1/FANCJ gene and susceptibility to prostate cancer. Br J Cancer Jan 27;100(2): Zhang S, Phelan CM, Zhang P, et al. Frequency of the CHEK2 1100delC mutation among women with breast cancer: an international study. Cancer Res Apr 1;68(7): Yokoyama S, Woods SL, Boyle GM, et al. A novel recurrent mutation in MITF predisposes to familial and sporadic melanoma. Nature Nov 13;480(7375): Park DJ, Odefrey FA, Hammet F, et al. FAN1 variants identified in multiple-case early-onset breast cancer families via exome sequencing: no evidence for association with risk for breast cancer. Breast Cancer Res Treat Dec;130(3): Risch HA, McLaughlin JR, Cole DEC, et al. Population BRCA1 and BRCA2 mutation frequencies and cancer penetrances: a kin-cohort study in Ontario, Canada. J Natl Cancer Inst 2006;98: The Breast Cancer Linkage Consortium. Cancer risks in BRCA2 mutation carriers. J Natl Cancer Inst 1999;91: van Asperen CJ, Brohet RM, Meijers-Heijboer EJ, et al. Cancer risks in BRCA2 families: estimates for sites other than breast and ovary. J Med Genet 2005;42: Couch FJ, Johnson MR, Rabe KG, et al. The prevalence of BRCA2 mutations in familial pancreatic cancer. Cancer Epidemiol Biomarkers Prev Feb;16(2): Ferrone CR, Levine DA, Tang LH, et al. BRCA germline mutations in Jewish patients with pancreatic adenocarcinoma. J Clin Oncol Jan 20;27(3): Abbott DW, Freeman ML, Holt JT. Double-strand break repair deficiency and radiation sensitivity in BRCA2 mutant cancer cells. J Natl Cancer Inst Jul 1;90(13): Goggins M, Hruban RH, Kern SE. BRCA2 is inactivated late in the development of pancreatic intraepithelial neoplasia: evidence and implications. Am J Pathol May;156(5): Skoulidis F, Cassidy LD, Pisupati V, et al. Germline Brca2 heterozygosity promotes Kras(G12D) -driven carcinogenesis in a murine model of familial pancreatic cancer. Cancer Cell Nov 16;18(5):

180 Rowley M, Ohashi A, Mondal G, et al. Inactivation of Brca2 promotes Trp53-associated but inhibits KrasG12D-dependent pancreatic cancer development in mice. Gastroenterology Apr;140(4): e Feldmann G, Karikari C, dal Molin M, et al. Inactivation of Brca2 cooperates with Trp53(R172H) to induce invasive pancreatic ductal adenocarcinomas in mice: a mouse model of familial pancreatic cancer. Cancer Biol Ther Jun 1;11(11): Thompson D, Easton DF, the Breast Cancer Linkage Consortium. Cancer Incidence in BRCA1 mutation carriers. J Natl Cancer Inst 2002;94: Brose MS, Rebbeck TR, Calzone KA, et al. Cancer risk estimates for BRCA1 mutation carriers identified in a risk evaluation program. J Natl Cancer Inst 2002;94: Beger C, Ramadani M, Meyer S, et al. Down-regulation of BRCA1 in chronic pancreatitis and sporadic pancreatic adenocarcinoma. Clinical Cancer Res 2004;10: Honrado E, Benitez J, Palacios J. The molecular pathology of hereditary breast cancer: genetic testing and therapeutic implications. Mod Pathol 2005;18: Esteller M, Fraga MF, Guo M, et al. DNA methylation patterns in hereditary human cancers mimic sporadic tumorigenesis. Hum Mol Genet 2001;10: Gudmundsdottir K, Ashworth A. The roles of BRCA1 and BRCA2 and associated proteins in the maintenance of genomic stability. Oncogene 2006;25: Struewing JP, Hartge P, Wacholder S, et al. The risk of cancer associated with specific mutations of BRCA1 and BRCA2 among Ashkenazi Jews. N Engl J Med 1997;336: Lynch HT, Deters CA, Snyder CL, et al. BRCA1 and pancreatic cancer: pedigree findings and their causal relationships. Cancer Genetics and Cytogenetics 2005;158: Tonin P, Weber B, Offit K, et al. Frequency of recurrent BRCA1 and BRCA2 mutations in Ashkenazi Jewish breast cancer families. Nat Medicine 1996;2: Gruber SB, Petersen GM. Cancer risk in BRCA1 carriers: time for the next generation of studies. J Natl Cancer Inst 2002;94: Struewing JP, Abeliovich D, Peretz T, et al. The carrier frequency of the BRCA1 185delAG mutation is approximately 1 percent in Ashkenazi Jewish individuals. Nat Genet Oct;11(2): Ford D, Easton DF, Peto J. Estimates of the gene frequency of BRCA1 and its contribution to breast and ovarian cancer incidence. Am J Hum Genet Dec;57(6): Kim DH, Crawford B, Ziegler J, et al. Prevalence and characteristics of pancreatic cancer in families with BRCA1 and BRCA2 mutations. Fam Cancer. 2009;8(2): Hall M, Olopade O. Pancreatic cancer and BRCA mutation in familial breast cancer families. Journal of Clinical Oncology 2005;23(16S):9550.

181 Ozcelik H, Schmoker B, Di Nicola N, et al. Germline BRCA2 6174delT mutations in Ashkenazi Jewish pancreatic cancer patients. Nat Genet 1997;16: Peng DF, Kanai Y, Sawada M, et al. DNA methylation of multiple tumor-related genes in association with overexpression of DNA methyltransferase 1(DNMT1) during multistage carcinogenesis of the pancreas. Carcinogenesis 2006;27: Saif MW. Controversies in adjuvant treatment of pancreatic adenocarcinoma. JOP 2007;8: McCabe N, Turner NC, Lord CJ, et al. Deficiency in the repair of DNA damage by homologous recombination and sensitivity to poly(adp-ribose) polymerase inhibition. Cancer Res Aug 15;66(16): Yun J, Zhong Q, Kwak JY, et al. Hypersensitivity of Brca1-deficient MEF to the DNA interstrand crosslinking agent mitomycin C is associated with defect in homologous recombination repair and aberrant S-phase arrest. Oncogene 2006;24: Treszezamsky AD, Kachnic LA, Feng Z, et al. BRCA1- and BRCA2-deficient cells are sensitive to etoposide-induced DNA double-strand breaks via topoisomerase II. Cancer Res 2007;67: James E, Waldron-Lynch MG, Saif MW. Prolonged survival in a patient with BRCA2 associated metastatic pancreatic cancer after exposure to camptothecin: a case report and review of literature. Anticancer Drugs Aug;20(7): Sonnenblick A, Kadouri L, Appelbaum L, et al. Complete remission, in BRCA2 mutation carrier with metastatic pancreatic adenocarcinoma, treated with cisplatin based therapy. Cancer Biol Ther Aug 1;12(3): Lowery M, Shah MA, Smyth E, et al. A 67-year-old woman with BRCA 1 mutation associated with pancreatic adenocarcinoma. J Gastrointest Cancer Sep;42(3): Gu W, Lupski JR. CNV and nervous system diseases--what's new? Cytogenet Genome Res. 2008;123(1-4): Alaerts M, Del-Favero J. Searching genetic risk factors for schizophrenia and bipolar disorder: learn from the past and back to the future. Hum Mutat. 2009;30: Schaschl H, Aitman TJ, Vyse TJ. Copy number variation in the human genome and its implication in autoimmunity. Clin Exp Immunol. 2009;156: Lanktree M, Hegele RA. Copy number variation in metabolic phenotypes. Cytogenet Genome Res. 2008;123: Karageorgi S, Prescott J, Wong JY, et al. GSTM1 and GSTT1 copy number variation in population-based studies of endometrial cancer risk. Cancer Epidemiol Biomarkers Prev Jul;20(7):

182 Engert S, Wappenschmidt B, Betz B, et al. MLPA screening in the BRCA1 gene from 1,506 German hereditary breast cancer cases: novel deletions, frequent involvement of exon 17, and occurrence in single early-onset cases. Hum Mutat. 2008;29: Madlensky L, Berk TC, Bapat BV, et al. A preventive registry for hereditary nonpolyposis colorectal cancer.can J Oncol. 1995;5: Cotterchio M, Manno M, Klar N, et al. Colorectal screening is associated with reduced colorectal cancer risk: a case-control study within the population-based Ontario Familial Colorectal Cancer Registry. Cancer Causes Control. 2005;16: Stewart AF, Dandona S, Chen L, et al. Kinesin family member 6 variant Trp719Arg does not associate with angiographically defined coronary artery disease in the Ottawa Heart Genomics Study. J Am Coll Cardiol. 2009;53: Krawczak M, Nikolaus S, von Eberstein H, et al. PopGen: population-based recruitment of patiens and controls for the analysis of complex genotype-phenotype relationships. Community Genet. 2006;9: Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155: Li C, Hung Wong W. Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol. 2001;2(8):RESEARCH Nannya Y, Sanada M, Nakazaki K, et al. A robust algorithm for copy number detection using high-density oligonucleotide single nucleotide polymorphism genotyping arrays. Cancer Res. 2005;65: Korn JM, Kuruvilla FG, McCarroll SA, et al. Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat Genet Oct;40(10): Pinto D, Pagnamenta AT, Klei L, et al. Functional impact of global rare copy number variation in autism spectrum disorders. Nature Jul 15;466(7304): Schmittgen TD, Livak KJ. Analyzing real-time PCR data by the comparative C(T) method. Nat Protoc. 2008;3: Zhang J, Feuk L, Duggan GE, et al. Development of bioinformatics resources for display and analysis of copy number and other structural variants in the human genome. Cytogenet Genome Res. 2006;115: Higgins ME, Claremont M, Major JE, et al. CancerGenes: a gene selection resource for cancer genome projects. Nucleic Acids Res. 2007;35(Database issue):d Shepherd R, Forbes SA, Beare D, et al. Data mining using the Catalogue of Somatic Mutations in Cancer BioMart. Database (Oxford) 2011:bar018. Print 2011.

183 Jin Q, Gao G, Mulder KM. Requirement of a dynein light chain in TGFbeta/Smad3 signaling. J Cell Physiol Dec;221(3): Jiang J, Yu L, Huang X, et al. Identification of two novel human dynein light chain genes, DNLC2A and DNLC2B, and their expression changes in hepatocellular carcinoma tissues from 68 Chinese patients. Gene. 2001;281: Malinda KM, Kleinman HK. The laminins. Int J Biochem Cell Biol Sep;28(9): Kim YH, Lee HC, Kim SY, et al. Epigenomic analysis of aberrantly methylated genes in colorectal cancer identifies genes commonly affected by epigenetic alterations. Ann Surg Oncol. 2011;18: Scrideli CA, Carlotti CG Jr, Okamoto OK, et al. Gene expression profile analysis of primary glioblastomas and non-neoplastic brain tissue: identification of potential target genes by oligonucleotide microarray and real-time quantitative PCR. J Neurooncol. 2008;88: Pinto D, Darvishi K, Shi X, et al. Comprehensive assessment of array-based platforms and calling algorithms for detection of copy number variants. Nat Biotechnol. 2011;29: Wang H, Linghu H, Wang J, et al. The role of Crk/Dock180/Rac1 pathway in the malignant behavior of human ovarian cancer cell SKOV3. Tumour Biol. 2010;31: Sanders MA, Ampasala D, Basson MD. DOCK5 and DOCK1 regulate Caco-2 intestinal epithelial cell spreading and migration on collagen IV. J Biol Chem. 2009;284: Buchholz M, Braun M, Heidenblut A, et al. Transcriptome analysis of microdissected pancreatic intraepithelial neoplastic lesions. Oncogene. 2005;24: Gu W, Zhang F, Lupski JR. Mechanisms for human genomic rearrangements. Pathogenetics Nov 3;1(1): Pruitt KD, Tatusova T, Brown GR, et al. NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res Jan;40(Database issue):d Griffiths-Jones S. mirbase: microrna sequences and annotation. Curr Protoc Bioinformatics Mar;Chapter 12:Unit Hercus C [last accessed date November, 2009] McKenna A, Hanna M, Banks E, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res Sep;20(9): Affymetrix. BRLMM: An improved genotype calling method for the GeneChip Mapping 500K Array Set Kumar P, Henikoff S, Ng PC. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc. 2009;4(7): Exome Variant Server, NHLBI Exome Sequencing Project (ESP), Seattle, WA (URL: [last accessed Dec 2011].

184 Ahel I, Ahel D, Matsusaka T, et al. Poly(ADP-ribose)-binding zinc finger motifs in DNA repair/checkpoint proteins. Nature Jan 3;451(7174): Macrae CJ, McCulloch RD, Ylanko J, et al. APLF (C2orf13) facilitates nonhomologous endjoining and undergoes ATM-dependent hyperphosphorylation following ionizing radiation. DNA Repair (Amst) Feb 1;7(2): Allen NP, Donninger H, Vos MD, et al. RASSF6 is a novel member of the RASSF family of tumor suppressors. Oncogene Sep 13;26(42): Ou YY, Mack GJ, Zhang M, et al. CEP110 and ninein are located in a specific domain of the centrosome associated with centrosome maturation. J Cell Sci May 1;115(Pt 9): Carrara S, Cangi MG, Arcidiacono PG, et al. Mucin expression pattern in pancreatic diseases: findings from EUS-guided fine-needle aspiration biopsies. Am J Gastroenterol Jul;106(7): Fletcher O, Houlston RS. Architecture of inherited susceptibility to common cancer. Nat Rev Cancer May;10(5): Mehrotra PV, Ahel D, Ryan DP, et al. DNA repair factor APLF is a histone chaperone. Mol Cell Jan 7;41(1): Okada S, Tokunaga E, Kitao H, et al. Loss of Heterozygosity at BRCA1 Locus Is Significantly Associated with Aggressiveness and Poor Prognosis in Breast Cancer. Ann Surg Oncol Dec 17. [Epub ahead of print] 575. Lane DP. Cancer. p53, guardian of the genome. Nature Jul 2;358(6381): Chen XR, Zhang WZ, Lin XQ, et al. Genetic instability of BRCA1 gene at locus D17S855 is related to clinicopathological behaviors of gastric cancer from Chinese population. World J Gastroenterol Jul 14;12(26): Pestonjamasp PH, Mittra I. Analysis of BRCA1 involvement in breast cancer in Indian women. J Biosci Mar;25(1): Garcia-Patiño E, Gomendio B, Lleonart M, et al. Loss of heterozygosity in the region including the BRCA1 gene on 17q in colon cancer. Cancer Genet Cytogenet Jul 15;104(2):

185 172 Appendices 1. Appendix Tables Table S1: Primers for BRCA1 microsatellite markers Microsatellite marker Number of repeats Expected average amplicon size (bp) Primer sequences D17S855 dinucleotide 151 F: GGA TGG CCT TTT AGA AAG TGG R: ACA CAG ACT TGT CCT ACT GCC D17S1322 trinucleotide 130 F: CTA GCC TGG GCA ACA AAC GA R: GCA GGA AGC AGG AAT GGA AC D17S579 dinucleotide 123 F: AGT CCT GTA GAC AAA ACC TG R: CAG TTT CAT ACC AAG TTC CT D16S2616 trinucleotide 125 F: TGT GAT TCA GTA GGT CTT GGG R: GTG ACT AAA CCT GAC ATT GTG C Annealing temp ( C) Table S2: BRCA1 mutations sequencing primers Mutation Expected amplicon size (bp) Primer sequences Annealing temp ( C) 5382insC 109 F: CAG AGG AGA TGT GGT CAA TG 55 R: GGG GTG AGA TTT TTG TCA AC 185delAg 91 F: CGT TGA AGA AGT ACA AAA TGT C 59 R: CCC AAA TTA ATA CAC TCT TGT G 2318delG 103 F: CTA AGT GTT CAA ATA CCA GTG R: GCA TTA TTA GAC ACT TTA ACT G 55 Table S3: FPC cases in CNV study (Table available as excel sheet on attached CD) Table S4: Controls (OFCCR and FGICR) in CNV study (Table available as excel sheet on attached CD) Table S5: Primers for qpcr validation of CNVs CNV ID F primer R primer D_180 GGAGGACATGGAATTGATGG CTGCAAGCAAAGATCACCAA D_19 GTAGCAGAGTGGGCCAAAAA GGGAAAAATTCACCCCTGAT D_128 GCAGAATGAAATTTGGCACA AAGCCACCACTGAGGTTCAC D_152 CCAGAGAGGATGGTGAGAGG GCTTTGGGACTGACTGCTTC D_234 (primer A) AAGGAGGCTGAGTGGCTACA CCTTGAAGACCTGGCTTCTG D_234 (primer B) AGGGAAGAACACCTCCACCT ATCCCTCTTCCTTGCTCCAT D_143 (primer A) TGCTCCATGGTGCTGATTTA CACACATCACTGCCCTTCAC D_143 (primer B) TCTGTTCCTATTCGGCCATC TTCTCCCAAACTCCACAAGC D_220 GCTCCAAGATCCGTTCTGAG TCATTTGACGCATGACCCTA D_30 & D_36 TACAGGCAACCCCAGGTATC CACCCAGCCATGTTTTCTTT (same region in two samples) D_40 AAAGAGGCCAACAGGAAACC TCTGAGAAAGCGTAGACATTTCC D_105 (primer A) TTTCTAGCTGGGCTCTCCAA CCAGCAATGGTAGGGTGAGT D_105 (primer B) CTGGCTTTTGTGGATGGTTT TGCATGCTTGAATCTCCTTG D_83 ACAGCCAAGGGTGAAACATC CTGTGAACCTGGGTGAACCT D_48 CACTGGATTGGAGACCAGAA TTGGAAGAACTCGGCTTGAT D_125 ACGGATTCCTCAACACTTGC CTGTCCTGGCTACTGCATCA D_134 GCATCCTTGCACTACCCATT GGGGGAAAGTGCTGTGTAAA D_142 (primer A) CTACCTACTGGGCACCCAAA TTGATGTTGAAATGGGCTGA D_142 (primer B) TGGTGATACCCACTGCTGAA CCAGCTTGCTTTCTTTGTCC D_56 GCAGATTTCAGGTGTGCTGA AAAGACACCCTGGCAGAGAA G_225 TGCCTTGGCTCCACTTCTAT GTCCAGCTCCACAAGAGAGG

186 173 G_226 TGTGCCAGTGGACTCTGAAC TTTGTTGACCACTCCCTTCC G_365 (primer A) TCCCAACCATATCACCCAGT AAAACCAACCAAGGCATCAG G_365 (primer B) TGCCTGCTGCTTAAAAAGGT ATATCAACGACTGCCCTTGG G_369 GGGGCAGCTGTAAATACCAA CCCCAGGTCATAGACCAGAA G_380 GGCAGGTAGACATGACAGCA CCATCTCAGCTCCAGTCACA G_407 TGCCCCCAAAATGAATGTAT CAAAAGTGTTGGCTGCTGAA G_603/604 TAGGCCTTGGATGGAAATTG GTGATGAGGGGGTGAAGAGA G_69 TGGGAACCCCTGCTATAGTG TGCTCGCTTTGAATTTGATG G_88 AGGTCAGCGCTCCTCAATAA TGCCCCTGTGCATACAAATA G_97 (primer A) CAGCTCTCCAGGTCATCCAT GAGTTCACCAGGTGGGAAAA G_97 (primer B) AGAACCGAGTGGAAAGAGCA TGAGGCCCAAAGATGGTAAC Table S6: Primers for qpcr breakpoint mapping of TGFBR3-transecting duplication CNV ID F primer R primer T_Out_1 CCAAGGCCTCTGGACTAGGT AGACTTGGAGCCCTAGGACAA T_Out_2 TCACTTGGCTTCATGAAAAGG AAATAGCCCCAGATGTGTGC T_Out_3 AGCCAAGAGCTGTGTTTGTGT AAATGCAATCAAGGCAGCTT T_Out_4 GGCCTCTAGCCCGAAATAAC GACTGCAAAATGGGTGTGG O_In_2 CTTGTGGTTTTGCCTGGAAT ACCACTGTGCAGCTCCTGA O_Out_1 CCAGTTTGGAATGCAATGAA ACTCTCAGTTGTGGCTTGGAG O_Out_5 ACAAATTGCTGTTTCTTTCTACAGC TTACCTGCGAGCTACTGAATATAGG Sequencing Primers CTGGTAGACAGTTGGGGTTTC ACATCTCTGGTGCCCTTTG Table S7: High- and low-confidence losses on Affy500K array in FPC cases (Table available as excel sheet on attached CD) Table S8: High- and low-confidence gains on Affy500K array in FPC cases (Table available as excel sheet on attached CD) Table S9: High- and low-confidence losses on Affy500K array in controls (Table available as excel sheet on attached CD) Table S10: High- and low-confidence gains on Affy500K array in controls (Table available as excel sheet on attached CD) Table S11: High-confidence CNVs on Affy 6.0 array in FPC cases (Table available as excel sheet on attached CD) Table S12: High-confidence CNVs on Affy 6.0 array in controls (Table available as excel sheet on attached CD)

187 Appendix Figures One outlier excluded from each set of sample results if value is outside range of mean +/- 2SD (for this purpose, 2*SD and range is calculated after removing the value in question) Fold difference calculated relative to average dct for control samples (i.e. ddct for each sample is dct(sample)-dct(average)) (error bars = 2*SD of fold difference) For all figures, the sample with Id_ is FPC case containing CNV; samples with RD- identifiers are controls. Figure S1 qpcr of region D_180 Figure S2 qpcr of region D_19

188 175 Figure S3 qpcr of region D_128 Figure S4 qpcr of region D_152 Figure S5 qpcr of region D_234 (primer A)

189 176 Figure S6 qpcr of region D_234 (primer B) Figure S7 qpcr of region D_143 (primer A) Figure S8 qpcr of region D_143 (primer B)

190 177 Figure S9 qpcr of region D_220 Figure S10 qpcr of region D_30 & D_36 Figure S11 qpcr of region D_40

191 178 Figure S12 qpcr of region D_105 (primer A) Figures S13 qpcr of region D_105 (primer B) Figure S14 qpcr of region D_83

192 179 Figure S15 qpcr of region D_48 Figure S16 qpcr of region D_125 Figure S17 qpcr of region D_134

193 180 Figure S18 qpcr of region D_142 (primer A) Figure S19 qpcr of region D_142 (primer B) Figure S20 qpcr of region D_56

194 181 Figure S21 qpcr of region G_225 Figure S22 - Region: G_226 Figure S23 qpcr of region G_365 (primer A)

195 182 Figure S24 qpcr of region G_365 (primer B) Figure S25 qpcr of region G_369 Figure S26 qpcr of region G_380

196 183 Figure S27 qpcr of region G_407 Figure S28 qpcr of region G_603/604 Figure S29 qpcr of region G_69

197 184 Figure S30 qpcr of region G_88 Figure S31 qpcr of region G_97 (primer A) ID_27 Figure S32 qpcr of region G_97 (primer B) ID_27

198 185 Figure S33 - Region G_97 (primer A) qpcr in ID_203 and family members Figure S34 - Region G_97 (primer A) qpcr in ID_203 s family members Figure S35 - Region G_97 (primer A) qpcr in ID_203 and family members

199 186 Figure S36 - Region G_97 (primer A) qpcr in ID_203 s family members Figure S37 - Region G_97 (primer A) qpcr in ID_203 s family members Figure S38 - Region G_97 (primer B) qpcr in ID_203 and family members

Genetics of Pancreatic Cancer. October 6, If you experience technical difficulty during the presentation:

Genetics of Pancreatic Cancer. October 6, If you experience technical difficulty during the presentation: Genetics of Pancreatic Cancer October 6, 2016 If you experience technical difficulty during the presentation: Contact WebEx Technical Support directly at: US Toll Free: 1-866-229-3239 Toll Only: 1-408-435-7088

More information

Genetic testing and pancreatic disease

Genetic testing and pancreatic disease Genetic testing and pancreatic disease February 2 d, 2018 Yale Pancreas Symposium 2018: Multidisciplinary Management of Pancreatic Cancer Xavier Llor, M.D., PhD. Associate Professor of Medicine Co-Director,

More information

Hereditary Aspects of Pancreatic Cancer

Hereditary Aspects of Pancreatic Cancer Pancreatic Cancer Seminar San Francisco, CA Hereditary Aspects of Pancreatic Cancer Genetic Risk Assessment and Counseling for Familial Pancreatic Cancer February 3, 2016 Amie Blanco, MS, CGC Gordon and

More information

Multistep nature of cancer development. Cancer genes

Multistep nature of cancer development. Cancer genes Multistep nature of cancer development Phenotypic progression loss of control over cell growth/death (neoplasm) invasiveness (carcinoma) distal spread (metastatic tumor) Genetic progression multiple genetic

More information

Pancreatic intraepithelial

Pancreatic intraepithelial Pancreatic intraepithelial neoplasia (PanIN) Markéta Hermanová St. Anne s University Hospital Brno Faculty of Medicine, Masaryk University Precursor lesions of invasive pancreatic cancer Pancreatic intraepithelial

More information

Tumor suppressor genes D R. S H O S S E I N I - A S L

Tumor suppressor genes D R. S H O S S E I N I - A S L Tumor suppressor genes 1 D R. S H O S S E I N I - A S L What is a Tumor Suppressor Gene? 2 A tumor suppressor gene is a type of cancer gene that is created by loss-of function mutations. In contrast to

More information

Pancreas Cancer Genomics

Pancreas Cancer Genomics Pancreas Cancer Genomics Steven Gallinger MD, MSc, FRCS HPB Surgical Oncology Program University Health Network Samuel Lunenfeld Research Institute Mount Sinai Hospital University of Toronto Fate of the

More information

DOES THE BRCAX GENE EXIST? FUTURE OUTLOOK

DOES THE BRCAX GENE EXIST? FUTURE OUTLOOK CHAPTER 6 DOES THE BRCAX GENE EXIST? FUTURE OUTLOOK Genetic research aimed at the identification of new breast cancer susceptibility genes is at an interesting crossroad. On the one hand, the existence

More information

Surveillance of Individuals At High Risk For Developing Pancreatic Cancer

Surveillance of Individuals At High Risk For Developing Pancreatic Cancer Surveillance of Individuals At High Risk For Developing Pancreatic Cancer Marco Bruno Erasmus Medical Center, Rotterdam Pancreatic Cancer Facts & figures One of the most fatal malignancies Overall 5-year

More information

LESSON 3.2 WORKBOOK. How do normal cells become cancer cells? Workbook Lesson 3.2

LESSON 3.2 WORKBOOK. How do normal cells become cancer cells? Workbook Lesson 3.2 For a complete list of defined terms, see the Glossary. Transformation the process by which a cell acquires characteristics of a tumor cell. LESSON 3.2 WORKBOOK How do normal cells become cancer cells?

More information

Development of Carcinoma Pathways

Development of Carcinoma Pathways The Construction of Genetic Pathway to Colorectal Cancer Moriah Wright, MD Clinical Fellow in Colorectal Surgery Creighton University School of Medicine Management of Colon and Diseases February 23, 2019

More information

Germline Testing for Hereditary Cancer with Multigene Panel

Germline Testing for Hereditary Cancer with Multigene Panel Germline Testing for Hereditary Cancer with Multigene Panel Po-Han Lin, MD Department of Medical Genetics National Taiwan University Hospital 2017-04-20 Disclosure No relevant financial relationships with

More information

Precision Genetic Testing in Cancer Treatment and Prognosis

Precision Genetic Testing in Cancer Treatment and Prognosis Precision Genetic Testing in Cancer Treatment and Prognosis Deborah Cragun, PhD, MS, CGC Genetic Counseling Graduate Program Director University of South Florida Case #1 Diana is a 47 year old cancer patient

More information

CANCER. Inherited Cancer Syndromes. Affects 25% of US population. Kills 19% of US population (2nd largest killer after heart disease)

CANCER. Inherited Cancer Syndromes. Affects 25% of US population. Kills 19% of US population (2nd largest killer after heart disease) CANCER Affects 25% of US population Kills 19% of US population (2nd largest killer after heart disease) NOT one disease but 200-300 different defects Etiologic Factors In Cancer: Relative contributions

More information

Information for You and Your Family

Information for You and Your Family Information for You and Your Family What is Prevention? Cancer prevention is action taken to lower the chance of getting cancer. In 2017, more than 1.6 million people will be diagnosed with cancer in the

More information

7th Annual Symposium on Gastrointestinal Cancers " St. Louis, Mo, 9/20/08

7th Annual Symposium on Gastrointestinal Cancers  St. Louis, Mo, 9/20/08 Molecular markers to aid in early diagnosis of pancreatic cancer Michael Goggins, MD Professor of Pathology, Medicine and Oncology Johns Hopkins Medical Institutions, Baltimore, MD 7th Annual Symposium

More information

CANCER GENETICS PROVIDER SURVEY

CANCER GENETICS PROVIDER SURVEY Dear Participant, Previously you agreed to participate in an evaluation of an education program we developed for primary care providers on the topic of cancer genetics. This is an IRB-approved, CDCfunded

More information

Familial and Hereditary Colon Cancer

Familial and Hereditary Colon Cancer Familial and Hereditary Colon Cancer Aasma Shaukat, MD, MPH, FACG, FASGE, FACP GI Section Chief, Minneapolis VAMC Associate Professor, Division of Gastroenterology, Department of Medicine, University of

More information

September 20, Submitted electronically to: Cc: To Whom It May Concern:

September 20, Submitted electronically to: Cc: To Whom It May Concern: History Study (NOT-HL-12-147), p. 1 September 20, 2012 Re: Request for Information (RFI): Building a National Resource to Study Myelodysplastic Syndromes (MDS) The MDS Cohort Natural History Study (NOT-HL-12-147).

More information

Introduction to Genetics

Introduction to Genetics Introduction to Genetics Table of contents Chromosome DNA Protein synthesis Mutation Genetic disorder Relationship between genes and cancer Genetic testing Technical concern 2 All living organisms consist

More information

Colonic Polyp. Najmeh Aletaha. MD

Colonic Polyp. Najmeh Aletaha. MD Colonic Polyp Najmeh Aletaha. MD 1 Polyps & classification 2 Colorectal cancer risk factors 3 Pathogenesis 4 Surveillance polyp of the colon refers to a protuberance into the lumen above the surrounding

More information

The Genetics of Breast and Ovarian Cancer Prof. Piri L. Welcsh

The Genetics of Breast and Ovarian Cancer Prof. Piri L. Welcsh The Genetics of Breast Piri L. Welcsh, PhD Research Assistant Professor University of Washington School of Medicine Division of Medical Genetics 1 Genetics of cancer All cancers arise from genetic and

More information

Policy Specific Section: Medical Necessity and Investigational / Experimental. October 14, 1998 March 28, 2014

Policy Specific Section: Medical Necessity and Investigational / Experimental. October 14, 1998 March 28, 2014 Medical Policy Genetic Testing for Colorectal Cancer Type: Medical Necessity and Investigational / Experimental Policy Specific Section: Laboratory/Pathology Original Policy Date: Effective Date: October

More information

Cancer Genetics. What is Cancer? Cancer Classification. Medical Genetics. Uncontrolled growth of cells. Not all tumors are cancerous

Cancer Genetics. What is Cancer? Cancer Classification. Medical Genetics. Uncontrolled growth of cells. Not all tumors are cancerous Session8 Medical Genetics Cancer Genetics J avad Jamshidi F a s a U n i v e r s i t y o f M e d i c a l S c i e n c e s, N o v e m b e r 2 0 1 7 What is Cancer? Uncontrolled growth of cells Not all tumors

More information

Brian T Burgess, DO, PhD, GYN Oncology Fellow Rachel W. Miller, MD, GYN Oncology

Brian T Burgess, DO, PhD, GYN Oncology Fellow Rachel W. Miller, MD, GYN Oncology Brian T Burgess, DO, PhD, GYN Oncology Fellow Rachel W. Miller, MD, GYN Oncology Epithelial Ovarian Cancer - Standard Current Treatment: Surgery with De-bulking + Platinum-Taxane based Chemotherapy - No

More information

Colon Cancer and Hereditary Cancer Syndromes

Colon Cancer and Hereditary Cancer Syndromes Colon Cancer and Hereditary Cancer Syndromes Gisela Keller Institute of Pathology Technische Universität München gisela.keller@lrz.tum.de Colon Cancer and Hereditary Cancer Syndromes epidemiology models

More information

Neoplasia 18 lecture 6. Dr Heyam Awad MD, FRCPath

Neoplasia 18 lecture 6. Dr Heyam Awad MD, FRCPath Neoplasia 18 lecture 6 Dr Heyam Awad MD, FRCPath ILOS 1. understand the role of TGF beta, contact inhibition and APC in tumorigenesis. 2. implement the above knowledge in understanding histopathology reports.

More information

The Next Generation of Hereditary Cancer Testing

The Next Generation of Hereditary Cancer Testing The Next Generation of Hereditary Cancer Testing Why Genetic Testing? Cancers can appear to run in families. Often this is due to shared environmental or lifestyle patterns, such as tobacco use. However,

More information

Supplementary Table 1: Previous reports relevant to Early Onset Pancreatic Cancer (EOPC).

Supplementary Table 1: Previous reports relevant to Early Onset Pancreatic Cancer (EOPC). Supplementary Table 1: Previous reports relevant to Early Onset Pancreatic Cancer (EOPC). First author (year) Environmental and genetic factors Maximum age of EOPC group Number of EOPC cases Total number

More information

Familial and Hereditary Colon Cancer

Familial and Hereditary Colon Cancer Familial and Hereditary Colon Cancer Aasma Shaukat, MD, MPH, FACG, FASGE, FACP GI Section Chief, Minneapolis VAMC Associate Professor, Division of Gastroenterology, Department of Medicine, University of

More information

Genomic structural variation

Genomic structural variation Genomic structural variation Mario Cáceres The new genomic variation DNA sequence differs across individuals much more than researchers had suspected through structural changes A huge amount of structural

More information

Transform genomic data into real-life results

Transform genomic data into real-life results CLINICAL SUMMARY Transform genomic data into real-life results Biomarker testing and targeted therapies can drive improved outcomes in clinical practice New FDA-Approved Broad Companion Diagnostic for

More information

Integration of Cancer Genome into GECCO- Genetics and Epidemiology of Colorectal Cancer Consortium

Integration of Cancer Genome into GECCO- Genetics and Epidemiology of Colorectal Cancer Consortium Integration of Cancer Genome into GECCO- Genetics and Epidemiology of Colorectal Cancer Consortium Ulrike Peters Fred Hutchinson Cancer Research Center University of Washington U01-CA137088-05, PI: Peters

More information

Serrated Polyps and a Classification of Colorectal Cancer

Serrated Polyps and a Classification of Colorectal Cancer Serrated Polyps and a Classification of Colorectal Cancer Ian Chandler June 2011 Structure Serrated polyps and cancer Molecular biology The Jass classification The familiar but oversimplified Vogelsteingram

More information

Introduction to Cancer Biology

Introduction to Cancer Biology Introduction to Cancer Biology Robin Hesketh Multiple choice questions (choose the one correct answer from the five choices) Which ONE of the following is a tumour suppressor? a. AKT b. APC c. BCL2 d.

More information

Clonal evolution of human cancers

Clonal evolution of human cancers Clonal evolution of human cancers -Pathology-based microdissection and genetic analysis precisely demonstrates molecular evolution of neoplastic clones- Hiroaki Fujii, MD Ageo Medical Laboratories, Yashio

More information

Germline Genetic Testing for Breast Cancer Risk

Germline Genetic Testing for Breast Cancer Risk Kathmandu, Bir Hospital visit, August 2018 Germline Genetic Testing for Breast Cancer Risk Evidence-based Genetic Screening Rodney J. Scott Demography in New South Wales (total population ~ 7,000,000)

More information

Biology of cancer development in the GI tract

Biology of cancer development in the GI tract 1 Genesis and progression of GI cancer a genetic disease Colorectal cancer Fearon and Vogelstein proposed a genetic model to explain the stepwise formation of colorectal cancer (CRC) from normal colonic

More information

MEDICAL POLICY Genetic Testing for Breast and Ovarian Cancers

MEDICAL POLICY Genetic Testing for Breast and Ovarian Cancers POLICY: PG0067 ORIGINAL EFFECTIVE: 07/30/02 LAST REVIEW: 01/25/18 MEDICAL POLICY Genetic Testing for Breast and Ovarian Cancers GUIDELINES This policy does not certify benefits or authorization of benefits,

More information

Cancer Genomics 101. BCCCP 2015 Annual Meeting

Cancer Genomics 101. BCCCP 2015 Annual Meeting Cancer Genomics 101 BCCCP 2015 Annual Meeting Objectives Identify red flags in a person s personal and family medical history that indicate a potential inherited susceptibility to cancer Develop a systematic

More information

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#4:(October-0-4-2010) Cancer and Signals 1 2 1 2 Evidence in Favor Somatic mutations, Aneuploidy, Copy-number changes and LOH

More information

Combatting Pancreatic Cancer: Keys to Early Recognition and Diagnosis

Combatting Pancreatic Cancer: Keys to Early Recognition and Diagnosis Transcript Details This is a transcript of an educational program accessible on the ReachMD network. Details about the program and additional media formats for the program are accessible by visiting: https://reachmd.com/programs/clinicians-roundtable/combatting-pancreatic-cancer-keys-earlyrecognition-and-diagnosis/7286/

More information

AllinaHealthSystems 1

AllinaHealthSystems 1 Overview Biology and Introduction to the Genetics of Cancer Denise Jones, MS, CGC Certified Genetic Counselor Virginia Piper Cancer Service Line I. Our understanding of cancer the historical perspective

More information

Genetics and Genomics in Medicine Chapter 8 Questions

Genetics and Genomics in Medicine Chapter 8 Questions Genetics and Genomics in Medicine Chapter 8 Questions Linkage Analysis Question Question 8.1 Affected members of the pedigree above have an autosomal dominant disorder, and cytogenetic analyses using conventional

More information

Asingle inherited mutant gene may be enough to

Asingle inherited mutant gene may be enough to 396 Cancer Inheritance STEVEN A. FRANK Asingle inherited mutant gene may be enough to cause a very high cancer risk. Single-mutation cases have provided much insight into the genetic basis of carcinogenesis,

More information

CNV Detection and Interpretation in Genomic Data

CNV Detection and Interpretation in Genomic Data CNV Detection and Interpretation in Genomic Data Benjamin W. Darbro, M.D., Ph.D. Assistant Professor of Pediatrics Director of the Shivanand R. Patil Cytogenetics and Molecular Laboratory Overview What

More information

AVENIO family of NGS oncology assays ctdna and Tumor Tissue Analysis Kits

AVENIO family of NGS oncology assays ctdna and Tumor Tissue Analysis Kits AVENIO family of NGS oncology assays ctdna and Tumor Tissue Analysis Kits Accelerating clinical research Next-generation sequencing (NGS) has the ability to interrogate many different genes and detect

More information

What is New in Genetic Testing. Steven D. Shapiro MS, DMD, MD

What is New in Genetic Testing. Steven D. Shapiro MS, DMD, MD What is New in Genetic Testing Steven D. Shapiro MS, DMD, MD 18th Annual Primary Care Symposium Financial and Commercial Disclosure I have a no financial or commercial interest in my presentation. 2 Genetic

More information

Linkage analysis: Prostate Cancer

Linkage analysis: Prostate Cancer Linkage analysis: Prostate Cancer Prostate Cancer It is the most frequent cancer (after nonmelanoma skin cancer) In 2005, more than 232.000 new cases were diagnosed in USA and more than 30.000 will die

More information

The lymphoma-associated NPM-ALK oncogene elicits a p16ink4a/prb-dependent tumor-suppressive pathway. Blood Jun 16;117(24):

The lymphoma-associated NPM-ALK oncogene elicits a p16ink4a/prb-dependent tumor-suppressive pathway. Blood Jun 16;117(24): DNA Sequencing Publications Standard Sequencing 1 Carro MS et al. DEK Expression is controlled by E2F and deregulated in diverse tumor types. Cell Cycle. 2006 Jun;5(11) 2 Lassandro L et al. The DNA sequence

More information

Genomic Instability. Kent Nastiuk, PhD Dept. Cancer Genetics Roswell Park Cancer Institute. RPN-530 Oncology for Scientist-I October 18, 2016

Genomic Instability. Kent Nastiuk, PhD Dept. Cancer Genetics Roswell Park Cancer Institute. RPN-530 Oncology for Scientist-I October 18, 2016 Genomic Instability Kent Nastiuk, PhD Dept. Cancer Genetics Roswell Park Cancer Institute RPN-530 Oncology for Scientist-I October 18, 2016 Previous lecturers supplying slides/notes/inspiration Daniel

More information

Risk of Colorectal Cancer (CRC) Hereditary Syndromes in GI Cancer GENETIC MALPRACTICE

Risk of Colorectal Cancer (CRC) Hereditary Syndromes in GI Cancer GENETIC MALPRACTICE Identifying the Patient at Risk for an Inherited Syndrome Sapna Syngal, MD, MPH, FACG Director, Gastroenterology Director, Familial GI Program Dana-Farber/Brigham and Women s Cancer Center Associate Professor

More information

oncogenes-and- tumour-suppressor-genes)

oncogenes-and- tumour-suppressor-genes) Special topics in tumor biochemistry oncogenes-and- tumour-suppressor-genes) Speaker: Prof. Jiunn-Jye Chuu E-Mail: jjchuu@mail.stust.edu.tw Genetic Basis of Cancer Cancer-causing mutations Disease of aging

More information

Adenocarcinoma of the pancreas

Adenocarcinoma of the pancreas Adenocarcinoma of the pancreas SEMINARS IN DIAGNOSTIC PATHOLOGY 31 (2014) 443 451 Ralph H.Hruban, MD, David S. Klimstra, MD Paola Parente Anatomia Patologica Casa Sollievo della Sofferenza San Giovanni

More information

GENETICS OF COLORECTAL CANCER: HEREDITARY ASPECTS By. Magnitude of the Problem. Magnitude of the Problem. Cardinal Features of Lynch Syndrome

GENETICS OF COLORECTAL CANCER: HEREDITARY ASPECTS By. Magnitude of the Problem. Magnitude of the Problem. Cardinal Features of Lynch Syndrome GENETICS OF COLORECTAL CANCER: HEREDITARY ASPECTS By HENRY T. LYNCH, M.D. 1 Could this be hereditary Colon Cancer 4 Creighton University School of Medicine Omaha, Nebraska Magnitude of the Problem Annual

More information

Introduction. Cancer Biology. Tumor-suppressor genes. Proto-oncogenes. DNA stability genes. Mechanisms of carcinogenesis.

Introduction. Cancer Biology. Tumor-suppressor genes. Proto-oncogenes. DNA stability genes. Mechanisms of carcinogenesis. Cancer Biology Chapter 18 Eric J. Hall., Amato Giaccia, Radiobiology for the Radiologist Introduction Tissue homeostasis depends on the regulated cell division and self-elimination (programmed cell death)

More information

MSI positive MSI negative

MSI positive MSI negative Pritchard et al. 2014 Supplementary Figure 1 MSI positive MSI negative Hypermutated Median: 673 Average: 659.2 Non-Hypermutated Median: 37.5 Average: 43.6 Supplementary Figure 1: Somatic Mutation Burden

More information

Supplementary note: Comparison of deletion variants identified in this study and four earlier studies

Supplementary note: Comparison of deletion variants identified in this study and four earlier studies Supplementary note: Comparison of deletion variants identified in this study and four earlier studies Here we compare the results of this study to potentially overlapping results from four earlier studies

More information

So, now, that we have reviewed some basics of cancer genetics I will provide an overview of some common syndromes.

So, now, that we have reviewed some basics of cancer genetics I will provide an overview of some common syndromes. Hello. My name is Maureen Mork and I m a Certified Genetic Counselor in the Clinical Cancer Genetics Program at The University of Texas MD Anderson Cancer Center. I ll be lecturing today on the Cancer

More information

Primary Care Approach to Genetic Cancer Syndromes

Primary Care Approach to Genetic Cancer Syndromes Primary Care Approach to Genetic Cancer Syndromes Jason M. Goldman, MD, FACP FAU School of Medicine Syndromes Hereditary Breast and Ovarian Cancer (HBOC) Hereditary Nonpolyposis Colorectal Cancer (HNPCC)

More information

Role of genetic testing in familial breast cancer outside of BRCA1 and BRCA2

Role of genetic testing in familial breast cancer outside of BRCA1 and BRCA2 Role of genetic testing in familial breast cancer outside of BRCA1 and BRCA2 Introduction Most commonly diagnosed cancer in South African women and the second most commonly diagnosed cancer in Black women

More information

Germline mutations in pancreatic cancer and potential new therapeutic options

Germline mutations in pancreatic cancer and potential new therapeutic options /, 2017, Vol. 8, (No. 42), pp: 73240-73257 Germline mutations in pancreatic cancer and potential new therapeutic options Rille Pihlak 1,2, Juan W. Valle 1,2 and Mairéad G. McNamara 1,2 1 Division of Molecular

More information

Using the Bravo Liquid-Handling System for Next Generation Sequencing Sample Prep

Using the Bravo Liquid-Handling System for Next Generation Sequencing Sample Prep Using the Bravo Liquid-Handling System for Next Generation Sequencing Sample Prep Tom Walsh, PhD Division of Medical Genetics University of Washington Next generation sequencing Sanger sequencing gold

More information

So how much of breast and ovarian cancer is hereditary? A). 5 to 10 percent. B). 20 to 30 percent. C). 50 percent. Or D). 65 to 70 percent.

So how much of breast and ovarian cancer is hereditary? A). 5 to 10 percent. B). 20 to 30 percent. C). 50 percent. Or D). 65 to 70 percent. Welcome. My name is Amanda Brandt. I am one of the Cancer Genetic Counselors at the University of Texas MD Anderson Cancer Center. Today, we are going to be discussing how to identify patients at high

More information

Management of higher risk of colorectal cancer. Huw Thomas

Management of higher risk of colorectal cancer. Huw Thomas Management of higher risk of colorectal cancer Huw Thomas Colorectal Cancer 41,000 new cases pa in UK 16,000 deaths pa 60% 5 year survival Adenoma-carcinoma sequence (Morson) Survival vs stage (Dukes)

More information

Protein Domain-Centric Approach to Study Cancer Somatic Mutations from High-throughput Sequencing Studies

Protein Domain-Centric Approach to Study Cancer Somatic Mutations from High-throughput Sequencing Studies Protein Domain-Centric Approach to Study Cancer Somatic Mutations from High-throughput Sequencing Studies Dr. Maricel G. Kann Assistant Professor Dept of Biological Sciences UMBC 2 The term protein domain

More information

Genome 371, Autumn 2018 Quiz Section 9: Genetics of Cancer Worksheet

Genome 371, Autumn 2018 Quiz Section 9: Genetics of Cancer Worksheet Genome 371, Autumn 2018 Quiz Section 9: Genetics of Cancer Worksheet All cancer is due to genetic mutations. However, in cancer that clusters in families (familial cancer) at least one of these mutations

More information

Lynch Syndrome. Angie Strang, PGY2

Lynch Syndrome. Angie Strang, PGY2 Lynch Syndrome Angie Strang, PGY2 Background Previously hereditary nonpolyposis colorectal cancer Autosomal dominant inherited cancer susceptibility syndrome Caused by defects in the mismatch repair system

More information

Hereditary Gastric Cancer

Hereditary Gastric Cancer Hereditary Gastric Cancer Dr Bastiaan de Boer Consultant Pathologist Department of Anatomical Pathology PathWest Laboratory Medicine, QE II Medical Centre Clinical Associate Professor School of Pathology

More information

Accel-Amplicon Panels

Accel-Amplicon Panels Accel-Amplicon Panels Amplicon sequencing has emerged as a reliable, cost-effective method for ultra-deep targeted sequencing. This highly adaptable approach is especially applicable for in-depth interrogation

More information

CS2220 Introduction to Computational Biology

CS2220 Introduction to Computational Biology CS2220 Introduction to Computational Biology WEEK 8: GENOME-WIDE ASSOCIATION STUDIES (GWAS) 1 Dr. Mengling FENG Institute for Infocomm Research Massachusetts Institute of Technology mfeng@mit.edu PLANS

More information

Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers

Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers Gordon Blackshields Senior Bioinformatician Source BioScience 1 To Cancer Genetics Studies

More information

Human Genetics 542 Winter 2018 Syllabus

Human Genetics 542 Winter 2018 Syllabus Human Genetics 542 Winter 2018 Syllabus Monday, Wednesday, and Friday 9 10 a.m. 5915 Buhl Course Director: Tony Antonellis Jan 3 rd Wed Mapping disease genes I: inheritance patterns and linkage analysis

More information

Genetic Testing for Familial Gastrointestinal Cancer Syndromes. C. Richard Boland, MD La Jolla, CA January 21, 2017

Genetic Testing for Familial Gastrointestinal Cancer Syndromes. C. Richard Boland, MD La Jolla, CA January 21, 2017 Genetic Testing for Familial Gastrointestinal Cancer Syndromes C. Richard Boland, MD La Jolla, CA January 21, 2017 Disclosure Information C. Richard Boland, MD I have no financial relationships to disclose.

More information

Introduction to genetic variation. He Zhang Bioinformatics Core Facility 6/22/2016

Introduction to genetic variation. He Zhang Bioinformatics Core Facility 6/22/2016 Introduction to genetic variation He Zhang Bioinformatics Core Facility 6/22/2016 Outline Basic concepts of genetic variation Genetic variation in human populations Variation and genetic disorders Databases

More information

Hepatobiliary and Pancreatic Malignancies

Hepatobiliary and Pancreatic Malignancies Hepatobiliary and Pancreatic Malignancies Gareth Eeson MD MSc FRCSC Surgical Oncologist and General Surgeon Kelowna General Hospital Interior Health Consultant, Surgical Oncology BC Cancer Agency Centre

More information

Chromothripsis: A New Mechanism For Tumorigenesis? i Fellow s Conference Cheryl Carlson 6/10/2011

Chromothripsis: A New Mechanism For Tumorigenesis? i Fellow s Conference Cheryl Carlson 6/10/2011 Chromothripsis: A New Mechanism For Tumorigenesis? i Fellow s Conference Cheryl Carlson 6/10/2011 Massive Genomic Rearrangement Acquired in a Single Catastrophic Event during Cancer Development Cell 144,

More information

Hereditary Breast and Ovarian Cancer Rebecca Sutphen, MD, FACMG

Hereditary Breast and Ovarian Cancer Rebecca Sutphen, MD, FACMG Hereditary Breast and Ovarian Cancer 2015 Rebecca Sutphen, MD, FACMG Among a consecutive series of 11,159 women requesting BRCA testing over one year, 3874 responded to a mailed survey. Most respondents

More information

HEREDITY & CANCER: Breast cancer as a model

HEREDITY & CANCER: Breast cancer as a model HEREDITY & CANCER: Breast cancer as a model Pierre O. Chappuis, MD Divisions of Oncology and Medical Genetics University Hospitals of Geneva, Switzerland Genetics, Cancer and Heredity Cancers are genetic

More information

Human Genetics 542 Winter 2017 Syllabus

Human Genetics 542 Winter 2017 Syllabus Human Genetics 542 Winter 2017 Syllabus Monday, Wednesday, and Friday 9 10 a.m. 5915 Buhl Course Director: Tony Antonellis Module I: Mapping and characterizing simple genetic diseases Jan 4 th Wed Mapping

More information

NGS for Cancer Predisposition

NGS for Cancer Predisposition NGS for Cancer Predisposition Colin Pritchard MD, PhD University of Washington Dept. of Lab Medicine AMP Companion Society Meeting USCAP Boston March 22, 2015 Disclosures I am an employee of the University

More information

Session 4 Rebecca Poulos

Session 4 Rebecca Poulos The Cancer Genome Atlas (TCGA) & International Cancer Genome Consortium (ICGC) Session 4 Rebecca Poulos Prince of Wales Clinical School Introductory bioinformatics for human genomics workshop, UNSW 20

More information

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction Optimization strategy of Copy Number Variant calling using Multiplicom solutions Michael Vyverman, PhD; Laura Standaert, PhD and Wouter Bossuyt, PhD Abstract Copy number variations (CNVs) represent a significant

More information

DNA-seq Bioinformatics Analysis: Copy Number Variation

DNA-seq Bioinformatics Analysis: Copy Number Variation DNA-seq Bioinformatics Analysis: Copy Number Variation Elodie Girard elodie.girard@curie.fr U900 institut Curie, INSERM, Mines ParisTech, PSL Research University Paris, France NGS Applications 5C HiC DNA-seq

More information

Policy Specific Section: Medical Necessity and Investigational / Experimental. October 15, 1997 October 9, 2013

Policy Specific Section: Medical Necessity and Investigational / Experimental. October 15, 1997 October 9, 2013 Medical Policy Genetic Testing for Hereditary Breast and/or Ovarian Cancer Type: Medical Necessity and Investigational / Experimental Policy Specific Section: Laboratory/Pathology Original Policy Date:

More information

CHROMOSOMAL MICROARRAY (CGH+SNP)

CHROMOSOMAL MICROARRAY (CGH+SNP) Chromosome imbalances are a significant cause of developmental delay, mental retardation, autism spectrum disorders, dysmorphic features and/or birth defects. The imbalance of genetic material may be due

More information

Molecular Testing Updates. Karen Rasmussen, PhD, FACMG Clinical Molecular Genetics Spectrum Medical Group, Pathology Division Portland, Maine

Molecular Testing Updates. Karen Rasmussen, PhD, FACMG Clinical Molecular Genetics Spectrum Medical Group, Pathology Division Portland, Maine Molecular Testing Updates Karen Rasmussen, PhD, FACMG Clinical Molecular Genetics Spectrum Medical Group, Pathology Division Portland, Maine Keeping Up with Predictive Molecular Testing in Oncology: Technical

More information

Hereditary Prostate Cancer: From Gene Discovery to Clinical Implementation

Hereditary Prostate Cancer: From Gene Discovery to Clinical Implementation Hereditary Prostate Cancer: From Gene Discovery to Clinical Implementation Kathleen A. Cooney, MD MACP Duke University School of Medicine Duke Cancer Institute (No disclosures to report) Overview Prostate

More information

The silence of the genes: clinical applications of (colorectal) cancer epigenetics

The silence of the genes: clinical applications of (colorectal) cancer epigenetics The silence of the genes: clinical applications of (colorectal) cancer epigenetics Manon van Engeland, PhD Dept. of Pathology GROW - School for Oncology & Developmental Biology Maastricht University Medical

More information

WHAT IS A GENE? CHROMOSOME DNA PROTEIN. A gene is made up of DNA. It carries instructions to make proteins.

WHAT IS A GENE? CHROMOSOME DNA PROTEIN. A gene is made up of DNA. It carries instructions to make proteins. WHAT IS A GENE? CHROMOSOME E GEN DNA A gene is made up of DNA. It carries instructions to make proteins. The proteins have specific jobs that help your body work normally. PROTEIN 1 WHAT HAPPENS WHEN THERE

More information

Neoplasias Quisticas del Páncreas

Neoplasias Quisticas del Páncreas SEAP -Aproximación Práctica a la Patología Gastrointestinal- Madrid, 26 de mayo, 2006 Neoplasias Quisticas del Páncreas Gregory Y. Lauwers, M.D. Director, Service Massachusetts General Hospital Harvard

More information

PANCREATIC CANCER RISK PERCEPTION AND WORRY IN FAMILIAL HIGH- RISK PATIENTS UNDERGOING ENDOSCOPIC ULTRASOUND FOR SURVEILLANCE.

PANCREATIC CANCER RISK PERCEPTION AND WORRY IN FAMILIAL HIGH- RISK PATIENTS UNDERGOING ENDOSCOPIC ULTRASOUND FOR SURVEILLANCE. PANCREATIC CANCER RISK PERCEPTION AND WORRY IN FAMILIAL HIGH- RISK PATIENTS UNDERGOING ENDOSCOPIC ULTRASOUND FOR SURVEILLANCE by Erica Lynn Silver BS, Cell and Molecular Biology; California State University

More information

AD (Leave blank) TITLE: Genomic Characterization of Brain Metastasis in Non-Small Cell Lung Cancer Patients

AD (Leave blank) TITLE: Genomic Characterization of Brain Metastasis in Non-Small Cell Lung Cancer Patients AD (Leave blank) Award Number: W81XWH-12-1-0444 TITLE: Genomic Characterization of Brain Metastasis in Non-Small Cell Lung Cancer Patients PRINCIPAL INVESTIGATOR: Mark A. Watson, MD PhD CONTRACTING ORGANIZATION:

More information

NGS in tissue and liquid biopsy

NGS in tissue and liquid biopsy NGS in tissue and liquid biopsy Ana Vivancos, PhD Referencias So, why NGS in the clinics? 2000 Sanger Sequencing (1977-) 2016 NGS (2006-) ABIPrism (Applied Biosystems) Up to 2304 per day (96 sequences

More information

GENETIC TESTING AND COUNSELING FOR HERITABLE DISORDERS

GENETIC TESTING AND COUNSELING FOR HERITABLE DISORDERS Status Active Medical and Behavioral Health Policy Section: Laboratory Policy Number: VI-09 Effective Date: 03/17/2014 Blue Cross and Blue Shield of Minnesota medical policies do not imply that members

More information

Reviewers' comments: Reviewer #1 (Remarks to the Author):

Reviewers' comments: Reviewer #1 (Remarks to the Author): Reviewers' comments: Reviewer #1 (Remarks to the Author): In this study the authors analysed 18 deep penetrating nevi for oncogenic genomic changes (single nucleotide variations, insertions/deletions,

More information

Cancer. Questions about cancer. What is cancer? What causes unregulated cell growth? What regulates cell growth? What causes DNA damage?

Cancer. Questions about cancer. What is cancer? What causes unregulated cell growth? What regulates cell growth? What causes DNA damage? Questions about cancer What is cancer? Cancer Gil McVean, Department of Statistics, Oxford What causes unregulated cell growth? What regulates cell growth? What causes DNA damage? What are the steps in

More information

Neoplasia 2018 lecture 11. Dr H Awad FRCPath

Neoplasia 2018 lecture 11. Dr H Awad FRCPath Neoplasia 2018 lecture 11 Dr H Awad FRCPath Clinical aspects of neoplasia Tumors affect patients by: 1. their location 2. hormonal secretions 3. paraneoplastic syndromes 4. cachexia Tumor location Even

More information

Dan Koller, Ph.D. Medical and Molecular Genetics

Dan Koller, Ph.D. Medical and Molecular Genetics Design of Genetic Studies Dan Koller, Ph.D. Research Assistant Professor Medical and Molecular Genetics Genetics and Medicine Over the past decade, advances from genetics have permeated medicine Identification

More information