Identification of novel risk variants for sarcoma and other cancers by whole exome sequencing analysis in cancer cluster families DOI:

Similar documents
Tumor suppressor genes D R. S H O S S E I N I - A S L

DOES THE BRCAX GENE EXIST? FUTURE OUTLOOK

Multistep nature of cancer development. Cancer genes

Introduction to Genetics

CS2220 Introduction to Computational Biology

Cancer Genetics. What is Cancer? Cancer Classification. Medical Genetics. Uncontrolled growth of cells. Not all tumors are cancerous

Chapter 1 : Genetics 101

Asingle inherited mutant gene may be enough to

Introduction to genetic variation. He Zhang Bioinformatics Core Facility 6/22/2016

Information for You and Your Family

Whole Genome and Transcriptome Analysis of Anaplastic Meningioma. Patrick Tarpey Cancer Genome Project Wellcome Trust Sanger Institute

HST.161 Molecular Biology and Genetics in Modern Medicine Fall 2007

Early Embryonic Development

Cancer genetics

Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers

AVENIO family of NGS oncology assays ctdna and Tumor Tissue Analysis Kits

oncogenes-and- tumour-suppressor-genes)

Computational Systems Biology: Biology X

DNA-seq Bioinformatics Analysis: Copy Number Variation

The Biology and Genetics of Cells and Organisms The Biology of Cancer

Human Genetics 542 Winter 2018 Syllabus

Karyotype analysis reveals transloction of chromosome 22 to 9 in CML chronic myelogenous leukemia has fusion protein Bcr-Abl

Human Genetics 542 Winter 2017 Syllabus

LESSON 3.2 WORKBOOK. How do normal cells become cancer cells? Workbook Lesson 3.2

Supplementary Figure 1. Estimation of tumour content

Introduction. Cancer Biology. Tumor-suppressor genes. Proto-oncogenes. DNA stability genes. Mechanisms of carcinogenesis.

Protein Domain-Centric Approach to Study Cancer Somatic Mutations from High-throughput Sequencing Studies

What All of Us Should Know About Cancer and Genetics

CANCER. Inherited Cancer Syndromes. Affects 25% of US population. Kills 19% of US population (2nd largest killer after heart disease)

Breast and ovarian cancer in Serbia: the importance of mutation detection in hereditary predisposition genes using NGS

Identifying Mutations Responsible for Rare Disorders Using New Technologies

Analysis with SureCall 2.1

Agro/Ansc/Bio/Gene/Hort 305 Fall, 2017 MEDICAL GENETICS AND CANCER Chpt 24, Genetics by Brooker (lecture outline) #17

Characterisation of structural variation in breast. cancer genomes using paired-end sequencing on. the Illumina Genome Analyser

Integration of Cancer Genome into GECCO- Genetics and Epidemiology of Colorectal Cancer Consortium

Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research

Introduction to Cancer Biology

CRISPR/Cas9 Enrichment and Long-read WGS for Structural Variant Discovery

Advances in Brain Tumor Research: Leveraging BIG data for BIG discoveries

PERSONALIZED GENETIC REPORT CLIENT-REPORTED DATA PURPOSE OF THE X-SCREEN TEST

New Enhancements: GWAS Workflows with SVS

BILATERAL BREAST CANCER INCIDENCE AND SURVIVAL

AD (Leave blank) TITLE: Genomic Characterization of Brain Metastasis in Non-Small Cell Lung Cancer Patients

MEDICAL GENOMICS LABORATORY. Peripheral Nerve Sheath Tumor Panel by Next-Gen Sequencing (PNT-NG)

Cancer. The fundamental defect is. unregulated cell division. Properties of Cancerous Cells. Causes of Cancer. Altered growth and proliferation

MEDICAL GENOMICS LABORATORY. Next-Gen Sequencing and Deletion/Duplication Analysis of NF1 Only (NF1-NG)

Oncogenes and Tumor Suppressors MCB 5068 November 12, 2013 Jason Weber

AllinaHealthSystems 1

Section D: The Molecular Biology of Cancer

Cancer Gene Panels. Dr. Andreas Scherer. Dr. Andreas Scherer President and CEO Golden Helix, Inc. Twitter: andreasscherer

1. Basic principles 2. 6 hallmark features 3. Abnormal cell proliferation: mechanisms 4. Carcinogens: examples. Major Principles:

CANCER GENETICS PROVIDER SURVEY

Performance Characteristics BRCA MASTR Plus Dx

Figure S4. 15 Mets Whole Exome. 5 Primary Tumors Cancer Panel and WES. Next Generation Sequencing

Genetics/Genomics: role of genes in diagnosis and/risk and in personalised medicine

Cancer. Questions about cancer. What is cancer? What causes unregulated cell growth? What regulates cell growth? What causes DNA damage?

CURRENT GENETIC TESTING TOOLS IN NEONATAL MEDICINE. Dr. Bahar Naghavi

Genetics and Genomics in Medicine Chapter 8 Questions

Test Bank for Robbins and Cotran Pathologic Basis of Disease 9th Edition by Kumar

The 100,000 Genomes Project Harnessing the power of genomics for NHS rare disease and cancer patients

CHR POS REF OBS ALLELE BUILD CLINICAL_SIGNIFICANCE

Welcome to the Genetic Code: An Overview of Basic Genetics. October 24, :00pm 3:00pm

Investigating rare diseases with Agilent NGS solutions

Golden Helix s End-to-End Solution for Clinical Labs

Lecture 17: Human Genetics. I. Types of Genetic Disorders. A. Single gene disorders

National Surgical Adjuvant Breast and Bowel Project (NSABP) Foundation Annual Progress Report: 2011 Formula Grant

The Genetics of Breast and Ovarian Cancer Prof. Piri L. Welcsh

Frequency(%) KRAS G12 KRAS G13 KRAS A146 KRAS Q61 KRAS K117N PIK3CA H1047 PIK3CA E545 PIK3CA E542K PIK3CA Q546. EGFR exon19 NFS-indel EGFR L858R

VARIANT PRIORIZATION AND ANALYSIS INCORPORATING PROBLEMATIC REGIONS OF THE GENOME ANIL PATWARDHAN

609G: Concepts of Cancer Genetics and Treatments (3 credits)

AVENIO ctdna Analysis Kits The complete NGS liquid biopsy solution EMPOWER YOUR LAB

Transform genomic data into real-life results

Mohammed El-Khateeb. Tumor Genetics. MGL-12 May 13 th Chapter 22 slide 1 台大農藝系遺傳學

Cancer. The fundamental defect is. unregulated cell division. Properties of Cancerous Cells. Causes of Cancer. Altered growth and proliferation

Single Gene (Monogenic) Disorders. Mendelian Inheritance: Definitions. Mendelian Inheritance: Definitions

Dan Koller, Ph.D. Medical and Molecular Genetics

Test Bank for Robbins and Cotran Pathologic Basis of Disease 9th Edition by Kumar

Biochemistry of Cancer and Tumor Markers

Phenotype analysis in humans using OMIM

Genome-wide Association Analysis Applied to Asthma-Susceptibility Gene. McCaw, Z., Wu, W., Hsiao, S., McKhann, A., Tracy, S.

Global variation in copy number in the human genome

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

BIT 120. Copy of Cancer/HIV Lecture

Benefits and pitfalls of new genetic tests

Biochemistry of Carcinogenesis. Lecture # 35 Alexander N. Koval

CELL BIOLOGY - CLUTCH CH CANCER.

What is New in Genetic Testing. Steven D. Shapiro MS, DMD, MD

Carcinogenesis. Carcinogenesis. 1. Basic principles 2. 6 hallmark features 3. Abnormal cell proliferation: mechanisms 4. Carcinogens: examples

Variant Detection & Interpretation in a diagnostic context. Christian Gilissen

Part II The Cell Cell Division, Chapter 2 Outline of class notes

Introduction to the Genetics of Complex Disease

George R. Honig Junius G. Adams III. Human Hemoglobin. Genetics. Springer-Verlag Wien New York

What is DNA? DNA is a double helix formed by base pairs attached to a sugar-phosphate backbone.

Hands-On Ten The BRCA1 Gene and Protein

No mutations were identified.

BWA alignment to reference transcriptome and genome. Convert transcriptome mappings back to genome space

MOLECULAR BASIS OF ONCOGENESIS

Assessing Laboratory Performance for Next Generation Sequencing Based Detection of Germline Variants through Proficiency Testing

Identification of genomic alterations in cervical cancer biopsies by exome sequencing

Overview of Cancer. Mylene Freires Advanced Nurse Practitioner, Haematology

Transcription:

Identification of novel risk variants for sarcoma and other cancers by whole exome sequencing analysis in cancer cluster families Jones, R. M. (2017). Identification of novel risk variants for sarcoma and other cancers by whole exome sequencing analysis in cancer cluster families DOI: 10.4225/23/59f13ee5d5573 DOI: 10.4225/23/59f13ee5d5573 Link to publication in the UWA Research Repository Rights statement This work is protected by Copyright. You may print or download ONE copy of this document for the purpose of your own non-commercial research or study. Any other use requires permission from the copyright owner. The Copyright Act requires you to attribute any copyright works you quote or paraphrase. General rights Copyright owners retain the copyright for their material stored in the UWA Research Repository. The University grants no end-user rights beyond those which are provided by the Australian Copyright Act 1968. Users may make use of the material in the Repository providing due attribution is given and the use is in accordance with the Copyright Act 1968. Take down policy If you believe this document infringes copyright, raise a complaint by contacting repository-lib@uwa.edu.au. The document will be immediately withdrawn from public access while the complaint is being investigated. Download date: 28. Apr. 2018

Identification of novel risk variants for sarcoma and other cancers by whole exome sequencing analysis in cancer cluster families Submitted by Rachel Jones This thesis is presented for the degree of Doctor of Philosophy The University of Western Australia School of Surgery 2017 i

ii

Declaration I, Rachel Jones, certify that: This thesis has been substantially accomplished during enrolment in the degree. This thesis does not contain material which has been accepted for the award of any other degree or diploma in my name, in any university or other tertiary institution. No part of this work will, in the future, be used in a submission in my name, for any other degree or diploma in any university or other tertiary institution without the prior approval of The University of Western Australia and where applicable, any partner institution responsible for the joint-award of this degree. This thesis does not contain any material previously published or written by another person, except where due reference has been made in the text. The work(s) are not in any way a violation or infringement of any copyright, trademark, patent, or other rights whatsoever of any person. The research involving human data reported in this thesis was assessed and approved by The University of Western Australia Human Research Ethics Committee. Approval number: RA/4/1/6434. Third party editorial assistance was provided in the preparation of the thesis by Dr Tegan McNab. iii

For Gareth and Abbie... [A] knowledge of sequences could contribute much to our understanding of living matter. Frederick Sanger [1980] v

vi

Abstract Cancer is a genetic disease caused by an accumulation of genetic and epigenetic alterations. Cancers can be caused by mutations that arise in single somatic cells, resulting in sporadic tumours or mutations that occur in the germline, resulting in hereditary predisposition to cancer. While only a small proportion of cancers are estimated to involve an inherited genetic mutation, familial clustering of cancers is relatively common. More than 100 cancer predisposition genes have been identified using a variety of genetic strategies. However, only a small proportion of familial cancer risk can be explained by established cancer susceptibility genes. The identification of genes that predispose individuals to cancer is of high importance in human medical research as inherited genetic variants in genes that metabolise and process drugs can influence response to treatment. Sarcomas are a rare group of cancers that arise predominantly from the connective tissues of the body. Despite representing only 1% of all cancers, sarcomas are a high impact group of cancers that disproportionately affect the young. While it is sometimes difficult to distinguish sporadic from hereditary cancer, rare cancer, such as sarcoma, occurring twice within the one family is epidemiologically striking. The use of whole exome sequencing (WES) in families currently represents an optimal study design for the identification of rare genetic variants involved in the risk of cancer. Families in which multiple members develop a rare form of cancer, such as sarcoma, are more likely to have a mutation segregating in an inherited cancer gene compared to families affected by more common types of cancer. vii

In this study, three cancer cluster families (19 individuals) with a sarcoma proband were selected from the International Sarcoma Kindred Study, and WES was performed on germline DNA from both affected and unaffected family members using the Ion Proton platform at 100X coverage. WES data was annotated using Annotate Variation (ANNOVAR) and Regulome database (RegulomeDB). Putative structural and regulatory variants were filtered using genomic location and variant class or RegulomeDB score. Three different strategies were used to prioritise rare private variants, known rare variants and candidate gene variants. Association and segregation analyses of the prioritised variants were used to identify eight nominally significant germline risk variants in the ARHGAP39, C16orf96, ABCB5, ZFP69B, UVSSA, BEAN1, KIF2C and PDIA2 genes that show segregation with cancer in the families. Matched tumour and germline analyses were performed on WES data generated using the Illumina HiSeq 4000 at 60X coverage for two myxoid liposarcoma patients from two of the cancer cluster families. A total of 13 statistically significant somatic mutations were identified using VarScan2 and Strelka (PRMT5, ASPN, LAMA2, TET2, FHOD3, GATAD2A, ADSSL1, P4HTM, ABL1, SLC6A18, PLK2 and two intergenic variants between SLC22A20 and POLA2, and SDR16C6P and PENK). A region of loss of heterozygosity on chromosome 16 was also identified in one of the myxoid liposarcoma tumours. Whole genome sequencing (WGS) of germline DNA using the Illumina HiSeq X Ten platform was available for 561 sarcoma cases and 1,144 healthy ageing controls from the Garvan Institute for Medical Research. Using this WGS data, variant burden analyses were performed independently for summed nonsynonymous deleterious variants and putative regulatory variants to validate target regions identified in the cancer cluster families. The target regions were defined as the genes in which candidate germline and somatic mutations were identified and included 1,000 bases either side. For intergenic variants, both flanking genes were included. Of the 21 regions analysed, six (C16orf96, SLC6A218, TET2, ARHGAP39, ABL1 and a region encompassing SLC22A20 and POLA2 ) were found to have a significantly higher burden of variants in sarcoma cases compared to controls (p-value < 2.38 x 10 3 ). viii

The current study was the first to perform WES on cancer cluster families identified by a sarcoma proband. The results indicate the utility of this approach to identify novel sarcoma candidate risk genes by sequencing a small number of mixed cancer cluster families and validating the results in larger population cohorts. Genomic regions identified in this study should be prioritised for further studies to determine the role of these genes in cancer and sarcoma pathogenesis. ix

x

Contents Declaration Abstract Table of contents List of tables List of figures Acknowledgements Authorship declaration Abbreviations iii vii xi xviii xx xxiii xxv xxvii 1 Literature review 1 1.1 Cancer................................. 1 1.2 Cancer genetics............................ 2 1.3 Familial cancers............................ 4 1.3.1 Familial cancer predisposition syndromes.......... 4 1.3.2 Familial cancer clusters.................... 5 1.4 Evidence for pleiotropic genetic risk factors............. 6 1.5 Sarcoma................................ 7 1.5.1 Sarcoma genetics....................... 7 1.6 Methods for identifying genetic risk variants............ 9 1.6.1 Linkage mapping....................... 10 1.6.2 Association.......................... 10 1.6.3 DNA sequencing....................... 11 1.6.4 Whole exome sequencing................... 12 xi

1.6.5 Whole exome sequencing of cancer cluster families..... 13 1.7 Next generation sequencing study considerations.......... 13 1.8 Known cancer predisposition genes................. 14 1.9 Summary............................... 15 1.10 Aims.................................. 17 2 Aim 1: Whole exome sequencing of three cancer cluster families identified by a sarcoma proband 19 2.1 Introduction.............................. 19 2.1.1 Ion Proton platform..................... 20 2.2 Methods................................ 21 2.2.1 Families selected for whole exome sequencing........ 21 2.2.2 DNA extraction........................ 25 2.2.3 Whole exome sequencing................... 25 2.2.3.1 Library preparation................ 25 2.2.3.2 Exome sequencing................. 27 2.2.4 Sequence alignment and variant calling........... 28 2.2.5 Variation to sequence alignment and variant calling.... 28 2.2.5.1 Torrent variant caller plugin............ 28 2.2.5.2 Genome analysis toolkit.............. 29 2.2.5.3 Intersect variant calls from Torrent Variant Caller and Genome Analysis Toolkit........... 30 2.2.6 Recalibrate variants...................... 30 2.2.7 Genotype concordance.................... 31 2.3 Results................................. 31 2.3.1 Families selected for whole exome sequencing........ 31 2.3.2 Whole exome sequencing................... 32 2.3.3 Variant calling........................ 33 2.3.4 Recalibrate variants...................... 33 2.3.5 Genotype concordance.................... 36 2.4 Discussion............................... 40 2.4.1 Evaluation of families used in this study.......... 40 2.4.2 The use of whole exome sequencing to identify disease causing variants........................ 41 2.4.3 Limitations of whole exome sequencing........... 42 xii

2.4.4 The Ion Proton sequencing platform............ 43 2.4.5 Base calling software..................... 43 2.4.6 Concordance.......................... 44 3 Aim 2: Identification of candidate germline risk variants in three cancer cluster families 47 3.1 Introduction.............................. 47 3.2 Bioinformatic strategies for variant filtering and prioritisation in whole exome sequencing....................... 48 3.2.1 Annotation.......................... 48 3.2.1.1 Annotation of non-coding regions......... 49 3.2.2 Variant class filtering..................... 49 3.2.3 Population frequency filtering................ 50 3.2.4 Evolutionary conservation.................. 50 3.2.5 Functional impact prediction................. 50 3.2.6 Association analysis in families............... 52 3.2.7 Familial segregation...................... 52 3.2.8 Outline of chapter...................... 52 3.3 Methods................................ 53 3.3.1 Ascertainment bias correction................ 53 3.3.2 Intersection.......................... 53 3.3.3 Annotation and filtration................... 53 3.3.4 Prioritisation strategies.................... 55 3.3.4.1 Prioritisation using a rare private variants strategy 55 3.3.4.2 Prioritisation using a known rare variants strategy 55 3.3.4.3 Prioritisation using a candidate gene strategy.. 56 3.3.5 Methods for testing association of variants with cancer phenotypes.......................... 56 3.3.6 Bonferroni correction..................... 57 3.3.7 Familial segregation analysis................. 57 3.3.8 Evidence further supporting candidate risk genes..... 57 3.4 Results................................. 58 3.4.1 Variant prioritisation..................... 58 3.4.1.1 Prioritisation using a rare private variants strategy 59 3.4.1.2 Prioritisation using a known rare variants strategy 59 xiii

3.4.1.3 Prioritisation using a candidate gene strategy.. 59 3.4.1.4 Summary of annotated variants from each prioritisation strategy....................... 59 3.4.2 Rare private variants..................... 62 3.4.2.1 Association analysis in SOLAR.......... 62 3.4.2.2 Segregation analysis results............ 64 3.4.3 Known rare variants..................... 65 3.4.3.1 Association analysis in SOLAR.......... 65 3.4.3.2 Segregation analysis results............ 68 3.4.4 Candidate gene variants................... 71 3.4.4.1 Association analysis in SOLAR.......... 71 3.4.4.2 Segregation analysis results............ 74 3.4.5 Evidence further supporting germline risk genes...... 75 3.5 Discussion............................... 81 3.5.1 Variant filtering and prioritisation strategies........ 81 3.5.2 Association and segregation analyses of candidate risk variants in families...................... 82 3.5.2.1 The ABCB5 gene................. 83 3.5.2.2 The KIF2C gene.................. 84 3.5.2.3 The PDIA2 gene.................. 84 3.5.3 Conclusion........................... 85 4 Aim 3: A comparison of matched tumour and germline DNA from two sarcoma patients 87 4.1 Introduction.............................. 87 4.1.1 Myxoid liposarcoma..................... 87 4.1.1.1 Somatic variants.................. 88 4.1.1.2 Loss of heterozygosity............... 88 4.1.1.3 Somatic copy number alteration.......... 89 4.1.2 Bioinformatic assessment of matched tumour and germline samples............................ 89 4.1.3 Somatic mutations and drug sensitivity........... 90 4.1.4 Outline of chapter...................... 91 4.2 Methods................................ 91 4.2.1 Whole exome sequencing................... 91 xiv

4.2.2 Pre-processing and quality control.............. 93 4.2.3 Adapter trimming....................... 93 4.2.4 Sequence alignment and calling............... 93 4.2.5 BAM quality control..................... 94 4.2.6 Generate mpileup file..................... 94 4.2.7 Somatic variant calling using VarScan2........... 94 4.2.7.1 Somatic variant calling using Strelka....... 95 4.2.8 Evidence further supporting somatic risk genes....... 96 4.2.9 Drug sensitivity........................ 96 4.2.10 Loss of heterozygosity variant calling using VarScan2... 96 4.2.11 Variant annotation and filtering............... 97 4.2.12 Somatic copy number analysis using VarScan2....... 97 4.3 Results................................. 97 4.3.1 Whole exome sequencing................... 97 4.3.2 Sequence alignment and calling............... 98 4.3.3 BAM quality control..................... 100 4.3.4 Somatic variant calling.................... 103 4.3.4.1 VarScan2...................... 103 4.3.4.2 Validation of somatic variants using Strelka... 104 4.3.4.3 Evidence further supporting somatic risk genes. 107 4.3.4.4 Drug sensitivity................... 118 4.3.5 Loss of heterozygosity variants................ 118 4.3.6 Copy number analysis.................... 119 4.4 Discussion............................... 121 4.4.1 Comparison of results in the context of published literature on myxoid liposarcoma genetics............... 121 4.4.2 Strengths........................... 125 4.4.3 Limitations.......................... 125 4.4.4 Summary........................... 126 5 Aim 4: Variant burden analyses at candidate risk loci in sarcoma cases and healthy ageing controls 127 5.1 Introduction.............................. 127 5.1.1 Variant burden analyses in sarcoma cohorts........ 128 5.2 Methods................................ 128 xv

5.2.1 Study participants...................... 128 5.2.2 Whole genome sequencing.................. 129 5.2.3 Genomic regions selected for validation........... 129 5.2.4 Statistical analyses...................... 131 5.3 Results................................. 133 5.3.1 Identification of nonsynonymous deleterious variants in the target regions....................... 133 5.3.2 Statistical analyses...................... 134 5.3.2.1 Nonsynonymous deleterious variants....... 134 5.3.2.2 Putative regulatory variants............ 136 5.4 Discussion............................... 138 5.4.1 Novel findings......................... 138 5.4.2 Known cancer genes..................... 139 5.4.3 Clinical implications..................... 140 5.4.4 Strengths and limitations................... 141 5.4.5 Conclusion........................... 142 6 Conclusion 145 6.1 Summary of results.......................... 145 6.2 Clinical utility of findings...................... 146 6.3 Review of methodology........................ 147 6.4 Recommendations for future work.................. 148 Bibliography 149 Appendices 239 A World Health Organisation classification of soft tissue tumours and bone tumours 241 B Novel tumour-predisposing genes identified by whole exome sequencing 251 C Familial cancer syndromes associated with sarcomas 265 D Translocations associated with sarcomas 271 xvi

E Genetically complex sarcomas 277 F Known cancer predisposition genes 281 G Candidate genes used for variant prioritisation based on a priori knowledge of cancer biology 289 H Genes in which variants were also prioritised using the candidate gene prioritisation strategy 293 I Patient 1-II-2: Copy number variation by chromosome 297 J Patient 2-II-1: Copy number variation by chromosome 303 K A list of nonsynonymous deleterious variants included in variant burden analyses 309 L Gene identified by variant burden analyses by Ballinger et al. (2016) and Brohl et al. (2017) 315 M A list of putative regulatory variants included in variant burden analyses 319 xvii

List of Tables 2.1 Parameters used to create whole exome sequencing run plans using Torrent Suite software..................... 27 2.2 Parameters used to run the Torrent Variant Caller plugin to call bases.................................. 29 2.3 Parameters used for Genome Analysis Toolkit UnifiedGenotyper to call bases.............................. 30 2.4 Depth of coverage summary from Torrent Suite.......... 32 2.5 Genome Analysis Toolkit VariantRecalibrator tranche results... 34 2.6 Discordant genotype calls between the Agilent HaloPlex custom panel and whole exome sequencing for Patient 2-II-1....... 39 2.7 Discordant genotype calls between the Agilent HaloPlex custom panel and whole exome sequencing for Patient 3-III-1....... 39 3.1 Classification of Regulome database scores............. 54 3.2 Functional annotation of intersect file using ANNOVAR...... 58 3.3 Summary of variant annotation using Annotate Variation and Regulome Database for each prioritisation strategy........ 60 3.4 Summary of SOLAR association results for rare private variants. 63 3.5 Summary of SOLAR association results for known rare variants. 66 3.6 Summary of SOLAR association results for candidate gene variants 72 3.7 Summary of findings from in silico resources investigating the role of candidate germline risk variants in cancer pathogenesis.. 76 3.8 Summary of search results from PubMed for genes in which germline variants were identified........................ 79 4.1 Parameters specified for VarScan2 somaticfilter to filter false positives from the high confidence somatic mutations....... 95 xviii

4.2 Raw data summary from Macrogen Inc. for Patient 1-II-2 and Patient 2-II-1 germline and tumour samples............ 98 4.3 Summary statistics generated using Samtools flagstat for Patient 1-II-2 and 2-II-1 germline and tumour samples........... 99 4.4 Results from VarScan2 somaticfilter to remove possible false positives from the high confidence somatic calls for Patient 1-II-2 and Patient 2-II-1........................... 103 4.5 Somatic variants identified by VarScan2 and Strelka for Patient 1-II-2.................................. 105 4.6 Somatic variants identified by VarScan2 and Strelka for Patient 2-II-1.................................. 106 4.7 Summary of findings from in silico resources investigating the role of somatic risk variants and the genes in which they arise in cancer pathogenesis.......................... 108 4.8 Summary of search results from PubMed for genes in which somatic variants were identified........................ 114 4.9 Statistically significant high confidence loss of heterozygosity variants for Patient 1-II-2...................... 120 5.1 Genomic coordinates for target regions in which germline and somatic risk variants were identified................. 130 5.2 Classification of Regulome database scores............. 132 5.3 Annotated summary of nonsynonymous deleterious variants and putative regulatory variants in the target regions.......... 133 5.4 Odds ratios, p-values and 95% confidence intervals from Fisher s exact test for target regions for nonsynonymous deleterious variants 135 5.5 Odds ratios and p-values from Fisher s exact test for target regions for putative regulatory variants................... 137 xix

List of Figures 1.1 Location of known cancer predisposition genes........... 16 2.1 Pedigree of family 1.......................... 23 2.2 Pedigree of family 2.......................... 24 2.3 Pedigree of family 3.......................... 24 2.4 Whole exome sequencing pipeline flowchart............. 26 2.5 The number of variants called by Torrent Variant Caller and Genome Analysis Toolkit UnifiedGenotyper, and the number of variants that were called by both callers (intersect)........ 33 2.6 Genome Analysis Toolkit VariantRecalibrator tranche plot.... 34 2.7 Genome Analysis Toolkit VariantRecalibrator projection for mapping quality rank sum (MQRankSum) versus haplotype score..... 35 2.8 Concordance of genotype calls between the Agilent HaloPlex custom panel and whole exome sequencing on Ion Proton for three patients............................. 37 3.1 Genotypes for the ARHGAP39 variant that shows segregation in patients with cancer in family 3................... 64 3.2 Genotypes for the C16orf96 and ABCB5 variants that show segregation in patients with cancer in family 2........... 68 3.3 Genotypes for the ZFP69B, BEAN1, UVSSA and KIF2C variants that show segregation in patients with cancer in family 3..... 70 3.4 Genotypes for the PDIA2 variant that shows segregation in patients with cancer in family 2........................ 74 4.1 Pedigree of family 1 highlighting sarcoma Patient 1-II-2 for tumour-germline comparison.............................. 92 4.2 Pedigree of family 2 highlighting sarcoma Patient 2-II-1 for tumour-germline comparison.............................. 92 xx

4.3 Genome analysis toolkit depth of coverage summary for Patient 1-II-2 and Patient 2-II-1 germline and tumour DNA........ 101 4.4 Insert size histogram plots generated by Picard for Patient 1-II-2 and Patient 2-II-1 germline and tumour samples.......... 102 4.5 Pedigree of family 1 indicating genotypes for each patient at chr16:53513055 (rs8049033) in the RBL2 gene........... 123 xxi

xxii

Acknowledgements I would like to acknowledge support from Mandy Basson and the Board of Directors of the Abbie Basson Sarcoma Foundation Ltd (Sock it to Sarcoma!). I would like to sincerely thank David Thomas, Mandy Ballinger and Mark Pinese, for providing the DNA samples and data used in this thesis. I would also like to acknowledge the participants from the International Sarcoma Kindred Study and the Medical Genome Reference Bank. I would like to express my gratitude to my supervisors Eric Moses, Phillip Melton, David Wood, David Thomas and Evan Ingley, for their guidance and for the opportunity to pursue this project. I would also like to acknowledge Jane Allen and Barry Iacopetta for their support. I would like to thank all my friends at the Centre for Genetic Origins of Health and Disease for your daily guidance and support, especially Alex Rea for his assistance in the lab and Gemma Cadby for her helpful advice and for reading drafts. I would also like to thank Tegan McNab for proofreading my thesis. I am grateful to my family and friends who have always supported my studies. xxiii

xxiv

xxvi

Abbreviations Abbreviation *.bam *.bed *.sam *.vcf ABC Alt ANNOVAR ASPREE ATP ATPase ATRA B BCFtools BWA BWA-MEM Chr CNV COSMIC CpG CREB Definition Binary Alignment/Map Browser Extensible Data Sequence Alignment/Map Variant Call Format ATP-binding cassette Alternate allele Annotate Variation ASPirin in Reducing Events in the Elderly Adenosine Triphosphate Adenosinetriphosphatase All-Trans-Retinoic-Acid Benign Binary Variant Call Format Tools Burrows-Wheeler Aligner Burrows-Wheeler Aligner Maximal Exact Matches Chromosome Copy Number Variation The Catalogue of Somatic Mutations in Cancer 5 C phosphate G 3 camp Response Element-binding Protein xxvii

Abbreviation Definition D Deleterious dbsnp Short Genetic Variations Database DNA Deoxyribonucleic Acid DNase Deoxyribonuclease dntp Deoxynucleotide E2F E2 Factor ECM Extracellular Matrix ENCODE Encyclopedia of DNA Elements eqtl Expression Quantitative Trait Loci ER Endoplasmic Reticulum ERbB Erythroblastosis ERK Extracellular Signal-Regulated Kinase ESC Embryonic Stem Cells ExAC Exome Aggregation Consortium FAMMM Familial Atypical Multiple Mole Melanoma FFPE Formalin-Fixed and Paraffin-Embedded GATK Genome Analysis ToolKit GeneRIF Gene References into Functions GERP Genomic Evolutionary Rate Profiling GO Gene Ontology GOHaD Centre for Genetic Origins of Health and Disease GPCR G Protein Coupled Receptor GTP Guanosine Triphosphate GTPase Guanosine Triphosphatase GWA Genome Wide Association HapMap International Haplotype Project HDI Histone Deacetylation Inhibitor hg19 Human Genome build 19 xxviii

Abbreviation hmscs IG IGV INDEL Int isec ISKS Kb LOD LOH MAF MGRB MPNST MQRankSum mrna NCBI NGS NS NTR OMIM P PDI PNET PolyPhen-2 Probit Q QC Rb Definition Human bone marrow-derived Mesenchymal Stromal Cells Intergenic Integrative Genomics Viewer Insertions and Deletions Intronic BCFtools Intersect International Sarcoma Kindred Study Kilobase Logarithm of the Odds Loss Of Heterozygosity Minor Allele Frequency Medical Genome Reference Bank Malignant Peripheral Nerve Sheath Tumour Mapping Quality Rank Sum Messenger Ribonucleic Acid National Center for Biotechnology Information Next Generation Sequencing Nonsynonymous Neurotrophins Online Mendelian Inheritance in Man Possibly damaging Protein Disulphide Isomerase Primitive Neuroectodermal Tumour Polymorphism Phenotyping-2 Probability Unit Base Quality Score Quality Control Retinoblastoma xxix

Abbreviation Ref RegulomeDB RNA Robo rs ID S SCNA SIFT SLBP SNP SNV SOLAR T TF TMAP TVC UCSC USA UTR UTR3 UTR5 UWA VQSLOD VQSR WES WGS Definition Reference allele Regulome Database Ribonucleic Acid Roundabout family of proteins Reference SNP Identification Synonymous Somatic Copy Number Alteration Sorting Intolerant from Tolerant Stem-Loop Binding Domain Single Nucleotide Polymorphism Single Nucleotide Variant Sequential Oligogenic Linkage Analysis Routines Tolerated Transcription Factor Torrent Mapping Alignment Program Torrent Variant Caller University of California Santa Cruz United States of America Untranslated Region 3 Untranslated Region 5 Untranslated Region The University of Western Australia Variant Quality Score Log-Odds Variant Quality Score Recalibration Whole Exome Sequencing Whole Genome Sequencing xxx

Chapter 1 Literature review 1.1 Cancer Collectively, cancers are a diverse spectrum of human diseases with a common progression resulting from the failure to regulate normal cell growth, proliferation and apoptosis. 1 Cancers can arise from any of the cell or tissue types in the human body and are classified accordingly. 2 The most common cancers in adults are carcinomas, (approximately 90% of cancers) 2 which are derived from epithelial cells that line body cavities and glands. 3 Lymphomas and leukaemias arise in the tissue that gives rise to lymphoid and blood cells and account for approximately 8% of human malignancies. 3, 4 Melanomas, retinoblastomas, neuroblastomas and glioblastomas are derived from dividing cells in melanocytes, ocular retina, neurons and neural glia, respectively. 3 Sarcomas arise from the connective tissues such as bones, tendons, cartilage and fat. 2 Cancer is one of the leading worldwide causes of death with over 14 million people affected each year. 5 In 2012, there were 4.3 million premature deaths from cancer with premature deaths expected to increase 44% from 2012 to 2030. 6, 7 The lost years of life and productivity caused by cancer represent the largest cost to the global economy compared to other causes of death. 8 1

1.2 Cancer genetics Cancer is a genetic disease arising from an accumulation of genetic and epigenetic mutations. 9 These mutations can deregulate multiple complex regulatory pathways of genes affecting cellular growth, division, migration, and survival. 10 Tumour genomes usually exhibit many mutations and can be highly unstable. 11 Mutations can range from intragenic mutations to large gains and losses of chromosomal material. 9 A genetic mutation is a permanent change in the DNA sequence. A polymorphism is a genetic variation that is common in the population. The arbitrary cut-off between a mutation and a polymorphism is 1%, that is, the less common allele of a polymorphism must have a frequency of at least 1% in the population. 12 Mutations in a cancer genome can comprise the following types of DNA change: substitutions, insertions or deletions of small or large segments of DNA, rearrangements, copy number increases, and copy number reductions. 13 Cancer cells can also acquire new DNA sequences from viruses including human papillomavirus, Epstein-Barr virus, hepatitis B virus, human T-lymphotropic virus 1, and human herpesvirus. 14 Cancer genomes can also acquire epigenetic changes which alter chromatin structure and gene expression. 15 There can be anywhere between tens to thousands of mutations per cancer genome. 16 The substantial variation in the number and pattern of mutations in individual cancers reflects exposure to different risk factors, DNA repair defects, and the cellular origins. 17 Mutations that occur in cancers fall into two functional categories: mutations required for tumourigenesis, and mutations that merely occur during tumourigenesis and do not contribute to the process. These are called driver and passenger mutations, respectively. Drivers confer a selective advantage during clonal evolution and therefore drive the tumourigenesis process. Passenger mutations do not appear in tumours as a result of evolutionary selection, but rather as a variation that occurs by chance in a cell that harbours a driver mutation. It is likely that most cancers carry 2

more than one driver mutation, and the number of drivers varies between cancer 13, 16, 18 20 type. Mutations can arise in three broad categories of genes - oncogenes, tumour suppressor genes, and genome stability genes. Mutations in oncogenes and tumour suppressor genes drive the tumourigenesis process by increasing proliferation or inhibiting apoptosis, respectively, whereas mutations in genome stability genes drive tumourigenesis by increasing the rate of mutations in other genes. 9 The characterisation of these genes has led to the discovery of the biochemical pathways underlying the process of tumourigenesis, and also to a better understanding of the normal homeostatic roles these pathways play in healthy cells and tissues. 21 Mutations in these three classes of genes can occur in single somatic cells, resulting in sporadic tumours, or in the germline, resulting in hereditary predisposition to cancer. Sporadic cancers develop due to mutations that arise during a person s lifetime. The majority of cancers (90-95%) develop sporadically due to genetic mutations that result from DNA damage from exposure to environmental and lifestyle factors. 22 Environmental risk factors include occupational exposures (chemicals, dust, and industrial processes), sunlight, radiation, and environmental pollution. 23 Lifestyle factors that may increase the risk of developing cancer include smoking, excessive alcohol consumption, poor diet, obesity and physical inactivity, chronic 24, 25 infections, sun tanning, and sunburn. Only a small proportion (5-10%) of cancers are estimated to involve an inherited genetic mutation. 26 30 However, familial clustering of cancers is relatively common. 31 Familial clustering is the occurrence of a disease, such as cancer, in some families more than what would be expected from the presence in the general population. 32 Familial clustering of cancer can be measured by familial proportion (the proportion of cases with an affected relative), which has been reported as high as 20% in prostate cancer. 33 Familial clustering of cancers is likely due to a combination of environmental factors, rare gene mutations with high penetrance and more common, lower penetrant gene variants that act together to increase cancer 32, 34 36 susceptibility. 3

1.3 Familial cancers All cancers, both rare and common, show some degree of familial clustering. 37 Cancers can be two- to four-fold more common in first degree relatives of individuals with cancer. 38 1.3.1 Familial cancer predisposition syndromes Familial clustering of cancers can sometimes represent a familial cancer predisposition syndrome. A familial cancer predisposition syndrome manifests when multiple members of a family inherit gene mutations that predispose them to one or more types of cancer. 39 These families have multiple affected individuals, and family members often show early onset of cancer, multiple primary sites of disease, and occasionally bilateral involvement of paired organs. 35, 39 Some cancer predisposition syndromes appear to confer an increased risk of adult-onset cancers, such as breast, ovarian and colorectal cancers. 40 42 Other syndromes increase the susceptibility of tumour onset in childhood, such as hereditary retinoblastoma, 43 or early onset in both children and adults, such as von Hippel-Lindau disease. 44 Most familial cancer predisposition syndromes are transmitted in a Mendelian autosomal dominant manner. 35, 45 Dominant mutations require only one defective allele to be present for the individual to be predisposed to cancer. Individuals with one defective and one normal allele are heterozygous. An example of a Mendelian autosomal dominant cancer predisposition syndrome is hereditary breast-ovarian cancer. 46 This syndrome is caused by mutations in the BRCA1 and BRCA2 genes. 47 Women with germline mutations in BRCA1 have a 46 65% risk of developing breast cancer by age 70, while those with a BRCA2 mutation 40, 48 have a lower risk of 43 45% by age 70. Less often, familial cancer predisposition syndromes can be transmitted in an autosomal recessive manner. In the case of recessive mutations, both alleles must be mutated for the individual to have a predisposition to cancer. Individuals who inherit a recessive germline mutation in a gene are known as carriers and carry the mutation in every cell of their body. There is a variable risk that a carrier will develop cancer. A carrier will not develop cancer unless the remaining 4

normal allele is also mutated. The particular mutation, other genes, and dietary, lifestyle and environmental factors can influence risk. 49 The likelihood that a carrier will develop cancer is defined as the penetrance of the mutation. 3 An example of an autosomal recessive cancer predisposition syndrome is xeroderma pigmentosum complementation group A, characterised by increased sensitivity to sunlight with the development of carcinomas at an early age. 50 Xeroderma pigmentosum complementation group A has been associated with homozygous or compound heterozygous mutations in the XPA gene. 50 The study of familial cancer predisposition syndromes has led to the identification of genes critical to carcinogenesis and has also informed our understanding of the fundamental biology of human cancer. 51 Li and Fraumeni (1969) described the first familial cancer syndrome in four unrelated children with sarcoma and other affected family members. 52 They hypothesised that the occurrence of various malignancies in a family might represent a familial cancer syndrome due to the transmission of an autosomal dominant gene mutation. 52 In 1990 the TP53 gene was identified as the underlying gene responsible for Li-Fraumeni syndrome. 53 The TP53 gene encodes a tumour suppressor protein that responds to diverse cellular stresses to regulate expression of target genes, thereby inducing cell cycle arrest, apoptosis, senescence, DNA repair, or changes in metabolism. 54 56 Germline mutations in the TP53 gene were later established to also be the underlying genetic cause for many other malignancies. 51 1.3.2 Familial cancer clusters There are also familial cancer clusters that are not defined by known hereditary cancer syndromes. Familial cancer clusters are those that do not exhibit the features of hereditary types of cancer but occur in more individuals in the family than statistically expected. 36 In addition to familial clustering for the majority of specific cancers, aggregation of different types of cancers in families has also been observed. For example, individuals with BRCA1 and BRCA2 mutations, have not only increased susceptibility to breast and ovarian cancers, but also colon, cervix, uterus, pancreas, and prostate cancers. 57 5

1.4 Evidence for pleiotropic genetic risk factors Early studies assessed the discordant clustering of cancer in families to determine if there was a general susceptibility to cancer. Case-control, registry- and population-based studies have evaluated familial clustering using risk ratio and kinship coefficient estimations. 58 Identifying shared genetic associations between diseases (pleiotropy) is a useful approach to identify new risk loci, and may elucidate common aetiologies and help in risk prediction. 59 The largest studies using the Utah Population and Cancer Registry Database and the Swedish Family-Cancer Database demonstrated excess familial clustering at almost every cancer site in the body. 34, 60 63 However, these studies focused on familial clustering exclusively in nuclear families, therefore, they were not able to separate the role of shared environmental and genetic factors in the familial aggregation of cancer. A more extensive study by Cannon-Albright et al. (1994) used the Utah Population Database to evaluate familial clustering for more distant relatives. 60 As familial risk can be due to shared exposure to an environmental risk and/or a common genetic mutation, examination of familial clustering in near and distant relatives is useful. In more distant relationships, shared familial environment might be less likely, and the probability of shared genotypes can be measured. 60 This study found that there was significant clustering of cancer outside the nuclear family for cancer sites. 60 These results support the hypothesis of an inherited basis to cancer of almost all sites and support the existence of more than one susceptibility locus for some cancers. 60 In support of this finding, a study by Amundadottir et al. (2005) analysed familial aggregation of cancer in extended families from Iceland to search for genetic factors that contribute to cancer at one or more sites in the body. 58 The authors found that most cancer sites demonstrated a significantly increased risk for the same cancer beyond the nuclear family. 58 They also found significantly increased familial clustering between different cancer sites in both close and distant relatives. 58 Therefore, Amundadottir et al. concluded that genetic factors are involved in the aetiology of many cancers and that these factors are in some cases shared by different cancer sites. 58 These findings support the conclusions by Cannon-Albright et al. However, shared environment or non-random mating for certain risk factors 6

also play a role in the familial clustering of cancer. 58 Several types of study designs can be used to identify genetic risk variants that may be involved in the aetiology of cancers. 1.5 Sarcoma Sarcomas are a rare group of cancers that arise predominantly from the embryonic mesoderm (the connective tissues of the body), for example, bones, muscles, cartilage and fat. There are over 70 different subtypes of sarcoma that are grouped into two broad classifications of bone or soft tissue (Appendix A). 64 The majority of sarcomas arise in the soft tissue, while malignant bone tumours make up just over 10% of all sarcomas. 65 Soft tissue sarcomas are often further sub-categorised by the line of differentiation, for example, liposarcoma (fat), leiomyosarcoma (smooth muscle), rhabdomyosarcoma (skeletal muscle) and fibrosarcoma (connective tissue). 66, 67 Bone tumours are further classified into bone-forming tumours, cartilage-forming tumours, marrow tumours, or vascular tumours. 68 It can be difficult to diagnose and classify this diverse group of malignancies with overlapping histological features. However, it is important to correctly determine the specific 67, 68 histologic subtype for management and treatment decisions. Sarcomas are a high impact group of cancers that disproportionately affect the young. Although sarcomas are rare, they contribute significantly to the burden of disease as they tend to affect teenagers and young adults. 69, 70 Sarcomas represent only 1% of all cancers in adults but represent 10% of cancers in children and 8% of cancers in adolescents and young adults. 71 There are approximately 800 new sarcoma cases in Australia each year. 72 1.5.1 Sarcoma genetics There is evidence to suggest a strong genetic basis to sarcomas. First, sarcomas disproportionately affect the young, with early age at diagnosis associated with 73, 74 a genetic basis for many heritable diseases, including hereditary cancers. Second, sarcomas are over-represented among survivors of melanoma, breast cancer, thyroid cancer, Hodgkin s lymphoma, and leukaemias. 75 Third, sarcoma survivors are at increased risk of secondary cancers. 76 Finally, several rare genetic 7

syndromes are associated with sarcomas such as Li-Fraumeni syndrome. 35 Appendix C contains a summary of hereditary syndromes associated with sarcoma including genes and genomic locations. In addition to sarcomas being associated with familial cancer predisposition syndromes, 52, 77 79 sarcomas also show evidence of familial clustering. Up to 33% of paediatric sarcomas are estimated to be associated with a significant family history of cancers. 80 The risk of sarcomas is increased six-fold in relatives of children with sarcoma compared to age-matched controls. When a causal gene mutation is identified, this risk increased to over 250-fold. 81 Whilst some sarcomas are associated with familial inherited predisposition, most sarcomas do not have a known cause. Very little is currently known about the causes of sarcoma because they are so rare. 65 Several risk factors have been associated with sarcomas including ionising radiation, 82, 83 viruses (Epstein-Barr virus 84 and Kaposi s sarcoma-associated herpes virus 85 ), occupation, 86 90 exposure to chemicals, 91 96 hormones, 97, 98 antibiotics, 99 medications for nausea used during pregnancy, 100 use of antibiotics in babies, 101 birth weight, 102 gestational age, 103 104, 105 birth order, and maternal age. Sarcomas that arise due to somatic mutations are classified into two main groups based on genetics: 1. Sarcomas with specific recurrent genetic mutations on a background of relatively few other chromosomal changes 2. Sarcomas with no specific genetic mutations on a complex background of numerous chromosomal changes Approximately one-third of all sarcomas have specific recurrent genetic mutations. 106 These tumours either contain disease-specific chromosome translocations or specific activating mutations. Most sarcomas with specific recurrent genetic mutations are characterised by balanced or reciprocal translocations (the exchange of pieces between two chromosomes), resulting in two derivative chromosomes with no net gain or loss of chromosomal 8

material. 107 In some cases, only one derivative chromosome is formed, and some genetic material is lost. The fusion proteins produced as a result of the translocation can contribute to oncogenesis by increasing cell proliferation, promoting anchorage-independent cell growth, overriding cell contact adhesion, inhibiting apoptosis, enhancing invasion and suppressing terminal differentiation. 107 Appendix D contains a table of known translocations that have been associated with sarcoma. The remaining sarcomas in the specific recurrent genetic mutations group are characterised by specific activating mutations. 108 These tumours show some degree of aneuploidy, but generally, have less disordered karyotypes than the complex group of sarcomas. 83 An example of a sarcoma subtype with a specific activating mutation is gastrointestinal stromal tumours which have activating 109, 110 mutations in KIT or PDGFRA. The remaining two-thirds of sarcomas have highly complex unbalanced karyotypes lacking specific genetic translocations. 66, 111 This group is mostly composed of spindle cell or pleomorphic sarcomas including leiomyosarcoma, myxofibrosarcoma, pleomorphic liposarcoma, pleomorphic rhabdomyosarcoma, malignant peripheral nerve sheath tumour, angiosarcoma, extraskeletal osteosarcoma and spindle cell/pleomorphic unclassified sarcoma (previously known as spindle cell/pleomorphic malignant fibrous histiocytoma). 111 These neoplasms show gains and losses of many chromosomes or chromosome regions and amplifications. 111 Many of them share recurrent aberrations (such as the gain of 5p13-p15) that play a significant role in tumour progression or metastatic dissemination. 111 Appendix E lists the genomic regions identified in complex sarcomas. 1.6 Methods for identifying genetic risk variants Several study designs can be employed to identify genetic risk variants. Each study design is suited to identifying different types of mutations from highly penetrant genes in rare Mendelian disorders to low-penetrant variants in more common disease, and rare variants. 9

1.6.1 Linkage mapping Linkage mapping in families has been used with success in localising highly penetrant disease-causing genes (e.g., BRACA1 and BRACA2 ) and, in particular, those involved in rare Mendelian human diseases (e.g., Online Inheritance In Man (OMIM), http://www.omim.org/). Linkage analysis in families is a form of positional cloning and makes no underlying assumptions about the nature of the genes involved. In human disease studies, the aim of linkage mapping is first to determine the chromosomal location of putative risk genes by identifying polymorphic DNA markers that cosegregate with a disease of interest. The genes in such linkage regions are referred to as positional candidate risk genes. These genes are then prioritised for further genetic and molecular analyses to identify the specific causal mutations or polymorphisms. Linkage mapping has been used to identify highly penetrant susceptibility alleles associated with Mendelian familial cancer predisposition syndromes. 112 117 However, these variants explain only a small fraction of the genetics of all cancer cases. For example, inherited mutations in the BRCA1 and BRCA2 genes account for approximately 2% 3% of all breast cancer cases. 118, 119 However, more prevalent founder mutations in these genes can explain up to about 10% of the disease in 47, 120 123 some populations. 1.6.2 Association It has been postulated that more common cancers that do not show a clear pattern of inheritance are caused by many genes that confer a small risk of disease. 124 For disease risk genes of small effect, association studies can be more powerful than linkage studies. 125 With the advent of dense panels of single nucleotide polymorphism (SNP) markers and high-throughput technology for efficiently genotyping them in thousands of individuals, the genome wide association (GWA) analysis study design was subsequently adopted widely for the genetic analysis of common human diseases. GWA is also a form of positional cloning and relies on linkage disequilibrium, the non-random association of alleles at different loci that is a function of population history. 10

GWA studies have identified thousands of lower penetrance risk variants for common human traits and diseases typically with small effect size (odds ratio between 1.1 and 1.5). 126 Lower penetrance genetic variants associated with non-familial syndrome breast cancer confer slight risk alterations (odds ratio of approximately 1.2), 127 compared to the high penetrance variants in BRCA1 and BRCA2 identified by linkage with odds ratios between 2 and 4. 127 Most genetic cancer risk variants identified so far confer relatively small increments in risk and explain only a small proportion of familial clustering. 128 The inability of the risk variants detected by GWA studies to account for much of the heritability of most common disorders, missing heritability, has led to an emerging view that rare variants with larger effect sizes could be responsible for a substantial proportion of genetic risk for complex human disease. 129 Significant advances in genome sequencing have now offered the possibility of using this technology as an alternative study design to GWA studies for the detection of rare genetic risk variants. 1.6.3 DNA sequencing DNA sequencing analysis is the process of determining the precise order of nucleotides in a given DNA sample. One aim of DNA sequencing is to identify genomic variations and to associate those changes with human disease. A breakthrough in DNA sequencing technology was the development of Sanger s chain termination method. 130 In this approach, sequencing occurs by the selective incorporation of a single chain-terminating dideoxynucleotide by DNA polymerase. 130 For approximately 40 years Sanger sequencing was the most widely used approach. Since the completion of the Human Genome Project in 2003, there have been substantive developments in Next Generation Sequencing (NGS) technologies. Whereas the first human genome took 13 years and several billion dollars to complete, a human genome can now be sequenced in a day for $1,000 US (at 20X coverage). The speed of sequencing has increased as NGS enables the simultaneous detection of multiple mutations in multiple genes by the parallel sequencing of millions of different DNA fragments. 131 The development of affordable and efficient next generation DNA sequencing technologies has now provided a new 11

study paradigm to search for rare risk variants involved in common, complex diseases. The impact of NGS technology on the discovery of genetic variants in human disease has been profound. Since the introduction of NGS there have been enormous advances in speed, read length and throughput of sequencing studies. 132 The advent of NGS has allowed the inquiry of nearly every base in the genome. 133 The growth in cancer genomics discovery has been unprecedented; knowledge of genes frequently mutated in cancer has grown from four genes in 2004 to over 600 genes listed in the Catalogue of Somatic Mutations in Cancer (COSMIC) currently (v79, released 14-NOV-16). 134 Initiatives such as the Cancer Genome Atlas 135 and the International Cancer Genome Consortium, 136 have employed NGS strategies to characterise tumour genomes and provide multi-platform data for thousands of tumours from a variety of cancer types and subtypes. 137 Typical NGS applications include DNA sequencing, RNA sequencing (to measure gene expression changes to discover new transcripts), chromatin immunoprecipitation sequencing (to detect genome wide transcription factor binding sites and chromatin-associated modifications) and methylation sequencing (to profile various types of DNA methylation). 138, 139 Next generation DNA sequencing can be used for whole genome sequencing (WGS), whole exome sequencing (WES), or sequencing of a specifically targeted region of the genome. 138 The NGS workflow consists of multiple steps including library preparation and enrichment, sequencing, base calling, sequence alignment and variant calling. 1.6.4 Whole exome sequencing WES involves sequencing only the protein-coding region of the genome. The human exome makes up approximately 1% of the human genome. However, the majority (85%) of disease-causing mutations in Mendelian disorders are expected to arise in the exome. 140 Therefore WES is a cost-effective initial strategy to identify disease-causing variants. In the last decade, WES of unrelated individuals or families with multiple affected members a rare disorder has identified the genetic basis of diseases such as Freeman-Sheldon syndrome, Kabuki syndrome, Miller syndrome, and autosomal dominant spinocerebellar ataxia. 141 146 WES 12

studies have also identified more than 50 novel tumour-predisposing genes, listed in Appendix B. 1.6.5 Whole exome sequencing of cancer cluster families While WES has been used with great success to identify novel tumour predisposing mutations, only one published study has used WES to identify pleiotropic genetic risk variants that predispose families to more than one type of cancer. The recent WES study by Thutkawkorapin et al. (2016) utilised NGS technology to investigate a family with a dominant cancer syndrome with a high risk of both rectal and gastric cancer. 147 The authors hypothesised that the mixed representation of rectal and gastric cancer among family members was due to one predisposing mutation in one gene. 147 The authors performed WES in three family members, two with rectal cancer and one with gastric cancer, and followed up with WES and Sanger sequencing in additional family members, other patients and controls. 147 Thutkawkorapin et al. identified 12 novel nonsynonymous single nucleotide variants (SNVs) shared among five affected members of this family. The authors suggested that at least five of the 12 variants may be candidates that contributed to the disease in the family. 147 These variants did not segregate in other families and are therefore unlikely to be highly penetrant variants. 1.7 Next generation sequencing study considerations NGS technologies can be used to identify rare variants in tumour or germline DNA that increase an individual s susceptibility to developing cancer. 133 It is essential to compare tumour DNA with matched germline DNA to determine somatic and germline alterations in cancer. 148 Germline variants exist in the normal germline sequence. 149 Somatic variants are those in the tumour sequence but not in the normal germline sequence. 149 The ability of NGS to detect somatic variants depends on the variant frequency within the tumour sample, sample contamination, tumour heterogeneity, sequencing error, and the scarcity of somatic 150, 151 mutations within a genome. Recently there has been a return to family-based designs to identify rare risk variants involved in common human disease, based on the hypothesis that affected 13

133, 152 155 members of the same family will carry the same rare susceptibility variant. Therefore, the number of individuals needed for rare variant discovery is potentially smaller than in cohorts of unrelated individuals. 133 Families used in these types of studies to identify rare inherited variants can either be consanguineous families, or non-consanguineous, large multigenerational and multiplex pedigrees. 133 Targeted sequencing technologies have been used to successfully identify new causal genes in hereditary non-polyposis colon cancer and familial adenomatous polyposis, 156 and hereditary breast and ovarian cancers. 157 Two-phase NGS family study designs are recommended. In the first phase, family members are sequenced, and the discovered variants are ranked according to their likelihood of being associated with the trait. 158 In the second phase, the variants are tested for association in an independent population-based sample. 158 Families used to study cancer clustering should be selected carefully. Suitable families have multiple affected and unaffected individuals from two or more generations available for analysis. 49 Families in which various members develop a rare form of cancer, such as sarcoma, are more likely to have a mutation segregating in an inherited cancer gene compared to families affected by more common types of cancer, for example, adenocarcinomas of the lung, breast, prostate, and colon. 49 Therefore ideal families for genetic studies of familial clustering of cancers are those with multiple generations of affected and unaffected individuals and families with multiple cases of a rare form of cancer such as sarcoma. 1.8 Known cancer predisposition genes Over the last 30 years, more than 100 cancer predisposition genes have been identified using a variety of strategies. 134, 159 161 Figure 1.1 shows the location of known cancer predisposition genes and a full list of known cancer predisposition gene is available in Appendix F. However, only a small proportion of familial 38, 162 cancer risk can be explained by established cancer predisposition genes. The use of family-based NGS strategies in this field may facilitate the discovery of rare genetic mutations that explain the remaining genetic risk for cancer predisposition if much of the missing genetic control is due to gene variants that are too rare to be picked up by GWA studies and have relatively large effects on risk. 14

1.9 Summary There have been a substantial number of studies performed to identify genetic risk variants associated with cancer. Linkage studies have identified high penetrant risk alleles associated with Mendelian autosomal dominant cancer predisposition syndromes. Association studies have been used to successfully identify lower penetrant variants associated with more common types of cancer. However, much of the heritability of cancer remains unexplained. The introduction of NGS technology has allowed the identification of rare variants that are expected to explain some of the missing heritability of cancer. Study considerations for using NGS in cancer research include sequencing both tumour and germline DNA to facilitate the differentiation of somatic and germline mutations and to use a family-based study design with multiple generations of affected and unaffected individuals and families with multiple cases of a rare form of cancer such as sarcoma. To date, there have been few studies on shared genetic risk factors in cancer cluster families that are not defined by a known familial cancer predisposition syndrome. This study will employ the approach of performing WES in cancer cluster families of mixed cancer types. WES will be conducted on both affected and unaffected individuals from cancer cluster families that have been identified by a sarcoma proband to identify rare cancer predisposing variants. Only one previous study has used NGS technology to investigate shared genetic risk variants across multiple cancer types. 147 This study will be the second WES study performed on cancer cluster families to identify shared genetic risk variants, and the first WES study to select cancer cluster families by a sarcoma proband. 15

1 p36.22 p34.3 p32.1 p22.3 p13.2 q21.1 q23.2 q31.1 q41 q43 2 p25.2 p23.1 p16.1 p11.2 q12.3 q21.3 q24.2 q32.2 q35 3 p25.3 p22.2 p21.1 p11.2 q13.2 q22.3 q26.1 q28 4 p16.1 p15.1 q12 q21.3 q25 q31.1 q32.3 5 p15.31 p13.2 q12.1 q14.2 q22.2 q31.2 q34 6 p24.3 p21.32 p11.1 q14.3 q22.1 q23.2 7 p22.1 p14.2 p11.1 q21.12 q31.1 q33 8 p23.1 p11.23 q12.2 q21.2 q23.3 9 p23 p13.2 q13 q22.2 q33.1 10 p15.1 p11.23 q11.23 q23.1 q25.1 11 p15.3 p13 q12.3 q14.1 q23.2 12 p13.31 p11.1 q13.3 q21.33 q24.22 13 p12 q12.2 q14.3 q31.1 14 p12 q11.2 q22.1 q31.1 15 p12 q12 q21.1 q25.1 16 p13.2 p11.1 q21 17 p13.2 q11.2 q23.1 18 p11.23 q12.1 q21.31 19 p13.2 q11 20 p12.3 q11.21 21 p12 q21.2 22 p12 q11.23 X p22.32 p21.2 q11.1 q21.31 q24 q27.3 Y p11.2 q11.223 Indicates position of known cancer predisposition gene Figure 1.1: Location of known cancer predisposition genes 16

1.10 Aims The identification of genes that predispose individuals to cancer is a high priority in human medical research. It is anticipated that this knowledge will drive a new era of personalised human medicine, potentially allowing tailoring of specific drug treatments and interventions. The use of NGS in families currently represents an optimal study design for the identification of rare genetic variants involved in the risk of cancer and other common complex human diseases. Waves of novel genetic discoveries using this approach are now regularly appearing in the literature. While it is sometimes difficult to distinguish sporadic from hereditary cancer, rare cancer, such as sarcoma, occurring twice within the one family is epidemiologically striking. 163 The identification of genetic risk factors for cancer will be a significant contribution to medicine and particularly in the provision of health care to cancer patients and their families. The aims of this study are: 1. To perform WES on three cancer cluster families identified by a sarcoma proband using peripheral blood samples. 2. To identify candidate germline risk variants by prioritising and filtering structural and regulatory variants that segregate with cancer or sarcoma in the three families. 3. To perform a matched tumour and germline analysis on two myxoid liposarcoma patients using peripheral blood genomic DNA and genomic DNA isolated from sarcoma tumour tissue to distinguish somatic mutations. 4. To validate the most significant putative germline and somatic cancer predisposing mutations in unrelated sarcoma cases and cancer-free controls. 17

18

Chapter 2 Aim 1: Whole exome sequencing of three cancer cluster families identified by a sarcoma proband 2.1 Introduction Next Generation Sequencing (NGS) has provided tremendous insight into the genomic landscape of several tumour types, including defining tumour subtypes, identifying new druggable targets and understanding into the heterogeneity of many tumours. 164, 165 Protein-coding genes constitute approximately 1% of the human genome but harbour nearly 85% of the disease-causing mutations of 140, 166 169 Mendelian diseases, although this may be due to ascertainment bias. Genetic variations discovered in coding regions of genes may inform immediate treatment choices and also further other therapeutic discoveries. 170, 171 Therefore, exome sequencing is an efficient approach for identifying actionable variants. The first aim of this study was to perform whole exome sequencing (WES) in three cancer cluster families ascertained from an index sarcoma patient. 19

2.1.1 Ion Proton platform The Ion Proton platform from Thermo Fisher Scientific is a benchtop semiconductor-based sequencing system for the human genome, exome or transcriptome sequencing. Semiconductor sequencing is based on the detection of hydrogen ions that are released during the polymerisation of DNA using a sequencing-by-synthesis approach. 172 The Ion Proton sequencing chemistry uses native deoxynucleotides (dntps) and electronic sensors to detect the release of hydrogen atoms as the dntps are incorporated into the growing DNA strand. 173 Microwells are sequentially flooded with each dntp to distinguish the order of each nucleotide. 173 Homopolymer runs are detected by the magnitude of the ph change to determine how many nucleotides were added. 173 Errors on the Ion Proton are mostly due to insertions and deletions in homopolymer runs due to the difficulty in evaluating the magnitude of signal when several dntps are incorporated in one cycle. 174 Automated sequencing analysis occurs using the Torrent Suite software that is preinstalled on the Torrent Server. The web-based interface can be used to plan, monitor and view the results of sequencing runs. The Torrent Suite base calling algorithm converts the raw file information into a sequence of bases and writes the sequence to an unaligned Binary Alignment/Map (*.bam) file. The *.bam file is then aligned using Torrent Mapping Alignment Program (TMAP). Variants are called using the Torrent Variant Caller (TVC). Both TMAP and TVC were developed specifically for Ion Torrent data and were used in this chapter. 20

2.2 Methods 2.2.1 Families selected for whole exome sequencing The patients were recruited from the International Sarcoma Kindred Study (ISKS). The ISKS was initiated in 2008 to investigate the prevalence and nature of heritable risk in sarcoma populations. 175 The ISKS is a global genetic, biological, epidemiological, and clinical resource for researchers to investigate the hereditary characteristics of sarcoma. Patients were recruited from several sites across Australia, France, New Zealand, India, the United States of America, the United Kingdom, and Canada. The ISKS Steering Committee granted access to the database for this study under an ethically approved protocol (the University of Western Australia (UWA) Human Research Ethics Committee RA/4/1/6434). Patients with sarcoma (probands) were recruited from major sarcoma treatment centres, regardless of their family history of cancer. Individuals with adult-onset sarcoma (> 15 years old) were eligible for the ISKS. Family members were also invited to participate if the patient with sarcoma was < 45 years of age, or there was a significant family history of cancer. 175 Study questionnaires containing demographic, medical, epidemiological and psychosocial information were completed, including personal history of cancer or exposure to known risk factors for sarcoma. 176 Patients were also asked to donate a venous blood sample and tumour sample, as well as provide access to medical information and access to information about deceased relatives (collected from cancer registries and other health organisations). Medical history and treatment records were obtained for each proband where possible. 176 All reported cancer diagnoses were independently verified by medical records, Australian and New Zealand cancer registries or death certificates. There are now more than 1,300 families enrolled in the ISKS with detailed pedigree information and cancer incidence verified for each. More than 1,800 blood samples have been collected and approximately 2,100 questionnaires completed. The average age at onset for sarcoma in the ISKS cohort is 46.6 years (range 3-95 years) with the majority being sarcomas of soft tissue. Family members have reported over 2,000 other cancers. The average age at diagnosis for these other cancers is 57.9 years compared to 65.6 years in the general population. 175 21

Since the establishment of the ISKS, several studies have focused on identifying TP53 germline mutations in Li-Fraumeni and the less stringent Li-Fraumeni-like syndrome in the cohort. 163, 176, 177 A previous study found pathogenic TP53 mutations in blood DNA of 20 of 559 sarcoma probands (3.6%) in the ISKS cohort. 176 The study of familial cancer cluster patterns in the ISKS identified 14% of the ISKS families with patterns of familial clustering without conforming to any known syndrome. 163 A more recent study using the ISKS discovered that more than half of the sarcoma patients had an excess of putatively pathogenic monogenic and polygenic germline variation in known and novel cancer genes using a case-control rare variant burden test. 178 The combination of findings that 14% of cancer cluster families in the ISKS do not conform to known syndromes and the excess of rare monogenic and polygenic germline mutations in more than half of the ISKS patients indicate the potential utility of this cohort to identify novel genetic risk factors for sarcomas and cancer clustering in families. Three ISKS families that do not conform to known cancer syndromes were targeted for selection in the current study and represented a unique opportunity to identify novel variants that may influence sarcoma or cancer development. These three families were selected for the current study based on the following selection criteria: The sarcoma proband must have blood and tumour biospecimens available The pedigree must contain a first degree relative with cancer also with germline samples available The pedigree must contain at least one unaffected relative with germline material available, and The family is not defined by TP53 or other known familial cancer susceptibility genes Family 1 (Figure 2.1) depicts a proband (Patient 1-III-1) who developed Ewing s sarcoma at 15 years of age, as well as a non-identical twin brother (Patient 1-III-2) who has not developed sarcoma. The proband s father (Patient 1-II-2) 22

developed myxoid liposarcoma at 39 years of age. Germline DNA was available from the proband and father, and from the proband s twin brother, mother (Patient 1-II-3), an aunt (Patient 1-II-1) and grandparents (Patient 1-I-1 and Patient 1-I-2), who were all unaffected by cancer. Family 2 (Figure 2.2) was identified by a proband (Patient 2-II-2) who developed myxoid liposarcoma at 61 years of age. The proband s father (Patient 2-I-1) developed prostate cancer at 71 years old, and two of the proband s sisters were diagnosed with skin melanomas at 44 (Patient 2-II-3) and 46 (Patient 2-II-2) years of age. Germline DNA was available for the proband, one of his unaffected children (Patient 2-III-1), three of his sisters (including an unaffected sister, Patient 2-II-4), and his parents (Patient 2-I-1 and Patient 2-1-2). In family 3 (Figure 2.3), there are two individuals with sarcoma; the proband (Patient 3-III-1) who developed a primitive neuroectodermal tumour (PNET) at 22 years of age, and her grandmother (Patient 3-I-1) who developed malignant peripheral nerve sheath tumour (MPNST) at 79 years old. The proband s father (Patient 3-II-1) was diagnosed with prostate cancer at 51 years of age, and the proband s aunt developed breast cancer at age 36. Germline DNA was available from the proband, her parents (Patient 3-II-1 and Patient 3-II-2), her unaffected brother (Patient 3-III-2), and her grandmother. 1-I-1 1-I-2 Key Affected male Affected female Unaffected male 1-II-1 1-II-2 Sarcoma (39) 1-II-3 Unaffected female Proband 1-III-1 Sarcoma (15) 1-III-2 Figure 2.1: Pedigree of family 1 23

2-I-1 2-I-2 Prostate (71) Key Affected male Affected female Unaffected male 2-II-1 Sarcoma (61) 2-II-2 Melanoma (46) 2-II-3 2-II-4 Melanoma (44) Unaffected female Proband 2-III-1 Figure 2.2: Pedigree of family 2 3-I-1 Sarcoma (79) Key Affected male 3-II-1 Prostate (51) 3-II-2 Affected female Unaffected male Unaffected female 3-III-1 Sarcoma (22) 3-III-2 Proband Figure 2.3: Pedigree of family 3 24

2.2.2 DNA extraction DNA extraction was performed by researchers at the Peter MacCallum Cancer Centre in Melbourne, Australia. Anti-coagulated blood was processed using a Ficoll gradient. DNA was extracted from the nucleated cell product using QIAamp DNA blood kit (Qiagen). 176 2.2.3 Whole exome sequencing WES was performed by the candidate at the Curtin University - UWA Centre for Genetic Origins of Health and Disease (GOHaD). Two germline samples from Patient 3-I-1 and Patient 3-III-2 were badly degraded and of poor quality. Therefore, whole genome amplification was performed on these samples using a Qiagen REPLI-g Mini Kit (Qiagen) as per the manufacturer s instructions. Exome library preparation was performed using the Thermo Fisher Scientific Ion AmpliSeq Exome RDY Kit as per the manufacturer s instructions. Libraries were loaded onto the Ion P1 v2 BC Chip (Thermo Fisher Scientific) using the Ion Chef and sequenced on the Ion Proton as per the manufacturer s instructions. An overview of the WES pipeline is shown in Figure 2.4. 2.2.3.1 Library preparation The target regions were amplified using the Ion Ampliseq Exome RDY Library Preparation from 100 ng of genomic DNA in the Ion Ampliseq Exome RDY plates and the Ion Ampliseq HiFi Mix. The amplicons were treated with FuPa reagent to digest the primers partially and to phosphorylate the amplicons. The amplicons were then ligated to Ion Xpress Barcode Adapters, purified and dissolved in 50 µl of Low TE. Validation of enrichment and quantification of target DNA were performed on the ViiA 7 (Thermo Fisher Scientific). Three 10-fold dilutions of Escherichia coli control library were prepared at 6.8 pm, 0.68 pm and 0.068 pm. 9 µl of each control library and each sample were added to wells of a 96-well qpcr plate as well as 11 µl of the reaction mixture for a total reaction volume of 20 µl. The qpcr was run for 40 cycles. 25

Samples 19 germline samples Library Preparation Ion Ampliseq Exome RDY Library Preparation Sequencing platform Life Technologies Ion Proton Quality check Torrent Suite software Sequence alignment Torrent Mapping Alignment program Variant calling Torrent Variant Caller plugin Genome Analysis Toolkit UnifiedGenotyper Merge using bcftools BCFtools intersect Figure 2.4: Whole exome sequencing pipeline flowchart 26

2.2.3.2 Exome sequencing Run plans were created for each chip with the barcode and sample identity number on the Torrent Browser server. The plans were created using the Torrent Suite Software with the run parameters listed in Table 2.1. Table 2.1: Parameters used to create whole exome sequencing run plans using Torrent Suite software Parameter Application Kit Library kit type Template kit Specified DNA Ion Ampliseq Exome Kit Ion Ampliseq Exome RDY IC Kit 1x8 Ion Chef, Ion PI IC 200 kit Flows 520 Chip type Barcode set Reference library Plug ins Ion PI chip IonXpress Human genome build 19 (hg19) variantcaller and coverageanalysis The sample libraries were diluted to approximately 50 pm, the optimal input concentration. The Ion PI v2 BC chips were prepared for loading by performing alternate washes with 100% isopropanol, Ion PI Chip Preparation Solution, nuclease-free water, 0.1 M NaOH, and 1X Ion Chip Priming Solution as per the manufacturer s instructions. The Ion PI IC Reagents 200 cartridge was removed from the freezer and warmed to room temperature 45 min before the Ion Chef Instrument run. The Ion Chef Instrument was loaded with treated Ion chips, consumables, reagents and libraries as per the manufacturer s instructions (Thermo Fisher Scientific). The Ion Chef Instrument run completed overnight. 27

The following day, the Ion Proton Sequencer was initialised as per the manufacturer s instructions (Thermo Fisher Scientific). The Ion chips were unloaded from the Ion Chef Instrument, and the first chip was loaded into the Ion Proton Sequencer. The second chip was stored in a container at 4 C until 20 min before the end of the first run. When the first run was completed, the second chip was loaded immediately for sequencing. 2.2.4 Sequence alignment and variant calling The Torrent Suite software (Life Technologies, v4.4.3.3) Torrent Variant Caller (TVC) was used to perform base calling. The resulting base calls were stored in an unmapped *.bam format. The Torrent Suite Torrent Mapping Alignment Program (TMAP) was used to align sequencing reads to the reference genome using human genome build 19 (hg19). Some or all of the reads produced by the WES pipeline are used as input for TMAP, along with the reference genome and index files. The output from TMAP is a mapped *.bam file. 2.2.5 Variation to sequence alignment and variant calling 2.2.5.1 Torrent variant caller plugin As an additional measure, base calling was performed a second time using the TVC Plugin (Life Technologies, version 5.0.0). The TVC Plugin software was installed on Magnus (Pawsey Centre), a Cray XC40 supercomputer. The AmpliSeq Exome capture browser extensible data (*.bed) file from Life Technologies was used as the target region *.bed and primer trim *.bed file (available from https://www.ampliseq.com). The output is a variant call format (*.vcf) file containing meta-information lines, a header line and data lines for each position in the genome. 179 Each individual was called separately using TVC, generating 19 individual *.vcf files. The details used to run the TVC Plugin on Magnus are outlined in Table 2.2. 28

Table 2.2: Parameters used to run the Torrent Variant Caller plugin to call bases Parameter Input bam Reference fasta Region bed Primer trim bed Error motifs Specified All *.bam files from the Ion Proton hg19.fasta AmpliSeqExome.20131001.designed.bed AmpliSeqExome.20131001.designed.bed ampliseqexome_germline_p1_hiq_motifset.txt Each of the 19 patients was called individually and then merged using Binary Variant Call Format Tools (BCFtools) vcf-merge 179 to create a single *.vcf file. As TVC only calls individual *.bam files, there is uncertainty whether a position is truly missing or is reference homozygous. BCFtools missing-to-reference 179 was also run on the merged file to fill unknown positions to homozygous reference (0/0). 2.2.5.2 Genome analysis toolkit The Genome Analysis Toolkit (GATK, version 3.4.0) UnifiedGenotyper 180 was used in addition to the single sample calling to sort, index and call the *.bam files to ensure base calling accuracy. GATK can perform multi-sample calling. Therefore, all 19 patients were called together. GATK UnifiedGenotyper was used on a secure Linux server owned by GOHaD (operating system: Bio-Linux (based on Ubuntu 14.04.3)). UnifiedGenotyper uses a Bayesian genotype likelihood model to estimate the most likely genotypes and allele frequency in a population of samples simultaneously and produces a genotype for each site. First, each sample was sorted using SAMtools sort and indexed using SAMtools index. 181 Picard CreateSequenceDictionary (version 2.4.1, https://github.com/broadinstitute/picard) was used to create a sequence dictionary for a reference sequence and then Picard BedToIntervalList was used to convert a *.bed file to Picard interval list format. The specifications used to run GATK UnifiedGenotyper on the server are outlined in Table 2.3. 29

Table 2.3: Parameters used for Genome Analysis Toolkit UnifiedGenotyper to call bases Parameter Reference fasta Genotype likelihoods model Input bam Target interval list Out mode Metrics Specified hg19.fasta SNP All sorted *.bam files from the Ion Proton AmpliSeqExome.20131001.bed EMIT_ALL_CONFIDENT_SITES Directory for metrics Stand-conf-call 50.0 Stand-emit-conf 10.0 Annotation AlleleBalance 2.2.5.3 Intersect variant calls from Torrent Variant Caller and Genome Analysis Toolkit The resulting *.vcf files from both TVC and GATK were combined using BCFtools intersect (isec) 181 exact allele match to identify the common calls between TVC and GATK. This tool created both intersections and complements of the TVC and GATK *.vcf files. The intersect data from both callers was used for the remainder of the analysis. 2.2.6 Recalibrate variants GATK VariantRecalibrator 180 was used to assign a well-calibrated probability to each variant call in a call set. This tool has a two stage process called Variant Quality Score Recalibration (VQSR). The first pass is performed by VariantRecalibrator 180 and consists of creating a Gaussian mixture model by looking at the distribution of annotation values over a high quality subset of the input call set and then scoring all input variants according to the model. 180 The recalibrated variant quality score provides a continuous estimate of the probability that each variant is correct, allowing one to partition the call sets into quality tranches. 182 The 30

primary purpose of the tranches is to establish thresholds within the data that correspond to particular levels of sensitivity relative to the truth sets. The second pass is performed by the ApplyRecalibration tool 180 that applies the model parameters to each variant in input *.vcf files to produce a recalibrated VCF file in which each variant is annotated with its variant quality score log-odds (VQSLOD) value. 182 This step also filters the calls based on this new logarithm of the odds (LOD) score by adding Pass for variants that meet the specified threshold, and LowQual in the FILTER column for variants that do not meet the specified LOD threshold. 180 The filter level selected for the ApplyRecalibration tool was 99.0. 2.2.7 Genotype concordance Concordance was measured in three patients that had previously been genotyped to validate the genotype calls. The three patients, Patient 1-II-2, Patient 2-II-1 and Patient 3-III-1, all sarcoma cases, had been genotyped previously through the ISKS using an Agilent HaloPlex custom panel of 85-101 gene coding sequence capture. Genotype calls were compared across the three sarcoma cases and to determine how many calls (either 0/0, 0/1 or 1/1) were the same between the intersect file and previous genotyping using the Agilent HaloPlex custom panel. Any discordant variants were checked in the *.vcf files. The *.bam files were 183, 184 also visually examined in Integrative Genomics Viewer (IGV, version 2.3.80). 2.3 Results 2.3.1 Families selected for whole exome sequencing This study included 19 patients from three multigenerational mixed cancer families. Of these, 11 (58%) were female, and nine (47%) had been diagnosed with cancer. The average age of the patients at the time of blood collection was 55.3 years (range: 15 years to 90 years) and the average age of cancer (including sarcoma) onset was 47.5 years (range: 15 years to 79 years). The average age of onset in the three families is younger than the average age of onset of all 31

cancers in the whole ISKS cohort (57.9 years) but similar to the age of onset of sarcomas (46.6 years). 2.3.2 Whole exome sequencing Table 2.4 shows the summary statistics generated by the Torrent Suite software. The average depth of coverage across all samples was 100.66 reads, which is a sufficient depth for detecting single nucleotide variants (SNVs). 185, 186 The average number of mapped reads was 38,484,361, and the average total genotyping rate was 98.9%. Table 2.4: Depth of coverage summary from Torrent Suite Patient Mapped reads On target Mean Depth Number of variants 1-I-1 43,848,035 94.24% 115.80 47,625 1-I-2 28,509,630 96.37% 79.96 48,690 1-II-1 28,343,027 96.36% 80.53 47,334 1-II-2 38,178,599 93.83% 94.99 47,113 1-II-3 39,158,180 94.60% 98.83 47,915 1-III-1 37,229,527 93.93% 93.94 46,670 1-III-2 42,568,341 95.26% 108.40 47,641 2-I-1 33,480,989 94.43% 87.30 42,574 2-I-2 48,585,532 95.67% 131.80 48,220 2-II-1 35,464,936 94.21% 95.84 47,678 2-II-2 45,333,955 94.63% 119.30 48,491 2-II-3 46,884,691 95.38% 128.30 49,238 2-II-4 36,173,806 95.19% 99.70 48,517 2-III-1 30,353,951 95.56% 82.98 47,282 3-I-1 34,870,702 96.03% 79.57 41,493 3-II-1 42,063,872 95.10% 114.40 53,329 3-II-2 40,663,971 95.07% 110.60 52,846 3-III-1 47,344,500 95.68% 118.20 48,337 3-III-2 32,146,623 95.01% 72.06 41,169 Average 38,484,361 95.08% 100.66 47,482 32

2.3.3 Variant calling 5,099,324 unknown positions were changed to reference positions in the merged TVC *.vcf files using BCFtools missing-to-reference. In total, 109,503 variants were called by TVC and 238,530 variants were called by GATK UnifiedGenotyper. Figure 2.5 shows a diagram of the number of calls by TVC and GATK and the intersection of both callers. The intersect file from both callers contained 94,263 variants for all 19 patients. 144,267 94,263 15,240 Genome Analysis Toolkit Intersect Torrent Variant Caller Figure 2.5: The number of variants called by Torrent Variant Caller and Genome Analysis Toolkit UnifiedGenotyper, and the number of variants that were called by both callers (intersect) 2.3.4 Recalibrate variants Figure 2.6 shows the tranche plot generated by GATK VariantRecalibrator. The first tranche (90), has the lowest value of truth sensitivity but the highest value of novel Ti/Tv, is very specific but less sensitive. 187 Each subsequent tranche introduces additional true positive calls along with a growing number of false positive calls. 187 Table 2.5 shows the 99.0 tranche used in this study that has 85,941 known calls and 3,097 novel calls with 49,447 accessible truth sites. In total, 48,952 calls were made in tranche 99.0. The resulting file now has a new column generated by VariantRecalibrator that has pass or low quality for each variant. 33

Figure 2.6: Genome Analysis Toolkit VariantRecalibrator tranche plot X-axis: the number of novel variants called. Y-axis: the novel transition to transversion ratio and the overall truth sensitivity. TP (true positive): exact match of non-reference genotype. FP (false positive): additional alternate allele in WES genotype. Table 2.5: Genome Analysis Toolkit VariantRecalibrator tranche results Tranch minvqslod Known Novel Truth sites Called 90.0 1.14 75,136 at 2.77 2,254 at 1.89 49,447 accessible 44,502 99.0 1.01 85,941 at 2.71 3,097 at 1.50 49,447 accessible 48,952 99.90 6.32 88,789 at 2.69 4,200 at 1.20 49,447 accessible 49,397 100.00 192.99 88,975 at 2.69 4,528 at 1.14 49,447 accessible 49,447 34

Figure 2.7 shows the 2D projection of mapping quality rank sum (MQRankSum) test versus haplotype score by marginalising over the other annotation dimensions in the model. The mapping quality rank sum test is the u-based z-approximation from the Mann-Whitney Rank Sum Test 188 for mapping qualities, that is, reads with reference bases versus those with the alternate allele. 187 This measure can be used to evaluate the likelihood of SNPs being real. Figure 2.7: Genome Analysis Toolkit VariantRecalibrator projection for mapping quality rank sum (MQRankSum) versus haplotype score The upper left panel shows the probability density function that was fitted to the data. Green: high quality. Red: lowest quality. The remaining three panels give scatter plots in which each single nucleotide polymorphism (SNP) is plotted in the two annotation dimensions (MQRankSum and HaplotypeScore) in a point cloud. In the upper right panel, SNPs are coloured black and red to show which SNPs are retained and filtered, respectively, by applying the variant quality score recalibration procedure. The lower left panel colours SNPs green, grey, and purple to give a sense of the distribution of the variants used to train the model. Green SNPs: found in the training sets. Purple: given the lowest probability of being true. The lower right panel colours each SNP by their known/novel status. Blue: known SNPs. Red: novel SNPs. 35

2.3.5 Genotype concordance A total of 212 positions across three previously genotyped individuals were used to compare genotype calls from WES and Agilent HaloPlex custom panel (Figure 2.8). Of those 212 positions, 77 were not called in the WES data due to low coverage or position of the primers. Of the remaining 135 positions, 123 calls (91%) were concordant between the two data types and 12 calls (9%) were discordant. Of the 12 discordant calls, two of the calls were in Patient 3-III-1 and were called at 1/1 using the Agilent HaloPlex custom panel data and called as 0/1 in the WES data. The remaining ten discordant calls were all in Patient 2-II-1 and were called as 0/0 from the Agilent HaloPlex custom panel data and either 0/1 (6 calls) or 1/1 (4 calls) using the WES data. Both concordant and discordant calls were kept in the intersect file. The genotyping positions were all located in easy to map regions of the genome and may not reflect the true false positive to false negative rate for all positions. 36

2 123 10 Called variant by Agilent HaloPlex custom panel genotyping Concordant Called variant by Ion Proton whole exome sequencing Not called in whole exome sequencing data 77 Concordant 123 Discordant 12 Called as variant by Agilent HaloPlex 2 Called as variant by Ion Proton 10 TOTAL 212 Figure 2.8: Concordance of genotype calls between the Agilent HaloPlex custom panel and whole exome sequencing on Ion Proton for three patients Blue: Called homozygous alternate (1/1) by Agilent HaloPlex custom panel but called heterozygous (0/1) by Ion Proton whole exome sequencing. Green: called variant (0/1 or 1/1) by Ion Proton whole exome but called homozygous reference (0/0) by sequencing Agilent HaloPlex custom panel. 37

Table 2.6 shows the ten positions in Patient 2-II-1 in which the genotype calls are discordant between the Agilent HaloPlex custom panel and the WES genotype, that is, where the variant is called 0/1 in the intersect file but no variant is called by the Agilent HaloPlex custom panel. The genotype calls for Patient 2-II-1 were checked in the TVC *.vcf, the GATK *.vcf and the intersect file. The genotype calls for the ten positions were the same across the three files. The genotype results from WES for both parents of Patient 2-II-1 (Patient 2-I-1 and Patient 2-I-2) are included in the last two columns of Table 2.6. These results indicate the WES genotype calls for Patient 2-II-1 at these positions are likely correct, given the genotypes of both parents. Table 2.7 shows the two discordant variants for Patient 3-III-1 which were both called as homozygous alternate using the Agilent HaloPlex custom panel but called as heterozygous in the intersect file. The genotype calls for these two positions were checked in the TVC *.vcf file, the GATK *.vcf file and the intersect file. TVC called the first variant (chromosome 7) as 1/1 whereas GATK called the variant 0/1. Therefore the position is called as 0/1 in the intersect file. TVC called the second variant (chromosome 13) also as 1/1, GATK called the variant 1/1, however, in the intersect file the variant is called 0/1. For both variants, the parents of 3-III-1 (last two columns) have a homozygous alternate genotype call. On visual inspection of the *.bam files in IGV, Patient 3-III-1 appears to be also homozygous for the alternate allele at these positions. Therefore it appears the errors for these variant calls occurred when intersecting the *.vcf files. 38

Table 2.6: Discordant genotype calls between the Agilent HaloPlex custom panel and whole exome sequencing for Patient 2-II-1 Chr Position Ref Alt Agilent HaloPlex GT Intersect file GT 2-I-1 (Father) 2-I-2 (Mother) 4 84383810 C T 0/0 0/1 0/0 1/1 7 6026775 T C 0/0 1/1 1/1 0/1 9 86617265 A G 0/0 0/1 0/1 (low reads) 1/1 11 108183167 A G 0/0 1/1 1/1 1/1 11 125525195 A G 0/0 0/1 0/1 (low reads) 1/1 (low reads) 14 75513883 T C 0/0 1/1 0/1 (low reads) 1/1 17 7579472 G C 0/0 1/1 0/1 1/1 17 59763347 A G 0/0 0/1 0/1 0/1 17 63554591 G A 0/0 0/1 0/1 0/0 18 60027241 C T 0/0 0/1 0/1 0/1 Chr: chromosome. Ref: reference allele. Alt: alternate allele. GT: genotype. Low reads: less than 10 reads at this position. Table 2.7: Discordant genotype calls between the Agilent HaloPlex custom panel and whole exome sequencing for Patient 3-III-1 Chr Position Ref Alt Agilent HaloPlex GT Intersect file GT 3-II-1 (Father) 3-II-2 (Mother) 7 6026775 T C 1/1 0/1 1/1 1/1 39 13 103527930 G C 1/1 0/1 1/1 1/1 Chr: chromosome. Ref: reference allele. Alt: alternate allele. GT: genotype.

2.4 Discussion 2.4.1 Evaluation of families used in this study It has long been recognised that cancer has a familial component. Genetic studies were traditionally performed on sets of related individuals, including Mendel s study of inheritance patterns in pea plants from parents to offspring that propose the underlying mechanisms of inheritance. 189 Pedigree studies have been used successfully to identify genes influencing a broad range of monogenic, highly penetrant traits. 161 There are several reasons why family studies are used for gene discovery. Firstly, pedigrees are more likely to represent a more homogeneous and limited set of causal genes which enhance the statistical power for gene discovery. 190 Secondly, clinical characteristics that are shared among family members also reduce heterogeneity for analysis. 190 Thirdly, the analysis of phenotypes among family members is controlled to some extent for both genetic background and environmental exposures. 190 Therefore, the background genetic variation is also controlled to some extent. Finally, family data allow a deeper level of genotyping quality control than is possible in studies of unrelated individuals. 190 There are also disadvantages of using families in genetic research. It can be more costly to recruit entire pedigrees compared to unrelated individuals. 190 However, the analysis of disease/trait segregation in pedigrees with known genetic markers has proven to be a robust approach to gene discovery. The study of familial cancer predisposition syndromes characterised by sarcoma probands has resulted in valuable insight into cancer biology and genetic risk. For example, the study of Li-Fraumeni syndrome defined the roles of the tumour suppressor gene, TP53, in the development of cancer. Since germline mutations in the TP53 gene were first identified in Li-Fraumeni syndrome families, the gene has also been implicated in the sporadic form of most cancers. 51 It is now known that the TP53 gene has a role in the regulation of the cell cycle, DNA repair, apoptosis, cellular metabolism, and senescence. 191 These findings have had a significant impact on the clinical management of familial cancer predisposition syndromes and cancers in general. 192 40

The ascertainment of cancer cluster families by a sarcoma proband has also been used to study incidence and distributions of cancers in relatives of sarcoma probands in families not defined by known syndromes. 193 197 These studies found an increased cancer risk in relatives of sarcoma probands, and suggest the presence of shared underlying genetic risk variants independent of known cancer predisposition syndromes. 195, 196 The families selected for investigation in the current study were in this category, i.e. they were not defined by a known cancer predisposition syndrome and therefore represent an opportunity to identify novel risk variants associated with both sarcoma and cancer risk. The ISKS families selected for WES in the current study include sarcoma, prostate cancer and melanoma cases. The occurrence of these cancers in families has been previously reported in familial cancer syndromes such as Li-Fraumeni syndrome 51, 52, 198, 199 and familial atypical multiple mole melanoma (FAMMM) syndrome (characterised by mutations in the CDKN2A gene), 200 203 as well as other non-fammm syndrome families, also found to have mutations in the CDKN2A gene. 202, 204 However, the three families selected do not have mutations in the CDKN2A gene and therefore represent an opportunity to identify novel genetic variants that may lead to the development of these cancers within a family. The number and size of pedigrees vary widely in genetic studies of familial cancer. The number of relatives can range from two family members to extended pedigrees with > 30 individuals. 205, 206 The families used in this study are similar in size to the families studied by Roach et al. (2010) to discover the causative gene for Miller syndrome and Shi et al. (2014) to identify rare POT1 variants in familial 207, 208 cutaneous malignant melanoma. 2.4.2 The use of whole exome sequencing to identify disease causing variants WES has been a powerful approach for identifying genes that underlie Mendelian disorders and complex traits. 141, 144, 209, 210 To date, most genes discovered that underlie rare Mendelian disorders have genetic variation in protein coding sequences 166, 211 that are predicted to have functional consequences and be deleterious. 41

WES has also been a powerful and efficient approach for the discovery of genetic mutations in various cancers, identifying more than 50 novel tumour-predisposing genes (Appendix B). The identification of clinically actionable driver mutations through WES has enabled the development of precision oncology therapies. 212 215 Many of the genes that have been implicated in hereditary sarcomas play a significant role in the cellular response to DNA damage that has led to the development 216, 217 of DNA repair targeted therapies. WES has the advantage of increased coverage of regions of interest (exons) at lower cost and higher throughput compared with current whole genome sequencing (WGS). 148 WES was therefore chosen for this study as an appropriate, affordable 139, 210 and robust in-house method. 2.4.3 Limitations of whole exome sequencing A weakness of WES is that it largely ignores variants residing in non-coding and intergenic regions that can affect gene expression. 218 Non-coding DNA plays an important role in gene regulation and 3D chromatin folding 219 However, the effects of non-coding variants on gene expression are not yet completely understood. 220 The effects of regulatory variation may be more subtle and may be more important in common complex diseases such as cancer compared to Mendelian diseases. 221 The relevance of regulatory variation to cancer susceptibility in humans is unclear, but it is possible that polymorphisms in non-coding regions 221, 222 might have an important role. As the costs of WGS decrease and analytical tools such as Encyclopedia of DNA Elements (ENCODE) 223 become more adept at interpreting the effects of non-coding variants, WGS will become more widespread. The use of WGS studies to investigate genetic variants in cancer cluster families may lead to the discovery of mutations in regulatory elements that add to the pool of disease-associated variants. 224 Structural variations (defined as DNA sequence alterations other than SNVs including insertions, deletions, duplications, inversions and translocations) 225 227 were not examined using WES in this chapter. There are many challenges in somatic structural variation detection inherent in the limitations of NGS technologies, the complexities of tumour samples and the difficulties in structural variant 42

reconstruction. 227 As WGS technologies improve, the use of paired-end reads, deeper coverage and longer sequence reads will facilitate the examination of somatic structural variants in cancer. 2.4.4 The Ion Proton sequencing platform The Ion Proton generally shows similar performance to other high-throughput sequencing platforms. 228, 229 The Ion Proton is also known to produce high quality data at a comparable average depth and read length in addition to a faster 172, 229, 230 turnaround time compared to the Illumina HiSeq. The average percent of reads on target produced in this study was 95.08%. The measurement of reads on target is represented by the ratio of the number of reads within a target region to the total number of bases output by the sequencer, expressed as a percentage. Off-target regions refer to those areas that are located 5 and 3 to target regions (upstream, downstream, untranslated regions and intronic). The percentage of on-target reads are dependent on the platform used as each platform uses different target choices, bait lengths, bait density and molecules used for capture. 185 2.4.5 Base calling software The TVC software was developed specifically to call Ion Proton sequencing data. However, it cannot produce multi-sample variant call files. The advantage of using multi-sample calling is to distinguish non-variant genotypes between 149, 231 homozygous reference genotype and missing genotype in cohort analysis. Multi-sample variant calling reduces the probability of calling random sequencing errors and increases the likelihood of calling alleles of low frequency or low coverage in a single sample. 149 Therefore, the sensitivity and accuracy of base calling are improved. 149 When calling the samples individually using TVC, many positions had to be filled to reference homozygous, and it was impossible to distinguish missing from homozygous reference positions. GATK UnifiedGenotyper can perform multi-sample calling and can, therefore, distinguish between missing and reference homozygous positions. However, GATK is not suited to Ion Proton data as the Ion Proton platform produces markedly different data to the Illumina 43

platform. 232 There were over twice the number of variants returned from GATK UnifiedGenotyper (238,530) compared to the number returned by TVC (109,503). Anecdotally, GATK does produce a higher number of false positives which may account for the difference in variants called (up to 10 times as many as reported on online bioinformatics forums). An intersect file of the calls made by TVC and GATK UnifiedGenotyper was created to reduce the number of false positives in the final call set and to overcome the problem of single sample calling by TVC and the platform differences by using GATK. Previous studies have recommended using multiple callers to generate a final call set. 233, 234 A simple way to combine call sets is to take the intersection or union of calls as final calls. 234 However, this was a very rigorous approach that reduced the number of variants from 109,503 called by TVC and 238,530 from GATK UnifiedGenotyper to just 94,263 in the intersect file. Therefore, some true variants may have been excluded as a result of using the intersect file. However, this may be the best approach for reducing the number of false positive calls. 2.4.6 Concordance In this study, the concordance rate of genotype calls for 135 positions from WES and the Agilent HaloPlex custom panel was 91%. The concordance rate falls into the range supported by previous literature on the concordance rates of panel versus sequencing data. A previous study by Motoike et al. (2014) aimed to validate SNV calls by exome analysis. They sequenced 12 independent genomes from Japanese patients using the Ion Proton semiconductor sequencer for whole exome sequencing (average depth 109). 235 Reads were aligned to hg19 using TMAP and genotype calling was performed on each sample using TVC. 235 Single nucleotide polymorphism (SNP) calls based on the Illumina Human Omni (version 2.5-8) SNP chip data were used as the reference. They analysed a total of 79,143 SNPs on the autosomes and found the concordance rate between the Omni 2.5-8 and Ion Proton calls to be 81.8 96.0%. 235 These figures are comparable to results reported in a previous study. 229 44

The intersect file described in this chapter was used in Aim 2 of this study to identify candidate risk variants. None of the discordant calls were removed from the intersect file. However, due to the findings of the concordance analysis, particularly the wrong call found in the intersect file but not either of the original *.vcf files from TVC or GATK, each variant detected in the analysis of this data was visually verified in the *.bam file using IGV. 45

46

Chapter 3 Aim 2: Identification of candidate germline risk variants in three cancer cluster families 3.1 Introduction Whole exome sequencing (WES) generates data on a large number of variants, most of which are not relevant to the disease of interest as they do not have a functional effect at the protein or systemic level. 236 The second aim of this study was to use the WES data described in Chapter 2 to identify candidate germline risk variants that segregate with cancer or sarcoma in three cancer cluster families. The analysis of WES data requires comprehensive computational approaches and strategies to identify candidate risk variants or genes for a disease of interest. 237 239 Despite advances in sequencing platform technology, reference data sets, software, and analysis pipelines, there is no gold standard for the filtering and prioritisation of variants. However, many guidelines, tools, and online resources have been developed to assist in the identification of functional variants from WES. 47

3.2 Bioinformatic strategies for variant filtering and prioritisation in whole exome sequencing 3.2.1 Annotation As the sequencing of cancer genomes can reveal thousands of mutations, an essential step in the interpretation of WES data is the annotation of variants and their potential effects on genes and transcripts. 240 Variant annotation is the process of assigning functional information to DNA variants. At a basic level, annotations can be used to identify genes, transcripts and genomic regions, and at a higher level, also predict the impact of the variant on the protein product. There are over 80 bioinformatic tools available for genomic annotation, many of which are available as web-based applications. 241 Most tools focus on the annotation of single nucleotide variants (SNVs) as they are easily identified and analysed. 242 However, an increasing number of tools are being developed to annotate copy number alterations and other structural variations including 241, 243 250 insertions, deletions, inversions and translocations. The most common form of annotation is the provision of links to public databases such as the National Center for Biotechnology Information (NCBI) Short Genetic Variations Database (dbsnp) or the 1000 Genomes Project. 251, 252 The functional prediction of variants can result from a simple sequence-based analysis, region-based analysis, or evaluation of the structural impact on proteins. 242 The choice of annotation tool is largely dependent on the desired selection of variant annotations. A widely used annotation tool to identify the functional consequence of sequence variation is Annotate Variation (ANNOVAR). 245 ANNOVAR predicts the functional effects of variants on genes, as well as performing genomic region-based annotation and comparison of variants to existing databases. 245 ANNOVAR incorporates scores based on evolutionary conservation and in silico prediction of functional consequences. 48

3.2.1.1 Annotation of non-coding regions A significant portion of the reads obtained in WES come from outside of the designed target region. 253 In a typical WES study, approximately 40-60% of the reads are off target, and all or most of these off-target reads are usually ignored. 254 256 Three main types of off-target reads are found in WES data: reads from introns and intergenic regions, reads from the mitochondrial genome and reads from viral genomes. 218 Although WES is not designed to identify regulatory variants in intronic and intergenic regions, off-target reads should not be discarded as many changes outside the coding regions may be responsible for disease phenotypes. 253 Annotation also plays an essential role in the interpretation of off-target variants. Regulome Database (RegulomeDB) can be used to guide the interpretation of regulatory variants in the human genome to identify potential regulatory changes based on experimental data sets from the Encyclopaedia of DNA Elements (ENCODE) and other sources. 257 RegulomeDB also includes computational predictions and manual annotations to identify putative regulatory potential and identify functional variants. 257 RegulomeDB uses a heuristic scoring system based on the functional consequence of the variant. 257 3.2.2 Variant class filtering Variant filtering can be carried out using annotations for the genomic location and the variant class. Annotations from ANNOVAR can be used to identify intronic variants, exonic variants, intergenic variants, 5 and 3 -untranslated region (UTR) variants, splicing site variants, and upstream or downstream variants. 245 For exonic variants, ANNOVAR scans annotated messenger ribonucleic acid (mrna) sequences to identify and report amino acid changes, as well as stop-gain or stop-loss mutations. 245 Exonic missense, nonsense, stop-loss, frameshift and splice site variants all have potential to affect protein function and are retained during this filtering process. 211, 239 RegulomeDB scores can also be used to filter variants that are more likely to lie in a functional location. 49

3.2.3 Population frequency filtering Population frequency is one of the primary criteria for predicting if a variant is likely to have a functional effect on the encoded protein. 258 Some rare nonsense variants might be expected to have a larger functional impact than a frequently occurring one. 211, 259 The Exome Aggregation Consortium (ExAC) database is the biggest catalogue of protein-coding genetic variation to date and is intended to be used as a general population resource to filter variants, including, for example, minor allele frequency (MAF). 260, 261 The ExAC database is the aggregation and analysis of high-quality exome DNA sequence data for 60,706 individuals of diverse ancestries. 261 The ExAC database is recommended due to the allele frequencies being calculated from considerably more samples compared to the Exome Variant Server and the 1000 Genomes Project. 252, 260 In disease studies, a commonly used starting point for filtering is to remove variants with a MAF > 1%. 239 3.2.4 Evolutionary conservation Genomic Evolutionary Rate Profiling (GERP) uses a comparative genomics approach to identify putatively functional sequences by comparing similarity across divergent species to identify sequences that have been maintained during evolution. 262 Pathogenic mutations tend to have a markedly higher conservation than benign variants. 263, 264 GERP uses maximum likelihood evolutionary rate estimation for position-specific scoring. 262 GERP scores range from a maximum of 6.18 to a below-zero minimum (-12.36). Positive scores represent a substitution deficit (expected for sites under selective constraint), while negative scores represent a substitution surplus. 3.2.5 Functional impact prediction In silico analysis of functional consequences of a variant on protein function and estimates of evolutionary conservation are often used for prioritisation in genetic discovery studies. Non-synonymous variants that lead to an amino acid change in the protein product are of particular interest as amino acid substitutions 50

account for approximately half of the known genetic variants responsible for human inherited disease. 265 Sorting Intolerant From Tolerant (SIFT) and Polymorphism Phenotyping-2 (PolyPhen-2) are commonly used tools that can predict if an amino acid substitution will have an effect on the protein function. 266, 267 SIFT uses sequence homology to predict whether an amino acid substitution will affect protein function and potentially alter phenotype. 266 A SIFT score 0.05 is predicted to be damaging, and a score > 0.05 is predicted to be tolerated. PolyPhen-2 predicts the possible impact of amino acid substitutions on the stability and function of human proteins using structural and comparative evolutionary considerations. 267 A PolyPhen-2 score between 0.0 and 0.15 is predicted to be benign, a score between 0.15 and 1.0 is predicted to be possibly damaging, and a score between 0.85 and 1.0 is more confidently predicted to be damaging. 267 An alternative strategy for filtering variants is based on a priori knowledge of the functional involvement of variants or genes. For example, association studies with candidate genes have been used to identify a number of risk genes for complex diseases. 268 A candidate gene study takes advantage of and is limited by knowledge of the phenotype, tissues, genes and proteins that are likely to be involved or have been previously implicated in the disease. 268, 269 Assessing candidate genes possessing functional variants in the context of existing biomedical knowledge and known biomolecular functions can be used to produce a manageable set of variants for further validation or exploration. 239 Several next generation sequencing (NGS) studies have identified rare variants associated with disease using a candidate gene approach. 270 274 In addition to variant filtering based on annotation and functional impact predictions, strong genetic support is also necessary for assigning possible causality to variants identified using WES. 239 Evidence of genetic association or familial segregation should be supplemented by functional and bioinformatics support. 51

3.2.6 Association analysis in families Association analysis in families can identify genes that influence complex human traits and provide protection against population stratification. 275 Variance components models are a way to assess the amount of variation in a dependent variable that is associated with one or more random-effects variables. 276 Variance components analysis is widely used in the genetic analysis of quantitative traits in family studies. 275 This approach is favoured because it can accommodate pedigrees of any size, it allows both linkage and association analysis, and tends to be more robust than competing approaches. 275 Sequential Oligogenic Linkage Analysis Routines (SOLAR) is a software that performs variance components analysis in pedigrees. 277 Almasy and Blangero (1998) 278 extended the strategy developed by Amos (1994) 279 for pedigree-based variance components analysis to estimate the genetic variance attributable to the region around a specific genetic marker using SOLAR. Maximum likelihood methods that take into account relationships among family members can be used to determine association in a polygenic model in SOLAR. 3.2.7 Familial segregation Segregation analysis is a general method for evaluating the transmission of a disease or trait within pedigrees. Segregation analysis can be used to prioritise and filter variants by assessing the co-segregation of candidate variants with disease status. 276 This analysis distinguishes variants that segregate with the disease of interest and are absent in unaffected family members. Segregation analysis can be applied to any pedigree structure and works with both qualitative and quantitative traits. 280 3.2.8 Outline of chapter This chapter describes the annotation, filtering, prioritisation and segregation analysis of WES data to identify putative germline risk variants that are associated with cancer or sarcoma in three cancer cluster families. WES data from Chapter 2 was annotated using ANNOVAR and RegulomeDB. Putative structural and regulatory variants were filtered using genomic location and variant class or 52

RegulomeDB score. Three different strategies were used to further prioritise rare private variants, known rare variants and candidate gene variants. Prioritised variants were tested for association with sarcoma and cancer using SOLAR. Significant variants were assessed for familial segregation with disease. 3.3 Methods 3.3.1 Ascertainment bias correction The families selected for this study were ascertained from the International Sarcoma Kindred Study (ISKS), 175 as described previously in Chapter 2. A weighted covariate using a probability unit (probit) regression was created in R 281 (bias reduction in binomial-response generalised linear models (brglm) library, version 3.1.2) 281 to account for ascertainment bias in the sample. Probit regression assigns a weight to each based on their case status and can be used as a covariate in modelling. 3.3.2 Intersection The intersect file created from the variant call files from the Torrent Variant Caller (TVC, version 5.0.0), and Genome Analysis Toolkit (GATK, version 3.4.0) UnifiedGenotyper in Chapter 2 was used in these analyses. This file consists of 94,623 variants. 3.3.3 Annotation and filtration ANNOVAR (version 2015Jun16) 245 was used to annotate the intersect file using gene-based annotation. Using the ANNOVAR annotation, variants were filtered to include only putative structural variants. Variant filtering retained loci if they: (1) were exonic, (2) were predicted to be nonsynonymous or resulting in a stop gain or stop loss, (3) were predicted to be deleterious or probably damaging in SIFT and PolyPhen-2 and, (4) had a GERP score < 3. 53

All remaining variants that were not classified as putative structural variants were annotated using RegulomeDB. 257 Putative regulatory variants that had a RegulomeDB score of 1a, 1b, 1c, 1d, 1e, 1f, 2a, 2b or 2c were retained as these scores represent the highest confidence that a variant lies within a functional location. Table 3.1 shows the classification of scores from RegulomeDB. Known expression quantitative trait loci (eqtl) for genes are associated with expression and are most likely to result in a functional consequence. 257 Other subcategories with high confidence for regulatory variants are transcription factor (TF) binding, TF motifs, Deoxyribonuclease (DNase) footprints and DNase peaks. 257 Table 3.1: Classification of Regulome database scores Score 1a Supporting data eqtl + TF binding + matched TF motif + matched DNase Footprint + DNase peak 1b eqtl + TF binding + any motif + DNase Footprint + DNase peak 1c 1d 1e 1f 2a 2b 2c 3a 3b eqtl + TF binding + matched TF motif + DNase peak eqtl + TF binding + any motif + DNase peak eqtl + TF binding + matched TF motif eqtl + TF binding / DNase peak TF binding + matched TF motif + matched DNase Footprint + DNase peak TF binding + any motif + DNase Footprint + DNase peak TF binding + matched TF motif + DNase peak TF binding + any motif + DNase peak TF binding + matched TF motif 4 TF binding + DNase peak 5 TF binding or DNase peak 6 Other eqtl: Expression Quantitative Trait Loci. TF: Transcription Factor. DNase: Deoxyribonuclease. 54

False positive variants that arise due to misalignment, inaccuracies and biases in the reference sequence can be identified and provisionally excluded during a search for disease-causing variants. Fuentes Fajardo et al. (2012) analysed WES data from 118 individuals in 29 families to create a list of 2,157 genes that are candidates for provisional exclusion from exome analysis. 282 All filtered variants in this study were cross-referenced to the exclusion list by Fuentes Fajardo et al. (Available in the paper s Supplementary material: Table S7 gene exclusion list final ) to determine if any results found in polygenic regions should be excluded to reduce the risk of false positives. 3.3.4 Prioritisation strategies 3.3.4.1 Prioritisation using a rare private variants strategy The first prioritisation strategy was applied to the filtered variants from the intersect file to identify rare private variants. Rare private variants are defined as those unique to individuals or families, and those that have not been previously annotated. 283 A major driving hypothesis behind WES of complex diseases is that multiple, rare variants in protein-coding genes contribute to the disease/trait of interest. 284 The focus on rare genetic variation is supported by studies that predict that numerous functional and deleterious variants segregate in the population at frequencies too low (0.5-5%) to detect by genome wide association (GWA) studies. 128 Investigators have successfully used this approach to identify rare private variants after removing known variants with a reference SNP identification (rs ID) from further consideration if they are found in the International Haplotype Project (HapMap), 285 the 1000 Genomes Project, 286 or dbsnp. 251 The variants from the intersect file were filtered to remove those that had been previously annotated to prioritise rare private variants in this study. 251 3.3.4.2 Prioritisation using a known rare variants strategy The second strategy was used to prioritise known rare variants from the filtered intersect file using a population database and MAF information. By filtering the data from WES for rare variants that have been documented in a large database such as ExAC, variants that occur at a low frequency in the population 55

that may be associated with cancer are more likely to be prioritised in these cancer cluster families. The full list of variants from the ExAC browser were downloaded (version 0.3.1, 30 August 2016). Variants from a complete list of ExAC browser variants with a MAF 0.01 (1%) and that were also in the intersect file were selected. 3.3.4.3 Prioritisation using a candidate gene strategy The prioritisation of candidate genes based on a priori knowledge of cancer biology was the third prioritisation strategy used on the filtered intersect file in this study. The variants from the intersect file were filtered to prioritise those detected in 119 known cancer and sarcoma genes including 25 kb upstream and downstream of the gene to include any potential regulatory variants captured in off-target reads. Candidate genes were selected from two cancer gene panels and a search of the Online Mendelian Inheritance in Man (OMIM) database. 287 Cancer genes were chosen from the HaloPlex Cancer Research Panel, 288 and Illumina s MiSeq and TruSeq Cancer Panels. 289 Both panels are NGS target enrichment panels that were designed for known cancer hotspots. The panels contain genes found in previous research to be associated with a broad range of cancer types as well as with published drug targets. Candidate genes from the results of a search of the OMIM database for genes known to be associated with the specific sarcoma subtypes in the three families were also included. 287 The full list of cancer genes used in the prioritisation process can be found in Appendix G. The variants present in both the intersect file and in the candidate genes were selected. 3.3.5 Methods for testing association of variants with cancer phenotypes SOLAR (version 7.6.4) 277 was employed to estimate and test the significance of association under a polygenic model for quantitative phenotypes (age at onset of cancer and age at onset of sarcoma) and disease status (cancer and sarcoma). Covariates included were age and sex of the participant, and the age sex interactions along with a weighting factor assigned to each individual to correct for the ascertainment 56

bias. Analysis of disease status as discrete binary traits was performed using a liability threshold model in SOLAR. This model employs probit regression for the mean effect component and a standard random effects variance component model for the residual additive genetic component of variance. 278, 290 As variance component models are highly influenced by kurtosis (a descriptor of the shape of a probability curve), the quantitative phenotypes were inverse normalised using the SOLAR function, inorm. 291 3.3.6 Bonferroni correction Bonferroni correction was performed on each annotated variant list to correct for multiple testing. 292 Corrections were performed for each method based on the number of variants in the prioritised list. Any significant variants after correcting for multiple testing, or nominal variants (p-value < 0.05), were investigated for co-segregation in the families. 3.3.7 Familial segregation analysis Three assumptions were used to determine familial segregation. First, the variant will be rare (shared only by cases in one family). Second, every carrier of a putative disease-causing variant will have the phenotype (complete penetrance). Third, every individual with the disorder will carry the putative disease-causing variant (100% probability of observing a genotype given the phenotype). 284 Due to the segregation analysis assumptions, it was hypothesised that variants identified by this approach would be private mutations that co-segregate with cancer or sarcoma in each family. The genotypes of any variants found to segregate with the phenotype of interest were visually confirmed by importing the Binary Alignment/Map (*.bam) files into Integrative Genomics Viewer (IGV, version 183, 184 2.3.80) by determining the number of reads for each allele. 3.3.8 Evidence further supporting candidate risk genes The candidate germline risk variants and the genes in which they arise were further examined for association with cancer pathogenesis using several in silico resources including the Catalogue of Somatic Mutations in Cancer (COSMIC), 134 57

the pathway unification database (PathCards), 293 gene ontology (GO) annotations, 294 PubMeth (a database of methylation in cancer), 295 and NCBI. 296 A PubMed search was performed using a string ( gene name ) AND (cancer OR malignancy OR tumor* OR tumour* OR sarcoma) in April 2017. Abstracts were screened for relevance to the current study. 3.4 Results 3.4.1 Variant prioritisation The intersect file containing 94,263 variants was annotated with ANNOVAR and RegulomeDB and variants in known polymorphic regions were removed. Approximately 42% of variants were exonic and 51% were intronic (Table 3.2). Less than 1% of variants were intergenic. Of the exonic variants, approximately 48% were nonsynonymous, and 51% were synonymous, with 0.5% classified as stop gain and loss variants. Table 3.2: Functional annotation of intersect file using ANNOVAR Function Percentage Exonic 42.45 Nonsynonymous 47.61 Synonymous 50.55 Stop gain/loss 0.50 Unknown 1.35 Intronic 50.74 Intergenic 0.04 Upstream/downstream 0.68 UTR 4.96 Other 1.13 58

3.4.1.1 Prioritisation using a rare private variants strategy The first prioritisation method was employed to identify rare, novel variants not previously reported in reference data sets. Of the 94,263 variants in the intersect file, 4,425 variants had not previously been annotated with an rs ID number. Of these, 1,858 (42%) were exonic variants and 1,184 (64%) were nonsynonymous. 3.4.1.2 Prioritisation using a known rare variants strategy The second prioritisation method was used to identify known rare variants using the ExAC public database. There were over 10 million variants in the ExAC browser (release 0.3.1, 30 March 2016). Of those 10 million variants, 3,686,062 variants had a MAF of less than 0.01. Of the ~3.7 million rare variants, 8,840 variants were also in the intersect file. Of these, 5,184 (59%) were exonic and 2,815 (54%) were nonsynonymous. 3.4.1.3 Prioritisation using a candidate gene strategy The third prioritisation method was based on a priori knowledge of cancer and sarcoma. The results of the WES intersect file were filtered to only those variants in known cancer and sarcoma genes (1,297 variants). Of these variants, 806 were in the known cancer genes listed in Appendix G. The remaining 491 variants were located in regions upstream and downstream (25 kb) of each known cancer gene. Appendix H contains a table of variants in the upstream and downstream regions of the cancer genes that were also prioritised using this method. Of the 1,297 variants, 487 (38%) were exonic and 211 (43%) were nonsynonymous. 3.4.1.4 Summary of annotated variants from each prioritisation strategy A summary of the annotated variants from each prioritisation strategy is presented in Table 3.3. The first section of the table shows the number of variants prioritised by each strategy, followed by the genomic location of variants, exonic function and functional prediction. The final section of the table shows the number of variants classified as putative structural and functional variants. The results of each prioritisation strategy were tested for significant associations with cancer phenotypes using SOLAR. 59

60 Table 3.3: Summary of variant annotation using Annotate Variation and Regulome Database for each prioritisation strategy Strategy Rare private variants Known rare variants Candidate gene variants Number of variants prioritised 4,425 8,840 1,297 Location Exonic 1,858 5,184 487 Intronic 2,170 3,209 724 Downstream 8 6 5 Upstream 25 14 3 5 untranslated region 132 124 28 3 untranslated region 119 197 38 Splicing 19 19 2 Non-coding RNA 91 84 10 Intergenic 1 3 0 Upstream/downstream 1 0 0 5 /3 untranslated region 1 0 0 Exonic function Nonsynonymous 1,184 2,815 211 Stop gain 40 34 1 Stop loss 1 4 0 Synonymous 601 2,268 273 Unknown 32 63 2

Strategy Rare private variants Known rare variants Candidate gene variants Functional prediction Deleterious in SIFT and PolyPhen-2 254 449 22 Tolerated in SIFT and PolyPhen-2 545 1,551 134 Unknown in SIFT and PolyPhen-2 3,189 46 6 Regulome database score < 3 0 683 168 Classification Putative structural variants 254 449 22 Putative regulatory variants 0 683 168 SIFT: Sorting Intolerant From Tolerant. PolyPhen-2: Polymorphism Phenotyping-2. 61

3.4.2 Rare private variants 3.4.2.1 Association analysis in SOLAR The annotated rare private variants (Table 3.3) were tested for association with cancer phenotypes using SOLAR. The results from SOLAR were corrected for multiple testing using the Bonferroni method with the number of prioritised variants (4,425). The significance level after correction was α < 1.23 x 10 5. No variants were significantly associated with a cancer phenotype after correcting for multiple testing. As the variants prioritised by this strategy were rare, novel variants, all nominally significant variants (p-value < 0.05) were visually confirmed using IGV to determine if they could be due to alignment or calling error. Any variants located near an insertion or deletion or on the edge of a gap or read block were removed. Table 3.4 contains a summary of the nominally significant variants (p-value < 0.05) for the age at onset of cancer, the age at onset of sarcoma, and cancer status. The results show eight variants nominally associated with age at onset of cancer, six variants nominally associated with age at onset of sarcoma, and two variants showing nominal association with cancer status. There were no variants with a p-value < 0.05 for sarcoma status. Of the total variants, eight were associated with a single cancer phenotype, and four variants were associated with more than one cancer phenotype. Two variants were associated with age at onset of cancer and cancer status, and two variants were associated with age at onset of cancer and age at onset of sarcoma. As these were rare variants without an rs ID, MAF data from the 1000 Genomes Project database and annotation using RegulomeDB was not available. Therefore, all the variants identified by this prioritisation strategy were rare risk alleles. 62

Table 3.4: Summary of SOLAR association results for rare private variants Chr:Pos Gene p-value Beta SE Exonic function SIFT PolyPhen-2 GERP Ref Alt MAF Age at onset of cancer 8:145773319 ARHGAP39 0.01 1.36 0.51 NS D D 5.37 C T 0.08 2:232790160 NPPC 0.01 1.03 0.42 NS D D 5.29 C G 0.13 20:57568079 NELFCD 0.02 1.42 0.58 NS D D 5.84 T C 0.05 17:11726319 DNAH9 0.04 1.87 0.89 NS D D 4.05 G A 0.03 6:25517633 LRRC16A 0.04 0.93 0.44 NS D D 5.94 G A 0.05 9:95772634 FGD3 0.04-1.87 0.89 NS D D 4.29 C A 0.47 19:41128570 LTBP4 0.05 1.00 0.50 Unknown D D 3.77 C A 0.08 6:72889438 RIMS1 0.05 1.30 0.65 NS D D 5.65 G A 0.05 Age at onset of sarcoma 17:72878728 FADS6 < 0.01 1.59 0.49 NS D D 5.15 A G 0.03 17:11726319 DNAH9 0.01 1.51 0.57 NS D D 4.05 G A 0.03 6:25517633 LRRC16A 0.01 0.72 0.27 NS D D 5.94 G A 0.05 19:51021625 LRRC4B 0.03-0.58 0.27 NS D D 3.45 C G 0.18 16:77227362 MON1B 0.03-0.67 0.31 NS D D 3.61 G T 0.21 6:30624000 DHX16 0.06 0.56 0.30 NS D D 4.89 A G 0.18 Cancer status 8:145773319 ARHGAP39 0.02-5.14 2.26 NS D D 5.37 C T 0.08 19:41128570 LTBP4 0.02-3.26 1.44 Unknown D D 3.77 C A 0.08 63 Chr:Pos: Chromosome:Position. SE: Standard Error. SIFT: Sorting Intolerant from Tolerant score. PolyPhen-2: Polymorphism Phenotyping-2. GERP: Genomic Evolutionary Rate Profiling score. Ref: reference allele. Alt: alternate allele. MAF: Minor Allele Frequency in the study population. NS: nonsynonymous. D: deleterious.

3.4.2.2 Segregation analysis results Of the 12 variants identified, seven were seen only in one family (ARHGAP39, NELFCD, LTBP4, RIMS1, DNAH9, LRRC16A and FADS6 ). However, using all three criteria for familial segregation, only one conserved deleterious variant in the ARHGAP39 gene showed nominal association with age at onset of cancer and cancer status and complete familial segregation in family 3 (Figure 3.1). Each family member with cancer was heterozygous at this position, whereas unaffected family members were homozygous for the reference allele at this position. None of the other prioritised rare private variants showed complete familial segregation in any of the families according to the familial segregation criteria. 3-I-1 Sarcoma Key Affected male 3-II-1 Prostate 3-II-2 Affected female Unaffected male Unaffected female Proband 3-III-1 Sarcoma 3-III-2 Patient Genotype at position in ARHGAP39 gene Read depth Ref, alt Patient 3-I-1 C/T 18,16 Patient 3-II-1 C/T 38,30 Patient 3-II-2 C/C 66,0 Patient 3-III-1 C/T 37,57 Patient 3-III-2 C/C 13,0 Figure 3.1: Genotypes for the ARHGAP39 variant that shows segregation in patients with cancer in family 3 64

3.4.3 Known rare variants 3.4.3.1 Association analysis in SOLAR The annotated known rare variants (Table 3.3) were tested for association with cancer phenotypes using SOLAR. The results from SOLAR were corrected for multiple testing using the Bonferroni method with the number of variants prioritised (8,840). The significance level after correction was α < 5.66 x 10 6. No variants were significant after correcting for multiple testing. Table 3.5 contains a summary of the nominally associated variants (p-value < 0.05) for the age at onset of cancer, the age at onset of sarcoma, and cancer status. The results include ten variants that showed nominal association with age at onset of cancer (eight putative structural and two putative regulatory variants), one putative regulatory variant that showed nominal association with age at onset of sarcoma, and 15 variants showing nominal association with cancer status (12 putative structural and three putative regulatory variants). There were no variants showing association with a p-value < 0.05 for sarcoma status. Of all the variants, 12 were associated with a single cancer phenotype, and seven variants were associated with more than one cancer phenotype. Of the latter, all seven variants were associated with both cancer status and age at onset of cancer. 65

66 Table 3.5: Summary of SOLAR association results for known rare variants Chr:Pos Gene p-value Beta SE Exonic function SIFT PolyPhen-2 RegulomeDB GERP Ref Alt MAF 1000G MAF Age at onset of cancer 1:40929077 ZFP69B 0.01 1.36 0.51 NS D D 3a 3.33 C G 0.0016 0.08 16:66503705 BEAN1 0.01 1.36 0.51 NS D. 5 4.11 C A 0.0050 0.08 4:1348920 UVSSA 0.01 1.36 0.51 NS D D 5 5.26 G A 0.0040 0.08 16:4606552 C16orf96 0.01 1.27 0.49 NS D D 2b 5.22 T C 0.0002 0.11 6:4087949 C6orf201 0.03-0.76 0.34 NS D D 6 3.07 A T 0.0473 0.16 10:72462080 ADAMTS14 0.03 1.11 0.50 NS D D 5 5.93 C T 0.0002 0.08 10:79590510 DLG5 0.03 1.11 0.50 NS D D 5 5.67 C G 0.0058 0.08 8:128750540 MYC < 0.01 1.64 0.57 NS T D 2b 3.91 A G 0.0152 0.05 1:45224937 KIF2C 0.01 1.36 0.51 S.. 2b. G A 0.0012 0.08 7:20721130 ABCB5 0.01 1.27 0.49 S.. 2c. G A 0.0008 0.11 Age at onset of sarcoma 16:70595515 SF3B3 < 0.01-1.30 0.25 Int.. 2b. T G. 0.16 Cancer status 16:4606552 C16orf96 0.01-4.79 1.79 NS D D 2b 5.22 T C 0.0002 0.11 1:40929077 ZFP69B 0.02-5.14 2.26 NS D D 3a 3.33 C G 0.0016 0.08 4:1348920 UVSSA 0.02-5.14 2.26 NS D D 5 5.26 G A 0.0040 0.08 10:72462080 ADAMTS14 0.02-3.48 1.54 NS D D 5 5.93 C T 0.0002 0.08 10:79590510 DLG5 0.02-3.48 1.54 NS D D 5 5.67 C G 0.0058 0.08

Chr:Pos Gene p-value Beta SE Exonic function SIFT PolyPhen-2 RegulomeDB GERP Ref Alt MAF 1000G MAF 13:33703738 STARD13 0.02-3.29 1.45 NS D D 5 5.82 C T 0.0002 0.08 8:21986479 HR 0.02-3.29 1.45 NS D D 5 2.95 G A 0.0012 0.08 8:67341481 RRS1 0.02-3.29 1.45 NS D D 4 2.18 C A. 0.08 3:63264392 SYNPR 0.02-3.26 1.44 NS D D 4 4.3 C T 0.0012 0.08 5:52397270 MOCS2 0.02-3.26 1.44 NS D D 5 5.92 G A 0.0008 0.08 4:52926666 SPATA18 0.03-1.47 0.67 NS D D 5 3.72 A T 0.0024 0.16 6:4087949 C6orf201 0.05 1.18 0.60 NS D D 6 3.07 A T 0.0473 0.18 7:20721130 ABCB5 0.01-4.79 1.79 S.. 2c. G A 0.0008 0.11 3:10255002 IRAK2 0.01 3.01 1.22 NS T 0.00 2b -0.447 C G 0.0323 0.11 17:39197601 KRTAP1-1 0.02 1.49 0.64 NS T 0.00 2b -8.93 T C. 0.24 Chr:Pos: Chromosome:Position. SE: Standard Error. SIFT: Sorting Intolerant from Tolerant score. PolyPhen-2: Polymorphism Phenotyping-2. GERP: Genomic Evolutionary Rate Profiling score. Ref: reference allele. Alt: alternate allele. MAF 1000G: Minor Allele Frequency in 1000 Genomes Project. MAF: Minor Allele Frequency in study population. NS: nonsynonymous. S: synonymous. Int: intronic. UTR3: 3 untranslated region. UTR5: 5 untranslated region. D: deleterious..: not annotated in database. 67

3.4.3.2 Segregation analysis results Of the 19 variants, 13 were only seen in one family (ZFP69B, BEAN1, UVSSA, C16orf96, ADAMTS14, DLG5, KIF2C, ABCB5, STARD13, HR, RRS1, SYNPR and MOCS2 ). Using the three criteria for familial segregation, six variants showed complete familial segregation. Two conserved, deleterious variants showed complete familial segregation in family 2 (Figure 3.2). An exonic nonsynonymous variant in the C16orf96 gene showed nominal association with age at onset of cancer and onset of cancer. A synonymous variant in the ABCB5 gene also showed nominal association with age at onset of cancer and cancer status. Each family member with cancer was heterozygous at these positions, whereas unaffected family members were homozygous for the reference allele at these positions. 2-I-1 2-I-2 Prostate Key Affected male Affected female Unaffected male 2-II-1 Sarcoma 2-II-2 Melanoma 2-II-3 Melanoma 2-II-4 Unaffected female 2-III-1 Patient Genotype at position in C16orf96 gene Read depth Ref, alt Proband Genotype at position in ABCB5 gene Patient 2-I-1 T/T 119,0 G/G 69,0 Patient 2-I-2 T/C 28,38 G/A 6,4 Read depth Ref, alt Patient 2-II-1 T/C 63,38 G/A 32,41 Patient 2-II-2 T/C 36,27 G/A 54,65 Patient 2-II-3 T/C 58.44 G/A 41,38 Patient 2-II-4 T/T 62,1 G/G 65,0 Patient 2-III-1 T/T 42,0 G/G 96,0 Figure 3.2: Genotypes for the C16orf96 and ABCB5 variants that show segregation in patients with cancer in family 2 68

Using the three criteria for familial segregation, four conserved, deleterious variants showed complete familial segregation in family 3 (Figure 3.3). Exonic variants in the ZFP69B and UVSSA gene showed nominal association with both age at onset of cancer and cancer status in family 3. Two exonic variants in the BEAN1 and KIF2C genes showed nominal association with age at onset of cancer in family 3. All patients with cancer in family 3 were heterozygous at these positions, and unaffected family members were homozygous for the reference allele at these positions. None of the other prioritised known rare variants showed complete familial segregation in any of the families according to the familial segregation criteria. 69

70 3-II-1 Prostate 3-I-1 Sarcoma 3-II-2 Key Affected male Affected female Unaffected male Unaffected female 3-III-1 Sarcoma 3-III-2 Proband Patient Genotype at position in ZFP69B gene Read depth Ref, alt Genotype at position in UVSSA gene Read depth Ref, alt Genotype at position in BEAN1 gene Read depth Ref, alt Genotype at position in KIF2C gene Patient 3-I-1 C/G 27,19 G/A 58,31 C/A 28,26 G/A 59,52 Patient 3-II-1 C/G 29,37 G/A 75,69 C/A 18,19 G/A 50,66 Patient 3-II-2 C/C 118,0 G/G 129,0 C/C 62,0 G/G 126,0 Read depth Ref, alt Patient 3-III-1 C/G 24,41 G/A 127,64 C/A 44,30 G/A 125,107 Patient 3-III-2 C/C 44,0 G/G 85,0 C/C 52,0 G/G 118,0 Figure 3.3: Genotypes for the ZFP69B, BEAN1, UVSSA and KIF2C variants that show segregation in patients with cancer in family 3

3.4.4 Candidate gene variants 3.4.4.1 Association analysis in SOLAR The annotated candidate gene variants (Table 3.3) were tested for association with cancer phenotypes using SOLAR. The results from SOLAR were corrected for multiple testing using the Bonferroni method with the number of variants prioritised (1,297). The significance level after correction was α < 3.86 x 10 5. No variants were significant after correcting for multiple testing. Table 3.6 contains a summary of the nominally associated variants (p-value < 0.05) for the age at onset of cancer, the age at onset of sarcoma, and cancer status. The results include 14 variants that showed nominal association with age at onset of cancer (2 putative structural and 12 putative regulatory variants), two putative regulatory variants that showed nominal association with age at onset of sarcoma, and 12 variants that showed nominal association with cancer status (one putative structural and 11 putative regulatory variants). There were no variants showing an association with a p-value < 0.05 for sarcoma status. Of the total variants, 18 variants were associated with a single cancer phenotype, and five variants were associated with more than one cancer phenotype. Three variants were associated with age at onset of cancer and cancer status, one variant was associated with age at onset of cancer and age at onset of sarcoma, and one variant was associated with age at onset of sarcoma and cancer status. 71

72 Table 3.6: Summary of SOLAR association results for candidate gene variants Chr:Pos Gene p-value Beta SE Exonic function SIFT PolyPhen-2 RegulomeDB GERP Ref Alt MAF 1000G MAF Age at onset of cancer 16:334543 PDIA2 0.01 1.27 0.49 NS D D 5 3.17 C G 0.05 0.11 11:108098576 ATM 0.04 1.87 0.89 NS D D 6 4.22 C G 0.004 0.03 11:64577620 MEN1 0.01-0.59 0.22 Int.. 2b -6.21 G C 0.17 0.08 11:64564208 MAP4K2 0.01-1.01 0.38 Int.. 1f 1.94 A G 0.16 0.11 19:45866972 ERCC2 < 0.01 1.42 0.58 Int.. 2b 2.51 C T 0.00 0.05 17:41622861 ETV4 0.03-0.75 0.34 Int.. 2b 2.49 G A. 0.18 17:18208544 TOP3A 0.03-0.86 0.40 Int.. 1f -6.25 G A 0.32 0.13 17:18226177 SMCR8 0.03-0.86 0.40 S.. 2b 5.20 G T 0.25 0.13 17:18231998 SHMT1 0.03-0.86 0.40 UTR3.. 1f -0.29 G A 0.32 0.13 17:18232017 SHMT1 0.03-0.86 0.40 UTR3.. 1f 1.74 G C 0.20 0.13 17:18233810 SHMT1 0.03-0.86 0.40 Int.. 1f -6.43 T C 0.20 0.13 11:47369443 MYBPC3 0.04 1.87 0.89 S.. 1f -11.10 G A 0.07 0.03 11:64557132 MAP4K2 0.04-1.07 0.52 Int.. 1f 1.69 C T 0.17 0.05 19:42342319 LYPD4 0.04-0.53 0.26 S.. 1f -3.55 A G 0.60 0.55 Age at onset of sarcoma 8:145742879 RECQL4 0.01 0.49 0.18 S.. 2b 0.96 T C 0.53 0.39 11:47369443 MYBPC3 0.01 1.51 0.57 S.. 1f -11.10 G A 0.07 0.03

Chr:Pos Gene p-value Beta SE Exonic function SIFT PolyPhen-2 RegulomeDB GERP Ref Alt MAF 1000G MAF Cancer status 16:334543 PDIA2 0.01-2.76 1.03 NS D D 5 3.17 C G 0.06 0.11 11:64577620 MEN1 0.01 4.39 1.73 Int.. 2b -6.21 G C 0.18 0.18 16:322934 RGS11 0.02-1.50 0.65 Int.. 2b 3.34 C T 0.55 0.82 17:41622861 ETV4 0.02 1.50 0.65 Int.. 2b 2.49 G A. 0.18 20:43030160 HNF4A 0.02-2.50 1.11 Int.. 2b 2.12 G A 0.02 0.08 8:145730330 GPT 0.03-1.37 0.64 Int.. 2b -0.87 G A 0.39 0.53 8:145737514 RECQL4 0.03-1.37 0.64 Int.. 2b -9.84 G A 0.41 0.53 8:145741765 RECQL4 0.03-1.37 0.64 S.. 2b -9.43 G A 0.36 0.53 11:47371578 MYBPC3 0.04 3.20 1.53 S.. 2b 0.53 G A 0.01 0.08 16:419923 MRPL28 0.04 3.20 1.53 Int.. 2b -7.23 G C 0.11 0.08 16:68857289 CDH1 0.04 2.94 1.41 Int.. 2b 3.50 T C 0.07 0.08 8:145742879 RECQL4 0.05-0.95 0.48 S.. 2b 0.96 T C 0.53 0.39 Chr:Pos: Chromosome:Position. SE: Standard Error. SIFT: Sorting Intolerant from Tolerant score. PolyPhen-2: Polymorphism Phenotyping-2. GERP: Genomic Evolutionary Rate Profiling score. Ref: reference allele. Alt: alternate allele. MAF 1000G: Minor Allele Frequency in 1000 Genomes Project. MAF: Minor Allele Frequency in the study population. NS: nonsynonymous. S: synonymous. Int: intronic. UTR3: 3 untranslated region. UTR5: 5 untranslated region. D: deleterious..: not annotated in database. 73

3.4.4.2 Segregation analysis results Of the 23 different candidate variants identified, five variants were only seen in one family (PDIA2, ERCC2, HNF4A, MYBPC3 and MRPL28 ). However, using all three criteria for familial segregation, only one variant in the PDIA2 gene showed nominal association with age at onset of cancer and cancer status and complete familial segregation in family 2 (Figure 3.4). Each family member with cancer was heterozygous at this position, whereas unaffected family members were homozygous for the reference allele at this position. None of the other prioritised candidate gene variants showed complete familial segregation in any of the families according to the familial segregation criteria. 2-I-1 2-I-2 Prostate Key Affected male Affected female 2-III-1 2-II-1 2-II-2 2-II-3 Sarcoma Melanoma Melanoma 2-II-4 Unaffected male Unaffected female Proband Patient Genotype at position in PDIA2 gene Read depth Ref, alt Patient 2-I-1 C/C 206,1 Patient 2-I-2 C/G 109,135 Patient 2-II-1 C/G 74,67 Patient 2-II-2 C/G 63,43 Patient 2-II-3 C/G 106,108 Patient 2-II-4 C/C 138,0 Patient 2-III-1 C/C 161,0 Figure 3.4: Genotypes for the PDIA2 variant that shows segregation in patients with cancer in family 2 74

3.4.5 Evidence further supporting germline risk genes The nominally significant (p-value < 0.05) variants that showed familial segregation were researched using several in silico resources. Table 3.7 contains a combined summary of several in silico resources for all nominally significant candidate germline risk variants identified by the three prioritisation strategies that showed familial segregation and the genes in which they arise. Evidence from the table indicates that none of the candidate risk variants identified were reported in COSMIC (database of genes somatically mutated in cancers). However, six of the genes in which germline risk variants were identified were each reported to have mutations in the COSMIC database. None of the genes were listed in the COSMIC cancer gene census. They also were not listed in the PubMeth database, which suggests there is currently no evidence of methylation of these genes in cancer. Two of the genes were reported to have gene functions that support involvement in cancer pathogenesis in NCBI. The Gene References into Functions (GeneRIF) for ABCB5 suggests a role for this gene in chemoresistance and the GeneRIF for KIF2C suggests this gene is involved in directional migration and invasion of tumour cells. A summary of the PubMed searches for the eight candidate risk genes is summarised in Table 3.8. The PubMed searches revealed previously published associations between the ABCB5, KIF2C, and PDIA2 genes and cancer. However, there is no supporting evidence for the involvement of ARHGAP39, C16orf96, ZFP69B, UVSSA and BEAN1 genes in cancer pathogenesis at this time. The single publication returned by the search strategy for the ARHGAP39 gene revealed a role for the gene as a binding partner for CNK2 which is a spatial modulator of Rac cycling during spine morphogenesis. 297 This publication did not report any association of ARHGAP39 and cancer. The PubMed search for the BEAN1 gene returned results on randomised soya trials, labelled BEAN1 and BEAN2. 298, 299 No publications were returned on the function of the BEAN1 gene or involvement in cancer pathogenesis. 75

Table 3.7: Summary of findings from in silico resources investigating the role of candidate germline risk variants in cancer pathogenesis Gene Genomic Variant in No. Cancer SuperPath GO Molecular function Methylation GeneRIF location COSMIC mutations gene in COSMIC census ARHGAP39 8:145773319 No 246 No Developmental biology GTPase activator activity No No Signalling by Robo receptor 75 NTR receptor-mediated signalling Signalling by GPCR Signalling by Rho GTPases C16orf96 16:4606552 No 155 No.. No No ABCB5 7:20721130 No 332 No ABC-family proteins mediated ATP binding No Chemoresistance transport Xeonobiotic-transporting Transmembrane transport of ATPase activity small molecules Efflux transmembrane transporter activity ATPase activity

Gene Genomic Variant in No. Cancer SuperPath GO Molecular function Methylation GeneRIF location COSMIC mutations gene in COSMIC census ZFP69B 1:40929077 No 0 No Gene expression DNA binding Transcription factor activity, sequence-specific DNA binding Protein binding Metal ion binding No No UVSSA 4:1348920 No 0 No Transcription-coupled RNA polymerase II core No No nucleotide excision repair binding DNA double strand break Protein binding repair BEAN1 16:66503705 No 31 No.. No No KIF2C 1:45224937 No 141 No Golgi-to-ER retrograde Microtubule motor activity No Directional transport Cell cycle Mitotic metaphase and anaphase Mitotic prometaphase Protein binding ATP binding Microtubule binding ATPase activity migration and invasion of tumour cells Vesicle-mediated transport

Gene Genomic Variant in No. Cancer SuperPath GO Molecular function Methylation GeneRIF location COSMIC mutations gene in COSMIC census PDIA2 16:334543 No 114 No Statin pathway Protein disulfide isomerase activity Steroid binding Protein binding Lipid binding Disulfide oxidoreductase activity No No Genomic location: chromosome:position. COSMIC: Catalogue of Somatic Mutations in Cancer database (http://cancer.sanger.ac.uk/cosmic). 134 No. mutations in COSMIC: the number of mutations reported in the gene in the COSMIC database. Cancer gene census: is the gene reported in the cancer gene census in COSMIC? The cancer gene census is a catalogue of genes for which mutations have been causally implicated in cancer. SuperPath: from Pathcards, an integrated database of human pathways and their annotations. (http://pathcards.genecards.org/). 293 Human pathways are clustered into SuperPaths based on gene content similarity. GO molecular function: Gene Ontology molecular function. 294 Methylation: is the gene reported to be methylated in cancer by PubMeth? (http://www.pubmeth.org). 295 GeneRIF: Gene References Into Functions from National Center for Biotechnology Information (https://www.ncbi.nlm.nih.gov/). 296 Are any GeneRIF associated with cancer reported for the gene? Robo: Roundabout family of proteins. NTR: Neurotrophins. GPCR: G-protein-coupled receptors. GTPase: Guanosinetriphosphatase. ABC: ATP-binding cassette. ATP: Adenosine triphosphate. ATPase: Adenosinetriphosphatase. ER: endoplasmic reticulum.

Table 3.8: Summary of search results from PubMed for genes in which germline variants were identified Gene No. of publications Role of gene Selected references ARHGAP39 1.. C16orf96 0.. ABCB5 109 ABCB5 is a drug efflux pump associated with melanoma, colon cancer, Merkel cell carcinoma, oral squamous cell carcinoma, acute leukemia, colorectal cancer, hepatic cancer, breast cancer and osteosarcoma drug resistance. ABCB5 has also been found to be overexpressed at the transcriptional level in a number of cancer subtypes, including breast cancer, melanoma. Alterations found in ABCB5 reported in lung cancer. ZFP69B 0.. 300 311 UVSSA 5 UVSSA is involved in transcription-coupled nucleotide excision repair by relieving RNA polymerase II arrest at damaged sites to permit repair of the template strand. Mutations in UVSSA associated with Cockayne syndrome group B (characterised by photosensitivity, growth failure, progressive neurodevelopmental disorder, and premature ageing but no predisposition to skin cancer). 312 316 BEAN1 2.. KIF2C 55 KIF2C (also known as MCAK) is critical in the regulation of microtubule dynamics during mitosis. KIF2C is also involved in the directional migration and invasion of tumour cells and plays a role in cell proliferation. KIF2C is a gene likely to be involved in carcinogenesis. 317 323

Gene No. of publications Role of gene Selected references PDIA2 3 Gene expression of PDIA2 found to influence the prognostic significance of TWIST (correlated with cancer invasion and metastasis in several human cancers). PDIA2 plays a role in the maintenance of endoplasmic reticulum homeostasis and endoplasmic reticulum stress-induced apoptosis. 324, 325 PubMed search was performed using a string ( gene name ) AND (cancer OR malignancy OR tumor* OR tumour* OR sarcoma) in April 2017. Abstracts were screened for relevance to the current study.

3.5 Discussion The filtering and prioritisation of eight germline variants generated by WES in three families were described in this chapter. Eight candidate germline risk variants were found to show nominal association with cancer and age at onset of cancer in two of the three cancer cluster families. 3.5.1 Variant filtering and prioritisation strategies The annotation results using ANNOVAR are consistent with a previous publication that reports a significant amount of DNA fragments across WES capture fall outside target regions. 256 There were slightly more synonymous variants than nonsynonymous variants, which is also consistent with previous findings. 286 SIFT and PolyPhen-2 scores from ANNOVAR annotation were used to determine if the variants were likely to have a deleterious effect on protein function. A previous study reports reasonable sensitivity for SIFT and PolyPhen-2 (69% and 68%, respectively) but low specificity (13% and 16%, respectively). 326 Therefore, both programs have a high false-positive rate and these results should be interpreted with caution and should be reported in the context of other available evidence. 326 In addition to variants reported as deleterious or tolerated by SIFT and PolyPhen-2, there are a number of variants that were not annotated with a score (unknown). In particular, 80% of variants prioritised by the rare variants strategy were filtered out because they were unknown in both databases. Although two of the prioritisation strategies identified more regulatory variants than structural variants, of the eight candidate risk variants that showed familial segregation, seven were structural, and only one was regulatory. In this study, exome sequencing combined with variant filtering and prioritisation is an efficient strategy for identifying risk alleles in cancer cluster families. 81

3.5.2 Association and segregation analyses of candidate risk variants in families Family segregation studies are re-emerging as an optimal way to classify extremely rare variants. 327 In this study, three assumptions were made in determining familial segregation. These assumptions did not take into account the possibility of unaffected carriers (incomplete penetrance), later onset of disease, or risk variants that occur in cases in more than one family. Therefore, some true variants may have been excluded using these assumptions. SOLAR was used to test for association of filtered and prioritised variants with both age at onset of disease and disease status. Despite efforts to filter and prioritise variants, no variants reached statistical significance after correcting for multiple testing and a nominal p-value of < 0.05 was therefore used to select variants for familial segregation analyses. The large number of variants identified and the relatively small sample size are the likely reasons that no variants reached statistical significance after correcting for multiple testing in this study. Despite these limitations, by treating each family as a separate discovery unit, it was hoped that some insight might be gained into genetic contributions to the risk of cancer in each family. Eight variants nominally associated with age at onset of cancer and cancer status were identified in two of the three cancer cluster families. The candidate risk variants identified in this study were all private variants, seen only in one family. There has been increasing awareness that rare variants of modest to large effect contribute to complex diseases and may explain a substantial proportion of missing heritability. 129 There has been, therefore, a return to family-based studies to identify rare risk variants involved in common human 133, 152 155 disease. Recent sequencing studies have shown that the rate of private mutations in individuals is larger than previously expected. 328 330 Rare, private mutations found in families could be due to the explosion of human populations and the slowing of negative selection by improved food supplies, sanitation, vaccines 82

and routine health care. 329 331 Rare variants that are private to families could constitute a proportion of disease risk variants. 328 It is plausible that the variants found in the ABCB5, KIF2C and PDIA2 genes may be involved in the pathogenesis of cancer based on previous publications and the proposed function of the protein. Each of these genes is discussed in more detail below. 3.5.2.1 The ABCB5 gene The ABCB5 gene is a ATP-binding cassette (ABC) drug efflux transporter present in a number of stem cells. 332, 333 ABCB5 functions as a determinant of membrane potential and regulator of cell fusion in physiologic skin cells. 334 This gene is also expressed in clinical malignant melanoma tumours and preferentially marks CD133 + stem cell phenotype expressing tumour cells. 334 ABCB5 is a rhodamine-123 efflux transporter and marks CD133-expressing progenitor cells. ABCB5 regulates membrane potential in these progenitor cells and determines the propensity to undergo cell fusion. 334 Membrane hyperpolarisation is associated with the multidrug resistance phenotype of human cancer cells. 335 ABCB5 plays a role in multi-drug resistance of multiple malignancies including human malignant melanoma, 333, 336, 337 colon cancer, 304, 338 Merkel cell carcinoma, 305 oral squamous cell carcinoma, 306 acute leukaemia, 307 colorectal cancer, 309 hepatocellular carcinoma, 310 breast cancer, 311 and osteosarcoma. 303 Melanoma is resistant to the effects of doxorubicin, 333 a chemotherapy drug used to treat many different types of cancer. It has been proposed that the ABCB5 drug efflux function may be involved in doxorubicin resistance. 334 The variant identified in the ABCB5 gene may be phenotypically relevant to family 2 as this family has two family members affected by melanoma (Patient 2-II-2 and Patient 2-II-3), in addition to a prostate cancer case (Patient 2-I-1), and a sarcoma case (Patient 2-II-1). 83

3.5.2.2 The KIF2C gene KIF2C is a kinesin-like protein that functions as a microtubule-dependent molecular motor. 339 The KIF2C gene (also known as MCAK), is one of the best characterised members of the kinesin-13 family and plays an important role in microtubule dynamics during mitosis. 320 The deregulation of KIF2C induces defects in spindle assembly, chromosome congression and segregation leading to chromosome instability, 340 344 one of the hallmarks of cancer. 320 KIF2C is important for the migration and invasion of tumour cells via the modulation of microtubule dynamics in the 320, 345, 346 cytoskeleton. The KIF2C gene has been identified as a tumour antigen in patients with colorectal cancer. 347 The overexpression of KIF2C associates with a more invasive and metastatic phenotype and poor prognosis for breast, gastric and colorectal cancer patients. 347 350 KIF2C may represent an attractive target for antigen-specific 347, 348 immunotherapies in colorectal cancer and other malignancies. 3.5.2.3 The PDIA2 gene The PDIA2 gene is the pancreas-specific member of the protein disulphide isomerase (PDI) family of proteins. PDIA2, as with other PDIs, has a central role as a reductase, an oxidase, an isomerase and molecular chaperone in the endoplasmic reticulum. 351 It has been proposed that PDIA2 plays a role in the production and secretion of digestive enzymes in vivo 352 and in the binding and regulation of oestrogen synthesis. 353 A higher level of PDIA2 expression was found to be associated with shorter survival time in patients whose prostate cancer expressed a high level of TWIST but not in patients whose prostate cancer expressed a low level of TWIST. 324 TWIST is an oncogene that is correlated with cancer invasion and metastasis in human cancers including breast cancer, rhabdomyosarcoma, gastric carcinomas, bladder and prostate cancer. 354 357 Little is known about the role of PDIA2 in prostate cancer, although lower levels of PDIA2 expression were associated with better survival. 324 Therefore, PDIA2 may promote cancer progression. 324 However, PDIA2 alone was a poor prognostic marker for prostate cancer. 324 84

3.5.3 Conclusion In conclusion, WES data was annotated, filtered and prioritised in an attempt to identify candidate germline risk variants that may be involved in cancer or sarcoma pathogenesis in three cancer cluster families. As there is no gold standard for the filtering and prioritisation of WES data, these results represent the current state of tools, databases and knowledge of cancer biology. With the data obtained, it is not possible to determine whether the variants in the ARHGAP39, C16orf96, ZFP69B, UVSSA and BEAN1 genes are pathogenic mutations. These genes, however, become candidates that can be further tested for association with cancer in independent families and study populations. With further genetic evidence of involvement in risk of cancer, functional studies including assays of patient-derived tissue or well-established cell or animal models of gene function could be undertaken to determine the causal effect of all candidate risk variants on the cancer phenotype. 236 Due to time and budget limitations, these types of functional studies are beyond the scope of this thesis. 85

86

Chapter 4 Aim 3: A comparison of matched tumour and germline DNA from two sarcoma patients 4.1 Introduction Next Generation Sequencing (NGS) of tumour samples and matched germline samples is a powerful strategy for studying the genetic basis of cancer initiation, development, and growth. 133 The third aim of this study was to perform a matched tumour and germline analysis on two myxoid liposarcoma patients using peripheral blood genomic DNA and genomic DNA isolated from tumour tissue to identify somatic mutations. 4.1.1 Myxoid liposarcoma Myxoid liposarcomas are the second most common group of adipocytic/lipogenic sarcomas. 64 Myxoid liposarcomas are malignant tumours composed of uniform round to oval shaped primitive non-lipogenic cells and a variable number of small signet-ring cell lipoblasts. 64 The tumours typically exhibit a FUS-DDIT3 or EWSR1-DDIT3 rearrangement. 64 Myxoid liposarcomas occur most commonly in the deep soft tissue of the extremities and very rarely in the retroperitoneum. 87

4.1.1.1 Somatic variants A comparison of matched tumour and germline samples from a patient allows researchers to distinguish between somatic variation (< 0.01% of variants) and inherited germline variation (> 99.99% of variants). 133 Germline variants are those that exist in the germline DNA which is the source of DNA for all cells in the body. 149 A variant contained within the germline can be passed from parent to offspring. The identification of putative germline variants was the focus of Aim 2 (Chapter 3). Therefore, germline variants will not be reported in this chapter. In contrast, somatic variants are those found in the tumour DNA but not in the germline DNA. 149 Most cancers arise and evolve as a consequence of somatic mutations. 358 The characterisation of somatic mutations in cancer genomes is essential for understanding the disease and for the development of targeted therapeutics. 359 Over the last three decades, more than 600 genes have been 134, 358 shown to be somatically mutated in cancers. Molecular characterisation of somatic driver mutations allows greater understanding of biological abnormalities within cancer cells and provides information on the function of gene products, and relationships between genes and biochemical pathways. 134 Development of new therapeutic and preventative agents are dependent on the identification and modulation of these molecular targets. 134, 360 Targeted therapies for advanced lung cancer, 361 melanoma, 362 colorectal cancer, 363 and gastrointestinal stromal tumour 364 are examples that have resulted from the translation of knowledge gained from genomics. In addition to somatic variants, a comparison of matched tumour and germline DNA can also identify the absence of heterozygosity at loci in tumour DNA compared to germline DNA. 4.1.1.2 Loss of heterozygosity Loss of heterozygosity (LOH) is a common genetic event in cancer development. 365 LOH is a change in the polymorphic markers from a heterozygous state in the germline DNA to a homozygous state in the tumour DNA. 366 In cancers, the absence of one functional copy of a tumour suppressor gene does not affect the phenotype. However, if LOH occurs and the remaining normal copy of the tumour 88

suppressor gene is lost, this will result in the complete loss of the protective function of the tumour suppressor gene. LOH is known to be involved in the somatic loss of wild-type alleles in many inherited cancer syndromes such as 366, 367 retinoblastoma and hereditary breast and ovarian cancer syndromes. 4.1.1.3 Somatic copy number alteration In addition to distinguishing somatic and LOH variants, somatic copy number alterations (SCNA) can be identified in a tumour sample relative to the matched germline sample by comparing the normalised read depth. 368, 369 The DNA sequence copy number is the number of copies of DNA in a region of a genome. 370 Cancer progression often involves alterations in DNA copy number. 370 In humans, the normal copy number is two for all the autosomes. A copy number variation (CNV) is defined as structurally variant regions where copy number differences have been observed between two or more genomes that are larger than one kilobase (kb) in size. 371 CNVs can alter transcription of genes by changing the dosage or by disrupting proximal or distant regulatory regions. 372 SCNA, distinguished from germline CNV, play a role in activating oncogenes and inactivating tumour suppressor genes. 13 Identification of SCNA can provide valuable insights into the cellular defects that cause cancer and suggest potential therapeutic strategies. 373 SCNA and CNVs have a significant role in tumourigenesis in many cancers including gastric cancer, 374 ovarian cancer, 375 hepatocellular carcinoma, 376 testicular germ cell tumours, 377 colorectal carcinoma, 378 and bladder cancer. 379 The characterisation of focal SCNAs has led to the identification of novel cancer genes such as MYB, PAX5 and DUSP4. 380 387 4.1.2 Bioinformatic assessment of matched tumour and germline samples A number of bioinformatic tools have been developed to analyse matched tumour and germline samples. Initially, these tools used algorithms that involved calling variants in the tumour and germline samples separately followed by classification using a statistical significance test or simple subtraction. 388 More recently, tools have been developed that compare the tumour and germline directly at each 89

locus. VarScan2 and Strelka are two calling algorithms that were specifically 368, 369, 389 designed for the joint analysis of matched tumour and germline samples. VarScan2 uses tumour and germline samples to heuristically detect sequence 368, 369 variants and classify them by somatic status (germline, somatic or LOH). Strelka utilises a novel Bayesian approach to represent continuous allele frequencies for both tumour and normal samples to efficiently identify somatic variants. 389 Using Strelka, the normal sample is represented as a mixture of diploid germline variation with noise, and the tumour sample is represented as a combination of the normal sample with somatic variation. 389 It is important to identify somatic mutations in cancer studies as these variants often play important roles in tumour development and treatment decisions. 149 4.1.3 Somatic mutations and drug sensitivity The identification of somatic driver mutations that arise in tumours is important in developing new cancer therapeutic targets as genetic variation influences the response of an individual to drug treatments. 390 The current treatment for most cancers includes using cytotoxic chemotherapy, which is not precisely targeted to the somatic mutations that drive malignant transformation. 390 Somatic mutations can influence tumour behaviour and clinical outcome. Therefore, therapies should be targeted to the patient s tumour genotype rather than a generic treatment. An increased understanding of somatic mutations in individual patients has the potential to make therapies safer and more effective by assisting treatment selection and dosage based on driver mutations in the tumour. The Genomics of Drug Sensitivity in Cancer database (http://www.cancerrxgene. org/) 391 is a large dataset on drug sensitivity in cancer cells linked to genomic information to facilitate the discovery of new biomarkers of drug response. 391 The database contains information on over 250 anticancer drugs across > 1,000 cell lines. 391 Molecular markers are identified by integrating data from the Catalogue of Somatic Mutations in Cancer (COSMIC) database 134 and cell line drug sensitivity data. 90

4.1.4 Outline of chapter Whole exome sequencing (WES) was performed on matched tumour and germline DNA from two myxoid liposarcoma patients from the families described in Chapter 2. VarScan2 was used to identify candidate somatic variants that were confirmed using Strelka. VarScan2 was also used to identify LOH variants and SCNA events to determine regions of interest in both patients. 4.2 Methods 4.2.1 Whole exome sequencing Tumour DNA from formalin-fixed and paraffin-embedded (FFPE) tumour samples and germline DNA from Patient 1-II-2 and Patient 2-II-1 were available to perform a matched tumour-germline analysis. DNA was extracted at the Peter MacCallum Cancer Centre in Melbourne, Australia. After microdissection of tumour material from FFPE tissue, DNA was extracted using a DNeasy Tissue kit (Qiagen) as previously described. 392 Anti-coagulated blood was processed using a Ficoll gradient. DNA was extracted from the nucleated cell product using QIAamp DNA blood kit (Qiagen). Patient 1-II-2 (Figure 4.1) is a male patient who was diagnosed with a myxoid liposarcoma at 39 years of age. Patient 2-II-1 (Figure 4.2) is a male patient who was diagnosed with a myxoid liposarcoma at 61 years old. 91

1-I-1 1-I-2 Key Affected male Affected female 1-II-1 * 1-II-2 Sarcoma 1-II-3 Unaffected male Unaffected female Proband 1-III-1 Sarcoma 1-III-2 * Patient selected for tumour-normal analysis Figure 4.1: Pedigree of family 1 highlighting sarcoma Patient 1-II-2 for tumour-germline comparison 2-I-1 2-I-2 Prostate Key Affected male Affected female Unaffected male * 2-III-1 2-II-1 Sarcoma 2-II-2 Melanoma 2-II-3 Melanoma 2-II-4 Unaffected female Proband * Patient selected for tumour-normal analysis Figure 4.2: Pedigree of family 2 highlighting sarcoma Patient 2-II-1 for tumour-germline comparison 92

Due to difficulties performing WES on older FFPE samples, 393 DNA extracted from these samples were sent to an external sequencing facility. The four samples were sequenced using Agilent SureSelect V5 Capture on the Illumina HiSeq 4000 at 60X coverage. 4.2.2 Pre-processing and quality control FASTQ files were received from Macrogen, Inc. Initial quality control (QC) reports were generated using FastQC (version 0.11.3), a quality control application for high throughput sequence data. 394 FastQC reads FASTQ files and can either provide an interactive application to review the results of several different checks or create an HTML based report which can be integrated into a pipeline. 394 QC reports were generated on sequence quality, GC content, duplication levels and adapter content. 4.2.3 Adapter trimming The presence of technical sequences such as adapters in WES data can result in suboptimal downstream analyses. 395 The Illumina-specific adapter sequences were trimmed from the FASTQ files using Trimmomatic (version 0.36). 395 As Illumina sequences are paired-end, the palindrome mode was used. This mode is specifically aimed at detecting typical adapter read-through situations in which the DNA fragment is shorter than the read length and indicates adapter contamination on the end of the reads. 395 After the Illumina-specific adapters had been trimmed from the FASTQ files, the second round of QC reports were generated on the adapter trimmed data using FastQC. 4.2.4 Sequence alignment and calling The raw sequencing data was then aligned to the human genome using the Burrows-Wheeler Aligner (BWA, version 0.7.2). 396 BWA alignment was performed in two steps. In the first step, the genome was indexed to the human genome build 19 (hg19) reference sequence. In the second phase, BWA Maximal Exact Matches (BWA-MEM) was used to run the alignment to align the sequence reads to hg19. 93

The alignment step creates the alignment in Sequence Alignment/Map (*.sam) format. SAMtools (version 1.3.1) View 397 was used to convert the *.sam files to *.bam format to reduce the size of the data. Summary statistics were created for the *.bam files using SAMtools flagstat. 397 Index files were created for each *.bam file using SAMtools index. 397 Local realignment was performed on the *.bam files in two stages using Genome Analysis Toolkit (GATK) RealignerTargetCreator and IndelRealigner (version 3.4.0). 180 The Picard (version 2.4.1) FixMateInformation tool was used to ensure that all read entries had their mate information written correctly. The Picard MarkDuplicates tool was then used to identify duplicate reads. 4.2.5 BAM quality control A final round of QC was performed on the *.bam files using GATK DepthOfCoverage, 180 and Picard CollectInsertSizeMetrics and CollectAlignmentSummaryMetrics to determine coverage, insert size (the library portion between the adapter sequences) and alignment metrics, respectively. 4.2.6 Generate mpileup file The germline and tumour *.bam files for each patient were grouped using SAMtools mpileup. 181 Alignment records were consolidated by sample identifiers in read group header lines. 4.2.7 Somatic variant calling using VarScan2 The genotype for each sample was determined from mpileup files using VarScan2 (version 2.3.9). 368 The algorithm read the data from both tumour and germline samples simultaneously. VarScan2 employed a heuristic approach to call variants that met the thresholds for read depth, base quality, variant allele frequency, and statistical significance. 368, 369 If the genotypes did not match, the read counts were evaluated by one-tailed Fisher s exact test in a two-by-two table, comparing the number of reference-supporting reads and variant-supporting reads observed in the tumour to the numbers that were observed in the germline. 368 If the 94

resulting p-value met the significance threshold (default 0.10), then the variant was called somatic (if the germline matched the reference genome at that position). 368 The VarScan2 subcommand, processsomatic, was then used to create output files of somatic variants based on confidence (low confidence and high confidence). High confidence variants are classed as those with a tumour variant allele frequency > 15%, normal variant allele frequency < 5%, and a somatic p-value of < 0.03. The remaining variants are classed as low confidence. VarScan2 somaticfilter was used to filter the possible false positives from the high confidence somatic mutations. Table 4.1 shows the settings used to run the somaticfilter command. Table 4.1: Parameters specified for VarScan2 somaticfilter to filter false positives from the high confidence somatic mutations Parameter Specified Minimum read depth 10 Minimum supporting reads for a variant 2 Minimum number of strands on which variant observed 1 Minimum average base quality for variant-supporting reads 20 Minimum variant allele frequency threshold 0.2 Default p-value threshold for calling variants 1 x 10 1 Bonferroni adjustments were made to the somatic p-value values from VarScan2 to correct for multiple testing. 292 The total number of variants in the mpileup files for each patient were used for the correction. The genotypes of any significant variants were visually confirmed by importing *.bam files into Integrative Genomics 183, 184 Viewer (IGV, version 2.3.80) by determining the number of reads for each allele. 4.2.7.1 Somatic variant calling using Strelka A second somatic variant caller, Strelka (version 1.0.15), 389 was used to confirm the statistically significant somatic variants called by VarScan2. The first step of somatic variant analysis using Strelka is to run preliminary configuration validation (ensure that the chromosome names match in the *.bam header and 95

reference genome). Template configuration files from Strelka were used in this analysis. The configuration generates a makefile that controls the analysis step. The second phase is to run the analysis using the makefile. The sorted tumour and germline *.bam files and hg19 reference sequence were used in the analysis. 4.2.8 Evidence further supporting somatic risk genes The significant somatic variants and the genes in which they arise were further examined for evidence in cancer pathogenesis using several in silico resources including COSMIC (catalogue of somatic mutations), 134 the pathway unification database (PathCards), 293 gene ontology (GO) annotations, 294 PubMeth (a database of methylation in cancer), 295 and National Center for Biotechnology Information (NCBI). 296 A PubMed search was performed using a string ( gene name ) AND (cancer OR malignancy OR tumor* OR tumour* OR sarcoma) in April 2017. Abstracts were screened for relevance to the current study. 4.2.9 Drug sensitivity The genes in which somatic mutations were identified in two sarcoma patients were searched in the Genomics of Drug Sensitivity in Cancer database (http://www.cancerrxgene.org/) 391 to determine whether they were known molecular targets. 4.2.10 Loss of heterozygosity variant calling using VarScan2 VarScan2 was used to call LOH variants. Similar to the somatic variant calling process, if the genotype between tumour and germline DNA did not match, the read counts were evaluated by a one-tailed Fisher s exact test. If the resulting p-value met the significance threshold (default 0.10), then the variant was called LOH (if the germline was heterozygous). Bonferroni adjustments were made to the LOH p-value values from VarScan2 to correct for multiple testing. 292 The total number of variants in the mpileup files for each patient were used for the correction. The genotypes of any significant variants were visually confirmed by importing *.bam files into IGV. 96

4.2.11 Variant annotation and filtering Statistically significant somatic and LOH variants were annotated using Annotate Variation (ANNOVAR, version 2015Jun16) 245 and Regulome database (RegulomeDB). 257 The somatic and LOH variants that reached statistical significance after Bonferroni correction were cross-referenced to the exclusion list of Fuentes Fajardo et al. (2012) (Available in the paper s Supplementary material: Table S7 gene exclusion list final ) to determine if any variants in highly polymorphic regions should be excluded. 282 4.2.12 Somatic copy number analysis using VarScan2 VarScan2 copynumber was applied to the tumour-germline mpileup files to create a single output file of raw SCNAs. VarScan2 copycaller was then used to adjust for GC content and make preliminary calls. The adjusted calls files were imported into R, 281 and the package DNAcopy (version 1.48.0) 398 was used to perform circular binary segmentation on a per-chromosome basis to smooth and segment the raw output from VarScan2 copycaller. 370 The results of DNAcopy were plotted in R to visualise SCNA. 4.3 Results 4.3.1 Whole exome sequencing Raw data reports from Macrogen, Inc. are summarised in Table 4.2. The GC content for an exome typically falls within the range of 49-51%. 399 Therefore, the samples show just below average %GC content, with the tumour samples showing lower %GC content than the germline samples. Three of the samples have over 90% bases with a base quality (Q) score above 20 in the Phred scale (call accuracy of 99%), except Patient 2-II-1 tumour sample, which has 88.6% bases with a Q score above 20. Pre-processing QC and adapter trimming did not result in any sequences being flagged or trimmed. 97

Table 4.2: Raw data summary from Macrogen Inc. for Patient 1-II-2 and Patient 2-II-1 germline and tumour samples Sample ID Total read bases (base pairs) Total reads GC(%) AT(%) Q20(%) Q30(%) Patient 1-II-2 germline 7,720,761,786 76,443,186 48.9 51.1 98.6 96.09 Patient 2-II-1 germline 6,079,509,766 60,193,166 48.88 51.12 98.24 95.19 Patient 1-II-2 tumour 6,711,209,620 66,447,620 47.13 52.87 96.5 91.47 Patient 2-II-1 tumour 5,022,136,524 49,724,124 47.47 52.53 94.7 88.58 Sample ID: sample name. Total read bases: total number of bases sequenced. Total reads: total number of reads. GC(%): GC content. AT(%): AT content. Q20(%): Ratio of reads that have Phred quality score of over 20. Q30(%): Ratio of reads that have Phred quality score of over 30. 4.3.2 Sequence alignment and calling Summary statistics on the trimmed *.bam files were computed using Samtools flagstat 181 and are presented in Table 4.3. The results show that both germline samples had over 99% of reads mapped, and both tumour samples had over 98% of reads mapped. Both germline samples had almost all of the mapped reads properly paired (> 98.8%). However, the tumour samples had slightly lower properly paired reads (93.68% for Patient 1-II-2 and 95.86% for Patient 2-II-1). 98

Table 4.3: Summary statistics generated using Samtools flagstat for Patient 1-II-2 and 2-II-1 germline and tumour samples Statistic Patient 1-II-2 germline Patient 2-II-1 germline Patient 1-II-1 tumour Patient 2-II-1 tumour Total (QC-passed reads + QC-failed reads) 76,335,490 60,113,342 63,327,174 49,372,458 Duplicates 0 0 0 0 Mapped (%) 75,948,149 (99.49%) 59,803,630 (99.48%) 62,074,975 (98.02%) 48,478,673 (98.19%) Paired in sequencing 76,335,490 60,113,342 63,327,174 49,372,458 Read 1 38,167,745 30,056,671 31,663,587 24,686,229 Read 2 38,167,745 30,056,671 31,663,587 24,686,229 Properly paired 75,457,268 (98.85%) 59,445,380 (98.89%) 59,321,774 (93.68%) 47,327,518 (95.86%) With itself and mate mapped 75,833,239 59,704,772 61,514,992 47,943,636 Singletons (%) 114,910 (0.15%) 98,858 (0.16%) 559,983 (0.88%) 535,037 (1.08%) Mate mapped to a different chromosome 174,551 105,540 109,018 39,240 Mate mapped to a different chromosome (mapq 5) 156,396 93,541 65,912 27,128 QC: quality control.

Local realignment was performed using GATK RealignerTargetCreator. For Patient 1-II-2 there were 3,793,051 (2.72%) reads filtered out during the traversal. Of these, 224,019 reads failed the bad mate filter, 3,569,018 reads failed the mapping quality zero filter, and 14 reads failed the unmapped read filter. For Patient 2-II-1 there were 3,060,049 (2.79%) reads filtered out during the traversal. Of these, 121,551 reads failed the bad mate filter, 2,938,469 reads failed the mapping quality zero filter, and 29 reads failed the unmapped read filter. For Patient 1-II-2, no reads were filtered out of 76,335,490 total reads in the germline sample, and no reads were filtered out of 63,327,174 total reads in the tumour sample. For Patient 2-II-1, no reads were filtered out of 60,113,342 total reads in the germline sample, and no reads were filtered out of 49,372,458 total reads in the tumour sample. 4.3.3 BAM quality control GATK depth of coverage results are presented in Figure 4.3. As expected, the majority of bases were covered at a depth of 100X or less in each sample. Germline samples (blue) for both patients show slightly higher coverage compared to the tumour samples (orange). 100

220,000 220,000 195,000 195,000 170,000 170,000 Number of bases 145,000 120,000 95,000 Number of bases 145,000 120,000 95,000 70,000 70,000 45,000 45,000 20,000 20,000-5,000 >=0 >=50 >=100 >=150 >=200 >=250 >=300 >=350 >=400 >=450 >=500 Depth Patient 1-II-2 Germline Patient 1-II-2 Tumour -5,000 >=0 >=50 >=100 >=150 >=200 >=250 >=300 >=350 >=400 >=450 >=500 Depth Patient 2-II-1 Germline Patient 2-II-1 Tumour (a) Patient 1-II-2 (b) Patient 2-II-1 Figure 4.3: Genome analysis toolkit depth of coverage summary for Patient 1-II-2 and Patient 2-II-1 germline and tumour DNA

The average insert size for the germline samples of both Patient 1-II-2 and Patient 2-II-1 is approximately 150 base pairs. The tumours samples have slightly smaller insert sizes of approximately 125 base pairs and 140 base pairs for Patient 1-II-2 and Patient 2-II-1, respectively. Figure 4.4 shows histogram plots of the insert size distribution for both patients germline and tumour samples generated by Picard (Patient 1-II-2: top panels, Patient 2-II-1: bottom panels). a) b) Patient 1-II-2 germline Patient 1-II-2 tumour c) d) Patient 2-II-1 germline Patient 2-II-1 tumour Figure 4.4: Insert size histogram plots generated by Picard for Patient 1-II-2 and Patient 2-II-1 germline and tumour samples 102

High level metrics about the alignment of reads within a *.bam file were produced by the CollectAlignmentSummaryMetrics tool from Picard. All the reads from both patients germline and tumour samples passed the filter criteria, and the percentage of reads aligned was above 98% for all samples. 4.3.4 Somatic variant calling 4.3.4.1 VarScan2 VarScan2 identified 4,888 somatic variants in Patient 1-II-2, of which, 702 were classed as high confidence. Patient 2-II-1 had 2,667 somatic variants with 595 classed as high confidence. The results of the somaticfilter command (to remove possible false positives) for the SNV somatic high confidence files are presented in Table 4.4. Most of the variants that were removed from both patients failed the Reads2 requirement (minimum supporting reads for a variant). Table 4.4: Results from VarScan2 somaticfilter to remove possible false positives from the high confidence somatic calls for Patient 1-II-2 and Patient 2-II-1 Filter Patient 1-II-2 Patient 2-II-1 Total variants in input 702 595 Coverage requirement (10) 3 7 Reads2 requirement (2) 652 459 VarFreq requirement (0.2) 0 0 p-value requirement (1 x 10 1 ) 2 14 SNP clusters requirement 0 4 Near INDELs 0 0 Passed 45 111 Reads2: minimum supporting reads for a variant filter. VarFreq: Minimum variant allele frequency filter. SNP: single nucleotide polymorphism. INDEL: insertion or deletion. 103

Bonferroni adjustment was performed on the p-values from VarScan2 to correct for multiple testing. As Patient 1-II-2 had 66,265,606 positions for comparison, the significance level after Bonferroni correction was α < 7.55 x 10 10. After correcting for multiple testing, Patient 1-II-2 had 11 statistically significant somatic variants. Patient 2-II-1 had 67,054,165 positions for comparison, therefore the significance level after Bonferroni correction was α < 7.46 x 10 10. After correcting for multiple testing, Patient 2-II-1 had three statistically significant somatic variants. 4.3.4.2 Validation of somatic variants using Strelka Of the 11 somatic variants identified by VarScan2 in Patient 1-II-2, ten were also reported as somatic variants by Strelka (Table 4.5). A variant in the CCDC66 gene (position 19:47768072) was reported by VarScan2 but not reported by Strelka. All three somatic variants identified by VarScan2 in Patient 2-II-1 were confirmed by Strelka (Table 4.6). The variants were annotated and cross-referenced against a provisional gene exclusion list, but no variants were removed. 282 104

Table 4.5: Somatic variants identified by VarScan2 and Strelka for Patient 1-II-2 Chr:Pos Gene p-value Function SIFT PolyPhen-2 RegulomeDB GERP Ref Alt chr14:23397376 PRMT5 1.70 x 10 26 NS D B. 4.64 G A 93,76 chr9:95237133 ASPN 5.55 x 10 22 NS T D. 5.12 T A 60,51 chr6:129704330 LAMA2 2.86 x 10 18 NS T D 6 5.74 G A 67,49 chr4:106156358 TET2 1.09 x 10 16 NS T B 5 2.32 C T 61,55 chr18:34322702 FHOD3 9.73 x 10 16 NS T D 6 4.17 A T 54,54 chr19:19607014 GATAD2A 3.78 x 10 14 NS T D 5 5.65 G A 9,17 chr14:105212646 ADSSL1 3.05 x 10 12 S.. 2b. C T 41,34 chr3:49042587 P4HTM 5.26 x 10 12 NS. B 2b -4.39 A C 54,26 chr11:65000621 SLC22A20,POLA2 4.24 x 10 11 IG.. 5. G T 21,15 chr9:133760443 ABL1 1.03 x 10 10 S.. 5. G C 53,32 Chr:pos: Chromosome:position. p-value: Fisher s p-value. SIFT: Sorting Intolerant From Tolerant. PolyPhen-2: Polymorphism Phenotyping-2. GERP: Genomic Evolutionary Rate Profiling score (a positive GERP score represents a substitution deficit, while a negative GERP score represents a substitution surplus). Ref: reference allele. Alt: alternate allele. NS: nonsynonymous. S: synonymous. IG: intergenic. D: deleterious. B: benign. T: tolerated. Read depth Ref, alt

Table 4.6: Somatic variants identified by VarScan2 and Strelka for Patient 2-II-1 Chr:Pos Gene p-value Function SIFT PolyPhen-2 RegulomeDB GERP Ref Alt chr8:57307625 SDR16C6P,PENK 9.39 x 10 17 Intergenic.... C T 44,32 chr5:1244883 SLC6A18 1.33 x 10 12 Splicing.. 5 3.29 G A 10,12 chr5:57754947 PLK2 4.34 x 10 11 Intronic.. 4. T C 31,23 Chr:pos: Chromosome:position. p-value: Fisher s p-value. SIFT: Sorting Intolerant From Tolerant. PolyPhen-2: Polymorphism Phenotyping-2. GERP: Genomic Evolutionary Rate Profiling score (a positive GERP score represents a substitution deficit, while a negative GERP score represents a substitution surplus). Ref: reference allele. Alt: alternate allele. D: deleterious. B: benign. Read depth Ref, alt

4.3.4.3 Evidence further supporting somatic risk genes Table 4.7 contains a summary of several in silico resources for the somatic risk variants and the genes in which they arise for both patients. None of the somatic risk variants were reported in COSMIC. However, all but one of the genes (SDR16C6P) were reported to have mutations in the COSMIC database. Two genes were listed in the COSMIC cancer gene census (TET2 and ABL1 ). The ABL1 and PENK genes were reported in the PubMeth database. However, none of the other genes were reported in PubMeth, which suggests there is currently no evidence of methylation of these genes in cancer. Evidence from NCBI suggests that ten genes have been reported to have gene functions that support involvement in cancer. Of these ten genes, six genes (PRMT5, LAMA2, TET2, FHOD3, ABL1 and PLK2 ) were reported to have Gene References into Functions (GeneRIF) evidence for involvement in cancer pathogenesis. Two genes (POLA2 and PENK) had GeneRIF that indicated these genes might be biomarkers for cancer and two genes (ASPN and P4HTM ) had GeneRIF that suggested these genes may be targets for cancer therapeutics. A summary of the PubMed searches for the candidate somatic risk genes is summarised in Table 4.8. The PubMed searches revealed previously published associations between the genes and cancer except for the ADSSL1 and SDR16C6P genes. Therefore there is no evidence supporting the involvement of ADSSL1 and SDR16C6P genes in cancer pathogenesis at this time. 107

Table 4.7: Summary of findings from in silico resources investigating the role of somatic risk variants and the genes in which they arise in cancer pathogenesis Gene Genomic Variant in No. Cancer SuperPath GO Molecular function Methylation GeneRIF location COSMIC mutations gene in COSMIC census PRMT5 chr14:23397376 No 95 No Regulation of TP53 Core promoter No Colorectal cancer activity sequence-specific DNA pathogenesis Chromatin organisation binding Acute myeloid Gene expression Transport of the SLBP independent mature mrna RNA transport Transcription corepressor activity Protein binding Methyltransferase activity Methyl-CpG binding leukemia growth Marker of poor prognosis in nasopharyngeal carcinoma Marker for early colorectal carcinomas ASPN chr9:95237133 No 96 No ECM proteoglycans Protein kinase inhibitor No Role in gastric cancer Degradation of the activity Therapeutic target extracellular matrix Calcium ion binding molecule Collagen binding

Gene Genomic Variant in No. variants Cancer SuperPath GO Molecular function Methylation GeneRIF location COSMIC in COSMIC gene census LAMA2 chr6:129704330 No 854 No Integrin pathway Receptor binding No Mutations in ERK signalling Arrhythmogenic Structural molecule activity hepatocellular carcinoma patients right ventricular cardiomyopathy Dilated cardiomyopathy Focal adhesion TET2 chr4:106156358 No 2,726 Yes Activated PKN1 Sulfonate dioxygenase No Involved in stimulates transcription activity leukemogenesis of androgen receptor regulated genes Chromatin regulation / Acetylation Gene expression DNA binding Protein binding Ferrous iron binding Zinc ion binding Oncogenic role in myeloid tumour FHOD3 chr18:34322702 No 393 No. Actin binding Protein binding No Glioma linear migration Associated with acute lymphoblastic leukemia Promotes invasive migration and local invasion in vivo

Gene Genomic Variant in No. variants Cancer SuperPath GO Molecular function Methylation GeneRIF location COSMIC in COSMIC gene census GATAD2A chr19:19607014 No 95 No Activated PKN1 stimulates transcription of androgen receptor regulated genes Chromatin organisation Gene expression Regulation of TP53 activity ADSSL1 chr14:105212646no 114 No Purine metabolism Purine nucleotides de novo biosynthesis Metabolism purine metabolism Alanine, aspartate and glutamate metabolism Contributes to RNA polymerase II regulatory region sequence-specific DNA binding Transcription factor activity, sequence-specific DNA binding Protein binding Zinc ion binding Protein binding, bridging Magnesium ion binding GTPase activity Adenylosuccinate synthase activity GTP binding Ligase activity No. No. P4HTM chr3:49042587 No 75 No. Iron ion binding Calcium ion binding Oxidoreductase activity No May aid the design of novel therapies for inhibiting bone tumours

Gene Genomic Variant in No. variants Cancer SuperPath GO Molecular function Methylation GeneRIF location COSMIC in COSMIC gene census SLC22A20 chr11:65000621 No 83 No. Inorganic anion exchanger activity No. Sodium-independent organic anion transmembrane transporter activity POLA2 chr11:65000621 No 108 No Telomere C-strand DNA binding synthesis DNA-directed DNA E2F mediated regulation polymerase activity of DNA replication Protein heterodimerisation Regulation of activated activity PAK-2p34 by proteasome mediated degradation Purine metabolism Cell cycle, Mitotic No Prognostic biomarker in non small cell lung cancer pathogenesis

Gene Genomic Variant in No. variants Cancer SuperPath GO Molecular function Methylation GeneRIF location COSMIC in COSMIC gene census ABL1 chr9:133760443 No 1,684 Yes DNA double-strand breakmagnesium ion binding repair Development Slit-Robo signalling Regulation of actin dynamics for phagocytic cup formation DNA binding Actin monomer binding Nicotinate-nucleotide adenylyltransferase activity Protein kinase activity Cell cycle ErbB signalling pathway Yes BCR/ABL oncogene in leukaemia Promote breast cancer osteolytic metastasis Progression of gastric cancer SDR16C6P chr8:57307625 No. No.. No. PENK chr8:57307625 No 163 No Apoptotic pathways in Opioid peptide activity Yes Promoter methylation synovial fibroblasts GPCR pathway ERK signalling Nanog in Mammalian Neuropeptide hormone activity Opioid receptor binding associated with colorectal adenocarcinoma diagnosis ESC Pluripotency CREB Pathway

Gene Genomic Variant in No. variants Cancer SuperPath GO Molecular function Methylation GeneRIF location COSMIC in COSMIC gene census SLC6A18 chr5:1244883 No 186 No Transport of glucose and other sugars, bile salts and organic acids, metal ions and amine compounds Amino acid transport across the plasma membrane Neurotransmitter:sodium symporter activity Amino acid transmembrane transporter activity Symporter activity No. PLK2 chr5:57754947 No 153 No FoxO signalling pathway Nucleotide binding No Promoting tumour Gene expression Protein kinase activity progression TP53 Regulates transcription of cell cycle genes DNA damage Protein serine/threonine kinase activity Signal transducer activity Protein binding Increases cell proliferation and decreases apoptosis in gastric cancer cells Regulation of TP53 activity Genomic location: chromosome:position. COSMIC: Catalogue of Somatic Mutations in Cancer database. 134 No. mutations in COSMIC: the number of mutations reported in the gene in the COSMIC database. Cancer gene census: is the gene reported in the cancer gene census in COSMIC? The cancer gene census is a catalogue of genes for which mutations have been causally implicated in cancer. SuperPath: from Pathcards, an integrated database of human pathways and their annotations. (http://pathcards.genecards.org/). Human pathways were clustered into SuperPaths based on gene content similarity. GO molecular function: Gene Ontology molecular function. 294 Methylation: is the gene reported to be methylated in cancer by PubMeth? (http://www.pubmeth.org). 295 GeneRIF: Gene References Into Functions from National Center for Biotechnology Information (https://www.ncbi.nlm.nih.gov/). Are any GeneRIF associated with cancer reported for the gene? SLBP: stem-loop binding protein. ECM: extracellular matrix. ERK: extracellular receptor kinase. GTP: guanosine triphosphate. E2F: E2 factor. ErbB: erythroblastosis oncogene B. GPCR: G-protein-coupled receptors. ESC: embryonic stem cells. CREB: camp response element-binding protein.

Table 4.8: Summary of search results from PubMed for genes in which somatic variants were identified Gene No. of publications Role of gene Selected references PRMT5 156 PRMT5 is a regulator of homologous recombination-mediated double-strand break repair. PRMT5 methyltransferase activity is necessary for tumour cell proliferation and plays an important role in cancer progression by repressing the expression of key tumour suppressor genes. Mutations in PRMT5 associated with gastric cancer, oropharyngeal squamous cell carcinoma, hepatocellular carcinoma, prostate cancer, lung adenocarcinoma, lung squamous cell carcinoma, endometrial carcinoma and breast carcinoma. ASPN 33 ASPN is a secreted small leucine rich proteoglycan with known roles in ligament regulation and chondrogenesis. It is a potential mediator of metastatic progression found within the tumour microenvironment. ASPN has been shown to play a role in breast cancer, scirrhous gastric cancer, pancreas, and prostate cancer. 400 404, 404, 405 406 411 LAMA2 22 LAMA2 functionally involved in the formation of extracellular matrix and is found to be upregulated in metastatic renal cell carcinoma and during serum-induced glioma initiating cells differentiation. Downregulation of LAMA2 reported in oesophageal cancer, extracellular matrix in drug-resistant ovarian cancer cell line, hepatocellular carcinoma, and laryngeal cancer. Abnormal methylation reported in breast cancer carcinoma and colorectal cancer. LAMA2 is a candidate marker and indicator of poor prognosis for posterior fossa subgroup A epdendymal tumours. 412 424

Gene No. of publications Role of gene Selected references TET2 664 TET2 is an epigenetic regulator which is frequently mutated or inactivated in cancer, and it has been suggested that the TET proteins may protect against abnormal DNA methylation at promoters. TET2 mutations frequently observed in myeloid, lymphoid and hematological malignancies. 425 430 FHOD3 11 FHOD3 involved in cancer cell migration and invasion via regulation of dynamic actin spike assembly in cells invading in vitro and in vivo. FHOD3 plays a role in glioma linear migration motility. FHOD3 was hypomethylated, overexpressed and involved in major deletions and may play a role in thyroid cancer. FHOD3 mutations in leukaemia associated with methotrexate polyglutamates accumulation. GATAD2A 5 GATAD2A is a subunit of the nucleosome remodeling and histone deacetylase complex, a chromatin-level regulator of transcription with a number of important and emerging roles in cancer biology. Knockdown of GATAD2A decreased the ability of cell proliferation and colony formation and promoted cell apoptosis in thyroid cancer cells. A variant in GATAD2A associated with susceptibility to three cancers (breast, ovarian and prostate). ADSSL1 0.. 431 435 436 440 P4HTM 1 P4HTM found to be hypermethylated in rhabdomyosarcoma. P4HTM silencing by promoter DNA methylation is a potential mechanism for HIF 1α stabilisation in rhabdomyosarcomas. SLC22A20 3 SLC22A20 (OAT6 ) as an uptake carrier of sorafenib, SLC22A20 is differentially methylated in hepatocellular carcinoma 441 321, 442, 443

Gene No. of publications Role of gene Selected references POLA2 8 POLA2 has been reported to be involved in cell proliferation by mediating DNA replication, recombination, and repair. A variant in POLA2 improves differential survivability and mortality in non-small cell lung cancer patients and could be used as a prognostic biomarker. The knockdown of POLA2 increases gemcitabine resistance in human lung cancer cells. Low mrna expression of POLA2 was prognostic of poor outcome in ovarian carcinomas. POLA2-CDC42EP2 read-through fusion transcript identified in gastrointestinal stromal tumours. POLA2 found to be overexpressed in mesothelioma. 322, 444 449 ABL1 1,368 The product of the ABL1 gene is a tyrosine kinase which plays a role in cellular growth control and response to DNA damage. The BCR-ABL1 (Philadelphia chromosome) gene fusion is responsible for > 95% of chronic myeloid leukemia. Mutations in BCR-ABL1 gene have been found to be a major cause of disease progression and resistance to tyrosine kinase inhibitors in chronic myeloid leukemia patients. Methylation of the proximal promoter of the ABL1 oncogene is a common epigenetic alteration associated with clinical progression of chronic myeloid leukemia. ABL1 first identified as oncogene in leukaemia but mutations also reported in lung cancer. SDR16C6P 0.. 450 453 PENK 99 PENK is a candidate tumour suppressor gene that is hypermethylated in various cancers. PENK is also a potential biomarker for prostate, colorectal and bladder cancer. Hypermethylation of PENK contributes to cell motility and adhesion. SLC6A18 1 Gain of 5p15.33 (harbouring the SLC6A18 gene) reported in non-small cell lung cancer cases. 454 462 463

Gene No. of publications Role of gene Selected references PLK2 99 PLK2 plays a critical role in cell cycle and response to DNA damage. PLK2 plays a tumour suppressor role in cervical cancer, ovarian cancer, gastric cancer and hematopoietic diseases. PLK2 is involved in paclitaxel resistance in solid tumours. PLK2 phosphorylates TAp73 resulting in inhibited cell proliferation, increased apoptosis, G1 phase arrest, and decreased cell invasion. Protein kinases represent the most effective class of therapeutic targets in cancer. 464 474 PubMed search was performed using a string ( gene name ) AND (cancer OR malignancy OR tumor* OR tumour* OR sarcoma) in April 2017. Abstracts were screened for relevance to the current study.

4.3.4.4 Drug sensitivity Two of the genes identified as of interest (TET2 and LAMA2 ) were reported in the Genomics of Drug Sensitivity in Cancer database. 391 The TET2 gene showed a statistically significant association (p-value < 10 3 ) with VNLG/124 and Bexarotene in pan-cancer analysis (drug sensitivity for cell lines from all cancer types with genomic features identified from the analysis of patient tumours across multiple different cancer types). 391 While the LAMA2 gene was reported in the Genomics of Drug Sensitivity in Cancer database, none of the associations reached statistical significance. VNLG/124 is a novel mutual prodrug of all-trans-retinoic-acid (ATRA) and histone deacetylation inhibitors (HDIs). 475 TET2 mutations determine sensitivity to ATRA. ATRA has been previously shown to induce the interaction and chromatin recruitment of a novel RARβ-TET2 complex to epigenetically activate a specific cohort of target genes. 476 Wu et al. (2017) reported a novel RARβ-TET2-miR-200c-PKCζ signalling pathway that directs cancer cell state changes that may have potential therapeutic implications. 476 Bexarotene is a selective retinoid X receptors (RXR) agonist with properties overlapping ATRA. 477 Bexarotene exerts its effects in blocking cell cycle progression, inducing apoptosis and differentiation, preventing multidrug resistance, and inhibiting angiogenesis and metastasis. 477 Therefore it is a promising chemopreventive agent against cancer. 477 None of the remaining genes harbouring significant somatic mutations were listed in the Genomics of Drug Sensitivity in Cancer database as of April 2017. However, as the understanding of genes and pathways that are causally implicated in cancer grows, more therapeutics will be added to the database in the future. 4.3.5 Loss of heterozygosity variants A total of 2,075 LOH variants were identified in Patient 1-II-2 using VarScan2. Of these, 507 were high confidence. After correcting for multiple testing and removing variants in polygenic regions, 18 LOH variants were statistically significant (Table 4.9). Of these LOH variants, 16 were located on chromosome 16, and the remaining two were located on chromosome 19. 118

There were 1,344 LOH variants identified in Patient 2-II-1 using VarScan2, with 785 categorised as high confidence. After correcting for multiple testing, no LOH variants reached statistical significance for Patient 2-II-1. 4.3.6 Copy number analysis The results of DNAcopy were visualised as SCNA graphs per chromosome for Patient 1-II-2 (Appendix I) and Patient 2-II-1 (Appendix J). The SCNA graphs for Patient 1-II-1 show a considerable disruption on chromosome 16. The SCNA graphs for Patient 2-II-1 do not show any large regions of disruption. 119

Table 4.9: Statistically significant high confidence loss of heterozygosity variants for Patient 1-II-2 Chr:Pos Gene Somatic p-value Function SIFT PolyPhen-2 RegulomeDB GERP Ref Alt Read depth (Ref, alt) MAF 1000G chr16:87678441 JPH3 8.11 x 10 19 S.. 5. C T 132,18 0.13 chr16:57503213 POLR2C 1.22 x 10 18 Int.... C T 11,77 0.52 chr16:84208335 DNAAF1 2.55 x 10 15 Int.. 1f. G T 6,96 0.33 chr16:89300014 ZNF778, ANKRD11 1.73 x 10 14 IG T P 3a -2 C T 9,132 0.009 chr16:84691044 KLHL36 4.31 x 10 13 S.. 1f. C T 44,0 0.34 chr16:71319539 CMTR2 4.92 x 10 12 S.... C T 10,57 0.68 chr19:33444588 CEP89 7.78 x 10 12 NS T P 2b 2.25 T G 52,0. chr16:68598007 ZFP90 1.63 x 10 11 S.. 3a. A G 82,10 0.57 chr16:81929488 PLCG2 1.99 x 10 11 S.. 3a. C T 82,7 0.35 chr16:53326860 CHD9 3.39 x 10 11 S.... G A 88,10 0.30 chr16:88884466 GALNS 5.52 x 10 11 S.. 4. C T 1,39 0.38 chr16:87678144 JPH3 7.31 x 10 11 S.... T C 6,61 0.50 chr19:33444576 CEP89 1.93 x 10 10 NS T B 4-6.38 C T 34,0. chr16:53513055 RBL2 2.44 x 10 10 Int.. 6. T C 3,60 0.46 chr16:69973297 WWP2 2.60 x 10 10 S.. 4. C T 84,11 0.03 chr16:55562466 LPCAT2 4.82 x 10 10 NS T B 6-3.8 G A 68,2 0.65 chr16:67316600 PLEKHG4 5.82 x 10 10 Int.. 1f. G A 7,63 0.44 chr16:89805261 ZNF276 7.18 x 10 10 UTR3.. 4. A G 70,3 0.61 Chr:Pos: Chromosome:Position. SIFT: Sorting Intolerant from Tolerant score. PolyPhen-2: Polymorphism Phenotyping-2. GERP: Genomic Evolutionary Rate Profiling score (a positive GERP score represents a substitution deficit, while a negative GERP score represents a substitution surplus). Ref: reference allele. Alt: alternate allele. MAF 1000G: Minor Allele Frequency in 1000 Genomes Project. S: synonymous. NS: nonsynonymous. Int: intronic. IG: intergenic. UTR3: 3 untranslated region. D: deleterious. T: tolerated. B: benign. P: possibly damaging.

4.4 Discussion In summary, ten somatic variants in Patient 1-II-2 and three somatic variants in Patient 2-II-1 were identified by VarScan2 and confirmed by Strelka. VarScan2 also identified a large region of LOH on chromosome 16 in Patient 1-II-2. This LOH region was supported by the SCNA results which also indicated a region of SCNA on chromosome 16. Of the somatic mutations identified, two were listed in the Genomics of Drug Sensitivity in Cancer database, indicating the potential clinical utility of these findings. Of the 13 genes in which somatic mutations were identified, 11 genes have been previously associated with cancer in published literature. 4.4.1 Comparison of results in the context of published literature on myxoid liposarcoma genetics The majority of myxoid liposarcomas are characterised by the presence of the reciprocal chromosomal translocation t(12;16)(q13;p11). This translocation creates the FUS-DDIT3 chimeric gene. 64 A smaller fraction of myxoid liposarcoma cases harbour a similar variant translocation and gene fusion, the t(12;22)(q13;q12), which fuses the EWSR1 gene to the DDIT3 gene. 478 It is likely that these translocations are the primary genetic event essential for tumour formation. 479 However, in solid tumours, single base substitutions outweigh the number of chromosomal translocations by at least one order of magnitude. 16 Therefore, it is possible that sarcomas with fusion gene drivers may also harbour other driver gene mutations. 479 Myxoid liposarcoma can contain several additional molecular genetic alterations, including TP53, PIK3CA, and TERT mutations, which directly influence tumour cell biology and may be involved in round cell transformation, migration capacity, and differential response to drugs. 480 485 Alterations of the TP53 pathway have 480, 486, 487 also been described in myxoid liposarcoma. One study has previously performed a matched tumour and germline analysis on myxoid liposarcoma tumours. Joseph et al. (2014) performed WES on eight fresh frozen surgically resected myxoid liposarcomas and matched blood samples. 488 A median of 10.8 (range 3 15) somatic mutations per tumour were reported, 121

consistent with the findings of this study (ten somatic mutations reported in Patient 1-II-2 and three in Patient 2-II-1). One somatic variant was reported by Joseph et al. in FHOD3 gene (g.chr18:32552101g>t). 488 However, this is a different FHOD3 variant to the variant reported in this study. A PubMed search was performed (May 2017) using a string ( gene name ) AND ( myxoid liposarcoma ) for each of the genes in which somatic mutations were identified. No results were returned for any genes except ABL1. It has previously been suggested that ABL1 may play a role in pre- and post-transcriptional regulatory networks that contribute to sensitivity to trabectedin treatment in myxoid liposarcoma patients. 489 The other genes in which somatic mutations were identified in Patient 1-II-2 and Patient 2-II-1 have not been previously reported in myxoid liposarcomas. A cluster of 16 LOH variants on chromosome 16q was also identified in Patient 1-II-2. The SCNA plots for Patient 1-II-2 also highlight a region of SCNA on chromosome 16 which suggests that this may be the site of a significant genomic disruption in this patient. Of the 1,015 genes in this chromosomal region, 66 genes have previously been associated with cancer. Patient 1-II-2 had a LOH mutation in one of the cancer genes located in the region of LOH on chromosome 16, RBL2, at position chr16:53513055 (rs8049033). The minor allele frequency (MAF) in 1000 Genomes Project European population is 0.4602. Therefore, this is a common variant in the general population. The RegulomeDB score for rs8049033 is 6, which indicates there is minimal binding evidence at this position. As this is an intronic variant, we do not know the effect of LOH at this position on the phenotype. All other patients in family 1 are also heterozygous at this position, except Patient 1-I-2 (unaffected) who is a homozygous reference (Figure 4.5). Patient 1-III-1 (Ewing s sarcoma) is also heterozygous at this position in the germline DNA, however, without tumour sample for Patient 1-III-1 it is not possible to determine if this variant also becomes homozygous for the alternate allele in the tumour DNA at this position. 122

1-I-1 1-I-2 Key Affected male Affected female 1-II-1 * 1-II-2 Sarcoma 1-II-3 Unaffected male Unaffected female 1-III-1 Sarcoma Patient 1-III-2 Genotype at position in RBL2 gene Proband * Patient selected for tumour-normal analysis Read depth Ref, alt Patient 1-I-1 germline T/C 52,71 Patient 1-I-2 germline T/T 115,0 Patient 1-II-1 germline T/C 42,47 Patient 1-II-2 germline T/C 77,60 Patient 1-II-2 tumour C/C 3,60 Patient 1-II-3 germline T/C 70,46 Patient 1-III-1 germline T/C 55,39 Patient 1-III-2 germline T/C 34,55 Figure 4.5: Pedigree of family 1 indicating genotypes for each patient at chr16:53513055 (rs8049033) in the RBL2 gene 123

The Retinoblastoma-Like 2 (RBL2 ) gene, also known as RB2 or p130, is a tumour suppressor gene that has been implicated in endometrial cancer, 490 495 intraocular melanoma, 496, 497 lung cancer, 498 506 507, 508 nasopharyngeal cancer, neuroblastoma, 509 511 and retinoblastoma. 512 527 The Retinoblastoma (Rb) protein family plays an important role in regulating other cellular processes, such as terminal differentiation and senescence. 528 Previous studies have also shown that Rb proteins are differentially regulated during adipogenic differentiation of pre-adipocyte cell lines, 529, 530 suggesting that an absence of RB1 or RB2 may promote adipogenesis. 531 Human bone marrow-derived mesenchymal stromal cells (hmscs) are multipotent cells that, under defined conditions, can differentiate into multiple connective tissue cell types, such as adipocytes, osteoblasts, chondrocytes, and myoblasts. 532 Differentiation of hmscs into different lineages involves complex regulation and transcriptional activation or repression of a vast number of genes, and disruption of this regulation can 533, 534 have severe pathological consequences, such as cancer development. A second cancer gene of interest in the region of LOH on chromosome 16 in Patient 1-II-2 is the fused in sarcoma (FUS) gene, although no significant variants were reported by VarScan2 in this gene. The FUS gene is involved in the specific translocation of myxoid liposarcomas (t(12;16)(q13;p11)). 535 This translocation fuses exons 5, 7, or 8 of FUS gene with exon 2 of the DDIT3 gene. The FUS gene, also known as translocated in liposarcomas (TLS), is involved in pre-messenger ribonucleic acid (mrna) splicing and the export of fully processed mrna to the cytoplasm. 536 This protein belongs to the FET family of RNA-binding proteins (consisting of FUS, EWS and TAF15) which have been implicated in cellular processes that include regulation of gene expression, maintenance of genomic integrity and mrna/microrna processing. 537 FET genes are directly involved in deleterious genomic rearrangements, primarily in sarcomas and leukaemia. 538 Given that Patient 1-II-2 was diagnosed with a myxoid liposarcoma (a tumour derived from primitive cells that undergo adipose differentiation), the region identified on chromosome 16 may be significant. Chromosome 16 shows a vast region of LOH that encompasses both the RBL2 and FUS genes, as well as 64 other known cancer genes and numerous SCNA events that may contribute towards tumour pathogenesis in this patient. 124

4.4.2 Strengths A strength of the current study is the confirmation of statistically significant somatic variants using a second, independent variant caller. Many cancer sequencing studies have relied on a single calling pipeline to generate candidates. However, there is an imperfect consensus between different callers; therefore the results from a single caller should not be over-interpreted. 539 Each caller algorithm has different weaknesses, and VarScan2 has a tendency to return a very high total number of reported calls, which indicates a low specificity. 540 Ideally, more than one algorithm with different biases may reduce the number of false positives. 539 Therefore, statistically significant somatic variants called by VarScan2 were validated using a second somatic variant caller, Strelka. Of the 14 statistically significant somatic variants called by VarScan2 (11 in Patient 1-II-2 and three in Patient 2-II-1), 13 were also called by Strelka (93%). As these somatic variants have been called by two independent callers, it is less likely that these are false positive results. Despite these somatic variants being called by two independent somatic variant callers, these variants should be validated using Sanger sequencing. However, this was beyond the scope of the current project. 4.4.3 Limitations The analysis of matched tumour and germline data has several unique challenges including accounting for heterogeneity from subclonal variation and sample impurity. 148, 541, 542 The nature of cancer tissue makes somatic variant calling a challenging task. 540 The tumour DNA for this analysis was extracted from FFPE samples collected > ten years earlier. It is hard to determine the tumour purity and heterogeneity from DNA extracted in this manner as it is impossible to verify whether the block contained a mixture of tumour and adjacent normal tissue, or whether the tumour contained heterogeneous cell populations. Therefore, the tumour purity and effects of heterogeneity could not be taken into account in this analysis but should be considered in future studies. 125

Other issues that arise from using FFPE samples for WES are artefacts such as fragmentation and artificial base alterations. 543 547 FFPE samples can be a good resource for discovery of biomarkers in cancer using WES, but fresh frozen tissue is preferred as it minimises the damage to nucleotides. 543 There are also sources of error from mapping and sequencing processes. In general, data generated on the Illumina platform have increased error rates at the end of reads, a tendency towards transversion base call errors, a low INDEL error rate, and systematic sequence-specific errors following inverted repeat sequences and GG motifs. 548 550 The matched tumour and germline comparisons were only performed on two of the five sarcoma cases from the three cancer cluster families described in Chapter 2. A clearer picture of the full somatic mutation burden in the three cancer cluster families could be achieved by performing a matched tumour and germline analysis for all sarcomas and other cancers in these families. However, tumour DNA was not available for these patients. 4.4.4 Summary In summary, 13 novel somatic mutations were identified in two myxoid liposarcoma patients. Two of the genes in which somatic mutations were identified (FHOD3 and ABL1 ) have been previously associated with myxoid liposarcoma in the literature. A large region of LOH and SCNA on chromosome 16q that includes the genes FUS and RBL2 was reported in Patient 1-II-2, which suggests that this chromosomal region may contribute towards tumour pathogenesis in this patient. The genes in which somatic and LOH variants were identified are candidates for further investigation. Independent experimental validation should be performed to screen additional myxoid liposarcomas for variants in these candidate genes. Further functional studies could be carried out to determine the role of these variants or genes in myxoid liposarcoma pathogenesis. Due to time and budget limitations, these types of functional studies are beyond the scope of this thesis. 126

Chapter 5 Aim 4: Variant burden analyses at candidate risk loci in sarcoma cases and healthy ageing controls 5.1 Introduction In genetic studies of complex human disease, like cancers, the validation of candidate risk variants is an important and often rate-limiting step. 551 553 Existing single variant association tests are underpowered for validating rare risk variants unless sample or effect sizes are large. 554, 555 A more robust approach involves combining information across variants in a target region, such as a gene. 556 Burden tests use methods that combine rare and common variants across a gene/target region and compare an aggregate statistic between cases and controls. 272, 557, 558 A simple approach is to summarise the genotype information by counting the number of minor alleles across all variants in the target region. 556 In this chapter the candidate risk loci identified in Chapter 3 and Chapter 4 will be assessed by a case-control variant burden analysis to evaluate the full mutational burden of these regions. 127

5.1.1 Variant burden analyses in sarcoma cohorts A case-control rare variant burden analysis has previously been performed using sarcoma cases from the International Sarcoma Kindred Study (ISKS) cohort. 178 Targeted exon sequencing was performed on 72 genes associated with increased cancer risk in 1,162 sarcoma cases (including 966 from the ISKS) and 6,545 Caucasian controls. Ballinger et al. found an excess of pathogenic germline variants (combined odds ratio (OR) = 1.43, 95% confidence interval = 1.24 1.64, p-value < 0.0001) with approximately half of the sarcoma cases found to have putatively pathogenic monogenic and polygenic variation in known and novel cancer genes. 178 This study found a measurable contribution of polygenic effects to sarcoma risk by rare variant burden analysis of cases and controls. 178 A variant burden analysis was also performed using 175 Ewing s sarcoma patients from the International Cancer Genome Consortium (100 patients) and Pediatric Cancer Genome Project (19 patients). 559 Pathogenic and likely pathogenic mutations were found in 13.1% of Ewing s sarcoma cases, which is significantly higher compared to the same genes in the Exome Aggregation Consortium (ExAC) database (53,105 subjects). 559 Brohl et al. found pathogenic mutations were highly enriched for genes involved in DNA damage repair and cancer predisposition syndromes. 559 A table of genes identified by Ballinger et al. and Brohl et al. can be found in Appendix L. 5.2 Methods 5.2.1 Study participants Sarcoma cases (561) were selected from the ISKS, 175 described previously in Chapter 2. Briefly, the ISKS was initiated in 2008 and is a global resource for researchers to investigate the hereditary characteristics of sarcoma. 175 Patients with sarcoma were recruited from major sarcoma treatment centres across Australia, France, New Zealand, India, the United States of America (USA), the United Kingdom, and Canada, regardless of their family history of cancer. 175 Individuals with adult-onset sarcoma (> 15 years old) were eligible for the ISKS. 128

A total of 1,144 healthy ageing cancer-free controls were selected from the Medical Genome Reference Bank (MGRB) program. 560 The MGRB program is a collaborative project between the New South Wales State Government and the Garvan Institute of Medical Research to sequence healthy, older individuals to create a high quality database that is depleted of damaging genetic variants. 560 The MGRB program utilises participants from an existing cohort, the ASPirin in Reducing Events in the Elderly (ASPREE) Study. 561 The ASPREE Study is an international clinical trial to determine whether daily low-dose aspirin improves the quality of life for 19,000 older people in Australia and the USA. 561 5.2.2 Whole genome sequencing Whole genome sequencing (WGS) for ISKS cases and MGRB controls was performed by collaborators at the Garvan Institute for Medical Research. Cases and controls were sequenced at one lane per sample on the Illumina HiSeq X Ten platform using TruSeq Nano chemistry (2 x 150 base pair paired-end reads, > 30X mean depth for all samples). Samples passing FastQC 394 and verifybamid 562 contamination filters were mapped to the 1000 Genomes Project hs37d5 reference 563 with an additional PhiX decoy, and small variants called using the Genome Analysis Toolkit (GATK) 3.7 best practices pipeline. 564 The hs37d5 reference is the hg19-based reference genome employed by the 1000 Genomes Project for Phase 3 analysis. This genome differs from the hg19 genome due to the inclusion of 35 Mb of human sequence that is included as an additional contig (hs37d5). Variants passing variant quality score recalibration (VQSR) tranche thresholds of 99.5% (single nucleotide polymorphisms) and 99.0% (insertions and deletions) were retained to summarise frequencies. 182 5.2.3 Genomic regions selected for validation Table 5.1 contains eight target regions that were identified in Chapter 3 (ABCB5, ARHGAP39, BEAN1, C16orf96, KIF2C, PDIA2, UVSSA and ZFP69B). These target regions are genes in which germline risk variants segregating with cancer and age at onset of cancer in three cancer-cluster families were identified using whole exome sequencing (WES). 129

Table 5.1: Genomic coordinates for target regions in which germline and somatic risk variants were identified Target region Chromosome Start coordinate End coordinate Identified in Chapter 3 KIF2C 1 45,204,490 45,234,438 ZFP69B 1 40,915,337 40,930,390 UVSSA 4 1,340,104 1,382,837 ABCB5 7 20,654,245 20,797,637 ARHGAP39 8 145,753,563 145,839,888 BEAN1 16 66,460,200 66,517,745 C16orf96 16 4,605,491 4,651,318 PDIA2 16 331,615 338,209 Identified in Chapter 4 P4HTM 3 49,026,341 49,045,581 TET2 4 106,066,842 106,201,960 PLK2 5 57,748,810 57,756,966 SLC6A18 5 1,224,470 1,247,304 LAMA2 6 129,203,286 129,838,710 SDR16C6P,PENK 8 57,286,277 57,359,593 ABL1 9 133,588,268 133,764,062 ASPN 9 95,217,489 95,245,844 SLC22A20,POLA2 11 64,980,311 65,066,088 ADSSL1 14 105,195,228 105,214,647 PRMT5 14 23,388,733 23,399,661 FHOD3 18 33,876,702 34,361,018 GATAD2A 19 19,495,642 19,620,741 Genomic coordinates for each target region (± 1,000 bases) based on human genome 19 (hg19) were obtained from the University of California Santa Cruz (UCSC) Genome Browser (https://genome.ucsc.edu/). 565 130

The additional 13 target regions listed in Table 5.1 were identified in Chapter 4; (ABL1, ADSSL1, ASPN, FHOD3, GATAD2A, LAMA2, P4HTM, PLK2, PRMT5, SLC6A18, TET2, two target regions encompassing SDR16C6P and PENK, SLC22A20, and POLA2 ). These target regions are genes in which candidate somatic risk variants were identified by a matched tumour and germline analysis in two myxoid liposarcoma patients. For intergenic variants, both flanking genes were included. Genomic coordinates for each target region were obtained from the University of California Santa Cruz (UCSC) Genome Browser (https://genome.ucsc.edu/) 565 using human genome build 19 (hg19) and included 1,000 bases either side of each target region. Frequency summary files for the target regions for both case and controls (in variant call format (*.vcf)) were received and annotated using Annotate Variation 245, 257 (ANNOVAR, version 2015Jun16) and Regulome database (RegulomeDB). 5.2.4 Statistical analyses Using the annotation from ANNOVAR, the number of nonsynonymous and deleterious alleles (defined as deleterious in both Sorting Intolerant from Tolerant (SIFT) and Polymorphism Phenotyping-2 (PolyPhen-2)) 266, 267 and normal alleles in each target region were summed in cases and controls. As deleterious alleles were defined as being deleterious in both SIFT and PolyPhen-2, this was a more conservative approach. 566 The number of putative regulatory alleles (defined as those with a RegulomeDB score of 1a, 1b, 1c, 1d, 1e, 1f, 2a, 2b or 2c) and normal alleles in each target region were summed in cases and controls. Table 5.2 shows the classification of scores from RegulomeDB. Odds ratios (ORs) and p-values reported for variant burden analysis were obtained from one-sided Fisher s exact tests performed in R 281 to compare the total burden of deleterious and putative regulatory variants, separately, in cases and controls, a method used previously by Ballinger et al. (2016). 178 Bonferroni adjustment was performed to correct for multiple testing. 292 131

Table 5.2: Classification of Regulome database scores Score 1a Supporting data eqtl + TF binding + matched TF motif + matched DNase Footprint + DNase peak 1b eqtl + TF binding + any motif + DNase Footprint + DNase peak 1c 1d 1e 1f 2a 2b 2c 3a 3b eqtl + TF binding + matched TF motif + DNase peak eqtl + TF binding + any motif + DNase peak eqtl + TF binding + matched TF motif eqtl + TF binding / DNase peak TF binding + matched TF motif + matched DNase Footprint + DNase peak TF binding + any motif + DNase Footprint + DNase peak TF binding + matched TF motif + DNase peak TF binding + any motif + DNase peak TF binding + matched TF motif 4 TF binding + DNase peak 5 TF binding or DNase peak 6 Other eqtl: Expression Quantitative Trait Loci. TF: Transcription Factor. DNase: Deoxyribonuclease. 132

5.3 Results 5.3.1 Identification of nonsynonymous deleterious variants in the target regions The results of the annotation of the frequency summary files using ANNOVAR and RegulomeDB are summarised in Table 5.3. On average, 1,128 variants were identified in each gene in ISKS cohort and 2,282 in the MGRB cohort. Each gene had an average of five nonsynonymous deleterious variants in the ISKS cohort and six nonsynonymous deleterious variants in the MGRB cohort. The ISKS cohort had an average of 12 putative regulatory variants per gene compared to 11 per gene in the MGRB cohort. Table 5.3: Annotated summary of nonsynonymous deleterious variants and putative regulatory variants in the target regions Total variants Deleterious variants Regulatory variants Target region ISKS MGRB ISKS MGRB ISKS MGRB KIF2C 345 396 0 1 2 2 ZFP69B 197 224 2 4 0 0 P4HTM 115 163 2 2 7 7 TET2 328 386 5 9 6 5 UVSSA 667 841 8 6 23 22 PLK2 89 106 0 1 2 2 SLC6A18 314 425 5 8 4 4 LAMA2 6,431 7,864 7 14 18 16 ABCB5 2,117 22,112 17 16 10 8 ARHGAP39 1,085 1,313 1 3 4 5 SDR16C6P,PENK 842 1,100 2 1 3 2 ABL1 2,173 2,563 3 2 24 26 ASPN 23 31 1 2 0 0 133

Total variants Deleterious variants Regulatory variants Target region ISKS MGRB ISKS MGRB ISKS MGRB SLC22A20,POLA2 811 954 1 2 22 19 ADSSL1 249 302 5 6 13 13 PRMT5 51 57 1 1 0 0 BEAN1 506 582 1 0 7 7 C16orf96 552 737 4 9 8 7 PDIA2 128 124 7 3 1 1 FHOD3 5,215 5,892 9 9 62 49 GATAD2A 1,456 1,753 1 2 32 31 ISKS: International Sarcoma Kindred Study. MGRB: Medical Genome Reference Bank. Deleterious variants: defined as nonsynonymous variants that are deleterious in both Sorting Intolerant from Tolerant (SIFT) and Polymorphism Phenotyping-2 (PolyPhen-2). Regulatory variants: defined as variants with a Regulome database score < 3. Number of variants corresponds to the number of deleterious or regulatory variants within each target region. 5.3.2 Statistical analyses 5.3.2.1 Nonsynonymous deleterious variants Table 5.4 shows the number of nonsynonymous deleterious alleles and normal alleles for each target region for cases and controls and the results of Fisher s exact test. The significance level after Bonferroni correction was α < 2.38 x 10 3. A table containing each nonsynonymous deleterious variant for each target region that was included in the variant burden test is located in Appendix K. 134

Table 5.4: Odds ratios, p-values and 95% confidence intervals from Fisher s exact test for target regions for nonsynonymous deleterious variants Target region Chr. Identified as Odds ratio p-value 95% CI KIF2C 1 Germline 0 1. ZFP69B 1 Germline 0.51 0.55 0.06-2.17 P4HTM 3 Somatic 3.08 0.34 0.35-36.89 UVSSA 4 Germline 1.29 0.45 0.69-2.41 TET2 4 Somatic 2.24 2.29 x 10 3 1.31-3.78 PLK2 5 Somatic 0 1. SLC6A18 5 Somatic 2.12 7.271 x 10 7 1.57-2.85 LAMA2 6 Somatic 0.88 0.58 0.59-1.30 ABCB5 7 Germline 0.99 0.82 0.89-1.09 ARHGAP39 8 Germline 2.07 0.45 0.04-25.76 SDR16C6P,PENK 8 Somatic 2.04 0.62 0.11-120.33 ABL1 9 Somatic 1.36 0.70 0.18-10.19 ASPN 9 Somatic 2.05 0.03 0.99-4.04 SLC22A20,POLA2 11 Somatic 2.04 0.48 0.03-39.19 ADSSL1 14 Somatic 1.36 0.56 0.36-4.52 PRMT5 14 Somatic 2.04 0.55 0.03-160.01 BEAN1 16 Germline 0 1. C16orf96 16 Germline 2.78 9.95 x 10 5 1.64-4.64 PDIA2 16 Germline 0.42 2.2 x 10 16 0.35-0.51 FHOD3 18 Somatic 1.11 0.19 0.94-1.30 GATAD2A 19 Somatic 2.06 0.48 0.03-39.55 Chr: Chromosome. ISKS: International Sarcoma Kindred Study. MGRB: Medical Genome Reference Bank. CI: Confidence interval. Odds ratios, p-values and 95% CI obtained from Fisher s exact test performed in R. 135

Four target regions reached statistical significance after correction for multiple testing (C16orf96, PDIA2, SLC6A18 and TET2 ). Of these, C16orf96 and PDIA2 were initially identified as germline variants in three cancer cluster families, and SLC6A18 and TET2 were identified as somatic variants from a matched tumour-germline analysis in two myxoid liposarcoma cases. The odds ratios in Table 5.4 indicate a higher burden of nonsynonymous deleterious alleles in sarcoma cases compared to controls for C16orf96, SLC6A18 and TET2. However, the odds ratio for PDIA2 suggests that controls have a higher burden of variant alleles compared to the sarcoma cases. 5.3.2.2 Putative regulatory variants Table 5.5 shows the number of putative regulatory alleles and normal alleles for each target region for cases and controls and the results of Fisher s exact test. The significance level after Bonferroni correction was α < 2.78 x 10 3. A table containing each putative regulatory variant for each target region that was included in the variant burden test is located in Appendix M. 136

Table 5.5: Odds ratios and p-values from Fisher s exact test for target regions for putative regulatory variants Target region Chr. Identified as Odds ratio p-value 95% CI KIF2C 1 Germline 1 0.98 0.89-1.13 ZFP69B 1 Germline... P4HTM 3 Somatic 1 0.85 0.95-1.04 UVSSA 4 Germline 0.86 2.2 x 10 16 0.83-0.88 TET2 4 Somatic 0.78 5.27 x 10 4 0.68-0.90 PLK2 5 Somatic 0.89 0.19 0.75-1.06 SLC6A18 5 Somatic 1.14 0.21 0.93-1.39 LAMA2 6 Somatic 0.89 1.11 x 10 6 0.85-0.93 ABCB5 7 Germline 0.72 2.2 x 10 16 0.68-0.75 ARHGAP39 8 Germline 1.29 4.91 x 10 6 1.16-1.45 SDR16C6P,PENK 8 Somatic 2.04 0.34 0.48-9.84 ABL1 9 Somatic 1.18 5.109 x 10 12 1.12-1.23 ASPN 9 Somatic... SLC22A20,POLA2 11 Somatic 1.27 2.2 x 10 16 1.22-1.31 ADSSL1 14 Somatic 1.01 0.64 0.97-1.05 PRMT5 14 Somatic... BEAN1 16 Germline 1 0.99 0.93-1.08 C16orf96 16 Germline 0.84 4.38 x 10 10 0.79-0.89 PDIA2 16 Germline 0.85 0.85 0.36-1.86 FHOD3 18 Somatic 0.7 2.2 x 10 16 0.68-0.72 GATAD2A 19 Somatic 0.97 0.08 0.94-1.00 Chr: Chromosome. ISKS: International Sarcoma Kindred Study. MGRB: Medical Genome Reference Bank. CI: Confidence interval. Odds ratios, p-values and 95% CI obtained from Fisher s exact test performed in R. 137

Nine target regions reached statistical significance after correction for multiple testing (ABCB5, ARHGAP39, C16orf96, UVSSA, ABL1, FHOD3, LAMA2, TET2 and a region encompassing SLC22A20 and POLA2 ). Of these, ABCBC5, ARHGAP39, C16orf96, UVSSA were identified as germline variants in three cancer cluster families, and ABL1, FHOD3, LAMA2, TET2 and a region encompassing SLC22A20 and POLA2 were identified as somatic variants from a matched tumour-germline analysis in two myxoid liposarcoma cases. The odds ratios indicate a higher burden of putative regulatory variants in sarcoma cases compared to controls for ARHGAP39, ABL1 and a region encompassing SLC22A20 and POLA2. However, the odds ratio for ABCB5, C16orf96, UVSSA, FHOD3, LAMA2, and TET2 indicates that controls have a higher burden of variant alleles compared to the sarcoma cases. 5.4 Discussion A total of six target regions of interest (C16orf96, SLC6A18, TET2, ARHGAP39, ABL1 and a region encompassing SLC22A20 and POLA2 ) were found to have a higher burden of nonsynonymous deleterious variants or putative regulatory variants in 561 sarcoma cases compared to 1,144 healthy ageing controls. 5.4.1 Novel findings This is the first study to report associations between the C16orf96, SLC6A18, ARHGAP39, POLA2 and SLC22A20 genes and sarcoma. None of these genes were reported by Ballinger et al. or Brohl et al. in their variant burden analyses 178, 559 in sarcoma cohorts. C16orf96 is an open reading frame gene on chromosome 16 that is an uncharacterised protein coding gene. The function of C16orf96 is currently unknown, and expression is generally low in cells. In situ hybridisation experiments have shown C16orf96 RNA expression is low in testis and skin only and not present in other tissue types. 567 The function of this gene or any potential role for this gene in cancer pathogenesis has not been established. 138

The SLC6A18 gene is a member of the SLC6 specific transporter family. SLC6A18 is involved in the transport of glucose and other sugars, bile salts and organic acids, metal ions and amine compounds. A previous study reported a gain of region 5p15.33 containing SLC6A18 in small cell lung cancers. 463 Copy number variations in SLC618A have also been reported in lung adenocarcinoma. 568 The protein encoded by ARHGAP39 is a binding partner for CNK2 that is a spatial modulator of Rac cycling during spine morphogenesis and signalling by G protein coupled receptors (GPCR). 297 There is no supporting evidence for a role for ARHGAP39 in cancer pathogenesis at this time. SLC22A20 is a member of the solute carrier family that plays a role in inorganic anion exchanger activity. SLC22A20 is differentially methylated in hepatocellular 321, 442, 443 carcinoma and may be used as a biomarker for early detection. The POLA2 gene has been reported to be involved in cell proliferation by mediating DNA replication, recombination, and repair. 444 A variant in POLA2 has been found to improve differential survivability and mortality in non-small cell lung cancer patients and could be used as a prognostic biomarker. 445, 448 Low mrna expression of POLA2 was found to be prognostic of poor outcome in ovarian carcinomas. 446 Additionally, POLA2 was found to be overexpressed in mesothelioma. 449 The role of C6orf96, SLC6A18, ARHGAP39, SLC22A20 and POLA2 in sarcoma pathogenesis remains to be elucidated. The results of this study should prioritise further research on these genes in sarcomas. 5.4.2 Known cancer genes Both the TET2 and ABL1 genes are known cancer genes listed in the Catalogue of Somatic Mutations in Cancer (COSMIC) cancer gene census. 134 TET2 is reported to be frequently mutated or inactivated in cancer and mutations are commonly observed in myeloid, lymphoid and haematological malignancies. 425 430 TET2 has previously been associated with sarcomas. The loss of TET2 is a 569, 570 characteristic of myeloid sarcomas and may be used as a novel marker. 139

ABL1 is a proto-oncogene that encodes a protein tyrosine kinase involved in a variety of cellular processes, including cell division, adhesion, differentiation, and response to stress. 571 This gene is known to be fused to a variety of translocation partner genes in various leukaemias, for example, chronic myelogenous leukaemia (BCR-ABL1 ). 572 ABL kinases may also play a role in solid tumours including breast, colon, lung and kidney carcinomas, and melanoma. 573 582 ABL1 variants have previously been reported in sarcomas. Two patients with chronic myeloid leukaemia and secondary sarcomas (histiocytic sarcoma and segregated extramedullary (nodal) myeloid sarcoma) were found to be positive for the t(9;22) BCR/ABL1 translocation in the sarcoma tumours. 583, 584 This evidence suggests that the lineages may be clonally related. 583, 584 However, there is no evidence of ABL1 variants in sarcoma cases without chronic myeloid leukaemia. 5.4.3 Clinical implications Three of the regions of interest identified in this study may have clinical implications in the treatment of sarcomas. TET2 is listed in the Genomics of Drug Sensitivity in Cancer database and shows a statistically significant association (p-value < 10 3 ) with VNLG/124 and Bexarotene. 391 There may be myeloid sarcomas among the ISKS cases sequenced in this study that harbour TET2 mutations and may respond to VNLG/124 or Bexarotene. However, there may also be other sarcoma subtypes harbouring TET2 mutations. The role of TET2 in sarcoma subtypes other than myeloid sarcomas and treatment of sarcomas with TET2 variants using VNLG/124 and Bexarotene should be further investigated. ABL1 is associated with trabectedin sensitivity in myxoid liposarcomas. 489 Therefore, there may be an opportunity to treat other sarcoma subtypes that exhibit ABL1 variants with trabectedin. An expanded access program tested trabectedin in patients with incurable soft tissue sarcoma following the progression of disease with standard therapy. 585 Results of the study demonstrated disease control despite a low incidence of objective responses in advanced soft tissue sarcoma patients after failure of standard chemotherapy. 585 The study also found greater clinical benefit rate and longer median overall survival in patients with 140

leiomyosarcoma and liposarcoma compared with patients with histopathologic subsets of sarcomas other than leiomyosarcoma and liposarcoma. 585 A second study that evaluated the effectiveness of trabectedin for patients with soft tissue sarcoma also found there may be a benefit in using trabectedin in patients with leiomyosarcoma or liposarcoma who failed standard of care agents. 586 The SCL22A20 gene offers some interest and potential clinical utility as an uptake carrier of sorafenib, a multikinase inhibitor. 442 Sorafenib has been shown 587, 588 to have activity in metastatic soft tissue sarcoma, specifically in leiomyosarcoma. 5.4.4 Strengths and limitations Classic single-marker association analysis for rare variants are underpowered unless the sample size is extremely large, or the variants have a large effect size. 558, 589 Consequently, burden tests for the analysis of rare genetic variants have been developed that consider their joint effects on complex traits within the same functional unit or genomic region. The burden test makes assumptions that all variants in a region are causal and associated with a trait in the same direction and magnitude of effect. 590 Violation of these assumptions can reduce the power of the test. 591 593 For the variants identified in a genomic region by WES and WGS, like in this study, some variants will have little or no effect on the phenotype, some variants may be protective, and some may be deleterious. The magnitude of the effect of each variant may also vary. For example, rare variants may have a larger effect compared to common variants. Some burden tests, for example, sequence kernel association tests, take violations of these assumptions into consideration. 592 However, as only frequency summary files for each cohort were available, the breach of these assumptions could not be addressed at this time. 141

There were also several regions of interest that were identified to have a higher mutational rate in controls compared to cases. PDIA2 was found to have a higher rate of nonsynonymous deleterious variants in controls compared to cases. ABCB5, C16orf96, UVSSA, FHOD3, LAMA2 and TET2 were found to have a higher rate of putative regulatory variants in controls compared to cases. This may be due to the presence of common minor alleles in the general population (see Appendix K for minor allele frequencies (MAF) for each variant) or the presence of variants that are phenotypically neutral. Two of the regions of interest (TET2 and C16orf96 ) were found to have a higher mutational rate of nonsynonymous deleterious variants in cases compared to controls, but a higher mutational rate of putative regulatory variants in controls. This may also be due to the presence of common minor alleles classified as putative regulatory variants (see Appendix M for MAF for each variant). For example, two putative regulatory variants in C16orf96 have a MAF of 0.61 and 1.00 in the general population. Therefore, these may be phenotypically neutral variants. Whereas the nonsynonymous deleterious variants in C16orf96 had MAF < 2%. Likewise, TET2 nonsynonymous variants had MAF < 2% whereas one putative regulatory variant had a MAF of 0.21. Due to these findings of higher mutational rates in controls compared to cases and contradictory findings between nonsynonymous deleterious variants and putative regulatory variants for C16orf96 and TET2, further studies are required to confirm these gene-level associations. 5.4.5 Conclusion In conclusion, six target regions that were identified by WES in cancer cluster families and matched tumour and germline analysis of two myxoid liposarcomas have been validated using a large independent case and control cohort. C16orf96, SLC6A18 and TET2 were found to have a higher mutational burden of nonsynonymous deleterious variants in sarcoma cases compared to healthy ageing controls. A higher mutational burden of putative regulatory variants in cases was found in ARHGAP39, ABL1 and a region encompassing SCL22A20 and POLA2. This study reported five novel associations between C6orf96, SLC6A18, ARHGAP39, 142

POLA2 and SLC22A20 and sarcoma. Two of these genes, TET2 and ABL1, are known cancer genes and have potential clinical utility as they have been identified to contribute to drug sensitivity in cancers. This study has identified novel risk genes that appear to have a higher mutational burden in sarcoma cases compared to healthy ageing controls and should be prioritised for further research. 143

144

Chapter 6 Conclusion 6.1 Summary of results Whole exome sequencing (WES) was performed on three mixed cancer cluster families identified by a sarcoma proband from the International Sarcoma Kindred Study (ISKS). The cancer cluster families selected were not defined by known cancer predisposition syndromes and therefore represented an opportunity to identify novel risk variants associated with both sarcoma and cancer risk. The WES data was annotated, filtered and prioritised using three different strategies to identify rare private variants, known rare variants and candidate gene variants. The prioritised variants were then tested for association with cancer phenotypes using Sequential Oligogenic Linkage Analysis Routines (SOLAR). Nominally significant variants were assessed for familial segregation in each cancer cluster family. Eight novel putative germline risk variants were identified to segregate with cancer in the families. Each variant was private to a single family and showed segregation with mixed cancer types. These findings suggest the presence of inherited cancer mutations that may increase the risk for cancer within families. 145

Matched tumour and germline analyses were performed on two myxoid liposarcoma cases from the cancer cluster families. VarScan2 and Strelka were used to identify 13 novel statistically significant somatic mutations. A vast region of loss of heterozygosity and somatic copy number alterations on chromosome 16 encompassing the RBL2 and FUS genes was also identified in one of the tumours, which may contribute towards tumour pathogenesis. Target regions in which germline and somatic mutations were identified in the cancer cluster families were validated using variant burden analyses in 561 sarcoma cases and 1,144 healthy ageing controls. Six target regions showed an increased mutational burden of nonsynonymous deleterious variants (C16orf96, SLC6A18 and TET2 ) or putative regulatory variants (ARHGAP39, ABL1 and a region encompassing SLC22A20 and POLA2 ) in sarcoma cases compared to controls. 6.2 Clinical utility of findings Two target regions that were found to have a higher mutational burden in sarcoma cases (TET2 and ABL1 ) are known cancer genes and have potential clinical utility in the treatment of sarcomas as they have both been identified to contribute to drug sensitivity in cancers. Also, the SCL22A20 gene offers potential clinical utility as an uptake carrier of sorafenib. TET2 and ABL1 have been reported to be associated with myeloid sarcomas and secondary sarcomas in patients with chronic myeloid leukaemia, respectively. However, there is no evidence of association with other sarcoma subtypes. The remaining genes identified in this study represent novel candidate risk genes for sarcoma. The POLA2 gene has been reported to be involved in cell proliferation by mediating DNA replication, recombination, and repair. 444 The role of the remaining genes of interest (C16orf96, ARHGAP39 and SLC6A18 ) in cancer pathogenesis remain to be elucidated. As previously observed by Ballinger et al. and consistent with the findings of the current study, there is a burden of clinically relevant genetic variation in sarcoma patients and their families. 178 The results from this study will be returned to the ISKS coordinators and submitted to a central database. The database contains molecular and biological information that has been collected over time 146

on the ISKS families and specimens. It is critical to catalogue genetic variants as future studies of these candidates may provide a further understanding of the aetiology of sarcoma or new therapies that target these candidates may be developed. 6.3 Review of methodology The current study was the first to perform WES in mixed cancer cluster families identified by a sarcoma proband. This study is an example of a successful two-phase next generation sequencing family study approach; the application of WES to cancer cluster families with rare cancers followed by larger replication in independent population cohorts. The results of the current study show the utility of this approach in small cancer cluster families to identify novel risk genes for a rare disease, such as sarcoma. The current study was limited in the size of the initial study sample (19 people in three families) and assumptions used for variant filtering, prioritisation and segregation analysis, and the availability of tumour DNA. The validation using variant burden analysis was also limited by the inability to account for risk, neutral and protective alleles. The current state of bioinformatic tools, databases and knowledge of cancer biology underpinned the study design and analyses performed. The WES data generated in this study may be re-analysed in the future as new tools are developed and/or the results may become clinically relevant as knowledge in this field progresses. The validation of findings from WES (both germline in families and tumour-germline comparison in myxoid liposarcomas) does not provide conclusive evidence of an involvement of these genes in sarcoma pathogenesis. Rather, the results of this study should be seen as hypothesis-generating for novel candidate risk genes that should be prioritised for future research. 147

6.4 Recommendations for future work The current study has identified novel candidate risk genes for sarcoma by performing WES in a small number of cancer cluster families. The role of these genes in sarcoma pathogenesis has not been elucidated in this study and was beyond the scope of this thesis. These genes, however, become candidates that can be further tested for association in other sarcoma and cancer cohorts and for functional validation studies such as molecular assays to determine expression or interactions, or biological assays in animal models. The two-phase NGS family study approach is gaining momentum in genomics literature as researchers return to family-based study designs to identify rare genetic variants. The current study adds to the growing evidence that this approach can be successfully used to identify novel risk genes for a rare complex disease such as sarcoma, and may be extended to identify novel risk genes for other complex diseases. 148

Bibliography 1 Gerard I. Evan and Karen H. Vousden. Proliferation, cell cycle and apoptosis in cancer. Nature, 411(6835):342 348, 2001. 2 SEER Training Modules. Cancer classification. Technical report, U.S. National Institutes of Health, National Cancer Institute, 2016. 3 Fred Bunz. Principles of cancer genetics. Springer, Netherlands, 1st edition, 2008. 4 Geoffrey M. Cooper and Robert E. Hausman. The development and causes of cancer. In The Cell: A Molecular Approach, pages 725 766. Sinauer Associates Sunderland, 2nd edition, 2000. 5 Jacques Ferlay, Isabelle Soerjomataram, Rajesh Dikshit, Sultan Eser, Colin Mathers, and Marise Rebelo et al. Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012. International Journal of Cancer, 136(5):E359 E386, 2015. 6 World Health Organization. Global health observatory: the data repository; URL: http://www.who.int/gho/en/, 2016. 7 World Health Organization. Health in 2015: from MDGs to SDGs. World Health Organization, Geneva, 2015. 8 Rijo John and Hana Ross. The global economic cost of cancer. Technical report, American Cancer Society, 2010. 9 Bert Vogelstein and Kenneth W. Kinzler. Cancer genes and the pathways they control. Nature Medicine, 10(8):789 799, 2004. 149

10 Douglas Hanahan and Robert A. Weinberg. The hallmarks of cancer. Cell, 100(1):57 70, 2000. 11 Keith R. Loeb and Lawrence A. Loeb. Significance of multiple mutations in cancer. Carcinogenesis, 21(3):379 385, 2000. 12 Roshan Karki, Deep Pandya, Robert C. Elston, and Cristiano Ferlini. Defining mutation and polymorphism in the era of personal genomics. BMC Medical Genomics, 8(1):1, 2015. 13 Michael R. Stratton, Peter J. Campbell, and P. Andrew Futreal. The cancer genome. Nature, 458(7239):719 724, 2009. 14 Simon J. Talbot and Dorothy H. Crawford. Viruses and tumours - an update. European Journal of Cancer, 40(13):1998 2005, 2004. 15 Peter A. Jones and Stephen B. Baylin. The fundamental role of epigenetic events in cancer. Nature Review Genetics, 3(6):415 428, 2002. 16 Bert Vogelstein, Nickolas Papadopoulos, Victor E. Velculescu, Shibin Zhou, Luis A. Diaz, and Kenneth W. Kinzler. Cancer genome landscapes. Science, 339(6127):1546 1558, 2013. 17 Christopher Greenman, Philip Stephens, Raffaella Smith, Gillian L. Dalgliesh, Christopher Hunter, and Graham Bignell et al. Patterns of somatic mutation in human cancer genomes. Nature, 446(7132):153 158, 2007. 18 Daniel G. Miller. On the nature of susceptibility to cancer. The presidential address. Cancer, 46(6):1307 1318, 1980. 19 Anna C. Schinzel and William C. Hahn. Oncogenic transformation and experimental models of human cancer. Frontiers in Bioscience, 13:71 84, 2007. 20 Niko Beerenwinkel, Tibor Antal, David Dingli, Arne Traulsen, Kenneth W. Kinzler, and Victor E. Velculescu et al. Genetic progression and the waiting time to cancer. PLOS Computational Biology, 3(11):e225, 2007. 150

21 Pawan Upadhyay, Renu Dwivedi, and Amit Dutt. Applications of next-generation sequencing in cancer. Current Science, 107(5):795, 2014. 22 International Agency for Research on Cancer. World cancer report 2014. Technical report, World Health Organisation, 2014. 23 Australian Institute of Health and Welfare & Australasian Association of Cancer Registries. Cancer in Australia: an overview, 2012. Technical report, AIHW, 2012. 24 Julian Peto. Cancer epidemiology in the last century and the next decade. Nature, 411(6835):390 395, 2001. 25 Tracey DiSipio, Carla Rogers, Beth Newman, David Whiteman, Elizabeth Eakin, Lin Fritschi, and Joanne Aitken. The Queensland cancer risk study: behavioural risk factor results. Australian and New Zealand Journal of Public Health, 30(4):375 382, 2006. 26 Elizabeth B. Claus, Joellen M. Schildkraut, Douglas W. Thompson, and Neil J. Risch. The genetic attributable risk of breast and ovarian cancer. Cancer, 77(11):2318 2324, 1996. 27 Lauri A. Aaltonen, Reijo Salovaara, Paula Kristo, Federico Canzian, Akseli Hemminki, and Paivi Peltomaki et al. Incidence of hereditary nonpolyposis colorectal cancer and the feasibility of molecular screening for the disease. New England Journal of Medicine, 338(21):1481 1487, 1998. 28 Agnes Chompret, Laurence Brugieres, Muriel Ronsin, Maryvonne Gardes, Francoise Dessarps-Freichey, and Anne Abel et al. P53 germline mutations in childhood cancers and cancer risk for carrier individuals. British Journal of Cancer, 82(12):1932, 2000. 29 Carlo La Vecchia, Eva Negri, Antonella Gentile, and Silvia Franceschi. Family history and the risk of stomach and colorectal cancer. 70(1):50 55, 1992. Cancer, 30 Gianni Zanghieri, Carmela Di Gregorio, Carla Sacchetti, Rossella Fante, Romano Sassatelli, and Giacomo Cannizzo et al. Familial occurrence 151

of gastric cancer in the 2-year experience of a population-based registry. Cancer, 66(9):2047 2051, 1990. 31 Shirley Hodgson. Mechanisms of inherited cancer susceptibility. Journal of Zhejiang University. Science. B, 9(1):1 4, 2008. 32 Knut Borch-Johnsen, Jorgen H. Olsen, and Thorkild I.A. Sorensen. Genes and family environment in familial clustering of cancer. Theoretical Medicine, 15(4):377 386, 1994. 33 Kari Hemminki, Jan Sundquist, and Justo L. Bermejo. How common is familial cancer? Annals of Oncology, 19(1):163 167, 2008. 34 David E. Goldgar, Douglas F. Easton, Lisa A. Cannon-Albright, and Mark H. Skolnick. Systematic population-based assessment of cancer risk in first-degree relatives of cancer probands. Journal of the National Cancer Institute, 86(21):1600 1608, 1994. 35 Frederick P. Li, Joseph F. Fraumeni, John J. Mulvihill, William A. Blattner, Margaret G. Dreyfus, Margaret A. Tucker, and Robert W. Miller. A cancer family syndrome in twenty-four kindreds. Cancer Research, 48(18):5358 5362, 1988. 36 Janice L. Berliner and Angela Musial Fay. Risk assessment and genetic counseling for hereditary breast and ovarian cancer: recommendations of the national society of genetic counselors. Journal of Genetic Counseling, 16(3):241 260, 2007. 37 Kari Hemminki, Mahdi Fallah, and Akseli Hemminki. Collection and use of family history in oncology clinics. 32(29):3344 3345, 2014. Journal of Clinical Oncology, 38 Paul Lichtenstein, Niels V. Holm, Pia K. Verkasalo, Anastasia Iliadou, Jaakko Kaprio, and Markku Koskenvuo et al. Environmental and heritable factors in the causation of cancer - analyses of cohorts of twins from Sweden, Denmark, and Finland. New England Journal of Medicine, 343(2):78 85, 2000. 152

39 Frederick P. Li and Joseph F. Fraumeni. Prospective study of a family cancer syndrome. The Journal of the American Medical Association, 247(19):2692 2694, 1982. 40 Anthony Antoniou, Paul D.P. Pharoah, Steven Narod, Harvey A. Risch, Jorunn E. Eyfjord, and John L. Hopper et al. Average risks of breast and ovarian cancer associated with BRCA1 or BRCA2 mutations detected in case series unselected for family history: a combined analysis of 22 studies. The American Journal of Human Genetics, 72(5):1117 1130, 2003. 41 Harvey A. Risch, John R. McLaughlin, David E.C. Cole, Barry Rosen, Linda Bradley, and Elaine Kwan et al. Prevalence and penetrance of germline BRCA1 and BRCA2 mutations in a population series of 649 women with ovarian cancer. The American Journal of Human Genetics, 68(3):700 710, 2001. 42 Henry T. Lynch and Albert de la Chapelle. Hereditary colorectal cancer. New England Journal of Medicine, 348(10):919 932, 2003. 43 Alfred G. Knudson. Mutation and cancer: statistical study of retinoblastoma. Proceedings of the National Academy of Sciences, 68(4):820 823, 1971. 44 Abha Gupta and David Malkin. Sarcomas and cancer predisposition syndromes; URL: http://sarcomahelp.org/articles/sarcoma-predisposition-syndromes.html, 2008. 45 Judy E. Garber and Kenneth Offit. Hereditary cancer predisposition syndromes. Journal of Clinical Oncology, 23(2):276 292, 2005. 46 Csilla I. Szabo and Mary-Claire King. Inherited breast and ovarian cancer. Human Molecular Genetics, 4(suppl 1):1811 1817, 1995. 47 Mary-Claire King, Joan H. Marks, and Jessica B. Mandell. Breast and ovarian cancer risks due to inherited mutations in BRCA1 and BRCA2. Science, 302(5645):643 646, 2003. 153

48 Sining Chen, Edwin S. Iversen, Tara Friebel, Dianne Finkelstein, Barbara L. Weber, and Andrea Eisen et al. Characterization of BRCA1 and BRCA2 mutations in a large United States sample. Journal of Clinical Oncology, 24(6):863 871, 2006. 49 Eric R. Fearon. Human cancer syndromes: clues to the origin and nature of cancer. Science, 278(5340):1043, 1997. 50 Ichiro Satokata, Kiyoji Tanaka, Naoyuki Miura, Michiko Narita, Takashi Mimaki, and Yoshiaki Satoh et al. Three nonsense mutations responsible for group A xeroderma pigmentosum. Mutation Research/DNA Repair, 273(2):193 202, 1992. 51 David Malkin, Frederick P. Li, Louise C. Strong, Joseph F. Fraumeni, Camille E. Nelson, and David H. Kim et al. Germline p53 mutations in a familial syndrome of breast cancer, sarcomas, and other neoplasms. Science, 250(4985):1233 1238, 1990. 52 Frederick P. Li and Joseph F. Jr Fraumeni. Soft-tissue sarcomas, breast cancer, and other neoplasms: a familial syndrome? Medicine, 71(4):747 752, 1969. Annals of Internal 53 David Malkin, Kent W. Jolly, Noele Barbier, A. Thomas Look, Stephen H. Friend, and Mark C. Gebhardt et al. Germline mutations of the p53 tumor-suppressor gene in children and young adults with second malignant neoplasms. New England Journal of Medicine, 326(20):1309 1315, 1992. 54 Arnold J. Levine. P53, the cellular gatekeeper for growth and division. Cell, 88(3):323 331, 1997. 55 Amato J. Giaccia and Michael B. Kastan. The complexity of p53 modulation: emerging patterns from divergent signals. Development, 12(19):2973 2983, 1998. Genes & 56 Charles J. Sherr and Frank McCormick. The RB and p53 pathways in cancer. Cancer Cell, 2(2):103 112, 2002. 57 Fattaneh A. Tavassoli, Peter Devilee, and World Health Organization. Tumours of the breast and female genital organs - pathology and genetics. 154

World Health Organization Classification of Tumours. Lyon, France: IARC Press, 2003. 58 Laufey T. Amundadottir, Sverrir Thorvaldsson, Daniel F. Gudbjartsson, Patrick Sulem, Kristleifur Kristjansson, and Sigurdur Arnason et al. Cancer as a complex phenotype: pattern of cancer distribution within and beyond the nuclear family. PLOS Medicine, 1(3):e65, 2005. 59 Iona Cheng, Jonathan M. Kocarnik, Logan Dumitrescu, Noralane M. Lindor, Jenny Chang-Claude, and Christy L. Avery et al. Pleiotropic effects of genetic risk variants for other cancers on colorectal cancer risk: PAGE, GECCO and CCFR consortia. Gut, 63(5):800 807, 2014. 60 Lisa A. Cannon-Albright, Alun Thomas, David E. Goldgar, Khosrow Gholami, Kerry Rowe, and Matt Jacobsen et al. Utah. Cancer Research, 54(9):2378 2385, 1994. Familiality of cancer in 61 Pauli Vaittinen and Kari Hemminki. Familial cancer risks in offspring from discordant parental cancers. International Journal of Cancer, 81(1):12 19, 1999. 62 Chuanhui Dong and Kari Hemminki. Modification of cancer risks in offspring by sibling and parental cancers from 2,112,616 nuclear families. International Journal of Cancer, 92(1):144 150, 2001. 63 Kamila Czene, Paul Lichtenstein, and Kari Hemminki. Environmental and heritable causes of cancer among 9.6 million individuals in the Swedish family-cancer database. International Journal of Cancer, 99(2):260 266, 2002. 64 Christopher D.M. Fletcher and World Health Organization. WHO classification of tumours of soft tissue and bone. International Agency for Research on Cancer, 2013. 65 Zachary Burningham, Mia Hashibe, Logan Spector, and Joshua Schiffman. The epidemiology of sarcoma. Clinical Sarcoma Research, 2(1):14, 2012. 66 Guy Lahat, Alexander Lazar, and Dina Lev. Sarcoma epidemiology and etiology: potential environmental and genetic factors. Surgical Clinics of North America, 88(3):451 481, 2008. 155

67 John R. Goldblum, Sharon W. Weiss, and Andrew L. Folpe. Enzinger and Weiss s soft tissue tumors. Elsevier Health Sciences, 2013. 68 Fritz Schajowicz. Histological typing of bone tumours. Springer Science & Business Media, 2012. 69 W. Archie Bleyer. Cancer in older adolescents and young adults: epidemiology, diagnosis, treatment, survival, and importance of clinical trials. Medical and Pediatric Oncology, 38(1):1 10, 2002. 70 W. Archie Bleyer, Troy Budd, and Michael Montello. Adolescents and young adults with cancer. Cancer, 107(S7):1645 1655, 2006. 71 Ernest K. Amankwah, Anthony P. Conley, and Damon R. Reed. Epidemiology and therapies for metastatic sarcoma. Clinical Epidemiology, 5:147 162, 2013. 72 Australasian Association of Cancer Registries. Cancer in Australia 1998: incidence and mortality data for 1998. Technical report, Australian Institute of Health and Welfare, 2001. 73 Kasmintan A. Schrader, Donavan T. Cheng, Vijai Joseph, Meera Prasad, Michael Walsh, and Ahmet Zehir et al. Germline variants in targeted tumor sequencing using matched normal DNA. JAMA Oncology, 2(1):104 111, 2016. 74 Jinghui Zhang, Michael F. Walsh, Gang Wu, Michael N. Edmonson, Tanja A. Gruber, and John Easton et al. Germline mutations in predisposition genes in pediatric cancer. New England Journal of Medicine, 373(24):2336 2346, 2015. 75 Fabio Levi, Lalao Randimbison, Manuela Maspoli-Conconi, Rafael Blanc-Moya, and Carlo La Vecchia. Incidence of second sarcomas: a cancer registry-based study. Cancer Causes & Control, 25(4):473 477, 2014. 76 Josefin Fernebro, Anna Bladstrom, Anders Rydholm, Pelle Gustafson, Hakan Olsson, Jacob Engellau, and Mef Nilbert. Increased risk of malignancies in a population-based study of 818 soft-tissue sarcoma patients. British Journal of Cancer, 95(8):986 990, 2006. 156

77 Ruth A. Kleinerman, Sara J. Schonfeld, and Margaret A. Tucker. Sarcomas in hereditary retinoblastoma. Clinical Sarcoma Research, 2, 2012. 78 Michael A. Postow and Mark E. Robson. Inherited gastrointestinal stromal tumor syndromes: mutations, clinical features, and therapeutic implications. Clinical Sarcoma Research, 2, 2012. 79 D. Gareth R. Evans, Susan M. Huson, and Jillian M. Birch. Malignant peripheral nerve sheath tumours in inherited disease. Research, 2, 2012. Clinical Sarcoma 80 Junya Toguchida, Toshikazu Yamaguchi, Siri H. Dayton, Roberta L. Beaughamp, Guillermo E. Herrera, and Kanji Ishizaki at al. Prevalence and spectrum of germline mutations of the p53 gene among patients with sarcoma. New England Journal of Medicine, 326(20):1301 1308, 1992. 81 Shih-Jen Hwang, Guillermina Lozano, Christopher I. Amos, and Louise C. Strong. Germline p53 mutations in a cohort with childhood sarcoma: sex differences in cancer risk. The American Journal of Human Genetics, 72(4):975 983, 2003. 82 Amy Berrington de Gonzalez, Alina Kutsenko, and Preetha Rajaraman. Sarcoma risk after radiation exposure. Clinical Sarcoma Research, 2(1):1, 2012. 83 Lee J. Helman and Paul Meltzer. Mechanisms of sarcoma development. Nature Reviews Cancer, 3(9):685 694, 2003. 84 Kishor Bhatia, Meredith S. Shiels, Alexandra Berg, and Eric A. Engels. Sarcomas other than Kaposi sarcoma occurring in immunodeficiency: interpretations from a systematic literature review. Current Opinion in Oncology, 24(5):537, 2012. 85 Denise Whitby, Chris Boshoff, T. Hatzioannou, Robert A. Weiss, Thomas F. Schulz, and Mark R. Howard et al. Detection of Kaposi sarcoma associated herpesvirus in peripheral blood of HIV-infected individuals and progression to Kaposi s sarcoma. The Lancet, 346(8978):799 802, 1995. 157

86 R. Balarajan and Ernest D. Acheson. Soft tissue sarcomas in agriculture and forestry workers. 38(2):113 116, 1984. Journal of Epidemiology and Community Health, 87 Diego Serraino, Silvia Franceschi, Carlo La Vecchia, and Antonino Carbone. Occupation and soft-tissue sarcoma in northeastern Italy. Cancer Causes & Control, 3(1):25 30, 1992. 88 Gun Wingren, Mats Fredrikson, H. Noorlind Brage, Bo Nordenskjold, and Olav Axelson. Soft tissue sarcoma and occupational exposures. Cancer, 66(4):806 811, 1990. 89 Franco Merletti, Lorenzo Richiardi, Franco Bertoni, Wolfgang Ahrens, Antoine Buemi, and Cristina Costa-Santos et al. Occupational factors and risk of adult bone sarcomas: A multicentric case-control study in Europe. International Journal of Cancer, 118(3):721 727, 2006. 90 Eero Pukkala, Jan Ivar Martinsen, Elsebeth Lynge, Holmfridur Kolbrun Gunnarsdottir, Par Sparen, and Laufey Tryggvadottir et al. Occupation and cancer-follow-up of 15 million people in five Nordic countries. Acta Oncologica, 48(5):646 790, 2009. 91 Mikael Eriksson, Lennart Hardell, and Hans-Olov Adami. Exposure to dioxins as a risk factor for soft tissue sarcoma: A population-based case-control study. Journal of the National Cancer Institute, 82(6):486 490, 1990. 92 Jane A. Hoppin, Paige E. Tolbert, W. Dana Flanders, Rebecca H. Zhang, Danni S. Daniels, Bruce D. Ragsdale, and Edward A. Brann. Occupational risk factors for sarcoma subtypes. Epidemiology, 10(3):300 306, 1999. 93 Manolis Kogevinas, Timo Kauppinen, Regina Winkelmann, Heiko Becher, Pier Alberto Bertazzi, and H. Bas Bueno-de-Mesquita et al. Soft tissue sarcoma and non-hodgkin s lymphoma in workers exposed to phenoxy herbicides, chlorophenols, and dioxins: two nested case-control studies. Epidemiology, 6(4):396 402, 1995. 94 Lennart Hardell and Mikael Eriksson. The association between soft tissue sarcomas and exposure to phenoxyacetic acids. Cancer, 62(3):652 656, 1988. 158

95 J. Gustav Smith and Allen J. Christophers. Phenoxy herbicides and chlorophenols: a case control study on soft tissue sarcoma and malignant lymphoma. British Journal of Cancer, 65(3):442, 1992. 96 James S. Woods, Lincoln Polissar, Richard K. Severson, LS. Heuser, and Bruce G. Kulander. Soft tissue sarcoma and non-hodgkin s lymphoma in relation to phenoxyherbicide and chlorinated phenol exposure in western Washington. Journal of the National Cancer Institute, 78(5):899 910, 1987. 97 Francesca Fioretti, Alessandra Tavani, Silvano Gallus, Eva Negri, Silvia Franceschi, and Carlo La Vecchia. Menstrual and reproductive factors and risk of soft tissue sarcomas. Cancer, 88(4):786 789, 2000. 98 Kristin P. Anfinsen, Susan S. Devesa, Freddie Bray, Rebecca Troisi, Thora J. Jonasdottir, Oyvind S. Bruland, and Tom Grotmol. Age-period-cohort analysis of primary bone cancer incidence rates in the United States (1976-2005). Cancer Epidemiology Biomarkers & Prevention, 20(8):1770 1777, 2011. 99 Deborah M. Winn, Frederick P. Li, Leslie L. Robison, John J. Mulvihill, Ann E. Daigle, and Joseph F. Fraumeni. A case-control study of the etiology of Ewing s sarcoma. Cancer Epidemiology Biomarkers & Prevention, 1(7):525 532, 1992. 100 Seymour Grufferman, Helen H. Wang, Elizabeth R. DeLong, Sue Y.S. Kimm, Elizabeth S. Delzell, and John M. Falletta. Environmental factors in the etiology of rhabdomyosarcoma in childhood. Journal of the National Cancer Institute, 68(1):107 113, 1982. 101 Ann L. Hartley, Jillian M. Birch, Henry B. Marsden, Martin Harris, and Val Blair. Neurofibromatosis in children with soft tissue sarcoma. Pediatric Hematology and Oncology, 5(1):7 16, 1988. 102 Lisa Mirabello, Ruth Pfeiffer, Gwen Murphy, Najat C. Daw, Ana Patino-Garcia, and Rebecca J. Troisi et al. Height at diagnosis and birth-weight as risk factors for osteosarcoma. Cancer Causes & Control, 22(6):899 908, 2011. 159

103 Logan G. Spector, Susan E. Puumala, Susan E. Carozza, Eric J. Chow, Erin E. Fox, and Scott Horel et al. Cancer risk among children with very low birth weights. Pediatrics, 124(1):96 104, 2009. 104 Simona Ognjanovic, Susan E. Carozza, Eric J. Chow, Erin E. Fox, Scott Horel, and Colleen C. McLaughlin et al. Birth characteristics and the risk of childhood rhabdomyosarcoma based on histological subtype. British Journal of Cancer, 102(1):227 231, 2010. 105 Julie Von Behren, Logan G. Spector, Beth A. Mueller, Susan E. Carozza, Eric J. Chow, and Erin E. Fox et al. Birth order and risk of childhood cancer: a pooled analysis from five US States. International Journal of Cancer, 128(11):2709 2716, 2011. 106 Felix Mitelman, Bertil Johansson, and Fredrik Mertens. Mitelman database of chromosome aberrations and gene fusions in cancer; URL: http://cgap.nci.nih.gov/chromosomes/mitelman, 2016. 107 Shujuan J. Xia and Frederic G. Barr. Chromosome translocations in sarcomas and the emergence of oncogenic transcription factors. European Journal of Cancer, 41(16):2513 2527, 2005. 108 Surbhi Jain, Lori W. McGinnes, and Trudy G. Morrison. Thiol/disulfide exchange is required for membrane fusion directed by the Newcastle disease virus fusion protein. Journal of Virology, 81(5):2328 2339, 2007. 109 Brian P. Rubin, Samuel Singer, Connie Tsao, Anette Duensing, Marcia L. Lux, and Robert Ruiz et al. KIT activation is a ubiquitous feature of gastrointestinal stromal tumors. Cancer Research, 61(22):8118 8121, 2001. 110 Michael C. Heinrich, Christopher L. Corless, Anette Duensing, Laura McGreevey, Chang-Jie Chen, and Nora Joseph et al. PDGFRA activating mutations in gastrointestinal stromal tumors. Science, 299(5607):708 710, 2003. 111 Louis Guillou and Alain Aurias. Soft tissue sarcomas with complex genomic profiles. Virchows Archiv, 456(2):201 217, 2009. 160

112 Jeff M. Hall, Ming K. Lee, Beth Newman, Jan E. Morrow, Lee A. Anderson, Bing Huey, and Marie-Claire King. Linkage of early-onset familial breast cancer to chromosome 17q21. Science, 250(4988):1684, 1990. 113 Richard Wooster, Susan L. Neuhausen, Jonathan Mangion, Yvette Quirk, Deborah Ford, and Nadine Collins et al. Localization of a breast cancer susceptibility gene, BRCA2, to chromosome 13q12-13. Science, 265(5181):2088 2091, 1994. 114 Walter F. Bodmer, Carolyn J. Bailey, Julia G. Bodmer, H.J.R. Bussey, Anthony Ellis, and Patricia Gorman et al. Localization of the gene for familial adenomatous polyposis on chromosome 5. Nature, 328(6131):614 616, 1987. 115 Paivi Peltomaki, Lauri A. Aaltonen, Pertti Sistonen, Lea Pylkkanen, Jukka-Pekka Mecklin, and Heikki Jarvinen et al. Genetic mapping of a locus predisposing to human colorectal cancer. Science, 260(5109):810 812, 1993. 116 Annika Lindblom, Pia Tannergard, Barbro Werelius, and Magnus Nordenskjold. Genetic mapping of a second locus predisposing to hereditary non-polyposis colon cancer. Nature Genetics, 5(3):279 282, 1993. 117 Lisa A. Cannon-Albright, David E. Goldgar, Laurence J. Meyer, Cathryn M. Lewis, David E. Anderson, and J.W. Fountain et al. Assignment of a locus for familial melanoma, MLM, to chromosome 9p13-p22. Science, 258(5085):1148, 1992. 118 Group Anglian Breast Cancer Study. Prevalence and penetrance of BRCA1 and BRCA2 mutations in a population-based series of breast cancer cases. British Journal of Cancer, 83(10):1301, 2000. 119 Kirsi Syrjakoski, Pia Vahteristo, Hannaleena Eerola, Anitta Tamminen, Kati Kivinummi, and Laura Sarantaus et al. Population-based study of BRCA1 and BRCA2 mutations in 1035 unselected Finnish breast cancer patients. Journal of the National Cancer Institute, 92(18):1529 1531, 2000. 120 Gudrun Johannesdottir, Julius Gudmundsson, Jon T. Bergthorsson, Adalgeir Arason, Bjarni A. Agnarsson, and Gudny Eiriksdottir et al. High prevalence 161

of the 999del5 mutation in Icelandic breast and ovarian cancer patients. Cancer Research, 56(16):3663 3665, 1996. 121 Steinunn Thorlacius, Stefan Sigurdsson, Helga Bjarnadottir, Gudridur Olafsdottir, Jon Gunnlaugur Jonasson, and Laufey Tryggvadottir et al. Study of a single BRCA2 mutation with high carrier frequency in a small population. American Journal of Human Genetics, 60(5):1079, 1997. 122 Patricia Hartge, Jeffery P. Struewing, Sholom Wacholder, Lawrence C. Brody, and Margaret A. Tucker. The prevalence of common BRCA1 and BRCA2 mutations among Ashkenazi Jews. The American Journal of Human Genetics, 64(4):963 970, 1999. 123 Steinunn Thorlacius, Jeffery P. Struewing, Patricia Hartage, Gudridur H. Olafsdottir, Helgi Sigvaldason, and Laufey Tryggvadottir et al. Population-based study of risk of breast cancer in carriers of BRCA2 mutation. The Lancet, 352(9137):1337 1339, 1998. 124 Bruce A.J. Ponder. Cancer genetics. Nature, 411(6835):336 341, 2001. 125 Joel N. Hirschhorn and Mark J. Daly. Genome-wide association studies for common diseases and complex traits. Nature Review Genetics, 6(2):95 108, 2005. 126 Tony Burdett, Peggy N. Hall, Emma Hastings, Lucia A. Hindorff, and Heather A. Junkins. The NHGRI-EBI Catalog of published genome-wide association studies. Available at: www.ebiacuk/gwas, 2015. 127 Andrew D. Beggs and Shirley V. Hodgson. Genomics and breast cancer: the different levels of inherited susceptibility. European Journal of Human Genetics, 17(7):855 856, 2009. 128 Teri A. Manolio, Francis S. Collins, Nancy J. Cox, David B. Goldstein, Lucia A. Hindorff, and David J. Hunter et al. Finding the missing heritability of complex diseases. Nature, 461(7265):747 753, 2009. 129 Jon McClellan and Mary-Claire King. Genetic heterogeneity in human disease. Cell, 141(2):210 217, 2010. 162

130 Frederick Sanger, Steven Nicklen, and Alan R. Coulson. DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences, 74(12):5463 5467, 1977. 131 Marcel Margulies, Michael Egholm, William E. Altman, Said Attiya, Joel S. Bader, and Lisa A. Bemben et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437(7057):376 380, 2005. 132 Erwin L. van Dijk, Helene Auger, Yan Jaszczyszyn, and Claude Thermes. Ten years of next-generation sequencing technology. 30(9):418 426, 2014. Trends in Genetics, 133 Daniel C. Koboldt, Karyn Meltz Steinberg, David E. Larson, Richard K. Wilson, and Elaine R. Mardis. The next-generation sequencing revolution and its impact on genomics. Cell, 155(1):27 38, 2013. 134 Sally Bamford, Emily Dawson, Simon Forbes, Jody Clements, Roger Pettett, and Ahmet Dogan et al. The COSMIC (catalogue of somatic mutations in cancer) database and website. British Journal of Cancer, 91(2):355 358, 2004. 135 The Cancer Genome Atlas Research Network, John N. Weinstein, Eric A. Collisson, Gordon B. Mills, Kenna R. Mills Shaw, Brad A. Ozenberger, and Kyle Ellrott et al. The cancer genome atlas pan-cancer analysis project. Nature Genetics, 45(10):1113 1120, 2013. 136 Thomas J. Hudson, Warwick Anderson, Axel Aretz, Anna D. Barker, Cindy Bell, and Rosa R. Bernabe et al. International network of cancer genome projects. Nature, 464(7291):993 998, 2010. 137 Veronique G. LeBlanc and Marco A. Marra. Next-generation sequencing approaches in cancer: Where have they brought us and where will they take us? Cancers, 7(3):1925 1958, 2015. 138 Elaine R. Mardis. Next-generation DNA sequencing methods. Annual Review of Genomics and Human Genetics, 9(1):387 402, 2008. 139 Michael L. Metzker. Sequencing technologies - the next generation. Nature Reviews Genetics, 11(1):31 46, 2010. 163

140 David N. Cooper. The nature and mechanisms of human gene mutation. The Metabolic and Molecular Bases of Inherited Disease, pages 259 291, 1995. 141 Sarah B. Ng, Emily H. Turner, Peggy D. Robertson, Steven D. Flygare, Abigail W. Bigham, and Choli Lee et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature, 461(7261):272 276, 2009. 142 Sarah B. Ng, Abigail W. Bigham, Kati J. Buckingham, Mark C. Hannibal, Margaret J. McMillin, and Heidi I. Gildersleeve et al. Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nature Genetics, 42(9):790 793, 2010. 143 Alexander Hoischen, Bregje W.M. van Bon, Christian Gilissen, Peer Arts, Bart van Lier, and Marloes Steehouwer et al. De novo mutations of SETBP1 cause Schinzel-Giedion syndrome. Nature Genetics, 42(6):483 485, 2010. 144 Sarah B. Ng, Kati J. Buckingham, Choli Lee, Abigail W. Bigham, Holly K. Tabor, and Karin M. Dent et al. Exome sequencing identifies the cause of a Mendelian disorder. Nature Genetics, 42(1):30 35, 2010. 145 Jun Ling Wang, Xu Yang, Kun Xia, Zheng Mao Hu, Ling Weng, and Xin Jin et al. TGM6 identified as a novel causative gene of spinocerebellar ataxias using exome sequencing. Brain, 133(12):3510 3518, 2010. 146 Chee-Seng Ku, Nasheen Naidoo, and Yudi Pawitan. Revisiting Mendelian disorders through exome sequencing. Human Genetics, 129(4):351 370, 2011. 147 Jessada Thutkawkorapin, Simone Picelli, Vinaykumar Kontham, Tao Liu, Daniel Nilsson, and Annika Lindblom. Exome sequencing in one family with gastric- and rectal cancer. BMC Genetics, 17:41, 2016. 148 Matthew Meyerson, Stacey Gabriel, and Gad Getz. Advances in understanding cancer genomes through second-generation sequencing. Nature Review Genetics, 11(10):685 696, 2010. 149 Riyue Bao, Lei Huang, Jorge Andrade, Wei Tan, Warren A. Kibbe, Hongmei Jiang, and Gang Feng. Review of current methods, applications, and data management for the bioinformatics analysis of whole exome sequencing. Cancer Informatics, 13(Suppl 2):67 82, 2014. 164

150 Kristian Cibulskis, Michael S. Lawrence, Scott L. Carter, Andrey Sivachenko, David Jaffe, and Carrie Sougnez et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nature Biotechnology, 31(3):213 219, 2013. 151 Qingguo Wang, Peilin Jia, Fei Li, Haiquan Chen, Hongbin Ji, and Donald Hucks et al. Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers. Genome Medicine, 5(10):1 8, 2013. 152 Xiaofeng Zhu, Tao Feng, Yali Li, Qing Lu, and Robert C. Elston. Detecting rare variants for complex traits using family and unrelated data. Genetic Epidemiology, 34(2):171 187, 2010. 153 Tao Feng, Robert C. Elston, and Xiaofeng Zhu. Detecting rare and common variants for complex traits: sibpair and odds ratio weighted sum statistics (SPWSS, ORWSS). Genetic Epidemiology, 35(5):398 409, 2011. 154 Iuliana Ionita-Laza and Ruth Ottman. Study designs for identification of rare disease variants in complex diseases: the utility of family-based designs. Genetics, 189(3):1061 1068, 2011. 155 Gang Shi and D.C. Rao. Optimum designs for next-generation sequencing to discover rare variants for common complex disease. Genetic Epidemiology, 35(6):572 579, 2011. 156 Colin C. Pritchard, Christina Smith, Stephen J. Salipante, Ming K. Lee, Anne M. Thornton, and Alex S. Nord et al. ColoSeq provides comprehensive Lynch and polyposis syndrome mutational analysis using massively parallel sequencing. The Journal of Molecular Diagnostics, 14(4):357 366, 2012. 157 Tom Walsh, Ming K. Lee, Silvia Casadei, Anne M. Thornton, Sunday M. Stray, and Christopher Pennil et al. Detection of inherited mutations for breast and ovarian cancer using genomic capture and massively parallel sequencing. Proceedings of the National Academy of Sciences, 107(28):12629 12633, 2010. 158 Duncan Thomas, Zhao Yang, and Fan Yang. Two-phase and family-based designs for next-generation sequencing studies. Frontiers in Genetics, 4(276), 2013. 165

159 Nazneen Rahman. Realizing the promise of cancer predisposition genes. Nature, 505(7483):302 308, 2014. 160 Nazneen Rahman. Mainstreaming genetic testing of cancer predisposition genes. Clinical Medicine, 14(4):436 439, 2014. 161 Victor A. McKusick. Mendelian Inheritance in Man and Its Online Version, OMIM. American Journal of Human Genetics, 80(4):588 604, 2007. 162 Olivia Fletcher and Richard S. Houlston. Architecture of inherited susceptibility to common cancer. Nature Reviews Cancer, 10(5):353 361, 2010. 163 David M. Thomas and Mandy L. Ballinger. Inherited and de novo germline TP53 mutations in adult-onset sarcoma. Practice, 10(2):A26, 2012. Hereditary Cancer in Clinical 164 Levi A. Garraway and Eric S. Lander. Lessons from the cancer genome. Cell, 153(1):17 37, 2013. 165 Himisha Beltran, Davide Prandi, Juan Miguel Mosquera, Matteo Benelli, Loredana Puca, and Joanna Cyrta et al. Divergent clonal evolution of castration-resistant neuroendocrine prostate cancer. Nature Medicine, 22(3):298 305, 2016. 166 Peter D. Stenson, Edward V. Ball, Katy Howells, Andrew D. Phillips, Matthew Mort, and David N. Cooper. The human gene mutation database: providing a comprehensive central mutation database for molecular diagnostics and personalised genomics. Human Genomics, 4(2):69, 2009. 167 Murim Choi, Ute I. Scholl, Weizhen Ji, Tiewen Liu, Irina R. Tikhonova, and Paul Zumbo et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proceedings of the National Academy of Sciences, 106(45):19096 19101, 2009. 168 Dale Hedges, Dan Burges, Eric Powell, Cherylyn Almonte, Jia Huang, and Stuart Young et al. Exome sequencing of a multigenerational human pedigree. PLOS ONE, 4(12):e8232, 2009. 166

169 David Botstein and Neil Risch. Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nature Genetics, 33(3s):228, 2003. 170 Urs A. Meyer. Pharmacogenetics and adverse drug reactions. The Lancet, 356(9242):1667 1671, 2000. 171 Urs A. Meyer, Ulrich M. Zanger, and Matthias Schwab. Omics and drug response. Annual Review of Pharmacology and Toxicology, 53(1):475 502, 2013. 172 Barry Merriman, Ion Torrent Development Team, and Jonathan M. Rothberg. Progress in Ion Torrent semiconductor chip based sequencing. Electrophoresis, 33(23):3397 3417, 2012. 173 Martin Mascher, Shuangye Wu, Paul St Amand, Nils Stein, and Jesse Poland. Application of genotyping-by-sequencing on semiconductor sequencing platforms: a comparison of genetic and reference-based marker ordering in barley. PLOS ONE, 8(10):e76925, 2013. 174 Nicholas J. Loman, Raju V. Misra, Timothy J. Dallman, Chrystala Constantinidou, Saheer E. Gharbia, John Wain, and Mark J. Pallen. Performance comparison of benchtop high-throughput sequencing platforms. Nature Biotechnology, 30(5):434 439, 2012. 175 Australasian Sarcoma Study Group. International sarcoma kindred study, URL: http://www.australiansarcomagroup.org/sarcomakindredstudy, 2013. 176 Gillian Mitchell, Mandy L. Ballinger, Stephen Wong, Chelsee Hewitt, Paul James, and Mary-Anne Young et al. High frequency of germline TP53 mutations in a prospective adult-onset sarcoma cohort. PLOS ONE, 8(7):1 7, 2013. 177 Gang Peng, Jasmina Bojadzieva, Mandy L. Ballinger, Jialu Li, Amanda L. Blackford, and Phuong L. Mai et al. Estimating TP53 mutation carrier probability in families with Li-Fraumeni syndrome using LFSPRO. Cancer Epidemiology and Prevention Biomarkers, pages cebp 0695.2016, 2017. 167

178 Mandy L. Ballinger, David L. Goode, Isabelle Ray-Coquard, Paul A. James, Gillian Mitchell, and Eveline Niedermayr et al. Monogenic and polygenic determinants of sarcoma risk: an international genetic study. The Lancet Oncology, 17(9):1261 1271, 2016. 179 Petr Danecek, Adam Auton, Goncalo Abecasis, Cornelis A. Albers, Eric Banks, and Mark A. DePristo et al. The variant call format and vcftools. Bioinformatics, 27(15):2156 2158, 2011. 180 Aaron McKenna, Matthew Hanna, Eric Banks, Andrey Sivachenko, Kristian Cibulskis, and Andrew Kernytsky et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9):1297 1303, 2010. 181 Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, and Nils Homer et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16):2078 2079, 2009. 182 GATK Documentation. Variant quality score recalibration (VQSR), URL: http://gatkforums.broadinstitute.org/gatk/discussion/39/variant-quality-scorerecalibration-vqsr, 2016. 183 James T. Robinson, Helga Thorvaldsdottir, Wendy Winckler, Mitchell Guttman, Eric S. Lander, Gad Getz, and Jill P. Mesirov. genomics viewer. Nature Biotechnology, 29(1):24 26, 2011. Integrative 184 Helga Thorvaldsdottir, James T. Robinson, and Jill P. Mesirov. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Briefings in Bioinformatics, 14(2):178 192, 2013. 185 Michael J. Clark, Rui Chen, Hugo Y. K. Lam, Konrad J. Karczewski, Rong Chen, and Ghia Euskirchen et al. Performance comparison of exome DNA sequencing technologies. Nature Biotechnology, 29(10):908 914, 2011. 186 Alison M. Meynert, Louise S. Bicknell, Matthew E. Hurles, Andrew P. Jackson, and Martin S. Taylor. Quantifying single nucleotide variant detection sensitivity in exome sequencing. BMC Bioinformatics, 14(1):1, 2013. 168

187 Robert P. VanderWaal, Douglas R. Spitz, Cara L. Griffith, Ryuji Higashikubo, and Joseph L. Roti Roti. Evidence that protein disulfide isomerase (PDI) is involved in DNA-nuclear matrix anchoring. Journal of Cellular Biochemistry, 85(4):689 702, 2002. 188 Henry B. Mann and Donald R. Whitney. On a test of whether one of two random variables is stochastically larger than the other. Mathematical Statistics, pages 50 60, 1947. The Annals of 189 William Bateson and Gregor Mendel. Mendel s principles of heredity. University press, 1913. 190 Ingrid B. Borecki and Michael A. Province. Genetic and genomic discovery using family studies. Circulation, 118(10):1057 1063, 2008. 191 Diana Merino and David Malkin. p53 and hereditary cancer. In Deb Swati Palit and Deb Sumitra, editors, Mutant p53 and MDM2 in Cancer, pages 1 16. Springer Netherlands, Dordrecht, 2014. 192 Joanne Ngeow and Charis Eng. Precision medicine in heritable cancer: when somatic tumour testing and germline mutations meet. NPJ Genomic Medicine, 1:15006, 2016. 193 Edward D. Lustbader, Wick R. Williams, Melissa L. Bondy, Sara Strom, and Louise C. Strong. Segregation analysis of cancer in families of childhood soft-tissue-sarcoma patients. American Journal of Human Genetics, 51(2):344 356, 1992. 194 Biljana Novakovic, Alisa M. Goldstein, Leonard H. Wexler, and Margaret A. Tucker. Increased risk of neuroectodermal tumors and stomach cancer in relatives of patients with Ewing s sarcoma family of tumors. Journal of the National Cancer Institute, 86(22):1702 1706, 1994. 195 Ann L. Hartley, Jillian M. Birch, Val Blair, Anna M. Kelsey, Martin Harris, and Patricia H. Morris Jones. Patterns of cancer in the families of children with soft tissue sarcoma. Cancer, 72(3):923 930, 1993. 196 Eileen Burke, Frederick P. Li, Abbe J. Janov, Stephen Batter, Holcombe Grier, and Allen Goorin. Cancer in relatives of survivors of childhood sarcoma. Cancer, 67(5):1467 1469, 1991. 169

197 Kevin B. Jones, Joshua D. Schiffman, Wendy Kohlmann, R. Lor Randall, Stephen L. Lessnick, and Lisa A. Cannon-Albright. Complex genotype sarcomas display familial inheritance independent of known cancer predisposition syndromes. Cancer Epidemiology Biomarkers & Prevention, 20(5):751 757, 2011. 198 Henry T. Lynch, Gabriel M. Mulcahy, Randall E. Harris, Hoda A. Guirgis, and Jane F. Lynch. Genetic and pathologic findings in a kindred with hereditary sarcoma breast cancer, brain tumors, leukemia, lung, laryngeal, and adrenal cortical carcinoma. Cancer, 41:2055 2064, 1978. 199 Wick R. Williams and Louise C. Strong. Genetic epidemiology of soft tissue sarcomas in children. In Familial Cancer, pages 151 153. Karger Publishers, 1985. 200 Henry T. Lynch, Randall E. Brand, David Hogg, Carolyn A. Deters, Ramon M. Fusaro, and Jane F. Lynch et al. Phenotypic variation in eight extended CDKN2A germline mutation familial atypical multiple mole melanoma-pancreatic carcinoma-prone families. Cancer, 94(1):84 96, 2002. 201 Stephen J. Rulyak, Teresa A. Brentnall, Henry T. Lynch, and Melissa A. Austin. Characterization of the neoplastic phenotype in the familial atypical multiple mole melanoma pancreatic carcinoma syndrome. Cancer, 98(4):798 804, 2003. 202 Sophie Sun, Pamela M. Pollock, Ling Liu, Sepideh Karimi, Serge Jothy, and Benedict J. Milner et al. CDKN2A mutation in a non-fammm kindred with cancers at multiple sites results in a functionally abnormal protein. International Journal of Cancer, 73(4):531 536, 1997. 203 Rodney C.P. Go, Mary-Claire King, Joan Bailey-Wilson, Robert C. Elston, and Henry T. Lynch. Genetic epidemiology of breast cancer and associated cancers in high-risk families. I. Segregation analysis. Journal of the National Cancer Institute, 71(3):455 461, 1983. 204 Henry T. Lynch, Carolyn A. Deters, David Hogg, Jane F. Lynch, Yulia Kinarsky, and Zoran Gatalica. Familial sarcoma. Cancer, 98(9):1947 1957, 2003. 170

205 Audrey H. Schnell and John S. Witte. Family-based study designs. In Timothy R. Rebeck, Christine B. Ambrosone, and Peter G. Shields, editors, Molecular Epidemiology: Applications in Cancer and Other Human Diseases, pages 19 28. Taylor & Francis, 2008. 206 Steven A. Narod, Deborah Ford, Peter Devilee, Rosa B. Barkardottir, Henry T. Lynch, and Simon A. Smith et al. An evaluation of genetic heterogeneity in 145 breast-ovarian cancer families. American Journal of Human Genetics, 56(1):254 264, 1995. 207 Jared C. Roach, Gustavo Glusman, Arian F.A. Smit, Chad D. Huff, Robert Hubley, and Paul T. Shannon et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science, 328(5978):636 639, 2010. 208 Jianxin Shi, Xiaohong R. Yang, Bari Ballew, Melissa Rotunno, Donato Calista, and Maria Concetta Fargnoli et al. Rare missense variants in POT1 predispose to familial cutaneous malignant melanoma. Nature Genetics, 46(5):482 486, 2014. 209 Leslie G. Biesecker. Exome sequencing makes medical genomics a reality. Nature Genetics, 42(1):13 15, 2010. 210 Michael J. Bamshad, Sarah B. Ng, Abigail W. Bigham, Holly K. Tabor, Mary J. Emond, Deborah A. Nickerson, and Jay Shendure. Exome sequencing as a tool for Mendelian disease gene discovery. Nature Reviews Genetics, 12(11):745 755, 2011. 211 Gregory V. Kryukov, Len A. Pennacchio, and Shamil R. Sunyaev. Most rare missense alleles are deleterious in humans: implications for complex disease and association studies. The American Journal of Human Genetics, 80(4):727 739, 2007. 212 Colin C. Pritchard, Stephen J. Salipante, Karen Koehler, Christina Smith, Sheena Scroggins, and Brent Wood et al. Validation and implementation of targeted capture and sequencing for the detection of actionable mutation, copy number variation, and gene rearrangement in clinical cancer specimens. The Journal of Molecular Diagnostics, 16(1):56 67, 2014. 171

213 Antonija Kreso, Catherine A. O Brien, Peter van Galen, Olga I. Gan, Faiyaz Notta, and Andrew M.K. Brown et al. Variable clonal repopulation dynamics influence chemotherapy response in colorectal cancer. Science, 339(6119):543 548, 2013. 214 Sreenath V. Sharma, Daphne W. Bell, Jeffrey Settleman, and Daniel A. Haber. Epidermal growth factor receptor mutations in lung cancer. Nature Reviews Cancer, 7(3):169 181, 2007. 215 Paul B. Chapman, Axel Hauschild, Caroline Robert, John B. Haanen, Paolo Ascierto, and James Larkin et al. Improved survival with vemurafenib in melanoma with BRAF V600E mutation. New England Journal of Medicine, 2011(364):2507 2516, 2011. 216 David M. Thomas and Mandy L. Ballinger. Diagnosis and management of hereditary sarcoma. In Rare Hereditary Cancers, pages 169 189. Springer, 2016. 217 Navnath S. Gavande, Pamela S. VanderVere-Carozza, Hilary D. Hinshaw, Shadia I. Jalal, Catherine R. Sears, Katherine S. Pawelczak, and John J. Turchi. DNA repair targeted therapy: The past or future of cancer treatment? Pharmacology & Therapeutics, 160:65 83, 2016. 218 David C. Samuels, Leng Han, Jiang Li, Sheng Quanghu, Travis A. Clark, Yu Shyr, and Yan Guo. Finding the lost treasures in exome sequencing data. Trends in Genetics, 29(10):593 599, 2013. 219 Malte Spielmann and Stefan Mundlos. Looking beyond the genes: the role of non-coding variants in human disease. Human Molecular Genetics, 25(R2):R157 R165, 2016. 220 Graham R.S. Ritchie and Paul Flicek. Computational approaches to interpreting genomic sequence variation. Genome Medicine, 6(10):87, 2014. 221 Paul D.P. Pharoah, Alison M. Dunning, Bruce A.J. Ponder, and Douglas F. Easton. Association studies for finding cancer-susceptibility genetic variants. Nature Reviews Cancer, 4(11):850 860, 2004. 172

222 Susanne Horn, Adina Figl, P. Sivaramakrishna Rachakonda, Christine Fischer, Antje Sucker, Andreas Gast, Stephanie Kadel, Iris Moll, Eduardo Nagore, and Kari Hemminki. Tert promoter mutations in familial and sporadic melanoma. Science, 339(6122):959 961, 2013. 223 The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414):57 74, 2012. 224 Amanda Warr, Christelle Robert, David Hume, Alan Archibald, Nader Deeb, and Mick Watson. Exome sequencing: current and future perspectives. G3: Genes Genomes Genetics, 5(8):1543 1550, 2015. 225 Ken Chen, John W. Wallis, Michael D. McLellan, David E. Larson, Joelle M. Kalicki, Craig S. Pohl, and et al. Breakdancer: an algorithm for high-resolution mapping of genomic structural variation. Nature Methods, 6(9):677 681, 2009. 226 Can Alkan, Bradley P. Coe, and Evan E. Eichler. Genome structural variation discovery and genotyping. Nature Reviews Genetics, 12(5):363 376, 2011. 227 Biao Liu, Jeffrey M. Conroy, Carl D. Morrison, Adekunle O. Odunsi, Maochun Qin, Lei Wei, and et al. Structural variation discovery in the cancer genome using next generation sequencing: computational solutions and perspectives. Oncotarget, 6(8):5477 5489, 2015. 228 Shengpei Chen, Sheng Li, Weiwei Xie, Xuchao Li, Chunlei Zhang, and Haojun Jiang et al. Performance comparison between rapid sequencing platforms for ultra-low coverage sequencing strategy. PLOS ONE, 9(3):e92192, 2014. 229 Joseph F. Boland, Charles C. Chung, David Roberson, Jason Mitchell, Xijun Zhang, and Kate M. Im et al. The new sequencer on the block: comparison of Life Technology s Proton sequencer to an Illumina HiSeq for whole-exome sequencing. Human Genetics, 132(10):1153 1163, 2013. 230 Eric Samorodnitsky, Benjamin M. Jewell, Raffi Hagopian, Jharna Miya, Michele R. Wing, and Ezra Lyon et al. Evaluation of hybridization capture 173

versus amplicon-based methods for whole-exome sequencing. Mutation, 36(9):903 914, 2015. Human 231 Pankaj Kumar, Mashael Al-Shafai, Wadha Ahmed Al Muftah, Nader Chalhoub, Mahmoud F. Elsaid, Alice Abdel Aleem, and Karsten Suhre. Evaluation of SNP calling using single and multiple-sample calling algorithms by validation against array base genotyping and Mendelian inheritance. BMC Research Notes, 7:747, 2014. 232 Pengyuan Zhu, Lingyu He, Yaqiao Li, Wenpan Huang, Feng Xi, and Lin Lin et al. OTG-snpcaller: an optimized pipeline based on TMAP and GATK for SNP calling from Ion Torrent data. PLOS ONE, 9(5):e97507, 2014. 233 Xiangtao Liu, Shizhong Han, Zuoheng Wang, Joel Gelernter, and Bao-Zhu Yang. Variant callers for next-generation sequencing data: a comparison study. PLOS ONE, 8(9):e75619, 2013. 234 Su Yeon Kim, Laurent Jacob, and Terence P. Speed. Combining calls from multiple somatic mutation-callers. BMC Bioinformatics, 15(1):154, 2014. 235 Ikuko N. Motoike, Mitsuyo Matsumoto, Inaho Danjoh, Fumiki Katsuoka, Kaname Kojima, and Naoki Nariai et al. Validation of multiple single nucleotide variation calls by additional exome analysis with a semiconductor sequencer to supplement data of whole-genome sequencing of a human population. BMC Genomics, 15(1):673, 2014. 236 Daniel G. MacArthur, Teri. A. Manolio, David P. Dimmock, Heidi L. Rehm, Jay Shendure, and Goncalo R. Abecasis et al. Guidelines for investigating causality of sequence variants in human disease. Nature, 508(7497):469 476, 2014. 237 LaDeana W. Hillier, Gabor T. Marth, Aaron R. Quinlan, David Dooling, Ginger Fewell, and Derek Barnett et al. Whole-genome sequencing and variant discovery in C. elegans. Nature Methods, 5(2):183 188, 2008. 238 Christian Gilissen, Alexander Hoischen, Han G. Brunner, and Joris A. Veltman. Disease gene identification strategies for exome sequencing. European Journal of Human Genetics, 20(5):490 497, 2012. 174

239 Mahjoubeh Jalali Sefid Dashti and Junaid Gamieldien. Identifying candidate function-impacting variants. BioTechniques, 62(1):18 30, 2017. 240 Damian Smedley and Peter N. Robinson. Phenotype-driven strategies for exome prioritization of human Mendelian disease genes. Genome Medicine, 7(1):81, 2015. 241 Vincent J. Henry, Anita E. Bandrowski, Anne-Sophie Pepin, Bruno J. Gonzalez, and Arnaud Desfeux. OMICtools: an informative directory for multi-omic data analysis. Database, 2014:bau069 bau069, 2014. 242 Stephan Pabinger, Andreas Dander, Maria Fischer, Rene Snajder, Michael Sperk, and Mirjana Efremova et al. A survey of tools for variant analysis of next-generation genome sequencing data. Briefings in Bioinformatics, 15(2):256 278, 2013. 243 Min Zhao and Zhongming Zhao. CNVannotator: a comprehensive annotation server for copy number variation in the human genome. 8(11):e80170, 2013. PLOS ONE, 244 Eric R. Gamazon, Wei Zhang, Anuar Konkashbaev, Shiwei Duan, Emily O. Kistner, and Dan L. Nicolae et al. SCAN: SNP and copy number annotation. Bioinformatics, 26(2):259 262, 2010. 245 Kai Wang, Mingyao Li, and Hakon Hakonarson. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Research, 38(16):e164, 2010. 246 Kai Wang, Mingyao Li, Dexter Hadley, Rui Liu, Joseph Glessner, and Struan F.A. Grant et al. PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Research, 17(11):1665 1674, 2007. 247 Vladimir Makarov, Tina O Grady, Guiqing Cai, Jayon Lihm, Joseph D. Buxbaum, and Seungtai Yoon. AnnTools: a comprehensive and versatile annotation toolkit for genomic variants. Bioinformatics, 28(5):724 725, 2012. 248 Ryan L. Collins, Matthew R. Stone, Harrison Brand, Joseph T. Glessner, and Michael E. Talkowski. CNView: a visualization and annotation tool for 175

copy number variation from whole-genome sequencing. biorxiv, page 049536, 2016. 249 Yuanwei Zhang, Zhenhua Yu, Rongjun Ban, Huan Zhang, Furhan Iqbal, and Aiwu Zhao et al. DeAnnCNV: a tool for online detection and annotation of copy number variations from whole-exome sequencing data. Nucleic Acids Research, 43(W1):W289 W294, 2015. 250 Galina A. Erikson, Neha Deshpande, Balachandar G. Kesavan, and Ali Torkamani. SG-ADVISER CNV: copy-number variant annotation and interpretation. Genetics in Medicine, 17(9):714 718, 2014. 251 Stephen T. Sherry, Ming H. Ward, Michael Kholodov, Jonathan Baker, Lon Phan, Elizabeth M. Smigielski, and Karl Sirotkin. dbsnp: the NCBI database of genetic variation. Nucleic Acids Research, 29(1):308 311, 2001. 252 Consortium Genomes Project. A map of human genome variation from population-scale sequencing. Nature, 467(7319):1061 1073, 2010. 253 Feng Zhang and James R. Lupski. Non-coding genetic variants in human disease. Human Molecular Genetics, 24(R1):R102 R110, 2015. 254 Anna-Maija Sulonen, Pekka Ellonen, Henrikki Almusa, Maija Lepisto, Samuli Eldfors, and Sari Hannula et al. Comparison of solution-based exome capture methods for next generation sequencing. Genome Biology, 12(9):R94, 2011. 255 Yu Xu, Hui Jiang, Chris Tyler-Smith, Yali Xue, Tao Jiang, and Jiawei Wang et al. Comprehensive comparison of three commercial human whole-exome capture platforms. Genome Biology, 12(9):1, 2011. 256 Yan Guo, Jirong Long, Jing He, Chung-I. Li, Qiuyin Cai, and Xiao-Ou Shu et al. Exome sequencing generates high quality data in non-target regions. BMC Genomics, 13(1):194, 2012. 257 Alan P. Boyle, Eurie L. Hong, Manoj Hariharan, Yong Cheng, Marc A. Schaub, and Maya Kasowski et al. Annotation of functional variation in personal genomes using RegulomeDB. Genome Research, 22(9):1790 1797, 2012. 176

258 Matthew R. Nelson, Daniel Wegmann, Margaret G. Ehm, Darren Kessner, Pamela St Jean, and Claudio Verzilli et al. An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science, 337(6090):100 104, 2012. 259 Elizabeth T. Cirulli and David B. Goldstein. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Reviews Genetics, 11(6):415 425, 2010. Nature 260 Exome Variant Server. Exome variant server, URL: http://evs.gs.washington.edu/evs/, 2016. 261 Monkol Lek, Konrad J. Karczewski, Eric V. Minikel, Kaitlin E. Samocha, Eric Banks, and Timothy Fennell et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature, 536(7616):285 291, 2016. 262 Eugene V. Davydov, David L. Goode, Marina Sirota, Gregory M. Cooper, Arend Sidow, and Serafim Batzoglou. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLOS Computational Biology, 6(12):e1001025, 2010. 263 Lisenka E.L.M. Vissers, Joep de Ligt, Christian Gilissen, Irene Janssen, Marloes Steehouwer, and Petra de Vries et al. A de novo paradigm for mental retardation. Nature Genetics, 42(12):1109 1112, 2010. 264 Gregory M. Cooper, David L. Goode, Sarah B. Ng, Arend Sidow, Michael J. Bamshad, Jay Shendure, and Deborah A. Nickerson. Single-nucleotide evolutionary constraint scores highlight disease-causing mutations. Nature Methods, 7(4):250 251, 2010. 265 Michael Krawczak, Edward V. Ball, Iain Fenton, Peter D. Stenson, Shaun Abeysinghe, Nick Thomas, and David N. Cooper. Human gene mutation database - a biomedical information and research resource. Human Mutation, 15(1):45, 2000. 266 Pauline C. Ng and Steven Henikoff. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Research, 31(13):3812 3814, 2003. 177

267 Ivan Adzhubei, Daniel M. Jordan, and Shamil R. Sunyaev. Predicting functional effect of human missense mutations using PolyPhen-2. Current Protocols in Human Genetics, pages 7 20, 2013. 268 Holly K. Tabor, Neil J. Risch, and Richard M. Myers. Candidate-gene approaches for studying complex genetic traits: practical considerations. Nature Review Genetics, 3(5):391 397, 2002. 269 Jennifer M. Kwon and Alison M. Goate. The candidate gene approach. Alcohol Research and Health, 24(3):164 168, 2000. 270 Nadav Ahituv, Nihan Kavaslar, Wendy Schackwitz, Anna Ustaszewska, Joel Martin, and Sybil Hebert et al. Medical sequencing at the extremes of human body mass. The American Journal of Human Genetics, 80(4):779 791, 2007. 271 Amelie Bonnefond, Nathalie Clement, Katherine Fawcett, Loic Yengo, Emmanuel Vaillant, and Jean-Luc Guillaume et al. Rare MTNR1B variants impairing melatonin receptor 1B function contribute to type 2 diabetes. Nature Genetics, 44(3):297 301, 2012. 272 Jonathan C. Cohen, Robert S. Kiss, Alexander Pertsemlidis, Yves L. Marcel, Ruth McPherson, and Helen H. Hobbs. Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science, 305(5685):869 872, 2004. 273 Dorothee Diogo, Fina Kurreeman, Eli A. Stahl, Katherine P. Liao, Namrata Gupta, and Jeffrey D. Greenberg et al. Rare, low-frequency, and common variants in the protein-coding sequence of biological candidate genes from GWASs contribute to risk of rheumatoid arthritis. The American Journal of Human Genetics, 92(1):15 27, 2013. 274 Weizhen Ji, Jia Nee Foo, Brian J. O Roak, Hongyu Zhao, Martin G. Larson, and David B. Simon et al. Rare independent mutations in renal salt handling genes contribute to blood pressure variation. Nature Genetics, 40(5):592 599, 2008. 275 Guoqing Diao and D.Y. Lin. Variance-components methods for linkage and association analysis of ordinal traits in general pedigrees. Epidemiology, 34(3):232 237, 2010. Genetic 178

276 George D. Garson. Variance Components Analysis. Statistical Associates Publishers, Asheboro, NC, 2012. 277 John Blangero and Laura Almasy. Solar: sequential oligogenic linkage analysis routines. Population Genetics Laboratory Technical Report, 6, 1996. 278 Laura Almasy and John Blangero. Multipoint quantitative-trait linkage analysis in general pedigrees. The American Journal of Human Genetics, 62(5):1198 1211, 1998. 279 Christopher I. Amos. Robust variance-components approach for assessing genetic linkage in pedigrees. 54(3):535 543, 1994. American Journal of Human Genetics, 280 Gail P. Jarvik, Laura M. Amendola, Jonathan S. Berg, Kyle Brothers, Ellen W. Clayton, and Wendy Chung et al. Return of genomic results to research participants: the floor, the ceiling, and the choices in between. The American Journal of Human Genetics, 94(6):818 826, 2014. 281 R Core Team. R: a language and environment for statistical computing., 2014. 282 Karin V. Fuentes Fajardo, David Adams, Nisc Comparative Sequencing Program, Christopher E. Mason, Murat Sincan, and Cynthia Tifft et al. Detecting false-positive signals in exome sequencing. Human Mutation, 33(4):609 613, 2012. 283 Giulio Genovese, Menachem Fromer, Eli A. Stahl, Douglas M. Ruderfer, Kimberly Chambert, and Mikael Landen et al. Increased burden of ultra-rare protein-altering variants among 4,877 individuals with schizophrenia. Nature Neuroscience, 19(11):1433 1441, 2016. 284 Nathan O. Stitziel, Adam Kiezun, and Shamil Sunyaev. Computational and statistical approaches to analyzing variants identified by exome sequencing. Genome Biology, 12(9):227, 2011. 285 The International HapMap Consortium. A haplotype map of the human genome. Nature, 437(7063):1299 1320, 2005. 179

286 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature, 467(7319):1061 1073, 2010. 287 McKusick-Nathans Institute of Genetic Medicine. Online mendelian inheritance in man, OMIM, URL: http://omim.org/, 2015. 288 Agilent Technologies. Clearseq cancer research panels, URL: http://www.genomics.agilent.com/article.jsp?pageid=6900003#cancer, 2016. 289 Illumina. Truseq amplicon - cancer panel, URL: https://www.illumina.com/products/by-type/clinical-research-products/truseqamplicon-cancer-panel.html, 2016. 290 Ravindranath Duggirala, Jeff T. Williams, Sarah Williams-Blangero, and John Blangero. A variance component approach to dichotomous trait linkage analysis using a threshold model. Genetic Epidemiology, 14(6):987 992, 1997. 291 Bo Peng, Robert K. Yu, Kevin L. DeHoff, and Christopher I. Amos. Normalizing a large number of quantitative traits using empirical normal quantile transformation. BMC Proceedings, 1(1):S156, 2007. 292 J. Martin Bland and Douglas G. Altman. Multiple significance tests: the Bonferroni method. BMJ, 310(6973):170, 1995. 293 Frida Belinky, Noam Nativ, Gil Stelzer, Shahar Zimmerman, Tsippi Iny Stein, Marilyn Safran, and Doron Lancet. Pathcards: multi-source consolidation of human biological pathways. Database, 2015, 2015. 294 Michael Ashburner, Catherine A. Ball, Judith A. Blake, David Botstein, Heather Butler, and J. Michael Cherry et al. Gene Ontology: tool for the unification of biology. Nature Genetics, 25(1):25 29, 2000. 295 Mate Ongenaert, Leander Van Neste, Tim De Meyer, Gerben Menschaert, Sofie Bekaert, and Wim Van Criekinge. PubMeth: a cancer methylation database combining text-mining and expert annotation. Nucleic Acids Research, 36(suppl 1):D842 D846, 2008. 296 Donna Maglott, Jim Ostell, Kim D. Pruitt, and Tatiana Tatusova. Entrez Gene: gene-centered information at NCBI. 33(suppl_1):D54 D58, 2005. Nucleic Acids Research, 180

297 Junghwa Lim, Daniel A. Ritt, Ming Zhou, and Deborah K. Morrison. The CNK2 scaffold interacts with vilse and modulates Rac cycling during spine morphogenesis in hippocampal neurons. Current Biology, 24(7):786 792, 2014. 298 Gertraud Maskarinec, Yukiko Morimoto, Sreang Heak, Marissa Isaki, Astrid Steinbrecher, Laurie J. Custer, and Adrian A. Franke. Urinary estrogen metabolites in two soy trials with premenopausal women. European Journal of Clinical Nutrition, 66(9):1044 1049, 2012. 299 Brook E. Harmon, Yukiko Morimoto, Fanchon Beckford, Adrian A. Franke, Frank Z. Stanczyk, and Gertraud Maskarinec. Oestrogen levels in serum and urine of premenopausal women eating low and high amounts of meat. Public Health Nutrition, 17(9):2087 2093, 2014. 300 Reetobrata Basu, Nicholas Baumgaertel, Shiyong Wu, and John J. Kopchick. Growth hormone receptor knockdown sensitizes human melanoma cells to chemotherapy by attenuating expression of ABC drug efflux pumps. Hormones and Cancer, pages 1 14, 2017. 301 Juntao Yao, Xuan Yao, Tao Tian, Xiao Fu, Wenjuan Wang, and Suoni Li et al. ABCB5-ZEB1 axis promotes invasion and metastasis in breast cancer cells. Oncology Research Featuring Preclinical and Clinical Cancer Therapeutics, 25(3):305 316, 2017. 302 Thilo Gambichler, A.L. Petig, Eggert Stockfleth, and Markus Stucker. Expression of SOX10, ABCB5 and CD271 in melanocytic lesions and correlation with survival data of patients with melanoma. Clinical and Experimental Dermatology, 41(7):709 716, 2016. 303 Yang Wang and Jia-Song Teng. Increased multi-drug resistance and reduced apoptosis in osteosarcoma side population cells are crucial factors for tumor recurrence. Experimental and Therapeutic Medicine, 12(1):81 86, 2016. 304 Huanle Zhang, P. Wang, Miao-zhen Lu, and Shu-Dong Zhang. c-myc regulation of ATP-binding cassette transporter reverses chemoresistance in CD133 (+) colon cancer stem cells. Sheng Li Xue Bao:[Acta physiologica Sinica], 68(2):171 178, 2016. 181

305 Sonja Kleffel, Nayoung Lee, Cecilia Lezcano, Brian J. Wilson, Kristine Sobolewski, and Karim R. Saab et al. ABCB5-targeted chemoresistance reversal inhibits Merkel cell carcinoma growth. Journal of Investigative Dermatology, 136(4):838 846, 2016. 306 Martin Grimm, Marcel Cetindis, Max Lehmann, Thorsten Biegner, Adelheid Munz, Peter Teriete, and Siegmar Reinert. Apoptosis resistance-related ABCB5 and DNaseX (Apo10) expression in oral carcinogenesis. Acta Odontologica Scandinavica, 73(5):336 342, 2015. 307 Hala M. Farawela, Mervat M. Khorshied, Neemat M. Kassem, Heba A. Kassem, and Hamdy M. Zawam. The clinical relevance and prognostic significance of adenosine triphosphate ATP-binding cassette (ABCB5) and multidrug resistance (MDR1) genes expression in acute leukemia: an Egyptian study. Journal of Cancer Research and Clinical Oncology, 140(8):1323 1330, 2014. 308 Ramaswamy Govindan, Li Ding, Malachi Griffith, Janakiraman Subramanian, Nathan D. Dees, and Krishna L. Kanchi et al. Genomic landscape of non-small cell lung cancer in smokers and never-smokers. Cell, 150(6):1121 1134, 2012. 309 Brian J. Wilson, Tobias Schatton, Qian Zhan, Martin Gasser, Jie Ma, and Karim R. Saab et al. ABCB5 identifies a therapy-refractory tumor cell population in colorectal cancer patients. Cancer Research, 71(15):5307 5316, 2011. 310 Siu Tim Cheung, Phyllis F.Y. Cheung, Christine K.C. Cheng, Nicholas C.L. Wong, and Sheung Tat Fan. Granulin-epithelin precursor and ATP-dependent binding cassette (ABC)B5 regulate liver cancer cell chemoresistance. Gastroenterology, 140(1):344 355, 2011. 311 Ji Yeon Yang, Seon-Ah Ha, Yun-Sik Yang, and Jin Woo Kim. p-glycoprotein ABCB5 and YB-1 expression plays a role in increased heterogeneity of breast cancer cells: correlations with cell fusion and doxorubicin resistance. BMC Cancer, 10(1):388 398, 2010. 182

312 Mitsuru Higa, Xue Zhang, Kiyoji Tanaka, and Masafumi Saijo. Stabilization of Ultraviolet (UV)-stimulated Scaffold Protein A by interaction with ubiquitin-specific peptidase 7 is essential for transcription-coupled nucleotide excision repair. Journal of Biological Chemistry, 291(26):13771 13779, 2016. 313 James E. Cleaver, Angela M. Brennan-Minnella, Raymond A. Swanson, Ka-wing Fong, Junjie Chen, and Kai-ming Chou et al. Mitochondrial reactive oxygen species are scavenged by Cockayne syndrome B protein in human fibroblasts without nuclear DNA damage. Proceedings of the National Academy of Sciences, 111(37):13487 13492, 2014. 314 Jia Guo, Philip C. Hanawalt, and Graciela Spivak. Comet-FISH with strand-specific probes reveals transcription-coupled repair of 8-oxoGuanine in human cells. Nucleic Acids Research, 41(16):7700 7712, 2013. 315 Petra Schwertman, Wim Vermeulen, and Jurgen A. Marteijn. UVSSA and USP7, a new couple in transcription-coupled DNA repair. 122(4):275 284, 2013. Chromosoma, 316 Jia Fei and Junjie Chen. KIAA1530 protein is recruited by Cockayne syndrome complementation group protein A (CSA) to participate in transcription-coupled repair (TCR). Journal of Biological Chemistry, 287(42):35118 35126, 2012. 317 Gaowu Hu, Ye Xu, Wenquan Chen, Jiandong Wang, Chunying Zhao, and Ming Wang. RNA interference of IQ motif containing GTPase-activating protein 3 (IQGAP3) inhibits cell proliferation and invasion in breast carcinoma cells. Oncology Research Featuring Preclinical and Clinical Cancer Therapeutics, 24(6):455 461, 2016. 318 Malwina Michalak, Uwe Warnken, Sabine Andre, Martina Schnolzer, Hans-Joachim Gabius, and Juergen Kopitz. Detection of proteome changes in human colon cancer induced by cell surface binding of growth-inhibitory human galectin-4 using quantitative SILAC-based proteomics. Journal of Proteome Research, 15(12):4412 4422, 2016. 183

319 Yanqin Gu, Linfeng Lu, Lingfeng Wu, Hao Chen, Wei Zhu, and Yi He. Identification of prognostic genes in kidney renal clear cell carcinoma by RNA-seq data analysis. Molecular Medicine Reports, 15(4):1661 1667, 2017. 320 Andreas Ritter, Mourad Sanhaji, Alexandra Friemel, Susanne Roth, Udo Rolle, Frank Louwen, and Juping Yuan. Functional analysis of phosphorylation of the mitotic centromere-associated kinesin by Aurora B kinase in human tumor cells. Cell Cycle, 14(23):3755 3767, 2015. 321 Yangxing Zhao, Feng Xue, Jinfeng Sun, Shicheng Guo, Hongyu Zhang, and Bijun Qiu et al. Genome-wide methylation profiling of the different stages of hepatitis b virus-related hepatocellular carcinoma development in plasma cell-free DNA reveals potential biomarkers for early detection and high-risk monitoring of hepatocellular carcinoma. Clinical Epigenetics, 6(1):30, 2014. 322 Yong-Chen Lu, Xin Yao, Jessica S. Crystal, Yong F. Li, Mona El-Gamil, and Colin Gross et al. Efficient identification of mutated cancer antigens recognized by T cells associated with durable tumor regressions. Clinical Cancer Research, 20(13):3401 3410, 2014. 323 Cerys S. Manning, Steven Hooper, and Erik A. Sahai. Intravital imaging of SRF and Notch signalling identifies a key role for EZH2 in invasive melanoma cells. Oncogene, 34(33):4320 4332, 2015. 324 Peng Lyu, Shu-Dong Zhang, Hiu-Fung Yuen, Cian M. McCrudden, Qing Wen, Kwok-Wah Chan, and Hang Fai Kwok. Identification of TWIST-interacting genes in prostate cancer. Science China Life Sciences, pages 1 11, 2017. 325 Kimberly A. Krautkramer, Amelia K. Linnemann, Danielle A. Fontaine, Amy L. Whillock, Ted W. Harris, and Gregory J. Schleis et al. Tcf19 is a novel islet factor necessary for proliferation and survival in the INS-1 beta-cell line. American Journal of Physiology - Endocrinology And Metabolism, 305(5):E600 E610, 2013. 326 Sarah E. Flanagan, Ann-Marie Patch, and Sian Ellard. Using SIFT and PolyPhen to predict loss-of-function and gain-of-function mutations. Genetic Testing and Molecular Biomarkers, 14(4):533 537, 2010. 184

327 Brian H. Shirts, Colin C. Pritchard, and Tom Walsh. Family-specific variants and the limits of human genetics. Trends in Molecular Medicine, 22(11):925 934, 2016. 328 James R. Lupski, John W. Belmont, Eric Boerwinkle, and Richard A. Gibbs. Clan genomics and the complex architecture of human disease. Cell, 147(1):32 43, 2011. 329 Alex Coventry, Lara M. Bull-Otterson, Xiaoming Liu, Andrew G. Clark, Taylor J. Maxwell, and Jacy Crosby et al. Deep resequencing reveals excess rare recent variants consistent with explosive population growth. Nature Communications, 1:131 136, 2010. 330 Daniel J. Turner, Marcos Miretti, Diana Rajan, Heike Fiegler, Nigel P. Carter, and Martyn L. Blayney et al. Germline rates of de novo meiotic deletions and duplications causing several genomic disorders. Nature Genetics, 40(1):90 95, 2008. 331 Adam R. Boyko, Scott H. Williamson, Amit R. Indap, Jeremiah D. Degenhardt, Ryan D. Hernandez, and Kirk E. Lohmueller et al. Assessing the evolutionary impact of amino acid mutations in the human genome. PLOS Genetics, 4(5):e1000083, 2008. 332 Michael Dean and Tarmo Annilo. Evolution of the ATP-binding cassette (ABC) transporter superfamily in vertebrates. Annual Review of Genomics Human Genetics, 6:123 142, 2005. 333 Natasha Y. Frank, Armen Margaryan, Ying Huang, Tobias Schatton, Ana Maria Waaga-Gasser, and Martin Gasser et al. ABCB5-mediated doxorubicin transport and chemoresistance in human malignant melanoma. Cancer Research, 65(10):4320 4333, 2005. 334 Natasha Y. Frank, Shona S. Pendse, Peter H. Lapchak, Armen Margaryan, Debbie Shlain, and Carsten Doeing et al. Regulation of progenitor cell fusion by ABCB5 P-glycoprotein, a novel human ATP-binding cassette transporter. Journal of Biological Chemistry, 278(47):47156 47165, 2003. 335 Claudina Aleman, Jean-Philippe Annereau, Xing-Jie Liang, Carol O. Cardarelli, Barbara Taylor, and Jun Jie Yin et al. P-glycoprotein, expressed 185

in multidrug resistant cells, is not responsible for alterations in membrane fluidity or membrane potential. Cancer Research, 63(12):3084, 2003. 336 Marine Chartrain, Joelle Riond, Aline Stennevin, Isabelle Vandenberghe, Bruno Gomes, and Laurence Lamant et al. Melanoma chemotherapy leads to the selection of ABCB5-expressing cells. PLOS One, 7(5):e36762, 2012. 337 Brian J. Wilson, Karim R. Saab, Jie Ma, Tobias Schatton, Pablo Putz, and Qian Zhan et al. ABCB5 maintains melanoma-initiating cells through a proinflammatory cytokine signaling circuit. Cancer Research, 74(15):4196 4207, 2014. 338 Ge Yang, Ou Jiang, Daiqiong Ling, Xiaoyue Jiang, Pingzong Yuan, and Guang Zeng et al. MicroRNA-522 reverses drug resistance of doxorubicin-induced HT29 colon cancer cell by targeting ABCB5. Molecular Medicine Reports, 12(3):3930 3936, 2015. 339 Elma Zaganjor, Lauren M. Weil, Joshua X. Gonzales, John D. Minna, and Melanie H. Cobb. Ras transformation uncouples the kinesin-coordinated cellular nutrient response. Proceedings of the National Academy of Sciences, 111(29):10568 10573, 2014. 340 Mourad Sanhaji, Claire Therese Friel, Nina-Naomi Kreis, Andrea Kramer, Claudia Martin, and Jonathon Howard et al. Functional and spatial regulation of mitotic centromere-associated kinesin by cyclin-dependent kinase 1. Molecular and Cellular Biology, 30(11):2594 2607, 2010. 341 Mourad Sanhaji, Andreas Ritter, Hannah R. Belsham, Claire T. Friel, Susanne Roth, Frank Louwen, and Juping Yuan. Polo-like kinase 1 regulates the stability of the mitotic centromere-associated kinesin in mitosis. Oncotarget, 5(10):3130 3144, 2014. 342 Andreas Ritter, Mourad Sanhaji, Kerstin Steinhauser, Susanne Roth, Frank Louwen, and Juping Yuan. The activity regulation of the mitotic centromere-associated kinesin by Polo-like kinase 1. Oncotarget, 6(9):6641 6655, 2015. 343 Liangyu Zhang, Hengyi Shao, Yuejia Huang, Feng Yan, Youjun Chu, and Hai Hou et al. PLK1 phosphorylates mitotic centromere-associated kinesin 186

and promotes its depolymerase activity. Journal of Biological Chemistry, 286(4):3033 3046, 2011. 344 Todd Maney, Andrew W. Hunter, Mike Wagenbach, and Linda Wordeman. Mitotic centromere-associated kinesin is important for anaphase chromosome segregation. The Journal of Cell Biology, 142(3):787 801, 1998. 345 Ayana T. Moore, Kathleen E. Rankin, George Von Dassow, Leticia Peris, Michael Wagenbach, and Yulia Ovechkina et al. MCAK associates with the tips of polymerizing microtubules. The Journal of Cell Biology, 169(3):391 397, 2005. 346 Alexander Braun, Kyvan Dang, Felinah Buslig, Michelle A. Baird, Michael W. Davidson, Clare M. Waterman, and Kenneth A. Myers. Rac1 and Aurora A regulate MCAK to polarize microtubule growth in migrating endothelial cells. The Journal of Cell Biology, 206(1):97 112, 2014. 347 Sacha Gnjatic, Yanran Cao, Uta Reichelt, Emre F. Yekebas, Christina Nolker, and Andreas H. Marx et al. NY-CO-58/KIF2C is overexpressed in a variety of solid tumors and induces frequent T cell responses in patients with colorectal cancer. International Journal of Cancer, 127(2):381 393, 2010. 348 Arata Shimo, Chizu Tanikawa, Toshihiko Nishidate, Meng-Lay Lin, Koichi Matsuda, and Jae-Hyun Park et al. Involvement of kinesin family member 2C/mitotic centromere-associated kinesin overexpression in mammary carcinogenesis. Cancer Science, 99(1):62 70, 2008. 349 Yuji Nakamura, Fumiaki Tanaka, Naoto Haraguchi, Koshi Mimori, Tatsuhiko Matsumoto, and Hiroshi Inoue et al. Clinicopathological and biological significance of mitotic centromere-associated kinesin overexpression in human gastric cancer. British Journal of Cancer, 97(4):543 549, 2007. 350 Kazuhiro Ishikawa, Yukio Kamohara, Fumiaki Tanaka, Naoto Haraguchi, Koshi Mimori, Hiroshi Inoue, and Masatomo Mori. Mitotic centromere-associated kinesin is a novel marker for prognosis and lymph node metastasis in colorectal cancer. British Journal of Cancer, 98(11):1824 1829, 2008. 187

351 Carlo Turano, Sabina Coppari, Fabio Altieri, and Anna Ferraro. Proteins of the PDI family: unpredicted non-er locations and functions. Journal of Cellular Physiology, 193(2):154 163, 2002. 352 Peter Klappa, Lloyd W. Ruddock, Nigel J. Darby, and Robert B. Freedman. The b domain provides the principal peptide-binding site of protein disulfide isomerase but all domains contribute to binding of misfolded proteins. The EMBO Journal, 17(4):927 935, 1998. 353 Xin-Miao Fu and Bao Ting Zhu. Human pancreas-specific protein disulfide isomerase homolog (PDIp) is an intracellular estrogen-binding protein that modulates estrogen levels and actions in target cells. The Journal of Steroid Biochemistry and Molecular Biology, 115(1):20 29, 2009. 354 Roberta Maestro, Angelo P. Dei Tos, Yasuo Hamamori, Svetlana Krasnokutsky, Vittorio Sartorelli, and Larry Kedes et al. Twist is a potential oncogene that inhibits apoptosis. Genes & Development, 13(17):2207 2217, 1999. 355 Eric N. Olson and William H. Klein. bhlh factors in muscle development: dead lines and commitments, what to leave in and what to leave out. Genes & Development, 8(1):1 8, 1994. 356 Elisabeth H. Villavicencio, Joon Won Yoon, Daniel J. Frank, Ernst-Martin Fuchtbauer, David O. Walterhouse, and Philip M. Iannaccone. Cooperative E-box regulation of human GLI1 by TWIST and USF. Genesis, 32(4):247 258, 2002. 357 Erika Rosivatz, Ingrid Becker, Katja Specht, Elena Fricke, Birgit Luber, and Raymonde Busch et al. Differential expression of the epithelial-mesenchymal transition regulators snail, SIP1, and twist in gastric cancer. The American Journal of Pathology, 161(5):1881 1891, 2002. 358 P. Andrew Futreal, Lachlan Coin, Mhairi Marshall, Thomas Down, Timothy Hubbard, and Richard Wooster et al. A census of human cancer genes. Nature Reviews Cancer, 4(3):177 183, 2004. 359 Zhengyan Kan, Bijay S. Jaiswal, Jeremy Stinson, Vasantharajan Janakiraman, Deepali Bhatt, and Howard M. Stern et al. Diverse somatic 188

mutation patterns and pathway alterations in human cancers. 466(7308):869 873, 2010. Nature, 360 Bing Yu, Sandra A. O Toole, and Ronald J. Trent. Somatic DNA mutation analysis in targeted therapy of solid tumours. 4(2):125 138, 2015. Translational Pediatrics, 361 J. Guillermo Paez, Pasi A. Janne, Jeffrey C. Lee, Sean Tracy, Heidi Greulich, and Stacey Gabriel et al. EGFR mutations in lung cancer: correlation with clinical response to gefitinib therapy. Science, 304(5676):1497 1500, 2004. 362 Keith T. Flaherty, Igor Puzanov, Kevin B. Kim, Antoni Ribas, Grant A. McArthur, and Jeffrey A. Sosman et al. Inhibition of mutated, activated BRAF in metastatic melanoma. New England Journal of Medicine, 363(9):809 819, 2010. 363 Astrid Lievre, Jean-Baptiste Bachet, Delphine Le Corre, Valerie Boige, Bruno Landi, and Emile Jean-Francois et al. KRAS mutation status is predictive of response to cetuximab therapy in colorectal cancer. Cancer Research, 66(8):3992 3995, 2006. 364 Martin H. Cohen, Ann Farrell, Robert Justice, and Richard Pazdur. Approval summary: imatinib mesylate in the treatment of metastatic and/or unresectable malignant gastrointestinal stromal tumors. The Oncologist, 14(2):174 180, 2009. 365 Georgina L. Ryland, Maria A. Doyle, David Goode, Samantha E. Boyle, David Y. H. Choong, and Simone M. Rowley et al. Loss of heterozygosity: what is it good for? BMC Medical Genomics, 8(1):45, 2015. 366 Brenda L. Gallie, A. Linn Murphree, Louise C. Strong, and Rhiannon L. White. Expression of recessive alleles by chromosomal mechanisms in retinoblastoma. Nature, 305(779784):3134, 1983. 367 Sofia D. Merajver, Thomas S. Frank, Junzhe Xu, Trinh M. Pham, Kathleen A. Calzone, and Pamela Bennett-Baker et al. Germline BRCA1 mutations and loss of the wild-type allele in tumors from families with early onset breast and ovarian cancer. Clinical Cancer Research, 1(5):539 544, 1995. 189

368 Daniel C. Koboldt, Qunyuan Zhang, David E. Larson, Dong Shen, Michael D. McLellan, and Ling Lin et al. VarScan2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Research, 22(3):568 576, 2012. 369 Daniel C. Koboldt, David E. Larson, and Richard K. Wilson. Using VarScan2 for germline variant calling and somatic mutation detection. Current Protocols in Bioinformatics, 44:15.4.1 15.4.17, 2013. 370 Adam B. Olshen, Venkatraman E. Seshan, Robert Lucito, and Michael Wigler. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics, 5(4):557 572, 2004. 371 Richard Redon, Shumpei Ishikawa, Karen R. Fitch, Lars Feuk, George H. Perry, and T. Daniel Andrews et al. Global variation in copy number in the human genome. Nature, 444(7118):444 454, 2006. 372 Adam Shlien and David Malkin. Copy number variations and cancer. Genome Medicine, 1(6):62 62, 2009. 373 Darrin Stuart and William R. Sellers. Linking somatic genetic alterations in cancer to therapeutics. Current Opinion in Cell Biology, 21(2):304 310, 2009. 374 Rebecca J. Leary, Jimmy C. Lin, Jordan Cummins, Simina Boca, Laura D. Wood, and D. Williams Parsons et al. Integrated analysis of homozygous deletions, focal amplifications, and sequence alterations in breast and colorectal cancers. Proceedings of the National Academy of Sciences, 105(42):16224 16229, 2008. 375 Evelyn Despierre, Matthieu Moisse, Betul Yesilyurt, Jalid Sehouli, Ioana Braicu, and Sven Mahner et al. Somatic copy number alterations predict response to platinum therapy in epithelial ovarian cancer. Gynecologic Oncology, 135(3):415 422, 2014. 376 Hongtao Xu, Xia Zhu, Zulong Xu, Yue Hu, Shiping Bo, Tongjing Xing, and Kuichun Zhu. Non-invasive analysis of genomic copy number variation in patients with hepatocellular carcinoma by next generation DNA sequencing. Journal of Cancer, 6(3):247, 2015. 190

377 Sara Martoreli Silveira, Isabela Werneck da Cunha, Fabio Albuquerque Marchi, Ariane Fidelis Busso, Ademar Lopes, and Silvia Regina Rogatto. Genomic screening of testicular germ cell tumors from monozygotic twins. Orphanet Journal of Rare Diseases, 9(1):181, 2014. 378 Sukanya Horpaopan, Isabel Spier, Alexander M. Zink, Janine Altmuller, Stefanie Holzapfel, and Andreas Laner et al. Genome-wide CNV analysis in 221 unrelated patients and targeted high-throughput sequencing reveal novel causative candidate genes for colorectal adenomatous polyposis. International Journal of Cancer, 136(6):E578 E589, 2015. 379 Nadine Bonberg, Beate Pesch, Thomas Behrens, Georg Johnen, Dirk Taeger, and Katarzyna Gawrych et al. Chromosomal alterations in exfoliated urothelial cells from bladder cancer cases and healthy men: a prospective screening study. BMC Cancer, 14(1):854, 2014. 380 Barbara A. Weir, Michele S. Woo, Gad Getz, Sven Perner, Li Ding, and Rameen Beroukhim et al. Characterizing the cancer genome in lung adenocarcinoma. Nature, 450(7168), 2007. 381 Astrid M. Eder, Xiaomei Sui, Daniel G. Rosen, Laura K. Nolden, Kwai Wa Cheng, and John P. Lahad et al. Atypical PKCI contributes to poor prognosis through loss of apical-basal polarity and cyclin E overexpression in ovarian cancer. Proceedings of the National Academy of Sciences of the United States of America, 102(35):12519 12524, 2005. 382 Idoya Lahortiga, Kim De Keersmaecker, Pieter Van Vlierberghe, Carlos Graux, Barbara Cauwelier, and Frederic Lambert et al. Duplication of the MYB oncogene in T cell acute lymphoblastic leukemia. Nature Genetics, 39(5):593 595, 2007. 383 Lars Zender, Mona S. Spector, Wen Xue, Peer Flemming, Carlos Cordon-Cardo, and John Silke et al. Identification and validation of oncogenes in liver cancer using an integrative oncogenomic approach. Cell, 125(7):1253 1267, 2006. 384 Charles G. Mullighan, Salil Goorha, Ina Radtke, Christopher B. Miller, Elaine Coustan-Smith, and James D. Dalton et al. Genome-wide 191

analysis of genetic alterations in acute lymphoblastic leukaemia. Nature, 446(7137):758 764, 2007. 385 Ruprecht Wiedemeyer, Cameron Brennan, Timothy P. Heffernan, Yonghong Xiao, John Mahoney, and Alexei Protopopov et al. Feedback circuit among INK4 tumor suppressors constrains human glioblastoma development. Cancer Cell, 13(4):355 364, 2008. 386 The Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature, 455(7216):1061 1068, 2008. 387 Dhananjay Chitale, Yixuan Gong, Barry S. Taylor, Stephen Broderick, Cameron Brennan, and Romel Somwar et al. An integrated genomic analysis of lung cancer reveals loss of DUSP4 in EGFR-mutant tumors. Oncogene, 28(31):2773 2783, 2009. 388 Erin D. Pleasance, R. Keira Cheetham, Philip J. Stephens, David J. McBride, Sean J. Humphray, and Chris D. Greenman et al. A comprehensive catalogue of somatic mutations from a human cancer genome. Nature, 463(7278):191 196, 2010. 389 Christopher T. Saunders, Wendy S.W. Wong, Sajani Swamy, Jennifer Becq, Lisa J. Murray, and R. Keira Cheetham. Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics, 28(14):1811 1817, 2012. 390 Heather E. Wheeler, Michael L. Maitland, M. Eileen Dolan, Nancy J. Cox, and Mark J. Ratain. Cancer pharmacogenomics: strategies and challenges. Nature Reviews Genetics, 14(1):23 34, 2013. 391 Wanjuan Yang, Jorge Soares, Patricia Greninger, Elena J. Edelman, Howard Lightfoot, and Simon Forbes et al. Genomics of drug sensitivity in cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells. Nucleic Acids Research, 41(D1):D955 D961, 2013. 392 Lin Wu, Nancy Patten, Carl T. Yamashiro, and Buena Chui. Extraction and amplification of DNA from formalin-fixed, paraffin-embedded tissues. Applied Immunohistochemistry & Molecular Morphology, 10(3):269 274, 2002. 192

393 Sarah Munchel, Yen Hoang, Yue Zhao, Joseph Cottrell, Brandy Klotzle, and Andrew K. Godwin et al. Targeted or whole genome sequencing of formalin fixed tissue samples: potential applications in cancer genomics. Oncotarget, 6(28):25943 25961, 2015. 394 Simon Andrews. Fastqc: a quality control tool for high throughput sequence data, 2010. 395 Anthony M. Bolger, Marc Lohse, and Bjoern Usadel. Trimmomatic: a flexible trimmer for Illumina sequence data. 30(15):2114 2120, 2014. Bioinformatics, 396 Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Broad Institute of Harvard and MIT, pages 1 3, 2013. 397 Genome Research Limited. Samtools- utilities for the sequence alignment/map (SAM) format; URL: http://www.htslib.org/doc/samtools.html, 2016. 398 E. Seshan Venkatraman and Adam B. Olshen. Dnacopy: a package for analyzing DNA copy data. Department of Epidemiology and Biostatistics. Memorial Sloan-Kettering Cancer Center, 2007. 399 Yan Guo, Fei Ye, Quanghu Sheng, Travis Clark, and David C. Samuels. Three-stage quality control strategies for DNA re-sequencing data. Briefings in Bioinformatics, 15(6):879 889, 2014. 400 Thomas L. Clarke, Maria Pilar Sanchez-Bailon, Kelly Chiang, John J. Reynolds, Joaquin Herrero-Ruiz, and Tiago M. Bandeiras et al. PRMT5-dependent methylation of the TIP60 coactivator RUVBL1 is a key regulator of homologous recombination. Molecular Cell, 65(5):900 916, 2017. 401 Bhavna Kumar, Arti Yadav, Nicole V. Brown, Songzhu Zhao, Michael J. Cipolla, and Paul E. Wakely et al. Nuclear PRMT5, cyclin D1 and IL-6 are associated with poor outcome in oropharyngeal squamous cell carcinoma patients and is inversely associated with p16-status. Oncotarget, 8(9):14847 14859, 2017. 193

402 Hao Yang, Xiaoping Zhao, Li Zhao, Liu Liu, Jiajin Li, and Wenzhi Jia et al. PRMT5 competitively binds to CDK4 to promote G1-S transition upon glucose induction in hepatocellular carcinoma. Oncotarget, 7(44):72131, 2016. 403 Xiaxin Deng, Guoqiang Shao, Hong-Tao Zhang, Chunyan Li, Dajie Zhang, and Li Cheng et al. Protein arginine methyltransferase 5 functions as an epigenetic activator of the androgen receptor to promote prostate cancer cell growth. Oncogene, 36(9):1223 1231, 2017. 404 Yan Sheng, Hongtao Wang, Dongchen Liu, Cheng Zhang, Yupeng Deng, and Fan Yang et al. Methylation of tumor suppressor gene CDH13 and SHP1 promoters and their epigenetic regulation by the UHRF1/PRMT5 complex in endometrial carcinoma. Gynecologic Oncology, 140(1):145 151, 2016. 405 H. Chen, Benjamin Lorton, Vijayalaxmi Gupta, and David Shechter. A TGFB-PRMT5-MEP50 axis regulates cancer cell invasion through histone H3 and H4 arginine methylation coupled transcriptional activation and repression. Oncogene, 36(3):373 386, 2017. 406 Annie Rochette, Nadia Boufaied, Eleonora Scarlata, Lucie Hamel, Fadi Brimo, and Hayley C. Whitaker et al. Asporin is a stromally expressed marker associated with prostate cancer progression. British Journal of Cancer, 116(6):775 784, 2017. 407 Paula J. Hurley, Debasish Sundi, Brian Shinder, Brian W. Simons, Robert M. Hughes, and Rebecca M. Miller et al. Germline variants in asporin vary by race, modulate the tumor microenvironment, and are differentially associated with metastatic prostate cancer. Clinical Cancer Research, 22(2):448, 2016. 408 Pamela Maris, Arnaud Blomme, Ana Perez Palacios, Brunella Costanza, Akeila Bellahcene, and Elettra Bianchi et al. Asporin is a fibroblast-derived TGF-B1 inhibitor and a tumor suppressor associated with good prognosis in breast cancer. PLOS Medicine, 12(9):e1001871, 2015. 409 Qian Ding, Mei Zhang, and Can Liu. Asporin participates in gastric cancer cell growth and migration by influencing EGF receptor signaling. Oncology Reports, 33(4):1783 1790, 2015. 194

410 Rika Satoyoshi, Sei Kuriyama, Namiko Aiba, Masakazu Yashiro, and Masamitsu Tanaka. Asporin activates coordinated invasion of scirrhous gastric cancer and cancer-associated fibroblasts. Oncogene, 34(5):650 660, 2015. 411 Andrei Turtoi, Davide Musmeci, Yinghong Wang, Bruno Dumont, Joan Somja, and Generoso Bevilacqua et al. Identification of novel accessible proteins bearing diagnostic and therapeutic potential in human pancreatic ductal adenocarcinoma. Journal of Proteome Research, 10(9):4302 4313, 2011. 412 Thai H. Ho, Daniel J. Serie, Mansi Parasramka, John C. Cheville, Brian M. Bot, and Weihong Tan et al. Differential gene expression profiling of matched primary renal cell carcinoma and metastases reveals upregulation of extracellular matrix genes. Annals of Oncology, 28(3):604 610, 2017. 413 Magdalena Zakrzewska, Wojciech Fendler, Krzysztof Zakrzewski, Beata Sikorska, Wieslawa Grajkowska, and Bozenna Dembowska-Baginska et al. Altered microrna expression is associated with tumor grade, molecular background and outcome in childhood infratentorial ependymoma. PLOS ONE, 11(7):e0158464, 2016. 414 John Richard McPherson, Choon-Kiat Ong, Cedric Chuan-Young Ng, Vikneswari Rajasegaran, Hong-Lee Heng, and Willie Shun-Shing Yu et al. Whole-exome sequencing of breast cancer, malignant peripheral nerve sheath tumor and neurofibroma from a patient with neurofibromatosis type 1. Cancer Medicine, 4(12):1871 1878, 2015. 415 Pooja Ganguly and Niladri Ganguly. Transcriptomic analyses of genes differentially expressed by high-risk and low-risk human papilloma virus E6 oncoproteins. VirusDisease, 26(3):105 116, 2015. 416 O.A. Simonova, Ekaterina B. Kuznetsova, Elena V. Poddubskaya, Tatiana V. Kekeeva, R.A. Kerimov, and I.D. Trotsenko et al. DNA methylation in the promoter regions of the laminin family genes in normal and breast carcinoma tissues. Molecular Biology, 49(4):598 607, 2015. 195

417 Anbarasu Lourdusamy, Ruman Rahman, Stuart Smith, and Richard Grundy. microrna network analysis identifies mir-29 cluster as key regulator of LAMA2 in ependymoma. Acta Neuropathologica Communications, 3(1):26 30, 2015. 418 Radoslaw Januchowski, Piotr Zawierucha, Marcin Rucinski, and Maciej Zabel. Microarray-based detection and expression analysis of extracellular matrix proteins in drug-resistant ovarian cancer cell lines. Oncology Reports, 32(5):1981 1990, 2014. 419 Suchit Jhunjhunwala, Zhaoshi Jiang, Eric W. Stawiski, Florian Gnad, Jinfeng Liu, and Oleg Mayba et al. Diverse modes of genomic alteration in hepatocellular carcinoma. Genome Biology, 15(8):436 450, 2014. 420 Radoslaw Januchowski, Piotr Zawierucha, Marcin Rucinski, Michal Nowicki, and Maciej Zabel. Extracellular matrix proteins expression profiling in chemoresistant variants of the A2780 ovarian cancer cell line. BioMed Research International, 2014:1 9, 2014. 421 Akiko Niibori-Nambu, Uichi Midorikawa, Souhei Mizuguchi, Takuichiro Hide, Minako Nagai, and Yoshihiro Komohara et al. Glioma initiating cells form a differentiation niche via the induction of extracellular matrices and integrin av. PLOS ONE, 8(5):e59558, 2013. 422 Rong Sheng Ni, Xiaohui Shen, Xiaoyun Qian, Chenjie Yu, Haiyan Wu, and X.I.A. Gao. Detection of differentially expressed genes and association with clinicopathological features in laryngeal squamous cell carcinoma. Oncology Letters, 4(6):1354 1360, 2012. 423 Dwain Mefford and Joel Mefford. Stromal genes add prognostic information to proliferation and histoclinical markers: a basis for the next generation of breast cancer gene signatures. PLOS ONE, 7(6):e37646, 2012. 424 Sunwoo Lee, Taejeong Oh, Hyuncheol Chung, Sunyoung Rha, Changjin Kim, and Youngho Moon et al. Identification of GABRA1 and LAMA2 as new DNA methylation markers in colorectal cancer. International Journal of Oncology, 40(3):889 898, 2012. 196

425 Yizhu Lyu, Jiacheng Lou, Yan Yang, Jiuxing Feng, Yuchao Hao, and Shuyu Huang et al. Dysfunction of the WT1-MEG3 signaling promotes AML leukemogenesis via p53 dependent and independent pathways. Leukemia, 2017. 426 Piotr Ciesielski, Pawel Jozwiak, Katarzyna Wojcik-Krowiranda, Ewa Forma, Lukasz Cwonda, and Sylwia Szczepaniec et al. Differential expression of ten-eleven translocation genes in endometrial cancers. Tumor Biology, 39(3):1 8, 2017. 427 Yoko Kubuki, Takumi Yamaji, Tomonori Hidaka, Takuro Kameda, Kotaro Shide, and Masaaki Sekine et al. TET2 mutation in diffuse large B-cell lymphoma. Journal of Clinical and Experimental Hematopathology, 56(3):145 149, 2017. 428 Lars Bullinger, Konstanze Dohner, and Hartmut Dohner. Genomics of acute myeloid leukemia diagnosis and pathways. 35(9):934 946, 2017. Journal of Clinical Oncology, 429 Gholamreza Bahari, Mohammad Hashemi, Majid Naderi, and Mohsen Taheri. TET2 promoter DNA methylation and expression in childhood acute lymphoblastic leukemia. Asian Pacific Journal of Cancer Prevention, 17(8):3959 3962, 2016. 430 Satoshi Chiba. Significance of TET2 mutations in myeloid and lymphoid neoplasms. [Rinshoo Ketsueki] The Japanese Journal of Clinical Hematology, 57(6):715 722, 2016. 431 Joseph H.R. Hetmanski, Egor Zindy, Jean-Marc Schwartz, and Patrick T. Caswell. A MAPK-driven feedback loop suppresses Rac activity to promote RhoA-driven cancer cell invasion. PLOS Computational Biology, 12(5):e1004909, 2016. 432 Pascale Monzo, Yuk Kien Chong, Charlotte Guetta-Terrier, Anitha Krishnasamy, Sharvari R. Sathe, and Evelyn K.F. Yim et al. Mechanical confinement triggers glioma linear migration dependent on formin FHOD3. Molecular Biology of the Cell, 27(8):1246 1261, 2016. 197

433 Li Chai, Jia Li, and Zhongwei Lv. An integrated analysis of cancer genes in thyroid cancer. Oncology Reports, 35(2):962 970, 2016. 434 Nikki R. Paul, Jennifer L. Allen, Anna Chapman, Maria Morlan-Mairal, Egor Zindy, and Guillaume Jacquem et al. a5b1 integrin recycling promotes Arp2/3-independent cancer cell invasion via the formin FHOD3. The Journal of Cell Biology, 210(6):1013 1031, 2015. 435 Deborah French, Wenjian Yang, Cheng Cheng, Susana C. Raimondi, Charles G. Mullighan, and James R. Downing et al. Acquired variation outweighs inherited variation in whole genome analysis of methotrexate polyglutamate accumulation in leukemia. Blood, 113(19):4512 4520, 2009. 436 Zongping Wang, Jie Kang, Xianzhao Deng, Bomin Guo, Bo Wu, and Youben Fan. Knockdown of GATAD2A suppresses cell proliferation in thyroid cancer in vitro. Oncology Reports, 37(4):2147 2152, 2017. 437 Cornelia G. Spruijt, Martijn S. Luijsterburg, Roberta Menafra, Rik G.H. Lindeboom, Pascal W.T.C. Jansen, and Raghu Ram Edupuganti et al. ZMYND8 co-localizes with NuRD on target genes and regulates poly(adp-ribose)-dependent recruitment of GATAD2A/NuRD to sites of DNA damage. Cell Reports, 17(3):783 798, 2016. 438 Siddhartha P. Kar, Jonathan Beesley, Ali Amin Al Olama, Kyriaki Michailidou, Jonathan Tyrer, and ZSofia Kote-Jarai et al. Genome-wide meta-analyses of breast, ovarian, and prostate cancer association studies identify multiple new susceptibility loci shared by at least two cancer types. Cancer Discovery, 6(9):1052 1067, 2016. 439 Venkatadri Kolla, Koumudi Naraparaju, Tiangang Zhuang, Mayumi Higashi, Sriharsha Kolla, Gerd A. Blobel, and Garrett M. Brodeur. The tumour suppressor CHD5 forms a NuRD-type chromatin remodelling complex. Biochemical Journal, 468(2):345 352, 2015. 440 Morgan P. Torchy, Ali Hamiche, and Bruno P. Klaholz. Structure and function insights into the NuRD chromatin remodeling complex. Cellular and Molecular Life Sciences, 72(13):2491 2507, 2015. 198

441 Sarah E. Mahoney, Zizhen Yao, C. Chip Keyes, Stephen J. Tapscott, and Scott J. Diede. Genome-wide DNA methylation studies suggest distinct DNA methylation patterns in pediatric embryonal and alveolar rhabdomyosarcomas. Epigenetics, 7(4):400 408, 2012. 442 Eric I. Zimmerman, Alice A. Gibson, Shuiying Hu, Aksana Vasilyeva, Shelley J. Orwick, and Guoqing Du et al. Multikinase inhibitors induce cutaneous toxicity through OAT6-mediated uptake and MAP3K7-driven cell death. Cancer Research, 76(1):117, 2016. 443 Fanfan Zhou and Guofeng You. Molecular insights into the structure-function relationship of organic anion transporters OATs. Pharmaceutical Research, 24(1):28 36, 2007. 444 Wei Cao, Enguang Ma, Li Zhou, Tan Yuan, and Chunying Zhang. Exploring the FGFR3-related oncogenic mechanism in bladder cancer using bioinformatics strategy. World Journal of Surgical Oncology, 15(1):66 73, 2017. 445 Vivien Koh, Hsueh Yin Kwan, Woei Loon Tan, Tzia Liang Mah, and Wei Peng Yong. Knockdown of POLA2 increases gemcitabine resistance in lung cancer cells. BMC Genomics, 17(13):1029 138, 2016. 446 Scooter Willis, Victor M. Villalobos, Olivier Gevaert, Mark Abramovitz, Casey Williams, Branimir I. Sikic, and Brian Leyland-Jones. Single gene prognostic biomarkers in ovarian cancer: a meta-analysis. PLOS ONE, 11(2):e0149183, 2016. 447 Guhyun Kang, Hongseok Yun, Choong-Hyun Sun, Inho Park, Seungmook Lee, and Jekeun Kwon et al. Integrated genomic analyses identify frequent gene fusion events and VHL inactivation in gastrointestinal stromal tumors. Oncotarget, 7(6):6538 6551, 2016. 448 Tzia Liang Mah, Xin Ning Adeline Yap, Vachiranee Limviphuvadh, Nanpu Li, Srinath Sridharan, and Vellaisemy Kuralmani et al. Novel SNP improves differential survivability and mortality in non-small cell lung cancer patients. BMC Genomics, 15(9):S20 S27, 2014. 199

449 Oluf Dimitri Roe, Adam Szulkin, Endre Anderssen, Arnar Flatberg, Helmut Sandeck, and Tore Amundsen et al. Molecular resistance fingerprint of pemetrexed and platinum in a long-term survivor of mesothelioma. PLOS ONE, 7(8):e40521, 2012. 450 Fotis A. Asimakopoulos, Pesach J. Shteper, Svetlana Krichevsky, Eitan Fibach, Aaron Polliack, and Eliezer Rachmilewitz et al. ABL1 methylation is a distinct molecular event associated with clonal evolution of chronic myeloid leukemia. Blood, 94(7):2452 2460, 1999. 451 Adina Aviram, Bruria Witenberg, Mati Shaklai, and Dorit Blickstein. Detection of methylated ABL1 promoter in philadelphia-negative myeloproliferative disorders. Blood Cells, Molecules, and Diseases, 30(1):100 106, 2003. 452 Baodong Sun, Guanchao Jiang, Muhammad-Ali A. Zaydan, Vincent F. La Russa, Hana Safah, and Melanie Ehrlich. ABL1 promoter methylation can exist independently of BCR-ABL transcription in chronic myeloid leukemia hematopoietic progenitors. Cancer Research, 61(18):6931 6937, 2001. 453 Jing Jin Gu, Clay Rouse, Xia Xu, Jun Wang, Mark W. Onaitis, and Ann Marie Pendergast. Inactivation of ABL kinases suppresses non-small cell lung cancer metastasis. JCI Insight, 1(21):1 16, 2016. 454 Jean-Philippe Foy, Curtis R. Pickering, Vassiliki A. Papadimitrakopoulou, Jaroslav Jelinek, Steven H. Lin, and William N. William et al. New DNA methylation markers and global DNA hypomethylation are associated with oral cancer development. Cancer Prevention Research, 8(11):1027 1035, 2015. 455 Eun-Joon Lee, Prakash Rath, Jimei Liu, Dungsung Ryu, Lirong Pei, and Satish K. Noonepalle et al. Identification of global DNA methylation signatures in glioblastoma-derived cancer stem cells. Journal of Genetics and Genomics, 42(7):355 371, 2015. 456 Jean-Pierre Roperch, Karim Benzekri, Hicham Mansour, and Roberto Incitti. Improved amplification efficiency on stool samples by addition of 200

spermidine and its use for non-invasive detection of colorectal cancer. BMC Biotechnology, 15(1):41 49, 2015. 457 Nadia Ashour, Javier C. Angulo, Guillermo Andres, Raul Alelu, Ana Gonzalez-Corpas, and Maria V. Toledo et al. A DNA hypermethylation profile reveals new potential biomarkers for prostate cancer diagnosis and prognosis. The Prostate, 74(12):1171 1182, 2014. 458 Bodour Salhia, Jeff Kiefer, Julianna T.D. Ross, Raghu Metapally, Rae Anne Martinez, and Kyle N. Johnson et al. Integrated genomic and epigenomic analysis of breast cancer brain metastasis. PLOS ONE, 9(1):e85448, 2014. 459 Jean-Pierre Roperch, Roberto Incitti, Solene Forbin, Floriane Bard, Hicham Mansour, and Farida Mesli et al. Aberrant methylation of NPY, PENK, and WIF1 as a promising marker for blood-based diagnosis of colorectal cancer. BMC Cancer, 13(1):566 576, 2013. 460 Masahiro Shitani, Shigeru Sasaki, Noriyuki Akutsu, Hideyasu Takagi, Hiromu Suzuki, and Masanori Nojima et al. Genome-wide analysis of DNA methylation identifies novel cancer-related genes in hepatocellular carcinoma. Tumor Biology, 33(5):1307 1317, 2012. 461 Yugo Kishida, Atsushi Natsume, Yutaka Kondo, Ichiro Takeuchi, Byonggu An, and Yasuyuki Okamoto et al. Epigenetic subclassification of meningiomas based on genome-wide DNA methylation analyses. Carcinogenesis, 33(2):436 441, 2012. 462 Woonbok Chung, Jolanta Bondaruk, Jaroslav Jelinek, Yair Lotan, Shoudan Liang, Bogdan Czerniak, and Jean-Pierre J. Issa. Detection of bladder cancer using novel DNA methylation biomarkers in urine sediments. Cancer Epidemiology Biomarkers & Prevention, 20(7):1483 1491, 2011. 463 Ji Un Kang, Sun Hoe Koo, Kye Chul Kwon, Jong Woo Park, and Jin Man Kim. Gain at chromosomal region 5p15.33, containing TERT, is the most frequent genetic event in early stages of non-small cell lung cancer. Cancer Genetics and Cytogenetics, 182(1):1 11, 2008. 464 Yunyu Chen, Jing Zhang, Dongsheng Li, Jiandong Jiang, Yanchang Wang, and Shuyi Si. Identification of a novel Polo-like kinase 1 inhibitor 201

that specifically blocks the functions of Polo-Box domain. 8(1):1234 1246, 2016. Oncotarget, 465 Baochi Ou, Jingkun Zhao, Shaopei Guan, Xiongzhi Wangpu, Congcong Zhu, and Yaping Zong et al. PLK2 promotes tumor growth and inhibits apoptosis by targeting Fbxw7/Cyclin E in colorectal cancer. Cancer Letters, 380(2):457 466, 2016. 466 Fei Liu, Shimeng Zhang, Zhen Zhao, Xinru Mao, Jinlan Huang, and Zixian Wu et al. MicroRNA-27b up-regulated by human papillomavirus 16 E7 promotes proliferation and suppresses apoptosis by targeting polo-like kinase2 in cervical cancer. Oncotarget, 7(15):19666 19679, 2016. 467 Jia-Hui Xu, Shi-Lian Hu, Guo-Dong Shen, and Gan Shen. Tumor suppressor genes and their underlying interactions in paclitaxel resistance in cancer therapy. Cancer Cell International, 16(1):13 23, 2016. 468 M.V. Ramana Reddy, Balireddy Akula, Shashidhar Jatiani, Rodrigo Vasquez-Del Carpio, Vinay K. Billa, and Muralidhar R. Mallireddigari et al. Discovery of 2-(1H-indol-5-ylamino)-6-(2,4-difluorophenylsulfonyl)-8-methylpyrido [2,3-d]pyrimidin-7(8H)-one (7ao) as a potent selective inhibitor of Polo like kinase 2 (PLK2). Bioorganic & Medicinal Chemistry, 24(4):521 544, 2016. 469 Zheng Bo Hu, Xiao Hong Liao, Zun Ying Xu, Xiao Yang, Chao Dong, An Min Jin, and Hai Lu. PLK2 phosphorylates and inhibits enriched TAp73 in human osteosarcoma cells. Cancer Medicine, 5(1):74 87, 2016. 470 Li Ying Liu, Wei Wang, Ling Yu Zhao, Bo Guo, Juan Yang, and Xiao Ge Zhao et al. Silencing of polo-like kinase 2 increases cell proliferation and decreases apoptosis in SGC-7901 gastric cancer cells. Molecular Medicine Reports, 11(4):3033 3038, 2015. 471 Cheng-Wei Li and Bor-Sen Chen. Investigating core genetic-and-epigenetic cell cycle networks for stemness and carcinogenic mechanisms, and cancer drug design using big database mining and genome-wide next-generation sequencing data. Cell Cycle, 15(19):2593 2607, 2016. 202

472 Vishal Kothari, Iris Wei, Sunita Shankar, Shanker Kalyana-Sundaram, Lidong Wang, and Linda W. Ma et al. Outlier kinase expression by RNA sequencing as targets for precision therapy. Cancer Discovery, 3(3):280 293, 2013. 473 Tobias Berg, Gesine Bug, Oliver G. Ottmann, and Klaus Strebhardt. Polo-like kinases in AML. 21(8):1069 1074, 2012. Expert Opinion on Investigational Drugs, 474 Helen M. Coley, Eleftheria Hatzimichael, Sarah Blagden, Iain McNeish, Alastair Thompson, Tim Crook, and Nelofer Syed. Polo like kinase 2 tumour suppressor and cancer biomarker: new perspectives on drug sensitivity/resistance in ovarian cancer. Oncotarget, 3(1):78 83, 2012. 475 Lalji K. Gediya, Aakanksha Khandelwal, Jyoti Patel, Aashvini Belosay, Gauri Sabnis, and Jhalak et al. Mehta. Design, synthesis, and evaluation of novel mutual prodrugs (hybrid drugs) of all-trans-retinoic acid and histone deacetylase inhibitors with enhanced anticancer activities in breast and prostate cancer cells in vitro. Journal of Medicinal Chemistry, 51(13):3895 3904, 2008. 476 Mon-Ju Wu, Mi Ra Kim, Yu-Shan Chen, Jun-Yi Yang, and Chia-Jung Chang. Retinoic acid directs breast cancer cell state changes through regulation of TET2-PKC pathway. Oncogene, 36(22):3193 3206, 2017. 477 Liyan Qu and Xiuwen Tang. Bexarotene: a promising anticancer agent. Cancer Chemotherapy and Pharmacology, 65(2):201 205, 2010. 478 Martin P. Powers, Wei-Lien Wang, Vivian S. Hernandez, Kayuri S. Patel, Dina C. Lev, Alexander J. Lazar, and Dolores H. Lopez-Terrada. Detection of myxoid liposarcoma-associated FUS-DDIT3 rearrangement variants including a newly identified breakpoint using an optimized RT-PCR assay. Modern Pathology, 23(10):1307 1315, 2010. 479 Carola Andersson, Henrik Fagman, Magnus Hansson, and Fredrik Enlund. Profiling of potential driver mutations in sarcomas by targeted next generation sequencing. Cancer Genetics, 209(4):154 160, 2016. 203

480 Yoshinao Oda, Hidetaka Yamamoto, Tomonari Takahira, Chikashi Kobayashi, Kenichi Kawaguchi, and Naomi Tateishi et al. Frequent alteration of p16ink4a/p14arf and p53 pathways in the round cell component of myxoid/round cell liposarcoma: p53 gene alterations and reduced p14arf expression both correlate with poor prognosis. The Journal of Pathology, 207(4):410 421, 2005. 481 Jordi Barretina, Barry S. Taylor, Shantanu Banerji, Alexis H. Ramos, Mariana Lagos-Quintana, and Penelope L. DeCarolis et al. Subtype-specific genomic alterations define new targets for soft-tissue sarcoma therapy. Nature Genetics, 42(8):715 721, 2010. 482 Elizabeth G. Demicco, Keila E. Torres, Markus P. Ghadimi, Chiara Colombo, Svetlana Bolshakov, and Aviad Hoffman et al. Involvement of the PI3K/Akt pathway in myxoid/round cell liposarcoma. Modern Pathology, 25(2):212 221, 2012. 483 Tsuyoshi Saito, Keisuke Akaike, Aiko Kurisaki-Arakawa, Midori Toda-Ishii, Kenta Mukaihara, and Yoshiyuki Suehara et al. TERT promoter mutations are rare in bone and soft tissue sarcomas of Japanese patients. Molecular and Clinical Oncology, 4(1):61 64, 2016. 484 Christian Koelsche, Marcus Renner, Wolfgang Hartmann, Regine Brandt, Burkhard Lehner, and Nina Waldburger et al. TERT promoter hotspot mutations are recurrent in myxoid liposarcomas but rare in other soft tissue sarcoma entities. Journal of Experimental & Clinical Cancer Research, 33(1):33 40, 2014. 485 Marieke A. de Graaff, Jamie S.E. Yu, Hannah C. Beird, Davis R. Ingram, Theresa Nguyen, and Jeffrey Juehui Liu et al. Establishment and characterization of a new human myxoid liposarcoma cell line (DL-221) with the FUS-DDIT3 translocation. Laboratory Investigation, 96(8):885 894, 2016. 486 Cristina R. Antonescu, Sylvia J. Tschernyavsky, Ramona Decuseara, Denis H. Leung, James M. Woodruff, and Murray F. Brennan et al. Prognostic impact of P53 status, TLS-CHOP fusion transcript structure, and histological grade in myxoid liposarcoma. Clinical Cancer Research, 7(12):3977 3987, 2001. 204

487 Aviad Hoffman, Markus P.H. Ghadimi, Elizabeth G. Demicco, Chad J. Creighton, Keila Torres, and Chiara Colombo et al. Localized and metastatic myxoid/round cell liposarcoma. Cancer, 119(10):1868 1877, 2013. 488 Christine G. Joseph, Heejung Hwang, Yuchen Jiao, Laura D. Wood, Isaac Kinde, and Jian Wu et al. Exomic analysis of myxoid liposarcomas, synovial sarcomas, and osteosarcomas. Genes, Chromosomes and Cancer, 53(1):15 24, 2014. 489 Sarah Uboldi, Enrica Calura, Luca Beltrame, Ilaria Fuso Nerini, Sergio Marchini, and Duccio Cavalieri et al. A systems biology approach to characterize the regulatory networks leading to trabectedin resistance in an in vitro model of myxoid liposarcoma. PLOS ONE, 7(4):e35423, 2012. 490 Walter Pavicic, Esa Perkio, Sippy Kaur, and Paivi Peltomaki. Altered methylation at microrna-associated CpG islands in hereditary and sporadic carcinomas: a methylation-specific multiplex ligation-dependent probe amplification (MS-MLPA)-based approach. Molecular Medicine, 17(7-8):726 735, 2011. 491 Lina Albitar, Gavin Pickett, Marilee Morgan, Suzy Davies, and Kimberly K. Leslie. Models representing type I and type II human endometrial cancers: Ishikawa H and Hec50co cells. Gynecologic Oncology, 106(1):52 64, 2007. 492 Karin Milde-Langosch, Christoph Goemann, Carola Methner, Gabriele Rieck, Ana-Maria Bamberger, and Thomas Loning. Expression of Rb2/p130 in breast and endometrial cancer: correlations with hormone receptor status. British Journal of Cancer, 85(4):546 551, 2001. 493 Amit Nahum, Keren Hirsch, Michael Danilenko, Colin K.W. Watts, Owen W.J. Prall, Joseph Levy, and Yoav Sharoni. Lycopene inhibition of cell cycle progression in breast and endometrial cancer cells is associated with reduction in cyclin D levels and retention of p27kip1 in the cyclin E-cdk2 complexes. Oncogene, 20(26):3428, 2001. 494 Tommaso Susini, Daniela Massi, Milena Paglierani, Valeria Masciullo, Giovanni Scambia, and Antonio Giordano et al. Expression of the retinoblastoma-related gene Rb2/p130 is downregulated in atypical 205

endometrial hyperplasia and adenocarcinoma. 32(4):360 367, 2001. Human Pathology, 495 Tommaso Susini, Feliciano Baldi, Candace M. Howard, Alfonso Baldi, Gianluigi Taddei, and Daniela Massi et al. Expression of the retinoblastoma-related gene Rb2/p130 correlates with clinical outcome in endometrial cancer. Journal of Clinical Oncology, 16(3):1085 1093, 1998. 496 Mina Massaro-Giordano, Gianluca Baldi, Antonio De Luca, Alfonso Baldi, and Antonio Giordano. Differential expression of the retinoblastoma gene family members in choroidal melanoma: prognostic significance. Clinical Cancer Research, 5(6):1455, 1999. 497 Maria Pardo, Antonio Pineiro, Maria de la Fuente, Angel Garcia, Sripadi Prabhakar, and Nicole Zitzmann et al. Abnormal cell cycle regulation in primary human uveal melanoma cultures. Journal of Cellular Biochemistry, 93(4):708 720, 2004. 498 Vasily A. Yakovlev. Nitric oxide-dependent downregulation of BRCA1 expression promotes genetic instability. Cancer Research, 73(2):706, 2013. 499 Cinti Caterina, Macaluso Marcella, and Antonio Giordano. Tumor-specific exon 1 mutations could be the hit event predisposing Rb2/p130 gene to epigenetic silencing in lung cancer. Oncogene, 24(38):5821 5826, 2005. 500 Hu Xue Jun, Akihiko Gemma, Yoko Hosoya, Kuniko Matsuda, Michiya Nara, and Yukio Hosomi et al. Reduced transcription of the RB2/p130 gene in human lung cancer. Molecular Carcinogenesis, 38(3):124 129, 2003. 501 Giuseppe Russo, Pier Paolo Claudio, Yan Fu, Peter Stiegler, Zailin Yu, Marcella Macaluso, and Antonio Giordano. prb2/p130 target genes in non-small lung cancer cells identified by microarray analysis. Oncogene, 22(44):6959 6969, 2003. 502 Sanjay Modi, Akihito Kubo, Herbert Oie, Amy B. Coxon, Ahad Rehmatulla, and Frederic J. Kaye. Protein expression of the RB-related gene family and SV40 large T antigen in mesothelioma and lung cancer. Oncogene, 19(40):4632, 2000. 206

503 Pier Paolo Claudio, Mario Caputi, and Antonio Giordano. The RB2/p130 gene: the latest weapon in the war against lung cancer? Research, 6(3):754, 2000. Clinical Cancer 504 Alfonso Baldi, Vincenzo Esposito, Antonio De Luca, Yan Fu, Ilernando Meoli, and Giovan G. Giordano et al. Differential expression of Rb2/p130 and p107 in normal human tissues and in primary lung cancer. Clinical Cancer Research, 3(10):1691, 1997. 505 Luciano Mutti, Antonio De Luca, Pier Paolo Claudio, Giuseppe Convertino, Michele Carbone, and Antonio Giordano. Simian virus 40-like DNA sequences and large-t antigen-retinoblastoma family protein prb2/p130 interaction in human mesothelioma. Developments in Biological Standardization, 94:47 53, 1997. 506 Kristian Helin, Karin Holm, Anita Niebuhr, Hans Eiberg, Niels Tommerup, and Susanne Hougaard et al. Loss of the retinoblastoma protein-related p130 protein in small cell lung carcinoma. Proceedings of the National Academy of Sciences of the United States of America, 94(13):6933 6938, 1997. 507 Steven G. Gray, Xiang Guo, Darek Kedra, Bin T. Teh, and Hua-Qing Min. Correspondence re: P.P. Claudio et al., Mutations in the Retinoblastoma-related Gene RB2/p130 in Primary Nasopharyngeal Carcinoma. Cancer Res., 60: 8-12, 2000. Cancer Research, 61(15):5950 5951, 2001. 508 Pier Paolo Claudio, Candace M. Howard, Alfonso Baldi, Antonio De Luca, Yan Fu, and Gianluigi Condorelli et al. p130/prb2 has growth suppressive properties similar to yet distinctive from those of retinoblastoma family members prb and p107. Cancer Research, 54(21):5556, 1994. 509 Francesco P. Jori, Umberto Galderisi, Elena Piegari, Gianfranco Peluso, Marilena Cipollaro, and Antonio Cascino et al. RB2/p130 ectopic gene expression in neuroblastoma stem cells: evidence of cell-fate restriction and induction of differentiation. Biochemical Journal, 360(3):569, 2001. 510 Giuseppe Raschella, Barbara Tanno, Francesco Bonetto, Roberto Amendola, Tullio Battista, and Antonio De Luca et al. Retinoblastoma-related 207

protein prb2/p130 and its binding to the B-myb promoter increase during human neuroblastoma differentiation. Journal of Cellular Biochemistry, 67(3):297 303, 1997. 511 Giuseppe Raschella, Barbara Tanno, Francesco Bonetto, Anna Negroni, Pier Paolo Claudio, and Alfonso Baldi et al. The RB-related gene Rb2/p130 in neuroblastoma differentiation and in B-myb promoter down-regulation. Cell Death and Differentiation, 5(5):401 407, 1998. 512 Riccardo Di Fiore, Antonella D Anneo, Giovanni Tesoriere, and Renza Vento. RB1 in cancer: different mechanisms of RB1 inactivation and alterations of prb pathway in tumorigenesis. Journal of Cellular Physiology, 228(8):1676 1687, 2013. 513 Iva Simeonova, Vincent Lejour, Boris Bardot, Rachida Bouarich-Bourimi, Aurelie Morin, and Ming Fang et al. Fuzzy tandem repeats containing p53 response elements may define species-specific p53 target genes. PLOS Genetics, 8(6):e1002731, 2012. 514 Zena Lim and Boon Long Quah. Unilateral retinoblastoma in an eye with Peters anomaly. Journal of American Association for Pediatric Ophthalmology and Strabismus, 14(2):184 186, 2010. 515 Paola Indovina, Antonio Acquaviva, Giulia De Falco, Valeria Rizzo, Anna Onnis, and Anna Luzzi et al. Downregulation and aberrant promoter methylation of p16ink4a: a possible novel heritable susceptibility marker to retinoblastoma. Journal of Cellular Physiology, 223(1):143 150, 2010. 516 Peh-Yean Cheah. The emerging role of RBL2/p130 in multi-step retinoblastoma tumorigenesis. Cancer Biology & Therapy, 8(8):718 719, 2009. 517 Kadam Priya, Srinivasa Rao Jada, Boon Long Quah, Thuan Chong Quah, and Poh San Lai. High incidence of allelic loss at 16q12. 2 region spanning RB2/p130 gene in retinoblastoma. Cancer Biology & Therapy, 8(8):714 717, 2009. 518 David MacPherson, Karina Conkrite, Mandy Tam, Shizuo Mukai, David Mu, and Tyler Jacks. Murine bilateral retinoblastoma exhibiting rapid-onset, 208

metastatic progression and N-myc gene amplification. The EMBO Journal, 26(3):784 794, 2007. 519 Gian Marco Tosi, Carmela Trimarchi, Marcella Macaluso, Dario La Sala, Alfredo Ciccodicola, and Stefano Lazzi et al. Genetic and epigenetic alterations of RB2/p130 tumor suppressor gene in human sporadic retinoblastoma: implications for pathogenesis and therapeutic approach. Oncogene, 24(38):5827 5836, 2005. 520 David MacPherson, Julien Sage, Teresa Kim, Dennis Ho, Margaret E. McLaughlin, and Tyler Jacks. Cell type-specific effects of Rb deletion in the murine retina. Genes & Development, 18(14):1681 1694, 2004. 521 Marie Classon and Ed Harlow. The retinoblastoma tumour suppressor in development and cancer. Nature Reviews Cancer, 2(12):910 917, 2002. 522 Cristiana Bellan, Giulia De Falco, Gian Marco Tosi, Stefano Lazzi, Filomena Ferrari, and Giovanna Morbini et al. Missing expression of prb2/p130 in human retinoblastomas is associated with reduced apoptosis and lesser differentiation. Investigative Ophthalmology & Visual Science, 43(12):3602 3608, 2002. 523 William R. Sellers, Bennett G. Novitch, Satoshi Miyake, Agnieszka Heith, Gregory A. Otterson, and Frederic J. Kaye et al. Stable binding to E2F is not required for the retinoblastoma protein to activate transcription, promote differentiation, and suppress tumor cell growth. Genes & Development, 12(1):95 106, 1998. 524 Yukiharu Sawada, Hajime Nomura, Yuichi Endo, Kazumi Umeki, Teizo Fujita, Sachiya Ohtaki, and Kei Fujinaga. Cloning and characterization of the rat p130, a member of the retinoblastoma gene family. Biochimica et Biophysica Acta (BBA)-Molecular Basis of Disease, 1361(1):20 27, 1997. 525 Alfonso Baldi, Vincenzo Boccia, Pier Paolo Claudio, Antonio De Luca, and Antonio Giordano. Genomic structure of the human retinoblastoma-related Rb2/p130 gene. Proceedings of the National Academy of Sciences of the United States of America, 93(10):4629 4632, 1996. 209

526 Jacqueline M. Sterner, Yunxia Tao, Sarah B. Kennett, Hyung G. Kim, and Jonathan M. Horowitz. The amino terminus of the retinoblastoma (rb) protein associates with a cyclin-dependent kinase-like kinase via rb amino acids required for growth suppression. Cell Growth & Differentiation, 7(1):53 64, 1996. 527 Peter Whyte. The retinoblastoma protein and its relatives. Seminars in Cancer Biology, 6(2):83 90, 1995. 528 Hugh Cam and Brian David Dynlacht. Emerging roles for E2F: beyond the G1/S transition and DNA replication. Cancer Cell, 3(4):311 316, 2003. 529 Jacob B. Hansen, Hein te Riele, and Karsten Kristiansen. Novel function of the retinoblastoma protein in fat: regulation of white versus brown adipocyte differentiation. Cell Cycle, 3(6):772 776, 2004. 530 Victoria M. Richon, Robert E. Lyle, and Robert E. McGehee. Regulation and expression of retinoblastoma proteins p107 and p130 during 3t3-l1 adipocyte differentiation. Journal of Biological Chemistry, 272(15):10117 10124, 1997. 531 Stefania Capasso, Nicola Alessio, Giovanni Di Bernardo, Marilena Cipollaro, Mariarosa Melone, and Gianfranco Peluso et al. Silencing of RB1 and RB2/P130 during adipogenesis of bone marrow stromal cells results in dysregulated differentiation. Cell Cycle, 13(3):482 490, 2014. 532 Mark F. Pittenger, Alastair M. Mackay, Stephen C. Beck, Rama K. Jaiswal, Robin Douglas, and Joseph D. Mosca et al. Multilineage potential of adult human mesenchymal stem cells. Science, 284(5411):143 147, 1999. 533 Alexander B. Mohseny, Karoly Szuhai, Salvatore Romeo, Emilie P. Buddingh, Inge Briaire-de Bruijn, and Danielle de Jong et al. Osteosarcoma originates from mesenchymal stem cells in consequence of aneuploidization and genomic loss of Cdkn2. The Journal of Pathology, 219(3):294 305, 2009. 534 Nedime Serakinci, Per Guldberg, Jorge S. Burns, Basem Abdallah, Henrik Schrodder, Thomas Jensen, and Moustapha Kassem. Adult human mesenchymal stem cell as a target for neoplastic transformation. Oncogene, 23(29):5095 5098, 2004. 210

535 Ioannis Panagopoulos, M. Hoglund, Fredrik Mertens, Nils Mandahl, Felix Mitelman, and Pierre Aman. Fusion of the EWS and CHOP genes in myxoid liposarcoma. Oncogene, 12(3):489 494, 1996. 536 Helene Zinszner, John Sok, David Immanuel, Yin Yin, and David Ron. TLS (FUS) binds RNA in vivo and engages in nucleo-cytoplasmic shuttling. Journal of Cell Science, 110(15):1741, 1997. 537 Jessica I. Hoell, Erik Larsson, Simon Runge, Jeffrey D. Nusbaum, Sujitha Duggimpudi, and Thalia A. Farazi et al. RNA targets of wild-type and mutant FET family proteins. Nature Structural & Molecular Biology, 18(12):1428 1431, 2011. 538 Adelene Y. Tan and James L. Manley. The TET family of proteins: Functions and roles in disease. Journal of Molecular Cell Biology, 1(2):82 92, 2009. 539 Nicola D. Roberts, R. Daniel Kortschak, Wendy T. Parker, Andreas W. Schreiber, Susan Branford, and Hamish S. Scott et al. A comparative analysis of algorithms for somatic SNV detection in cancer. Bioinformatics, 29(18):2223 2230, 2013. 540 Anne Bruun Kroigard, Mads Thomassen, Anne-Vibeke Laenkholm, Torben A. Kruse, and Martin Jakob Larsen. Evaluation of nine somatic variant callers for detection of somatic mutations in exome and targeted deep sequencing data. PLOS ONE, 11(3):e0151664, 2016. 541 Li Ding, Michael C. Wendl, Daniel C. Koboldt, and Elaine R. Mardis. Analysis of next generation genomic data in cancer: accomplishments and challenges. Human Molecular Genetics, 19(R2):188 196, 2010. 542 Michael Gundry and Jan Vijg. Direct mutation analysis by high-throughput sequencing: from germline to low-abundant, somatic variants. Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis, 729(1):1 15, 2012. 543 Ensel Oh, Yoon-La Choi, Mi Jeong Kwon, Ryong Nam Kim, Yu Jin Kim, and Ji-Young Song et al. Comparison of accuracy of whole-exome sequencing 211

with formalin-fixed paraffin-embedded and fresh frozen tissue samples. PLOS ONE, 10(12):e0144162, 2015. 544 Jan A. Sikorsky, Donald A. Primerano, Terry W. Fenger, and James Denvir. DNA damage reduces Taq DNA polymerase fidelity and PCR amplification efficiency. Biochemical and Biophysical Research Communications, 355(2):431 437, 2007. 545 Hongdo Do and Alexander Dobrovic. Dramatic reduction of sequence artefacts from DNA isolated from formalin-fixed cancer biopsies by treatment with uracil-dna glycosylase. Oncotarget, 3(5):546 558, 2012. 546 Hongdo Do, Stephen Q. Wong, Jason Li, and Alexander Dobrovic. Reducing sequence artifacts in amplicon-based massively parallel sequencing of formalin-fixed paraffin-embedded DNA by enzymatic depletion of uracil-containing templates. Clinical Chemistry, 59(9):1376 1383, 2013. 547 Michael Hofreiter, Viviane Jaenicke, David Serre, Arndt von Haeseler, and Svante Paabo. DNA sequences from multiple amplifications reveal artifacts induced by cytosine deamination in ancient DNA. Nucleic Acids Research, 29(23):4793 4799, 2001. 548 Juliane C. Dohm, Claudio Lottaz, Tatiana Borodina, and Heinz Himmelbauer. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Research, 36(16):e105 e105, 2008. 549 Frazer Meacham, Dario Boffelli, Joseph Dhahbi, David I.K. Martin, Meromit Singer, and Lior Pachter. Identification and correction of systematic error in high-throughput sequence data. BMC Bioinformatics, 12(1):451, 2011. 550 Kensuke Nakamura, Taku Oshima, Takuya Morimoto, Shun Ikeda, Hirofumi Yoshikawa, and Yuh Shiwa et al. Sequence-specific error profile of Illumina sequencers. Nucleic Acids Research, pages 1 13, 2011. 551 Kym M. Boycott, Megan R. Vanstone, Dennis E. Bulman, and Alex E. MacKenzie. Rare-disease genetics in the era of next-generation sequencing: discovery to translation. Nature Review Genetics, 14(10):681 691, 2013. 212

552 Gregory M. Cooper and Jay Shendure. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Genetics, 12(9):628 640, 2011. Nature Reviews 553 Shamil R. Sunyaev. Inferring causality and functional significance of human coding DNA variants. Human Molecular Genetics, 21(R1):R10 R17, 2012. 554 Matthew Zawistowski, Shyam Gopalakrishnan, Jun Ding, Yun Li, Sara Grimm, and Sebastian Zollner. Extending rare-variant testing strategies: analysis of noncoding sequence and imputed genotypes. American Journal of Human Genetics, 87(5):604 617, 2010. 555 Martin Ladouceur, Zari Dastani, Yurii S. Aulchenko, Celia M.T. Greenwood, and J. Brent Richards. The empirical power of rare variant association methods: results from sanger sequencing in 1,998 individuals. PLOS Genetics, 8(2):e1002496, 2012. 556 Seunggeung Lee, Goncalo R. Abecasis, Michael Boehnke, and Xihong Lin. Rare-variant association analysis: study designs and statistical tests. American Journal of Human Genetics, 95(1):5 23, 2014. 557 Stephan Morgenthaler and William G. Thilly. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: A cohort allelic sums test (CAST). Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis, 615(1-2):28 56, 2007. 558 Bingshan Li and Suzanne M. Leal. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. The American Journal of Human Genetics, 83(3):311 321, 2008. 559 Andrew S. Brohl, Rajesh Patidar, Clesson E. Turner, Xinyu Wen, Young K. Song, and Jun S. Wei et al. Frequent inactivating germline mutations in DNA repair genes in patients with Ewing sarcoma. Genetic in Medicine, 2017. 560 Garvan Institute of Medical Research. Medical genome reference bank; URL: https://www.garvan.org.au/research/kinghorn-centre-for-clinical-genomics/ clinical-genomics/sydney-genomics-collaborative/mgrb, 2017. 213

561 John J. McNeil, Robyn L. Woods, Mark R. Nelson, Anne M. Murray, Christopher M. Reid, and Brenda Kirpach et al. Baseline characteristics of participants in the ASPREE (ASPirin in Reducing Events in the Elderly) study. The Journals of Gerontology, 2017. 562 Goo Jun, Matthew Flickinger, Kurt N. Hetrick, Jane M. Romm, Kimberly F. Doheny, and Goncalo R. Abecasis et al. Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. The American Journal of Human Genetics, 91(5):839 848, 2012. 563 Consortium The Genomes Project. A global reference for human genetic variation. Nature, 526(7571):68 74, 2015. 564 GATK Documentation. Best practices for germline SNP & Indel discovery in whole genome and exome sequence; URL: https://software.broadinstitute.org/gatk/best-practices/bp_3step.php?case= GermShortWGS, 2016. 565 Donna Karolchik, Robert Baertsch, Mark Diekhans, Terrence S. Furey, Angie Hinrichs, and Y.T. Lu et al. The UCSC genome browser database. Nucleic Acids Research, 31(1):51 54, 2003. 566 Florian Gnad, Albion Baucom, Kiran Mukhyala, Gerard Manning, and Zemin Zhang. Assessment of computational methods for predicting the effects of missense mutations in human cancers. BMC Genomics, 14(3):S7, 2013. 567 NCBI. EST Profile Hs.684212 - C16orf96: Chromosome 16 open reading frame 96; URL: https://www.ncbi.nlm.nih.gov/unigene/estprofileviewer.cgi?uglist=hs.684212, 2017. 568 Li Liu, Jiao Huang, Ke Wang, Li Li, Yangkai Li, Jingsong Yuan, and Sheng Wei. Identification of hallmarks of lung adenocarcinoma prognosis using whole genome sequencing. Oncotarget, 6(35):38016 38028, 2015. 214

569 Desheng Xiao, Ying Shi, Chunyan Fu, Jiantao Jia, Yu Pan, and Yiqun Jiang et al. Decrease of TET2 expression and increase of 5-hmC levels in myeloid sarcomas. Leukemia Research, 42:75 79, 2016. 570 Yu Pan, Yongguang Tao, Chunyan Fu, Jiantao Jia, Shuang Liu, and Desheng Xiao. Assessment of PET/CT in multifocal myeloid sarcomas with loss of TET2: a case report and literature review. International Journal of Clinical and Experimental Pathology, 8(10):13630 13634, 2015. 571 Pamela J. Woodring, Tony Hunter, and Jean Y.J. Wang. Regulation of F-actin-dependent processes by the Abl family of tyrosine kinases. Journal of Cell Science, 116(13):2613 2626, 2003. 572 Emma Shtivelman, Batia Lifshitz, Robert P. Gale, and Eli Canaani. Fused transcript of ABL and BCR genes in chronic myelogenous leukaemia. Nature, 216:550 554, 1985. 573 Richard B. Jones, Andrew Gordus, Jordan A. Krall, and Gavin MacBeath. A quantitative protein interaction network for the ErbB receptors using protein microarrays. Nature, 439(7073):168 174, 2006. 574 Divyamani Srinivasan and Rina Plattner. Activation of Abl tyrosine kinases promotes invasion of aggressive breast cancer cells. 66(11):5648 5655, 2006. Cancer Research, 575 Liuqing Yang, Chunru Lin, and Zhi-Ren Liu. P68 RNA helicase mediates PDGF-induced epithelial mesenchymal transition by displacing Axin from B-catenin. Cell, 127(1):139 155, 2006. 576 Klarisa Rikova, Ailan Guo, Qingfu Zeng, Anthony Possemato, Jian Yu, and Herbert Haack et al. Global survey of phosphotyrosine signaling identifies oncogenic kinases in lung cancer. Cell, 131(6):1190 1203, 2007. 577 Jeffrey Lin, Tong Sun, Lin Ji, Wei Deng, Jack Roth, John D. Minna, and Ralph Arlinghaus. Oncogenic activation of c-abl in non-small cell lung cancer cells lacking FUS1 expression: inhibition of c-abl by the tumor suppressor gene product Fus1. Oncogene, 26(49):6989 6996, 2007. 215

578 Chang-Jiun Wu, Tianxi Cai, Klarisa Rikova, David Merberg, Simon Kasif, and Martin Steffen. A predictive phosphorylation signature of lung cancer. PLOS ONE, 4(11):e7994, 2009. 579 Julian Carretero, Takeshi Shimamura, Klarisa Rikova, Autumn L. Jackson, Matthew D. Wilkerson, and Christa L. Borgman et al. Integrative genomic and proteomic analyses identify targets for Lkb1-deficient metastatic lung tumors. Cancer Cell, 17(6):547 559, 2010. 580 Justin M. Drake, Nicholas A. Graham, Tanya Stoyanova, Amir Sedghi, Andrew S. Goldstein, and Houjian Cai et al. Oncogene-specific activation of tyrosine kinase networks during prostate cancer progression. Proceedings of the National Academy of Sciences, 109(5):1643 1648, 2012. 581 Alessandro Furlan, Venturina Stagni, Azeemudeen Hussain, Sylvie Richelme, Filippo Conti, and Andrea Prodosmo et al. Abl interconnects oncogenic Met and p53 core pathways in cancer cells. Cell Death & Differentiation, 18(10):1608 1616, 2011. 582 Sourik S. Ganguly, Leann S. Fiore, Jonathan T. Sims, J. Woodrow Friend, Divyamani Srinivasan, and Matthew A. Thacker et al. c-abl and Arg are activated in human primary melanomas, promote melanoma cell invasion via distinct pathways, and drive metastatic progression. Oncogene, 31(14):1804 1816, 2012. 583 Junaid Ansari, Abdul Rafeh Naqash, Reinhold Munker, Hazem El-Osta, Samip Master, and James D. Cotelingam et al. Histiocytic sarcoma as a secondary malignancy: pathobiology, diagnosis, and treatment. European Journal of Haematology, 97(1):9 16, 2016. 584 Xueyan Chen, Joe C. Rutledge, David Wu, Min Fang, Kent E. Opheim, and Min Xu. Chronic myelogenous leukemia presenting in blast phase with nodal, bilineal myeloid sarcoma and t-lymphoblastic lymphoma in a child. Pediatric and Developmental Pathology, 16(2):91 96, 2013. 585 Brian L. Samuels, Sant Chawla, Shreyaskumar Patel, Margaret von Mehren, Jeremy Hamm, and Pamela E. Kaiser et al. Clinical outcomes and safety with trabectedin therapy in patients with advanced soft tissue sarcomas 216

following failure of prior chemotherapy: results of a worldwide expanded access program study. Annals of Oncology, 24(6):1703 1709, 2013. 586 Fernando A. Angarita, Amanda J. Cannell, Albiruni R. Abdul Razak, Brendan C. Dickson, and Martin E. Blackstein. Trabectedin for inoperable or recurrent soft tissue sarcoma in adult patients: a retrospective cohort study. BMC Cancer, 16(1):30 41, 2016. 587 Kira Bramswig, Ferdinand Ploner, Alexandra Martel, Thomas Bauernhofer, Wolfgang Hilbe, and Thomas Kuhr et al. Sorafenib in advanced, heavily pretreated patients with soft tissue sarcomas. Anti-Cancer Drugs, 25(7):848 853, 2014. 588 Armando Santoro, Alessandro Comandone, Umberto Basso, Hector Soto Parra, Rita De Sanctis, and Elisa Stroppa et al. Phase II prospective study with sorafenib in advanced soft tissue sarcomas after anthracycline-based therapy. Annals of Oncology, 24(4):1093 1098, 2013. 589 Bo Eskerod Madsen and Sharon R. Browning. A groupwise association test for rare mutations using a weighted sum statistic. 5(2):e1000384, 2009. PLOS Genetics, 590 Ya-Jing Zhou, Yong Wang, and Li-Li Chen. Detecting the common and individual effects of rare variants on quantitative traits by using extreme phenotype sampling. Genes, 7(1):2 14, 2016. 591 Benjamin M. Neale, Manuel A. Rivas, Benjamin F. Voight, David Altshuler, Bernie Devlin, and Marju Orho-Melander et al. Testing for an unusual distribution of rare variants. PLOS Genetics, 7(3):e1001322, 2011. 592 Seunggeun Lee, Michael C. Wu, and Xihong Lin. Optimal tests for rare variant effects in sequencing association studies. Biostatistics, 13(4):762 775, 2012. 593 Andriy Derkach, Jerry F. Lawless, and Lei Sun. Robust and powerful tests for rare variants using Fisher s method to combine evidence of association from two or more complementary tests. Genetic Epidemiology, 37(1):110 121, 2013. 217

594 Satu Maki-Nevala, Virinder Kaur Sarhadi, Aija Knuuttila, Ilari Scheinin, Pekka Ellonen, and Sonja Lagstrom et al. Driver gene and novel mutations in asbestos-exposed lung adenocarcinoma and malignant mesothelioma detected by exome sequencing. Lung, 194(1):125 135, 2016. 595 Robbert D.A. Weren, Marjolijn J.L. Ligtenberg, C. Marleen Kets, Richarda M. de Voer, Eugene T.P. Verwiel, and Liesbeth Spruijt et al. A germline homozygous mutation in the base-excision repair gene NTHL1 causes adenomatous polyposis and colorectal cancer. Nature Genetics, 47(6):668 671, 2015. 596 Oriol Calvete, Jose Reyes, Sheila Zuniga, Beatriz Paumard-Hernandez, Victoria Fernandez, and Luis Bujanda et al. Exome sequencing identifies ATP4A gene as responsible of an atypical familial type I gastric neuroendocrine tumour. Human Molecular Genetics, 24(10):2914 2922, 2015. 597 Michael W. Ronellenfitsch, Oh Ji Eun, Kaishi Satomi, Koichiro Sumi, Patrick N. Harter, and Joachim P. Steinbach et al. CASP9 germline mutation in a family with multiple brain tumors. Brain Pathology, pages 1 22, 2016. 598 Cezary Cybulski, Jian Carrot-Zhang, Wojciech Kluzniak, Barbara Rivera, Aniruddh Kashyap, and Dominika Wokolorczyk et al. Germline RECQL mutations are associated with breast cancer susceptibility. Nature Genetics, 47(6):643 646, 2015. 599 Jie Sun, Yuxia Wang, Yisui Xia, Ye Xu, Tao Ouyang, and Jinfeng Li et al. Mutations in RECQL gene are associated with predisposition to breast cancer. PLOS Genetics, 11(5):e1005228, 2015. 600 Johanna I. Kiiski, Liisa M. Pelttari, Sofia Khan, Edda S. Freysteinsdottir, Inga Reynisdottir, and Steven N. Hart et al. Exome sequencing identifies FANCM as a susceptibility gene for triple-negative breast cancer. Proceedings of the National Academy of Sciences, 111(42):15172 15177, 2014. 601 Francisco Javier Gracia-Aznarez, Victoria Fernandez, Guillermo Pita, Paolo Peterlongo, Orlando Dominguez, and Miguel de la Hoya et al. Whole exome sequencing suggests much of non-brca1/brca2 familial breast cancer 218

is due to moderate and low penetrance susceptibility alleles. PLOS ONE, 8(2):e55681, 2013. 602 Paolo Peterlongo, Irene Catucci, Mara Colombo, Laura Caleca, Eliseos Mucaki, and Massimo Bogliolo et al. FANCM c.5791c>t nonsense mutation (rs144567652) induces exon skipping, affects DNA repair activity and is a familial breast cancer risk factor. Human Molecular Genetics, 24(18):5345 5355, 2015. 603 Daniel J. Park, Kayoko Tao, Florence Le Calvez-Kelm, Tu Nguyen-Dumont, Nivonirina Robinot, and Fleur Hammet et al. Rare mutations in RINT1 predispose carriers to breast and Lynch syndrome-spectrum cancers. Cancer Discovery, 4(7):804 815, 2014. 604 Ella R. Thompson, Maria A. Doyle, Georgina L. Ryland, Simone M. Rowley, David Y. H. Choong, and Richard W. Tothill et al. Exome sequencing identifies rare deleterious mutations in DNA repair genes FANCC and BLM as potential breast cancer susceptibility alleles. PLOS Genetics, 8(9):e1002894, 2012. 605 Anna P. Sokolenko, Aglaya G. Iyevleva, Elena V. Preobrazhenskaya, Nathalia V. Mitiushkina, Svetlana N. Abysheva, and Evgeny N. Suspitsin et al. High prevalence and breast cancer predisposing role of the BLM c. 1642 C> T (Q548X) mutation in Russia. International Journal of Cancer, 130(12):2867 2873, 2012. 606 Darya Prokofyeva, Natalia Bogdanova, Natalia Dubrowinskaja, Marina Bermisheva, Zalina Takhirova, and Natalia Antonenkova et al. Nonsense mutation p.q548x in BLM, the gene mutated in Bloom s syndrome, is associated with breast cancer in Slavic populations. Breast Cancer Research and Treatment, 137(2):533 539, 2013. 607 Marianne Berwick, Jaya M. Satagopan, Leah Ben-Porat, Ann Carlson, Katherine Mah, and Rashida Henry et al. Genetic heterogeneity among fanconi anemia heterozygotes and risk of cancer. Cancer Research, 67(19):9591 9596, 2007. 219

608 Daniel J. Park, Fabienne Lesueur, Tu Nguyen-Dumont, Maroulio Pertesi, Fabrice Odefrey, and F. Hammet et al. Rare mutations in XRCC2 increase the risk of breast cancer. The American Journal of Human Genetics, 90(4):734 739, 2012. 609 Kirsi Maatta, Tommi Rantapero, Anna Lindstrom, Matti Nykter, Minna Kankuri-Tammilehto, Satu-Leena Laasanen, and Johanna Schleutker. Whole-exome sequencing of Finnish hereditary breast cancer families. European Journal of Human Genetics, 2016. 610 Abdelkader Heddar, Pierre Fermey, Sophie Coutant, Emilie Angot, Jean-Christophe Sabourin, and Paul Michelin et al. Familial solitary chondrosarcoma resulting from germline EXT2 mutation. Genes, Chromosomes and Cancer, 2016. 611 Lynn R. Goldin, Mary L. McMaster, Melissa Rotunno, Sarah E.M. Herman, Kristine Jones, and Bin Zhu et al. Whole exome sequencing in families with CLL detects a variant in Integrin B 2 associated with disease susceptibility. Blood, 128(18):2261 2263, 2016. 612 Helen E. Speedy, Ben Kinnersley, Daniel Chubb, Peter Broderick, Philip J. Law, and Kevin Litchfield et al. Germline mutations in shelterin complex genes are associated with familial chronic lymphocytic leukemia. Blood, 2016. 613 Jun-Xiao Zhang, Lei Fu, Richarda M. de Voer, Marc-Manuel Hahn, Peng Jin, and Chen-Xi Lv et al. Candidate colorectal cancer predisposing gene variants in Chinese early-onset and familial cases. World Journal of Gastroenterology, 21(14):4136 4149, 2015. 614 Nuria Segui, Leonardo B. Mina, Conxi Lazaro, Rebeca Sanz-Pamplona, Tirso Pons, and Matilde Navarro et al. Germline mutations in FAN1 cause hereditary colorectal cancer by impairing DNA repair. Gastroenterology, 149(3):563 566, 2015. 615 Clara Esteban-Jurado, Maria Vila-Casadesus, Pilar Garre, Juan Jose Lozano, Anna Pristoupilova, and Sergi Beltran et al. Whole-exome sequencing identifies rare pathogenic variants in new predisposition genes for familial colorectal cancer. Genetics in Medicine, 2014. 220

616 Taina T. Nieminen, Marie-Francoise O Donohue, Yunpeng Wu, Hannes Lohi, Stephen W. Scherer, and Andrew D. Paterson et al. Germline mutation of RPS20, encoding a ribosomal protein, causes predisposition to hereditary nonpolyposis colorectal carcinoma without DNA mismatch repair deficiency. Gastroenterology, 147(3):595 598. e5, 2014. 617 Alexandra E. Gylfe, Riku Katainen, Johanna Kondelin, Tomas Tanskanen, Tatiana Cajuso, and Ulrika Hanninen et al. Eleven candidate susceptibility genes for common familial colorectal cancer. PLOS Genetics, 9(10):e1003876, 2013. 618 Pi-Yueh Chang, Jinn-Shiun Chen, Nai-Chung Chang, Shih-Cheng Chang, Mei-Chia Wang, and Shu-Hui Tsai et al. NRAS germline variant G138R and multiple rare somatic mutations on APC in colorectal cancer patients in Taiwan by next generation sequencing. Oncotarget, 7(25):37566 37580, 2016. 619 Daniel Chubb, Peter Broderick, Sara E. Dobbins, Matthew Framptom, Ben Kinnersley, and Steven Penegar et al. Rare disruptive mutations and their contribution to the heritable risk of colorectal cancer. Nature Communications, 2016. 620 Richarda M. de Voer, Marc-Manuel Hahn, Robbert D.A. Weren, Arjen R. Mensenkamp, Christian Gilissen, and Wendy A. van Zelst-Stams et al. Identification of novel candidate genes for early-onset colorectal cancer susceptibility. PLOS Genetics, 12(2):e1005880, 2016. 621 Claire Palles, Jean-Baptiste Cazier, Kimberley M. Howarth, Enric Domingo, Angela M. Jones, and Peter Broderick et al. Germline mutations affecting the proofreading domains of POLE and POLD1 predispose to colorectal adenomas and carcinomas. Nature Genetics, 45(2):136 144, 2013. 622 Christopher G. Smith, Marc Naven, Rebecca Harris, James Colley, Hannah West, and Ning Li et al. Exome resequencing identifies potential tumor-suppressor genes that predispose to colorectal cancer. Human Mutation, 34(7):1026 1034, 2013. 221

623 Anna Rohlin, Theofanis Zagoras, Staffan Nilsson, Ulf Lundstam, Jan Wahlstrom, and Leif Hulten et al. A mutation in POLE predisposing to a multi-tumour phenotype. International Journal of Oncology, 45(1):77 81, 2014. 624 Laura Valle, Eva Hernandez-Illan, Fernando Bellido, Gemma Aiza, Adela Castillejo, and Maria-Isabel Castillejo et al. New insights into POLE and POLD1 germline mutations in familial colorectal cancer and polyposis. Human Molecular Genetics, 23(13):3506 3512, 2014. 625 Fernando Bellido, Marta Pineda, Gemma Aiza, Rafael Valdes-Mas, Matilde Navarro, and Diana A. Puente et al. POLE and POLD1 mutations in 529 kindred with familial colorectal cancer and/or polyposis: review of reported cases and recommendations for genetic testing and surveillance. Genetics in Medicine, 18(4):325 332, 2015. 626 Daniel Chubb, Peter Broderick, Matthew Frampton, Ben Kinnersley, Amy Sherborne, and Steven Penegar et al. Genetic diagnosis of high-penetrance susceptibility for colorectal cancer (CRC) is achievable for a high proportion of familial CRC by exome sequencing. Journal of Clinical Oncology, 33(5):426 432, 2015. 627 Fadwa A. Elsayed, C. Marleen Kets, Dina Ruano, Brendy van den Akker, Arjen R. Mensenkamp, and Melanie Schrumpf et al. Germline variants in POLE are associated with early onset mismatch repair deficient colorectal cancer. European Journal of Human Genetics, 23(8):1080 1084, 2015. 628 Maren F. Hansen, Jostein Johansen, Inga Bjornevoll, Anna E. Sylvander, Kristin S. Steinsbekk, and Pal Saetrom et al. A novel POLE mutation associated with cancers of colon, pancreas, ovaries and small intestine. Familial Cancer, 14(3):437 448, 2015. 629 Isabel Spier, Stefanie Holzapfel, Janine Altmuller, Bixiao Zhao, Sukanya Horpaopan, and Stefanie Vogt et al. Frequency and phenotypic spectrum of germline mutations in POLE and seven other polymerase genes in 266 patients with colorectal adenomas and carcinomas. International Journal of Cancer, 137(2):320 331, 2015. 222

630 Yael Goldberg, Naama Halpern, Ayala Hubert, Samuel N. Adler, Sherri Cohen, and Morasha Plesser-Duvdevani et al. Mutated MCM9 is associated with predisposition to hereditary mixed polyposis and colorectal cancer in addition to primary ovarian failure. Cancer Genetics, 208(12):621 624, 2015. 631 Ronja Adam, Isabel Spier, Bixiao Zhao, Michael Kloth, Jonathan Marquez, and Inga Hinrichsen et al. Exome sequencing identifies biallelic MSH3 germline mutations as a recessive subtype of colorectal adenomatous polyposis. The American Journal of Human Genetics, 99(2):337 351, 2016. 632 Isabel Spier, Martin Kerick, Dmitriy Drichel, Sukanya Horpaopan, Janine Altmuller, and Andreas Laner et al. Exome sequencing identifies potential novel candidate genes in patients with unexplained colorectal adenomatous polyposis. Familial Cancer, 15(2):281 288, 2016. 633 Ryan E. Fecteau, Jianping Kong, Adam Kresak, Wendy Brock, Yeunjoo Song, and Hisashi Fujioka et al. Association between germline mutation in VSIG10L and familial Barrett neoplasia. JAMA Oncology, 2(10):1333 1339, 2016. 634 Caixia Cheng, Heyang Cui, Ling Zhang, Zhiwu Jia, Bin Song, and Fang Wang et al. Genomic analyses reveal FAM84B and the NOTCH pathway are associated with the progression of esophageal squamous cell carcinoma. GigaScience, 5(1):1, 2016. 635 Keqiang Zhang, Jia-Wei Lin, Jinhui Wang, Xiwei Wu, Hanlin Gao, and Yi-Chen Hsieh et al. A germline missense mutation in COQ6 is associated with susceptibility to familial schwannomatosis. Genetic Medicine, 16(10):787 792, 2014. 636 Iikki Donner, Tuula Kiviluoto, Ari Ristimaki, Lauri A. Aaltonen, and Pia Vahteristo. Exome sequencing reveals three novel candidate predisposition genes for diffuse gastric cancer. Familial Cancer, 14(2):241 246, 2015. 637 Ian J. Majewski, Irma Kluijt, Annemieke Cats, Thomas S. Scerri, Daphne de Jong, and Roelof J.C. Kluin et al. An a-e-catenin (CTNNA1) mutation in hereditary diffuse gastric cancer. The Journal of Pathology, 229(4):621 629, 2013. 223

638 Samantha Hansford, Pardeep Kaurah, Hector Li-Chang, Michelle Woo, Janine Senz, and Hugo Pinheiro et al. Hereditary diffuse gastric cancer syndrome: CDH1 mutations and beyond. JAMA Oncology, 1(1):23 32, 2015. 639 Matthew N. Bainbridge, Georgina N. Armstrong, M. Monica Gramatges, Alison A. Bertuch, Shalini N. Jhangiani, and Harsha Doddapaneni et al. Germline mutations in shelterin complex genes are associated with familial glioma. Journal of the National Cancer Institute, 107(1):dju384, 2015. 640 Heikki Ristolainen, Outi Kilpivaara, Peter Kamper, Minna Taskinen, Silva Saarinen, and Sirpa Leppa et al. Identification of homozygous deletion in ACAN and other candidate variants in familial classical Hodgkin lymphoma by exome sequencing. British Journal of Haematology, 170(3):428 431, 2015. 641 Silva Saarinen, Mervi Aavikko, Kristiina Aittomaki, Virpi Launonen, Rainer Lehtonen, and Kaarle Franssila et al. Exome sequencing reveals germline NPAT mutation as a candidate risk factor for Hodgkin lymphoma. Blood, 118(3):493 498, 2011. 642 Melissa Rotunno, Mary L. McMaster, Joseph Boland, Sara Bass, Xijun Zhang, and Laurie Burdett et al. Whole exome sequencing in families at high risk for Hodgkin lymphoma: identification of a predisposing mutation in the KDR gene. Haematologica, 101(7):853, 2016. 643 Natalia D. Linhares, Maira C.M. Freire, Raony G.C.C.L. Cardenas, Heloisa B. Pena, Magda Bahia, and Sergio D.J. Pena. Exome sequencing identifies a novel homozygous variant in NDRG4 in a family with infantile myofibromatosis. European Journal of Medical Genetics, 57(11-12):643 648, 2014. 644 Yee Him Cheung, Tenzin Gayden, Philippe M. Campeau, Charles A. LeDuc, Donna Russo, and Van-Hung Nguyen et al. A recurrent PDGFRB mutation causes familial infantile myofibromatosis. The American Journal of Human Genetics, 92(6):996 1000, 2013. 645 John A. Martignetti, Lifeng Tian, Dong Li, Maria Celeste M. Ramirez, Olga Camacho-Vanegas, and Sandra Catalina Camacho et al. Mutations 224

in PDGFRB cause autosomal-dominant infantile myofibromatosis. American Journal of Human Genetics, 92(6):1001 1007, 2013. The 646 Xiaolei Lan, Hua Gao, Fei Wang, Jie Feng, Jiwei Bai, and Peng Zhao et al. Whole-exome sequencing identifies variants in invasive pituitary adenomas. Oncology Letters, 12(4):2319 2328, 2016. 647 Joanne Ngeow, Wanfeng Yu, Lamis Yehia, Farshad Niazi, Jinlian Chen, and Xuhua Tang et al. Exome sequencing reveals germline SMAD9 mutation that reduces phosphatase and tensin homolog expression and is associated with hamartomatous polyposis and gastrointestinal ganglioneuromas. Gastroenterology, 149(4):886 889, 2015. 648 Mervi Aavikko, Eevi Kaasinen, Janne K. Nieminen, Minji Byun, Iikki Donner, and Roberta Mancuso et al. Whole-genome sequencing identifies STAT4 as a putative susceptibility gene in classic Kaposi sarcoma. Journal of Infectious Diseases, 211(11):1842 1851, 2015. 649 Sho Egashira, Masatoshi Jinnin, Miho Harada, Shinichi Masuguchi, Satoshi Fukushima, and Hironobu Ihn. Exome sequence analysis of Kaposiform hemangioendothelioma: identification of putative driver mutations. Anais Brasileiros de Dermatologia, 91(6):748 753, 2016. 650 Stefano Caruso, Julien Calderaro, Eric Letouze, Jean-Charles Nault, Gabrielle Couchy, and Anais Boulais et al. Germline and somatic DICER1 mutations in familial and sporadic liver tumors. Journal of Hepatology, 66(4):734 742, 2016. 651 Donghai Xiong, Yian Wang, Elena Kupert, Claire Simpson, Susan M. Pinney, and Colette R. Gaba et al. A recurrent mutation in PARK2 is associated with familial lung cancer. The American Journal of Human Genetics, 96(2):301 308, 2015. 652 Hsuan-Yu Chen, Sung-Liang Yu, Bing-Ching Ho, Kang-Yi Su, Yi-Chiung Hsu, and Chi-Sheng Chang et al. R331W missense mutation of oncogene YAP1 is a germline risk allele for lung adenocarcinoma with medical actionability. Journal of Clinical Oncology, 33(20):2303 2310, 2015. 225

653 Makia J. Marafie, Mohammed Dashti, and Fahd Al-Mulla. Identification of a rare germline NBN gene mutation by whole exome sequencing in a lung-cancer survivor from a large family with various types of cancer. Familial Cancer, pages 1 6, 2016. 654 Leila Noetzli, Richard W. Lo, Alisa B. Lee-Sherick, Michael Callaghan, Patrizia Noris, and Anna Savoia et al. Germline mutations in ETV6 are associated with thrombocytopenia, red cell macrocytosis and predisposition to lymphoblastic leukemia. Nature Genetics, 47(5):535 538, 2015. 655 Sabine Topka, Joseph Vijai, Michael F. Walsh, Lauren Jacobs, Ann Maria, and Danylo Villano et al. Germline ETV6 mutations confer susceptibility to acute lymphoblastic leukemia and thrombocytopenia. PLOS Genetics, 11(6):e1005262, 2015. 656 Valentina Silvestri, Veronica Zelli, Virginia Valentini, Piera Rizzolo, Anna Sara Navazio, and Anna Coppa et al. Whole-exome sequencing and targeted gene sequencing provide insights into the role of PALB2 as a male breast cancer susceptibility gene. Cancer, 123(2):210 218, 2016. 657 Carla Daniela Robles-Espinoza, Mark Harland, Andrew J. Ramsay, Lauren G. Aoude, Victor Quesada, and Zhihao Ding et al. POT1 loss-of-function variants predispose to familial melanoma. Nature Genetics, 46(5):478 481, 2014. 658 Satoru Yokoyama, Susan L. Woods, Glen M. Boyle, Lauren G. Aoude, Stuart MacGregor, and Victoria Zismann et al. A novel recurrent mutation in MITF predisposes to familial and sporadic melanoma. Nature, 480(7375):99 103, 2011. 659 Paola Ghiorzo, Lorenza Pastorino, Paola Queirolo, William Bruno, Maria G. Tibiletti, and Sabina Nasti et al. Prevalence of the E318K MITF germline mutation in Italian melanoma patients: associations with histological subtypes and family cancer history. Pigment Cell & Melanoma Research, 26(2):259 262, 2013. 660 Marianne Berwick, Jamie MacArthur, Irene Orlow, Peter Kanetsky, Colin B. Begg, and Li Luo et al. MITF E318K s effect on melanoma risk independent 226

of, but modified by, other risk factors. Pigment Cell & Melanoma Research, 27(3):485 488, 2014. 661 J. William Harbour, Michael D. Onken, Elisha D.O. Roberson, Shenghui Duan, Li Cao, and Lori A. Worley et al. Frequent mutation of BAP1 in metastasizing uveal melanomas. Science, 330(6009):1410 1413, 2010. 662 Joseph R. Testa, Mitchell Cheung, Jianming Pei, Jennifer E. Below, Yinfei Tan, and Eleonora Sementino et al. Germline BAP1 mutations predispose to malignant mesothelioma. Nature Genetics, 43(10):1022 1025, 2011. 663 Thomas Wiesner, Anna C. Obenauf, Rajmohan Murali, Isabella Fried, Klaus G. Griewank, and Peter Ulz et al. Germline mutations in BAP1 predispose to melanocytic tumors. Nature Genetics, 43(10):1018 1021, 2011. 664 Mohamed H. Abdel-Rahman, Robert Pilarski, Colleen M. Cebulla, James B. Massengill, Benjamin N. Christopher, and Getachew Boru et al. Germline BAP1 mutation predisposes to uveal melanoma, lung adenocarcinoma, meningioma, and other cancers. Journal of Medical Genetics, 48(12):856 859, 2011. 665 Lauren G. Aoude, Karin Wadt, Anders Bojesen, Dorthe Cruger, Ake Borg, and Jeffrey M. Trent et al. A BAP1 mutation in a Danish family predisposes to uveal melanoma and other cancers. PLOS ONE, 8(8):e72144, 2013. 666 Mitchell Cheung, Jacqueline Talarchek, Karen Schindeler, Eduardo Saraiva, Lynette S. Penney, Mark Ludman, and Joseph R. Testa. Further evidence for germline BAP1 mutations predisposing to melanoma and malignant mesothelioma. Cancer Genetics, 206(5):206 210, 2013. 667 Megan N. Farley, Laura S. Schmidt, Jessica L. Mester, Samuel Pena-Llopis, Andrea Pavia-Jimenez, and Alana Christie et al. A novel germline mutation in BAP1 predisposes to familial clear-cell renal cell carcinoma. Molecular Cancer Research, 11(9):1061 1071, 2013. 668 Tatiana Popova, Lucie Hebert, Virginie Jacquemin, Sophie Gad, Virginie Caux-Moncoutier, and Catherine Dubois-d Enghien et al. Germline BAP1 mutations predispose to renal cell carcinomas. The American Journal of Human Genetics, 92(6):974 980, 2013. 227

669 David A. Maerker, Michael Zeschnigk, Jasmin Nelles, Dietmar R. Lohmann, Karl Worm, and Anja K. Bosserhoff et al. BAP1 germline mutation in two first grade family members with uveal melanoma. British Journal of Ophthalmology, 98(2):224 227, 2014. 670 Robert Pilarski, Colleen M. Cebulla, James B. Massengill, Karan Rai, Thereasa Rich, and Louise Strong et al. Expanding the clinical phenotype of hereditary BAP1 cancer predisposition syndrome, reporting three new cases. Genes, Chromosomes and Cancer, 53(2):177 182, 2014. 671 Colleen M. Cebulla, Elaine M. Binkley, Robert Pilarski, James B. Massengill, Karan Rai, and David A. Liebner et al. Analysis of BAP1 germline gene mutation in young uveal melanoma patients. Ophthalmic Genetics, 36(2):126 131, 2015. 672 Arnaud de la Fouchardiere, Odile Cabaret, Liliana Savin, Patrick Combemale, Hubert Schvartz, and Clotilde Penet et al. Germline BAP1 mutations predispose also to multiple basal cell carcinomas. Clinical Genetics, 88(3):273 277, 2015. 673 Sonja Klebe, Jack Driml, Masaki Nasu, Sandra Pastorino, Amirmasoud Zangiabadi, Douglas Henderson, and Michele Carbone. BAP1 hereditary cancer predisposition syndrome: a case report and review of literature. Biomarker Research, 3(1):14, 2015. 674 Pedram Gerami, Oriol Yelamos, Christina Y. Lee, Roxana Obregon, Pedram Yazdan, and Lauren M. Sholl et al. Multiple cutaneous melanomas and clinically atypical moles in a patient with a novel germline BAP1 mutation. JAMA Dermatology, 151(11):1235 1239, 2015. 675 Karan Rai, Robert Pilarski, Colleen M. Cebulla, and Mohamed H. Abdel-Rahman. Comprehensive review of BAP1 tumor predisposition syndrome with report of two new cases. Clinical Genetics, 89(3):285 294, 2016. 676 Karin A.W. Wadt, Lauren G. Aoude, Peter Johansson, Annalisa Solinas, Antonia L. Pritchard, and Oana Crainic et al. A recurrent germline BAP1 228

mutation and extension of the BAP1 tumor predisposition spectrum to include basal cell carcinoma. Clinical Genetics, 88(3):267 272, 2015. 677 David J. Barnes, Edward Hookway, Nick Athanasou, Takeshi Kashima, Udo Oppermann, and Simon Hughes et al. A germline mutation of CDKN2A and a novel RPLP1-C19MC fusion detected in a rare melanotic neuroectodermal tumor of infancy: a case report. BMC Cancer, 16(1):629, 2016. 678 Miriam J. Smith, James O Sullivan, Sanjeev S. Bhaskar, Kristen D. Hadfield, Gemma Poke, and John Caird et al. Loss-of-function mutations in SMARCE1 cause an inherited disorder of multiple spinal meningiomas. Nature Genetics, 45(3):295 298, 2013. 679 Miriam J. Smith, Andrew J. Wallace, Chris Bennett, Martin Hasselblatt, Ewelina Elert-Dobkowska, and Linton T. Evans et al. Germline SMARCE1 mutations predispose to both spinal and cranial clear cell meningiomas. The Journal of Pathology, 234(4):436 440, 2014. 680 Helen Raffalli-Ebezant, Scott A. Rutherford, Stavros Stivaros, Anna Kelsey, Miriam Smith, D. Gareth Evans, and John-Paul Kilday. Pediatric intracranial clear cell meningioma associated with a germline mutation of SMARCE1: a novel case. Child s Nervous System, 31(3):441 447, 2015. 681 Linton T. Evans, Jack Van Hoff, William F. Hickey, Miriam J. Smith, D. Gareth Evans, William G. Newman, and David F. Bauer. SMARCE1 mutations in pediatric clear cell meningioma: case report. Journal of Neurosurgery: Pediatrics, 16(3):296 300, 2015. 682 Wei Dai, Hong Zheng, Arthur Kwok Leung Cheung, Clara Sze-man Tang, Josephine Mun Yee Ko, and Bonnie Wing Yan Wong et al. Whole-exome sequencing identifies MST1R as a genetic susceptibility gene in nasopharyngeal carcinoma. Proceedings of the National Academy of Sciences, 113(12):3317 3322, 2016. 683 Sudheer Kumar Gara, Li Jia, Maria J. Merino, Sunita K. Agarwal, Lisa Zhang, and Maggie Cam et al. Germline HABP2 mutation causing familial nonmedullary thyroid cancer. New England Journal of Medicine, 373(5):448 455, 2015. 229

684 Chang Liu, Yang Yu, Guangliang Yin, Junxia Zhang, Wei Wen, and Xianhui Ruan et al. C14orf93 (RTFC) is identified as a novel susceptibility gene for familial nonmedullary thyroid cancer. Biochemical and Biophysical Research Communications, pages 1 7, 2016. 685 Ed Dicks, Honglin Song, Susan J. Ramus, Elke Van Oudenhove, Jonathan P. Tyrer, and Maria P. Intermaggio et al. Germline whole exome sequencing and large-scale replication identifies FANCM as a likely high grade serous ovarian cancer susceptibility gene. Oncotarget, 2017. 686 Silvia Vilarinho, E. Zeynep Erson-Omay, Akdes Serin Harmanci, Raffaella Morotti, Geneive Carrion-Grant, and Jacob Baranoski et al. Paediatric hepatocellular carcinoma due to somatic CTNNB1 and NFE2L2 mutations in the setting of inherited bi-allelic ABCB11 mutations. Journal of Hepatology, 61(5):1178 1183, 2014. 687 Filemon S. Dela Cruz, Daniel Diolaiti, Andrew T. Turk, Allison R. Rainey, Alberto Ambesi-Impiombato, and Stuart J. Andrews et al. A case study of an integrative genomic and experimental therapeutic approach for rare tumors: identification of vulnerabilities in a pediatric poorly differentiated carcinoma. Genome Medicine, 8(1):116, 2016. 688 Yoko Shimada, Takashi Kohno, Hideki Ueno, Yoshinori Ino, Hideyuki Hayashi, and Takashi Nakaoku et al. An oncogenic ALK fusion and an RRAS mutation in KRAS mutation-negative pancreatic ductal adenocarcinoma. The Oncologist, 22(2):158 164, 2017. 689 Jerneja Tomsic, Huiling He, Keiko Akagi, Sandya Liyanarachchi, Qun Pan, and Blake Bertani et al. A germline mutation in SRRM2, a splicing factor gene, is implicated in papillary thyroid carcinoma predisposition. Scientific Reports, 5, 2015. 690 Alberto Cascon, Inaki Comino-Mendez, Maria Curras-Freixes, Aguirre A. de Cubas, Laura Contreras, and Susan Richter et al. Whole-exome sequencing identifies MDH2 as a new familial paraganglioma gene. Journal of the National Cancer Institute, 107(5):djv053, 2015. 230

691 Andrew Feber, Daniel C. Worth, Ankur Chakravarthy, Patricia de Winter, Kunal Shah, and Manit Arya et al. Somatic mutations in penile squamous cell carcinoma. Cancer Research, 76(16):4720 4727, 2016. 692 Inaki Comino-Mendez, Francisco J. Gracia-Aznarez, Francesca Schiavi, Inigo Landa, Luis J. Leandro-Garcia, and Rocio Leton et al. Exome sequencing identifies MAX mutations as a cause of hereditary pheochromocytoma. Nature Genetics, 43(7):663 667, 2011. 693 Nelly Burnichon, Alberto Cascon, Francesca Schiavi, Nicole Paes Morales, Inaki Comino-Mendez, and Nassera Abermil et al. Mutations cause hereditary and sporadic pheochromocytoma and paraganglioma. Clinical Cancer Research, 18(10):2828 2837, 2012. 694 Mariola Peczkowska, Aldona Kowalska, Jacek Sygut, Dariusz Waligorski, Angelica Malinoc, and Hanna Janaszek-Sitkowska et al. Testing new susceptibility genes in the cohort of apparently sporadic phaeochromocytoma/paraganglioma patients with clinical characteristics of hereditary syndromes. Clinical Endocrinology, 79(6):817 823, 2013. 695 Sohela Shah, Kasmintan A. Schrader, Esme Waanders, Andrew E. Timms, Joseph Vijai, and Cornelius Miething et al. A recurrent germline PAX5 mutation confers susceptibility to pre-b cell acute lymphoblastic leukemia. Nature Genetics, 45(10):1226 1231, 2013. 696 Yuji Ikeda, Kazuma Kiyotani, Poh Yin Yew, Taigo Kato, Kenji Tamura, and Kai Lee Yap et al. Germline PARP4 mutations in patients with primary thyroid and breast cancers. Endocrine-Related Cancer, 23(3):171 179, 2016. 697 Pia Ostergaard, Michael A. Simpson, Fiona C. Connell, Colin G. Steward, Glen Brice, and Wesley J. Woollard et al. Mutations in GATA2 cause primary lymphedema associated with a predisposition to acute myeloid leukemia (Emberger syndrome). Nature Genetics, 43(10):929 931, 2011. 698 Christopher N. Hahn, Chan-Eng Chong, Catherine L. Carmichael, Ella J. Wilkins, Peter J. Brautigan, and Xiao-Chun Li et al. Heritable GATA2 mutations associated with familial myelodysplastic syndrome and acute myeloid leukemia. Nature Genetics, 43(10):1012 1017, 2011. 231

699 Harriet Holme, Upal Hossain, Michael Kirwan, Amanda Walne, Tom Vulliamy, and Inderjeet Dokal. Marked genetic heterogeneity in familial myelodysplasia/acute myeloid leukaemia. British Journal of Haematology, 158(2):242 248, 2012. 700 Jan Kazenwadel, Genevieve A. Secker, Yajuan J. Liu, Jill A. Rosenfeld, Robert S. Wildin, and Jennifer Cuellar-Rodriguez et al. Loss-of-function germline GATA2 mutations in patients with MDS/AML or MonoMAC syndrome and primary lymphedema reveal a key role for GATA2 in the lymphatic vasculature. Blood, 119(5):1283 1291, 2012. 701 Marlene Pasquet, Christine Bellanne-Chantelot, Suzanne Tavitian, Nais Prade, Blandine Beaupain, and Olivier LaRochelle et al. High frequency of GATA2 mutations in patients with mild chronic neutropenia evolving to MonoMac syndrome, myelodysplasia, and acute myeloid leukemia. Blood, 121(5):822, 2013. 702 Juehua Gao, Ryan D. Gentzler, Andrew E. Timms, Marshall S. Horwitz, Olga Frankfurt, Jessica K. Altman, and LoAnn C. Peterson. Heritable GATA2 mutations associated with familial AML-MDS: a case report and review of literature. Journal of Hematology & Oncology, 7(1):36, 2014. 703 Christopher N. Hahn, Peter J. Brautigan, Chan-Eng Chong, Alex Janssan, Parameswaran Venugopal, and Young Lee et al. Characterisation of a compound in-cis GATA2 germline mutation in a pedigree presenting with myelodysplastic syndrome/acute myeloid leukemia with concurrent thrombocytopenia. Leukemia, 29(8):1795 1797, 2015. 704 Xinan Wang, Hideki Muramatsu, Yusuke Okuno, Hirotoshi Sakaguchi, Kenichi Yoshida, and Nozomu Kawashima et al. GATA2 and secondary mutations in familial myelodysplastic syndromes and pediatric myeloid malignancies. Haematologica, 100(10):e398, 2015. 705 Liesel M. FitzGerald, Akash Kumar, Evan A. Boyle, Yuzheng Zhang, Laura M. McIntosh, and Suzanne Kolb et al. Germline missense variants in the BTNL2 gene are associated with prostate cancer susceptibility. Cancer Epidemiology Biomarkers & Prevention, 22(9):1520 1528, 2013. 232

706 Daniel C. Koboldt, Krishna L. Kanchi, Bin Gui, David E. Larson, Robert S. Fulton, and William B. Isaacs et al. Rare variation in TET2 is associated with clinically relevant prostate carcinoma in African Americans. Cancer Epidemiology Biomarkers & Prevention, 25(11):1456 1463, 2016. 707 Takahide Hayano, Hiroshi Matsui, Hirofumi Nakaoka, Nobuaki Ohtake, Kazuyoshi Hosomichi, Kazuhiro Suzuki, and Ituro Inoue. Germline variants of prostate cancer in Japanese families. PLOS ONE, 11(10):e0164233, 2016. 708 Danielle M. Karyadi, Milan S. Geybels, Eric Karlins, Brennan Decker, Laura McIntosh, and Amy Hutchinson et al. Whole exome sequencing in 75 high-risk families with validation and replication in independent case-control studies identifies TANGO2, OR5H14, and CHAD as new prostate cancer susceptibility genes. Oncotarget, 8(1):1495 1507, 2016. 709 Sally M. Hunter, Simone M. Rowley, David Clouston, Jason Li, Richard Lupat, and Nishanth Krishnananthan et al. Searching for candidate genes in familial BRCAX mutation carriers with prostate cancer. Urologic Oncology: Seminars and Original Investigations, 34(3):120.e9 120.e16, 2016. 710 Krinio Giannikou, Izabela A. Malinowska, Trevor J. Pugh, Rachel Yan, Yuen-Yi Tseng, and Coyin Oh et al. Whole exome sequencing identifies TSC1/TSC2 biallelic loss as the primary and sufficient driver event for renal angiomyolipoma development. PLOS Genetics, 12(8):e1006242, 2016. 711 Jinwoo Ahn, Kyung Seok Han, Jun Hyeok Heo, Duhee Bang, You Hyun Kang, and Hyun A. Jin et al. FOXC2 and CLIP4: a potential biomarker for synchronous metastasis of < 7-cm clear cell renal cell carcinomas. Oncotarget, 7(32):51423 51434, 2016. 712 Frank Y. Lin, Katie Bergstrom, Richard Person, Abhishek Bavle, Leomar Y. Ballester, and Sarah Scollon et al. Integrated tumor and germline whole-exome sequencing identifies mutations in MAPK and PI3K pathway genes in an adolescent with rosette-forming glioneuronal tumor of the fourth ventricle. Cold Spring Harbor Molecular Case Studies, 2(5):a001057, 2016. 713 Leora Witkowski, Jian Carrot-Zhang, Steffen Albrecht, Somayyeh Fahiminiya, Nancy Hamel, and Eva Tomiak et al. Germline and somatic 233

SMARCA4 mutations characterize small cell carcinoma of the ovary, hypercalcemic type. Nature Genetics, 46(5):438 443, 2014. 714 Pilar Ramos, Anthony N. Karnezis, David W. Craig, Aleksandar Sekulic, Megan L. Russell, and William P. D. Hendricks et al. Small cell carcinoma of the ovary, hypercalcemic type, displays frequent inactivating germline and somatic mutations in SMARCA4. Nature Genetics, 46(5):427 429, 2014. 715 Pierre-Marie Lavrut, Francois Le Loarer, Charline Normand, Celine Grosos, Remi Dubois, and Anni Buenerd et al. Small cell carcinoma of the ovary, hypercalcemic type/ovarian malignant rhabdoid tumor: report of a bilateral case in a teenager associated with SMARCA4 germline mutation. Pediatric Developmental Pathology, 19:56 60, 2015. 716 Joanna Moes-Sosnowska, Lukasz Szafron, Dorota Nowakowska, Agnieszka Dansonka-Mieszkowska, Agnieszka Budzilowska, and Bozena Konopka et al. Germline SMARCA4 mutations in patients with ovarian small cell carcinoma of hypercalcemic type. Orphanet Journal of Rare Diseases, 10(1):32, 2015. 717 Yoshitatsu Sei, Xilin Zhao, Joanne Forbes, Silke Szymczak, Qing Li, and Apurva Trivedi et al. A hereditary form of small intestinal carcinoid associated with a germline mutation in inositol polyphosphate multikinase. Gastroenterology, 149(1):67 78, 2015. 718 Kie Kyon Huang, Kang Won Jang, Sangwoo Kim, Han Sang Kim, Sung-Moo Kim, and Hyeong Ju Kwon et al. Exome sequencing reveals recurrent REV3L mutations in cisplatin-resistant squamous cell carcinoma of head and neck. Scientific Reports, 6:19552, 2016. 719 Huixing Pan, Xiaojian Xu, Deyao Wu, Qiaocheng Qiu, Shoujun Zhou, and Xuefeng He et al. Novel somatic mutations identified by whole-exome sequencing in muscle-invasive transitional cell carcinoma of the bladder. Oncology Letters, 11(2):1486 1492, 2016. 720 Sandra Hanks, Elizabeth R. Perdeaux, Sheila Seal, Elise Ruark, Shazia S. Mahamdallie, and Anne Murray et al. Germline mutations in the PAF1 complex gene CTR9 predispose to Wilms tumour. Nature Communications, 5, 2014. 234

721 Cristina R. Antonescu and Paola Dal Cin. Promiscuous genes involved in recurrent chromosomal translocations in soft tissue tumours. 46(2):105 112, 2014. Pathology, 722 Jungho Kim and Jerry Pelletier. Molecular genetics of chromosome translocations involving EWS and related family members. Physiological Genomics, 1(3):127 138, 1999. 723 Fredrik Mertens, Cristina R. Antonescu, and Felix Mitelman. Gene fusions in soft tissue tumors: recurrent and overlapping pathogenetic themes. Genes, Chromosomes and Cancer, 55(4):291 310, 2016. 724 Felix Mitelman, Bertil Johansson, and Fredrik Mertens. The impact of translocations and gene fusions on cancer causation. 7(4):233 245, 2007. Nat Rev Cancer, 725 Johanna Manner, Bernhard Radlwimmer, Peter Hohenberger, Katharina Mossinger, Stefan Kuffer, and Christian Sauer et al. MYC high level gene amplification is a distinctive feature of angiosarcomas after irradiation or chronic lymphedema. The American Journal of Pathology, 176(1):34 39, 2010. 726 Patrick S. Tarpey, Sam Behjati, Susanna L. Cooke, Peter Van Loo, David C. Wedge, and Nischalan Pillay et al. Frequent mutation of the major cartilage collagen gene COL2A1 in chondrosarcoma. Nature Genetics, 45(8):923 926, 2013. 727 Janusz Limon, Anna Szadowska, Mariola Iliszko, Malgorzata Babinska, Krzysztof Mrozek, and Janusz Jaskiewicz et al. Recurrent chromosome changes in two adult fibrosarcomas. Genes, Chromosomes and Cancer, 21(2):119 123, 1998. 728 Eva Van den Berg, Willemina M. Molenaar, Harald J. Hoekstra, Willem A. Kamps, and Bauke De Jong. DNA ploidy and karyotype in recurrent and metastatic soft tissue sarcomas. Modern Pathology, 5(5):505 514, 1992. 729 Paola Dal Cin, Patrick Pauwels, Raf Sciot, and Herman Van Den Berghe. Multiple chromosome rearrangements in a fibrosarcoma. Cancer Genetics and Cytogenetics, 87(2):176 178, 1996. 235

730 Jilong Yang, Xiaoling Du, Kexin Chen, Antti Ylipaa, Alexander J.F. Lazar, and Jonathan Trent et al. Genetic aberrations in soft tissue leiomyosarcoma. Cancer Letters, 275(1):1 8, 2009. 731 Avery A. Sandberg. Updates on the cytogenetics and molecular genetics of bone and soft tissue tumors: leiomyosarcoma. Cytogenetics, 161(1):1 19, 2005. Cancer Genetics and 732 Ahmed Idbaih, Jean-Michel Coindre, Josette Derre, Odette Mariani, Philippe Terrier, and Dominique Ranchere et al. Myxoid malignant fibrous histiocytoma and pleomorphic liposarcoma share very similar genomic imbalances. Laboratory Investigation, 85(2):176 181, 2005. 733 Hannelore Schmidt, Frank Bartel, Matthias Kappler, Peter Wurl, Heidemarie Lange, and Matthias Bache et al. Gains of 13q are correlated with a poor prognosis in liposarcoma. Modern Pathology, 18(5):638 644, 2005. 734 Barry S. Taylor, Jordi Barretina, Nicholas D. Socci, Penelope DeCarolis, Marc Ladanyi, and Matthew Meyerson et al. Functional copy-number alterations in cancer. PLOS ONE, 3(9):e3179, 2008. 735 Christopher D.M. Fletcher, Paola Dal Cin, Ivo De Wever, Nils Mandahl, Fredrik Mertens, and Felix Mitelman et al. Correlation between clinicopathological features and karyotype in spindle cell sarcomas: a report of 130 cases from the CHAMP study group. The American Journal of Pathology, 154(6):1841 1847, 1999. 736 Fredrik Mertens, Paola Dal Cin, Ivo De Wever, Christopher D.M. Fletcher, Nils Mandahl, and Felix Mitelman et al. Cytogenetic characterization of peripheral nerve sheath tumours: a report of the CHAMP study group. The Journal of Pathology, 190(1):31 38, 2000. 737 R. Stuart Bridge, Julia Ann Bridge, James R. Neff, Sabine Naumann, Pamela A. Althof, and Leslie A. Bruch. Recurrent chromosomal imbalances and structurally abnormal breakpoints within complex karyotypes of malignant peripheral nerve sheath tumour and malignant triton tumour: a cytogenetic and molecular cytogenetic study. Journal of Clinical Pathology, 57(11):1172 1178, 2004. 236

738 Fredrik Mertens, Christopher D.M. Fletcher, Paola Dal Cin, Ivo De Wever, Nils Mandahl, and Felix Mitelman et al. Cytogenetic analysis of 46 pleomorphic soft tissue sarcomas and correlation with morphologic and clinical features: a report of the CHAMP study group. Genes, Chromosomes and Cancer, 22(1):16 25, 1998. 739 Anwar N. Mohamed, Mark M. Zalupski, James R. Ryan, Fred Koppitch, Stanley Balcerzak, Raymond Kempf, and Sandra R. Wolman. Cytogenetic aberrations and DNA ploidy in soft tissue sarcoma: a Southwest Oncology Group Study. Cancer Genetics and Cytogenetics, 99(1):45 53, 1997. 740 Guidong Li, Akira Ogose, Hiroyuki Kawashima, Hajime Umezu, Tetsuo Hotta, and Tsuyoshi Tohyama et al. Cytogenetic and real-time quantitative reverse-transcriptase polymerase chain reaction analyses in pleomorphic rhabdomyosarcoma. Cancer Genetics and Cytogenetics, 192(1):1 9, 2009. 741 Anthony Gordon, Aidan McManus, John Anderson, Cyril Fisher, Syuiti Abe, and Takayuki Nojima et al. Chromosomal imbalances in pleomorphic rhabdomyosarcomas and identification of the alveolar rhabdomyosarcoma-associated PAX3-FOXO1A fusion gene in one case. Cancer Genetics and Cytogenetics, 140(1):73 77, 2003. 742 Josette Derre, Real Lagace, Andre Nicolas, Aline Mairal, Frederic Chibon, and Jean-Michel Coindre et al. Leiomyosarcomas and most malignant fibrous histiocytomas share very similar comparative genomic hybridization imbalances: an analysis of a series of 27 leiomyosarcomas. Laboratory Investigation, 81(2):211 215, 2000. 743 Marcelo L. Larramendy, Massimiliano Gentile, Sonia Soloneski, Sakari Knuutila, and Tom Bohling. Does comparative genomic hybridization reveal distinct differences in DNA copy number sequence patterns between leiomyosarcoma and malignant fibrous histiocytoma? Cancer Genetics and Cytogenetics, 187(1):1 11, 2008. 744 Ana Carneiro, Princy Francis, Par-Ola Bendahl, Josefin Fernebro, Mans Akerman, and Christopher Fletcher et al. Indistinguishable genomic profiles and shared prognostic markers in undifferentiated pleomorphic sarcoma and 237

leiomyosarcoma: different sides of a single coin? 89(6):668 675, 2009. Laboratory Investigation, 745 Ching C. Lau, Charles P. Harris, Xin-Yan Lu, Laszlo Perlaky, Sheila Gogineni, and Murali Chintagumpala et al. Frequent amplification and rearrangement of chromosomal bands 6p12-p21 and 17p11.2 in osteosarcoma. Genes, Chromosomes and Cancer, 39(1):11 21, 2004. 746 Shamini Selvarajah, Maisa Yoshimoto, Olga Ludkovski, Paul C. Park, Jane Bayani, and Paul Thorner et al. Genomic signatures of chromosomal instability and osteosarcoma progression detected by high resolution array CGH and interphase FISH. Cytogenetic and Genome Research, 122(1):5 15, 2008. 747 Jane Bayani, Maria Zielenska, Ajay Pandita, Khaldoun Al-Romaih, Jana Karaskova, and Karen Harrison et al. Spectral karyotyping identifies recurrent complex rearrangements of chromosomes 8, 17, and 20 in osteosarcomas. Genes, Chromosomes and Cancer, 36(1):7 16, 2003. 748 Julia A. Bridge, Marilu Nelson, Erin McComb, Michael H. McGuire, Howard Rosenthal, and Gerardo Vergara et al. Cytogenetic findings in 73 osteosarcoma specimens and a review of the literature. Cancer Genetics and Cytogenetics, 95(1):74 87, 1997. 238

Appendices 239

Appendix A World Health Organisation classification of soft tissue tumours and bone tumours SOFT TISSUE TUMOURS Adipocytic tumours Benign Lipoma Lipomatosis Lipomatosis of nerve Lipoblastoma / lipoblastomatosis Angiolipoma Myolipoma of soft tissue Chondroid lipoma Extra-renal angiomyolipoma Extra-adrenal myelolipoma Spindle cell / pleomorphic lipoma Hibernoma 241

Intermediate (locally aggressive) Atypical lipomatous tumour / well differentiated liposarcoma Malignant Dedifferentiated liposarcoma Myxoid liposarcoma Pleomorphic liposarcoma Liposarcoma, not otherwise specified Atypical lipomatous tumour (ALT) Adipocytic (lipoma-like) Sclerosing Inflammatory types Dedifferentiated liposarcoma Fibroblastic / myofibroblastic tumours Benign Nodular fasciitis Proliferative fasciitis Proliferative myositis Myositis ossifficans Fibro-osseous pseudotumour of digits Ischemic fasciitis Elastofibroma Fibrous hamartoma of infancy Fibromatosis colli Juvenile hyaline fibromatosis Inclusion body fibromatosis Fibroma of tendon sheath Desmoplastic fibroblastoma Mammary-type myofibroblastoma Calcifying aponeurotic fibroma 242

Angiomyofibroblastoma Cellular angiofibroma Nuchal-type fibroma Gardner fibroma Calcifying fibrous tumour Intermediate (locally aggressive) Palmar / plantar fibromatosis Desmoids-type fibromatosis Lipofibromatosis Giant cell fibroblastoma Intermediate (rarely metastasizing) Dermatofibrosarcoma protuberans Fibrosarcomatous dermatofibrosarcoma protuberans Pigmented dermatofibrosarcoma protuberans Solitary fibrous tumour Solitary fibrous tumour, malignant Inflammatory myofibroblastic tumour Low grade myofibroblastic sarcoma Myxoinflammatory fibroblastic sarcoma Atypical myxoinflammatory fibroblastic tumour Infantile fibrosarcoma Malignant Adult fibrosarcoma Myxofibrosarcoma Low-grade fibromyxoid sarcoma Sclerosing epithelioid fibrosarcoma Nodular fasciitis Extrapleural solitary fibrous tumour Low grade fibromyxoid sarcoma (LGFMS) 243

Sclerosing epithelioid fibrosarcoma (SEF) So-called fibrohistiocytic tumours Benign Tenosynovial giant cell tumour Localized type Diffuse type Malignant Deep benign fibrous histiocytoma Intermediate (rarely metastasizing) Plexiform fibrohistiocytic tumour Giant cell tumour of soft tissue Tenosynovial giant cell tumour Smooth-muscle tumours Benign Leiomyoma of deep soft tissue Malignant Leiomyosarcoma (excluding skin) Leiomyosarcoma Pericytic (perivascular) tumours Glomus tumour (and variants) Glomangiomatosis Malignant glomus tumour Myopericytoma Myofibroma Myofibromatosis Angioleiomyoma Skeletal-muscle tumours Rhabdomyoma Embryonal rhabdomyosarcoma 244

Alveolar rhabdomyosarcoma Pleomorphic rhabdomyosarcoma Spindle cell / Sclerosing rhabdomyosarcoma Alveolar rhabdomyosarcoma (ARMS) Vascular tumours Benign Haemangioma Synovial Venous Arteriovenous haemangioma / malformation Epithelioid haemangioma Angiomatosis Lymphangioma Intermediate (locally aggressive) Kaposiform haemangioendothelioma Intermediate (rarely metastasizing) Retiform haemangioendothelioma Papillary intralymphatic angioendothelioma Composite haemangioendothelioma Pseudomyogenic (epithelioid sarcoma-like) haemangioendothelioma Kapsoi sarcoma Malignant Epithelioid haemangioendothelioma Angiosarcoma of soft tissue Gastrointestinal stromal tumours Benign gastrointestinal stromal tumour Gastrointestinal stromal tumour Gastrointestinal stromal tumour 245

Nerve sheath tumours Benign Schwannoma (including variants) Melanotic schwannoma Neurofibroma (including variants) Plexiform neurofibroma Perineurioma Malignant perineurioma Granular cell tumour Dermal nerve sheath myxoma Solitary circumscribed neuroma Ectopic meningioma Nasal glial heterotopia Benign Triton tumour Hybrid nerve sheath tumours Malignant Malignant peripheral nerve sheath tumour Epithelioid malignant nerve sheath tumour Malignant Triton tumour Malignant granular cell tumour Ectomesenchymoma Tumours of uncertain differentiation Benign Acral fibromyxoma Intramuscular myxoma (including cellular variant) Juxta-articular myxoma Deep ( aggressive ) angiomyxoma Pleomorphic hyalinizing angiectatic tumour Ectopic hamartomatous thymoma 246

Intermediate (locally aggressive) Haemosiderotic fibrolipomatous tumour Intermediate (rarely metastasizing) Atypical fibroxanthoma Angiomatoid fibrous histiocytoma Ossifying fibromyxoid tumour Ossifying fibromyxoid tumour, malignant Mixed tumour NOS Mixed tumour NOS, malignant Myoepithelioma Myoepithelial carcinoma Phosphaturic mesenchymal tumour Phosphaturic mesenchymal tumour Malignant Synovial sarcoma NOS Synovial sarcoma, spindle cell Synovial sarcoma, biphasic Epithelioid sarcoma Alveolar soft-part sarcoma Clear cell sarcoma of soft tissue Extraskeletal myxoid chondrosarcoma Extraskeletal Ewing sarcoma Desmoplastic small round cell tumour Extra-renal rhabdoid tumour Neoplasms with perivascular epithelioid cell differentiation (PEComa) PEComa NOS, benign PEComa NOS, malignant Intimal sarcoma 247

Undifferentiated / unclassified sarcomas Undifferentiated spindle cell sarcoma Undifferentiated pleomorphic sarcoma Undifferentiated round cell sarcoma Undifferentiated epithelioid sarcoma Undifferentiated sarcoma NOS Undifferentiated round cell and spindle cell sarcoma Undifferentiated pleomorphic sarcoma (UPS) TUMOURS OF BONE Chondrogenic tumours Benign Osteochondroma Chondroma Enchondroma Periosteal chondroma Osteochondromyxoma Subungual exostosis Bizarre parosteal osteochondromatous proliferation Synovial chondromatosis Intermediate (locally aggressive) Chondromyxoid fibroma Atypical cartilaginous tumour / chondrosarcoma grade I Intermediate (rarely metastasizing) Chondroblastoma Malignant Chondrosarcoma Grade II, Grade III Dedifferentiated chondrosarcoma Mesenchymal chondrosarcoma 248

Clear cell chondrosarcoma Osteochondromyxoma Bizarre parosteal osteochondromatous proliferation Chondrosarcoma (grades I-III) Osteogenic tumours Benign Osteoma Osteoid osteoma Intermediate (locally aggressive) Osteoblastoma Malignant Low-grade central osteosarcoma Conventional osteosarcoma Chondroblastic osteosarcoma Fibroblastic osteosarcoma Osteoblastic osteosarcoma Telangiectatic osteosarcoma Small cell osteosarcoma Secondary osteosarcoma Parosteal osteosarcoma Periosteal osteosarcoma High-grade surface osteosarcoma Osteoclastic giant cell rich tumours Benign Giant cell lesion of the small bones Intermediate locally aggressive Giant cell tumour of bone Malignant Malignancy in giant cell tumour of bone 249

Fibrohistiocytic tumours Benign Benign fibrous histiocytoma / non-ossifying fibroma Notochordal tumours Benign Benign notochordal tumour Malignant Chordoma Vascular tumours Benign Haemangioma Intermediate locally aggressive rarely metastasizing Epithelioid hemangioma Malignant Epithelioid hemangioendothelioma Angiosarcoma Reference: Bridge, J. A., et al. WHO classification of tumours of soft tissue and bone. International Agency for Research on Cancer, 2013. 250

Appendix B Novel tumour-predisposing genes identified by whole exome sequencing 251

252 Cancer Population Patients Genes Citation Additional studies Abestos exposed lung adenocarcinoma Finland 26 cases MRPL1, SDK1, SEMA5B, INPP4A 594 Adenomatous Netherlands, 51 patient from 48 families, NTHL1 595 polyposis and USA negative for APC and MUTYH colorectal mutations carcinomas Atypical gastric Spain Large family, with ATP4A 596 neuroendocrine consanguineous parents and tumour, type 1 5/10 affected children Brain Germany 1 family CASP9 597 Breast Poland, Canada 144 Polish and 51 French-Canadian patients with RECQL 598 599 family history and/or early onset, negative for founder mutations in BRCA1, BRCA2, CHEK2, NBN and PALB2 China 9 early-onset patients with family history, negative for RECQL 599 598 BRCA1 and BRCA2 mutations Finland 24 breast cancer patients from 11 families, negative for BRCA1 FANCM 600 601, 602 and BRCA2 mutations

Cancer Population Patients Genes Citation Additional studies Multiple 89 early-onset breast cancer patients from 47 families RINT1 603 Australia 33 breast cancer patients from FANCC, 604 605 607 15 families, negative for BRCA1 BLM and BRCA2 mutations Multiple 13 families XRCC2 608 Finland 129 female hereditary breast ATM, MYC, 609 and/or ovarian cancer patients, PLAU, up to 989 female controls RAD1, and RRM2B Chondrosarcoma France 2 third-degree affected relatives in a single family EXT2 610 Chronic European 59 chronic lymphocytic ITGB2 611 lymphocytic leukaemia-prone families leukaemia and 173 unrelated chronic lymphocytic leukaemia patients UK 66 chronic lymphocytic POT1 612 leukaemia families Colorectal China 23 early onset colorectal cancer patients from 21 families EIF2AK4 613 Spain 3 patients from a large family FAN1 614 253

254 Cancer Population Patients Genes Citation Additional studies Spain Patients from 29 families, CDKN1B, 615 negative for mutations in known XRCC4, colorectal cancer genes EPHX1, NFKBIZ, SMARCA4, BARD1 Finland 4 patients from a large family, negative for mutations in known colorectal cancer genes RPS20 616 Finland 96 patients with family history UACA, 617 of colorectal cancer, negative for SFXN4, mutations in known colorectal TWSG1, cancer genes PSPH, NUDT7, ZNF490, PRSS37, CCDC18, PRADC1, MRPL3, AKR1C4 Taiwan 50 colorectal cancer cases NRAS 618 UK 1,006 early-onset familial CRC MRE11, 619 cases and 1,609 healthy controls POLE2 and POT1

Cancer Population Patients Genes Citation Additional studies Netherlands 55 colorectal cancer cases with PTPN12 620 a disease onset before 45 years and LRP6 of age Colorectal UK Probands from 15 colorectal POLE, 621 622 629 adenomas and adenoma families, negative POLD1 carcinomas for mutations in APC and MUTYH Ashkenazi 2 sisters MCM9 630 Colorectal adenomatous polyposis Germany 102 unrelated individuals MSH2 631 Germany 12 colorectal adenomas from DSC2, 629 seven unrelated patients PIEZO1, ZSWIM7 Colorectal Germany 12 colorectal adenomas DSC2, 632 adenomatous from 7 unrelated patients PIEZO1, polyposis with unexplained sporadic ZSWIM7 adenomatous polyposis Esophageal adenocarcinoma and Barrett esophagus USA Large family VSIG10L 633 255

256 Cancer Population Patients Genes Citation Additional studies Esophageal China 51 stage I and 53 stage III FAM84B 634 squamous cell esophageal squamous cell carcinoma carcinomas Familial China Large family, negative COQ6 635 schwannomatosis for mutations in known disease-causing genes Gastric Finland Large family with the diffuse type of gastric cancer, negative for mutations in CDH1 INSR, FBXO24, DOT1L 636 Netherlands Large family with the diffuse type of gastric cancer, negative for mutations in CDH1 CTNNA1 637 638 Glioma Multiple 90 patients from 55 families POT1 639 Hodgkin Middle East Family with 3 out of 5 affected ACAN 640 lymphoma children and healthy parents Finland Large family with nodular NPAT 641 lymphocyte predominant Hodgkin lymphoma USA 17 Hodgkin lymphoma prone families with three or more KDR 642 affected cases or obligate carriers (69 individuals) Infantile Brazil 2 affected brothers and their NDRG4 643 myofibromatosis healthy consanguineous parents

Cancer Population Patients Genes Citation Additional studies Multiple 11 patients from 4 families, and 5 simplex cases PDGFRB 644 645 USA 11 patients from 9 families PDGFRB 645 USA Large family, negative for PDGFRB mutations NOTCH3 645 644 Invasive pituitary China 6 invasive pituitary adenomas DPCR1, 646 adenomas and 6 non-invasive pituitary EGFL7, adenomas the PRDM family and LRRC50 Juvenile USA Single patient, with extensive SMAD9 647 hamartomatous family history, negative polyposis for known disease-causing syndrome mutations Kaposi sarcoma Finland Large family STAT4 648 Kaposiform Japan Matched tumour and normal ITGB2, 649 hemangioendothelioma sample from an individual IL-32 and DIDO1 Liver France 2 individuals from a family with recurrent well-differentiated hepatocellular tumours DICER1 650 Lung USA Large family PARK 651 257 Taiwan Large family YAP1 652

258 Cancer Population Patients Genes Citation Additional studies Arab An individual with lung cancer from an extended family segregating different types of hereditary cancer NBN 653 Lymphoblastic leukaemia Multiple Large family ETV6 654 655 Male breast Italy 1 male and 2 female BRCA1/2 mutation-negative breast cancer cases from a family Melanoma Multiple 101 patient from 56 melanoma families, negative for CDKN2A and CDK4 mutations Multiple 184 patients from 105 melanoma families, negative for CDKN2A and CDK4 mutations PALB2 656 POT1 208 657 POT1 657 208 USA, Australia, UK Patient from large melanoma family MITF 658 659, 660 USA Uveal melanoma patients BAP1 661 662 676 Finland 21 cases BAP1 594 Melanotic neuroectodermal tumour of infancy UK Single patient CDKN2A 677

Cancer Population Patients Genes Citation Additional studies Multiple spinal meningiomas UK 3 unrelated individuals with familial multiple spinal SMARCE1 678 679 681 meningiomas, negative for mutations in NF2 and SMARCB1 Nasopharyngeal China 161 NPC cases and 895 controls MST1R 682 carcinoma of Southern Chinese descent Nonmedullary thyroid cancer USA Large family HABP2 683 China 5 subjects from a large family RTFC 684 Ovarian UK, USA, 412 high grade serous ovarian FANCM 685 Australia, cancer Germany Paediatric hepatocellular carcinoma USA Single patient ABCB11 686 Paediatric poorly differentiated USA Patient with pediatric poorly differentiated carcinoma APC 687 carcinoma Pancreatic ductal Japan 4 cases of KRAS DCTN1-ALK 688 adenocarcinoma mutation-negative pancreatic fusion ductal adenocarcinoma 259 Papillary thyroid carcinoma USA, Canada Large family SRRM2 689

260 Cancer Population Patients Genes Citation Additional studies Paraganglioma Spain Patient with multiple paragangliomas and family history of the disease MDH2 690 Penile squamous cell carcinoma UK 27 patients CSN1 691 Pheochromocytoma Spain 3 patients with familial pheochromocytoma, negative for mutations in known disease causing genes MAX 692 693, 694 Pre-B cell acute lymphoblastic leukemia Puerto Rican African American ancestry 2 families PAX2 695 Primary thyroid USA 14 female research participants PARP4 696 and breast with primary thyroid and breast cancers without mutations in PTEN

Cancer Population Patients Genes Citation Additional studies Primary lymphedema European and Chinese 2 unrelated patients with family history and 1 sporadic case GATA2 697 698 704 associated with descent a predisposition to acute myeloid leukemia (Emberger syndrome) Prostate USA 91 patient from 19 families BTNL2 705 African American 652 aggressive prostate cancer patients and 752 disease-free controls TET2 706 Japan 140 patients with PC from 66 families TRRAP 707 USA 75 high risk families TANGO2, OR5H14, and CHAD 708 261

262 Cancer Population Patients Genes Citation Additional studies Australia 5 prostate cancer-affected men PCTP, 709 from 3 families MCRS1, ATRIP, PARP2, CYP3A43, DOK3, PLEKHH3, HEATR5B, GPR124, and HKR1 Renal angiomyolipoma Renal cell carcinoma USA 15 patients TSC1 and TSC2 Korea 10 patients FOXC2 and CLIP4 710 711 Rosette-forming African A patient with rosette-forming FGFR1, 712 glioneuronal American glioneuronal tumour of the PIK3CA, tumour fourth ventricle PTPN11 Small cell carcinoma of the ovary, hypercalcemic type USA, Canada, UK 6 patients from 3 families SMARCA4 713 714 716 Multiple 7 patients SMARCA4 714 713, 715, 716 Small intestinal carcinoids USA Large family IPMK 717

Cancer Population Patients Genes Citation Additional studies Squamous cell Korea 18 cisplatin-resistant metastatic REV3L 718 carcinoma of head tumours and matched germline and neck Transitional cell carcinoma China 2 patients HECW1 719 Wilms tumour UK 35 families CTR9 720 PubMed search was performed using a string (exome OR exom* OR NGS OR whole genome OR next-generation OR next generation OR WES) AND (familial OR hereditary OR susceptib* OR risk OR germline OR germline ) AND (sequencing OR analysis) AND (cancer OR malignancy OR tumor* OR tumour*) AND English [lang]. Only studies which reported the identification of novel genes by exome sequencing were included. Search results included up to March 2017. 263

264

Appendix C Familial cancer syndromes associated with sarcomas 265

266 Syndrome Sarcoma Inheritance Gene (location) Features Beckwith-Wiedemann syndrome RMS AD NSD1 (5q35.3), KIP2 (11p15.4),CDKN1C (11p15.4), H19 (11p15.5), KCNQ1OT1 (11p15.5), ICR1 (11p15.5) Overgrowth syndrome: macroglossia, omphalocele, hemihypertrophy, gigantism, and associated tumour predisposition Bloom syndrome RMS AR BLM (15q26.1) Progerioid syndrome: growth retardation, sun sensitivity, telangiectasias and other skin changes, and associated tumour predisposition Costello syndrome RMS AD HRAS (11p15.5) Rasopathy: coarse facies, short stature, distinctive hand posture and appearance, cardiac anomalies, developmental delay, congenital myopathy Familial adenomatous polyposis Gardner fibroma, desmoid, RMS AD APC (5q21-q22) Individuals develop hundreds to thousands of polyps of the colon and rectum that can progress to colorectal carcinoma if not treated Familial Gastrointesinal AD KIT (4q12), Multiple gastrointestinal stromal tumours gastrointestinal stromal tumour PDGFRA (4q12) stromal tumour Glomus tumours Glomus tumour AD GLMN (1p22.1) Glomuvenous malformations, glomangioma

Syndrome Sarcoma Inheritance Gene (location) Features Gorlin-Goltz nevoid basal cell carcinoma Hereditary leiomyomatosis and renal cell carcinoma syndrome Hereditary Retinoblastoma RMS, fetal rhabdomyoma Leiomyosarcoma (uterus) Sarcomas as second malignant neoplasm, lipoma AD PTCH (Xp11.23) Multiple basal cell carcinomas, odontogenic keratocysts, palmar/plantar pits, calcification of the falx cerebri, rib abnormalities AD FH (1q43) Tumour predisposition syndrome: cutaneous piloleiomyomas, uterine leiomyomas, type 2 papillary renal cell carcinomas AD RB1 (13q14.2) Retinoblastoma, often bilateral and typically in very early childhood Leiomyomatosis-Alport syndrome Leiomyoma XLD COL4A6 (Xq22.3) Alport syndrome plus multiple, diffuse leiomyomas Li-Fraumeni syndrome RMS, undifferentiated pleomorphic sarcoma, pleomorphic liposarcoma AD TP53 (17p13.1) Inherited cancer syndrome: early onset of tumours, multiple tumours within individual; most commonly sarcomas, others include breast cancer, central nervous system tumours, leukaemia and adrenocortical carcinoma Maffucci syndrome Spindle cell IDH1 (2q34), Multiple enchondromas (increased risk of hemangiomas IDH2 (15q26.1) chondrosarcoma) and hemangiomas Mazabraud syndrome Myxomas GNAS1 (20q13.32) Myxomas and fibrous dysplasia 267

268 Syndrome Sarcoma Inheritance Gene (location) Features Mosaic variegated aneuploidy RMS AR BUB1B (15q15) Intrauterine growth restriction, microcephaly, spectrum of other anomalies, and a high risk of malignancy including RMS, Wilms, and hematologic malignancy Neurofibromatosis type 1 Neurofibromatosis type 2 Malignant peripheral nerve sheath tumour, RMS, neurofibroma, gastrointestinal stromal tumour Schwannoma, RMS, malignant rhabdoid tumour AD NF1 ( 17q11.2) Cafe-au-lait spots, Lisch nodules in the eye, increased susceptibility to benign and malignant tumours AD NF2 (22q12.2) Tumours of the eighth cranial nerve (usually bilateral) and other schwannomas, meningiomas of the brain, and schwannomas of the dorsal roots of the spinal cord Nijmegen breakage syndrome RMS AR NBS1 (8q21.3) Chromosomal instability syndrome - microcephaly, growth retardation, immunodeficiency, and tumour predisposition Noonan syndrome RMS, lymphangioma AD PTPN11 (12q24) Rasopathy - Dysmorphic facies, short stature, neck webbing, cardiac anomalies, deafness, bleeding diathesis Roberts syndrome RMS AR ESC02 (8p21.1) Range of mild to severe malformation of bones, arms, legs, skull, and face - features similar to those seen in thalidomide exposure

Syndrome Sarcoma Inheritance Gene (location) Features Rothmund-Thomson syndrome Osteosarcoma AR RTS (18q24.3) Skin atrophy, telangiectasia, hyper- and hypopigmentation, congenital skeletal abnormalities, short stature, premature ageing, and increased risk of malignant disease Rubinstein-Taybi syndrome RMS AD CREBBP (16p13.1) Multiple congenital anomalies, developmental delay,microcephaly, dysmorphic features, and tumour predisposition Simpson-Golabi-Behmel syndrome Embryonal tumours XLR GPC3 Overgrowth syndrome - coarse facies, congenital heart defects, overgrowth, and other anomalies Tuberous sclerosis RMS, cardiac AD TSC1 (9q34), Hamartomas of multiple organs, rhabdomyoma,chordoma, TSC2 (16p13.3), angiomyolipomas, other renal tumours renal TSC3 (12q22- (cysts and renal cell carcinomas), angiomyolipoma, 24.1) lymphangioleiomyomatosis, angiofibromas perivascular and other skin lesions epithelioid cell tumours Werner syndrome RMS AR WRN (8p12-p11.2) Progerioid syndrome - Scleroderma-like skin changes, early onset atherosclerosis and diabetes AD: autosomal dominant. AR: autosomal recessive. RMS: rhabdomyosarcoma. XLD: X linked dominant. XLR: X linked recessive. 269

270

Appendix D Translocations associated with sarcomas Translocation Genes Alveolar rhabdomyosarcoma t(2;13)(q36;q14) PAX3 FOXO1 t(1;13)(p36;q14) PAX7 FOXO1 t(8;13;9)(p11;q14;q32) FOXO1-FGFR1 t(x;2)(q13;q36) PAX3-FOXO4 t(2;2)(p23;q36) PAX3-NCOA1 t(2;8)(q36;q13) PAX3-NCOA2 Alveolar soft-part sarcoma t(x;17)(p11.2;q25) ASPL TFE3 Angiomatoid fibrous histiocytoma t(2;22)(q33;q12) EWSR1-CREB1 t(12;16)(q13;p11) FUS-ATF1 t(12;22)(q13;q12) EWSR1-ATF1 Chondroid lipoma t(11;16)(q13.p13) C11orf95-MKL2 271

Translocation Genes Clear-cell sarcoma t(2;22)(q33;q12) EWSR1-CREB1 t(12;22)(q13;q12) EWSR1 ATF1 Congenital fibrosarcoma t(12;15)(p13;q25) ETV6 NTRK3 Dedifferentiated liposarcoma t(5;5)(p15;p15) TRIO-TERT t(9;12)(q33;q15) CNOT2-ASTN2?t(12)(q14q14) CTDSP2-FAM19A2 t(9;12)(q33;q21) NR6A1-TRHDE?t(12)(q15q21) NUP107-LGR5 t(9;12)(q33;q15) NUP107-PAPPA t(5;14)(p13;q32) RCOR1-WDR70 Dermatofibrosarcoma protuberans t(17;22)(q22;q13) COL1A1 PDGFB Desmoplastic small round-cell tumour t(11;22)(p13;q12) EWSR1 WT1 t(21;22)(q22;q12) EWSR1-ERG Endometrial stromal sarcoma t(6;10)(p21;p11) EPC1-PHF1 t(6;7)(p21;p15) JAZF1-PHF1 t(7;17)(p15;q11) JAZF1-SUZ12 t(1;6)(p34;p21) MEAF6-PHF1 t(10;17)(q23;p13) YWHAE-FAM22A t(10;17)(q22;p13) YWHAE-FAM22B t(x;22)(p11;q13) ZC3H7B-BCOR Epithelioid hemangioendothelioma t(1;3)(p36;q25) WWTR1-CAMTA1 t(x;11)(p11;q22) YAP1-TFE3 272

Translocation Genes Epithelioid sarcoma of the ovary t(12;12)(q23;q24) CMKLR1-HNF1A t(12;12)(q13;q22) ERBB3-CRADD t(1;22)(p36;q11) SMARCB1-WASF2 Ewing s sarcoma t(11;22)(q24;q12) EWSR1 FLI1 t(21;22)(q22;q12) EWSR1 ERG t(7;22)(p22;q12) EWSR1-ER81 t(17;22)(q21;q12) EWSR1-ETV4 t(2;22)(q33;q12) EWSR1 FEV t(21,22)(q22;q12) EWSR1-ERG t(16,21)(p11;q24) FUS-ERG t(2,16)(q35;p11) FUS-FEV t(20,22)(q13;q12) EWSR1-NFATC1 t(6,22)(p21;q12) EWSR1-POU5F1 t(4,22)(q31;q12) EWSR1-SMARCA5 t(7;22)(p21;q12) EWSR1-ETV1 Fibromyxoid sarcoma t(7;16)(q34;p11) FUS-CREB3L2 t(11;16)(p11;p11) FUS-CREB3L1 t(11:22)(p11;q12) EWSR1-CREB3L1 Inflammatory myofibroblastic tumour 2p23 rearrangements TMP3 ALK; TMP4 ALK inv(2)(p23q35) ATIC-ALK t(2;11)(p23;p15) CARS-ALK t(2;17)(p23;q23) CLTC-ALK t(2;12)(p23;p11) PPFIBP1-ALK t(2;2)(p23;q13) RANBP2-ALK t(x;6)(p11;p24) RREB1-TFE3 t(2;4)(p23;q21) SEC31A-ALK t(1;2)(q21;o23) TPM3-ALK 273

Translocation Genes t(2;19)(p23;p13) TPM4-ALK t(2;2)(p21;p23) EML4-ALK Kaposi s sarcoma EZH2, SIRT1 Leiomyoma of the uterus inv(7)(p21q22) CUX1-AGR3 t(12;14)(q14;q11) HMGA2-CCNB1IP1 t(7;12)(q31;q14) HMGA2-COG5 t(8;12)(q22;q14) HMGA2-COX6C t(12;14)(q14;q24) HMGA2-RAD51L1 Leiomyosarcoma SIRT1 Lipoblastoma t(7;8)(q21;q12) COL1A2-PLAG1 t(2;8)(q31;q12.1) COL3A1-PLAG1 del(8)(q12q24) HAS2-PLAG1 Lipoma t(5;12)(q33;q14) EBF1-LOC204010 t(2;12)9)(q37;q14) HMGA2-CXCR7 t(5;12)(q33;q14) HMGA2-EBF1 t(12;13)(q14;q13) HMGA2-LHFP t(3;12)(q28;q14 HMGA2-LPP t(9;12)(p22;q14) HMGA2-NFIB t(1;12)(p32;q14) HMGA2-PPAP2B t(3;12)(q28;q14) LPP-C12orf9 Mesenchymal chondrosarcoma t(8;8)(q12;q21) HEY1-NCOA2 t(1;5)(q42;q32) IRFBP2-CDX1 Myoepithelioma t(12;22)(q13;q12) EWSR1-ATF1 t(1;22)(q23;q12) EWSR1-PBX1 t(6;22)(p21;q12) EWSR1-POU5F1 t(19;22)(q13;q12) EWSR1-ZNF444 274

Translocation Genes Myxoid chondrosarcoma t(9;17)(q31;q12) TAF15-NR4A3 t(3;9)(q12;q31) TFG-NR4A3 t(9;15)(q31;q21) TCF12-NR4A3 t(9;22)(q22-31;q11-12) EWSR1 NR4A3 Myxoid liposarcoma t(12;16)(q13;p11) FUS DDIT3 t(12;22)(q13;q12) EWSR1 DDIT3 Ossifying fibromyxoid tumour t(6;12)(p21;q24) EP400-PHF1 PEComa t(x;1)(p11;p34) SFPG-TFE3 t(14;x)(q24;q12) RAD51B-OPHN1 t(14;x)(q24;p11) RAD51B-RRAGB Pericytoma t(7;12)(p22;q13) ACTB-GLI1 Primary pulmonary myxoid sarcoma t(2;22)(q33;q12) EWSR1-CREB1 Sclerosing epithelioid fibrosarcoma t(7;16)(q34;p11) FUS-CREB3L2 t(11;22)(p11;q12) EWSR1-CREB3L1 t(7;22)(q3;q12) EWSR1-CREB3L2 Soft tissue angiofibroma t(5;8)(p15;q13) AHRR-NCOA2 t(7;8;14)(q11;q13;q31) GTF2I-NCOA2 Soft tissue chondroma t(3;12)(q28;q14) HMGA2-LPP Solitary fibrous tumour inv(12)(q13q13) NAB2-STAT6 275

Translocation Spindle cell rhabdomyosarcoma t(6;8)(p21;q13) t(8;11)(q13;p15) t(6;6)(q22;q24) t(6;8)(q22;q13) Synovial sarcoma t(x;18)(p11;q11) Tenosynovial giant cell tumour t(1;2)(p13;q37) Undifferentiated sarcomas inv(x)(p11p11) t(4;19)(q35;q13) t(10;19)(q26;q13) t(6;22)(p21;q12) t(2;22)(q31;q12) Genes SRF-NCOA2 TEAD1-NCOA2 VGLL2-CITED2 VGLL2-NCOA2 SS18-SSX1, SS18-SSX2, SS18-SSX4 COL6A3-CSF1 BCOR-CCNB3 CIC-DUX4 CIC-DUX4L10 EWSR1-POU5F1 EWSR1-SP3 106, 107, 721 724 Citations: 276

Appendix E Genetically complex sarcomas 277

278 Sarcoma Genes References Angiosarcoma Amplification: 8q24.21 (MYC), 10p12.33, 5q35.3 725 Chondrosarcoma (types other than extraskeletal myxoid) COL2A1, IDH1, IDH2, TP53, RB1 pathway 726 Embryonal Polysomy: 8, 2, 11, 12, 13 and 20. Monsomy: 10 rhabdomyosarcoma and 15. LOH: 11p15.5 (IGF2, H19, CDKN1C, HOTS). Gains: 12q13 Fibrosarcoma (other than Multiple non-specific numerical and structural congenital) chromosomal abnormalities. Gain: 22q (PDGF-B) Leiomyosarcoma Gains: 1, 5, 6, 8, 15, 16, 17, 19, 20, 22, X. Losses: 1p, 2, 3, 4, 6q, 8, 9, 10p, 11p, 12q, 11q, 13, 16, 17p, 18 19, 22q. Amplifications: 1, 5, 8, 12, 13, 17, 19, 20 Liposarcoma (types other Gains of 1p, 1q21-q32, 2q, 3p, 3q, 5p12-p15, 5q, than myxoid) 6p21, 7p, 7q22, 8q, 10q, 12q12-q24, 13q, 14q, 15q, 17p, 17q, 18p, 18q12, 19p12, 19q13, 20q, 22q, and Xq21-q27. Losses: 1q, 2q, 3p, 4q, 10q, 11q, 12p13, 13q14, 13q21-qter, 14q23-24, 16q22, 17p13, 17q11.2, and 22q13 64 727 729 730, 731 732 734

Sarcoma Genes References Malignant peripheral Gains: 7p21-q36, 7p22, 7q, 8, 8q11-23, 1q25-44, nerve-sheath tumour and 5q13-35. Losses: 1p12-13, 1p21, 1p36, 3p21-pter, 9p13-21, 9p22-24, 10, 10p11-15, 11p, 11q21-25, 13q14, 15p, 16/16q24, 17/17p, 17q11-12, 17q21-25, 22, 22p, 22q13, and 22q11-12. Ring chromosomes, trisomy 7, and rearrangements of 11p and 12q13-15. Breakpoints: 1p, 7p22 (ETV1 ), 11q13-23, 20q13 (SRC ), and 22q11-13 (NF2 ) Myxofibrosarcoma Gains: 19p, 19q. Losses: 1q, 2q, 3p, 4q, 10q, 11q, and 13q (RB1 ). Amplification: 1, 5p, and 20q Extraskeletal Gains: 1q, 2, 8, and 17p11. Losses: 1q, 2, 5, 6, Osteosarcoma 12, 13, 14, 15, 16, 18, 19, 20, 21, and Y Pleomorphic Gains: 1p22-23, 5, 7p, 8, 14, 18/18, 20p, and 22. rhabdomyosarcoma Losses: 2, 3p, 5q32-qter, 6,10q23 (PTEN ), 11, 13, 14, 15q21-q22, 16, 17, 18, 19, and Y 735 737 732 64, 738, 739 740, 741 279

280 Sarcoma Genes References Spindle cell/pleomorphic Gains: 1p36-p31, 1q21-q24, 2p, 4p16, 5p, unclassified sarcoma 5q34, 6q, 7p15-p22, 7q21-qter, 17q, 9q, 14q, 16p13, 17q, 19p13, 19q13.11-q13.2, 20q, and 21q. Losses: 1q32.1, 2p25.3, 2q36-q37, 8p23, 9p, 10q21-q23, 11q22, 13q14-q21, 16q11, and 16q23. Amplifications: 1p33-p34, 12q13-q15, 17cen-p11.2, and 17p13-pter Skeletal osteosarcoma Gains and regional amplifications: 1q, 6p21-p12, 8q23-q24, and 17p13-p11.2 (TP53 ). Partial or complete loss: 6q. Rearrangements of chromosomes 20 742 744 745 748

Appendix F Known cancer predisposition genes Gene Genomic Cancer predisposition location SDHB 1p36.13 Gastrointestinal stromal tumour, paraganglioma, gastric stromal sarcoma, pheochromocytoma MUTYH 1p34.1 Colorectal adenomas, colorectal adenomatous polyposis, gastric cancer (somatic) UROD 1p34.1 Hepatocellular carcinoma MPL 1p34 Familial essential thrombocythemia GBA 1q22 Gaucher disease SDHC 1q23.3 Gastrointestinal stromal tumour, paraganglioma and gastric stromal sarcoma CDC73 1q31.2 Parathyroid carcinoma and adenoma FH 1q43 Leiomyomatosis and renal cell cancer ALK 2p23.2-p23.1 Familial neuroblastoma SOS1 2p22.1 Noonan syndrome 281

Gene Genomic Cancer predisposition location MSH2 2p21-p16 Colorectal cancer, hereditary nonpolyposis, type 1 MSH6 2p16.3 Colorectal cancer, hereditary nonpolyposis type 5, endometrial cancer (familial), mismatch repair cancer syndrome TMEM127 2q11.2 Pheochromocytoma ERCC3 2q14.3 Xeroderma pigmentosum, group B ABCB11 2q31.1 Hepatocellular carcinoma DIS3L2 2q37.1 Perlman syndrome VHL 3p25.3 Hemangioblastoma, pheochromocytoma, renal cell carcinoma, von Hippel-Lindau syndrome XPC 3p25.1 Xeroderma pigmentosum, group C BAP1 3p21.1 Tumour predisposition syndrome COL7A1 3p21.31 Dystrophic epidermolysis bullosa MLH1 3p22.2 Colorectal cancer, hereditary nonpolyposis type 2, mismatch repair cancer syndrome ATR 3q23 Familial cutaneous telangiectasia and cancer syndrome GATA2 3q21.3 Acute myeloid leukemia, myelodysplastic syndrome PHOX2B 4p13 Neuroblastoma KIT 4q12 Gastrointestinal stromal tumour, germ cell tumours, acute myeloid leukemia PDGFRA 4q12 Gastrointestinal stromal tumour SDHA 5p15.33 Paragangliomas TERT 5p15.33 Acute myeloid leukemia, melanoma 282

Gene Genomic Cancer predisposition location APC 5q22.2 Adenomatous polyposis coli, Brain tumour-polyposis syndrome 2, Colorectal cancer (somatic), Gardner syndrome, gastric cancer (somatic), hepatoblastoma (somatic) ITK 5q33.3 Lymphoproliferative syndrome 1 HFE 6p22.2 Hemochromatosis FANCE 6p21-p22 Acute myeloid leukaemia POLH 6p21.1 Xeroderma pigmentosum, variant type PMS2 7p22.1 Colorectal cancer, hereditary nonpolyposis type 4, mismatch repair cancer syndrome EGFR 7p11.2 Adenocarcinoma of lung, non-small cell lung cancer SBDS 7q11.21 Shwachman-Diamond syndrome SLC25A13 7q21.3 Hepatocellular carcinoma MET 7q31.2 Hepatocellular carcinoma, renal cell carcinoma, osteofibrous dysplasia PRSS1 7q34 Pancreatic cancer WRN 8p12 Werner syndrome NBN 8q21.3 Acute lymphoblastic leukemia, Nijmegen breakage syndrome EXT1 8q24.11 Chondrosarcoma RECQL4 8q24.3 Rothmund-Thomson syndrome DOCK8 9p24.3 Hyper-IgE recurrent infection syndrome, autosomal recessive MTAP 9p21.3 Malignant fibrous histiocytoma CDKN2A 9p21.3 Melanoma, neural system tumour syndrome, orolaryngeal cancer, pancreatic cancer/melanoma syndrome 283

Gene Genomic Cancer predisposition location RMRP 9p13.3 Metaphyseal dysplasia without hypotrichosis FANCG 9p13 Acute myeloid leukaemia XPA 9q22.33 Xeroderma pigmentosum, group A FANCC 9q22.32 Fanconi anemia, complementation group C PTCH1 9q22.32 Basal cell carcinoma TGFBR1 9q22.33 Multiple self-healing squamous epithelioma TSC1 9q34.13 Lymphangioleiomyomatosis, tuberous sclerosis-1 RET 10q11.21 Medullary thyroid carcinoma, multiple endocrine neoplasia, pheochromocytoma BMPR1A 10q23.2 Polyposis syndrome PTEN 10q23.31 Cowden syndrome 1, Endometrial carcinoma, malignant melanoma, PTEN hamartoma tumour syndrome, squamous cell carcinoma, head and neck, glioma susceptibility, prostate cancer TNFRSF6 10q23.31 Autoimmune lymphoproliferative syndrome, squamous cell carcinoma, autoimmune lymphoproliferative syndrome SUFU 10q24.32 Medulloblastoma HRAS 11p15.5 Costello syndrome, bladder cancer, thyroid carcinoma FANCF 11p15 Acute myeloid leukaemia WT1 11p13 Mesothelioma, Wilms tumor, type 1 DDB2 11p11.2 Xeroderma pigmentosum, group E, DDB-negative subtype 284

Gene Genomic Cancer predisposition location EXT2 11p11.2 Exostoses, multiple, type 2 SDHAF2 11q12.2 Familial paraganglioma MEN1 11q13.1 Adrenal adenoma, angiofibroma, carcinoid tumour of lung, lipoma, multiple endocrine neoplasia 1, parathyroid adenoma ATM 11q22.3 Lymphoma, T-cell prolymphocytic leukemia, breast cancer CBL 11q23.3 Noonan syndrome-like disorder SDHD 11q23.1 Intestinal carcinoid tumours, Cowden syndrome, merkel cell carcinoma, paraganglioma, gastric stromal sarcoma and pheochromocytoma HMBS 11q23.3 Hepatocellular carcinoma CDKN1B 12p13.1 Multiple endocrine neoplasia, type IV CDK4 12q14.1 Melanoma PTPN11 12q24.13 Juvenile myelomonocytic leukemia, Noonan syndrome 1 HNF1A 12q24.2 Familial hepatic adenoma POLE 12q24.33 Colorectal cancer BRCA2 13q13.1 Fanconi anemia, complementation group D1, Wilms tumour, breast cancer (male), breast-ovarian cancer, glioblastoma, medulloblastoma, pancreatic cancer, prostate cancer GJB2 13q12.11 Vohwinkel syndrome RB1 13q14.2 Bladder cancer, osteosarcoma, retinoblastoma, small cell cancer of the lung ERCC5 13q33.1 Xeroderma pigmentosum, group G MAX 14q23.3 Pheochromocytoma 285

Gene Genomic Cancer predisposition location SERPINA1 14q32.13 Thyroid cancer DICER1 14q32.13 Pleuropulmonary blastoma, rhabdomyosarcoma BUB1B 15q15.1 Colorectal cancer FAH 15q25.1 Hepatocellular carcinoma BLM 15q26.1 Bloom Syndrome TSC2 16p13.3 Lymphangioleiomyomatosis, tuberous sclerosis-2 ERCC4 16p13.12 Fanconi anemia, complementation group Q, Xeroderma pigmentosum, group F, Cockayne syndrome PALB2 16p12.2 Fanconi anemia, complementation group N, breast cancer, pancreatic cancer CYLD 16q12.1 Brooke-Spiegler syndrome, cylindromatosis, trichoepithelioma CDH1 16q22.1 Endometrial carcinoma, gastric cancer, ovarian carcinoma, breast cancer, prostate cancer FANCA 16q24.3 Fanconi anemia, complementation group A TP53 17p13.1 Adrenal cortical carcinoma, breast cancer, choroid plexus papilloma, colorectal cancer, hepatocellular carcinoma, Li-Fraumeni syndrome, nasopharyngeal carcinoma, osteosarcoma, pancreatic cancer, basal cell carcinoma 7, glioma susceptibility FLCN 17p11.2 Colorectal cancer, renal carcinoma RAD51D 17q12 Familial breast-ovarian cancer NF1 17q11.2 Neurofibromatosis, type 1 286

Gene Genomic Cancer predisposition location BRCA1 17q21.31 Familial breast-ovarian cancer, pancreatic cancer STAT3 17q21.2 Autoimmune disease, Hyper-IgE recurrent infection syndrome SMARCE1 17q21.2 Familial meningioma TRIM37 17q22 Breast cancer BRIP1 17q23.2 Breast cancer, Fanconi anemia, complementation group J PRKAR1A 17q24.2 Adrenocortical tumour, Carney complex type 1, pigmented nodular adrenocortical disease, primary AXIN2 17q24.1 Colorectal cancer, oligodontia-colorectal cancer syndrome RAD51C 17q22 Fanconi anemia, complementation group O, familial breast-ovarian cancer RHBDF2 17q25.1 Tylosis with esophageal cancer SMAD4 18q21.2 Juvenile polyposis/hereditary hemorrhagic telangiectasia syndrome, pancreatic cancer (somatic), juvenile intestinal polyposis SETBP1 18q21.1 Schinzel-Giedion syndrome ELANE 19p13.3 Cyclic neutropenia, severe congenital neutropenia STK11 19p13.3 Melanoma, pancreatic cancer, Peutz-Jeghers syndrome, testicular cancer SMARCA4 19p13.2 Coffin-Siris syndrome 4, rhabdoid tumour predisposition syndrome 2 CEBPA 19q13.11 Acute myeloid leukaemia 287

Gene Genomic Cancer predisposition location ERCC2 19q13.32 Cerebrooculofacioskeletal syndrome 2, trichothiodystrophy 1 photosensitive, xeroderma pigmentosum group D POLD1 19q13.33 Colorectal cancer RUNX1 21q22.12 Acute myeloid leukaemia, platelet disorder, familial, with associated myeloid malignancy SMARCB1 22q11.23 Coffin-Siris syndrome, Rhabdoid tumours, Schwannomatosis-1 LZTR1 22q11.21 Schwannomatosis-2 CHEK2 22q12.1 Li-Fraumeni syndrome, osteosarcoma, breast cancer, colorectal cancer, prostate cancer NF2 22q12.2 Neurofibromatosis type 2, schwannomatosis, meningioma WAS Xp11.23 Neutropenia, thrombocytopenia, Wiskott-Aldrich syndrome SH2D1A Xq25 Lymphoproliferative syndrome GPC3 Xq26.2 Wilms tumour DKC1 Xq28 Dyskeratosis congenita SRY Yp11.2 Hepatocellular carcinoma The information in this table was sourced from the Online Inheritance in Man database and 134, 159 161 the Catalogue of Somatic Mutations in Cancer database. 288

Appendix G Candidate genes used for variant prioritisation based on a priori knowledge of cancer biology Gene name Chromosome Start End No. variants APC 5 112018202 112206936 13 ARID1A 1 26997522 27133601 4 ATM 11 108068559 108264826 12 ATR 3 142143077 142322668 31 AXIN1 16 312440 427676 10 AXIN2 17 63499683 63582740 6 BARD1 2 215568275 215699428 10 BLM 15 91235579 91383686 9 BRCA1 17 41171312 41302500 13 BRCA2 13 32864617 32998809 20 BRIP1 17 59731547 59965920 9 BUB1B 2 111370409 111460684 1 C17orf85 17 3685045 3774545 3 CD99 Y 2534228 2634350 0 CDH1 16 68746195 68894444 6 CDKN2A 9 21942751 22000132 2 CHEK1 11 125471251 125552042 4 CHEK2 22 29058731 29162822 2 DDB2 11 47211493 47285769 4 DICER1 14 95527565 95633085 3 289

Gene name Chromosome Start End No. variants DKC1 X 153966031 154030964 0 DNA2 10 70148821 70256730 6 ELF3 1 201954690 202011315 3 ELF5 11 34475342 34560347 6 ERCC2 19 45829649 45898845 12 ERCC3 2 127989866 128076752 4 ERCC4 16 13989014 14071205 8 ERCC5 13 103479468 103549748 14 ERF 19 42726717 42784309 1 ERG 21 39726950 39895428 7 ETS1 11 128303656 128482453 4 ETS2 21 40152231 40221878 4 ETV-1 7 13905856 14054642 7 ETV2 19 36107647 36160773 4 ETV4 17 41580211 41648305 4 ETV6 12 11777788 12073325 2 EWSR1 22 29638998 29721515 8 EXT1 8 118786602 119149058 3 EXT2 11 44092747 44291980 6 FAM175A 4 84357094 84431290 3 FANCA 16 89778959 89908065 45 FANCB X 14836529 14916184 0 FANCC 9 97836336 98104991 3 FANCD2 3 10043113 10166344 12 FANCE 6 35395138 35459881 7 FANCF 11 22619079 22672387 2 FANCG 9 35048835 35105013 3 FANCI 15 89762194 89885362 13 FANCL 2 58361378 58493515 4 FANCM 14 45580136 45695093 9 FH 1 241635857 241708085 3 FLI1 11 128538811 128708162 3 HNF4A 20 42959441 43061115 7 IDH1 2 209075953 209144806 3 IDH2 15 90602212 90670708 4 KIF1B 1 10245764 10466661 27 KIT 4 55499095 55631881 4 LIG1 19 48593703 48698560 24 LIG4 13 108834792 108892882 2 MDM2 12 69176971 69264320 4 MEN1 11 64545986 64603188 8 290

Gene name Chromosome Start End No. variants MET 7 116287459 116463440 7 MLH1 3 37009841 37117337 6 MLH3 14 75455467 75543235 8 MRE11A 11 94125469 94252040 5 MSH2 2 47605206 47735367 11 MSH3 5 79925467 80197634 13 MSH6 2 47985221 48059092 11 MUTYH 1 45769914 45831142 6 NBN 8 90920564 91021899 10 NEIL2 8 11602172 11669854 5 NF1 17 29396945 29729695 8 NF2 22 29974545 30119589 2 PALB2 16 23589483 23677678 3 PMS1 2 190623811 190767355 3 PMS2 7 5987870 6073737 5 POLH 6 43518878 43613260 4 PPARG 3 12368001 12500855 3 PRKAR1A 17 66482921 66554570 6 PTCH1 9 98180264 98295831 9 PTEN 10 89598195 89753532 2 PTPN11 12 112831536 112972717 2 RAD50 5 131867616 132005313 6 RAD51C 17 56744963 56836692 2 RAD51D 17 33401811 33458500 3 RB1 13 48852883 49081026 9 RECQL4 8 145711667 145768210 10 RET 10 43547517 43650797 12 RMI1 9 86570637 86643987 2 RMI2 16 11414311 11470617 8 RPA1 17 1708273 1827848 8 RPA3 7 7651575 7783238 1 RPS19 19 42338988 42400484 3 SDHA 5 193356 281814 12 SDHB 1 17320225 17405665 3 SDHC 1 161259166 161359535 3 SDHD 11 111932548 111991525 3 SMARCA4 19 11046598 11197958 11 SMARCB1 22 24104150 24201705 3 SPDEF 6 34480579 34549110 1 SPI1 11 47351409 47425127 4 SQSTM1 5 179222842 179290077 3 291

Gene name Chromosome Start End No. variants STK11 19 1180798 1253434 7 TAF15 17 34111459 34199246 4 TGFBR2 3 30622994 30760633 2 TNFRSF11A 18 59967520 60079943 8 TOP1 20 39632462 39778126 0 TOP3A 17 18152235 18243321 8 TP53 17 7546720 7615868 5 TP53BP1 15 43674412 43810354 5 TSC1 9 135741735 135845020 2 TSC2 16 2072990 2163713 22 VHL 3 10158319 10220354 1 WRN 8 30865778 31056277 25 WT1 11 32384322 32482081 9 XPA 9 100412191 100484691 2 XPC 3 14161648 14245172 14 XRCC2 7 152318587 152398250 2 Start and End: the chromosome locations of the start and end of the gene (including ± 25 kb). 292

Appendix H Genes in which variants were also prioritised using the candidate gene prioritisation strategy Gene Chromosome No. variants ACCS 11 8 ACP2 11 4 ACRV1 11 1 ACYP1 14 1 AIMP2 7 3 ALX4 11 2 ANKRD49 11 1 ARHGAP39 8 5 ARHGDIG 16 2 ARHGEF1 19 2 ATP13A2 1 4 ATP1B2 17 3 ATP5D 19 3 BRK1 3 3 C11orf57 11 2 C11orf97 11 3 C19orf26 19 5 C22orf15 22 2 C5orf45 5 7 C9orf9 9 1 293

Gene Chromosome No. variants CAMKK1 17 5 CAT 11 2 CCDC127 5 2 CDC42BPG 11 12 CFAP126 1 3 CHCHD10 22 2 COX6B1 19 2 CPM 12 4 DCTN5 16 5 DERL3 22 4 DHFR 5 3 DHX8 17 5 DLAT 11 1 DMRTC2 19 1 EIF2AK1 7 1 EIF2B2 14 2 EPCAM 2 2 EPM2AIP1 3 1 EVI2A 17 2 FAM20A 17 3 FAM20A, PRKAR1A 17 1 FBXO11 2 1 FDFT1 8 4 FLII 17 7 FNDC8 17 2 FRY 13 1 GAS2L1 22 5 GATA4 8 3 GPT 8 2 GSK3A 19 3 GTPBP2 6 2 HAUS5 19 3 HEATR9 17 1 HELQ 4 11 HNRNPK 9 4 HSCB 22 2 IL13 5 2 INTS2 17 2 IRAK2 3 4 ITFG3 16 2 ITGAE 17 1 294

Gene Chromosome No. variants KLC3 19 8 LOC100507346 9 3 LOC401052 3 4 LPAR6 13 1 LRRC14 8 3 LRRC14B 5 2 LRRFIP2 3 2 LYPD4 19 3 MAD2L1BP 6 1 MAP3K2 2 1 MAP4K2 11 5 MFSD3 8 1 MIDN 19 3 MIEF2 17 9 MIS18BP1 14 2 MMP11 22 4 MPZ 1 1 MRPL28 16 11 MRPS18C 4 1 MYBPC3 11 13 NDUFAB1 16 2 NPAT 11 1 NR1H3 11 2 ORMDL1 2 3 OSGIN2 8 2 PACSIN1 6 1 PADI2 1 9 PDIA2 16 10 PGD 1 3 PIGO 9 10 PIGV 1 2 PIH1D2 11 1 PKD1 16 17 PLA2G4C 19 10 PLCG1-AS1 19 1 POLG 15 11 PPP1R16A 8 2 R3HDML 20 6 RBM42 19 4 RCBTB2 13 1 RGS11 16 5 295

Gene Chromosome No. variants RHBDD3 22 2 RNPEP 1 5 RNU6-28P 15 5 RPL10A 6 1 RUFY2 10 2 SHMT1 17 5 SLC2A11 22 1 SLC9A3R2 16 6 SMCR8 17 6 SMYD4 17 3 SRP19 5 2 STOML2 9 3 STT3A 11 3 TANGO6 16 2 TEAD3 6 4 TESK2 1 3 TMEM43 3 10 TMEM8A 16 13 TOE1 1 1 UPK1A 19 3 VAT1 17 1 VCP 9 5 VPS9D1 16 2 VRK2 2 1 WRAP53 17 5 XPO5 6 2 XRN1 3 1 ZAR1L 13 4 ZC2HC1C 14 2 ZNF276 16 13 ZNF526 19 1 ZNF710 15 6 296

Appendix I Patient 1-II-2: Copy number variation by chromosome 297

Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Index Index Index Index Index Index Black: normalised log ratios. Red: mean values among points in segment obtained by circular binary segmentation.

Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Index Index Index Index Index Index Black: normalised log ratios. Red: mean values among points in segment obtained by circular binary segmentation.

Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Index Index Index Index Index Index Black: normalised log ratios. Red: mean values among points in segment obtained by circular binary segmentation.

Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Index Index Index Index Index Index Black: normalised log ratios. Red: mean values among points in segment obtained by circular binary segmentation.

Appendix J Patient 2-II-1: Copy number variation by chromosome 303

Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Index Index Index Index Index Index Black: normalised log ratios. Red: mean values among points in segment obtained by circular binary segmentation.

Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Index Index Index Index Index Index Black: normalised log ratios. Red: mean values among points in segment obtained by circular binary segmentation.

Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Log 2 (T/R) Index Index Index Index Index Index Black: normalised log ratios. Red: mean values among points in segment obtained by circular binary segmentation.