Studying recurrent somatic variations in high-grade glioma patients

Size: px

Start display at page:

Download "Studying recurrent somatic variations in high-grade glioma patients"

Kellie Alexander
5 years ago
Views:

1 Studying recurrent somatic variations in high-grade glioma patients Matthew Osmond Department of Human Genetics McGill University, Montreal June 2016 A thesis submitted to McGill University in partial fulfillment of the requirements of the degree of Human Genetics Matthew Osmond 2016

2 Acknowledgements First and foremost, I would like to thank my supervisor Dr. Jacek Majewski for giving me the opportunity to pursue this thesis project. His mentorship and feedback throughout my time as a Master s student were integral to this project s completion. I would also especially like to thank Dr. Hamid Nikbakht, both for his constant answering questions and giving advice on everything related to glioblastomas, and his encouragement along the way. My thanks also to my thesis advisory committee members Dr. Rima Slim and Dr. Guillaume Bourque for their valuable feedback. My gratitude goes out to everyone in the Majewski lab for both their assistance and support throughout this project. In particular, Dr. Eric Bareke provided many hours of support in running our in-house pipelines and providing feedback on software design. I also extend thanks to Simon Papillon for his advice on variant filtration, histone epigenetics, and the reviewing of this thesis. I also wish to extend my thanks to Dr. Nada Jabado and everyone in her lab for their help with our in-house GBM data. Dr. Tenzin Gayden, Dr. Leonie Mikael, and Dr. Nikoleta Juretic in particular were instrumental in this process. Finally, I would like to thank my family and friends for their tireless support during my studies, and both Dr. Alex Mackenzie and Dr. Kym Boycott for helping me discover my passion for genetics and research. This research was financially supported by the George G. Harris Fellowship in Cancer Research, and by Dr. Majewski s CIHR grant. i

3 Abstract The genetic basis of high-grade gliomas (HGGs) remains a topic of great interest due to their prevalence and resistance to common cancer therapies. Recent advances in DNA sequencing technology have helped uncover a number of intriguing somatic mutations in glioblastoma (GBM), an astrocytic subtype of HGGs. It is unclear however, how these mutations interact with one another and how they are related to tumour phenotype. In this study, we collected a large number of whole-genome and whole-exome sequences from pediatric and adult GBM tumours to examine the associations between somatic variation and tumour phenotype. Due to the size of this database, we developed the SQLProfiler software suite to quickly analyze patterns between common somatic variations and tumour characteristics. Database queries performed by SQLProfiler revealed a number of striking mutation patterns. Recurrent somatic mutations in a number of histone genes and chromatin remodelling genes were found throughout the GBM dataset. These mutations were not only mutually exclusive, but were also observed to significantly associate with location and age-based tumour phenotypes. Furthermore, many histone and chromatin remodelling variants were found to cooccur with different sets of accompanying mutations, with these correlations also associating with tumor location and diagnosis age. These location and age-based mutation profiles provide further evidence that GBM is a complex heterogeneous form of cancer. These unique molecular subtypes will serve as a valuable reference in the future diagnosis of GBMs, as well as the assessment of tumour prognosis. Additional studies on the epigenetic effects of histone and chromatin remodeler gene ii

4 variants could help provide further insight on the mechanisms behind GBM tumourigenesis, and help in the development of novel targeted molecular therapies. iii

5 Extrait La prévalence des gliomes de haut-grade (GHG) et leurs résistances aux thérapies anticancéreuses demeure un sujet d intérêt primordial et nécessite la détermination des bases génétiques sous-jacentes à la maladie. Les avancées récentes des technologies de séquençage d ADN ont mis en évidence plusieurs mutations somatiques insuspectées dans un sous-type de GHG: le glioblastome (GBM). Ces mutations pourraient alors interagir les une avec les autres, tout en étant corrélées au phénotype tumoral. Dans notre étude, nous avons collecté plusieurs séquences complètes de génomes et d exomes provenant de GBM pédiatriques et adultes afin d examiner les associations entre variations somatiques et phénotype tumoral. Du fait de la taille importante de cette base d information, nous avons développé un programme appelé SQLProfiler qui nous permet de rapidement analyser les patrons d interactions entre variations somatiques courantes et caractéristiques tumorales. Les requêtes performées avec SQLprofiler au sein de la base de données révélèrent plusieurs patrons de mutation surprenants. Particulièrement, des mutations somatiques présentes au travers du set de GBM se retrouvaient de façon récurrente à l intérieur de gènes codant pour des protéines d histones ou des protéines impliquées dans le remodelage de la chromatine. Ces mutations s avèrent être mutuellement exclusives, et sont significativement associées à la localisation de la tumeur ainsi qu au phénotype et à l âge du patient. De plus, de nombreux variants d histone ou de gènes impliqué dans le remodelage de la chromatine coopéraient en compagnie d autre types de mutations et ce type de coopération était également corrélé avec la localisation de la tumeur et l âge du patient au moment du diagnostique. iv

6 Ces profiles de mutations basés sur l âge et la localisation des tumeurs supportent le fait que le glioblastome est un type complexe et hétérogène de cancer. L identification des différents profiles moléculaires sous-jacent aux phénotypes tumorales pourra servir de référence afin d améliorer le diagnostique des GBM et de déterminer le pronostique pour le patient. A plus long terme, il serait intéressant d étudier l effet épigénétique des variants de gènes histones ou de gènes impliqués dans le remodelage de la chromatine afin de mieux comprendre les mécanismes supportant la tumorigenicité des GBM et de permettre le développement de thérapies moléculaires ciblées. v

7 Table of Contents Acknowledgements... i Abstract... ii Extrait... iv Table of Contents... vi List of Figures... viii List of Tables... ix List of Abbreviations... x Part 1: Background Background on Whole Exome Sequencing (WES) Applications of WES Computational Analysis of Genomic Data High Grade Gliomas (HGGs) Types of HGGs GBMs High grade astrocytomas/glioblastomas DIPGs Diffuse Intrinsic Pontine Gliomas Driver and Accompanying Mutations in GBMs Tumour Suppressor (p53/rb) Pathways Cell Proliferation (RKT/PI3K) Pathway Histone Pathway Isocitrate Dehydrogenase (IDH) Genes Histone Encoding Genes Histone Regulating and Chromatin Remodelling Genes ACVR1 Potential Novel Pathway in DIPGs Part 2: Objectives Part 3: Methodology Overview of Patient Cohort Jabado Dataset Baker Dataset Jones Dataset Hawkins Dataset In-House DNA Sequence Processing DNA Read Mapping SNV/INDEL Calling Variant Annotation Variant Correlation Database and Software (SQLProfiler) SQLProfiler Design Goals vi

8 3.3.2 Query and Filter Selection Database Structure Query Generation and Execution Statistical Analysis Part 4: Results Most Commonly Mutated Genes/Variants Adult Samples Pediatric Samples Location and Age-based Mutation Profiles Gene Correlations Adult Samples Pediatric Samples Location and Age-based Gene Correlations Part 5: Discussion Histone and chromatin remodeling heterogeneity in GBMs GBMs lacking histone or chromatin remodeling variants remain unclear Correlations between histone and typical cancer pathways Future applications of SQLProfiler Future directions Supplementary Data References vii

9 List of Figures Figure 1-1: Methods of next-generation sequencing... 2 Figure 1-2: Basic overview of whole-exome sequencing... 3 Figure 1-3: Relationship between variant frequency and effect size... 5 Figure 1-4: Somatic p53/rb pathway mutations in GBMs Figure 1-5: Somatic RTK/PI3K pathway mutations in GBMs Figure 1-6: Somatic histone and chromatin remodeler mutations in GBMs Figure 1-7: Overview of nucleosome structure and histone H3 tail modifications Figure 3-1: Catalogue for samples within GBM dataset Figure 3-2: Overview of DNA sequencing analysis workflow Figure 3-3: Overview of SQLProfiler analysis Figure 3-4: SQLProfiler query selection interface Figure 3-5: Relational database structure Figure 3-6: MySQL construction of gene queries Figure 3-7: Correlation graph explanation Figure 4-1: Gene correlation plot for adult GBM samples Figure 4-2: Gene correlation plot for pediatric GBM samples Figure 4-3: Gene correlation plot for pediatric midline GBM samples Figure 4-4: Gene correlation plot for pediatric hemispheric GBM samples Figure 4-5: Gene correlation plot for early (0-10 years) midline GBM samples Figure 4-6: Gene correlation plot for early (0-10 years) hemispheric GBM samples Figure 4-7: Gene correlation plot for late (11-20 years) midline GBM samples Figure 4-8: Gene correlation plot for late (11-20 years) hemispheric GBM samples viii

10 List of Tables Table 1: List of variants/genes selected for correlation queries Supplementary Table 1: Full sample information and mutation catalogue Supplementary Table 2: KDM and KMT mutation frequency in paired pediatric GBMs Supplementary Table 3: KDM and KMT mutation frequency in all pediatric GBMs Supplementary Table 4: Raw correlation data for adult GBMs Supplementary Table 5: Raw correlation data for pediatric GBMs Supplementary Table 6: Raw correlation data for pediatric midline GBMs Supplementary Table 7: Raw correlation data for pediatric hemispheric GBMs Supplementary Table 8: Raw correlation data for early (0-10 years) midline GBMs Supplementary Table 9: Raw correlation data for early (0-10 years) hemispheric GBMs Supplementary Table 10: Raw correlation data for late (11-20 years) midline GBMs Supplementary Table 11: Raw correlation data for late (11-20 years) hemispheric GBMs ix

11 List of Abbreviations α-kg ACVR1 ALT ATRX BAM BCOR Alpha-Ketoglutarate Activin A Receptor, Type I Alternative Lengthening of Telomeres Alpha Thalassemia/Mental Retardation Syndrome X-Linked Binary Alignment Map BCL6 Corepressor BCORL1 BCL6 Corepressor-Like 1 BMP BRAF BWA C4R CDK CDKN Bone Morphogenic Protein v-raf Murine Sarcoma Viral Oncogene Homolog B1 Burrows-Wheeler Alignment Care 4 Rare Cell-Dependent Kinase Cell-Dependent Kinase Inhibitor CHEK2 Checkpoint Kinase 2 DAC DIPG DNA DNMT3A EGA EGFR EVS FGFR FOP FORGE GATK GBM Data Access Committee Diffuse Intrinsic Pontine Glioma Deoxyribonucleic Acid DNA (Cytosine-5) Methyltransferase 3 Alpha European Genome-Phenome Archive Epidermal Growth Factor Receptor Exome Variant Server Fibroblast Growth Factor Receptor Fibrodysplasia Ossificans Progressiva Finding of Rare Disease Genes in Canada Genome Analysis Toolkit Glioblastoma x

12 GC-GBM GERP GUI GWA H3F3A H3K27me3 H3K36me3 HIST1H3B HGG IDH INDEL JDBC KDM KMT MAX M:N MUHC MySQL NF NGS Giant Cell Glioblastoma Genomic Evolutionary Rate Profiling Graphical User Interface Genome-Wide Association H3 Histone Family 3A Histone 3 Lysine 27 Trimethylation Histone 3 Lysine 36 Trimethylation H3 Histone Family 3B High-Grade Glioma Isocitrate Dehydrogenase Insertion/Deletion Java Database Connectivity Lysine-Dependent Demethylase Lysine-Dependent Methyltransferase MYC Associated Factor X Many-to-many McGill University Health Center My Structure Query Language Neurofibromatosis Next Generation Sequencing p53/tp53 Tumour Protein 53 PDGFRA PI3K PIK3CA PIK3R1 PNET PPM1D PTEN Platelet Derived Growth Factor Receptor Alpha Phosphoinositide 3-Kinase PI3K Catalytic Subunit PI3K Regulatory Subunit Primitive Neuronal Protein Phosphatase, Mg 2+ /Mn 2+ Dependent 1D Phosphatase and Tensin Homolog xi

13 RB/pRB PCR Retinoblastoma Protein Polymerase Chain Reaction PRC2 Polycomb Repressive Complex 2 RTK SCA Tyrosine Kinase Receptor Small Cell Astrocytoma SETD2 SET Domain Containing 2 SIFT SNV TCGA UCSC VCF VEP WES WGS WHO Sorting Tolerant From Intolerant Single Nucleotide Variant The Cancer Genome Atlas University of California Santa Cruz Variant Call Format Variant Effect Predictor Whole Exome Sequencing Whole Genome Sequencing World Health Organization xii

14 Part 1: Background 1.1 Background on Whole Exome Sequencing (WES) The first sequencing of the human genome, accomplished through the Human Genome Project in 2003, was by all accounts a massive undertaking. The project was a multi-national initiative, cost millions of dollars each year, and took 13 years to finish 1. While this project revealed the enormous potential for the use of genomic information in research, it was clear that the DNA sequencing methods at the time were too slow and too costly to be practical for more focussed projects. In the last decade, the development of high-throughput methods of DNA sequencing, broadly referred to as next generation sequencing (NGS) techniques, has been essential in reducing the cost and time required to sequence genomic data. Many of the NGS techniques developed by biotech companies such as 454, Illumina, and Applied Biosystems, focussed on high amounts of parallelization during sequencing and used advanced computational methods in order to efficiently assemble DNA fragments into one contiguous sequence (Figure 1-1). Even with these advances in DNA sequencing technology, the cost of sequencing an entire human genome was a limiting factor for many research initiatives 2,3. 1

Figure 1-1: Methods of next-generation sequencing. This illustration depicts some of the methods developed to both amplify DNA fragments and parallelize the sequencing process.

15 Figure 1-1: Methods of next-generation sequencing. This illustration depicts some of the methods developed to both amplify DNA fragments and parallelize the sequencing process. Reproduced with permission from Metzker, M (2010) 4. In 2009, Ng et al. published a new DNA sequencing technique with the goal of making genome-wide analysis of large cohorts practical. This technique, known as whole exome sequencing (WES), isolates only the protein-coding regions of each gene in the genome (known as exons, and collectively as the exome) for sequencing (Figure 1-2). These protein-coding 2

regions end up only comprising approximately 1% of the human genome, making this technique significantly more cost-efficient than sequencing the entire genome (known as whole genome sequencing, WGS)

16 regions end up only comprising approximately 1% of the human genome, making this technique significantly more cost-efficient than sequencing the entire genome (known as whole genome sequencing, WGS) 2. Since the exome is thought to contain approximately 85% of all known disease-causing variations, WES is widely used for studying the genetic basis of disease 5. Figure 1-2: Basic overview of whole-exome sequencing. The critical step in this process involves extracting exonic reads (dark blue fragments) from the remaining genomic DNA. This is typically accomplished by hybridizing these reads to a set of predefined baits which code for exonic regions (orange fragments), and removing the unhybridized fragments in a wash step. Reproduced with permission from Bamshad et al (2011) 6. 3

17 1.1.1 Applications of WES Currently, one of the most common applications of WES in genetics studies is in the identification of disease-causing genes or variants in monogenic disorders. While many monogenic disorders are relatively rare, collectively they can have significant impacts on public health 7. Traditional methods of gene mapping for these studies such as linkage analysis and karyotyping, while often insightful, can be time-consuming or have limitations depending on the design of the study. For example, linkage analysis typically requires a large patient cohort in order to highlight loci of interest, which can be an impractical requirement when studying rare monogenic disorders 8. Furthermore, the loci of interest identified by such tests tend to be large, potentially encompassing hundreds of genes. Detecting specific causal mutations from these loci requires more detailed sequencing, further lengthening the process 8. WES is capable of providing information on specific variants directly, and as a result does not require large cohorts to achieve sufficient statistical power. The technique was first used to identify the previously unknown cause of a Mendelian disorder in 2009, when Ng et al. identified novel mutations in the candidate gene MYH3 in patients affected by Freeman-Sheldon Syndrome, a rare disorder consisting of multiple malformation 2. In the years since, WES has been used to identify around 800 novel disease-causing mutations, and has been an integral technology in large-scale rare disease initiatives such as the Finding of Rare Disease Genes in Canada (FORGE), and Care 4 Rare (C4R) 5,9. In addition to the discovery of variants related to monogenic disorders, WES has also been used to investigate the underlying genetic causes of complex diseases and traits. While other techniques such as genome-wide association (GWA) studies have previously been used to identify associations with common variants, such variants typically have a small effect size and 4

18 often do not account for most of the trait heritability 10. In order to explain this unaccounted heritability, the focus of complex disease studies shifted to the discovery of rare variants with potentially larger effect sizes (Figure 1-3). As previously mentioned, genome-wide association studies were not well suited to the discovery of rare variants due to their reliance on large sample sizes to achieve sufficient statistical power. Due to its ability to identify rare variants even with small patient cohorts, WES has been instrumental in uncovering the effects of rare variants on more common diseases. For example, WES has been used in multiple studies to describe rare variants associated with common neurological disorders such as Alzheimer s disease, Parkinson s disease, and multiple sclerosis 11. Figure 1-3: Relationship between variant frequency and effect size. The most interesting variants from a research standpoint lie within the dotted diagonal lines. While GWA studies have been effective at identifying common variants, they lack the statistical power to identify high-effect rare alleles (top left) typically seen in Mendelian complex disorders. Reproduced with permission from Manolio et al (2009) 12. 5

19 Finally, exome sequencing has been important in continuing to explore the genetic basis of cancer. The link between cancer, a wide subset of human disease resulting in abnormal cell growth, and genetic aberrations has been known since the 1960 s, when the now-famous Philadelphia chromosome was first discovered in patients with myeloid leukemia 13. Recently however, the emergence of WES has resulted in significant progress in the field of cancer genetics. WES has not only aided in the discovery of novel cancer-susceptibility genes, but it has also allowed us to uncover driver mutations seen exclusively in the tumour tissue (known as somatic mutations) of types of sporadic cancer 7. This is particularly useful in cancers in which no molecular pathways have been previously identified, and has even led to the identification of novel oncogenic pathways. For example, mutations in the DNA methyltransferase gene DNMT3A identified by WES have indicated that DNA methylation is a potential oncogenic pathway in acute monocytic leukemia 14. Another form of cancer that has benefitted greatly from the emergence of exome sequencing has been high grade gliomas (HGGs). In the past five years, WES has helped identify numerous driver and accompanying somatic mutations in HGGs, which will be described in detail later in this section. The groundwork laid by WES for uncovering the genetic basis of HGGs is critical for this project, which aims to use exome sequencing to create a high-resolution catalogue of somatic variants in HGG patients. 1.2 Computational Analysis of Genomic Data In addition to WES, another development that has greatly aided genomic cancer research is the creation of both advanced computing platforms and bioinformatics software suites. Following the invention of faster and cheaper NGS techniques, the amount of genomic data 6

20 which required storage and analysis increased exponentially 15. In order to meet these storage and analysis demands, massive computational systems known as computing clusters were developed. These systems consisted of multiple single-processor computing units running in parallel, allowing differential allocation of resources depending on the magnitude of a given task 16. Furthermore, many open-source bioinformatics programs were developed to process DNA sequencing data within these parallelized systems. Such software packages generally focussed on specific steps of DNA processing, including: 1. DNA sequence assembly/alignment: These programs produce contiguous genomes from millions of sequenced DNA fragments. Some programs align the fragments against a reference genome to produce a full sequence (Bowtie, BWA, GSNAP), while others are capable of de-novo assembly which does not require a reference (DISCOVAR). 2. Variant calling: These packages are capable of highlighting differences (known as variants) between the assembled genome and the reference genome (GATK, SAMtools). Due to the complex nature of DNA variations, each program uses different algorithms to identify variants. 3. Variant annotation: These programs provide additional information on variants identified by variant callers. This information can range from basic gene identification and amino acid changes (ANNOVAR, SNPEff), to predictions of evolutionary conservation and pathogenicity (GERP, PolyPhen2). 4. Additional tasks: Additional pieces of software were developed to achieve a number of auxiliary tasks, including sequence visualization (IGV), workflow creation (Galaxy), and variant database creation (GEMINI). 7

21 Many of these tools were utilized in the processing of this project s data, and are explained in greater detail in subsequent sections. 1.3 High Grade Gliomas (HGGs) According to the Central Brain Tumor Registry of the United States, gliomas make up 80% of malignant brain tumours, and approximately 27% of all brain tumours 17. The term glioma is used to describe a subset of brain tumours in which glial cells are the presumed cells of origin. These gliomas can arise from a specific subtype of glial cell such as astrocytes or oligodendrocytes, or from a mixture of different glial cell types. Outside of cell type, gliomas are also categorized according to a histological grading scale conceived by the World Health Organization (WHO) 18. These grades range from grade I (best prognosis) to grade IV (worst prognosis), and are classified according to clinical features such as the degrees of proliferation and mitotic activity. For this project, I have chosen to examine a subset of gliomas known as high grade gliomas Types of HGGs High grade gliomas (HGGs) refer to gliomas classified as grade III or IV tumours by the WHO grading scale 19. These gliomas are characterized by a high degree of proliferation throughout the brain, extensive necrosis, and high amounts of mitotic activity 18. HGGs have a particularly poor prognosis, with most patients succumbing within a year of diagnosis. Specifically, this project focusses on two types of HGGs for which genetic factors have been identified glioblastomas and diffuse intrinsic pontine gliomas (DIPGs). 8

22 1.3.2 GBMs High grade astrocytomas/glioblastomas Glioblastomas (GBM; WHO grade IV) have the highest incidence rates amongst malignant brain tumours, and generally occur more frequently with age 17. They are also the most common type of astrocytoma, a category of glioma in which the presumed cells of origin are astrocytic progenitor cells. Histologically, GBMs exhibit numerous features that are characteristic of HGGs 20. They are highly mobile, and are able to easily invade surrounding brain tissue through protease-mediated degradation of the extracellular environment. GBMs also tend to have central hypoxic regions that arise from excessive metabolic demands, which in turn results in necrosis of the hypoxic tissue. Finally, extensive angiogenesis is typically seen in GBMs, including the formation of microvascular bundles similar in structure to glomeruli. GBMs can either arise without any prior history, known as primary GBMs, or develop from a pre-existing lower grade glioma, known as secondary GBMs 21. Primary GBMs are more common than secondary GBMs, and typically occur in more elderly patients 22. Histologically, primary and secondary GBMs are largely identical, however differences in genetic and epigenetic profiles have been observed. Furthermore, variants of the conventional GBM morphology have been characterised over the past decade 23. These include giant cell GBMs (GC-GBMs), small cell astrocytomas (SCAs), gliosarcomas, GBMs with oligodendroglial features (AO), and GBMs with primitive neuronal (PNET) features. In addition to being morphologically distinct, these variants appear to have unique molecular genetic profiles 23. As indicated by their histological features, GBMs are highly aggressive tumours and frequently invade surrounding normal brain tissue, making surgical resection very difficult 21. Combined with a high resistance to normal chemotherapeutic drugs, there is currently no 9

23 effective treatment for GBM tumours 24. As a result, the survival estimates for patients with GBMs are dismal, with most not surviving beyond one year DIPGs Diffuse Intrinsic Pontine Gliomas Diffuse intrinsic pontine gliomas (DIPG; WHO grade IV) are a subtype of glioblastoma. In contrast to most GBMs, DIPGS are predominantly found in pediatric patients and are also the most common type of brainstem tumours found in children 25. As the name suggests, DIPGs are primarily characterized by a diffused pattern of infiltration throughout the brainstem, specifically the pons. In addition to swelling of the brainstem and increased intracranial pressure, they also display numerous features that typically describe high grade astrocytomas, including high mitotic activity, tumour necrosis, and angiogenesis 26. Due to its diffuse proliferation throughout the brainstem, surgical resection cannot typically be performed on DIPGs. Other treatments such as radiation therapy and chemotherapy have been attempted, however there is currently no effective treatment for DIPGs 26. This lack of effective treatments along with the presence of high grade clinical features results in a poor prognosis that is almost always fatal. Recent discoveries regarding the genetic etiology of DIPGs however, have indicated that targeted therapies are an alternative treatment that should be explored for patients with DIPG Driver and Accompanying Mutations in GBMs In order to gain a better understanding of the underlying biology of GBMs, initial studies focussed on examining DNA copy number and gene expression profiles in both pediatric and 10

24 adult tumours 27,28. Differences in these copy number and expression profiles prompted researchers to search for specific genetic mutations that were responsible for these unique profiles. To date, multiple somatic driver mutations have been described in GBM tumours, as well as an array of accompanying mutations spanning multiple pathways. The comprehensive analysis of adult GBM tumours by The Cancer Genome Atlas Research Network (TCGA) in 2008 was one of the pioneer studies in identifying somatic variants in GBMs 29. The findings by TCGA not only helped validate twenty years of molecular studies on the genetic cause of GBMs, but also paved the way for future analyses of somatic mutations. The major pathways in which somatic variations were found and confirmed by later studies are outlined below Tumour Suppressor (p53/rb) Pathways Tumour suppressor genes have been extensively studied with regard to their role in tumourigenesis. As their name implies, tumour suppressor genes encode for products involved in pathways necessary to prevent tumour growth. Many such pathways exist, however tumour suppressor genes can be loosely categorized into those related to inhibiting cell growth ( gatekeepers ), those responsible for DNA stability and damage repair ( caretakers ), and those which control abnormal growth through alterations to the environment ( landscapers ) 30. Somatic mutations in one such tumour suppressor pathway known as the p53 pathway was one of the key findings in the TCGA study (Figure 1-4). The p53 pathway is largely known for its role in responding to cell stress and DNA damage, typically through halting the cell cycle or apoptosis 31. Targeted sequencing of the tumour protein 53 (TP53) gene by TCGA revealed 11

25 numerous mutations in about 40% of the GBM tumours. Later studies confirmed this link between TP53 and GBM, with a mutation frequency of about 55% in adult tumours 29. These variants were all located in the DNA binding domain of the protein product p53, a domain which has been well-linked to other human cancers in the past 34. Disruption of this DNA binding domain is thought to inhibit the surveillance of expressed oncoproteins and apoptotic responses in the cell, allowing oncogenes to promote cell division uncontrollably 35. Another tumour suppressor pathway linked to GBMs in the TCGA study was the RB pathway (Figure 1-4). The tumour suppressor retinoblastoma (prb) protein is central to this pathway, as it has essential roles in cell cycle arrest (before S phase entry) through the inhibition of downstream transcription factors 31. Loss of function somatic mutations in RB1, the gene encoding prb, were initially reported in the TCGA study. Furthermore, variants were also identified in genes encoding cyclin-dependent kinases (CDK4 and CDK6) upstream of RB1, as well as variants in genes encoding inhibitors of these kinases (CDKN2A and CDKN2B), providing additional evidence that RB1 expression is linked to tumourigenesis in GBM tumours 29. Strikingly, subsequent studies observed mutual exclusivity between variants in the p53 and RB pathways, suggesting that the disruption of one tumour suppressor pathway is sufficient for GBM tumours to arise

Figure 1-4: Somatic p53/rb pathway mutations in GBMs. Typically, these tumour suppressor pathways provide a response to DNA damage either through apoptosis or cell cycle arrest.

26 Figure 1-4: Somatic p53/rb pathway mutations in GBMs. Typically, these tumour suppressor pathways provide a response to DNA damage either through apoptosis or cell cycle arrest. Previous studies have identified somatic mutations (colour coded by functional consequence) which either result in a loss-of-function in tumour suppressor gene products or a gain-offunction in upstream inhibitors, resulting in cell survival and proliferation even after DNA damage has occurred. Note that many intermediate genes in the p53/rb pathway have been excluded in this figure due to the lack of somatic mutation evidence in GBMs Cell Proliferation (RKT/PI3K) Pathway The PI3K pathway is centered around a group of lipid kinases known as phosphoinositide 3-kinases (PI3Ks). Following activation by various growth factors (referred to as receptor tyrosine kinases or RTKs), these lipid kinases facilitate the phosphorylation of phosphoinositides (lipids containing an inositol functional group). One such phosphoionositide known as PIP2 is phosphorylated to PIP3, resulting in the recruitment of multiple signalling proteins related to cell survival and proliferation 37. Somatic mutations in genes encoding for components of the PI3K complex have been well characterized in GBM tumours, typically resulting in upregulation of the PI3K pathway (Figure 13

27 1-5). In the TCGA study, mutations in the gene encoding the catalytic subunit p110α (PIK3CA) and the gene encoding the regulatory subunit p85α (PIK3R1) were prevalent amongst GBMs 29. Interestingly, recurrent PIK3CA mutations were typically located in just two exons, however each mutated exon had a separate mechanism by which PI3K activity was amplified. Mutations in exon 9 (typically E545K and E542K), interfere with the binding of regulatory subunit p85α, while mutations in exon 20 (typically H1047R) result in a gain-of-function through interaction with upstream signalling molecules 38. Furthermore, activating variants have been observed in multiple genes encoding RTKs. Somatic alterations in EGFR, PDGFRA, MET, and FGFR1/2/3/4 were detected in both the TCGA dataset and other studies 29,36,39. Finally, somatic mutations in genes encoding other regulatory proteins in the PI3K pathway have been described. For example, frequent inactivating mutations in the phosphatase and tensin homology gene (PTEN) result in an inability for the PTEN protein to dephosphorylate PIP3 back to its original PIP2 state, leading to constant activation of downstream cell proliferation signalling molecules 29,40. Inactivating variants have also been identified in the gene encoding neurofibromatosis-related protein (NF1), a tumour suppressor responsible for the downregulation of the PI3K pathway in the intermediate steps between RTKs and the PI3K complex 36,39. Finally, v-raf murine sarcoma viral oncogene homolog B1 (BRAF), a gene previously linked to numerous types of cancer, has been observed to be mutated in some GBM tumours at a specific amino acid position (V600E). This specific variant results in a constitutively active gene product which increases the activity of a parallel cell proliferation pathway (known as the RAF-MEK-ERK pathway) controlled by the same set of RTKs. 14

28 Figure 1-5: Somatic RTK/PI3K pathway mutations in GBMs. The PI3K complex is central to this pathway, as its activity results in the recruitment of signalling proteins which promote cell survival and proliferation. Excessive PI3K activity either through the gain-of-function of the catalytic subunit (PIK3CA), or a loss-of-function in the regulatory subunit (PIK3R1) are the most common changes noted in GBMs. Activating mutations have also been observed in upstream growth factor receptors (RTKs) and parallel proliferation pathways (BRAF), as well as inhibiting mutations in PI3K antagonists (NF1 and PTEN). Note that many intermediate genes in the RTK/PI3K pathway have been excluded in this figure due to the lack of somatic mutation evidence in GBMs Histone Pathway While the TCGA study was instrumental in confirming the presence of somatic variations in pathways previously linked to cancer, the gene-specific sequencing approach limited the discovery of novel pathways involved in GBM tumourigenesis. Recently, exome sequencing studies have uncovered a wide-range of mutations in both the histone genes themselves and genes encoding histone-interacting proteins in both pediatric and adult GBM tumours (Figure 1-6). 15

29 In eukaryotic organisms, the DNA within cells is organized into basic structural units known as nucleosomes. Nucleosomes are complexes which consist of 146bp of DNA wrapped around an octamer of proteins called histones (Figure 1-7). While basic in terms of structure, these histones are highly conserved between species, and have essential roles in the regulation of gene expression 41. Furthermore, three of these histone proteins (H2A, H2B, and H3) are further divided into a number of histone variants (for example, histone 3 has variants known as H3.1, H3.2, and H3.3), allowing for an even finer control of gene expression and establishment of specific cell identities during development 41. Figure 1-6: Somatic histone and chromatin remodeler mutations in GBMs. Specific amino acid changes to the tail of histone H3 variants (H3.3 and H3.1) are thought to result in the inhibition of methylation along the histone tail. Furthermore, loss of function mutations in many chromatin remodelling genes result in reduced trimethylation (SETD2), demethylation (IDH1), or histone recruitment (ATRX). These histone and chromatin remodeler variants are hypothesized to have widespread epigenetic effects on gene expression, however further work is required to understand the full consequences of these changes. Note that many intermediate genes in the histone/chromatin remodelling pathway have been excluded in this figure due to the lack of somatic mutation evidence in GBMs. 16

30 A) B) Figure 1-7: Overview of nucleosome structure and histone H3 tail modifications. The nucleosome is a basic structural unit comprised of 146 bp of DNA wrapped around an octamer of proteins known as histones (A). This octamer consists of two molecules of each histone protein: H2A, H2B, H3, and H4. Furthermore, these histone sequences include N-terminal tails which can be modified post-translation to allow fine control of DNA transcription. For example, the tail of histone H3 (B) can undergo a number of modifications at many positions, including methylation (m), acetylation (a), and phosphorylation (p). Figure adapted under the Creative Commons Attribution License from Xu et al (2013) Isocitrate Dehydrogenase (IDH) Genes In the same year that the TCGA GBM study was released, another genomic analysis of GBM tumours was published. Using unbiased sequencing of over 20,000 genes across 22 human tumours, recurrent missense mutations were observed in two isocitrate dehydrogenase genes (IDH1 and IDH2) 43. While initially seen in only 12% of GBM patients, later studies observed such variations in approximately 80% of lower grade gliomas and adult GBMs 44. Interestingly, 17

31 these mutations were frequently seen in secondary GBMs but rarely in primary GBMs, suggesting that IDH genes are responsible for cancer progression in lower grade tumours 33. Somatic mutations in IDH1 and IDH2 typically occur at a specific residue (R123H in IDH1, R172H in IDH2) which has been observed to disrupt the active site of the isocitrate dehydrogenase enzyme 44. Interestingly, this active site alteration is hypothesized to result in increased binding affinity for the final product of isocitrate dehydrogenation, α-ketoglutarate (α- KG). Consequently, α-kg is converted into an alternate product known as hydroxyglutarate, resulting in α-kg depletion. This depletion is thought to inhibit α-kg-dependent dioxygenase pathways, leading to expression of hypoxia-induced pathways and an inhibition of histone demethylases 44,45. Despite alterations to the expression of multiple downstream pathways, the specific mechanisms by which IDH mutations promote tumourigenesis are not currently understood Histone Encoding Genes The discovery of recurrent IDH mutations provided initial evidence that histone modification pathways played a pathogenic role in GBM tumours. The development of whole exome sequencing techniques allowed for the unbiased sequencing of genes not previously linked to cancer. In 2012, Schwartzentruber et al. first identified somatic mutations in H3F3A, a gene which encoded a variant of histone 3 (H3) known as H These mutations were nonsynonymous substitutions located at various positions in the histone tail (K27M and G34R/G34V), and were seen in approximately 30% of their patient cohort. Additional studies validated the presence of these H3F3A mutations in pediatric GBMs and DIPGs, as well as 18

32 somatic K27M mutations in HIST1H3B and HIST1H3C, genes responsible for the production of the histone 3 variant H Strikingly, these two variants are spatially distinct from one another, with K27M mutations occurring predominantly in midline GBMs and G34R/G34V variations found in hemispheric tumours 32. Functionally, these recurrent mutations have been observed to have widespread effects on gene expression in affected cells. The histone tails of H3 variants are typically modified posttranslation to allow for more precise control over multiple gene expression pathways. Tumours containing K27M mutations have a reduced ability to undergo trimethylation at the K27 locus (K27me3), a chromatin marker which ordinarily interacts with polycomb repressive complex 2 (PRC2) 47. Disruption of PRC2 interaction in turn leads to altered downstream expression of multiple pathways, a mechanism which has now also been observed in other forms of cancer 48,49. Conversely, mutations at the G34 locus appear to disrupt the trimethylation at the K36 locus, a chromatin mark associated with the expression of genes related to cortical brain development and DNA repair Histone Regulating and Chromatin Remodelling Genes Histone marks are regulated by a number of writer and eraser proteins, with the histones themselves being recruited and structurally altered by chaperone and remodeler proteins, respectively. Mutations to such proteins can have widespread effects on epigenetic regulation of gene expression, and have been linked to multiple types of cancer 51. Recently, numerous mutations in genes with a role in chromatin remodelling have also been discovered in both pediatric and adult GBMs. A wide variety of mutations in the chromatin remodelling genes 19

33 ATRX and DAXX have been observed in approximately 30% of pediatric GBMs, and about 43% of adult gliomas 32,33. The ATRX-DAXX pathway is central to the recruitment of H3.3, and it is thought that a loss of function could not only result in altered H3.3-mediated expression of oncogenes, but also in the alternative lengthening of telomeres (ALT) which is a feature seen in other cancers 52. In addition to ATRX-DAXX mutations, multiple somatic variants have been observed in the gene SETD2. This gene encodes a methyltransferase specifically responsible for the methylation of the K36 locus on the histone tail, however it is not known if the loss of this enzyme results in the same downstream effects as H3F3A G34R/G34V mutations 47. Finally, variations have also been seen in a variety of histone regulating gene families including lysine-dependent methyltransferases (KMT/MLL family) and lysine-dependent demethylases (KDM family), however to date no highly recurrent genes or mutated residues have been reported ACVR1 Potential Novel Pathway in DIPGs Finally, recent exome sequencing of pediatric GBMs has provided strong evidence for novel pathways being involved in tumourigenesis exclusively in DIPGs. In their analysis of 39 high-grade astrocytomas, Fontebasso et al. identified recurrent gain-of-function mutations in the activing A receptor type I gene (ACVR1) 53. These somatic variations were mainly observed at three specific residues: R206, G328, and G356. Interestingly, some of these amino acid positions have been previously observed as germline mutations in patients with the musculoskeletal disorder fibrodysplasia ossificans progressiva (FOP) 54. In FOP, mutations in ACVR1 are known to activate the bone morphogenic protein (BMP) pathway, ultimately leading to bone formation 20

34 and differentiation of osteocytes. The specific role of somatic ACVR1 mutations in DIPGs is not currently understood, however evidence that ACVR1 has a role in cell proliferation and that BMP signalling is related to astrocytic differentiation both provide promising theories to explore

35 Part 2: Objectives As outlined in Part 1, previous studies have identified a wide array of somatic mutations within GBM tumours. While there have been some characterizations of the co-occurrences between mutations and the associations between mutations and tumour phenotype, there is a great need to establish a more comprehensive variant catalogue and to quantify these associations. We hypothesize that many of the recurrent variants that have been previously described will be prevalent in this study, and that the presence of some of these mutations and co-occurrences will be related to both tumour diagnosis age and location within the brain. 22

36 Part 3: Methodology 3.1 Overview of Patient Cohort In order to create a high-resolution catalogue of somatic mutations in HGG patients, it was essential that a large and comprehensive cohort of HGG patients be collected. The cohort generated for this study was a combination of HGG samples previously analyzed by our laboratory as well as samples published online from other studies. Each of these sets of data are outlined below, along with the dataset-specific steps taken to sequence the DNA prior to bioinformatics analysis. Figure 3-1 provides a breakdown of the dataset according to tumour diagnosis, patient age, and tumour location in the brain. See Supplementary Table 1 for a complete listing of each sample and associated phenotypic information. 23

Other category being excluded from gene queries.

37 A) B) C) Figure 3-1: Catalogue for samples within GBM dataset (n = 482). All samples are initially categorized by (A) diagnosis, with samples in the Other category being excluded from gene queries. Samples classified as either GBM or DIPG (n = 250) are then broken down by (B) age at diagnosis, and (C) tumour location. 24

38 3.1.1 Jabado Dataset A large portion of the samples (n = 167, with 27 matched germline samples) analyzed in this project were collected from datasets used in previous collaborations with Dr. Nada Jabado s group at the McGill University Health Center (MUHC) 32,53,55. These samples were originally obtained from multiple sources, including the Montreal Children s Hospital, Boston Children s Hospital, the Brain Tumor Toronto Bank, the London/Ontario Tumor Bank, the Pediatric Cooperative Health Tissue Network, the Children s Oncology Group, and from additional collaborators in Germany, Hungary, and Poland. The diagnosis of each sample was verified by two senior neuropathologists in accordance with the guidelines published by the World Health Organization (WHO) 18. The tumour tissues were extracted using either needle biopsies or partial surgical resections, and genomic DNA was isolated using the Qiagen DNeasy Blood and Tissue kit. Since all of these samples were subject to whole-exome sequencing (WES) as opposed to whole-genome sequencing (WGS), the Nextera Rapid Capture Exome kit was used to prepared a paired-end genomic library consisting solely of exonic regions. Finally, DNA sequencing was performed using 100-bp paired-end reads on an Illumina HiSeq2000 platform Baker Dataset The additional datasets that were added to this study s cohort were downloaded from the European Genome-Phenome Archive (EGA) with permission from each study s corresponding Data Access Committee (DAC). The first publicly available dataset to be downloaded was published by Dr. Suzanne Baker (Accession: EGAS ) 56. HGG samples from this dataset (n = 104, with 93 matched germline samples) were initially obtained with consent and 25

39 approval from the St. Jude Children s Research Hospital and the Royal Marsden Hospital. The pathology of these samples was verified by an experienced clinical neuropathologist. Following genomic DNA library preparation, these samples were sequenced using the Illumina Genome Analyzer IIx or the Illumina HiSeq platform at a 100-bp read length (79 samples were WES, 25 samples were WGS) Jones Dataset Another dataset included for this project was published by Dr. Chris Jones (Accession: EGAS ) 57. This cohort was comprised of 26 DIPG tumours (20 WGS and 6 WES), with 25 tumours also having a matched peripheral blood sample. The criteria for DIPG diagnosis included diffuse intrinsic tumour tissue occupying at least 50% of the pons, as well as having of clinical history of less than 3 months. Each tumour sample was collected from either the Necker Sick Children s Hospital or Hospital Saint Joan de Déu, and were extracted via stereotactic biopsy. Library preparation was performed using the Illumina FastTrack service, and exome capture for WES samples was achieved with the Agilent SureSelect platform. Finally, paired-end sequencing with a 100-bp read length was carried out with the Illumina HiSeq2000 platform Hawkins Dataset Finally, a dataset of 20 DIPG tumours published by Dr. Cynthia Hawkins was added to this project (Accession: EGAS ) 25. The patients for this dataset were either diagnosed at or later referred to the Hospital for Sick Children. Similar to the Jones dataset, the diagnosis criteria for DIPG involved diffuse tumour tissue throughout at least 50% of the pons, 26

40 and was confirmed by a neuroradiologist using MRI imaging. All 20 tumours had a matching germline sample from either adjacent normal brain or blood tissue. As with other datasets, WGS was performed on all samples with a 100-bp read length using the Illumina HiSeq2000 platform. 3.2 In-House DNA Sequence Processing Following either WGS or WES, the raw DNA sequences produced needed to undergo multiple processing steps to allow for mutations to be identified in a given sample. In order to streamline this process for multiple NGS samples, an in-house pipeline consisting of multiple bioinformatics software packages was created. A general overview of this pipeline and the processing steps involved is outlined in Figure

Figure 3-2: Overview of DNA sequencing analysis workflow. The in-house pipeline used to process the raw reads produced from DNA sequencing is outlined in steps 2-4 in this figure.

41 Figure 3-2: Overview of DNA sequencing analysis workflow. The in-house pipeline used to process the raw reads produced from DNA sequencing is outlined in steps 2-4 in this figure. Reproduced with permission from Tetreault et al (2015) DNA Read Mapping The first step in processing raw DNA sequence files (in this case, FastQ files) involved organizing millions of 100-bp reads in order to reconstruct a sample s genome or exome. This was accomplished by aligning (also known as mapping) these reads to a reference genome sequence. Prior to DNA alignment, all reads were trimmed with the Trimmomatic software package to remove low-quality base pairs located at the ends of reads 58. Following this, the Burrows-Wheeler Alignment tool (BWA) was used to align WGS and WES FastQ files to the 28

42 human reference genome hg19 produced by the University of California Santa Cruz (UCSC) 59. The flexible read matching algorithms in BWA ensure that reads will still be properly matched to the corresponding reference sequence even when variants are present. Finally, two additional processing steps were performed prior analyzing genetic variants. Using the software suite Picard ( information between each read and its matepair were synced (Fixmate), and duplicate reads produced by PCR were identified and removed (MarkDuplicates). The resulting aligned sequences had redundancy across most of the genome/exome, with multiple reads typically overlapping with a given nucleotide. In order to reduce the file size, this aligned output was stored in a compressed binary format known as a BAM file SNV/INDEL Calling Once the sequence reads were aligned to the appropriate reference, the process of identifying nucleotides in the sample that differ from the reference genome (known as variant calling) was performed. Multiple types of variants are identified during this process, including single nucleotide variants (SNVs), insertions, and deletions (INDELs). For this project, the Genome Analysis Toolkit (GATK) software package was used to call variants across each aligned DNA sequence 60. GATK accomplished this by traversing along the given sequence and applying a Bayesian probabilistic model to generate the most likely genotype for each locus. Information regarding the variants called in a sample were stored in a text file format known as variant call format (VCF), including the chromosomal position, the reference and alternate alleles, and information regarding the read number and quality. 29

43 3.2.3 Variant Annotation The final step of the DNA sequence processing performed for this project was the functional annotation of each variant that was called. For this step, both the ANNOVAR tool and the Variant Effect Predictor (VEP) developed by Ensembl were used 61,62. These tools first provided more information on the variants themselves, including the gene associated with the variant, the type of variant (ie. nonsynonymous, synonymous, stop gain, splicing, etc.), and the expected amino acid change caused by this variant if applicable. Next, both tools assessed the deleteriousness of nonsynonymous variants using multiple predictive algorithms (SIFT, PolyPhen2, GERP) These algorithms used a wide variety of metrics including sequencebased features, protein structure, and evolutionary conservation in order to predict the degree to which a given variant will affect protein function. Finally, the minor allele frequency of each variant was reported across two large public datasets, the 1000 Genomes Project and the Exome Variant Server (EVS). These frequencies would allow us to estimate how frequent variants were in the general population. The information produced from variant annotation was then appended to existing VCF files for future analysis. 3.3 Variant Correlation Database and Software (SQLProfiler) As a result of the variant annotation step of our in-house pipeline, a tabular text file containing detailed information on all exonic variants was generated from the VCF file for each sample. While these text files are normally sufficient when examining recurrent mutations in rare disease studies with small patient cohorts, manual analysis of each of these text files was too impractical to be considered for this study. To put this into context, across the 482 DNA sequences analysed in this study there were over 800,000 unique mutations and over 4 million 30

44 mutations in total. Consequently, a new bioinformatics software package (named SQLProfiler) was developed in order to efficiently catalogue recurrent somatic mutations and correlations between mutated genes across this large dataset SQLProfiler Design Goals The design goals that were established prior to the development of SQLProfiler were to create a software package that could: 1. Efficiently determine the most recurrent mutations across a large dataset 2. Efficiently compute the correlations between recurrent mutations or mutated genes 3. Apply numerous filters to both patient and variant information during these queries 4. Allow user interaction through a simple graphical user interface (GUI), and 5. Be utilized in additional studies involving large datasets of DNA sequences In order to accomplish these design goals, SQLProfiler was primarily coded and compiled in Java Standard Environment 8 ( One of the major reasons Java was selected as the development language of choice was due to its ability to create GUIs that were easy to understand and were compatible with multiple operating systems, allowing the software to be accessible to users who were unfamiliar with computer programming or terminal-based software. Additionally, Java was capable of utilizing multiple open source libraries to cross-communicate with other programming languages. This allowed SQLProfiler to efficiently query datasets through languages designed to work with large records of data. 31

Query execution in SQLProfiler is broken down into three general steps (Figure 3-3). In the first step, the user both selects the query they wish to perform, and the filters to apply to this query.

45 Query execution in SQLProfiler is broken down into three general steps (Figure 3-3). In the first step, the user both selects the query they wish to perform, and the filters to apply to this query. In the second step, SQLProfiler constructs a database query based on the input provided by the user, runs this query on the variant database, and interprets and stores the query results. Finally, SQLProfiler sends these query results to a series of custom R scripts, which perform statistical analysis and generate figures based on the results. The following sections will outline each of these steps in greater detail. (C) (B) (A) Figure 3-3: Overview of SQLProfiler analysis. User selected queries and filters (A) are converted into SQL queries and executed on the dataset (B). The results are then imported into a series of custom R scripts, which return computed statistics and figures to the user (C). 32

46 3.3.2 Query and Filter Selection As mentioned above, one of the reasons Java was selected as the language of choice for SQLProfiler was in order to produce a user-friendly visual interface for query and filter selection (Figure 3-4). After a connection to the specified database is established, SQLProfiler provides the user with a choice from four queries: 1. Single Gene: Given a gene or specific chromosomal position, returns a list of patients containing a mutation in the specified gene or position. 2. Two Genes: Given two genes and/or chromosomal positions, returns a list of patients containing mutations in both. Also returns the significance (p-value) of this correlation, and the odds ratio. 3. Many Genes: Given a list of more than two genes and/or chromosomal positions, performs Two Genes queries for each combination of mutations specified. Returns matrices containing the p-value, odds ratio, and raw counts for each Two Genes query performed. 4. Most Common: Given a user-selected number N, returns the N most commonly seen mutated genes, or N most commonly seen variants (See Section 4.1 for an overview). Once the user has selected a query, they are given the option to apply two major sets of filters to this query. First, filters can be applied to specify the subgroup of samples to be queried. These filters can be simple flags such as specifying patient phenotype or an age range, or there are more advanced filters which include/exclude samples based on their mutation profile (Ex: selecting a subgroup of samples which do not contain certain driver mutations). Second, filters can be applied to the variants themselves. As with the sample filters, these filters can consist of 33

simple qualitative flags or quantitative ranges (Ex: only selecting nonsynonymous mutations, or only selecting variants with more than 3 alternate reads).

47 simple qualitative flags or quantitative ranges (Ex: only selecting nonsynonymous mutations, or only selecting variants with more than 3 alternate reads). However, variants can also be filtered according to their presence in control samples in the dataset. The user can choose to either remove variants that are seen in a given tumour s matched control sample, or to remove variants that are seen in any tumour s control sample. Once all the filters have been specified, the query is ready to be generated and sent to the database. (A) (B) (C) Figure 3-4: SQLProfiler query selection interface. The user can select from four types of queries, along with the option to specify which genes/variants to query (A). The user can then specify a number of sample-based (ie. diagnosis, age) or variant-based (ie. variation type) filters (B) to apply. The query results are then displayed at the bottom of the screen (C). 34

48 3.3.3 Database Structure In order to give a good idea as to how the database queries are constructed and executed, it is important to first provide an overview to the structure of the database used by SQLProfiler. Given a series of DNA sequences, we assume that each sample can contain multiple variants, and that each variant can be seen in multiple samples. This is known as a many-to-many (M:N) relationship, and is central to the design of the database. At its core, this database (coded in MySQL) is comprised of two major tables: 1. Sample Table: Contains one record for each sample in the provided dataset, which consists of the sample name and any other pieces of information specified by the user (Ex: age, diagnosis, etc ). 2. Variant Table: Contains one record for each unique variation in the provided dataset. Each record is primarily comprised of the variant s chromosome, position, reference allele, alternate allele, gene, and protein change (if applicable). Additional fields were also added for more flexible filtering such as the type of variant, or the variant frequency in public databases. In order to accommodate this M:N relationship, a third table known as an intersection table is needed (referred to in the database as the relation table). This table contains one record for each instance of mutation in each sample, and is used to link sample and variant information together. This linking is accomplished by storing the unique record IDs for both the sample and the variant in the relation record (Figure 3-5). These relation records also include variant information that is unique for each sample such as the reference and alternate read counts. 35

Figure 3-5: Relational database structure. In this database, one patient can have many variants, and one variant can be seen in many patients, resulting in a many-to-many (M:N) relationship.

49 Figure 3-5: Relational database structure. In this database, one patient can have many variants, and one variant can be seen in many patients, resulting in a many-to-many (M:N) relationship. In order to effectively link patient records to variant records, a third table (intersection table) is created. Each record in this table contains the unique IDs for both the patient record and the variant record. Finally, the SQLProfiler database contains two auxiliary tables. The first is a table which stores the variant record IDs for every variant seen in a control sample (known as the germline table in the database). The second is a table which links tumour samples with their matched control samples (known as the patient relation, or patrelation table). While neither table provides additional information to query, they are much quicker to query than the entire sample or variant tables and are used to make queries more efficient. 36

50 3.3.4 Query Generation and Execution Following query and filter selection, SQLProfiler formats the Java user input into a query that can be executed on the MySQL database. This is achieved through the use of MySQL Connector/J, a driver which uses the Java Database Connectivity (JDBC) API to allow communication between Java applications and MySQL databases. When a query is submitted by the user, a PreparedStatement object is created by SQLProfiler. This object stores a string of the MySQL code to be executed on the database. This string is generated in a modular fashion by SQLProfiler according to the filters provided by the user. In general, the query starts with the records from the relation table and joins these records with the sample and variant tables, applying all the previously specified filters in the process. An overview of the query construction is provided in Figure

51 Sample Table IDSAMPLE NAME DIAGNOSIS AGE 1 Patient 1 DIPG 15 2 Patient 2 DIPG 11 3 Patient 3 DIPG 5 4 Patient 4 GBM 12 5 Patient 5 Control 16 Variant Table IDVARIANT GENE PROTEIN_CHANGE 100 HIST1H3B p.k28m 101 H3F3A p.k28m 102 H3F3A p.g35r 103 TP53 p.r141c 104 IDH1 p.r132h INNER JOIN patient_table WHERE diagnosis = DIPG AND age > INNER JOIN variant_table WHERE gene = H3F3A AND protein_change = p.k28m... Relation (Intersection) Table IDRELATION IDSAMPLE IDVARIANT IDSAMPLE NAME IDRELATION IDSAMPLE IDVARIANT IDVARIANT GENE H3F3A 1 Patient H3F3A 2 Patient H3F3A 2 Patient Figure 3-6: MySQL construction of gene queries. All gene queries are initially based around the relation table records. Next, sample-based and variant-based filters are applied to their respective tables, and the filtered records (highlighted in blue and red, respectively) are joined to the relation records which satisfy both the sample and variant filters (highlighted in green). These joined relation records are then the results that are returned to the user. 38

52 3.3.5 Statistical Analysis While the results from the single gene and most common queries are simply sent to the Java program to display to the user, the two genes and many genes queries require statistical analysis in order to assess the significance of correlations between genes/variants. For these queries, the null hypothesis is that two genes/variants should not co-occur together more frequently than what would be expected by random chance, while the alternate hypothesis is that there is a positive or negative correlation between them instead. In order to test these hypotheses, a Fisher s exact test is applied to the sample counts in the correlation. Fisher s exact test was selected over other statistical tests such as a chi-square test due to its increased accuracy when expected correlation counts are relatively low 66. Finally, in a many genes query the p-values and odds ratios from multiple Fisher s exact tests are both compiled into separate tables, as well as combined into a graph which visually represents all the pairwise correlations in the query. An example of this graph, along with an explanation of how to interpret the graph is shown in Figure

53 log (Odd s Ratio) Figure 3-7: Correlation graph explanation. Each square in this graph represents a single pairwise comparison. In this figure, one square has been expanded into the 2x2 contingency table that it represents. This contingency table is what is provided as input for Fisher s Exact Test, which results in a p-value and odds ratio for the given correlation. The p-value threshold that this correlation surpasses is represented by symbols, and the odds ratio is represented by the colour of the square. 40

54 Part 4: Results 4.1 Most Commonly Mutated Genes/Variants In order to assess the most common somatic variants and mutated genes in this high grade glioma dataset, a number of variant filters were added to the SQLProfiler most common query. First, only variants that were identified as nonsynonymous SNVs, frameshift insertions and deletions, and stop gain (mutations resulting in a premature stop codon) variants were targeted in these queries. This not only isolated variants properly called and annotated in both WES and WGS samples (intergenic variants would not be captured in WES), but also restricted the analysis to mutations most likely to have an impact on protein structure or function. Furthermore, given that approximately 95% of gliomas are thought to be sporadic cases, we expected variants associated with GBM to be a result of rare de-novo events 67. As a result, genes were ranked according to their mutation frequency observed in previous non-cancer studies, and the top 1000 genes from this ranking were excluded. Variants that had previously been reported in either the 1000 Genomes or EVS public databases were also removed. Additionally, variants must have been observed in at least 5% of reads at that position and a minimum of 4 reads total. These low thresholds allowed for the inclusion of potentially interesting subclonal mutations, while attempting to exclude systemic errors that appear to resemble low-frequency SNPs 68,69. Variants were also filtered out if they were seen more than once in a separate in-house database consisting of approximately 3000 DNA sequences not associated with any form of cancer. Finally, any variants that were also seen in any control samples from GBM patients were discarded. This allowed for somatic variants to be approximated even in tumour samples lacking a matched control sample. 41

55 Following filtration based on the above parameters, the database was scanned to quantify the number of occurrences for each remaining variant. The variants were then ordered by number of occurrences in order to produce a list of the 100 most common somatic variants in the dataset. Finally, when examining the 100 most commonly mutated genes, the genes were sorted according to the number of patients containing at least one of the remaining variants from this gene Adult Samples While there were much fewer adult HGG samples in this dataset (n = 27) compared to pediatric HGG samples (n = 223), some distinct patterns of somatic variations were still able to be observed. The p.r132h hotspot mutation in IDH1 was the most frequent somatic variant in adult GBMs (8/27, 29.6%). This was followed by the p.r141c mutation located in the TP53 gene (7/27, 25.9%). Following these two recurrent variants, no other somatic mutation was observed more than twice within the adult samples. It is worth noting however, that mutations that were seen twice included variants in ATRX (p.l2074fs and p.r770x) and EGFR (p.a289v). Similar mutation patterns were observed when combining variants by gene to catalogue the most commonly mutated genes. TP53 was by far the most commonly mutated gene in adult samples, with a mutation frequency (15/27, 55.6%) almost twice that of the next highest genes, ATRX and IDH1 (both 8/27, 29.6%). EGFR, NF1, and PTEN were all within the top 100 mutated genes, albeit at lower frequencies (all 3/27, 11.1%). 42

56 4.1.2 Pediatric Samples In pediatric GBM samples (n = 223), we observed a mutation catalogue that was strikingly different from adult GBMs. In concurrence with previous studies, the most commonly observed somatic variations were the p.k27m alterations (listed as p.k28m due to recent annotation changes) in histone genes H3F3A (91/223, 40.8%) and HIST1H3B (28/223, 12.6%). These were followed by two hotspot mutations in TP53, p.r141c (17/223, 7.62%), and p.r43h (11/223, 4.9%), as well as the known p.g34r variant (listed as p.g35r due to recent annotation changes) in H3F3A (9/223, 4.04%). Four other variants were observed at a frequency greater than 3%: two previously described variants (ACVR1 p.g328e in 8/223, and PIK3CA p.e545k in 7/223), and two unknown variants (ORT2T35 p.c203delinsx in 9/223, and ZNF2 p.r79fs in 7/223). Upon further examination, both unknown variants resulted in amino acid deletions in repetitive regions of the exon, making it difficult to assess any potential functional significance. Furthermore, these variants were only found within samples that were sequenced using older platforms, which could indicate that these variants were a result of a sequencer-specific technical error. Finally, recurrent variants were observed in lower frequencies across multiple previously described genes, including PIK3CA, IDH1, MAX, TP53, ACVR1, ATRX, and BRAF. When grouping variants by gene, H3F3A and TP53 were the most commonly mutated genes in pediatric samples (103/223 and 94/223, respectively). Following this, four genes were observed to be mutated in over 10% of the pediatric dataset: ATRX (33/223), HIST1H3B (28/223), PIK3CA (27/223), and ACVR1 (26/223). Finally, a number of previously reported genes (PPM1D, PDGFRA, SETD2, BCOR, IDH1, BRAF, and EGFR) were within the top 100 mutated genes. 43

57 4.1.3 Location and Age-based Mutation Profiles Given the relatively large number of pediatric GBM sequences, samples were further stratified by general tumour location (midline vs hemispheric) and/or the age of diagnosis (0-10 years vs years). This stratification revealed numerous location and age-dependent mutation patterns. First, we observed significant differences between pediatric midline (n = 146) and hemispheric (n = 49) tumours. These differences were most pronounced in histone gene variants. The p.k27m alterations in H3F3A and HIST1H3B were the most common variants in midline tumours (86/146 and 27/146, respectively), but were almost never seen in hemispheric tumours (1/49 and 0/49, respectively). Conversely, the p.g34r alteration in H3F3A was the most common variation in hemispheric tumours (8/49), but was not seen in any midline tumour. Outside of histone gene variants, the p.v600e BRAF mutation was the second most common variant in hemispheric tumours (5/49), but was only seen once in midline tumours. Finally, when grouping variants by gene we observed significantly more IDH1 and ATRX mutations in hemispheric samples (4/49 vs 0/146 and 15/49 vs 14/146, respectively). Next, grouping samples into age categories of 0-10 years (n = 116) and years (n = 74) also revealed striking mutation profiles, especially with regards to histone gene variants. In both age groups, the most common variant was the p.k28m alteration in H3F3A, followed by p.k27m in HIST1H3B for patients ages 0-10 years, and p.g34r in H3F3A for patients aged years. While there was no significant difference between H3F3A p.k27m in patients that were 0-10 years old (43/116) and patients that were years old (27/74), the HIST1H3B p.k27m variant was overwhelmingly seen in patients ages 0-10 (23/116 vs 1/74) and the p.g34r 44

58 variant was seen exclusively in patients ages (8/74 vs 0/116). Next, multiple hotspot mutations in ACVR1 were seen significantly more often in the 0-10 age group, a trend which extended to ACVR1 mutations as a whole (21/116 vs 1/74). Finally, the recurrent variant IDH1 p.r132h was observed exclusively in the age group (5/74 vs 0/116), and ATRX mutations were also significantly more common in this age group (25/74 vs 7/116). Finally, pediatric samples were stratified according to both tumour location and diagnosis age. This helped uncover which of the significant associations above were dependent on both tumour location and age. For example, the HIST1H3B p.k27m variant was almost exclusively seen in younger midline tumours (23/88) compared to older midline tumours (1/34) and both younger and older hemispheric tumours (0/20 and 0/29, respectively). This subset of tumours also had high frequencies of H3F3A, TP53, ACVR1, and PIK3CA mutations, however none of these could be significantly associated with the phenotype. Conversely, no variants were observed more than twice in hemispheric tumours diagnosed between 0-10 years of age (n = 20), and no gene was mutated more than three times in total (NF1 and TP53 were both seen to have three mutations each). In the older subset of midline tumours (n = 34), H3F3A p.k27m was the only variant seen more than twice, and had a mutation frequency significantly higher than any other sample grouping (27/34). Mutations in TP53 and ATRX were also seen in over 25% of this sample group. Finally, the H3F3A p.g34r alteration was the most common variant seen in hemispheric tumours diagnosed between years of age (8/29), and was not observed in any other sample group. Hotspot TP53 mutations (p.r141c and p.r210x, both 4/29) and BRAF p.v600e (4/29) were the next most common variants, however no significant associations were found. Finally, TP53 (16/29), ATRX (14/29), and PDGFRA (5/29) were the most frequently 45

59 mutated non-histone genes in older hemispheric tumours, however these genes could not be significantly linked to this tumour phenotype. In this analysis, it is evident that there are not only differences between the common somatic mutation profiles of different tumour phenotypes, but that these differences involve multiple genes/variants. These observations prompted a second series of database queries with the goal of quantifying how often these recurrent mutations occur together in the same tumours and assessing the significance of these correlations. 4.2 Gene Correlations The second set of queries performed on this dataset assessed the correlations between previously described recurrent variants and mutated genes using the previously explained SQLProfiler many genes query. These queries applied the same variant filters described in the previous section in order to focus on mutations that are protein-changing, rare, and somatic. The variants/genes for these queries were selected both based on their prevalence in the current dataset, as well as somatic mutation evidence from previous GBM studies. A list of these variants/genes and their associated pathways are outlined in Table Adult Samples The pairwise variant/gene correlations in adult samples (n = 27) are depicted in Figure 4-1. While no specific correlation surpassed the p-value threshold when corrected for multiple tests, a few correlations remain interesting due to the sheer number of cases. Almost all samples with IDH1 or IDH2 variants contained at least one TP53 mutation (9/10, uncorrected p-value: 46

60 0.01), with every IDH1 p.r132h sample co-occurring with a TP53 mutation (8/8, uncorrected p- value: 0.003). Samples with IDH variants were also frequently seen to contain ATRX variants (6/10, uncorrected p-value: 0.02). Strikingly, there was a large overlap between these groups, as 6/10 samples with IDH variants also had both TP53 and ATRX variants. These mutation patterns were similar to those observed in the adult lower grade gliomas (n = 11) contained within the dataset (Supplementary Table 1), however due to small sample sizes it is difficult to assess the full significance of these correlations. 47

61 Table 1: List of variants/genes selected for correlation queries. Each variant/gene is categorized according to its relevant molecular pathway. Pathway Gene(s) Variant(s) H3F3A p.k27m, p.g34r/v HIST1H3B/HIST1H3C p.k27m IDH1/IDH2 Any Histone/Chromatin ATRX Any Remodeling Pathway SETD2 Any BCOR Any BCORL1 Any BMP Pathway ACVR1 Any TP53 Any p53 Pathway PPM1D Any CHEK2 Any p53/rb Pathway CDKN2A/CDKN2B Any RB1 Any Rb Pathway MYC Any MAX Any PIK3CA Any PIK3R1 Any NF1 Any RTK/PI3K Pathway PDGFRA Any MET Any FGFR1/FGFR2/FGFR3 Any EGFR Any PTEN Any RAS/MAPK Pathway BRAF Any PTPN11 Any 48

62 P-value thresholds: # = 0.05 (Corrected) ** = 0.01 (Uncorrected) * = 0.05 (Uncorrected) log (Odd s Ratio) Figure 4-1: Gene correlation plot for adult GBM samples (n = 27). For each pairwise comparison (square), the p-value thresholds are symbolically depicted, and the odd s ratio is represented through the colour. 49

Genomic Methods in Cancer Epigenetic Dysregulation

Genomic Methods in Cancer Epigenetic Dysregulation Clara, Lyon 2018 Jacek Majewski, Associate Professor Department of Human Genetics, McGill University Montreal, Canada A few words about my lab Genomics