A Versatile Algorithm for Finding Patterns in Large Cancer Cell Line Data Sets

Similar documents
An Analysis of MDM4 Alternative Splicing and Effects Across Cancer Cell Lines

SUPPLEMENTARY FIGURES: Supplementary Figure 1

SUPPLEMENTARY APPENDIX

Cancer Informatics Lecture

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics

HALLA KABAT * Outreach Program, mircore, 2929 Plymouth Rd. Ann Arbor, MI 48105, USA LEO TUNKLE *

Gene-microRNA network module analysis for ovarian cancer

Nature Genetics: doi: /ng Supplementary Figure 1. SEER data for male and female cancer incidence from

Supplementary Figures

Reliability of Ordination Analyses

Chapter 1. Introduction

Detection of copy number variations in PCR-enriched targeted sequencing data

A quick review. The clustering problem: Hierarchical clustering algorithm: Many possible distance metrics K-mean clustering algorithm:

Computer Science, Biology, and Biomedical Informatics (CoSBBI) Outline. Molecular Biology of Cancer AND. Goals/Expectations. David Boone 7/1/2015

T. R. Golub, D. K. Slonim & Others 1999

Characterisation of structural variation in breast. cancer genomes using paired-end sequencing on. the Illumina Genome Analyser

MODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION TO BREAST CANCER DATA

MethylMix An R package for identifying DNA methylation driven genes

Unsupervised MRI Brain Tumor Detection Techniques with Morphological Operations

Predication-based Bayesian network analysis of gene sets and knowledge-based SNP abstractions

Statistical Issues in Translational Cancer Research

Evaluation on liver fibrosis stages using the k-means clustering algorithm

Identifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang

Multi-omics data integration colon cancer using proteogenomics approach

BWA alignment to reference transcriptome and genome. Convert transcriptome mappings back to genome space

Comparative Study of K-means, Gaussian Mixture Model, Fuzzy C-means algorithms for Brain Tumor Segmentation

What can we contribute to cancer research and treatment from Computer Science or Mathematics? How do we adapt our expertise for them

SubLasso:a feature selection and classification R package with a. fixed feature subset

Nature Immunology: doi: /ni Supplementary Figure 1. RNA-Seq analysis of CD8 + TILs and N-TILs.

COMPUTATIONAL OPTIMISATION OF TARGETED DNA SEQUENCING FOR CANCER DETECTION

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION

PSSV User Manual (V1.0)

Comparison of volume estimation methods for pancreatic islet cells

What Math Can Tell You About Cancer

The 16th KJC Bioinformatics Symposium Integrative analysis identifies potential DNA methylation biomarkers for pan-cancer diagnosis and prognosis

Nature Methods: doi: /nmeth.3115

Identification of Tissue Independent Cancer Driver Genes

Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics, 2010

Introduction to Gene Sets Analysis

Paternal exposure and effects on microrna and mrna expression in developing embryo. Department of Chemical and Radiation Nur Duale

Segmentation and Analysis of Cancer Cells in Blood Samples

PSSV User Manual (V2.1)

StratomeX Visual Analysis of Large-Scale Heterogeneous Genomics Data for Cancer Subtype Characterization

Outlier Analysis. Lijun Zhang

Variant Classification. Author: Mike Thiesen, Golden Helix, Inc.

Exploring TCGA Pan-Cancer Data at the UCSC Cancer Genomics Browser

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017

Feature Vector Denoising with Prior Network Structures. (with Y. Fan, L. Raphael) NESS 2015, University of Connecticut

Introduction to LOH and Allele Specific Copy Number User Forum

Empirical Formula for Creating Error Bars for the Method of Paired Comparison

Cancer genetics

Comparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes

Research Strategy: 1. Background and Significance

Human Cancer Genome Project. Bioinformatics/Genomics of Cancer:

Hierarchical clustering analysis of blood plasma lipidomics profiles from mono- and dizygotic twin families

Risk-prediction modelling in cancer with multiple genomic data sets: a Bayesian variable selection approach

Gene expression correlates of clinical prostate cancer behavior

STATISTICS BY EXAMPLE Weighing Chances

Introduction to Systems Biology of Cancer Lecture 2

Session 4 Rebecca Poulos

Discovery of Novel Human Gene Regulatory Modules from Gene Co-expression and

MRI Image Processing Operations for Brain Tumor Detection

Nature Medicine: doi: /nm.3967

Relationship between genomic features and distributions of RS1 and RS3 rearrangements in breast cancer genomes.

Reveal Relationships in Categorical Data

OncoPPi Portal A Cancer Protein Interaction Network to Inform Therapeutic Strategies

Study the Evolution of the Avian Influenza Virus

Unsupervised Discovery of Emphysema Subtypes in a Large Clinical Cohort

Lessons in biostatistics

Detection and Classification of Lung Cancer Using Artificial Neural Network

Clustering analysis of cancerous microarray data

Bayesian Prediction Tree Models

Ch. 18 Regulation of Gene Expression

Package xseq. R topics documented: September 11, 2015

Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers

Predicting Diabetes and Heart Disease Using Features Resulting from KMeans and GMM Clustering

Lecture Outline. Biost 590: Statistical Consulting. Stages of Scientific Studies. Scientific Method

Contents. 1.5 GOPredict is robust to changes in study sets... 5

Nature Genetics: doi: /ng Supplementary Figure 1. Mutational signatures in BCC compared to melanoma.

TCGA. The Cancer Genome Atlas

Cancer troublemakers: a tale of usual suspects and novel villains

Session 6: Integration of epigenetic data. Peter J Park Department of Biomedical Informatics Harvard Medical School July 18-19, 2016

Chapter 2 Planning Experiments

Chapter 4 Cellular Oncogenes ~ 4.6 -

OVARIAN CANCER POSSIBLE TUMOR-SUPPRESSIVE ROLE OF BATF2 IN OVARIAN CANCER

Pseudogenes transcribed in breast invasive carcinoma show subtype-specific expression and cerna potential

CS 453X: Class 18. Jacob Whitehill

HHS Public Access Author manuscript Mach Learn Med Imaging. Author manuscript; available in PMC 2017 October 01.

Section D. Genes whose Mutation can lead to Initiation

BREAST CANCER EARLY DETECTION USING X RAY IMAGES

Comparing Ecological Communities Part Two: Ordination. Ordination vs. classification. Species responses to gradients. Read: Ch.

25. Two-way ANOVA. 25. Two-way ANOVA 371

Comparison of discrimination methods for the classification of tumors using gene expression data

Presence of AVA in High Frequency Oscillations of the Perfusion fmri Resting State Signal

Supplementary Information

Clustered mutations of oncogenes and tumor suppressors.

Introduction. Introduction

Breast cancer. Risk factors you cannot change include: Treatment Plan Selection. Inferring Transcriptional Module from Breast Cancer Profile Data

Cancer Genetics. What is Cancer? Cancer Classification. Medical Genetics. Uncontrolled growth of cells. Not all tumors are cancerous

Transcription:

A Versatile Algorithm for Finding Patterns in Large Cancer Cell Line Data Sets James Jusuf, Phillips Academy Andover May 21, 2017 MIT PRIMES The Broad Institute of MIT and Harvard

Introduction A quest to understand the cancer genome Discover pathways that can be targeted in cancer treatment Predict valuable information about cancer patients based on known genetic indicators An explosion in the amount of available data Human Genome Project (1990-2003) Databases: COSMIC (2004), TCGA (2005), CCLE (2012) The more data the better: larger sample sizes allow us to detect patterns in with more reliability

Introduction Our study focuses on two widely studied phenomena in cancer genomics/epigenomics: Fig. 1A: Mutations: variations in DNA sequence Fig. 1B: Methylation: amount and location of methyl ( CH 3 ) groups attached to DNA, which regulate gene expression

Goals Find genes and cancer types in which a specific mutation affects a cell s methylation profile Create an algorithm to quantify the correlation between methylation and the mutation of a given gene in a given cancer type In the future: Apply the algorithm to other variables, e.g. exon splicing and mutations Further investigate any potentially significant patterns we discover in the laboratory

What is unsupervised clustering? Fig. 2A: Some arbitrary data

What is unsupervised clustering? Fig. 2B: Data partitioned into 2 clusters

What is unsupervised clustering? Fig. 2C: Data partitioned into 3 clusters

What is unsupervised clustering? Fig. 3: Clustering can be generalized into any number of dimensions

How does clustering work mathematically? Fig. 4A: Sample table of 3-dimensional data showing x, y, and z-coordinates of 19 points Calculate the Euclidean distance between every pair of points: dd iiii = Δxx iiii 2 + Δyy iiii 2 + Δzz iiii 2

How does clustering work mathematically? Fig. 4B: A distance matrix of the 19 data points

Back to our project Create an algorithm to quantify the correlation between methylation and the mutation of a given gene in a given cancer type Methylation Cluster 1 All have mutation in Gene X Methylation Cluster 2 No mutations in Gene X Fig. 5: Sample cell lines clustered by methylation Null hypothesis: There is no relationship between methylation and mutation in [gene] among cells of [cancer type]

Cancer Cell Line Data All Data (CCLE) 697 cell lines 30394 genes 23 cancer types Colorectal (52 cell lines) Methylation Data Mutation Data Algorithm p-values Fig. 6: Data pipeline for analyzing the correlation between methylation and mutation in a given cancer type

Cancer Cell Line Data All Data (CCLE) 697 cell lines 30394 genes 23 cancer types Colorectal (52 cell lines) Methylation Data Mutation Data Algorithm p-values Fig. 6: Data pipeline for analyzing the correlation between methylation and mutation in a given cancer type

Methylation Data Gene T Fig. 9A: Methylation data Calculate the Euclidean distance between every pair of cell lines, now in T dimensions: dd iiii = Δmm 1iiii 2 + Δmm 2iiii 2 + + Δmm TTTTTT 2

Methylation Data Gene Name Cell Line (30387 more columns) Fig. 9B: Section of methylation data for bladder cancer cell lines (9 more rows) Calculate the Euclidean distance between every pair of cell lines, now in T dimensions: dd iiii = Δmm 1iiii 2 + Δmm 2iiii 2 + + Δmm TTTTTT 2

Methylation Distance Matrix Fig. 9C: N x N Methylation distance matrix

Cancer Cell Line Data All Data (CCLE) 697 cell lines 30394 genes 23 cancer types Colorectal (52 cell lines) Methylation Data Mutation Data Algorithm p-values Fig. 6: Data pipeline for analyzing the correlation between methylation and mutation in a given cancer type

Cancer Cell Line Data All Data (CCLE) 697 cell lines 30394 genes 23 cancer types Colorectal (52 cell lines) Methylation Data Mutation Data Algorithm p-values Fig. 6: Data pipeline for analyzing the correlation between methylation and mutation in a given cancer type

Incorporating Mutation Data Gene A Fig. 10A: Cell lines 2, 4, and N 1 have mutations in Gene A

Incorporating Mutation Data Gene B Fig. 10B: Cell lines 3 and 5 have mutations in Gene B

Incorporating Mutation Data Gene C Fig. 10C: Cell lines 1, 2, and N have mutations in Gene C

The Algorithm Gene X Suppose that k out of N cell lines have a mutation in gene X There are N(N 1)/2 distances in the entire distance matrix There are k(k 1)/2 distances between the cell lines with mutations

The Algorithm Gene X Call the distribution of all values in the matrix {A} Call the sample of k(k 1)/2 distances between mutated cell lines {K}

The Algorithm What is the probability that the sample {K} occurs entirely by random chance? Randomly select 10,000 samples of size k(k 1)/2 within the N x N half-matrix, and call these samples {R 1 } thru {R 10,000 } Use the Kolmogorov-Smirnov (KS) test, which returns the likelihood that a given sample is derived from some reference distribution (similarity score) The p-value will be the percentage of the time that KS({K}, {A}) exceeds KS({R i }, {A}) as i ranges from 1 to 10,000 Repeat process for each gene; p-values corrected for multiple hypothesis testing using Benjamini Hochberg (BH)

Improving the Algorithm Instead of using the distance matrix to calculate p-values, we can use the cophenetic distance matrix based on hierarchical clustering instead Takes into account the proximity of other data points; amplifies biologically relevant signal and removes noise Fig. 11: Example of cophenetic distance in 2D sample data

Visualizing the output Fig. 12: Heatmap of deletion mutations in top-scoring genes in kidney cancer cell lines

Sample Results Ly Fig. 13: MYC is a known oncogene in B-cell lymphomas and other cancers

Generalizing the algorithm Clustered cell lines based on exon splicing instead of methylation Find correlation between exon splicing patterns and mutations Could use algorithm for other applications in the future

Sample Results Fig. 14: SF3B1 is known to affect splicing patterns in pancreatic cancer

Acknowledgements I would like to thank: Dr. Mahmoud Ghandi The Broad-Novartis Cancer Cell Line Encyclopedia (CCLE) for providing the data used in this study The PRIMES program coordinators My parents