Bayesian Inference for Single-cell ClUstering and ImpuTing (BISCUIT) Elham Azizi

Similar documents
The power of single cells: Building a tumor immune atlas Dana Pe er Department of Biological Science Department of Systems Biology Columbia

Single-Cell Sequencing in Cancer. Peter A. Sims, Columbia University G4500: Cellular & Molecular Biology of Cancer October 22, 2018

Research Strategy: 1. Background and Significance

AbSeq on the BD Rhapsody system: Exploration of single-cell gene regulation by simultaneous digital mrna and protein quantification

Supplement to SCnorm: robust normalization of single-cell RNA-seq data

Data Analysis Using Regression and Multilevel/Hierarchical Models

Aspects of Statistical Modelling & Data Analysis in Gene Expression Genomics. Mike West Duke University

Bayesian Prediction Tree Models

Molecular Profiling of Tumor Microenvironment Alex Chenchik, Ph.D. Cellecta, Inc.

Identification of Tissue Independent Cancer Driver Genes

Spatially resolved multiparametric single cell analysis. Technical Journal Club 19th September 2017 Christina Müller (Group Speck)

Kelvin Chan Feb 10, 2015

Advanced Bayesian Models for the Social Sciences

Adaptive Trial Design and Incorporation of Biomarkers to Maximize Achievable Objectives. In Early Phase Clinical Studies

Ordinal Data Modeling

fl/+ KRas;Atg5 fl/+ KRas;Atg5 fl/fl KRas;Atg5 fl/fl KRas;Atg5 Supplementary Figure 1. Gene set enrichment analyses. (a) (b)

Outlier Analysis. Lijun Zhang

Confounding by indication developments in matching, and instrumental variable methods. Richard Grieve London School of Hygiene and Tropical Medicine

Advanced Bayesian Models for the Social Sciences. TA: Elizabeth Menninga (University of North Carolina, Chapel Hill)

Michael Hallquist, Thomas M. Olino, Paul A. Pilkonis University of Pittsburgh

MISSING DATA AND PARAMETERS ESTIMATES IN MULTIDIMENSIONAL ITEM RESPONSE MODELS. Federico Andreis, Pier Alda Ferrari *

Regulatory T Cells Exhibit Distinct Features in Human Breast Cancer

DeconRNASeq: A Statistical Framework for Deconvolution of Heterogeneous Tissue Samples Based on mrna-seq data

Cellecta Overview. Started Operations in 2007 Headquarters: Mountain View, CA

Supplemental Information. Aryl Hydrocarbon Receptor Controls. Monocyte Differentiation. into Dendritic Cells versus Macrophages

Chapter 1. Introduction

EXPression ANalyzer and DisplayER

Ch.20 Dynamic Cue Combination in Distributional Population Code Networks. Ka Yeon Kim Biopsychology

RNA-seq Introduction

Comparison of discrimination methods for the classification of tumors using gene expression data

Bayesian Estimation of a Meta-analysis model using Gibbs sampler

Supplementary Table 1: List of parameters used in this study. Parameter Unit Parameter Unit Clinical parameters CD4 T cells Microbial translocation

MODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION TO BREAST CANCER DATA

Supplemental Information. Genomic Characterization of Murine. Monocytes Reveals C/EBPb Transcription. Factor Dependence of Ly6C Cells

TRIPODS Workshop: Models & Machine Learning for Causal I. & Decision Making

Welcome. Nanostring Immuno-Oncology Summit. September 21st, FOR RESEARCH USE ONLY. Not for use in diagnostic procedures.

Introduction to Discrimination in Microarray Data Analysis

A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY

Type 1 Diabetes: Islet expressing GAD65 (green) with DAPI (Blue) Islet expressing Insulin (red) in 3D confocal imaging

The 16th KJC Bioinformatics Symposium Integrative analysis identifies potential DNA methylation biomarkers for pan-cancer diagnosis and prognosis

Supplementary Materials Extracting a Cellular Hierarchy from High-dimensional Cytometry Data with SPADE

Modelling Spatially Correlated Survival Data for Individuals with Multiple Cancers

Lecture 13: Finding optimal treatment policies

Simultaneous Measurement Imputation and Outcome Prediction for Achilles Tendon Rupture Rehabilitation

How many people do you know?: Efficiently estimating personal network size

4. Model evaluation & selection

Practical Bayesian Optimization of Machine Learning Algorithms. Jasper Snoek, Ryan Adams, Hugo LaRochelle NIPS 2012

Biostatistical modelling in genomics for clinical cancer studies

High-Throughput Sequencing Course

Inferring Biological Meaning from Cap Analysis Gene Expression Data

The effects of Cell Subset Isolation Method on Gene Expression inleukocytes

CSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil

Monocyte subsets in health and disease. Marion Frankenberger

7SK ChIRP-seq is specifically RNA dependent and conserved between mice and humans.

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training.

Supplemental Information. Regulatory T Cells Exhibit Distinct. Features in Human Breast Cancer

ANALYTISCHE STRATEGIE Tissue Imaging. Bernd Bodenmiller Institute of Molecular Life Sciences University of Zurich

MATS: a Bayesian framework for flexible detection of differential alternative splicing from RNA-Seq data

SUPPLEMENTARY APPENDIX

An Introduction to Multiple Imputation for Missing Items in Complex Surveys

An application of a pattern-mixture model with multiple imputation for the analysis of longitudinal trials with protocol deviations

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Supplemental Information. A Highly Sensitive and Robust Method. for Genome-wide 5hmC Profiling. of Rare Cell Populations

Statistical Audit. Summary. Conceptual and. framework. MICHAELA SAISANA and ANDREA SALTELLI European Commission Joint Research Centre (Ispra, Italy)

Effector T Cells and

Supplementary Methods

Nature Getetics: doi: /ng.3471

Lecture 21. RNA-seq: Advanced analysis

Application Note # MT-111 Concise Interpretation of MALDI Imaging Data by Probabilistic Latent Semantic Analysis (plsa)

4. Th1-related gene expression in infected versus mock-infected controls from Fig. 2 with gene annotation.

Introduction to Computational Neuroscience

Bayesian Bi-Cluster Change-Point Model for Exploring Functional Brain Dynamics

NEW METHODS FOR SENSITIVITY TESTS OF EXPLOSIVE DEVICES

Manipulating the Tumor Environment

Supplementary Figure 1: High-throughput profiling of survival after exposure to - radiation. (a) Cells were plated in at least 7 wells in a 384-well

Mediation Analysis With Principal Stratification

Pathology of Inflammatory Breast Cancer (IBC) A rare tumor

Relationship between genomic features and distributions of RS1 and RS3 rearrangements in breast cancer genomes.

Cross-Study Projections of Genomic Biomarkers: An Evaluation in Cancer Genomics

Data analysis and binary regression for predictive discrimination. using DNA microarray data. (Breast cancer) discrimination. Expression array data

DNA vaccine, peripheral T-cell tolerance modulation 185

Biomarker Identification in Breast Cancer: Beta-Adrenergic Receptor Signaling and Pathways to Therapeutic Response

ChIP-seq hands-on. Iros Barozzi, Campus IFOM-IEO (Milan) Saverio Minucci, Gioacchino Natoli Labs

How Systems Biology Can Advance Immune Oncology. Birgit Schoeberl, PhD NEDMG Summer Symposium May 31st

Lecture 11: Clustering to discover disease subtypes and stages

Bayesian Inference Bayes Laplace

Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers

Sawtooth Software. MaxDiff Analysis: Simple Counting, Individual-Level Logit, and HB RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc.

Digitizing the Proteomes From Big Tissue Biobanks

International Society of Breast Pathology. Immune Targeting in Breast Cancer. USCAP 2017 Annual Meeting

CyTOF workflow: differential discovery in high-throughput high-dimensional cytometry datasets!

Methodology for the VoicesDMV Survey

ImmunoTools special Award 2014

Imputation classes as a framework for inferences from non-random samples. 1

Bayesian Joint Modelling of Benefit and Risk in Drug Development

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS)

Broad H3K4me3 is associated with increased transcription elongation and enhancer activity at tumor suppressor genes

Comparative Study of K-means, Gaussian Mixture Model, Fuzzy C-means algorithms for Brain Tumor Segmentation

Introduction to Computational Neuroscience

Rise of the Machines

Transcription:

Bayesian Inference for Single-cell ClUstering and ImpuTing (BISCUIT) Elham Azizi BioC 2017: Where Software and Biology Connect

Profiling Tumor-Immune Ecosystem in Breast Cancer Immunotherapy treatments successful only in a subset of patients and cancer types Underlying biology determining success is not known Variability in responses suggest a complex immune environment Goal: Unsupervised characterization of tumor-infiltrating immune subpopulations across subtypes of breast cancer, identify impact of environmental cues Understanding the tumor-immune ecosystem can guide development of treatments to activate immune cells against the tumor Pilot Data: Single-cell RNA-seq 9000 CD45+ immune cells from 4 tumors (patients) *figure adapted from Kroemer Nat Med 2015 2 Collaboration with Alexander Rudensky, MSKCC

Single-cell RNA-seq reveals heterogeneity in expression Measurement of gene expression at resolution of single cells Break Count matrix Allows characterizing novel cell types and functions based on heterogeneity indrop (Klein et al Cell 2015) PBMC single-cell data (Zheng et al. biorxiv 2016) 3

Single-cell RNA-seq data for immune cells from 4 breast cancer tumors MSK1 MSK2 MSK3 MSK4 lymphoidc ells monocytic cells 9000 CD45+ cells from 4 tumors Normalization by library size Unclear structure of cell types Large patient biases Problems of scrna-seq data: Sampling sparse amounts of mrna leads to Drop-outs Amplification differences Cell-type specific capture rates 4

Goal: Characterizing cell subpopulations using Single-cell RNA-seq data Count Matrix Genes Cells 2D projection of cells (TSNE) Cluster 1 Cluster 2 Cluster 3 5

Problems: Single-cell RNA-seq data involves significant dropouts and library size variation Observed Count Matrix Genes Cells 2D projection of cells (TSNE) Need to Impute dropouts & Normalize data Cluster 1 Cluster 2 Cluster 3 6

Common Approach: Normalizing independent of cell types Observed Count Matrix Cells Normalization To mean/median library size Downsampling BASiCS with spike-ins/erccs Genes Problems: Clustering Cells Downstream Analysis Dropouts not resolved Zeros remain zero! Removes biological stochasticity specific to cell type Leads to improper clustering; Biased downstream analysis 7

Main Concepts behind Biscuit for Normalization and Imputing 8

Problem #1: Imputing Two ideas for imputing expression in Single-cell RNA-seq data A 2D projection of cells (TSNE) 9

Problem #1: Imputing Idea 1: Impute dropouts based on cell type A No expression of Gene A in a cell But we observe cells with same type mostly have high expression of Gene A 10

Problem #1: Imputing Idea 1: Impute dropouts based on cell type A No expression of Gene A in a cell But we observe cells with same type mostly have high expression of Gene A Impute dropout in Gene A based on similar cells 11

Problem #1: Imputing Idea 2: Impute dropouts based on co-expression patterns No significant inference based on similar cells 12

Problem #1: Imputing Idea 2: Impute dropouts based on co-expression patterns No significant inference based on similar cells However Gene A always co-expressed with Gene B in cells of same type 13

Problem #1: Imputing Idea 2: Impute dropouts based on co-expression patterns No significant inference based on similar cells However Gene A always co-expressed with Gene B in cells of same type Impute dropout in Gene A based on Gene B 14

Problem #2: Normalization Normalization of Single-cell RNA-seq data Histogram of library size in example SC dataset From Zeisel, Science 2014 In addition to imputing dropouts, we need to normalize data by library size 15

Problem #2: Normalization Problem with Global Normalization Cells with different sizes have very different total number of transcripts Example Housekeeping Gene High chance of Dropouts in smaller cells 0 16

Problem #2: Normalization Problem with Global Normalization Dropout not resolved 0 0 Spurious Differential Expression 17

Problem #2: Normalization Key: Different normalization for each cell type Chicken and egg problem: Normalize based on cell types but we do not know cell types! 18

Approach: Simultaneous inference of clusters and imputing parameters Clustering Cells iterative learning Imputing & Normalization x1 x3 x 0.75 x1 19

Approach: Simultaneous inference of clusters and imputing parameters Cell-specific Parameters Parameters characterizing each cluster iterative for Imputing & learning Normalization x1 Assignment of cells to clusters x3 x 0.75 x1 20

Modeling Single-cell data using a Bayesian Mixture Model 21

Modeling Clusters of Cells using a Bayesian Mixture Model Ideal Count Matrix (normalized) Genes Cells Cluster 1 Cluster 2 Cluster 3 22

Modeling Clusters of Cells using a Bayesian Mixture Model Ideal Count Matrix (normalized) Cells Cluster 1 Cluster 2 Cluster 3 Genes One gene Cluster 1 Cluster 2 Cluster 3 Each gene: Mixture of Log-Normal Models 23

Modeling Clusters of Cells using a Bayesian Mixture Model Ideal Count Matrix (normalized) Cells Cluster 1 Cluster 2 Cluster 3 Genes Two genes Cluster 1 Cluster 2 Cluster 3 Modeling all genes together: Mixture of Multivariate Log-Normals To also take advantage of co-expression patterns in learning clusters 24

Without Technical Variation Generative Model 25

Generative Model with Technical Variation Without Technical Variation Latent counts which we want to recover With Technical Variation Observation 26

BISCUIT (Bayesian Inference for Single-cell ClUstering and ImpuTing) Data-driven priors to adapt to different datasets Global Distributions of Observed Data Hyper-priors Hyper-parameters Priors set based on observed lib size distribution Cluster-specific parameters Cell-specific parameters Prabhakaran*, Azizi* ICML 2016 Observed gene expression per cell j 27

Model Specification Prabhakaran*, Azizi* ICML 2016 28

Inference Algorithm Parallel Sampling from derived conditional posterior distributions: P(parameter data, other parameters) Estimate hyper-priors based on Data Sample hyper-parameters Sampling technical variation parameters Gibbs iterations Sampling cluster-specific parameters Sampling assignment of cells to clusters using Chinese Restaurant Process (CRP) scaling mean, cov per cell Also allows estimating the number of clusters Prabhakaran*, Azizi* ICML 2016 29

Performance on Simulated Data Data simulated from model for 100 cells, 50 genes in 3 clusters Confusion matrices showing true labels and those from MCMC-based methods Boxplots of F-scores obtained in 15 experiments with randomly-generated data Prabhakaran*, Azizi* ICML 2016 30

Model Mismatch: Robustness when counts are not LogNormal Noncentral Student s t Prabhakaran*, Azizi* ICML 2016 (Log) Negative binomial 31

Performance on Single-cell Data Zeisel et al., 2015 3005 mouse cortex cells, with UMIs Deep coverage gives good ground truth for 7 Cell types Selected 558 genes with largest standard deviation across cells Fit model to log(counts+1) Prabhakaran*, Azizi* ICML 2016 F-score: 0.91 32

Comparison to other methods F-score: 0.21 DBscan F-score: 0.91 F-score: 0.79 F-score: 0.74 F-score: 0.61 33

Cluster-dependent Imputing & Normalizing Genes Cells Observed Data BISCUIT clusters & parameters Impute & Normalize With a linear transformation I 34

Cluster-dependent Imputing & Normalizing Observed Data BISCUIT clusters & parameters Cells Imputed Data Genes Genes Cells Cluster 1 Cluster 2 Cluster 3 35

Characterizing tumor-infiltrating immune cells in breast cancer 36

Single-cell RNA-seq data for immune cells from 4 breast cancer tumors MSK1 MSK2 MSK3 MSK4 Unclear structure of cell types Unclear structure of cell types Large patient biases Large patient biases 37

Single-cell RNA-seq data for immune cells from 4 breast cancer tumors MSK1 MSK2 MSK3 MSK4 Unclear structure of cell types Large patient biases Defined Structure Corrected patient biases 38

Impact of environment in immune cell heterogeneity Little known about how tissue microenvironment modulates anti-tumor immunity Collected 50K CD45+ leukocytes from 8 patients Different ranges of tumor grade, ER, PR, Her2, age, two cases of metastases Different environments (tissues): Tumor Peripheral blood Lymphnode Normal (prophylactic mastectomies) Largest single cell immune map based on tissue residence. 39

Map of immune cells from 8 breast cancer patients Normalized by BISCUIT 50K cells 82 clusters 82 clusters Higher entropy shows more mixing of patients with Biscuit 40

Map of immune cells from 8 breast cancer patients Normalized by BISCUIT Tregs macrophages B cells DCs Cytotoxic T Central memory T monocytes neutrophils 82 clusters T cells NK cells 41

Impact of Environments Tissues Tregs Monocytic cells B cells T cells NK cells 42

82 clusters Tregs Monocytic cells B cells What are the differences across inferred cell subpopulations? T cells NK cells 43

T cell activation: First component of variation Correlated genes enriched for cytokine production & signaling, lymphocyte activation, leukocyte differentiation, ligand receptor interaction 44

T cell activation: First component of variation Comparing distribution of cells along the activation component shows tumor is more activated. Tumor Lymphnode Normal Blood Cell Density T cell Activation Component 45

Cell Density T cell Activation Component Activation state of each cluster 46 46

Activation of monocytic cells: first components of variation 47

Covariance patterns identify Treg clusters Markers differentially expressed in mean but differ in covariance patterns 48

Different covariance patterns of immunotherapy targets in Tregs across patients Incorporating personalized co-expression of drug targets can broaden the scope of immunotherapy 49

Macrophage clusters differ in covariance between M1/M2 markers 50

Summary Analyzing single cell data involves computational challenges: dropouts, technical variation dependent on cell types BISCUIT: A bayesian approach for simultaneous clustering and imputing Clusters identified with both mean and gene-gene covariance patterns Incorporating covariance informations improves normalization and imputing Map of tumor-immune ecosystem in breast cancer Single cell data for 50K CD45+ cells from 8 patients analyzed with Biscuit Substantial diversity of immune cell types driven by environment Activation of T cells and monocytic cell types explain most of variation Covariance patterns can be informative in characterization of cell types and development of personalized treatments 51

Acknowledgments Dana Pe er Alexander Rudensky Barbara Engelhardt (Princeton) David Blei (Columbia) Pe er Lab Sandhya Prabhakaran Ambrose Carr Andrew Cornish Linas Mazutis Vaidotas Kiseliovas Juozas Nainys Rudensky Lab George Plitas Kasia Konopacki R code: https://github.com/sandhya212/biscuit_singlecell_imm_icml_2016 Email elham.azizi@gmail.com 52