Low Abundance Genes. DNA Microarrays. Log Transformation? Basic Idea. Mining for Low-abundance Transcripts in Microarray Data. Exploratory Methods

Similar documents
Cancer outlier differential gene expression detection

Experimental Design For Microarray Experiments. Robert Gentleman, Denise Scholtens Arden Miller, Sandrine Dudoit

Data analysis and binary regression for predictive discrimination. using DNA microarray data. (Breast cancer) discrimination. Expression array data

Nature Genetics: doi: /ng Supplementary Figure 1. Assessment of sample purity and quality.

islet insulin production in diabetes model: spatial comparison of mouse strains

Statistical Assessment of the Global Regulatory Role of Histone. Acetylation in Saccharomyces cerevisiae. (Support Information)

Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Supplement to SCnorm: robust normalization of single-cell RNA-seq data

MOST: detecting cancer differential gene expression

An Introduction to Bayesian Statistics

SUPPLEMENTARY FIGURES

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

Aspects of Statistical Modelling & Data Analysis in Gene Expression Genomics. Mike West Duke University

Genetic Analysis of Anxiety Related Behaviors by Gene Chip and In situ Hybridization of the Hippocampus and Amygdala of C57BL/6J and AJ Mice Brains

Sample Size Estimation for Microarray Experiments

Biol220 Cell Signalling Cyclic AMP the classical secondary messenger

Ecological Statistics

Comparison of discrimination methods for the classification of tumors using gene expression data

Complex Trait Genetics in Animal Models. Will Valdar Oxford University

Chapter 1: Exploring Data

Bayesian hierarchical modelling

PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science. Homework 5

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

Application of Resampling Methods in Microarray Data Analysis

Investigating the robustness of the nonparametric Levene test with more than two groups

Nature Getetics: doi: /ng.3471

Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection

n Outline final paper, add to outline as research progresses n Update literature review periodically (check citeseer)

RASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays

AP Statistics. Semester One Review Part 1 Chapters 1-5

List of Figures. List of Tables. Preface to the Second Edition. Preface to the First Edition

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

chapter 1 - fig. 2 Mechanism of transcriptional control by ppar agonists.

7SK ChIRP-seq is specifically RNA dependent and conserved between mice and humans.

A Case Study: Two-sample categorical data

What you should know before you collect data. BAE 815 (Fall 2017) Dr. Zifei Liu

SUPPLEMENTAL MATERIAL

Supplemental Figure S1. Expression of Cirbp mrna in mouse tissues and NIH3T3 cells.

Modelling Spatially Correlated Survival Data for Individuals with Multiple Cancers

KEY CONCEPT QUESTIONS IN SIGNAL TRANSDUCTION

Biostatistics for Med Students. Lecture 1

2402 : Anatomy/Physiology

Receptor mediated Signal Transduction

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach

Dr. Kelly Bradley Final Exam Summer {2 points} Name

Understandable Statistics

Supplementary Figures

Breast cancer. Risk factors you cannot change include: Treatment Plan Selection. Inferring Transcriptional Module from Breast Cancer Profile Data

Index. Springer International Publishing Switzerland 2017 T.J. Cleophas, A.H. Zwinderman, Modern Meta-Analysis, DOI /

HS Exam 1 -- March 9, 2006

Analysis of Unbalanced Microarray Data

Chapter 1: Introduction to Statistics

The Association Design and a Continuous Phenotype

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

General Principles of Endocrine Physiology

Classification of cancer profiles. ABDBM Ron Shamir

Unit 1 Exploring and Understanding Data

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Research Analysis MICHAEL BERNSTEIN CS 376

Unsupervised Identification of Isotope-Labeled Peptides

Identification of Tissue Independent Cancer Driver Genes

IPA Advanced Training Course

Still important ideas

Data analysis in microarray experiment

Signal Transduction Cascades

Boosted PRIM with Application to Searching for Oncogenic Pathway of Lung Cancer

RNA-seq. Design of experiments

SUPPLEMENTARY INFORMATION. Rett Syndrome Mutation MeCP2 T158A Disrupts DNA Binding, Protein Stability and ERP Responses

MBios 478: Systems Biology and Bayesian Networks, 27 [Dr. Wyrick] Slide #1. Lecture 27: Systems Biology and Bayesian Networks

Supplementary Materials for

Detection of aneuploidy in a single cell using the Ion ReproSeq PGS View Kit

V. Gathering and Exploring Data

Experimental Studies. Statistical techniques for Experimental Data. Experimental Designs can be grouped. Experimental Designs can be grouped

Biostatistics. Donna Kritz-Silverstein, Ph.D. Professor Department of Family & Preventive Medicine University of California, San Diego

Lecture 15. Signal Transduction Pathways - Introduction

Knowledge Discovery and Data Mining. Testing. Performance Measures. Notes. Lecture 15 - ROC, AUC & Lift. Tom Kelsey. Notes

Using Bayesian Networks to Analyze Expression Data. Xu Siwei, s Muhammad Ali Faisal, s Tejal Joshi, s

Normal Distribution. Many variables are nearly normal, but none are exactly normal Not perfect, but still useful for a variety of problems.

Research Methods 1 Handouts, Graham Hole,COGS - version 1.0, September 2000: Page 1:

Sawtooth Software. MaxDiff Analysis: Simple Counting, Individual-Level Logit, and HB RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc.

Bayesian Prediction Tree Models

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

MBA 605 Business Analytics Don Conant, PhD. GETTING TO THE STANDARD NORMAL DISTRIBUTION

Quantitative genetics: traits controlled by alleles at many loci

Problem 1) Match the terms to their definitions. Every term is used exactly once. (In the real midterm, there are fewer terms).

Memorial Sloan-Kettering Cancer Center

Hormones and Signal Transduction. Dr. Kevin Ahern

Multi-Channel Film Dosimetry and Gamma Map Verification

Supplementary Figures

Chapter 1: Explaining Behavior

Business Statistics Probability

Six Sigma Glossary Lean 6 Society

On the purpose of testing:

T. R. Golub, D. K. Slonim & Others 1999

Membrane associated receptor transfers the information. Second messengers relay information

Theta sequences are essential for internally generated hippocampal firing fields.

Supplementary Materials for

Small-area estimation of mental illness prevalence for schools

Chapter 4 DESIGN OF EXPERIMENTS

Transcription:

Mining for Low-abundance Transcripts in Microarray Data Yi Lin 1, Samuel T. Nadler 2, Alan D. Attie 2, Brian S. Yandell 1,3 1 Statistics, 2 Biochemistry, 3 Horticulture, University of Wisconsin-Madison September 2 Basic Idea DNA microarrays present tremendous opportunities to understand complex processes Transcription factors and receptors expressed at low levels, missed by most other methods Analyze low expression genes with normal scores Robustly adapt to changing variability across average expression levels Basis for clustering and other exploratory methods Data-driven p-value sensitive to low abundance & changes in variability with gene expression DNA Microarrays 1-1, items per chip cdna, oligonucleotides, proteins dozens of technologies, changing rapidly expose live tissue from organism mrna cdna hybridize with chip read expression signal as intensity (fluorescence) design considerations compare conditions across or within chips worry about image capture of signal Low Abundance Genes background adjustment remove local geography comparing within and between chips negative values after adjustment low abundance genes virtually absent in one condition could be important genes: transcription factors, receptors large measurement variability early technology (bleeding edge) prevalence across genes on a chip -2% per chip 1-5% across multiple conditions Log Transformation? tremendous scale range in mean intensities 1-1 fold common concentrations of chemicals (ph) fold changes have intuitive appeal looks pretty good in practice want transformed data to be roughly normal easy to test if no difference across conditions looking for genes that are outliers beyond edge of bell shaped curve provide formal or informal thresholds Exploratory Methods Clustering methods (Eisen 1998, Golub 1999) Self-organizing maps (Tamayo 1999) search for genes with similar changes across conditions do not determine significance of changes in expression require extensive pre-filtering to eliminate low intensity modest fold changes may detect patterns unrelated to fold change comparison of discrimination methods (Dudoit 2) 1

Confirmatory Methods ratio-based decisions for 2 conditions (Chen 1997) constant variance of ratio on log scale, use normality anova (Kerr 2, Dudoit 2) handles multiple conditions in anova model constant variance on log scale, use normality Bayesian inference (Newton 2, Tsodikov 2) Gamma-Gamma model variance proportional to squared intensity error model (Roberts 2, Hughes 2) variance proportional to squared intensity transform to log scale, use normality. acquire data Q, B 1. adjust for background A=Q B 2. rank order genes R=rank(A)/(n+1) 3. normal scores N=qnorm(R) 4. contrast conditions Y=N 1 N 2 Y = contrast 7. standardize Z= Y center spread 6. center & spread X = mean 5. mean intensity X=mean(N) Normal Scores Procedure adjusted expression A = Q B rank order R = rank(a) / (n+1) normal scores N = qnorm( R ) average intensity X = (N 1 +N 2 )/2 difference Y = N 1 N 2 variance Var(Y X) σ 2 (X) standardization S = [Y µ(x)]/σ(x) Motivate Normal Scores natural transformation to normality is log(a) background intensity B = bd +ω B measured with error ω attenuation d may depend on condition gene measurement Q = [αexp(g+h+ε)+b]d+ω Q gene signal g degree of hybridization h: intrinsic noise ε (variance may depend on g) attenuation α (depends thickness of sample, etc.) subtract background: A = Q B adjusted measurements A = dαexp(g+h+ε)+δ symmetric measurement error δ=ω B ω Q Motivate Normal Scores (cont.) adjusted measurements: A = Q B = dαg+δ log expression level: log(g) = g+h+ε gene signal g confounded with hybridization h unless hybridization h independent of condition G is observed if no measurement error δ no dye or array effect (no attenuation dα) no background intensity natural under this model to consider N=log(G) normal scores almost as good Robust Center & Spread genes sorted based on X partitioned into many (about 4) slices containing roughly the same number of genes slices summarized by median and MAD MAD = median absolute deviation robust to outliers (e.g. changing genes) MAD ~ same distribution across X up to scale MAD i = σ i Z i, Z i ~ Z, i = 1,,4 log(mad i ) = log(σ i ) + log( Z i ) median ~ same idea 2

Robust Center & Spread MAD ~ same distribution across X up to scale log(mad i ) = log(σ i ) + log( Z i ), I = 1,,4 regress log(mad i ) on X i with smoothing splines smoothing parameter tuned automatically generalized cross validation (Wahba 199) globally rescale anti-log of smooth curve Var(Y X) σ 2 (X) can force σ 2 (X) to be decreasing similar idea for median E(Y X) µ(x) Motivation for Spread log expression level: log(g) = g+h+ε hybridization h negligible or same across conditions intrinsic noise ε may depend on gene signal g compare two conditions 1 and 2 Y = N 1 N 2 log(g 1 ) log(g 2 ) = g 1 g 2 + ε 1 -ε 2 no differential expression: g 1 =g 2 =g Var(Y g) = σ 2 (g) g X suggests condition on X instead, but: Var(Y X) not exactly σ 2 (X) cannot be determined without further assumptions Simulation of Spread Recovery 1, genes: log expression Normal(4,2) 5% altered genes: add Normal(,2) no measurement error, attenuation estimate robust spread Robust Spread Estimation Estimate of Spread Q-Q Plot spread.2.4.6.8 Sample Quantiles -5 5-4 -2 2 4 mean expression -4-2 2 4 Bonferroni-corrected p-values standardized normal scores S = [Y µ(x)]/σ(x) ~ Normal(,1)? genes with differential expression more dispersed Zidak version of Bonferroni correction p = 1 (1 p 1 ) n 13, genes with an overall level p =.5 each gene should be tested at level 1.95*1-6 differential expression if S > 4.62 differential expression if Y µ(x) > 4.62σ(X) too conservative? weight by X? Dudoit (2) Simulation Study simulations with two conditions 1, genes g 1,g 2 ~ Normal(4,2) for nonchanging genes 5% with differential expression g c ~ Normal(4,2) + Normal(3-rank(X)/(n+1),1/2) up- or down-regulated with probability 1/2 intrinsic noise ε ~ Normal(,.5) attenuation α = 1 measurement error variance =, 1, 2, 5, 1, 2 3

Comparison of Methods 5 differential expression for two conditions Newton (2) J Comp Biol Gamma-Gamma-Bernouli model Bayesian odds of differential expression Chen (1997) 3 4 1 2 5 1 2 2 constant ratio of expressions underlying log-normality 1 normal scores (Lin 2) some (unknown) transformation to normality robust, smooth estimate of spread & center Bonferroni-style p-values 1 2 3 4 5 Capturing Changed Genes: Noise.2 Fold Change 1. Fold Change.5 1. 2. 5. 5. Capturing Changed Genes: No Noise.1.2.5.2.1.5 2. 1. 5. No Noise: Average Intensity.5 E. coli with IPTG-b Comparison with E. coli Data Normal Scores IPTG-b Cy3/Cy5 ratio 5. 5..5.1 ~15 genes at low abundance including key genes Cy3/Cy5 ratio 1 e-3 1 e-1 1 e+3 Bayesian odds of differential expression IPTG-b known to affect only a few genes Raw IPTG-b 4,+ genes (whole genome) Newton (2) J Comp Biol.2 1. 5. 2. Noise 1: Average Intensity 5. Number of Changed Genes Captured Success in Capture 1 e-2 1 e+ Mean Intensity 1 e+2.5.2 1. 5. Mean Intensity 2. 4

Diabetes & Obesity Study 13,+ gene fragments (11,+ genes) oligonuleotides, Affymetrix gene chips mean(pm) - mean(nm) adjusted expression levels six conditions in 2x3 factorial lean vs. obese B6, F1, BTBR mouse genotype adipose tissue influence whole-body fuel partitioning might be aberrant in obese and/or diabetic subjects Nadler (2) PNAS Low Abundance Obesity Genes low mean expression on at least 1 of 6 conditions negative adjusted values ignored by clustering routines transcription factors I-κB modulates transcription - inflammatory processes RXR nuclear hormone receptor - forms heterodimers with several nuclear hormone receptors regulation proteins protein kinase A glycogen synthase kinase-3 roughly 1 genes 9 new since Nadler (2) PNAS Table 1. Genes Differentially Expressed with Obesity as Identified by the Normal Scores Procedure Accesion # Description Obese/Lean Lean/Obese Average p-value aa68251 Mus musculus Dp1l1 mrna for polyposis locus protein 1-like 1 (TB2 protein-like 94.1.535. 1) x93999 M.musculus Gal beta-1,3-galnac-specific GalNAc alpha-2,6-sialyltransferase.1 74.78.513. gene w45955 No significant homologies.1 8.52.356.1 U16297 Mus musculus cytochrome B561 (mcyt).1 95.95.56. X72862* M.musculus gene for beta-3-adrenergic receptor.1 14.54 1.575. X12521 Mouse mrna for transition protein 1 TP1.1 15.3.751. aa544897 No significant homologies.1 138.69.346. AA62269 Mus musculus protamine.2 41.58.877.9 AA144629 Mus musculus transition protein 1 (Tnp1).2 42.71.71.8 X63349 M.musculus tyrosinase-related protein-2.2 43.42 1.599. AA11671 Mus musculus serum response factor (Srf).2 57.2.847.1 AA15229 Mus musculus hepsin (Hpn).2 59.34.318.47 aa26714 No significant homologies.2 62.72 1.154. aa48365 Mus musculus mrna for SRp25 nuclear protein.2 65.53.38.1 j5663 Mouse vas deferens androgen related protein (MVDP).3 3.75.682.15 w49331 No significant homologies.3 31.1.849.93 J2935 Mouse camp-dependent protein kinase type II regulatory subunit.3 32.2.399.148 aa195 No significant homologies.3 34.86 1.56.15 m3181 Mouse 2,3 -cyclic-nucleotide 3 -phosphodiesterase.3 35.21.512.37 AA33381 No significant homologies.3 36.63.56.28 aa616759 No significant homologies.3 36.67.522.27 u2883 Mus musculus Balb/c mammary-derived growth inhibitor (MDGI).3 38.1 1.2.1 AA198324 No significant homologies.3 39.31.428.16 D14883 Mouse C33/R2/IA4.4 25.1 1.134.119 W36455 Mus musculus adipsin (Adn).4 25.63 12.89. m2751 Mus musculus protamine 2.4 27.47.99.143 X8796 M.musculus brevican 51.31.2.34.38 AA48974 No significant homologies 51.32.2.565.1 ET63455 Mus musculus serum amyloid A-4 protein (Saa4) 51.57.2.31.18 X6526 M.musculus mrna for GTP-binding protein 54.5.2.81.1 AA138292 Similar to Rat S6 kinase 54.25.2.419.1 ET6236 CALCIUM-DEPENDENT PHOSPHOLIPASE A2 PRECURSOR 56.31.2.417.1 w64688 No significant homologies 57.6.2.943. AA3587 No significant homologies 57.27.2 1.91. AA111277 Mus musculus hippocalcin-like 1 (Hpcal1) 61.46.2.348.8 w1922 No significant homologies 62.54.2.694. w48392 similar X7518 Id4 helix-loop-helix protein 66.48.2.573. AA168767 Sequence withdrawn 7.1.1.9.72 X99143 M.musculus hair keratin, mhb6 71.31.1.23.64 AA137962 similar to Human ras-related protein rab-14 81.78.1 1.76. W1926 Mus musculus mrna for ubiquitin like protein 82.62.1.762. AA13862 Similar to Human hqp256 84.63.1.268.13 J4179 Mouse chromatin nonhistone high mobility group protein (HGM-I(Y)) 88.29.1 1.138. X355 Mouse gene exon 2 for serum amyloid A (SAA) 3 protein 88.76.1.693. U957 Mus musculus p21 (Waf1) 92.52.1.724. AA2399 Mus musculus dutpase 122.37.1.221.2 u7746 Mus musculus anaphylatoxin C3a receptor 13.36.1 1.191. x66224 M.musculus retinoid X receptor-beta 197.77.1.256. AA154294 Mus musculus protein tyrosine phosphatase (Ptpn16) 233.62.36. W85163 Mus musculus migration inhibitory factor 272.49.168. Low Abundance Genes for Obesity Obese vs. Lean.5.2.5 2. 5..2.5.2.5 2. 5. 2. Microarray ANOVAs Kerr (2) gene by condition interaction N ijk = gene i + condition j + gene*condition ij + rep error ijk conditions organized in factorial design experimental units may be whole or part of array genes are random effects focus on outliers (BLUPs), not variance components gene*condition ij = differential expression allow variance to depend on gene i main effect replication to improve precision, catch gross errors Average Intensity for Obesity 5

Microarray Random Effects variance component for non-changing genes robust estimate of MS(G*C) using smoothed MAD rescale normal score response N by spread σ(x) look for differential expression or use clustering methods variance component for replication robust estimate of MSE using smoothed MAD look for outliers = gross errors Obesity Genotype Main Effects Obesity Dominance -5 5 1-1 -5 5 Obesity Additive Microarray QTLs condition may be genotype whole organism or pattern of genes genotype may be inferred rather than known exactly N ijk = gene i + QTL j + gene*qtl ij + individual ijk QTL genotype depends on flanking markers mixture model across possible QTL genotypes single vs. multiple QTL single QTL may influence numerous genes epistasis = inter-genic interaction modification of biochemical pathway(s) 6