Lecture #4: Overabundance Analysis and Class Discovery
|
|
- Lynn Hudson
- 5 years ago
- Views:
Transcription
1 Topics in Microarray Data nalysis Winter November 15, 2004 Lecture #4: Overabundance nalysis and Class Discovery Lecturer: Doron Lipson Scribes: Itai Sharon & Tomer Shiran 1 Differentially Expressed Genes Overabundance What is Overabundance? Statistical Significance of Overabundance False Detection Rate Binomial Surprise Score Class Discovery What is Class Discovery? Class Discovery lgorithms Discovering dditional Classes Differentially Expressed Genes In the previous lecture we talked about the concept of differentially expressed genes. gene is given a score based on its ability to differentiate two different samples. For example, the expression of a gene in a normal lung tissue might vary from its expression in a tumor lung tissue. score is typically given to a gene based on its ability to differentiate two different samples. The threshold number of misclassifications (TNoM) measures how successful we are in separating the two groups of samples by a simple threshold over the expression values. That is, we search for the threshold value of the gene s expression that will distinguish the experimental conditions. gene is scored by the number of misclassifications made by the best threshold that we can find for it. If the expression value of the gene allows us to perfectly separate the groups, the gene has a TNOM score = 0. On the other hand, if the two groups are interspersed, the gene has a score that may be close to the size of the smallest group of samples. 2 Overabundance 2.1 What is Overabundance? When analyzing a data set we must ask how surprising is the data set? We typically examine the number of genes at different P values (i.e., significance levels) and compare them with the number under the null-hypothesis (the assumption that the separation of the samples is random). The difference between the expected and observed number of genes
2 in each significant P value is an estimation of the overabundance of information in the analyzed data set. Let s take a look at the following example: Scenario Scenario B Number of genes Number of samples Number of genes with TNoM 2 (P value = 0.03) In this example the P value is 0.03 and there are 1000 genes. The expected number of genes with TNoM of 2 or less is the P value multiplied by the number of genes in the data set (1000 * 0.03 = 30, in the preceding example). In Scenario we got one gene with TNoM of 2 or less. However, under the null-hypothesis we expected 30 such genes, so we cannot conclude that this gene has a biological significance. In Scenario B we got 100 genes with TNoM of 2 or less, so we can conclude that there are approximately = 70 that might have a biological significance. When examining data sets with biologically meaningful classifications, we usually find an overabundance of significantly informative genes. The number of genes with small scores is much higher than expected. For example, in the Leukemia data set (Golub et al. 1999), there are 3 genes with TNoM score 3 (P value 7.8 * ) while the expected number is 5.5 * Moreover, there are 294 genes with TNoM score 15 or less, while the expected number is roughly 1. This is an overabundance of informative genes, meaning that the expression profiles carry information relevant to the biological classification. n overabundance graph can be used to visualize the significance of an experiment. For different P values (or TNoM scores) it shows the difference between the expected number of genes and actual number. The following graph illustrates the use of an overabundance graph: Breast Cancer BRC1/BRC2 data - ctual and expected TNoM scores Number of genes Expected ctual TNoM Figure 1: overabundance graph
3 2.2 Statistical Significance of Overabundance Our next step is to quantify the statistical significance of overabundance. This quantification is important for two types of situations: 1) Consider a biologically meaningful classification (e.g., two subtypes of cancer, as in the case of the Leukemia data set). Then, we want to ascertain whether gene expression patterns reflect that classification. The Leukemia data set shows that this is the case without doubt. In other classifications, when there are fewer tissue samples, or more subtle signal, the situation might not be obvious. Using standard methods (e.g., Bonferroni bounds), we can determine whether a single gene is significant for the classification. Our aim, however, is to take into account the global patterns. That is, the behavior of all the genes. n overabundance of informative genes is an indication of statistical significance, even if no single gene is Bonferroni significant. 2) Consider a putative classification, as in Bittner et al ( Molecular classification of cutaneous malignant melanoma by gene expression profiling, 2000), that might correspond to a real biological distinction. Clearly, the ultimate test for a putative classification is a biological validation test (as described therein). However, statistics is a tool for evaluating classifications before planning further experiments. Thus, we want to develop statistical scores that measure the significance of suggested partitions False Detection Rate For each TNoM s we define the False Detection Rate (FDR): expected number of genes with TNoM s FDR ( s) = actual number of genes with TNoM s The FDR function can be used to select a subset of differentially expressed genes with a low expected number of false positives. We typically select a threshold (e.g., TNoM) that minimizes the FDR. nother option is to select a threshold that acquires a given FDR (e.g., 5%) Binomial Surprise Score n alternative to FDR is the Binomial Surprise Score approach: the basic idea behind this approach is to estimate the score for which we are most surprised by the number of observed genes for the given score. Let X(s) be the number of genes with TNoM s for a given threshold s, that we expect to observe for uniformly and independently drawn labeling vectors. Let p s be the matching P value and let n be the total number of genes in the data set. ssuming the n vectors are independently drawn, and assuming that a vector for which the TNoM score is s is drawn with probability p s, it is possible to conclude that: X ( s) ~ Binom ( n, p s )
4 Let n(s) be the observed number of genes in the dataset with TNoM s. The surprise rate logσ s where: is defined as ( ) σ n n ( ) i= n s( i ) ( ) ( ) ( ) i ( ) ( n i ) ( ) s s s = Prob X s n s = p 1 p ( ) s. We are, of course, interested in the threshold s for which the maximum surprise score is received. In the general case we would expect the Binomial Surprise Score to be The maximum surprise score is defined as max logσ ( s) 0 for the highest TNoM score s max, since p = 1 σ( s ) = 1 log( σ( s )) = 0 smax max max Low for TNoM score s 0 = 0, because if p 0 = 0 then σ(0) = 1 log( σ(0)) = 0 (in other words, there are usually very few genes, if any, with a perfect score). Real positive for scores between the two extremes. There is usually one maximum which is obtained between the two extremes. The two figures below present the Binomial Surprise Score for a data of 30 samples from normal and tumor lung tissues (taken from Naftali Kaminski s lab, Sheba Medical Center): as can be seen, argmax s [-log σ(s)] = 6. Lung Cancer Data - ctual and expected TNoM scores distribution Number of genes Expected distribution ctual distribution TNoM score -log(binomial surprise) log(binomial surprise) TNoM score Figure 2 One obvious misassumption of this approach is that genes are not really independent. biological phenomenon is almost always associated with many genes.
5 3 Class Discovery 3.1 What is Class Discovery? In many experimental designs it is useful to find tissue classification in gene expression data. Such classifications might be due to biological phenomena (e.g., disease subtypes), or due to mechanical or protocol noise. Identifying classifications can lead to biological discovery or can uncover experimental or data handling errors. Biologically meaningful classifications are often characterized by overabundance of informative genes. This overabundance might be due to a small set of genes that are highly informative about the classification, or due to a larger set of genes, each of them not as surprising, but the collection of them is. This suggests that the samples should be partitioned into two groups. We can then evaluate these partitions and measure to what degree they have the overabundance of informative genes. The partitions that display high overabundance are proposed putative classifications. To carry out this intuition we need to choose a score for overabundance and then to perform search for high scoring partitions. 3.2 Class Discovery lgorithms One approach for class discovery combines the maximum binomial surprise score with local search techniques. The surprise metric assigns a score to each partition, and a local search technique is used to seek partitions with statistically significant overabundance of informative genes. The following local search algorithms are commonly used (the size of the search space is 2 m for bipartition, or k m for k partitions, where m is the number of samples): 1) Steepest ascent: Move to the next candidate partition if and only if s new > s current. 2) Simulated annealing: Move to the next candidate partition with probability min(1, exp((s new > s current ) / T)). Simulated annealing allows occasional "uphill" moves (moves which worsen the current solution to the problem). The advantage of simulated annealing is that it can overcome local maxima, unlike the steepest ascent algorithm. 3) Genetic algorithms: There are various genetic heuristics which are beyond the scope of this course. If we are interested in partitioning the samples into two groups, then we can represent the associations via a binary vector of length m (the number of samples) each bit indicates the association of a specific sample. simple successor function for the local search algorithms is to flip a single bit in this vector. Figure 3 illustrates the graph this representation.
6 Score= Score=88.2 Score= Score= Score= Score= Score=73.2 Figure Discovering dditional Classes We can use a technique called peeling in order to fine-tune the classification: 1) Discover a significant partition via one of the preceding algorithms. 2) Remove all the genes that support the discovered partition (the maximum surprise threshold can be used). 3) Repeat the previous steps with the remaining genes. The peeling technique is effective because different sets of groups typically induce different partitions. Therefore, finding a fine-tuned set of classes is difficult (or impossible) when the expressions of all the genes in the data set are considered. Sometimes specific sets of genes are already known so standard clustering can be used to partition the samples. However, usually there is no such prior knowledge, so peeling can be an effective way of discovering classes based solely on the input data set.
Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines
Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines Florian Markowetz and Anja von Heydebreck Max-Planck-Institute for Molecular Genetics Computational Molecular Biology
More informationReview: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections
Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections New: Bias-variance decomposition, biasvariance tradeoff, overfitting, regularization, and feature selection Yi
More informationT. R. Golub, D. K. Slonim & Others 1999
T. R. Golub, D. K. Slonim & Others 1999 Big Picture in 1999 The Need for Cancer Classification Cancer classification very important for advances in cancer treatment. Cancers of Identical grade can have
More informationComments on Significance of candidate cancer genes as assessed by the CaMP score by Parmigiani et al.
Comments on Significance of candidate cancer genes as assessed by the CaMP score by Parmigiani et al. Holger Höfling Gad Getz Robert Tibshirani June 26, 2007 1 Introduction Identifying genes that are involved
More informationClassifica4on. CSCI1950 Z Computa4onal Methods for Biology Lecture 18. Ben Raphael April 8, hip://cs.brown.edu/courses/csci1950 z/
CSCI1950 Z Computa4onal Methods for Biology Lecture 18 Ben Raphael April 8, 2009 hip://cs.brown.edu/courses/csci1950 z/ Binary classifica,on Given a set of examples (x i, y i ), where y i = + 1, from unknown
More informationIntroduction to Discrimination in Microarray Data Analysis
Introduction to Discrimination in Microarray Data Analysis Jane Fridlyand CBMB University of California, San Francisco Genentech Hall Auditorium, Mission Bay, UCSF October 23, 2004 1 Case Study: Van t
More informationMBios 478: Systems Biology and Bayesian Networks, 27 [Dr. Wyrick] Slide #1. Lecture 27: Systems Biology and Bayesian Networks
MBios 478: Systems Biology and Bayesian Networks, 27 [Dr. Wyrick] Slide #1 Lecture 27: Systems Biology and Bayesian Networks Systems Biology and Regulatory Networks o Definitions o Network motifs o Examples
More informationn Outline final paper, add to outline as research progresses n Update literature review periodically (check citeseer)
Project Dilemmas How do I know when I m done? How do I know what I ve accomplished? clearly define focus/goal from beginning design a search method that handles plateaus improve some ML method s robustness
More informationGene Selection for Tumor Classification Using Microarray Gene Expression Data
Gene Selection for Tumor Classification Using Microarray Gene Expression Data K. Yendrapalli, R. Basnet, S. Mukkamala, A. H. Sung Department of Computer Science New Mexico Institute of Mining and Technology
More informationUsing Bayesian Networks to Analyze Expression Data. Xu Siwei, s Muhammad Ali Faisal, s Tejal Joshi, s
Using Bayesian Networks to Analyze Expression Data Xu Siwei, s0789023 Muhammad Ali Faisal, s0677834 Tejal Joshi, s0677858 Outline Introduction Bayesian Networks Equivalence Classes Applying to Expression
More informationData Mining in Bioinformatics Day 7: Clustering in Bioinformatics
Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tübingen Karsten Borgwardt:
More informationA quick review. The clustering problem: Hierarchical clustering algorithm: Many possible distance metrics K-mean clustering algorithm:
The clustering problem: partition genes into distinct sets with high homogeneity and high separation Hierarchical clustering algorithm: 1. Assign each object to a separate cluster. 2. Regroup the pair
More informationChapter 1. Introduction
Chapter 1 Introduction 1.1 Motivation and Goals The increasing availability and decreasing cost of high-throughput (HT) technologies coupled with the availability of computational tools and data form a
More informationPredicting Breast Cancer Survival Using Treatment and Patient Factors
Predicting Breast Cancer Survival Using Treatment and Patient Factors William Chen wchen808@stanford.edu Henry Wang hwang9@stanford.edu 1. Introduction Breast cancer is the leading type of cancer in women
More informationClassification of cancer profiles. ABDBM Ron Shamir
Classification of cancer profiles 1 Background: Cancer Classification Cancer classification is central to cancer treatment; Traditional cancer classification methods: location; morphology, cytogenesis;
More informationChIP-seq data analysis
ChIP-seq data analysis Harri Lähdesmäki Department of Computer Science Aalto University November 24, 2017 Contents Background ChIP-seq protocol ChIP-seq data analysis Transcriptional regulation Transcriptional
More informationFull title: A likelihood-based approach to early stopping in single arm phase II cancer clinical trials
Full title: A likelihood-based approach to early stopping in single arm phase II cancer clinical trials Short title: Likelihood-based early stopping design in single arm phase II studies Elizabeth Garrett-Mayer,
More informationObjectives. Quantifying the quality of hypothesis tests. Type I and II errors. Power of a test. Cautions about significance tests
Objectives Quantifying the quality of hypothesis tests Type I and II errors Power of a test Cautions about significance tests Designing Experiments based on power Evaluating a testing procedure The testing
More informationRank based statistics in analyzing high-throughput genomic data
The Raymond and Beverly Sackler Faculty of Exact Sciences School of Computer Science Rank based statistics in analyzing high-throughput genomic data Thesis submitted in partial fulfillment of the requirements
More informationPAIRED AND UNPAIRED COMPARISON AND CLUSTERING WITH GENE EXPRESSION DATA
Statistica Sinica 12(2002), 87-110 PAIRED AND UNPAIRED COMPARISON AND CLUSTERING WITH GENE EXPRESSION DATA Jenny Bryan 1, Katherine S. Pollard 2 and Mark J. van der Laan 2 1 University of British Columbia
More informationSISCR Module 7 Part I: Introduction Basic Concepts for Binary Biomarkers (Classifiers) and Continuous Biomarkers
SISCR Module 7 Part I: Introduction Basic Concepts for Binary Biomarkers (Classifiers) and Continuous Biomarkers Kathleen Kerr, Ph.D. Associate Professor Department of Biostatistics University of Washington
More informationSTATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012
STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION by XIN SUN PhD, Kansas State University, 2012 A THESIS Submitted in partial fulfillment of the requirements
More information3. Model evaluation & selection
Foundations of Machine Learning CentraleSupélec Fall 2016 3. Model evaluation & selection Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr
More informationComparison of discrimination methods for the classification of tumors using gene expression data
Comparison of discrimination methods for the classification of tumors using gene expression data Sandrine Dudoit, Jane Fridlyand 2 and Terry Speed 2,. Mathematical Sciences Research Institute, Berkeley
More informationMS&E 226: Small Data
MS&E 226: Small Data Lecture 10: Introduction to inference (v2) Ramesh Johari ramesh.johari@stanford.edu 1 / 17 What is inference? 2 / 17 Where did our data come from? Recall our sample is: Y, the vector
More informationComparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes
Comparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes Ivan Arreola and Dr. David Han Department of Management of Science and Statistics, University
More informationResearch Supervised clustering of genes Marcel Dettling and Peter Bühlmann
http://genomebiology.com/22/3/2/research/69. Research Supervised clustering of genes Marcel Dettling and Peter Bühlmann Address: Seminar für Statistik, Eidgenössische Technische Hochschule (ETH) Zürich,
More informationIntroduction to Computational Neuroscience
Introduction to Computational Neuroscience Lecture 5: Data analysis II Lesson Title 1 Introduction 2 Structure and Function of the NS 3 Windows to the Brain 4 Data analysis 5 Data analysis II 6 Single
More informationOutlier Analysis. Lijun Zhang
Outlier Analysis Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Extreme Value Analysis Probabilistic Models Clustering for Outlier Detection Distance-Based Outlier Detection Density-Based
More informationData Mining. Outlier detection. Hamid Beigy. Sharif University of Technology. Fall 1395
Data Mining Outlier detection Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 17 Table of contents 1 Introduction 2 Outlier
More informationApplication of Resampling Methods in Microarray Data Analysis
Application of Resampling Methods in Microarray Data Analysis Tests for two independent samples Oliver Hartmann, Helmut Schäfer Institut für Medizinische Biometrie und Epidemiologie Philipps-Universität
More informationClassification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang
Classification Methods Course: Gene Expression Data Analysis -Day Five Rainer Spang Ms. Smith DNA Chip of Ms. Smith Expression profile of Ms. Smith Ms. Smith 30.000 properties of Ms. Smith The expression
More informationKnowledge Discovery and Data Mining I
Ludwig-Maximilians-Universität München Lehrstuhl für Datenbanksysteme und Data Mining Prof. Dr. Thomas Seidl Knowledge Discovery and Data Mining I Winter Semester 2018/19 Introduction What is an outlier?
More informationWhere does "analysis" enter the experimental process?
Lecture Topic : ntroduction to the Principles of Experimental Design Experiment: An exercise designed to determine the effects of one or more variables (treatments) on one or more characteristics (response
More informationA Cue Imputation Bayesian Model of Information Aggregation
A Cue Imputation Bayesian Model of Information Aggregation Jennifer S. Trueblood, George Kachergis, and John K. Kruschke {jstruebl, gkacherg, kruschke}@indiana.edu Cognitive Science Program, 819 Eigenmann,
More informationIdentification of Tissue Independent Cancer Driver Genes
Identification of Tissue Independent Cancer Driver Genes Alexandros Manolakos, Idoia Ochoa, Kartik Venkat Supervisor: Olivier Gevaert Abstract Identification of genomic patterns in tumors is an important
More informationReview. Imagine the following table being obtained as a random. Decision Test Diseased Not Diseased Positive TP FP Negative FN TN
Outline 1. Review sensitivity and specificity 2. Define an ROC curve 3. Define AUC 4. Non-parametric tests for whether or not the test is informative 5. Introduce the binormal ROC model 6. Discuss non-parametric
More informationA Case Study: Two-sample categorical data
A Case Study: Two-sample categorical data Patrick Breheny January 31 Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 1/43 Introduction Model specification Continuous vs. mixture priors Choice
More informationLec 02: Estimation & Hypothesis Testing in Animal Ecology
Lec 02: Estimation & Hypothesis Testing in Animal Ecology Parameter Estimation from Samples Samples We typically observe systems incompletely, i.e., we sample according to a designed protocol. We then
More informationPredicting Kidney Cancer Survival from Genomic Data
Predicting Kidney Cancer Survival from Genomic Data Christopher Sauer, Rishi Bedi, Duc Nguyen, Benedikt Bünz Abstract Cancers are on par with heart disease as the leading cause for mortality in the United
More informationA Strategy for Identifying Putative Causes of Gene Expression Variation in Human Cancer
A Strategy for Identifying Putative Causes of Gene Expression Variation in Human Cancer Hautaniemi, Sampsa; Ringnér, Markus; Kauraniemi, Päivikki; Kallioniemi, Anne; Edgren, Henrik; Yli-Harja, Olli; Astola,
More informationTesting the robustness of anonymization techniques: acceptable versus unacceptable inferences - Draft Version
Testing the robustness of anonymization techniques: acceptable versus unacceptable inferences - Draft Version Gergely Acs, Claude Castelluccia, Daniel Le étayer 1 Introduction Anonymization is a critical
More informationModule Overview. What is a Marker? Part 1 Overview
SISCR Module 7 Part I: Introduction Basic Concepts for Binary Classification Tools and Continuous Biomarkers Kathleen Kerr, Ph.D. Associate Professor Department of Biostatistics University of Washington
More informationGene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering
Gene expression analysis Roadmap Microarray technology: how it work Applications: what can we do with it Preprocessing: Image processing Data normalization Classification Clustering Biclustering 1 Gene
More informationPrediction of Malignant and Benign Tumor using Machine Learning
Prediction of Malignant and Benign Tumor using Machine Learning Ashish Shah Department of Computer Science and Engineering Manipal Institute of Technology, Manipal University, Manipal, Karnataka, India
More informationCancer outlier differential gene expression detection
Biostatistics (2007), 8, 3, pp. 566 575 doi:10.1093/biostatistics/kxl029 Advance Access publication on October 4, 2006 Cancer outlier differential gene expression detection BAOLIN WU Division of Biostatistics,
More informationSpectrograms (revisited)
Spectrograms (revisited) We begin the lecture by reviewing the units of spectrograms, which I had only glossed over when I covered spectrograms at the end of lecture 19. We then relate the blocks of a
More informationOCW Epidemiology and Biostatistics, 2010 David Tybor, MS, MPH and Kenneth Chui, PhD Tufts University School of Medicine October 27, 2010
OCW Epidemiology and Biostatistics, 2010 David Tybor, MS, MPH and Kenneth Chui, PhD Tufts University School of Medicine October 27, 2010 SAMPLING AND CONFIDENCE INTERVALS Learning objectives for this session:
More informationGene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein The parsimony principle: A quick review Find the tree that requires the fewest
More informationSubLasso:a feature selection and classification R package with a. fixed feature subset
SubLasso:a feature selection and classification R package with a fixed feature subset Youxi Luo,3,*, Qinghan Meng,2,*, Ruiquan Ge,2, Guoqin Mai, Jikui Liu, Fengfeng Zhou,#. Shenzhen Institutes of Advanced
More informationAdvanced ANOVA Procedures
Advanced ANOVA Procedures Session Lecture Outline:. An example. An example. Two-way ANOVA. An example. Two-way Repeated Measures ANOVA. MANOVA. ANalysis of Co-Variance (): an ANOVA procedure whereby the
More informationEXERCISE: HOW TO DO POWER CALCULATIONS IN OPTIMAL DESIGN SOFTWARE
...... EXERCISE: HOW TO DO POWER CALCULATIONS IN OPTIMAL DESIGN SOFTWARE TABLE OF CONTENTS 73TKey Vocabulary37T... 1 73TIntroduction37T... 73TUsing the Optimal Design Software37T... 73TEstimating Sample
More informationStudent Performance Q&A:
Student Performance Q&A: 2009 AP Statistics Free-Response Questions The following comments on the 2009 free-response questions for AP Statistics were written by the Chief Reader, Christine Franklin of
More informationApplication of the concept of False Discovery Rate on predicted cancer outcome with microarrays
Mathematical Statistics Stockholm University Application of the concept of False Discovery Rate on predicted cancer outcome with microarrays Sally Salih Examensarbete 2006:1 Postal address: Mathematical
More informationThe Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0
The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0 Introduction Loss of erozygosity (LOH) represents the loss of allelic differences. The SNP markers on the SNP Array 6.0 can be used
More informationHypothesis-Driven Research
Hypothesis-Driven Research Research types Descriptive science: observe, describe and categorize the facts Discovery science: measure variables to decide general patterns based on inductive reasoning Hypothesis-driven
More informationPerformance and Saliency Analysis of Data from the Anomaly Detection Task Study
Performance and Saliency Analysis of Data from the Anomaly Detection Task Study Adrienne Raglin 1 and Andre Harrison 2 1 U.S. Army Research Laboratory, Adelphi, MD. 20783, USA {adrienne.j.raglin.civ, andre.v.harrison2.civ}@mail.mil
More informationStatistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.
Final review Based in part on slides from textbook, slides of Susan Holmes December 5, 2012 1 / 1 Final review Overview Before Midterm General goals of data mining. Datatypes. Preprocessing & dimension
More informationMODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION TO BREAST CANCER DATA
International Journal of Software Engineering and Knowledge Engineering Vol. 13, No. 6 (2003) 579 592 c World Scientific Publishing Company MODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION
More informationBayesian Nonparametric Methods for Precision Medicine
Bayesian Nonparametric Methods for Precision Medicine Brian Reich, NC State Collaborators: Qian Guan (NCSU), Eric Laber (NCSU) and Dipankar Bandyopadhyay (VCU) University of Illinois at Urbana-Champaign
More informationSAMPLING AND SAMPLE SIZE
SAMPLING AND SAMPLE SIZE Andrew Zeitlin Georgetown University and IGC Rwanda With slides from Ben Olken and the World Bank s Development Impact Evaluation Initiative 2 Review We want to learn how a program
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016 Exam policy: This exam allows one one-page, two-sided cheat sheet; No other materials. Time: 80 minutes. Be sure to write your name and
More informationAutomated Assessment of Diabetic Retinal Image Quality Based on Blood Vessel Detection
Y.-H. Wen, A. Bainbridge-Smith, A. B. Morris, Automated Assessment of Diabetic Retinal Image Quality Based on Blood Vessel Detection, Proceedings of Image and Vision Computing New Zealand 2007, pp. 132
More informationReflection Questions for Math 58B
Reflection Questions for Math 58B Johanna Hardin Spring 2017 Chapter 1, Section 1 binomial probabilities 1. What is a p-value? 2. What is the difference between a one- and two-sided hypothesis? 3. What
More informationData analysis in microarray experiment
16 1 004 Chinese Bulletin of Life Sciences Vol. 16, No. 1 Feb., 004 1004-0374 (004) 01-0041-08 100005 Q33 A Data analysis in microarray experiment YANG Chang, FANG Fu-De * (National Laboratory of Medical
More informationBayesian Latent Subgroup Design for Basket Trials
Bayesian Latent Subgroup Design for Basket Trials Yiyi Chu Department of Biostatistics The University of Texas School of Public Health July 30, 2017 Outline Introduction Bayesian latent subgroup (BLAST)
More informationUniversity of Cambridge Engineering Part IB Information Engineering Elective
University of Cambridge Engineering Part IB Information Engineering Elective Paper 8: Image Searching and Modelling Using Machine Learning Handout 1: Introduction to Artificial Neural Networks Roberto
More informationContents. 2 Statistics Static reference method Sampling reference set Statistics Sampling Types...
Department of Medical Protein Research, VIB, B-9000 Ghent, Belgium Department of Biochemistry, Ghent University, B-9000 Ghent, Belgium http://www.computationalproteomics.com icelogo manual Niklaas Colaert
More informationCompetition Between Objective and Novelty Search on a Deceptive Task
Competition Between Objective and Novelty Search on a Deceptive Task Billy Evers and Michael Rubayo Abstract It has been proposed, and is now widely accepted within use of genetic algorithms that a directly
More informationBayesian (Belief) Network Models,
Bayesian (Belief) Network Models, 2/10/03 & 2/12/03 Outline of This Lecture 1. Overview of the model 2. Bayes Probability and Rules of Inference Conditional Probabilities Priors and posteriors Joint distributions
More informationSubgroup Discovery for Test Selection: A Novel Approach and Its Application to Breast Cancer Diagnosis
Subgroup Discovery for Test Selection: A Novel Approach and Its Application to Breast Cancer Diagnosis Marianne Mueller 1,Rómer Rosales 2, Harald Steck 2, Sriram Krishnan 2,BharatRao 2, and Stefan Kramer
More informationData complexity measures for analyzing the effect of SMOTE over microarrays
ESANN 216 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 27-29 April 216, i6doc.com publ., ISBN 978-2878727-8. Data complexity
More informationCorrelation vs. Causation - and What Are the Implications for Our Project? By Michael Reames and Gabriel Kemeny
Correlation vs. Causation - and What Are the Implications for Our Project? By Michael Reames and Gabriel Kemeny In problem solving, accurately establishing and validating root causes are vital to improving
More informationInductive Cognitive Models and the Coevolution of Signaling
of of Analytis Max Planck Institute for Human Development October 18, 2012 of 1 2 3 4 5 6 Appendix of It simply was not true that a world with almost perfect information was very similar to one in which
More informationBiomarker adaptive designs in clinical trials
Review Article Biomarker adaptive designs in clinical trials James J. Chen 1, Tzu-Pin Lu 1,2, Dung-Tsa Chen 3, Sue-Jane Wang 4 1 Division of Bioinformatics and Biostatistics, National Center for Toxicological
More informationReinforcement Learning : Theory and Practice - Programming Assignment 1
Reinforcement Learning : Theory and Practice - Programming Assignment 1 August 2016 Background It is well known in Game Theory that the game of Rock, Paper, Scissors has one and only one Nash Equilibrium.
More informationEmotion Recognition using a Cauchy Naive Bayes Classifier
Emotion Recognition using a Cauchy Naive Bayes Classifier Abstract Recognizing human facial expression and emotion by computer is an interesting and challenging problem. In this paper we propose a method
More informationBayes theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event.
Bayes theorem Bayes' Theorem is a theorem of probability theory originally stated by the Reverend Thomas Bayes. It can be seen as a way of understanding how the probability that a theory is true is affected
More informationEfficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection
202 4th International onference on Bioinformatics and Biomedical Technology IPBEE vol.29 (202) (202) IASIT Press, Singapore Efficacy of the Extended Principal Orthogonal Decomposition on DA Microarray
More informationAutomatic Classification of Breast Masses for Diagnosis of Breast Cancer in Digital Mammograms using Neural Network
IJSTE - International Journal of Science Technology & Engineering Volume 1 Issue 11 May 2015 ISSN (online): 2349-784X Automatic Classification of Breast Masses for Diagnosis of Breast Cancer in Digital
More informationExploration and Exploitation in Reinforcement Learning
Exploration and Exploitation in Reinforcement Learning Melanie Coggan Research supervised by Prof. Doina Precup CRA-W DMP Project at McGill University (2004) 1/18 Introduction A common problem in reinforcement
More informationIntroduction. We can make a prediction about Y i based on X i by setting a threshold value T, and predicting Y i = 1 when X i > T.
Diagnostic Tests 1 Introduction Suppose we have a quantitative measurement X i on experimental or observed units i = 1,..., n, and a characteristic Y i = 0 or Y i = 1 (e.g. case/control status). The measurement
More information15.301/310, Managerial Psychology Prof. Dan Ariely Recitation 8: T test and ANOVA
15.301/310, Managerial Psychology Prof. Dan Ariely Recitation 8: T test and ANOVA Statistics does all kinds of stuff to describe data Talk about baseball, other useful stuff We can calculate the probability.
More informationLearning with Rare Cases and Small Disjuncts
Appears in Proceedings of the 12 th International Conference on Machine Learning, Morgan Kaufmann, 1995, 558-565. Learning with Rare Cases and Small Disjuncts Gary M. Weiss Rutgers University/AT&T Bell
More informationBinary Diagnostic Tests Paired Samples
Chapter 536 Binary Diagnostic Tests Paired Samples Introduction An important task in diagnostic medicine is to measure the accuracy of two diagnostic tests. This can be done by comparing summary measures
More informationStepwise method Modern Model Selection Methods Quantile-Quantile plot and tests for normality
Week 9 Hour 3 Stepwise method Modern Model Selection Methods Quantile-Quantile plot and tests for normality Stat 302 Notes. Week 9, Hour 3, Page 1 / 39 Stepwise Now that we've introduced interactions,
More informationA Learning Method of Directly Optimizing Classifier Performance at Local Operating Range
A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range Lae-Jeong Park and Jung-Ho Moon Department of Electrical Engineering, Kangnung National University Kangnung, Gangwon-Do,
More informationBivariate variable selection for classification problem
Bivariate variable selection for classification problem Vivian W. Ng Leo Breiman Abstract In recent years, large amount of attention has been placed on variable or feature selection in various domains.
More informationSheila Barron Statistics Outreach Center 2/8/2011
Sheila Barron Statistics Outreach Center 2/8/2011 What is Power? When conducting a research study using a statistical hypothesis test, power is the probability of getting statistical significance when
More informationMarriage Matching with Correlated Preferences
Marriage Matching with Correlated Preferences Onur B. Celik Department of Economics University of Connecticut and Vicki Knoblauch Department of Economics University of Connecticut Abstract Authors of experimental,
More informationCognitive Modeling. Lecture 12: Bayesian Inference. Sharon Goldwater. School of Informatics University of Edinburgh
Cognitive Modeling Lecture 12: Bayesian Inference Sharon Goldwater School of Informatics University of Edinburgh sgwater@inf.ed.ac.uk February 18, 20 Sharon Goldwater Cognitive Modeling 1 1 Prediction
More informationConceptual and Empirical Arguments for Including or Excluding Ego from Structural Analyses of Personal Networks
CONNECTIONS 26(2): 82-88 2005 INSNA http://www.insna.org/connections-web/volume26-2/8.mccartywutich.pdf Conceptual and Empirical Arguments for Including or Excluding Ego from Structural Analyses of Personal
More informationHandout 16: Opinion Polls, Sampling, and Margin of Error
Opinion polls involve conducting a survey to gauge public opinion on a particular issue (or issues). In this handout, we will discuss some ideas that should be considered both when conducting a poll and
More informationEdinburgh Imaging Academy online distance learning courses. Functional Imaging
Functional Imaging Semester 2 / Commences January 10 Credits Each Course is composed of Modules & Activities. Modules: BOLD Signal IMSc NI4R Experimental Design IMSc NI4R Pre-processing IMSc NI4R GLM IMSc
More informationEvolutionary Programming
Evolutionary Programming Searching Problem Spaces William Power April 24, 2016 1 Evolutionary Programming Can we solve problems by mi:micing the evolutionary process? Evolutionary programming is a methodology
More informationBayesian Joint Modelling of Benefit and Risk in Drug Development
Bayesian Joint Modelling of Benefit and Risk in Drug Development EFSPI/PSDM Safety Statistics Meeting Leiden 2017 Disclosure is an employee and shareholder of GSK Data presented is based on human research
More information19th AWCBR (Australian Winter Conference on Brain Research), 2001, Queenstown, AU
19th AWCBR (Australian Winter Conference on Brain Research), 21, Queenstown, AU https://www.otago.ac.nz/awcbr/proceedings/otago394614.pdf Do local modification rules allow efficient learning about distributed
More informationMathematical Modeling of Infectious Disease
Mathematical Modeling of Infectious Disease DAIDD 2013 Travis C. Porco FI Proctor Foundation for Research in Ophthalmology UCSF Scope and role of modeling In the most general sense, we may consider modeling
More informationSTATISTICS & PROBABILITY
STATISTICS & PROBABILITY LAWRENCE HIGH SCHOOL STATISTICS & PROBABILITY CURRICULUM MAP 2015-2016 Quarter 1 Unit 1 Collecting Data and Drawing Conclusions Unit 2 Summarizing Data Quarter 2 Unit 3 Randomness
More informationMeasuring Focused Attention Using Fixation Inner-Density
Measuring Focused Attention Using Fixation Inner-Density Wen Liu, Mina Shojaeizadeh, Soussan Djamasbi, Andrew C. Trapp User Experience & Decision Making Research Laboratory, Worcester Polytechnic Institute
More information