Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics
|
|
- Ashley Bell
- 6 years ago
- Views:
Transcription
1 Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tübingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1
2 Clustering in bioinformatics Microarrays Clustering is a widely used tool in microarray analysis Class discovery is an important problem in microarray studies for two reasons: either the classes are completely unknown beforehand or it is unknown whether a known class contains interesting subclasses Karsten Borgwardt: Data Mining in Bioinformatics, Page 2
3 Clustering in bioinformatics Examples Classes unknown: Does a disease affect gene expression in a particular tissue? Does gene expression differ between two groups in a particular condition? Subclasses unknown: Are there subtypes of a disease? Is there even a hierarchy of subclasses within one disease? Karsten Borgwardt: Data Mining in Bioinformatics, Page 3
4 Clustering in bioinformatics Popularity Clustering tools are available in the large microarray database NCBI Gene Expression Omnibus (GEO) pubmed hits for microarray clustering Recent editorial of OUP Bioinformatics Karsten Borgwardt: Data Mining in Bioinformatics, Page 4
5 Distance metrics Euclidean distance Euclidean distance of gene x and y of n samples or sample x and y of n genes: d xy = n (x i y i ) 2 (1) Pearson s Correlation i=1 Pearson Correlation of gene x and y of n samples or sample x and y of n genes, where x is the mean of x and is ȳ the mean of y: n i=1 r xy = (x i x)(y i ȳ) n i=1 (x i x) 2 n i=1 (y (2) i ȳ) 2 Karsten Borgwardt: Data Mining in Bioinformatics, Page 5
6 Distance metrics Un-centered correlation coefficient Un-centered correlation coefficient of gene x and y of n samples or sample x and y of n genes: r u xy = n i=1 x2 i n i=1 x iy i n i=1 y2 i (3) Karsten Borgwardt: Data Mining in Bioinformatics, Page 6
7 Clustering algorithms Hierarchical Clustering Single linkage: The linking distance is the minimum distance between two clusters. Complete linkage: The linking distance is the maximum distance between two clusters. Average linkage/upgma (The linking distance is the average of all pair-wise distances between members of the two clusters. Since all genes and samples carry equal weight, the linkage is an Unweighted Pair Group Method with Arithmetic Means (UPGMA)) Flat Clustering k-means (k from 2 to 15, 3 runs) k-median (k-medoid) Karsten Borgwardt: Data Mining in Bioinformatics, Page 7
8 The two-sample problem Interpretation of clusters Clustering introduces structure into microarray datasets But is there a statistical or biomedical meaning of these classes? Biomedical meaning has to be established in experiments Statistical meaning can be measured using statistical tests, by a so-called two-sample test A two-sample tests decides whether two samples were drawn from the same probability distribution or not Karsten Borgwardt: Data Mining in Bioinformatics, Page 8
9 The two-sample problem Data diversity Molecular biology produces a wealth of information The problem is that these data are generated on different platforms and by different protocols under different levels of noise Hence data from different labs show different scales different ranges different distributions Main problem: Joint data analysis may detect differences in distributions, not biological phenomena! Karsten Borgwardt: Data Mining in Bioinformatics, Page 9
10 The two-sample problem The two-sample problem Given two samples X and Y. Were they generated by the same distribution? Previous approaches two-sample tests exist for univariate and multivariate data Karsten Borgwardt: Data Mining in Bioinformatics, Page 10
11 The two-sample problem t-test A test of the null hypothesis that the means of two normally distributed populations are equal unpaired/independent (versus paired) For equal sample sizes and equal variances, the t statistic to test whether the means are different can be calculated as follows: t = x ȳ σ xy 2 n σ 2 where σ xy = x +σy 2 2. The degrees of freedom for this test is 2n 2 where n is the size of each sample. (4) Karsten Borgwardt: Data Mining in Bioinformatics, Page 11
12 The two-sample problem New challenges in bioinformatics high-dimensional structured (strings and graphs) low sample size Novel distribution test: (MMD) Maximum Mean Discrepancy Karsten Borgwardt: Data Mining in Bioinformatics, Page 12
13 MMD key idea Karsten Borgwardt: Data Mining in Bioinformatics, Page 13
14 MMD key idea Key Idea Theorem Theorem Avoid density estimator, use means in feature spaces Maximum Mean Discrepancy (Fortet and Mourier, 1953) D(p, q, F) := sup E p [f(x)] E q [f(y)] f F D(p, q, F) = 0 iff p = q, when F = C 0 (X). Follows directly, e.g. from Dudley, D(p, q, F) = 0 iff p = q, when F = {f f H 1} provided that H is a universal RKHS. (follows via Steinwart, 2001, Smola et al., 2006). Karsten Borgwardt: Data Mining in Bioinformatics, Page 14
15 MMD statistic Goal: Estimate D(p, q, F) E p,p k(x, x ) 2E p,q k(x, y) + E q,q k(y, y ) U-Statistic: Empirical estimate D(X, Y, F) k(x i, x j ) k(x i, y j ) k(y i, x j ) + k(y i, y j ) 1 m(m 1) i j Theorem D(X, Y, F) is an unbiased estimator of D(p, q, F). Test Estimate σ 2 from data. Reject null hypothesis that p = q if D(X, Y, F) exceeds acceptance threshold. Karsten Borgwardt: Data Mining in Bioinformatics, Page 15
16 Attractive for bioinformatics MMD two-sample test in terms of kernels Computationally attractive search infinite space of functions by evaluating one expression no optimization problem has to be solved All thanks to kernels! Karsten Borgwardt: Data Mining in Bioinformatics, Page 16
17 Attractive for bioinformatics Wide applicability for one- and higher-dimensional vectorial data, but also for structured data! two-sample problems can now be tackled on strings: protein and DNA sequences graphs: molecules, protein interaction networks time series: time series of microarray data and sets, trees,... Karsten Borgwardt: Data Mining in Bioinformatics, Page 17
18 Cross-platform comparability Data Task microarray data from two breast cancer studies one on cdna platform (Gruvberger et al., 2001) other on oligonucleotide microarray platform (West et al., 2001) Can MMD help to find out if two sets of observations were generated by the same study (both from Gruvberger or both from West)? different studies (one Gruvberger, one West)? Karsten Borgwardt: Data Mining in Bioinformatics, Page 18
19 Cross-platform comparability Experiment sample size each: 25 dimension of each datapoint 2,116 significance level: α = times: 1 sample from Gruvberger, 1 from West 100 times: both from Gruvberger or both from West report percentage of correct decisions compare to t-test, Friedman-Rafsky Wald-Wolfowitz and Smirnov Karsten Borgwardt: Data Mining in Bioinformatics, Page 19
20 Cross-platform comparability Karsten Borgwardt: Data Mining in Bioinformatics, Page 20
21 Kernel-based statistical test novel statistical test for two-sample problem: easy to implement non-parametric first for structured data best on high-dimensional data quadratic runtime w.r.t. the number of data points impressive accuracy in our experiments kernel method for two-sample problem: all kernels recently defined in molecular biology can be re-used for data integration applicable to vectors, strings, sets, trees, graphs and time series Karsten Borgwardt: Data Mining in Bioinformatics, Page 21
22 Biclustering Clustering in two dimensions alternative names: co-clustering, two-mode clustering A bicluster is a subset of genes that show similar activity patterns under a subset of conditions. Clustering in 2 dimensions Cluster patients and conditions Earliest work by Hartigan, 1972: Divide a matrix into submatrices with minimum variance. Most interesting cases are NP-complete. Many extensions in bioinformatics (e.g. Cheng and Church, 2002) Karsten Borgwardt: Data Mining in Bioinformatics, Page 22
23 References and further reading References [1] Gretton, Borgwardt, Rasch, Schölkopf, Smola: A kernel method for the two-sample problem. NIPS 2006 Karsten Borgwardt: Data Mining in Bioinformatics, Page 23
24 The end See you tomorrow! Next topic: Feature Selection in Bioinformatics Karsten Borgwardt: Data Mining in Bioinformatics, Page 24
Data Mining in Bioinformatics Day 4: Text Mining
Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10 Bioinformatics Group MPIs Tübingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1 What is text mining?
More informationData Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics
Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt March 1 to March 12, 2010 Machine Learning & Computational Biology Research Group MPIs Tübingen Karsten Borgwardt:
More informationGene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering
Gene expression analysis Roadmap Microarray technology: how it work Applications: what can we do with it Preprocessing: Image processing Data normalization Classification Clustering Biclustering 1 Gene
More informationChapter 1. Introduction
Chapter 1 Introduction 1.1 Motivation and Goals The increasing availability and decreasing cost of high-throughput (HT) technologies coupled with the availability of computational tools and data form a
More informationKnowledge Discovery and Data Mining I
Ludwig-Maximilians-Universität München Lehrstuhl für Datenbanksysteme und Data Mining Prof. Dr. Thomas Seidl Knowledge Discovery and Data Mining I Winter Semester 2018/19 Introduction What is an outlier?
More informationWhat can we contribute to cancer research and treatment from Computer Science or Mathematics? How do we adapt our expertise for them
From Bioinformatics to Health Information Technology Outline What can we contribute to cancer research and treatment from Computer Science or Mathematics? How do we adapt our expertise for them Introduction
More informationEvaluating Classifiers for Disease Gene Discovery
Evaluating Classifiers for Disease Gene Discovery Kino Coursey Lon Turnbull khc0021@unt.edu lt0013@unt.edu Abstract Identification of genes involved in human hereditary disease is an important bioinfomatics
More informationInter-session reproducibility measures for high-throughput data sources
Inter-session reproducibility measures for high-throughput data sources Milos Hauskrecht, PhD, Richard Pelikan, MSc Computer Science Department, Intelligent Systems Program, Department of Biomedical Informatics,
More informationNature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training.
Supplementary Figure 1 Behavioral training. a, Mazes used for behavioral training. Asterisks indicate reward location. Only some example mazes are shown (for example, right choice and not left choice maze
More informationIdentification of Tissue Independent Cancer Driver Genes
Identification of Tissue Independent Cancer Driver Genes Alexandros Manolakos, Idoia Ochoa, Kartik Venkat Supervisor: Olivier Gevaert Abstract Identification of genomic patterns in tumors is an important
More informationReliability of Ordination Analyses
Reliability of Ordination Analyses Objectives: Discuss Reliability Define Consistency and Accuracy Discuss Validation Methods Opening Thoughts Inference Space: What is it? Inference space can be defined
More informationStatistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.
Final review Based in part on slides from textbook, slides of Susan Holmes December 5, 2012 1 / 1 Final review Overview Before Midterm General goals of data mining. Datatypes. Preprocessing & dimension
More informationA Versatile Algorithm for Finding Patterns in Large Cancer Cell Line Data Sets
A Versatile Algorithm for Finding Patterns in Large Cancer Cell Line Data Sets James Jusuf, Phillips Academy Andover May 21, 2017 MIT PRIMES The Broad Institute of MIT and Harvard Introduction A quest
More information10CS664: PATTERN RECOGNITION QUESTION BANK
10CS664: PATTERN RECOGNITION QUESTION BANK Assignments would be handed out in class as well as posted on the class blog for the course. Please solve the problems in the exercises of the prescribed text
More informationGene Selection for Tumor Classification Using Microarray Gene Expression Data
Gene Selection for Tumor Classification Using Microarray Gene Expression Data K. Yendrapalli, R. Basnet, S. Mukkamala, A. H. Sung Department of Computer Science New Mexico Institute of Mining and Technology
More informationT. R. Golub, D. K. Slonim & Others 1999
T. R. Golub, D. K. Slonim & Others 1999 Big Picture in 1999 The Need for Cancer Classification Cancer classification very important for advances in cancer treatment. Cancers of Identical grade can have
More informationLecture #4: Overabundance Analysis and Class Discovery
236632 Topics in Microarray Data nalysis Winter 2004-5 November 15, 2004 Lecture #4: Overabundance nalysis and Class Discovery Lecturer: Doron Lipson Scribes: Itai Sharon & Tomer Shiran 1 Differentially
More informationClustering analysis of cancerous microarray data
Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(9): 488-493 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Clustering analysis of cancerous microarray data
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write
More informationMachine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017
Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017 A.K.A. Artificial Intelligence Unsupervised learning! Cluster analysis Patterns, Clumps, and Joining
More informationComparing Multifunctionality and Association Information when Classifying Oncogenes and Tumor Suppressor Genes
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationFeature Vector Denoising with Prior Network Structures. (with Y. Fan, L. Raphael) NESS 2015, University of Connecticut
Feature Vector Denoising with Prior Network Structures (with Y. Fan, L. Raphael) NESS 2015, University of Connecticut Summary: I. General idea: denoising functions on Euclidean space ---> denoising in
More informationReview: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections
Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections New: Bias-variance decomposition, biasvariance tradeoff, overfitting, regularization, and feature selection Yi
More informationVariable Features Selection for Classification of Medical Data using SVM
Variable Features Selection for Classification of Medical Data using SVM Monika Lamba USICT, GGSIPU, Delhi, India ABSTRACT: The parameters selection in support vector machines (SVM), with regards to accuracy
More informationUnit 1 Exploring and Understanding Data
Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile
More informationWDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you?
WDHS Curriculum Map Probability and Statistics Time Interval/ Unit 1: Introduction to Statistics 1.1-1.3 2 weeks S-IC-1: Understand statistics as a process for making inferences about population parameters
More informationInternational Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017
RESEARCH ARTICLE Classification of Cancer Dataset in Data Mining Algorithms Using R Tool P.Dhivyapriya [1], Dr.S.Sivakumar [2] Research Scholar [1], Assistant professor [2] Department of Computer Science
More informationEXTRACT THE BREAST CANCER IN MAMMOGRAM IMAGES
International Journal of Civil Engineering and Technology (IJCIET) Volume 10, Issue 02, February 2019, pp. 96-105, Article ID: IJCIET_10_02_012 Available online at http://www.iaeme.com/ijciet/issues.asp?jtype=ijciet&vtype=10&itype=02
More informationColon cancer subtypes from gene expression data
Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto Sherman Ip Leon Law Module 6: Applied Statistics 26th February 2016 Aim Replicate findings of Felipe De Sousa et
More informationData analysis in microarray experiment
16 1 004 Chinese Bulletin of Life Sciences Vol. 16, No. 1 Feb., 004 1004-0374 (004) 01-0041-08 100005 Q33 A Data analysis in microarray experiment YANG Chang, FANG Fu-De * (National Laboratory of Medical
More informationCoINcIDE: A framework for discovery of patient subtypes across multiple datasets
Planey and Gevaert Genome Medicine (2016) 8:27 DOI 10.1186/s13073-016-0281-4 METHOD CoINcIDE: A framework for discovery of patient subtypes across multiple datasets Catherine R. Planey and Olivier Gevaert
More informationBusiness Statistics Probability
Business Statistics The following was provided by Dr. Suzanne Delaney, and is a comprehensive review of Business Statistics. The workshop instructor will provide relevant examples during the Skills Assessment
More informationA Biclustering Based Classification Framework for Cancer Diagnosis and Prognosis
A Biclustering Based Classification Framework for Cancer Diagnosis and Prognosis Baljeet Malhotra and Guohui Lin Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada T6G 2E8
More informationData complexity measures for analyzing the effect of SMOTE over microarrays
ESANN 216 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 27-29 April 216, i6doc.com publ., ISBN 978-2878727-8. Data complexity
More informationConditional Distributions and the Bivariate Normal Distribution. James H. Steiger
Conditional Distributions and the Bivariate Normal Distribution James H. Steiger Overview In this module, we have several goals: Introduce several technical terms Bivariate frequency distribution Marginal
More informationList of Figures. List of Tables. Preface to the Second Edition. Preface to the First Edition
List of Figures List of Tables Preface to the Second Edition Preface to the First Edition xv xxv xxix xxxi 1 What Is R? 1 1.1 Introduction to R................................ 1 1.2 Downloading and Installing
More informationStatistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN
Statistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN Vs. 2 Background 3 There are different types of research methods to study behaviour: Descriptive: observations,
More information4. Model evaluation & selection
Foundations of Machine Learning CentraleSupélec Fall 2017 4. Model evaluation & selection Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr
More informationThe Long Tail of Recommender Systems and How to Leverage It
The Long Tail of Recommender Systems and How to Leverage It Yoon-Joo Park Stern School of Business, New York University ypark@stern.nyu.edu Alexander Tuzhilin Stern School of Business, New York University
More informationReporting Checklist for Nature Neuroscience
Corresponding Author: Manuscript Number: Manuscript Type: Alex Pouget NN-A46249B Article Reporting Checklist for Nature Neuroscience # Main Figures: 7 # Supplementary Figures: 3 # Supplementary Tables:
More informationPAIRED AND UNPAIRED COMPARISON AND CLUSTERING WITH GENE EXPRESSION DATA
Statistica Sinica 12(2002), 87-110 PAIRED AND UNPAIRED COMPARISON AND CLUSTERING WITH GENE EXPRESSION DATA Jenny Bryan 1, Katherine S. Pollard 2 and Mark J. van der Laan 2 1 University of British Columbia
More informationIntroduction to Discrimination in Microarray Data Analysis
Introduction to Discrimination in Microarray Data Analysis Jane Fridlyand CBMB University of California, San Francisco Genentech Hall Auditorium, Mission Bay, UCSF October 23, 2004 1 Case Study: Van t
More informationABSTRACT I. INTRODUCTION. Mohd Thousif Ahemad TSKC Faculty Nagarjuna Govt. College(A) Nalgonda, Telangana, India
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 1 ISSN : 2456-3307 Data Mining Techniques to Predict Cancer Diseases
More informationSTATISTICS AND RESEARCH DESIGN
Statistics 1 STATISTICS AND RESEARCH DESIGN These are subjects that are frequently confused. Both subjects often evoke student anxiety and avoidance. To further complicate matters, both areas appear have
More informationMammogram Analysis: Tumor Classification
Mammogram Analysis: Tumor Classification Term Project Report Geethapriya Raghavan geeragh@mail.utexas.edu EE 381K - Multidimensional Digital Signal Processing Spring 2005 Abstract Breast cancer is the
More informationComputational Capacity and Statistical Inference: A Never Ending Interaction. Finbarr Sloane EHR/DRL
Computational Capacity and Statistical Inference: A Never Ending Interaction Finbarr Sloane EHR/DRL Studies in Crop Variation I (1921) It has been estimated that Sir Ronald A. Fisher spent about 185
More informationDescribe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo
Business Statistics The following was provided by Dr. Suzanne Delaney, and is a comprehensive review of Business Statistics. The workshop instructor will provide relevant examples during the Skills Assessment
More informationSCIENCE & TECHNOLOGY
Pertanika J. Sci. & Technol. 25 (S): 241-254 (2017) SCIENCE & TECHNOLOGY Journal homepage: http://www.pertanika.upm.edu.my/ Fuzzy Lambda-Max Criteria Weight Determination for Feature Selection in Clustering
More informationNIH Public Access Author Manuscript Conf Proc IEEE Eng Med Biol Soc. Author manuscript; available in PMC 2013 February 01.
NIH Public Access Author Manuscript Published in final edited form as: Conf Proc IEEE Eng Med Biol Soc. 2012 August ; 2012: 2700 2703. doi:10.1109/embc.2012.6346521. Characterizing Non-Linear Dependencies
More informationFeature selection methods for early predictive biomarker discovery using untargeted metabolomic data
Feature selection methods for early predictive biomarker discovery using untargeted metabolomic data Dhouha Grissa, Mélanie Pétéra, Marion Brandolini, Amedeo Napoli, Blandine Comte and Estelle Pujos-Guillot
More informationStudy Guide for the Final Exam
Study Guide for the Final Exam When studying, remember that the computational portion of the exam will only involve new material (covered after the second midterm), that material from Exam 1 will make
More informationSNPrints: Defining SNP signatures for prediction of onset in complex diseases
SNPrints: Defining SNP signatures for prediction of onset in complex diseases Linda Liu, Biomedical Informatics, Stanford University Daniel Newburger, Biomedical Informatics, Stanford University Grace
More informationMachine Learning to Inform Breast Cancer Post-Recovery Surveillance
Machine Learning to Inform Breast Cancer Post-Recovery Surveillance Final Project Report CS 229 Autumn 2017 Category: Life Sciences Maxwell Allman (mallman) Lin Fan (linfan) Jamie Kang (kangjh) 1 Introduction
More informationTop-k typicality queries and efficient query answering methods on large databases
The VLDB Journal (9) 8:89 835 DOI.7/s778-8-8-8 REGULAR PAPER Top-k typicality queries and efficient query answering methods on large databases Ming Hua Jian Pei Ada W. C. Fu Xuemin Lin Ho-Fung Leung Received:
More informationResearch Supervised clustering of genes Marcel Dettling and Peter Bühlmann
http://genomebiology.com/22/3/2/research/69. Research Supervised clustering of genes Marcel Dettling and Peter Bühlmann Address: Seminar für Statistik, Eidgenössische Technische Hochschule (ETH) Zürich,
More informationReveal Relationships in Categorical Data
SPSS Categories 15.0 Specifications Reveal Relationships in Categorical Data Unleash the full potential of your data through perceptual mapping, optimal scaling, preference scaling, and dimension reduction
More informationA Statistical Framework for Classification of Tumor Type from microrna Data
DEGREE PROJECT IN MATHEMATICS, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2016 A Statistical Framework for Classification of Tumor Type from microrna Data JOSEFINE RÖHSS KTH ROYAL INSTITUTE OF TECHNOLOGY
More informationReadings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F
Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F Plous Chapters 17 & 18 Chapter 17: Social Influences Chapter 18: Group Judgments and Decisions
More informationHybridized KNN and SVM for gene expression data classification
Mei, et al, Hybridized KNN and SVM for gene expression data classification Hybridized KNN and SVM for gene expression data classification Zhen Mei, Qi Shen *, Baoxian Ye Chemistry Department, Zhengzhou
More informationA Comparison of Collaborative Filtering Methods for Medication Reconciliation
A Comparison of Collaborative Filtering Methods for Medication Reconciliation Huanian Zheng, Rema Padman, Daniel B. Neill The H. John Heinz III College, Carnegie Mellon University, Pittsburgh, PA, 15213,
More informationBayesian and Frequentist Approaches
Bayesian and Frequentist Approaches G. Jogesh Babu Penn State University http://sites.stat.psu.edu/ babu http://astrostatistics.psu.edu All models are wrong But some are useful George E. P. Box (son-in-law
More informationApplications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis
DSC 4/5 Multivariate Statistical Methods Applications DSC 4/5 Multivariate Statistical Methods Discriminant Analysis Identify the group to which an object or case (e.g. person, firm, product) belongs:
More informationContrasting the Contrast Sets: An Alternative Approach
Contrasting the Contrast Sets: An Alternative Approach Amit Satsangi Department of Computing Science University of Alberta, Canada amit@cs.ualberta.ca Osmar R. Zaïane Department of Computing Science University
More informationApplication of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties
Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties Bob Obenchain, Risk Benefit Statistics, August 2015 Our motivation for using a Cut-Point
More informationCS229 Final Project Report. Predicting Epitopes for MHC Molecules
CS229 Final Project Report Predicting Epitopes for MHC Molecules Xueheng Zhao, Shanshan Tuo Biomedical informatics program Stanford University Abstract Major Histocompatibility Complex (MHC) plays a key
More informationGene Expression Based Leukemia Sub Classification Using Committee Neural Networks
Bioinformatics and Biology Insights M e t h o d o l o g y Open Access Full open access to this and thousands of other papers at http://www.la-press.com. Gene Expression Based Leukemia Sub Classification
More informationBootstrapped Integrative Hypothesis Test, COPD-Lung Cancer Differentiation, and Joint mirnas Biomarkers
Bootstrapped Integrative Hypothesis Test, COPD-Lung Cancer Differentiation, and Joint mirnas Biomarkers Kai-Ming Jiang 1,2, Bao-Liang Lu 1,2, and Lei Xu 1,2,3(&) 1 Department of Computer Science and Engineering,
More informationEcological Statistics
A Primer of Ecological Statistics Second Edition Nicholas J. Gotelli University of Vermont Aaron M. Ellison Harvard Forest Sinauer Associates, Inc. Publishers Sunderland, Massachusetts U.S.A. Brief Contents
More informationSelecting the Right Data Analysis Technique
Selecting the Right Data Analysis Technique Levels of Measurement Nominal Ordinal Interval Ratio Discrete Continuous Continuous Variable Borgatta and Bohrnstedt state that "the most of central constructs
More informationSPRING GROVE AREA SCHOOL DISTRICT. Course Description. Instructional Strategies, Learning Practices, Activities, and Experiences.
SPRING GROVE AREA SCHOOL DISTRICT PLANNED COURSE OVERVIEW Course Title: Basic Introductory Statistics Grade Level(s): 11-12 Units of Credit: 1 Classification: Elective Length of Course: 30 cycles Periods
More informationHeterogeneous Data Mining for Brain Disorder Identification. Bokai Cao 04/07/2015
Heterogeneous Data Mining for Brain Disorder Identification Bokai Cao 04/07/2015 Outline Introduction Tensor Imaging Analysis Brain Network Analysis Davidson et al. Network discovery via constrained tensor
More informationReporting Checklist for Nature Neuroscience
Corresponding Author: Manuscript Number: Manuscript Type: Rutishauser NNA57105 Article Reporting Checklist for Nature Neuroscience # Main Figures: 8 # Supplementary Figures: 6 # Supplementary Tables: 1
More informationBayesian Prediction Tree Models
Bayesian Prediction Tree Models Statistical Prediction Tree Modelling for Clinico-Genomics Clinical gene expression data - expression signatures, profiling Tree models for predictive sub-typing Combining
More informationA Semi-supervised Approach to Perceived Age Prediction from Face Images
IEICE Transactions on Information and Systems, vol.e93-d, no.10, pp.2875 2878, 2010. 1 A Semi-supervised Approach to Perceived Age Prediction from Face Images Kazuya Ueki NEC Soft, Ltd., Japan Masashi
More informationIntroduction to Statistical Data Analysis I
Introduction to Statistical Data Analysis I JULY 2011 Afsaneh Yazdani Preface What is Statistics? Preface What is Statistics? Science of: designing studies or experiments, collecting data Summarizing/modeling/analyzing
More informationOutlier Analysis. Lijun Zhang
Outlier Analysis Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Extreme Value Analysis Probabilistic Models Clustering for Outlier Detection Distance-Based Outlier Detection Density-Based
More informationA Network Partition Algorithm for Mining Gene Functional Modules of Colon Cancer from DNA Microarray Data
Method A Network Partition Algorithm for Mining Gene Functional Modules of Colon Cancer from DNA Microarray Data Xiao-Gang Ruan, Jin-Lian Wang*, and Jian-Geng Li Institute of Artificial Intelligence and
More informationCHAPTER 3 PROBLEM STATEMENT AND RESEARCH METHODOLOGY
64 CHAPTER 3 PROBLEM STATEMENT AND RESEARCH METHODOLOGY 3.1 PROBLEM DEFINITION Clinical data mining (CDM) is a rising field of research that aims at the utilization of data mining techniques to extract
More informationDescribe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo
Please note the page numbers listed for the Lind book may vary by a page or two depending on which version of the textbook you have. Readings: Lind 1 11 (with emphasis on chapters 10, 11) Please note chapter
More informationStill important ideas
Readings: OpenStax - Chapters 1 13 & Appendix D & E (online) Plous Chapters 17 & 18 - Chapter 17: Social Influences - Chapter 18: Group Judgments and Decisions Still important ideas Contrast the measurement
More informationCHAPTER VI RESEARCH METHODOLOGY
CHAPTER VI RESEARCH METHODOLOGY 6.1 Research Design Research is an organized, systematic, data based, critical, objective, scientific inquiry or investigation into a specific problem, undertaken with the
More informationIdentifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang
Identifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang Abstract: Unlike most cancers, thyroid cancer has an everincreasing incidence rate
More informationEpiGRAPH regression: A toolkit for (epi-)genomic correlation analysis and prediction of quantitative attributes
EpiGRAPH regression: A toolkit for (epi-)genomic correlation analysis and prediction of quantitative attributes by Konstantin Halachev Supervisors: Christoph Bock Prof. Dr. Thomas Lengauer A thesis submitted
More informationLecture Outline. Biost 590: Statistical Consulting. Stages of Scientific Studies. Scientific Method
Biost 590: Statistical Consulting Statistical Classification of Scientific Studies; Approach to Consulting Lecture Outline Statistical Classification of Scientific Studies Statistical Tasks Approach to
More informationKnowledge Discovery and Data Mining. Testing. Performance Measures. Notes. Lecture 15 - ROC, AUC & Lift. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 15 - ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-17-AUC
More informationISIR: Independent Sliced Inverse Regression
ISIR: Independent Sliced Inverse Regression Kevin B. Li Beijing Jiaotong University Abstract In this paper we consider a semiparametric regression model involving a p-dimensional explanatory variable x
More informationBrain Tumour Detection of MR Image Using Naïve Beyer classifier and Support Vector Machine
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Brain Tumour Detection of MR Image Using Naïve
More informationIntroduction to MVPA. Alexandra Woolgar 16/03/10
Introduction to MVPA Alexandra Woolgar 16/03/10 MVP...what? Multi-Voxel Pattern Analysis (MultiVariate Pattern Analysis) * Overview Why bother? Different approaches Basics of designing experiments and
More informationComputational Approach for Deriving Cancer Progression Roadmaps from Static Sample Data
Computational Approach for Deriving Cancer Progression Roadmaps from Static Sample Data Yijun Sun,2,3,5,, Jin Yao, Le Yang 2, Runpu Chen 2, Norma J. Nowak 4, Steve Goodison 6, Department of Microbiology
More informationClassification with microarray data
Classification with microarray data Aron Charles Eklund eklund@cbs.dtu.dk DNA Microarray Analysis - #27612 January 8, 2010 The rest of today Now: What is classification, and why do we do it? How to develop
More informationSUPPLEMENTARY INFORMATION. Table 1 Patient characteristics Preoperative. language testing
Categorical Speech Representation in the Human Superior Temporal Gyrus Edward F. Chang, Jochem W. Rieger, Keith D. Johnson, Mitchel S. Berger, Nicholas M. Barbaro, Robert T. Knight SUPPLEMENTARY INFORMATION
More informationA COMBINATORY ALGORITHM OF UNIVARIATE AND MULTIVARIATE GENE SELECTION
5-9 JATIT. All rights reserved. A COMBINATORY ALGORITHM OF UNIVARIATE AND MULTIVARIATE GENE SELECTION 1 H. Mahmoodian, M. Hamiruce Marhaban, 3 R. A. Rahim, R. Rosli, 5 M. Iqbal Saripan 1 PhD student, Department
More informationThe use of random projections for the analysis of mass spectrometry imaging data Palmer, Andrew; Bunch, Josephine; Styles, Iain
The use of random projections for the analysis of mass spectrometry imaging data Palmer, Andrew; Bunch, Josephine; Styles, Iain DOI: 10.1007/s13361-014-1024-7 Citation for published version (Harvard):
More informationAUTOMATING NEUROLOGICAL DISEASE DIAGNOSIS USING STRUCTURAL MR BRAIN SCAN FEATURES
AUTOMATING NEUROLOGICAL DISEASE DIAGNOSIS USING STRUCTURAL MR BRAIN SCAN FEATURES ALLAN RAVENTÓS AND MOOSA ZAIDI Stanford University I. INTRODUCTION Nine percent of those aged 65 or older and about one
More informationBasic Biostatistics. Chapter 1. Content
Chapter 1 Basic Biostatistics Jamalludin Ab Rahman MD MPH Department of Community Medicine Kulliyyah of Medicine Content 2 Basic premises variables, level of measurements, probability distribution Descriptive
More informationA Strategy for Identifying Putative Causes of Gene Expression Variation in Human Cancer
A Strategy for Identifying Putative Causes of Gene Expression Variation in Human Cancer Hautaniemi, Sampsa; Ringnér, Markus; Kauraniemi, Päivikki; Kallioniemi, Anne; Edgren, Henrik; Yli-Harja, Olli; Astola,
More information3. Model evaluation & selection
Foundations of Machine Learning CentraleSupélec Fall 2016 3. Model evaluation & selection Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr
More informationCS 453X: Class 18. Jacob Whitehill
CS 453X: Class 18 Jacob Whitehill More on k-means Exercise: Empty clusters (1) Assume that a set of distinct data points { x (i) } are initially assigned so that none of the k clusters is empty. How can
More informationUsing CART to Mine SELDI ProteinChip Data for Biomarkers and Disease Stratification
Using CART to Mine SELDI ProteinChip Data for Biomarkers and Disease Stratification Kenna Mawk, D.V.M. Informatics Product Manager Ciphergen Biosystems, Inc. Outline Introduction to ProteinChip Technology
More informationMODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION TO BREAST CANCER DATA
International Journal of Software Engineering and Knowledge Engineering Vol. 13, No. 6 (2003) 579 592 c World Scientific Publishing Company MODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION
More information