Learning from data when all models are wrong

Similar documents
MS&E 226: Small Data

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Intelligent Systems. Discriminative Learning. Parts marked by * are optional. WS2013/2014 Carsten Rother, Dmitrij Schlesinger

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

EECS 433 Statistical Pattern Recognition

Bayesian performance

4. Model evaluation & selection

Stepwise method Modern Model Selection Methods Quantile-Quantile plot and tests for normality

Week 8 Hour 1: More on polynomial fits. The AIC. Hour 2: Dummy Variables what are they? An NHL Example. Hour 3: Interactions. The stepwise method.

BAYESIAN HYPOTHESIS TESTING WITH SPSS AMOS

3. Model evaluation & selection

MBios 478: Systems Biology and Bayesian Networks, 27 [Dr. Wyrick] Slide #1. Lecture 27: Systems Biology and Bayesian Networks

Russian Journal of Agricultural and Socio-Economic Sciences, 3(15)

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm

Response to Mease and Wyner, Evidence Contrary to the Statistical View of Boosting, JMLR 9:1 26, 2008

HOW TO BE A BAYESIAN IN SAS: MODEL SELECTION UNCERTAINTY IN PROC LOGISTIC AND PROC GENMOD

Hierarchy of Statistical Goals

Mathematisches Forschungsinstitut Oberwolfach. Statistische und Probabilistische Methoden der Modellwahl

Bayesians methods in system identification: equivalences, differences, and misunderstandings

Model Evaluation using Grouped or Individual Data. Andrew L. Cohen. University of Massachusetts, Amherst. Adam N. Sanborn and Richard M.

Bayesian Inference Bayes Laplace

Bayesian and Frequentist Approaches

Natural Scene Statistics and Perception. W.S. Geisler

ST440/550: Applied Bayesian Statistics. (10) Frequentist Properties of Bayesian Methods

Model evaluation using grouped or individual data

A Brief Introduction to Bayesian Statistics

2012 Course : The Statistician Brain: the Bayesian Revolution in Cognitive Science

How to Choose the Wrong Model. Scott L. Zeger Department of Biostatistics Johns Hopkins Bloomberg School

Sensory Cue Integration

Neurons and neural networks II. Hopfield network

Hellinger Optimal Criterion and HPA- Optimum Designs for Model Discrimination and Probability-Based Optimality

Michael Hallquist, Thomas M. Olino, Paul A. Pilkonis University of Pittsburgh

Fundamental Clinical Trial Design

1. What is the I-T approach, and why would you want to use it? a) Estimated expected relative K-L distance.

Lecture 2: Learning and Equilibrium Extensive-Form Games

Contributions to Brain MRI Processing and Analysis

How to Choose the Wrong Model. Scott L. Zeger Department of Biostatistics Johns Hopkins Bloomberg School

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang

Bayesian Models for Combining Data Across Domains and Domain Types in Predictive fmri Data Analysis (Thesis Proposal)

Search e Fall /18/15

Beau Cronin, Ian H. Stevenson, Mriganka Sur and Konrad P. Körding

What is probability. A way of quantifying uncertainty. Mathematical theory originally developed to model outcomes in games of chance.

You must answer question 1.

Model calibration and Bayesian methods for probabilistic projections

Computer Age Statistical Inference. Algorithms, Evidence, and Data Science. BRADLEY EFRON Stanford University, California

A Decision-Theoretic Approach to Evaluating Posterior Probabilities of Mental Models

A Vision-based Affective Computing System. Jieyu Zhao Ningbo University, China

Lecture 2: Foundations of Concept Learning

Bayesian Models for Combining Data Across Subjects and Studies in Predictive fmri Data Analysis

Selection of Linking Items

Hybrid HMM and HCRF model for sequence classification

Hypothesis-Driven Research

MTAT Bayesian Networks. Introductory Lecture. Sven Laur University of Tartu

Lecture 13: Finding optimal treatment policies

Signal Detection Theory and Bayesian Modeling

Computational Cognitive Neuroscience

EXPLORATION FLOW 4/18/10

SUPPLEMENTAL MATERIAL

For general queries, contact

A Race Model of Perceptual Forced Choice Reaction Time

Cognitive Modeling. Lecture 12: Bayesian Inference. Sharon Goldwater. School of Informatics University of Edinburgh

A Race Model of Perceptual Forced Choice Reaction Time

Bayesian Networks Representation. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Research Questions, Variables, and Hypotheses: Part 2. Review. Hypotheses RCS /7/04. What are research questions? What are variables?

Advanced Bayesian Models for the Social Sciences. TA: Elizabeth Menninga (University of North Carolina, Chapel Hill)

AClass: A Simple, Online Probabilistic Classifier. Vikash K. Mansinghka Computational Cognitive Science Group MIT BCS/CSAIL

Comparative Study of K-means, Gaussian Mixture Model, Fuzzy C-means algorithms for Brain Tumor Segmentation

On the diversity principle and local falsifiability

Mark J. Anderson, Patrick J. Whitcomb Stat-Ease, Inc., Minneapolis, MN USA

10CS664: PATTERN RECOGNITION QUESTION BANK

Marcus Hutter Canberra, ACT, 0200, Australia

Response to Comment on Cognitive Science in the field: Does exercising core mathematical concepts improve school readiness?

Oscillatory Neural Network for Image Segmentation with Biased Competition for Attention

Predicting the Truth: Overcoming Problems with Poppers Verisimilitude Through Model Selection Theory

Combining Risks from Several Tumors Using Markov Chain Monte Carlo

Model reconnaissance: discretization, naive Bayes and maximum-entropy. Sanne de Roever/ spdrnl

Advanced Bayesian Models for the Social Sciences

Adaptive EAP Estimation of Ability

Selection and Combination of Markers for Prediction

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Introduzione. Ruggero Donida Labati

Bayesian Analysis by Simulation

Unit 1 Exploring and Understanding Data

Psychology Research Process

What is a probability? Two schools in statistics: frequentists and Bayesians.

Technical Specifications

Robust learning of inhomogeneous PMMs

Modelling heterogeneity variances in multiple treatment comparison meta-analysis Are informative priors the better solution?

The Statistical Analysis of Failure Time Data

BREAST CANCER EPIDEMIOLOGY MODEL:

Measurement Error in Nonlinear Models

KARUN ADUSUMILLI OFFICE ADDRESS, TELEPHONE & Department of Economics

Individual Differences in Attention During Category Learning

Game Theoretic Statistics

An Efficient Method for the Minimum Description Length Evaluation of Deterministic Cognitive Models

Statistical Decision Theory and Bayesian Analysis

Emotion Recognition using a Cauchy Naive Bayes Classifier

Bayesian growth mixture models to distinguish hemoglobin value trajectories in blood donors

Description of components in tailored testing

Transcription:

Learning from data when all models are wrong Peter Grünwald CWI / Leiden Menu Two Pictures 1. Introduction 2. Learning when Models are Seriously Wrong Joint work with John Langford, Tim van Erven, Steven de Rooij, Wouter Koolen, Wojciech Kotlowski 3. Learning when Models are Almost True Wrong yet useful models Example 1: Regression (Curve Fitting) Scientists routinely use models that are wrong nonlinear relations modeled as linear, dependent variables modeled as independent, yet useful: they lead to reasonable predictions Examples: speech & digit recognition, DNA sequence analysis, Regression Regression Overfitting! Learning when All Models are Wrong 1

Regression Regression Standard Setting Standard Setting training data training data Usual assumption: independent identically distributed Usual assumption: independent identically distributed Model or model class : set of distributions Model or model class true distribution : set of distributions Polynomial Regression Example Model class consists of parametric sub-models: is conditional distribution of Y given X, expressing training data Standard Setting Learning based on training data, hypothesize an estimate of Learning Algorithm ( estimator ) is function i.e. has density Example Learning Algorithms: (penalized) least squares, (penalized) maximum likelihood Learning when All Models are Wrong 2

Standard Setting 1. When the Model is True training data Learning based on training data, hypothesize an estimate of Prediction Based on training data and and, predict 1. When the Model is True 2. When the Model is Almost True e.g. model (class) M is set of all polynomials, P is square root e.g. 3rd degree polynomial, Gaussian noise P M, inf P M d(p,p)=0 3. When the Model is Wrong e.g. model (class) is set of all polynomials with Gaussian noise, but real noise not Gaussian 4. When there is No Truth! P M, inf P M d(p,p)>0 Learning when All Models are Wrong 3

4 Situations, 3 Goals model true Estimation, Model Selection Bayes usually o.k. Prediction Bayes usually o.k. model true Estimation, Model Selection Bayes usually o.k. Prediction Bayes usually o.k. model almost true truth exists no truth Theme 1: Bayes! universal prediction Bayes ok for log-score, not for other scores model almost true truth exists no truth Bayes ok for log-score, not for other scores universal prediction Bayes ok for log-score, not for other scores Theme 2: when the model is wrong, the distance measure becomes much more important model true Estimation, Model Selection Prediction model true Estimation, Model Selection Prediction model almost true truth exists no truth Bayes not ok in general unless model convex Bayes ok for log-score, not for other scores universal prediction Bayes ok for log-score, not for other scores Theme 3: most existing results are from theoretical machine learning community (NIPS, COLT) model almost true truth exists no truth Bayes factor model selection suboptimal Bayes not ok in general unless model convex Bayes can be inconsistent! Bayes factor model averaging suboptimal Statistical Consistency Let ˆP : n 1 (X Y) n M be an estimator/learner ˆP is called consistent (relative to d) if for all P M, with P -probability 1, d(p,ˆp (X n,y n ) ) 0 Statistical Consistency Let ˆP : n 1 (X Y) n M be an estimator/learner ˆP is called consistent (relative to d) if for all P M, with P -probability 1, d(p,ˆp (X n,y n ) ) 0 truth model M Learning when All Models are Wrong 4

Consistency when Model is Wrong truth P Consistency when Model is Wrong truth P modelm modelm Distance More Important when Model is Wrong truth P Bayesian Inconsistency when Model is Wrong Grünwald and Langford, Machine Learning 66(2-3), 2007 There exist a countable set of distributions M={P 0,P 1,P 2,...} a true distribution P, and a best approximation P =P 0 P =arg min P M d(p, P)=ɛ>0 a prior distribution Won M with W( P)=1/2...such that, P -almost surely, the standard Bayesian estimators never converge to P modelm Bayesian Inconsistency when Model is Wrong Grünwald and Langford, Machine Learning 66(2-3), 2007 There exist a countable set of distributions M={P 0,P 1,P 2,...} a true distribution P, and a best approximation P =P 0 P =arg min P M d(p, P)=ɛ>0 a prior distribution Won M with W( P)=1/2...such that, P -almost surely, the standard Bayesian estimators never converge to P d=kl divergence! (so this is real bad) logistic regression setting There exist a countable set of distributions M={P 0,P 1,P 2,...} a true distribution P, and a best approximation P =P 0, P =arg min P M d(p, P)=ɛ>0 a prior distribution Won M with W( P)=1/2...such that, P -almost surely, the standard Bayesian estimators never converge to P Learning when All Models are Wrong 5

There exist a countable set of distributions M={P 0,P 1,P 2,...} a true distribution P, and a best approximation P =P 0, P =arg min P M d(p, P)=ɛ>0 a prior distribution Won M with W( P)=1/2...such that, P -almost surely, the standard Bayesian estimators never converge to P There exist a countable set of distributions M={P 0,P 1,P 2,...} a true distribution P, and a best approximation P =P 0, P =arg min P M d(p, P)=ɛ>0 a prior distribution Won M with W( P)=1/2...such that, P -almost surely, the standard Bayesian estimators never converge to P Bayesian inference is based on posterior W(P X n,y n ) Bayesian inference is based on posterior W(P X n,y n ) prediction average over posterior estimation output subset of M with large posterior probability prediction average over posterior estimation output subset of M with large posterior probability There exist a countable set of distributions M={P 0,P 1,P 2,...} a true distribution P, and a best approximation P =P 0, P =arg min P M d(p, P)=ɛ>0 a prior distribution Won M with W( P)=1/2...such that, P -almost surely, the standard Bayesian estimators never converge to P The Geometry of Inconsistency For all K >0,W(P :d(p,p) K X n,y n ) 0, P a.s. Bayesian inference is based on posterior W(P X n,y n ) prediction average over posterior estimation output subset of M with large posterior probability Bayes predictive dist. Note that each individual P s posterior weight W(P X n,y n ) does tend to 0, eventually, a.s. The Geometry of Inconsistency The Geometry of Inconsistency essential that we take nonparametric model (or model selection with infinite nr of models) if model is parametric then we get consistency but we need too much data essential that we take nonparametric model (or model selection with infinite nr of models) prediction is o.k. for log loss (relates to KL divergence, shown in this picture) but not other loss functions Bayes predictive dist. Note that each individual P s posterior weight W(P X n,y n ) does tend to 0, eventually, a.s. P Bayes predictive dist. Learning when All Models are Wrong 6

The Role of Convexity Bayes predictive distribution is a mixture ofp M Indeed, Bayes is consistent with KL divergence, countable set M,P M if model is convex (closed under taking mixtures) Bayes implicitly punishes complex sub-models. To get consistency when model is wrong, need to punish them significantly more if the prior W(P) is changed to W(P) n then we do get consistency even for nonconvex models Menu Two Pictures 1. Introduction 2. Learning when Models are Seriously Wrong 3. Model Selection when Models are Almost True Model Selection Methods Suppose we observe data We want to know which model in our list of candidate models best explains the data ˆθ k is maximum likelihood estimator (best-fit) within model M k A model selection method is a function mapping data sequences of arbitrary length to model indices is model chosen for data The AIC-BIC Dilemma Two main types of model selection methods: 1. AIC-type Akaike Information Criterion (AIC, 1973) ˆk(y n ) is k minimizing logpˆθ k (y n )+k 2. BIC-type Bayesian Information Criterion (BIC, 1978) ˆk(y n ) is k minimizing logpˆθ k (y n )+ k 2 logn The AIC-BIC Dilemma Two main types of model selection methods: 1. AIC-type Akaike Information Criterion (AIC, 1973) leave-one-out cross-validation DIC, C p 2. BIC-type Bayesian Information Criterion (BIC, 1978) prequential validation Bayes factor model selection standard Minimum Description Length (MDL) The AIC-BIC Dilemma asymptotic overfitting Two main types of model selection methods: 1. AIC-type Akaike Information Criterion leave-one-out cross-validation DIC, C p 2. BIC-type Bayesian Information Criterion prequential validation Bayes factor model selection standard MDL inconsistent consistent Learning when All Models are Wrong 7

The AIC-BIC Dilemma Two main types of model selection methods: 1. AIC-type Akaike Information Criterion leave-one-out cross-validation DIC, C p 2. BIC-type Bayesian Information Criterion prequential validation Bayes factor model selection standard MDL inconsistent optimal rate consistent slower rate The Best of Both Worlds We give a novel analysis of the slower convergence rate of BIC-type methods: the catch-up phenomenon asymptotic underfitting The Best of Both Worlds We give a novel analysis of the slower convergence rate of BIC-type methods: the catch-up phenomenon This allows us to define a model selection/averaging method that, in a wide variety of circumstances, 1. is provably consistent 2. provably achieves optimal convergence rates The Best of Both Worlds We give a novel analysis of the slower convergence rate of BIC-type methods: the catch-up phenomenon This allows us to define a model selection/averaging method that, in a wide variety of circumstances, 1. is provably consistent 2. provably achieves optimal convergence rates (often ) even though it had been suggested that this is impossible! Yang 2005, Forster 2001, Sober 2004 For many model classes, method is computationally feasible Sub-Menu Bayes Factor Model Selection 1. Bayes Factor Model Selection Predictive interpretation is k maximizing a posteriori probability 2. The Catch-Up Phenomenon. as exhibited by the Bayes factor method 3. Conclusion is prior are priors is k minimizing Learning when All Models are Wrong 8

For, select complex model Bayes factor model selection between 1 st -order and 2 nd -order Markov model for The Picture of Dorian Gray Bayes factor model selection between 1 st -order and 2 nd -order Markov model for The Picture of Dorian Gray The Catch-Up Phenomenon Suppose we select between simple model and complex model Common Phenomenon: for some simple model predicts better if complex model predicts better if this seems to be the very reason why it makes sense to prefer a simple model even if the complex one is true We would expect Bayes factor method to switch at about. but is this really where Bayes switches!? Menu 1. Bayes Factor Model Selection Predictive interpretation 2. The Catch-Up Phenomenon. as exhibited by the Bayes factor method 3. Conclusion Bayesian prediction Logarithmic Loss Given model, Bayesian marginal likelihood is If we measure prediction quality by log loss, Given model, predict by predictive distribution then accumulated loss satisfies so that accumulated log loss = minus log likelihood Learning when All Models are Wrong 9

The Most Important Slide Bayes picks the k minimizing Prequential interpretation of Bayes model selection: select the model such that, when used as a sequential prediction strategy, minimizes accumulated sequential prediction error Dawid 84, Rissanen 84 Menu 1. Bayes Factor Model Selection Predictive interpretation 2. The Catch-Up Phenomenon. as exhibited by the Bayes factor method 3. Conclusion Green curve depicts difference in accumulated prediction error between predicting with and predicting with Green curve depicts difference in accumulated prediction error between predicting with and predicting with The Catch-Up Phenomenon starts to outperform at So would like to switch to at but switching only happens at Suppose we select between simple model complex model Common Phenomenon: for some simple model predicts better if complex model predicts better if and Bayes exhibits inertia: complex model has to catch up, so we prefer simpler model for a while even after Learning when All Models are Wrong 10

Model averaging does not help! Can we modify Bayes so as to do as well as the black curve? Almost! The Switch Distribution Suppose we switch from to at sample size s Our total prediction error is then The Switch Distribution Suppose we switch from to at sample size s Our total prediction error is then If we define then total prediction error is may be viewed both as a prediction strategy and as a distribution over infinite sequences The Switch Distribution We want to predict using some distribution. such that no matter what data are observed, i.e. for all, where maximizes We achieve this by treating s as a parameter, putting a prior on it, and then integrating s out (adopt a Bayesian solution to a Bayesian problem ) Learning when All Models are Wrong 11

Switch Distribution (very briefly) The switch distribution is obtained by putting a (cleverly constructed) prior on sequences of models rather than individual models is the prior mass you put on the meta-hypothesis that is best at sample size 1,2,6,7 is best at sample size 3,4,5,8,9,... readily extended to > 2 models Now we apply Bayes using this new prior for prediction and model selection can be done efficiently by dynamic programming Results Theorem 1: In all situations in which Bayes factor model selection is consistent, model selection based on Switching Prior is consistent (like BIC) Theorem 2: In many models almost true situations prediction based on the Switching Prior achieves the optimal rate of convergence (like AIC) Practical Experiments confirm this Method is formally Bayes, but Bayesian interpretation very tenuous! It s prequential! Discussion Discussion Model(s) almost true : standard Bayes is too slow. We can interpret switch distribution as a reparation of Bayes...but also as a frequentist hypothesis test to check whether the Bayesian model and prior assumptions are justified, and to step out of the model if necessary Model(s) almost true : standard Bayes is too slow. We can interpret switch distribution as a reparation of Bayes...but also as a frequentist hypothesis test to check whether the Bayesian model and prior assumptions are justified, and to step out of the model if necessary Model(s) seriously wrong : standard Bayes can be inconsistent unless model convex. Discussion Discussion Model(s) almost true : standard Bayes is too slow. We can interpret switch distribution as a reparation of Bayes...but also as a frequentist hypothesis test to check whether the Bayesian model and prior assumptions are justified, and to step out of the model if necessary Model(s) seriously wrong : standard Bayes can be inconsistent unless model convex. In recent work we provide a frequentist empirical convexity test (does the convex closure of the model fit the data better?) to check whether Bayesian model and prior assumptions are justified, and to step out of the model if necessary Model(s) almost true : standard Bayes is too slow. We can interpret switch distribution as a reparation of Bayes...but also as a frequentist hypothesis test to check whether the Bayesian model and prior assumptions are justified, and to step out of the model if necessary Model(s) seriously wrong : standard Bayes can be inconsistent unless model convex. In recent work we provide a frequentist empirical convexity test (does the convex closure of the model fit the data better?) to check whether Bayesian model and prior assumptions are justified, and to step out of the model if necessary We are effectively combining Bayes+Popper+(..new principle..) Learning when All Models are Wrong 12

Conclusion Conclusion We are effectively combining Bayes+Popper+(..new principle..) Bayes+Popper has been proposed before, on an informal level, by eminent Bayesians & others: A.P. Dawid (1982): the well-calibrated Bayesian A. Gelman (various papers, blog) New principle: our ongoing work! We are effectively combining Bayes+Popper+(..new principle..) Bayes+Popper has been proposed before, on an informal level, by eminent Bayesians & others: A.P. Dawid (1982): the well-calibrated Bayesian A. Gelman (various papers, blog) New principle: our ongoing work! Literature: Grünwald & Langford, Machine Learning, 2007. Van Erven, Grünwald and De Rooij, NIPS 2007. Van Erven s PhD. Thesis (2010). (notice: machine learning is important!) Learning when All Models are Wrong 13