Bayesians methods in system identification: equivalences, differences, and misunderstandings

Similar documents
Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

MS&E 226: Small Data

A Brief Introduction to Bayesian Statistics

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Individualized Treatment Effects Using a Non-parametric Bayesian Approach

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

BayesOpt: Extensions and applications

Bayesian Inference. Thomas Nichols. With thanks Lee Harrison

Bayesian Joint Modelling of Benefit and Risk in Drug Development

Bayesian Tolerance Intervals for Sparse Data Margin Assessment

5/20/2014. Leaving Andy Clark's safe shores: Scaling Predictive Processing to higher cognition. ANC sysmposium 19 May 2014

Search e Fall /18/15

UNLOCKING VALUE WITH DATA SCIENCE BAYES APPROACH: MAKING DATA WORK HARDER

Sensory Cue Integration

You must answer question 1.

Bayesian Inference Bayes Laplace

For general queries, contact

What is a probability? Two schools in statistics: frequentists and Bayesians.

A Case Study: Two-sample categorical data

Bayesian and Frequentist Approaches

Hierarchy of Statistical Goals

A Diabetes minimal model for Oral Glucose Tolerance Tests

Learning from data when all models are wrong

Computer Age Statistical Inference. Algorithms, Evidence, and Data Science. BRADLEY EFRON Stanford University, California

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm

Advanced Bayesian Models for the Social Sciences

Intelligent Systems. Discriminative Learning. Parts marked by * are optional. WS2013/2014 Carsten Rother, Dmitrij Schlesinger

Combining Risks from Several Tumors Using Markov Chain Monte Carlo

Bayesian Nonparametric Methods for Precision Medicine

Dose Escalation using a Bayesian Model: Rational decision rules. Helene Thygesen

Cognitive Modeling. Lecture 12: Bayesian Inference. Sharon Goldwater. School of Informatics University of Edinburgh

Bayesian integration in sensorimotor learning

Introduction. Patrick Breheny. January 10. The meaning of probability The Bayesian approach Preview of MCMC methods

Practical Bayesian Optimization of Machine Learning Algorithms. Jasper Snoek, Ryan Adams, Hugo LaRochelle NIPS 2012

How to Choose the Wrong Model. Scott L. Zeger Department of Biostatistics Johns Hopkins Bloomberg School

Model Evaluation using Grouped or Individual Data. Andrew L. Cohen. University of Massachusetts, Amherst. Adam N. Sanborn and Richard M.

Bayesian Hierarchical Models for Fitting Dose-Response Relationships

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Model calibration and Bayesian methods for probabilistic projections

Bayesian Approach for Neural Networks Review and Case Studies. Abstract

Statistical Decision Theory and Bayesian Analysis

Bayesian Benefit-Risk Assessment. Maria Costa GSK R&D

Policy Gradients. CS : Deep Reinforcement Learning Sergey Levine

Advanced Bayesian Models for the Social Sciences. TA: Elizabeth Menninga (University of North Carolina, Chapel Hill)

Policy Gradients. CS : Deep Reinforcement Learning Sergey Levine

Discussion Meeting for MCP-Mod Qualification Opinion Request. Novartis 10 July 2013 EMA, London, UK

Bayesian Models for Combining Data Across Domains and Domain Types in Predictive fmri Data Analysis (Thesis Proposal)

A Decision-Theoretic Approach to Evaluating Posterior Probabilities of Mental Models

Ec331: Research in Applied Economics Spring term, Panel Data: brief outlines

Michael Hallquist, Thomas M. Olino, Paul A. Pilkonis University of Pittsburgh

Week 8 Hour 1: More on polynomial fits. The AIC. Hour 2: Dummy Variables what are they? An NHL Example. Hour 3: Interactions. The stepwise method.

Bayesian and Classical Approaches to Inference and Model Averaging

Between-word regressions as part of rational reading

CISC453 Winter Probabilistic Reasoning Part B: AIMA3e Ch

Why so gloomy? A Bayesian explanation of human pessimism bias in the multi-armed bandit task

A Bayesian Account of Reconstructive Memory

BAYESIAN HYPOTHESIS TESTING WITH SPSS AMOS

Advanced IPD meta-analysis methods for observational studies

Bayesian Models for Combining Data Across Subjects and Studies in Predictive fmri Data Analysis

Neurons and neural networks II. Hopfield network

Inverse problems in functional brain imaging Identification of the hemodynamic response in fmri

MULTI-MODAL FETAL ECG EXTRACTION USING MULTI-KERNEL GAUSSIAN PROCESSES. Bharathi Surisetti and Richard M. Dansereau

Bayesian (Belief) Network Models,

Introduction to Bayesian Analysis 1

AP Statistics. Semester One Review Part 1 Chapters 1-5

Analysis of Subpopulation Emergence in Bacterial Cultures, a case Study for Model Based Clustering or Finite Mixture Models

Learning Navigational Maps by Observing the Movement of Crowds

1.4 - Linear Regression and MS Excel

How to Choose the Wrong Model. Scott L. Zeger Department of Biostatistics Johns Hopkins Bloomberg School

Data Analysis Using Regression and Multilevel/Hierarchical Models

Outline. What s inside this paper? My expectation. Software Defect Prediction. Traditional Method. What s inside this paper?

Analysis methods for improved external validity

DECISION ANALYSIS WITH BAYESIAN NETWORKS

Comparing treatments evaluated in studies forming disconnected networks of evidence: A review of methods

Model evaluation using grouped or individual data

Exploiting Similarity to Optimize Recommendations from User Feedback

Bayes Linear Statistics. Theory and Methods

Functional Elements and Networks in fmri

Bayesian growth mixture models to distinguish hemoglobin value trajectories in blood donors

Complexity and specificity of experimentally induced expectations in motion perception

Model reconnaissance: discretization, naive Bayes and maximum-entropy. Sanne de Roever/ spdrnl

Meta-analysis of two studies in the presence of heterogeneity with applications in rare diseases

Editorial Note: this manuscript has been previously reviewed at another journal that is not operating a transparent peer review scheme.

Meta-analysis of few small studies in small populations and rare diseases

Ideal Observers for Detecting Motion: Correspondence Noise

The Human Kernel. Christoph Dann CMU. Abstract

Cognitive and Computational Neuroscience of Aging

Having your cake and eating it too: multiple dimensions and a composite

Bayesian methods in health economics

Bayesian Joint Modelling of Longitudinal and Survival Data of HIV/AIDS Patients: A Case Study at Bale Robe General Hospital, Ethiopia

TRIPODS Workshop: Models & Machine Learning for Causal I. & Decision Making

Beyond Subjective and Objective in Statistics

Russian Journal of Agricultural and Socio-Economic Sciences, 3(15)

CSE 258 Lecture 2. Web Mining and Recommender Systems. Supervised learning Regression

Sensitivity of heterogeneity priors in meta-analysis

On Test Scores (Part 2) How to Properly Use Test Scores in Secondary Analyses. Structural Equation Modeling Lecture #12 April 29, 2015

Challenges in Developing Learning Algorithms to Personalize mhealth Treatments

Transcription:

Bayesians methods in system identification: equivalences, differences, and misunderstandings Johan Schoukens and Carl Edward Rasmussen ERNSI 217 Workshop on System Identification Lyon, September 24-27, 217 Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 1 / 14

Outline Introduction: personal story and goal of presentation What is prior knowledge Setting the ideas: Impulse Response example Nature of the modeling problem Two approaches: Cross validation vs Bayesian inference Birds Eye diagram Priors: a robust concept Conclusions Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 2 / 14

Prior Knowledge Data-driven model relies not only on data: additional bits can take many forms: knowledge in the field, users beliefs, experience, assumptions, etc Example y = f (u) = θ i φ i (u), i C(θ), where C(θ) is some form of constraint; we will call it the prior. Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 3 / 14

Example: Impulse Response Number of data points (496) and excitation (almost) white noise, noise level figures: g(t) estimated by LS or by LS + reg prior encodes: exponential decaying envelope, and smooth weights second plot: RMS error as function of weigh delay (RMS values) Akaike turns off coefficients above about t = 4. Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 4 / 14

Example: Impulse Response continued Estimated LS FIR Estimated RLS FIR.2 magnitude.4.2 magnitude.4 5 1 5 1 t t RMS error on g(t).2 magnitude.4 g LS LSreg 1 2 3 4 5 6 7 8 9 1 t Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 5 / 14

The nature of the modeling problem Prior information is often qualitative: stable smooth stationarity (invariances) positivity, monotonicity The quantification (strength, or scale) of this information is inferred from the data through hyperparameters. Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 6 / 14

Diagram of methodologies Maximum likelihood Regularization framework Bayes Cost: negative log likelihood + discrete penalty Cost: negative log likelihood + lambda times regulariser Cost: negative log likelihood + negative log prior Classical: SI + AIC, BIC, CV Regularized SI Procedure: double optimisation parameters (continuous) model (discrete) optimize: CV or marginal likelihood for hypers optimisation for parameters (condition on hypers) Procedure: two levels: parameters + hyperparameters optimize: marginal likelihood for hypers parameter posterior Results: parameter point estimate parameter covariance (data driven) Results: parameter point estimate parameter covariance (data + regularizer) Results: parameter posterior predictive distribution (MCMC or approx) Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 7 / 14

Regression comparison 2 2 15 15 1 1 5 5-5 -5-1 -1 95% model confidence (data) 95% model confidence (data) -15 95% model confidence (function) data points -15 95% model confidence (function) data points generating function generating function -2 mean prediction -2 mean prediction -15-1 -5 5 1 15-15 -1-5 5 1 15 Comparing BIC with Gaussian process. Model uncertainty (95% confidence) in light grey, data uncertainty in dark grey. BIC uses a small number of basis functions, leads to under-fitting and overconfidence. GP uses infinitely many basis functions. Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 8 / 14

Robustness of solutions to prior, step example 3 3 2 2 1 1-1 -1-2 95% model confidence (data) -2 95% model confidence (data) 95% model confidence (function) 95% model confidence (function) data points data points generating function generating function -3 mean prediction -3 mean prediction -1-8 -6-4 -2 2 4 6 8 1-1 -8-6 -4-2 2 4 6 8 1 Comparing BIC with Gaussian process. Data from step function, prior for stationary function, not well matched. Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 9 / 14

Robustness of solutions to prior, exp example 45 45 95% model confidence (data) 95% model confidence (data) 4 95% model confidence (function) data points 4 95% model confidence (function) data points generating function generating function 35 mean prediction 35 mean prediction 3 3 25 25 2 2 15 15 1 1 5 5-5 -15-1 -5 5 1 15-5 -15-1 -5 5 1 15 Data from exp function, prior for stationary function. BIC selects a single basis function, overconfident for neg arguments. GP extrapolates poorly. Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 1 / 14

Marginalize or optimise? When optimising, overfitting becomes a problem. The optimiser will find solutions which agree well with the particular training set observed, but doesn t generalize well. This motivates regularisation and working with small models (Akaike, etc). Often external information (validation sets) are used to control complexity. When marginalising, overfitting does not happen. Instead, in large models with vague priors the large uncertainties will remain; the predictive error-bars will be large. Internal measures (the marginal likelihood) will show the problem (no external information is required). Unfortunately, whereas non-linear optimisation is hard, marginalisation is REALLY hard. Bayesian methods generally require 1) MCMC techniques for inference, or 2) specific model classes, such as Gaussian processes, or 3) analytic approximations (eg variational). Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 11 / 14

Marginalize or Optimize Although the use of a regulariser and prior look very similar (sometime even identical expressions), in fact these a quite different: In optimisation only the properties of the regulariser around the optimum are important, but in Bayes the whole prior distribution is important. This fact is typically overlooked. Marginalisation is mostly harder, and leads to a less convenient result, but may provide better uncertainty estimates. The tricky thing may become understanding the prior distribution. Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 12 / 14

possible discussion points priors can be useful for interpretation for generative models prior specification is invariance to how much data will be available Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 13 / 14

Conclusions Main message: data driven modeling uses more information than in the data Bayesian framework is systematic way of dealing with non-data information so: max likelihood systematic treatment of noisy data, Bayes systematic treatment of noisy data and priors The regularization framework may be interpreted as a Bayesian procedure in the mono-modal (Gaussian) case Posterior interpretation of prior can help interpretation Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 14 / 14

Bayesian complexity control too simple P(Y M i ) "just right" too complex Y All possible data sets Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 15 / 14