Bayesians methods in system identification: equivalences, differences, and misunderstandings

Bayesians methods in system identification: equivalences, differences, and misunderstandings Johan Schoukens and Carl Edward Rasmussen ERNSI 217 Workshop on System Identification Lyon, September 24-27, 217 Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 1 / 14

Outline Introduction: personal story and goal of presentation What is prior knowledge Setting the ideas: Impulse Response example Nature of the modeling problem Two approaches: Cross validation vs Bayesian inference Birds Eye diagram Priors: a robust concept Conclusions Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 2 / 14

Prior Knowledge Data-driven model relies not only on data: additional bits can take many forms: knowledge in the field, users beliefs, experience, assumptions, etc Example y = f (u) = θ i φ i (u), i C(θ), where C(θ) is some form of constraint; we will call it the prior. Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 3 / 14

Example: Impulse Response Number of data points (496) and excitation (almost) white noise, noise level figures: g(t) estimated by LS or by LS + reg prior encodes: exponential decaying envelope, and smooth weights second plot: RMS error as function of weigh delay (RMS values) Akaike turns off coefficients above about t = 4. Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 4 / 14

Example: Impulse Response continued Estimated LS FIR Estimated RLS FIR.2 magnitude.4.2 magnitude.4 5 1 5 1 t t RMS error on g(t).2 magnitude.4 g LS LSreg 1 2 3 4 5 6 7 8 9 1 t Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 5 / 14

The nature of the modeling problem Prior information is often qualitative: stable smooth stationarity (invariances) positivity, monotonicity The quantification (strength, or scale) of this information is inferred from the data through hyperparameters. Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 6 / 14

Diagram of methodologies Maximum likelihood Regularization framework Bayes Cost: negative log likelihood + discrete penalty Cost: negative log likelihood + lambda times regulariser Cost: negative log likelihood + negative log prior Classical: SI + AIC, BIC, CV Regularized SI Procedure: double optimisation parameters (continuous) model (discrete) optimize: CV or marginal likelihood for hypers optimisation for parameters (condition on hypers) Procedure: two levels: parameters + hyperparameters optimize: marginal likelihood for hypers parameter posterior Results: parameter point estimate parameter covariance (data driven) Results: parameter point estimate parameter covariance (data + regularizer) Results: parameter posterior predictive distribution (MCMC or approx) Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 7 / 14

Regression comparison 2 2 15 15 1 1 5 5-5 -5-1 -1 95% model confidence (data) 95% model confidence (data) -15 95% model confidence (function) data points -15 95% model confidence (function) data points generating function generating function -2 mean prediction -2 mean prediction -15-1 -5 5 1 15-15 -1-5 5 1 15 Comparing BIC with Gaussian process. Model uncertainty (95% confidence) in light grey, data uncertainty in dark grey. BIC uses a small number of basis functions, leads to under-fitting and overconfidence. GP uses infinitely many basis functions. Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 8 / 14

Robustness of solutions to prior, step example 3 3 2 2 1 1-1 -1-2 95% model confidence (data) -2 95% model confidence (data) 95% model confidence (function) 95% model confidence (function) data points data points generating function generating function -3 mean prediction -3 mean prediction -1-8 -6-4 -2 2 4 6 8 1-1 -8-6 -4-2 2 4 6 8 1 Comparing BIC with Gaussian process. Data from step function, prior for stationary function, not well matched. Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 9 / 14

Robustness of solutions to prior, exp example 45 45 95% model confidence (data) 95% model confidence (data) 4 95% model confidence (function) data points 4 95% model confidence (function) data points generating function generating function 35 mean prediction 35 mean prediction 3 3 25 25 2 2 15 15 1 1 5 5-5 -15-1 -5 5 1 15-5 -15-1 -5 5 1 15 Data from exp function, prior for stationary function. BIC selects a single basis function, overconfident for neg arguments. GP extrapolates poorly. Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 1 / 14

Marginalize or optimise? When optimising, overfitting becomes a problem. The optimiser will find solutions which agree well with the particular training set observed, but doesn t generalize well. This motivates regularisation and working with small models (Akaike, etc). Often external information (validation sets) are used to control complexity. When marginalising, overfitting does not happen. Instead, in large models with vague priors the large uncertainties will remain; the predictive error-bars will be large. Internal measures (the marginal likelihood) will show the problem (no external information is required). Unfortunately, whereas non-linear optimisation is hard, marginalisation is REALLY hard. Bayesian methods generally require 1) MCMC techniques for inference, or 2) specific model classes, such as Gaussian processes, or 3) analytic approximations (eg variational). Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 11 / 14

Marginalize or Optimize Although the use of a regulariser and prior look very similar (sometime even identical expressions), in fact these a quite different: In optimisation only the properties of the regulariser around the optimum are important, but in Bayes the whole prior distribution is important. This fact is typically overlooked. Marginalisation is mostly harder, and leads to a less convenient result, but may provide better uncertainty estimates. The tricky thing may become understanding the prior distribution. Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 12 / 14

possible discussion points priors can be useful for interpretation for generative models prior specification is invariance to how much data will be available Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 13 / 14

Conclusions Main message: data driven modeling uses more information than in the data Bayesian framework is systematic way of dealing with non-data information so: max likelihood systematic treatment of noisy data, Bayes systematic treatment of noisy data and priors The regularization framework may be interpreted as a Bayesian procedure in the mono-modal (Gaussian) case Posterior interpretation of prior can help interpretation Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 14 / 14

Bayesian complexity control too simple P(Y M i ) "just right" too complex Y All possible data sets Schoukens, Rasmussen (Vrije Uni Brussel, Cambridge) Bayesian SYSID Lyon, September 24-27, 217 15 / 14