BayesOpt: Extensions and applications

Similar documents
Practical Bayesian Optimization of Machine Learning Algorithms. Jasper Snoek, Ryan Adams, Hugo LaRochelle NIPS 2012

Bayesians methods in system identification: equivalences, differences, and misunderstandings

Challenges in Developing Learning Algorithms to Personalize mhealth Treatments

RoBO: A Flexible and Robust Bayesian Optimization Framework in Python

Search e Fall /18/15

Neurons and neural networks II. Hopfield network

Exploiting Similarity to Optimize Recommendations from User Feedback

A Brief Introduction to Bayesian Statistics

Dynamic Causal Modeling

Bayesian Joint Modelling of Benefit and Risk in Drug Development

Combining Risks from Several Tumors Using Markov Chain Monte Carlo

Applications with Bayesian Approach

Bayesian Nonparametric Methods for Precision Medicine

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Fundamental Clinical Trial Design

MULTI-MODAL FETAL ECG EXTRACTION USING MULTI-KERNEL GAUSSIAN PROCESSES. Bharathi Surisetti and Richard M. Dansereau

3. Model evaluation & selection

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Bayesian and Frequentist Approaches

Effective Implementation of Bayesian Adaptive Randomization in Early Phase Clinical Development. Pantelis Vlachos.

Predicting Sleep Using Consumer Wearable Sensing Devices

Modelling Spatially Correlated Survival Data for Individuals with Multiple Cancers

Selection of Linking Items

Ch.20 Dynamic Cue Combination in Distributional Population Code Networks. Ka Yeon Kim Biopsychology

Bayesian Hierarchical Models for Fitting Dose-Response Relationships

Bayesian Models for Combining Data Across Subjects and Studies in Predictive fmri Data Analysis

EXPLORATION FLOW 4/18/10

A Diabetes minimal model for Oral Glucose Tolerance Tests

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

A Bayesian Nonparametric Model Fit statistic of Item Response Models

arxiv: v1 [stat.ap] 2 Feb 2016

A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range

How many speakers? How many tokens?:

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

Probabilistic Modeling to Support and Facilitate Decision Making in Early Drug Development

Intelligent Systems. Discriminative Learning. Parts marked by * are optional. WS2013/2014 Carsten Rother, Dmitrij Schlesinger

Why so gloomy? A Bayesian explanation of human pessimism bias in the multi-armed bandit task

SVM-based Discriminative Accumulation Scheme for Place Recognition

Practical Bayesian Design and Analysis for Drug and Device Clinical Trials

Conjoint Audiogram Estimation via Gaussian Process Classification

Bayesian Prediction Tree Models

Meta-analysis of two studies in the presence of heterogeneity with applications in rare diseases

Bayesian Methods LABORATORY. Lesson 1: Jan Software: R. Bayesian Methods p.1/20

BREAST CANCER EPIDEMIOLOGY MODEL:

K Armed Bandit. Goal. Introduction. Methodology. Approach. and optimal action rest of the time. Fig 1 shows the pseudo code

10-1 MMSE Estimation S. Lall, Stanford

Bayesian Reinforcement Learning

Sample size calculation for a stepped wedge trial

Bayes Linear Statistics. Theory and Methods

Formulating Emotion Perception as a Probabilistic Model with Application to Categorical Emotion Classification

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination

Att vara eller inte vara (en Bayesian)?... Sherlock-conundrum

Ecological Statistics

Using mixture priors for robust inference: application in Bayesian dose escalation trials

A Bayesian approach to sample size determination for studies designed to evaluate continuous medical tests

Threshold Learning for Optimal Decision Making

Information-theoretic stimulus design for neurophysiology & psychophysics

An Introduction to Bayesian Statistics

Genome-Wide Localization of Protein-DNA Binding and Histone Modification by a Bayesian Change-Point Method with ChIP-seq Data

Statistical Tolerance Regions: Theory, Applications and Computation

Introduction to Computational Neuroscience

A Comparison of Methods for Determining HIV Viral Set Point

An application of a pattern-mixture model with multiple imputation for the analysis of longitudinal trials with protocol deviations

Bayesian Models for Combining Data Across Domains and Domain Types in Predictive fmri Data Analysis (Thesis Proposal)

Information Processing During Transient Responses in the Crayfish Visual System

Bayesian methods in health economics

PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science. Homework 5

Oscillatory Neural Network for Image Segmentation with Biased Competition for Attention

Sensory Cue Integration

Bayesian Methods in Regulatory Science

ERA: Architectures for Inference

Between-word regressions as part of rational reading

Bayesian Latent Subgroup Design for Basket Trials

Meta-analysis of few small studies in small populations and rare diseases

Lecture 10: Learning Optimal Personalized Treatment Rules Under Risk Constraint

Bayesian Inference Bayes Laplace

Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features. Tyler Yue Lab

Bayesian Estimation of a Meta-analysis model using Gibbs sampler

Reach and grasp by people with tetraplegia using a neurally controlled robotic arm

2014 Modern Modeling Methods (M 3 ) Conference, UCONN

Learning Classifier Systems (LCS/XCSF)

EECS 433 Statistical Pattern Recognition

A bivariate parametric long-range dependent time series model with general phase

MEA DISCUSSION PAPERS

OPERATIONAL RISK WITH EXCEL AND VBA

arxiv: v1 [cs.lg] 17 Dec 2018

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017

CLINICAL TRIAL SIMULATION & ANALYSIS

Learning and Adaptive Behavior, Part II

Models for potentially biased evidence in meta-analysis using empirically based priors

Dynamic borrowing of historical data: Performance and comparison of existing methods based on a case study

A SIMULATION BASED ESTIMATION OF CROWD ABILITY AND ITS INFLUENCE ON CROWDSOURCED EVALUATION OF DESIGN CONCEPTS

How Does Prospect Theory Reflect Heuristics Probability Sensitivity in Risky Choice?

Coding and computation by neural ensembles in the primate retina

Blending Psychometrics with Bayesian Inference Networks: Measuring Hundreds of Latent Variables Simultaneously

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics

Discussion Meeting for MCP-Mod Qualification Opinion Request. Novartis 10 July 2013 EMA, London, UK

Memory-Augmented Active Deep Learning for Identifying Relations Between Distant Medical Concepts in Electroencephalography Reports

Transcription:

BayesOpt: Extensions and applications Javier González Masterclass, 7-February, 2107 @Lancaster University

Agenda of the day 9:00-11:00, Introduction to Bayesian Optimization: What is BayesOpt and why it works? Relevant things to know. 11:30-13:00, Connections, extensions and applications: Extensions to multi-task problems, constrained domains, early-stopping, high dimensions. Connections to Armed bandits and ABC. An applications in genetics. 14:00-16:00, GPyOpt LAB!: Bring your own problem! 16:30-15:30, Hot topics current challenges: Parallelization. Non-myopic methods Interactive Bayesian Optimization.

Section II: Connections, extensions and applications Extensions to multi-task problems, constrained domains, early-stopping, high dimensions. Connections to Armed bandits and ABC. An applications in genetics.

Multi-task Bayesian Optimization [Wersky et al., 2013] Two types of problems: 1. Multiple, and conflicting objectives: design an engine more powerful but more efficient. 2. The objective is very expensive, but we have access to another cheaper and correlated one.

Multi-task Bayesian Optimization [Wersky et al., 2013] We want to optimise an objective that it is very expensive to evaluate but we have access to another function, correlated with objective, that is cheaper to evaluate. The idea is to use the correlation among the function to improve the optimization. Multi-output Gaussian process k(x, x ) = B k(x, x )

Multi-task Bayesian Optimization [Wersky et al., 2013] Correlation among tasks reduces global uncertainty. The choice (acquisition) changes.

Multi-task Bayesian Optimization [Wersky et al., 2013] In other cases we want to optimize several tasks at the same time. We need to use a combination of them (the mean, for instance) or have a look to the Pareto frontiers of the problem. Averaged expected improvement.

Multi-task Bayesian Optimization [Wersky et al., 2013]

Non-stationary Bayesian Optimization [Snoek et al., 2014] The beta distributions allows for a rich family of transformations.

Non-stationary Bayesian Optimization [Snoek et al., 2014] Idea: transform the function to make it stationary.

Non-stationary Bayesian Optimization [Snoek et al., 2014] Results improve in many experiments by warping the inputs. Extensions to multi-task warping.

Inequality Constraints [Gardner et al., 2014] An option is to penalize the EI with an indicator function that vanishes the acquisition out the domain of interest.

Inequality Constraints [Gardner et al., 2014] Much more efficient than standard approaches.

High-dimensional BO: REMBO [Wang et al., 2013]

High-dimensional BO: REMBO [Wang et al., 2013] A function f : X R is called to have effective dimensionality d with d D if there exist a linear subspace T of dimension d such that for all x T and x T T we have f(x ) = f(x + x ) where T is the orthogonal complement of T.

High-dimensional BO: REMBO [Wang et al., 2013] Better in cases in the which the intrinsic dimensionality of the function is low. Hard to implement (need to define the bounds of the optimization after the embedding).

High-dimensional BO: Additive models Use the Sobol-Hoeffding decompostion f(x) = f 0 + where D f i (x i ) + f ij (x i, x j ) + + f 1,...,D (x) i<j i=1 f 0 = X f(x)dx f i (x i ) = X i f(x)dx i - f 0 etc... and assume that the effects of high order than q are null.

High-dimensional BO: Additive models

Armed bandits - Bayesian Optimization Shahriari et al, [2016] Beta-Bernoulli Bayesian optimization: Beta prior on each arm.

Armed bandits - Bayesian Optimization Shahriari et al, [2016] Beta posterior: Thompson sampling:

Armed bandits - Bayesian Optimization Shahriari et al, [2016] Beta-Bernoulli Bayesian optimization:

Armed bandits - Bayesian Optimization Shahriari et al, [2016] Linear bandits: We introduce correlations among the arms. Normal-inverse Gamma prior.

Armed bandits - Bayesian Optimization Shahriari et al, [2016] Linear bandits: Now we can extract analytically the posterior mean and variance: And do Thompsom sampling again:

Armed bandits - Bayesian Optimization Shahriari et al, [2016] From linear bandits to Bayesian optimization: Replace X by a basis of functions Φ. Bayesian optimization generalizes Linear bandits as Gaussian processes generalizes Bayesian linear regresion. Infinitely many + linear + correlated Bandits = Bayesian optimization.

Early-stopping Bayesian optimization Swersky et al. [2014] Considerations: When looking for a good parameters set for a model, in many cases each evaluation requires of a inner loop optimization. Learning curves have a similar (monotonically decreasing) shape. Fit a meta-model to the learning curves to predict the expected performance of sets of parameters Main benefit: allows for early-stopping

Early-stopping Bayesian optimization Swersky et al. [2014] Kernel for learning curves k(t, t ) = where ϕ is a Gamma distribution. 0 e λt e λt ϕ(dλ)

Early-stopping Bayesian optimization Swersky et al. [2014] Non-stationary kernel as an infinite mixture of exponentially decaying basis function. A hierarchical model is used to model the learning curves. Early-stopping is possible for bad parameter sets.

Early-stopping Bayesian optimization Swersky et al. [2014] Good results compared to standard approaches. What to do if exponential decay assumption does not hold?

Conditional dependencies Swersky et al. [2014] Often, we search over structures with differing numbers of parameters: find the best neural network architecture The input space has a conditional dependency structure. Input space X = X 1 X d. The value of x j X j depends on the value of x i X i.

Conditional dependencies Swersky et al. [2014]

Robotics Video

Approximate Bayesian Computation - BayesOpt Gutmann et al. [2015] Bayesian inference: p(θ y) L(θ theta)p(θ) Focus on cases where: The likelihood function L(θ theta) is too costly to compute. It is still possible to simulate from the model.

Approximate Bayesian Computation - BayesOpt Gutmann et al. [2015] ABC idea: Identify the values of θ for which simulated data resemble the observed data y 0 1. Sample θ from the prior p(θ). 2. Sample y θ from the model. 3. Compute some distance d(y, y 0 ) between the observed and simulated data (using sufficient statistics). 4. Retain θ if d(y, y 0 ) ɛ

Approximate Bayesian Computation - BayesOpt Gutmann et al. [2015] Produce samples from the approximate posterior p(θ y). Small ɛ: accurate samples but very inefficient (a lot of rejection). Small ɛ: less rejection but inaccurate samples. Idea: Model the discrepancy d(y, y 0 ) with a (log) Gaussian process and use Bayesian optimization to find regions of the parameters space it is small. Meta-model for (θ i, d i ) where d i = d(y (i) θ, y 0)

Approximate Bayesian Computation - BayesOpt Gutmann et al. [2015] BayesOpt applied to minimize the discrepancy. Stochastic acquisition to encourage diversity in the points (GP-UCB + jitter term). ABC-BO vs. Monte Carlo (PMC) ABC approach: Roughly equal results using 1000 times fewer simulations.

Synthetic gene design with Bayesian optimization Use mammalian cells to make protein products. Control the ability of the cell-factory to use synthetic DNA. Optimize genes (ATTGGTUGA...) to best enable the cell-factory to operate most efficiently [González et al. 2014].

Central dogma of molecular biology

Central dogma of molecular biology

Big question Remark: Natural gene sequences are not necessarily optimized to maximize protein production. ATGCTGCAGATGTGGGGGTTTGTTCTCTATCTCTTCCTGAC TTTGTTCTCTATCTCTTCCTGACTTTGTTCTCTATCTCTTC... Considerations Different gene sequences same protein. The sequence affects the synthesis efficiency. Which is the most efficient sequence to produce a protein?

Redundancy of the genetic code Codon: Three consecutive bases: AAT, ACG, etc. Protein: sequence of amino acids. Different codons may encode the same aminoacid. ACA=ACU encodes for Threonine. AT UUUGACA = AT UUUGACU synonyms sequences same protein but different efficiency

Redundancy of the genetic code

How to design a synthetic gene? A good model is crucial: Gene sequence features protein production efficiency. Bayesian Optimization principles for gene design do: 1. Build a GP model as an emulator of the cell behavior. 2. Obtain a set of gene design rules (features optimization). 3. Design one/many new gene/s coherent with the design rules. 4. Test genes in the lab (get new data). until the gene is optimized (or the budget is over...).

Model as an emulator of the cell behavior Model inputs Features (x i ) extracted gene sequences (s i ): codon frequency, cai, gene length, folding energy, etc. Model outputs Transcription and translation rates f := (f α, f β ). Model type Multi-output Gaussian process f GP(m, K) where K is a corregionalization covariance for the two-output model (+ SE with ARD). The correlation in the outputs help!

Model as an emulator of the cell behavior

Obtaining optimal gene design rules Maximize the averaged EI [Swersky et al. 2013] α(x) = σ(x)( uφ( u) + φ(u)) where u = (y max m(x))/ σ(x) and m(x) = 1 f (x), σ 2 (x) = 1 2 2 2 (K (x, x)) l,l. l=α,β l,l =α,β A batch method is used when several experiments can be run in parallel

Designing new genes Simulating-matching approach: 1. Simulate genes coherent with the target (same amino-acids). 2. Extract features. 3. Rank synthetic genes according to their similarity with the optimal design rules. Ranking criterion: eval(s x ) = p j=1 w j x j x j x : optimal gene design rules. s, x j generated synonyms sequence and its features. w j : weights of the p features (inverse length-scales of the model covariance).

Results

Available software Spearmint (https://github.com/hips/spearmint). BayesOpt (http://rmcantin.bitbucket.org/html/). pybo (https://github.com/mwhoffman/pybo). robo (https://github.com/automl/robo).

GPyOpt Python code framework for Bayesian Optimization. Developed by the group with other contributions. Builds on top of GPy, framework for Gaussian process modelling (any model in GPy can be imported as a surrogate model to do optimization in GPyOpt). We started to develop it on Jun 2014.

Main features Feature Availability - GPs, warped-gp, RF, etc. - EI, MPI, GP-UCB - Internal optimizers: BFGS, DIRECT, CMA-ES - Model hyperpatemeters integration - Discrete/continuous/categorical variables - Bandits optimization - Parallel/batch optimization - Arbitrary constrains - Spearmint compatibility - Cost functions (including evaluation time) - Modular optimization - Structured inputs (conditional dependencies) - Context variables

Code sample