What is Regularization? Example by Sean Owen

Size: px

Start display at page:

Download "What is Regularization? Example by Sean Owen"

MargaretMargaret Reed
5 years ago
Views:

1 What is Regularization? Example by Sean Owen

2 What is Regularization? Name3 Species Size Threat Bo snake small friendly Miley dog small friendly Fifi cat small enemy Muffy cat small friendly Rufus dog large friendly Jebediah snail small friendly Allison dog large enemy Tomi cat large enemy Pets with four-letter names are enemies, as are large dogs with names that begin with A, snakes are not enemies.

3 What is Regularization?

4 What is Regularization? Pets with four-letter names are enemies Large dogs with names that begin with A Snakes are not enemies

5 What is Regularization? Large dogs are enemies Cats are enemies

6 What is Regularization? Regularization discourages complexity in the prediction logic that is learned from the training data.

7 Model Complexity Model Error = Bias2 + Variance + Irreducible Error sum = Estimation Error

8 Optimum Model Complexity Error Model Complexity Bias can reduce variance and lower MSE overall Total Error Variance Bias2 Model Complexity

9 ? Bias Variation Quiz Select the description that best suits each illustration. Put its letter in the textbox: C A B A. High bias, low variance B. Low bias, high variance C. Middle bias and variance, overall low estimation error

10 ? Bias Variation Quiz Select the description that best suits each illustration. Put its letter in the textbox: High bias low variance Underfitting training data

11 ? Bias Variation Quiz Select the description that best suits each illustration. Put its letter in the textbox: Low bias High variance Overfitting training data

12 ? Bias Variation Quiz Select the description that best suits each illustration. Put its letter in the textbox: Not overfitting training data nor underfitting training data. Small mistakes are made in predicting training data labels, but low error on predicting future data.

13 ? Target Quiz Determine the bias and variance case for each target. Use LB for low bias, HB for high bias, LV for low variance, and HV for high variance. For example: LBLV for low bias low variance LVLB HVLB LVHB HVHB

14 Goldilocks Principle Prediction Error High Bias Low Variance Low Bias High Variance Test Sample Underfitting Training Sample Low Overfitting Bias trade-off Model Complexity High

15 ? Overfitting Quiz Select the best method for addressing overfitting when there are a lot of features: Reduce the number of model parameters (often implying fewer features and feature transformations) Keep all features but reduce the magnitude of some features

16 MLE Maximum Likelihood Estimate (MLE) is: A technique for estimating model parameters. It answers the question: Which parameters will most likely to characterize the dataset?

17 MLE Traditional statistics: Number of parameters << Train set size d << n MLE performs well Asymptotic optimality kicks in MLE: Performs well Asymptotic optimality kicks in *Asymptotic optimality: as your training set approaches infinity, the model will converge to the ground truth or nature that generates the data, and the convergence happens at the fastest possible rate.

18 MLE Traditional statistics: Number of parameters << Train set size d << n MLE performs well Asymptotic optimality kicks in MLE: Performs well Asymptotic optimality kicks in Asymptotic variance of the MLE is the best possible

19 MLE Traditional statistics: Number of parameters << Train set size d << n MLE performs well Asymptotic optimality kicks in MLE: Performs well Asymptotic optimality kicks in Asymptotic variance of the MLE is the best possible The inverse Fisher information.

20 MLE Traditional statistics: Number of parameters << Train set size MLE: Performs well Asymptotic optimality kicks in d << n d << n MLE performs well MLE performs poorly Asymptotic optimality kicks in Overfits training data

21 Mean Squared Error 2

22 Mean Squared Error MSE = Bias2 + Variance MLE of Linear Regression is unbiased MLE may be biased in general 2

23 Mean Squared Error Bias can reduce variance and lower MSE overall!

24 Mean Squared Error Bias to simplicity can reduce variance and lower MSE overall: Reducing 2

25 Mean Squared Error Simplicity Bias lower estimation variance better overall estimation accuracy & less overfitting of training data

26 Regularization Methods James-Stein Shrinkage Lasso Estimator Breiman s Garrote Elastic Net Estimator Ridge Estimator

27 James-Stein Shrinkage Traditional statistical theory: no other estimation rule for means is uniformly better than the observed average. James-Stein Estimator: when estimating multiple means: shrink all individual averages towards a grand average.

28 James-Stein Shrinkage James-Stein estimator shrinks all the dimensions of uniformly.

29 James-Stein Shrinkage??? What is the proportion of: people that will vote for Hillary Clinton? babies in China that are girls? Americans that have light colored eyes? What kind of study is this?

30 James-Stein Shrinkage!!! The James-Stein estimate of the proportion of voters for Hillary Clinton depends on. Chinese baby data and eye color!!!!

31 James-Stein Shrinkage Estimation Establishment Jame s-stei n Estim ator

32 James-Stein Shrinkage Estimation Jame s-stei n Estim ator

33 James-Stein Shrinkage li ab ent t s E s h m

34 James-Stein Shrinkage James-Stein estimator shrinks all the dimensions of uniformly.

35 ? James-Stein Estimator Quiz

36 ? James-Stein Weakness Quiz What is a weakness of the James-Stein Estimator?

37 ? James-Stein Weakness Quiz

38 ? James-Stein Weakness Quiz

39 Breiman s Garrote where subject to When c<d some or all of the components of reduced in absolute value toward 0. When we constrain it is called the non-negative garotte. The parameter c is a tuning parameter and several different values are typically considered.

40 Breiman s Garrote Breiman s Garrote: Can shrink some dimensions more than others as needed Computation can be expensive in high dimensions

41 Ridge Regression subject to where

42 Ridge Regression Equivalent to maximizing

43 Ridge Regression Equivalence can be seen using Lagrange multipliers subject to =>

44 Ridge Regression subject to =>

45 Ridge Regression Solve for different values of lambda, evaluate each solution on hold-out set and select model that performs best

46 Ridge Trace for x4 Ridge Trace

47 ? Estimator Comparison Quiz Model: E(Y) = 0 + X1 + X2 + e ; e ~ N(0,1) The measured variables are: x1,x2,x3 x1 and x2 are U(0,1) x3 = 10 * X1 + unif(0,1) corr(x1,x3) = sqrt(100/101) = 0.995

48 ? Estimator Comparison Quiz # OLS fit of 3-variable model using correlated x3. olsc <- lm(y~ x1 + x2 + x3c) summary(olsc)

49 ? Estimator Comparison Quiz

50 ? Ridge Regression Quiz Perform ridge regression for both independent and correlated variables.

51 ? Ridge Regression Quiz ridgec <- lm.ridge (y ~ x1+x2+x3c, lambda = seq(0,.1,.001))

52 ? Ridge Regression Quiz

53 ? Ridge Regression Quiz

54 Lasso Estimator LASSO = least absolute shrinkage and selection operator Model Characteristic Ridge Must include all or none of the coefficients Lasso Does parameter estimation and variable selection

55 Lasso Estimator Ridge Lasso subject to

56 Lasso Estimator Penalty Model Result l1 lasso Encourages sparse solution. For a given some of the components of lasso will be zero l2 ridge Does not encourage sparsity. The coefficients of ridge may be close to zero but not precisely zero.

57 ? Lasso Ridge Quiz In each illustration the colored lines are paths of regression coefficients shrinking to zero. Label each as either lasso or ridge estimation: Lasso Ridge Regression

58 Linear Regression The methods in this lesson apply generally to maximum likelihood problems. The linear regression setting enables derivation of closed forms which provide insight into the differences between different regularizers. However, ridge and lasso can also be applied for other models like logistic regression (and they are very successfully applied).

59 Linear Regression Assume: X: a matrix of training data X(1),..., X(n) Y : vector whose entries are training labels Y(1),..., Y(n) Then the conditional log likelihood is: -(Y - Xθ)T(Y - Xθ) + c Setting its gradient to zero: XTXθ + XTY = 0 Implies: ^mle = (XTX)-1XTY θ

60 Linear Regression Special case of orthonormal training examples: XTX = I : ^θjmle = XTY (the orthonormal projection of the columns of X on Y).

61 Linear Regression In the case of ridge regularization: setting the penalized log likelihood gradient to zero

62 Linear Regression In the orthogonal case: Each component is shrunk by the same constant factor

63 Linear Regression In the case of lasso regression: where

64 Linear Regression In the case of lasso regression, orthonormal case: The penalized log likelihood is:

65 Linear Regression In the case of lasso regression, orthonormal case: The penalized log likelihood is:

66 Linear Regression In the case of lasso regression, orthonormal case: The penalized log likelihood is: Which decomposes into a sum of d maximization problems: That can be solved independently For each j, if the objective function will be negative unless then the objective function = 0

67 Linear Regression In the case of lasso regression: If setting Results in a 0 objective function

68 Linear Regression In the case of lasso regression: If Then: The objective function is differentiable Set the derivative to 0 To obtain:

69 Linear Regression In the case of lasso regression: Combining the two cases we get: where Therefore: The sparsity encouraging threshold nature of lasso as it zeros out small coefficients Recall: ridge will make them smaller, but non-zero.

70 Lasso Plots Install.packages( glmnet ) library(glmnet)

71 ? Uncorrelated Data Plot The predictors go into the model in the order of their magnitude of the true linear regression coefficient.

72 ? In the programming node, write the commands needed to use the lasso model on the mtcars dataset. Your plot should be similar to mine. Mtcars and lasso Quiz

73 ? Mtcars and lasso Quiz Install.packages( glmnet ) library(glmnet) mtcarslasso = glmnet(as.matrix(mtcars[-1]), mtcars [,1]) plot(mtcarslasso, col=1:12, lwd=2) legend('bottomleft',legend=names(mtcars)[-1],col=1: 12,lty=1,lwd=1)

74 MLE Comparisons

75 Elastic Net Estimator Linear regression model trained with L1 and L2 Very common in high dimensional modeling in industry

76 Elastic Net Estimator subject to

77 ? Elastic Net Graph The lasso plot The elastic net plot

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach

University of South Florida Scholar Commons Graduate Theses and Dissertations Graduate School November 2015 Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach Wei Chen