UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Similar documents
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

EECS 433 Statistical Pattern Recognition

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Outlier Analysis. Lijun Zhang

Computer Age Statistical Inference. Algorithms, Evidence, and Data Science. BRADLEY EFRON Stanford University, California

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS)

NMF-Density: NMF-Based Breast Density Classifier

Large-Scale Statistical Modelling via Machine Learning Classifiers

THE data used in this project is provided. SEIZURE forecasting systems hold promise. Seizure Prediction from Intracranial EEG Recordings

COMP9444 Neural Networks and Deep Learning 5. Convolutional Networks

What is Regularization? Example by Sean Owen

Learning from data when all models are wrong

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

Data mining for Obstructive Sleep Apnea Detection. 18 October 2017 Konstantinos Nikolaidis

Data Mining in Bioinformatics Day 4: Text Mining

CSE 258 Lecture 2. Web Mining and Recommender Systems. Supervised learning Regression

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017

An Introduction to Bayesian Statistics

Applying One-vs-One and One-vs-All Classifiers in k-nearest Neighbour Method and Support Vector Machines to an Otoneurological Multi-Class Problem

Predicting Seizures in Intracranial EEG Recordings

Learning in neural networks

PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science. Homework 5

Machine Learning to Inform Breast Cancer Post-Recovery Surveillance

SUPPLEMENTARY INFORMATION. Table 1 Patient characteristics Preoperative. language testing

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Bayesians methods in system identification: equivalences, differences, and misunderstandings

Selection and Combination of Markers for Prediction

Error Detection based on neural signals

Intelligent Edge Detector Based on Multiple Edge Maps. M. Qasim, W.L. Woon, Z. Aung. Technical Report DNA # May 2012

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training.

CLASSIFICATION OF BREAST CANCER INTO BENIGN AND MALIGNANT USING SUPPORT VECTOR MACHINES

Predicting Sleep Using Consumer Wearable Sensing Devices

10CS664: PATTERN RECOGNITION QUESTION BANK

Methods for Predicting Type 2 Diabetes

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang

Bayesian Networks Representation. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Index. E Eftekbar, B., 152, 164 Eigenvectors, 6, 171 Elastic net regression, 6 discretization, 28 regularization, 42, 44, 46 Exponential modeling, 135

3. Model evaluation & selection

Case Studies of Signed Networks

J2.6 Imputation of missing data with nonlinear relationships

Introduction to Computational Neuroscience

Stepwise method Modern Model Selection Methods Quantile-Quantile plot and tests for normality

Gender Based Emotion Recognition using Speech Signals: A Review

CSE 255 Assignment 9

Business Statistics Probability

Assigning B cell Maturity in Pediatric Leukemia Gabi Fragiadakis 1, Jamie Irvine 2 1 Microbiology and Immunology, 2 Computer Science

Midterm STAT-UB.0003 Regression and Forecasting Models. I will not lie, cheat or steal to gain an academic advantage, or tolerate those who do.

Automatic Medical Coding of Patient Records via Weighted Ridge Regression

arxiv: v1 [cs.lg] 3 Jan 2018

4. Model evaluation & selection

Classification of Mammograms using Gray-level Co-occurrence Matrix and Support Vector Machine Classifier

Bayesian Models for Combining Data Across Subjects and Studies in Predictive fmri Data Analysis

Improved Intelligent Classification Technique Based On Support Vector Machines

Learning Convolutional Neural Networks for Graphs

Modeling Sentiment with Ridge Regression

Performance Evaluation of Machine Learning Algorithms in the Classification of Parkinson Disease Using Voice Attributes

Radiotherapy Outcomes

Sensory Cue Integration

VARIABLE SELECTION WHEN CONFRONTED WITH MISSING DATA

A COMBINATORY ALGORITHM OF UNIVARIATE AND MULTIVARIATE GENE SELECTION

Automatic Classification of Perceived Gender from Facial Images

Emotion Recognition using a Cauchy Naive Bayes Classifier

Enhanced Detection of Lung Cancer using Hybrid Method of Image Segmentation

Comparison of discrimination methods for the classification of tumors using gene expression data

Predicting Diabetes and Heart Disease Using Features Resulting from KMeans and GMM Clustering

Contents. Just Classifier? Rules. Rules: example. Classification Rule Generation for Bioinformatics. Rule Extraction from a trained network

SUPPLEMENTARY INFORMATION

A New Approach for Detection and Classification of Diabetic Retinopathy Using PNN and SVM Classifiers

AUTOMATING NEUROLOGICAL DISEASE DIAGNOSIS USING STRUCTURAL MR BRAIN SCAN FEATURES

Using AUC and Accuracy in Evaluating Learning Algorithms

MODEL SELECTION STRATEGIES. Tony Panzarella

Identifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang

COMPARISON BETWEEN GMM-SVM SEQUENCE KERNEL AND GMM: APPLICATION TO SPEECH EMOTION RECOGNITION

A Vision-based Affective Computing System. Jieyu Zhao Ningbo University, China

Classıfıcatıon of Dıabetes Dısease Usıng Backpropagatıon and Radıal Basıs Functıon Network

Lecture 21. RNA-seq: Advanced analysis

Small Group Presentations

Feedback-Controlled Parallel Point Process Filter for Estimation of Goal-Directed Movements From Neural Signals

CSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil

Chemometrics for Analysis of NIR Spectra on Pharmaceutical Oral Dosages

Predicting Kidney Cancer Survival from Genomic Data

Bayesian (Belief) Network Models,

Machine Learning for Predicting Delayed Onset Trauma Following Ischemic Stroke

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

Reveal Relationships in Categorical Data

10-1 MMSE Estimation S. Lall, Stanford

Anale. Seria Informatică. Vol. XVI fasc Annals. Computer Science Series. 16 th Tome 1 st Fasc. 2018

DIABETIC RISK PREDICTION FOR WOMEN USING BOOTSTRAP AGGREGATION ON BACK-PROPAGATION NEURAL NETWORKS

Glossary From Running Randomized Evaluations: A Practical Guide, by Rachel Glennerster and Kudzai Takavarasha

Winner s Report: KDD CUP Breast Cancer Identification

MS&E 226: Small Data

Application of Artificial Neural Networks in Classification of Autism Diagnosis Based on Gene Expression Signatures

Testing Statistical Models to Improve Screening of Lung Cancer

Classifying Substance Abuse among Young Teens

Chapter 3 Software Packages to Install How to Set Up Python Eclipse How to Set Up Eclipse... 42

A Comparative Study of Some Estimation Methods for Multicollinear Data

Diagnosis of Breast Cancer Using Ensemble of Data Mining Classification Methods

Transcription:

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write your name and Penn student ID (the 8 bigger digits on your ID card) on the bubble form and fill in the associated bubbles in pencil. If you are taking this as a WPE, then enter only your WPE exam number. If you think a question is ambiguous, mark what you think is the best answer. The questions seek to test your general understanding; they are not intentionally trick questions. As always, we will consider written regrade requests if your interpretation of a question differed from what we intended. We will only grade the scantron forms For the TRUE or FALSE questions, note that TRUE is (a) and FALSE is (b). For the multiple choice questions, select exactly one answer. The exam is 9 pages long and has 78 questions. Name: 1

CIS520 Final, Fall 2014 2 1. [0 points] This is version A of the exam. Please fill in the bubble for that letter. 2. [1 points] True or False? Under the usual assumptions ridge regression is consistent. 3. [1 points] True or False? Under the usual assumptions ridge regression is unbiased. 4. [1 points] True or False? Stepwise regression finds the global optimum minimizing its loss function (squared error plus the usual L 0 penalty). 5. [1 points] True or False? k-means clustering finds the global optimum minimizing its loss function. 6. [2 points] When doing linear regression with n = 10, 000 observations and p = 1, 000, 000 features, if one expects around 500 or 1,000 features to enter the model, the best penalty to use is (a) AIC penalty (b) BIC penalty (c) RIC penalty (d) This problem is hopeless you couldn t possibly find a model that reliably beats just using a constant. 7. [2 points] When doing linear regression, if we expect a very small fraction of the features to enter the model, we should use an (a) L 0 penalty (b) L 1 penalty (c) L 2 penalty 8. [1 points] True or False? In general, in machine learning, we prefer to use unbiased algorithms due to their better accuracy. 9. [1 points] True or False? L 1 penalized regression (Lasso) solves a convex optimization problem. 10. [2 points] Which of the following loss functions is least sensitive to outliers? (a) Hinge loss (b) L 1 loss (c) Squared (L 2 ) loss (d) Exponential loss 11. [1 points] True or False? For small training sets, Naive Bayes generally is more accurate than logistic regression. 12. [1 points] True or False? Naive Bayes, as used in practice, in generally an MAP algorithm. 13. [1 points] True or False? One can make a good argument that minimizing an L 1 loss penalty in regression gives better results than the more traditional L 2 loss function minimized by ordinary least squares.

CIS520 Final, Fall 2014 3 14. [1 points] True or False? Linear SVMs tend to be slower, but more accurate than logistic regression. 15. [2 points] You estimate a ridge regression model with some data taken from your robot, and find (using cross validation) and optimal ridge penalty λ 1. You then buy a new sensor which has noise with 1/4 the variance (half the standard deviation) as before. Using the same number of observations as before you collect new data, and find a new optimal ridge penalty λ 2. Which of the following will be closest to true? (a) λ 1 /λ 2 = 1/4 (b) λ 1 /λ 2 = 1/2 (c) λ 1 /λ 2 = 1 (d) λ 1 /λ 2 = 2 (e) λ 1 /λ 2 = 4 16. [1 points] True or False? BIC can be viewed as an MDL method. 17. [1 points] True or False? If you expect half of the features to enter a model, and have n p, BIC is a better penalty to use than RIC. 18. [1 points] True or False? The elastic net tends to select fewer features than well-optimized L 0 penalty methods. 19. [1 points] True or False? The elastic net generally gives at least as good a model (in terms of test error) as Lasso. 20. [1 points] True or False? The appropriate penalty in L 0 -penalized linear regression can be determined by theory, e.g. using an MDL approach. 21. [1 points] True or False? Ridge regression is scale invariant in the sense the that test set prediction accuracy is unchanged if one rescales the features, x. 22. [1 points] True or False? The the fact that a coefficient in a penalized linear regression is kept or killed (removed from the model) is generally a good indicator of the importance of the corresponding feature; features that are highly correlated with y will be kept and those with low correlation will be dropped. Consider an L 0 penalized linear regression with three different penalties. The resulting models have the following properties: bits to bits to code residual code the model model 1 400 300 model 2 300 400 model 3 320 350 23. [1 points] True or False? Method 1 is overfitting. 24. [1 points] True or False? Method 2 is overfitting.

CIS520 Final, Fall 2014 4 25. [1 points] True or False? The KL divergence between a true distribution (p(a) = 0.5, p(b) = 0.5, p(c) = 0) and an approximating distribution (q(a) = 0.3, q(b) = 0.3q(C) = 0.4) will be infinite. 26. [1 points] True or False? Decision trees select features to add based on their expected information gain. This has effect that features which take on many possible values tend to be preferentially added, since they tend to have higher information gain. 27. [1 points] True or False? Boosting can be shown to optimize weights for an exponential loss function. 28. [1 points] True or False? Key to the boosting algorithm is the fact that at each iteration more weight is given points that were misclassified. This often enables test set accuracy to continue to improve even after training set error goes to zero. 29. [1 points] True or False? Perceptrons are a form of stagewise regression. 30. [1 points] True or False? Voted perceptrons are generally more accurate than regular ( simple ) perceptrons. 31. [1 points] True or False? Voted perceptrons are generally faster (at test time) than averaged perceptrons. 32. [1 points] True or False? Perceptrons (approximately) optimize a hinge loss. 33. [1 points] True or False? For a linearly separable problem, standard perceptrons are guaranteed to find a lienarly separating hyperplane if there are no repeated x s with inconsistent labels. 34. [1 points] Which of the following classifiers has the lowest 0-1 error (L 0 loss) given a training set with an infinite number of observations. (a) Logistic regression (b) Naive Bayes 35. [1 points] True or False? If we consider a linear SVM as a kernel SVM, then the kernel function is the inner product between the x s. 36. [1 points] True or False? Radial Basis Functions (RBFs) can be used either to reduce or to increase the effective dimensionality p of a regression problem. 37. [1 points] True or False? Any SVM problem can be made linearly separable with the right selection of a kernel function. 38. [1 points] True or False? The number of support vectors found by an SVM depends upon the size of the penalty on the slack variables. 39. [0 points] True or False? Because SVMs already seek large margin solutions, they do not, in general, require inclusion of a separate regularization penalty. This is badly phrased and was thrown out. 40. [1 points] True or False? An SVM with a Gaussian kernel, exp( x y C ), will have a lower expected variance when C = 1 than when C = 10,

CIS520 Final, Fall 2014 5 41. [2 points] In the figure below which points are support vectors? A B and D are in class 1, C and E are in class 2 E D C x 2 A B x 1 (a) A, C (b) A, C, D (c) Not enough information was provided to tell. 42. [2 points] For the primal problem for non-negative weighted regression is: min w i (y i w T x) 2 s.t. w j 0 for j = 1...p The dual problem is to solve (a) max λ i (y i w T x) 2 + j λ jw j s.t. λ j 0 (b) max λ i (y i w T x) 2 + λ j w j s.t. λ 0 (c) max λ i (y i w T x) 2 j λ jw j s.t. λ j 0 (d) min λ i (y i w T x) 2 + j λ jw j s.t. λ j 0 (e) min λ i (y i w T x) 2 j λ jw j s.t. λ j 0 43. [1 points] True or False? For the above optimization problem, the constraint corresponding to each weight is binding if and only if the weight is zero. 44. [2 points] Which of the following methods cannot be kernelized? (a) k-nn (b) linear regression (c) perceptrons (d) PCA (e) All of the above methods can be kernelized. 45. [1 points] True of False? Any function φ(x) can be used to generate a kernel using k(x, y) = φ(x) T φ(y). 46. [1 points] True or False? If there exists a pair of points x and y such that k(x, y) < 0, then k() can not be a kernel.

CIS520 Final, Fall 2014 6 47. [1 points] True or False? All entries in a kernel matrix must be non-negative. 48. [1 points] True or False? A kernel matrix must be symmetric. 49. [1 points] True or False? Any norm x can be used to define a distance by defining d(x, y) = x y 50. [1 points] True or False? k(x, y) = e ( x y 2 2 ) is a legitimate kernel function 51. [2 points] The number of parameters needed to specify a Gaussian Mixture Model with 4 clusters, data of dimension 3, and a single (full) covariance matrix shared across all 4 clusters is: (a) fewer than 16 (b) between 16 and 20 (inclusive) (c) 21 or 22 (d) 23 (e) 24 or more 52. [1 points] True or False? EM is a search algorithm for finding maximum likelihood (or sometimes MAP) estimates. Thus, it can, in theory, be replaced by other search algorithm that also maximizes the same likelihood function. 53. [1 points] True or False? The L 2 reconstruction error from using k-component PCA to approximate a set of observations X, can be characterizaed in in terms of the k largest eigenvalues of X X. 54. [1 points] True or False? A positive definite symmetric real square matrix has only positive eigenvalues. 55. [1 points] True or False? For real world data sets, the principle components of a matrix X are preferably found using SVD of X rather than actually finding the eigenvectors of X X. 56. [1 points] True or False? The singular values of X are equal to the eigenvalues of X X. 57. [1 points] True or False? For an NxP matrix, X. (N > P ) the k largest right singular vectors will be the same as the loadings of X. 58. [1 points] True or False? Principle component regression (PCR) in effect does regularization, and thus offers a partial, if not exact replacement for Ridge regression. 59. [1 points] True or False? Principle component regression (PCR), like linear regression, is scale invariant. 60. [2 points] The dominant cost of linear regression, when n >> p scales as (a) np (b) np 2 (c) n 2 p (d) p 3

CIS520 Final, Fall 2014 7 61. [1 points] True or False? Deep neural networks are almost always supervised learning methods. 62. [1 points] True or False? Deep neural networks most often use an architecture in which all nodes on each layer are connected to all nodes on the following layer, but have no connections to other deeper layers. 63. [1 points] True or False? Deep neural networks currently hold the records for best machine learning performance in problems ranging from speeh and vision to natural language processing and brain image modeling (MRI). Consider the following confusion matrix corrent answer True False predicted True 8 2 answer False 12 11 64. [1 points] For the above confusion matrix the precision is (a) 2/10 (b) 8/20 (c) 19/33 (d) none of the above 65. [1 points] For the above confusion matrix the recall is (a) 2/10 (b) 8/20 (c) 19/33 (d) none of the above 66. [1 points] True or False? L 2 loss (sometime with a regularization penalty) is widely used because it usually reflects the actual loss function for applications in business and science. 67. [1 points] True or False? One can compute a description length (for the model plus residual) for a belief net and the data it represents, and a causal belief net is likely to have a shorter description length than one that just captures the conditional independence structure of the same data. The following questions refer to the following figure; means is conditionally independent of.

CIS520 Final, Fall 2014 8 B C D E F G I J K L 68. [1 points] True or False? (B C D) 69. [1 points] True or False? (J K G) 70. [1 points] True or False? (J K L) 71. [1 points] True or False? (B J F, G) 72. [1 points] True or False? (F I G, L) 73. [1 points] True or False? G d-separates D and J 74. [2 points] What is the minimum number of parameters needed to represent the full joint distribution P (B, C, D, E, F, G, I, J, K, L) in the network, given that all variables are binary? hint: Parameters here refer to the value of each probability. For example, we need 1 parameter for P (X) and 3 parameters for P (X, Y ) if X and Y are binary. (a) < 20 (b) 20-30 (c) 31-99 (d) more than 100 75. [1 points] True or False? When building a belief net from a set of observations where A, B are Boolean variables, if P (A = T rue B = T rue) = P (A = T rue), then we know that there should not be a link from A to B. 76. [1 points] True or False? When building a belief net from a set of observations where A, B are Boolean variables, if P (A = T rue B = T rue) = P (A = T rue), then we know that there should not be a link from B to A. 77. [1 points] True or False? The emission matrix in an HMM represents the probability of the state, given an observation. For the next question, consider the Bayes Net below with parameter values labeled. This is an instance of an HMM. (Similar to homework 8)

CIS520 Final, Fall 2014 9 P (X 1 = a) = 0.3 P (X 2 = a X 1 = a) = 0.5 P (X 2 = a X 1 = b) = 0.2 X 1 X 2 O 1 O 2 P (O i = 0 X i = a) = 0.3 P (O i = 0 X i = b) = 0.6 78. [3 points] Suppose you have the observation sequence O 1 = 1, O 2 = 0. What is the prediction of Viterbi Decoding? (Maximize P (X 1, X 2 O 1 = 1, O 2 = 0)) (a) X 1 = a, X 2 = a (b) X 1 = a, X 2 = b (c) X 1 = b, X 2 = a (d) X 1 = b, X 2 = b