Ward Headstrom Institutional Research Humboldt State University CAIR

Similar documents
Intro to SPSS. Using SPSS through WebFAS

Hour 2: lm (regression), plot (scatterplots), cooks.distance and resid (diagnostics) Stat 302, Winter 2016 SFU, Week 3, Hour 1, Page 1

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach

CRITERIA FOR USE. A GRAPHICAL EXPLANATION OF BI-VARIATE (2 VARIABLE) REGRESSION ANALYSISSys

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training.

Weight Adjustment Methods using Multilevel Propensity Models and Random Forests

BIOL 458 BIOMETRY Lab 7 Multi-Factor ANOVA

STATISTICS IN CLINICAL AND TRANSLATIONAL RESEARCH

bivariate analysis: The statistical analysis of the relationship between two variables.

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

THE STATSWHISPERER. Introduction to this Issue. Doing Your Data Analysis INSIDE THIS ISSUE

SubLasso:a feature selection and classification R package with a. fixed feature subset

CLASSIFICATION TREE ANALYSIS:

Simple Linear Regression One Categorical Independent Variable with Several Categories

Propensity Score Analysis: Its rationale & potential for applied social/behavioral research. Bob Pruzek University at Albany

Part [2.1]: Evaluation of Markers for Treatment Selection Linking Clinical and Statistical Goals

Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data

LAB ASSIGNMENT 4 INFERENCES FOR NUMERICAL DATA. Comparison of Cancer Survival*

Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties

Modeling Sentiment with Ridge Regression

Using Random Forest in the field of metabolomics

A SAS Macro for Adaptive Regression Modeling

CLINICAL BIOSTATISTICS

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis

Today: Binomial response variable with an explanatory variable on an ordinal (rank) scale.

Migraine Dataset. Exercise 1

Discovering New Variables and Improving the Prediction of ALS Progression

Certificate Courses in Biostatistics

Linear Regression in SAS

Inferential Statistics

7 Statistical Issues that Researchers Shouldn t Worry (So Much) About

Controlling Bias & Confounding

Bangor University Laboratory Exercise 1, June 2008

Index. Springer International Publishing Switzerland 2017 T.J. Cleophas, A.H. Zwinderman, Modern Meta-Analysis, DOI /

Small Group Presentations

dataset1 <- read.delim("c:\dca_example_dataset1.txt", header = TRUE, sep = "\t") attach(dataset1)

Survival Prediction Models for Estimating the Benefit of Post-Operative Radiation Therapy for Gallbladder Cancer and Lung Cancer

A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY

List of Figures. List of Tables. Preface to the Second Edition. Preface to the First Edition

5 14.notebook May 14, 2015

Analysis of Categorical Data from the Ashe Center Student Wellness Survey

Supplementary Appendix

Sound Texture Classification Using Statistics from an Auditory Model

Biostatistics II

MULTIFACTOR DESIGNS Page Factorial experiments are more desirable because the researcher can investigate

Research Manual STATISTICAL ANALYSIS SECTION. By: Curtis Lauterbach 3/7/13

Prediction Model For Risk Of Breast Cancer Considering Interaction Between The Risk Factors

Comparison of discrimination methods for the classification of tumors using gene expression data

Introduction to SPSS S0

Article from. Forecasting and Futurism. Month Year July 2015 Issue Number 11

Identification of Neuroimaging Biomarkers

Logistic Regression Predicting the Chances of Coronary Heart Disease. Multivariate Solutions

SUPPLEMENTAL DIGITAL CONTENT FILE 2

Addendum: Multiple Regression Analysis (DRAFT 8/2/07)

Data Analysis Using Regression and Multilevel/Hierarchical Models

Meta-analysis using RevMan. Yemisi Takwoingi October 2015

Using SPSS for Correlation

Data management by using R: big data clinical research series

STAT 135 Introduction to Statistics via Modeling: Midterm II Thursday November 16th, Name:

10. LINEAR REGRESSION AND CORRELATION

RISK PREDICTION MODEL: PENALIZED REGRESSIONS

Lecture 21. RNA-seq: Advanced analysis

SPSS Correlation/Regression

Exam 4 Review Exercises

Biomarker adaptive designs in clinical trials

Dr. Kelly Bradley Final Exam Summer {2 points} Name

Personalized Colorectal Cancer Survivability Prediction with Machine Learning Methods*

Research Methods in Forest Sciences: Learning Diary. Yoko Lu December Research process

Quantitative Methods in Computing Education Research (A brief overview tips and techniques)

Research Manual COMPLETE MANUAL. By: Curtis Lauterbach 3/7/13

Decision Making Process

Examining Relationships Least-squares regression. Sections 2.3

Part 8 Logistic Regression

Assigning B cell Maturity in Pediatric Leukemia Gabi Fragiadakis 1, Jamie Irvine 2 1 Microbiology and Immunology, 2 Computer Science

MMI 409 Spring 2009 Final Examination Gordon Bleil. 1. Is there a difference in depression as a function of group and drug?

MEASURES OF ASSOCIATION AND REGRESSION

CSE 255 Assignment 9

Applied Medical. Statistics Using SAS. Geoff Der. Brian S. Everitt. CRC Press. Taylor Si Francis Croup. Taylor & Francis Croup, an informa business

Colon cancer subtypes from gene expression data

A review of statistical methods in the analysis of data arising from observer reliability studies (Part 11) *

Logistic Regression and Bayesian Approaches in Modeling Acceptance of Male Circumcision in Pune, India

Parental Smoking Behavior, Ethnicity, Gender, and the Cigarette Smoking Behavior of High School Students

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012

An Improved Algorithm To Predict Recurrence Of Breast Cancer

Modeling gene expression using five histone modifications

Predicting Breast Cancer Survivability Rates

Fitting discrete-data regression models in social science

Section 6: Analysing Relationships Between Variables

Analysis of Confidence Rating Pilot Data: Executive Summary for the UKCAT Board

PubH 7405: REGRESSION ANALYSIS. Propensity Score

Reveal Relationships in Categorical Data

STAT 151B. Administrative Info. Statistics 151B: Introduction Modern Statistical Prediction and Machine Learning. Overview and introduction

Biostatistics 2 nd year Comprehensive Examination. Due: May 31 st, 2013 by 5pm. Instructions:

Intro to R. Professor Clayton Nall h/t Thomas Leeper, Ph.D. (University of Aarhus) and Teppei Yamamoto, Ph.D. (MIT) June 26, 2014

MS&E 226: Small Data

Knowledge Discovery and Data Mining. Testing. Performance Measures. Notes. Lecture 15 - ROC, AUC & Lift. Tom Kelsey. Notes

User Guide. Association analysis. Input

Extending Rungie et al. s model of brand image stability to account for heterogeneity

Transcription:

Using a Random Forest model to predict enrollment Ward Headstrom Institutional Research Humboldt State University CAIR 2013 1

Overview Forecasting enrollment to assist University planning The R language The Random Forest model Binary Logistic Regression model Cautions and Conclusions The example I am going to use is projecting New enrollment. These techniques can easily be applied to predicting Retention Graduation Other future events 2

Simple enrollment projections 1) how many student enrolled last year? 2) enhance by breaking it down into subgroups 3) possibly use linear regressions (trends) 4) enhance further by looking at to-date information 3

2014 projection = 2013 to-date yield * 2014 apps = 1,369/3,692*4,361 = 1,617 4

5

But what about Why applicant yield might not be the best predictor: Admits more likely to enroll Confirms more likely to enroll Denied or withdrawn will not enroll Housing deposits may be good indicator of intent Local applicants more likely than distant applicants Certain majors or ethnicities may be more likely to enroll Do this year s applicants look like last year s? Ideally, we would like to use all the data we have about applicants to predict how likely they are to enroll. Variables: demographics, academics, actions to-date Model 1: Random Forest Model 2: Binary logistic regression 6

The language R CAIR comment: an emphasis on R would be limiting to institutions that used other software. The first (only?) implementation of Random Forest models R is open source free to use http://cran.us.r-project.org/ http://www.rstudio.com/ide/download/desktop Many online tutorials: http://cran.r-project.org/doc/contrib/paradis-rdebuts_en.pdf http://bioinformatics.knowledgeblog.org/2011/06/21/using-r-aguide-for-complete-beginners/ https://www.coursera.org/course/compdata www.researchgate.net/post/which_is_better_r_or_spss 7

R and RStudio overview Function-based: function(data,options) Case-specific language 4 panes help, history, import dataset, packages Object types: data.frame, vector, scalars, factor, models Useful commands: command line console can be used as calculator assignment -> or <- functions: na.omit(), summary(), table(), tolower() subsets: dataframe[row select, column select] graphics: hist(), plot() library(), especially library(randomforest) 8

RStudio 9

Data files All the data fields you think might help predict yields Major discipline Region of origin Sex Ethnicity Academic preparation Actions to-date Accepted SUG Confirmed intent to enroll Paid housing deposit Institutional actions Admit Deny/cancel 10

Import data into R 11

Decision Trees 12

Random Forest Model Developed by Leo Brieman and Adele Cutler Plan: grow a random forest of 500 decision trees randomforest(cenreg~variable1+variable2+,data=train) Randomly picks fields for each tree Randomly selects rows to exclude from each tree Measure of variable importance Out Of Box estimate of error rate and Confusion matrix Run new data through all 500 trees and let them vote 13

Random Forest model of applicant yield 14

varimpplot(rf) 15

1 st tree in Random Forest For categorical predictors, the splitting point is represented by an integer, whose binary expansion gives the identities of the categories that goes to left or right. For example, if a predictor has four categories, and the split point is 13. The binary expansion of 13 is (1, 0, 1, 1) (because 13 = 1*2^0 + 0*2^1 + 1*2^2 + 1*2^3), so cases with categories 1, 3, or 4 in this predictor get sent to the left, and the rest to the right. 16

Testing and making a Projection f Random Forest projects that 42% of current Spring apps will enroll, compared to 45% of last year s apps to-date and 34% of training years. 17

Binary Logistic Regression p(x) is the probability that x will occur, where x is a binary object (Y/N, 1/0, true/false) log p(x) 1 p(x) = B 0 + B 1 X 1 + B 2 X 2 + B 3 X 3 + B n represents calculated coefficients X n represents the value of dependent variables Break up factor variables into many terms where X n is 1 or 0 Can manipulate the result to return the probability (between 0 and 1) that x will occur, given the state of a particular set of dependent variables. Difficult to predict outcome of a single individual Can sum probabilities to estimate total 18

Binary logistic regression model of applicant yield 19

20

21

BLR model testing and projecting Binary Logistic Regression predicted 324 of current Spring applicants will enroll, compared to 340 projected by Random Forest model. 22

Cautions and Conclusions Null or new values in variables will cause problems Beware of to-date variables (e.g. intent_td). Make sure that procedures have not changed in a way that will affect behavior. R is a very powerful tool which can be very useful if you are willing to invest some time learning it. Multivariate models may improve the accuracy of your predictions. Corroborate with simple models and consultation with involved staff. 23

Questions? Comments? This presentation: www.humboldt.edu\irp\presentations.html My email: Ward.Headstrom@humboldt.edu 24