Introduction of Empirical Analysis using Stata: For Beginners

Size: px

Start display at page:

Download "Introduction of Empirical Analysis using Stata: For Beginners"

Chrystal Doyle
5 years ago
Views:

1 WBS seminar ('17/12/23) 1 Introduction of Empirical Analysis using Stata: For Beginners Lecturer: Tohru Yoshioka-Kobayashi Project Research Associate Department of Technology Management for innovation Graduate School of Engineering, the University of Tokyo t-koba@tmi.t.u-tokyo.ac.jp Acknowledgements: Mr. Kisa Sugihara and Mr. Akihiro Kawamura made a great contribution to the English translation. This lecture material can be used secondary according to the Creative Commons name display. Please note that there are some areas that do not adequately touch the statistical rigor.

2 0.Introduction 2 Introduction of the lecturer Researcher in MOT: '15 Ph.D. in Engineering from UTokyo Studying an organizational management in technology and design development Researcher in IP policy: '07 Master in Law from Osaka-U Seeking policy implications in the intelletual property law Career Assistant in legal affairs in the univ. start-up (Signpost, Corp.) Policy analysit in a private think tank (Mitsubishi Res. Inst.) Hitotsubashi Univ. & Univ. of Tokyo

3 0.Introduction 3 Goal We will learn basic knowledge and skills to reveal (or proof) a causal relationship. Even those who are not bright in mathematics will be able to analyze by yourself after the seminar The contents of the lecture are based on statistics, but no formula is used. Specialized in general-purpose analytical methods We use Stata.

4 0.Introduction 4 Agenda I. Preparation for the Analysis: How to Load data II. Descriptive Statistics and Graphs III. Data Processing IV. Regression Analysis V. Reporting of Regression Results

5 0.Introduction 5 Empirical analysis procedures 1 Setting research questions 2 Literature review 3 Causal model design 8 Creating a data set 9 Analysis 10 Discussion (interpretation) 4 Search for statistics and other data sources 5 Perform simple verification 6 Collecting data 7 Cleaning data (Data cleansing) Carried out in the head With data Embodiment

6 I. Preparation for the Analysis: How to Load data 6

7 I. Preparation for the analysis 7 1)Characteristics of statistical analysis software Stata SPSS R GRETL Features High High Medium (High w/add-in) User experience Good Good Bad Good Price High High Free Free Support Characteri stics Official Support + a couple of books Strong in the analysis of the social science Official support + Books A little strong in the analysis of the natural science A variety of information online + Books Strong in data processing Medium (High w/add-in) Information online Strong in analysis of the economics

8 I. Preparation for the analysis 8 2)Data to use SampleData_OECD.txt Created from OECD, Main Science and Technology Indicator tab-separated data Records of the following values 2008 and 2013 and their growth in 2013 (compare to those in 2008) Workforce population (thousands) PCT Patent applications (number of patents)...number of patent applications that are willing to apply to foreign countries Industry Value added (US $ million) Technology trade received (US $ Million) Technology trade payments (US $ Million) Technical trade balance (US $ million)...amount Received - payment

9 I. Preparation for the analysis 9 2)Data to use Data item Variable name Country Region_Narrow Region_Broad Laborforce_2008_thousands Content Country Region name Laborforce_2013_thousands (2013) Laborforce_growthrate pctpatentapplication_2008 Continent name Workforce population (thousands) (2008) Growth ( ) pctpatentapplication_2013 (2013) Pct_growthrate Valueadded_2008_m_usd Valueadded_2013_m_usd (2013) Number of international patent applications (2008) Growth rate( ) Industry Value added (US $ Million) (2008) Valueadded_growthrate Growth rate ( ) ValueAdded_Growth_M_USD Growth value ( ) Variable name Techreceipts_2008_m_usd Content Techreceipts_2013_m_usd (2013) Techreceipts_growthrate Techpayments_2008_m_usd Techpayments_2013_m_usd (2013) Techpayments_growthrate Techbalance_2008_m_usd Techbalance_2013_m_usd (2013) Techbalance_growth_m_usd Laborforce_growth_dummy Techbalance_growth_dummy Asiapacific_dummy Europe_dummy Eu_dummy Technology trade received (US $ Million) (2008) Growth rate( ) Technology trade payments (US $ Million) (2008) Growth rate( ) Technical trade balance of payment (US $ Million) (2008) Growth value (US $ Million) ( ) Dummy variable takes 1 if labor force population growth rate > 0 Dummy variable takes 1 if technology trade balance growth rate > 0 Dummy variable takes 1 if the country is in Asia or Paficif (including North America) Dummy variable takes 1 if the country is in Europe Dummy variable takes 1 if the country is one of the EU members

10 I. Preparation for the analysis 10 2)Data to use Questions to be solved What factor does increase the industry valueadded? What factor does increase the technology balance of payment? Important limitation: Examine only within the available data

11 I. Preparation for the analysis 11 3)Data for the experienced Overview IMPP_Eng_DATA.txt or IMPP_EnglishEdu_En.xlsx Source: Ministry of Education(MEXT) English Skill Survey in 2016 Surveys to public high schools and junior high schools Other government statistics Observed year FY2016

12 I. Preparation for the analysis 12 3)Data for the experienced Items Classification Items Variable names Basic information Prefecture ID ID Prefecture High School English Number of English teachers in public HS...(a) Teachers' English Skill Those who took an English examination among (a)...(b) (MEXT English Skill Survey in 2016) Those who graded Eiken Pre-1 and upper and these equivalents among (b)...(c) (c)/(a) High School Students' Seniors in public HS...(d) English Skill (MEXT Those who took an English examination among (d)...(e) English Skill Survey in Those who graded Eiken Pre-2 and upper among (e)...(f) 2016) Those who are regarded as equivalent to Eiken Pre-2 and upper except (f)...(g) (f)+(g) ((f)+(g))/(d) Pref_Str HS_T_ALL HS_T_EXAM HS_T_E1 HS_T_E1_R HS_S_ALL HS_S_EXAM HS_S_E2 HS_S_OT HS_S_E2OT HS_S_E2OT_R

13 I. Preparation for the analysis 13 3)Data for the experienced Items Classification Items Variable names Junior High School Number of English teachers in public JHS...(h) JH_T_ALL English Teachers' Those who took an English examination among (h)...(i) JH_T_EXAM English Skill (MEXT English Skill Survey in Those who graded Eiken Pre-1 and upper and these JH_T_E1 2016) equivalents among (i)...(j) (j)/(h) JH_T_E1_R Junior High School Seniors in public JHS...(k) Students' English Skill Those who took an English examination among (k)...(l) (MEXT English Skill Survey in 2016) Those who graded Eiken Pre-2 and upper among (l)...(m) Those who are regarded as equivalent to Eiken Pre-2 and upper except (m)...(n) (m)+(n) ((m)+(n))/(k) JH_S_ALL JH_S_EXAM JH_S_E2 JH_S_OT JH_S_E2OT JH_S_E2OT_R

14 I. Preparation for the analysis 14 3)Data for the experienced Items Classification Items Variable names Num. of High Schools Num. of high schools...(o) HS_I_ALL (MEXT Educational Num. of private high schools...(p) HS_I_PRIV Institution Basic Survey) Num. of public high schools...(q) HS_I_PUBL Num. of students who newly attend collage, university, and junior collage (MEXT Educational Institution Basic Survey) Num. of students who newly attend collages and universities (by HS location) Num. of students who newly attend junior collages (by HS location) Num. of graduate from JSH in 2013 (by JHS location) Percentage of students who go on to collages, universities, and junior collages Percentage of students who go on to collages and universities HS_S_UNIV_ENT HS_S_JC_ENT JH_S_PREVALL HS_S_UNJC_R HS_S_UNIV_R

15 I. Preparation for the analysis 15 3)Data for the experienced Questions What factors do influence on English skills of high school students?

16 I. Preparation for the analysis 16 4)Load data Statistics software has a fixed format The structure of the data must be followed as below Individual observation target in vertical direction (row direction) Variables (index) for each observation object in the horizontal direction (column direction) The top line should have a variable name Variable Do not put a line break in variable names No Name Gender Age Height 1 M.Y. Observations M S. F K.K. M

17 I. Preparation for the analysis 17 4)Load data Variable name guidelines How to name variables English letters and _(underscore) only make it safe. You should prevent use other symbols or Japanese Don't put a blank It is better not to use number as a first letter. Note) The data itself may contain Japanese and symbols

18 I. Preparation for the analysis 18 4)Load data File format It is best to read the Excel file. It is possible for STATA (though the old version does not work) If not,"tab-delimited text" is better than CSV. CSV data separate variables by ","(comma). In the numeric data, Excel and other database softwares may add "," as the digit indication. To avoid to be treated as separeted variables, these softwares add double-quotation like 333,231,298 when file is saved. Loading the file, R and Stata may treat numeric variables as a string. If the file is separated by tab, you can prevent this.

19 I. Preparation for the analysis 19 4)Load data FileMenu>Import> Choose Text data created by a spreadsheet

20 I. Preparation for the analysis 20 4)Load data Click on Browse [ii] Click on Browse [i] Keep checking tabdelimited data in advance

21 I. Preparation for the analysis 21 4)Load data On the file open window, choose Text Files (*.txt) and then open the data file Change to Text Files (*.txt)

22 I. Preparation for the analysis 22 4)Load data If you see a variable in the top right it is success Here

23 I. Preparation for the analysis 23 4)Load data Note the type of each variable in the imported data int Long Double Number (can be calculated) Byte 0/1(Can be calculated) Str String (not calculated) When there is garbage in the data or output to a tabdelimited text format with You can see it here.

24 I. Preparation for the analysis 24 4)Load data The type of the variable can be confirmed from [Variable Manager] Here

25 I. Preparation for the analysis 25 4)Load data The correct method A variable that is treated as string-type incorrectly can be fixed in DataMenu >Create or change data>other variabletransformation Commands>Convert variables from string to numeric.

26 II. Descriptive Statistics and Graphs 26

27 II. Descriptive Statistics and Graphs 27 1) Descriptive statistics View descriptive statistics Statistics Menu >Summaries, tables, and tests >Summary and descriptive statistics >Summary Statistics

28 II. Descriptive Statistics and Graphs 28 1) Descriptive statistics View descriptive statistics Just click on the data you want to aggregate in Variables [I] Just click and choose... [ii]ok

29 II. Descriptive Statistics and Graphs 29 1) Descriptive statistics View descriptive statistics. summarize laborforce_growthrate pct_growthrate valueadded_growthrate techbalance_gro > wth_m_usd Variable Obs Mean Std. Dev. Min Max laborforce~e pct_growth~e valueadded~e tech~h_m_usd Long variable names are omitted Standard deviation #Command lines for descriptive statistics summarize laborforce_growthrate pct_growthrate

30 II. Descriptive Statistics and Graphs 30 1) Descriptive statistics View descriptive statistics by/if/in Tags can be narrowed and aggregated by group [i]check here [ii]select a variable to be the groupʻs base (For example Europe_dummy)

31 II. Descriptive Statistics and Graphs 31 1) Descriptive statistics View descriptive statistics (results by group) -> europe_dummy = 0 Variable Obs Mean Std. Dev. Min Max laborforce~e pct_growth~e valueadded~e tech~h_m_usd > europe_dummy = 1 Variable Obs Mean Std. Dev. Min Max laborforce~e pct_growth~e valueadded~e tech~h_m_usd #Descriptive statistics by groups by europe_dummy, sort : summarize laborforce_growthrate pct_growthrate

32 II. Descriptive Statistics and Graphs 32 1) Descriptive statistics Correlations between variables Statistics > Summaries, tables and tests > Summary and descriptive statistics > Correlations and covariances #Correlations correlate valueadded_growthrate techbalance_growth_m_usd

33 II. Descriptive Statistics and Graphs 33 1) Descriptive statistics Correlations between variables (cont.)

34 II. Descriptive Statistics and Graphs 34 1) Descriptive statistics Correlations between variables (cont.): Results. correlate valueadded_growthrate techbalance_growth_m_usd techbalance_growth_dummy laborforce_growthrate pct_growthrate (obs=29) eu_dummy valuea~e ~h_m_usd techba~y laborf~e pct_gr~e eu_dummy valueadded~e tech~h_m_usd techbalanc~y laborforce~e pct_growth~e eu_dummy

35 II. Descriptive Statistics and Graphs 35 2)Graphs Drawing a histogram Graphics Menu > Histogram

36 II. Descriptive Statistics and Graphs 36 2)Graphs Drawing a histogram (cont.) Select a variable

37 II. Descriptive Statistics and Graphs 37 2)Graphs Drawing a histogram (cont.): Results Density ValueAdded_GrowthRate #Drawing a histogram histgram valueadded_growthrate

38 II. Descriptive Statistics and Graphs 38 2)Graphs Drawing a histogram by groups You can create a histogram for each group in the By tab [i]click By [ii] Select variables to use for grouping Density ValueAdded_GrowthRate Graphs by Europe_Dummy 0 5

39 II. Descriptive Statistics and Graphs 39 2)Graphs Drawing a histogram by groups Command lines #Drawing a histogram by groups histgram valueadded_growthrate, by(europe_dummy) Increase/decrease bins #Change the number of bins histgram valueadded_growthrate, bin(12) Density ValueAdded_GrowthRate

40 II. Descriptive Statistics and Graphs 40 2)Graphs Drawing a scatter chart Graphics Menu>Twoway graph (scatter, line, etc.)

41 II. Descriptive Statistics and Graphs 41 2)Graphs Drawing a scatter chart [i]click Create

42 II. Descriptive Statistics and Graphs 42 2)Graphs Drawing a scatter chart (cont.) [i]select the Scatter in the basic plots [ii]select each axis variable [iii] Press accept to return to the previous screen. Then press ok #Drawing a scatter chart twoway (scatter valueadded_growthrate pct_growthrate)

43 II. Descriptive Statistics and Graphs 43 2)Graphs Drawing a scatter chart (cont.): Results PCT_GrowthRate LaborForce_GrowthRate

44 II. Descriptive Statistics and Graphs 44 2)Graphs Drawing a scatter plot matrix Graphics > Scatterplot matrix

II. Descriptive Statistics and Graphs 45 2)Graphs Drawing a scatter plot matrix Select variables -.1 0.1.2 0.5 1.5 ValueAdded_GrowthRate 0.2 -.5.1 0 LaborForce_GrowthRate -.

45 II. Descriptive Statistics and Graphs 45 2)Graphs Drawing a scatter plot matrix Select variables ValueAdded_GrowthRate LaborForce_GrowthRate PCT_GrowthRate EU_Dummy #Drawing a scatter plot matrix graph matrix valueadded_growthrate laborforce_growthrate pct_growthrate eu_dummy

46 II. Descriptive Statistics and Graphs 46 2)Graphs Drawing a box plot Graphics > Box plot

47 II. Descriptive Statistics and Graphs 47 2)Graphs Drawing a box plot PCT_GrowthRate #Drawing a box plot graph box pct_growthrate

48 II. Descriptive Statistics and Graphs 48 2)Graphs Drawing a box plot by groups [i]click Categories tab [ii]check Group1 [iii]select a variable for grouping #Drawing a box plot by groups graph box pct_growthrate, over(region_broad)

49 II. Descriptive Statistics and Graphs 49 2)Graphs Drawing a box plot by groups: Results PCT_GrowthRate Asia-Pacific Europe Other

50 II. Descriptive Statistics and Graphs 50 3)Exercise Our dataset (SampleData_OECD) includes one variable contains errors Hint: They are obvious errors Hint: Error are in specific variales among labor force, PCT, and value added related variables Find the variable by using summary statistics, histgrams, and scatter plots

51 II. Descriptive Statistics and Graphs 51 3)Exercise Answer ValueAdded_Growth_M_USD They calculated the value in 2008 minus the value in Thus, too many negative growths!

52 III. Data Processing 52

53 III. Data Processing 53 1)Create a new variable How to compute a new variable Data Menu>Create or change data>create new variable

54 III. Data Processing 54 1)Create a new variable How to compute a new variable [i]fill the name of the new variable [ii]click Create

55 III. Data Processing 55 1)Create a new variable How to compute a new variable (cont.) log( techbalance_growth_m_u sd ) [i] The mathematical process can be chosen from Function >Mathmatical [ii] You can choose a variable from variables #Create a new variable generate log( techbalance_growth_m_usd )

56 III. Data Processing 56 2)Save the dataset Save the modified dataset [1] File > Export > Textdata (delimited, *.csv)

57 III. Data Processing 57 2)Save the dataset Save the modified dataset [1] Input a file name Check Tab-delimited #Save the dataset in a tab delimited format text file export delimited using "OECD_data_v02.txt", delimiter(tab) replace

58 III. Data Processing 58 2)Save the dataset Save the modified dataset [2] File > Save as... #Save the dataset in a Stata data file(.dta) save "OECD_data.dta"

59 IV. Regression Analysis 59

60 IV. Regression Analysis 60 1) Estimating correlations with multiple variables: Basics Collect a large number of data and estimate an influence of each factor Performance b a c Factor 1 Green layer indicates the layer which is the most closest with all data (dots) Performance =a*factor 1 +b*factor 2+c Regression Analysis Factor 2 (note) Generally, green layer is not triangle, but in this example, we put limitation on Factor 1 and 2 (>0) and Performance (< p)

61 IV. Regression Analysis 61 1) Estimating correlations with multiple variables: Basics Key terms Dependent variable The variable to be estimated. In many cases, performance indicators Explanatory variables, independent variables Variables that are affected (or think there is a strong correlation with) dependent variable Control variables A variable that is not an explanatory variable that is affecting (or thinks there is a strong correlation) dependent variable In many cases, the variables used in prior research

62 IV. Regression Analysis 62 1) Estimating correlations with multiple variables: Basics What can be used as a explanatory variable? i. Squared term Estimates along with the normal one (first term?) and see the degree of influence of both to find a quadratic effect Multi-collinearity is often allowed between first term (x) and squared term (x 2 ) Interpretation Coefficients of X Coefficients of X 2 Interpretation 1 Significantly (+) 2 Significantly (-) Significantly (-) Significantly (+) Inverse-U shaped U-shaped 3 Not Significant Significantly (+) Positive impact is non-linear 4 Significantly (+) Not Significant A linear positive impact

63 IV. Regression Analysis 63 1) Estimating correlations with multiple variables: Basics What can be used as a explanatory variable? (cont.) ii. Cross section Use when there is a condition and how the explanatory variable works differently (check the moderator effect) Estimates along with each explanatory variables and see the degree of influence of both Factor 1 Performance Factor 2 Influence of Factor 1 depend on Factor 2

64 IV. Regression Analysis 64 1) Estimating correlations with multiple variables: Basics What can be used as a explanatory variable? (cont.) ii. Cross section (cont.) Notes: Cross section often cause multicollinearity with original explanatory variables: Need centering or standardization Centering: Original value mean value Standardization: (Original value - mean) / standard deviation If there is an unbalance between two explanatory variables, cross section will have biased influence: Need standardization or alignment of the number of digits

65 IV. Regression Analysis 65 1) Estimating correlations with multiple variables: Basics What can be used as a explanatory variable? (cont.) iii. Dummy variable The variable takes 1 if fulfill specific condition, otherwise 0. Useful to control the differences of conditions or affiliations (Example) Previous race win dummy: Takes 1 if the horse won in the previous race (Source) JRA Bolton, R. N., & Chapman, R. G. (1986). Searching for positive returns at the track: A multinomial logit model for handicapping horse races. Management Science, 32(8),

66 IV. Regression Analysis 66 1) Estimating correlations with multiple variables Regression by ordinary least-square method (OLS) Conditions that OLS can be used The number of samples does not have to be large if it meets from i to v i. All explanatory variables are data derived from the experiment. (An uncertain value that takes a certain range = not a random variable) ii. The expected value of the error is 0 iii. No heteroscedasticity The error term is not unevenly distributed (see next page) The coefficients estimated for each explanatory variable are mathematically optimal solutions iv. No correlation between explanatory variables and errors Variable describing the explained variable is not lacking There are no variables that affect both the description variable and the explanatory variable. It also says There is no endogenous or "error terms are non-correlated" v. Error is normal distribution vi. It becomes possible to appropriately judge whether coefficients estimated for each explanatory variable are statistically correct There are no strong correlation between explanatory variables Bias is not included in the coefficients estimated for each explanatory variable

67 IV. Regression Analysis Modified the material provided by Dr. Koichi Hasegawa 67 1) Estimating correlations with multiple variables Regression by ordinary least-square method (OLS) Conditions that OLS can be used iii) No heteroscedasticity Heteroscedasticity: the scattering of error tends to be greatly scattered in a specific area and scattered small in another area under the influence of a certain factor. The result is not reliable in the greatly scattered area (it is only a value taken between) Error Check by Breusch-pagan Test, or LM test Estimated formula If there is uneven dispersion Solution 1. Add missing variables to model 2. Logarithmic translation of explanatory variables and explained variables 3. Use a robust standard error 4. Estimating by Weighted least squares method (details, practice omitted), maximum likelihood method Cause1

68 IV. Regression Analysis 68 1) Estimating correlations with multiple variables Regression by ordinary least-square method (OLS) Conditions that OLS can be used iv) No correlation between error and explanatory variable = no endogeneity (or no omitted variable bias) Knowledge volume and correlation Amount of knowledge Number of papers read Research time Number of hours spent Luck??? (Studentsʼ smartness) Cannot measure Highly rated research papers Evaluation from Instructors/ Awards/ Number of paper cited appear in the error sector Example:Scenes in which the seminar instructor's influence works both the number of accessible articles and the evaluation It cannot estimate the pure effect of the amount of knowledge as long as it is not possible to measure the goodness of the head of the person. Must be consider before the analysis. Durbin-wu-hausman test detect the endogeneity If there is an endogeneity Solution Fixed effect model estimation on panel data Adding control variables Adopt method of instrumental variables (IV)

69 IV. Regression Analysis 69 1) Estimating correlations with multiple variables Regression by ordinary least-square method (OLS) Conditions that OLS can be used iv) No correlation between error and explanatory variable = no endogeneity (or no omitted variable bias) Phenomena observed when omitted variable bias exists R 2 is low (the model's explanatory power is weak) We have not added explanatory variables and control variables (It is not important in causality model, but it affects variable to be explained) that have been confirmed to have a significant influence on previous studies using the same explained variable Solution - check the previous research carefully!

70 IV. Regression Analysis 70 1) Estimating correlations with multiple variables Regression by ordinary least-square method (OLS) Conditions that OLS can be used iv) No correlation between error and explanatory variable = no endogeneity (simultaneity bias or reverse causality) Amount of knowledge Number of papers read Correlation Example: Scenes where you can concentrate on research by being known as writing a good paper Research time Number of hours spent Devoted to research Already published highly rated research papers Highly rated research papers Evaluation from Instructors/ Awards/ Number of paper cited If itʻs not in the explanatory variable, its effect will appear in the error term Correct calculation is impossible in circulation. Must be consider before the analysis. Detectable by Durbin-Wu-Hausman test. Solution Add the value of one term before the explanatory variable Adopt method of instrumental variables (IV)

71 IV. Regression Analysis 71 1) Estimating correlations with multiple variables Regression by ordinary least-square method (OLS) Conditions that OLS can be used v)normal distribution of errors However, if the sample is large enough (about a few hundreds) no verification required If the error is not normally distributed, the estimated line is not the correct slope. Confirm whether the residual is normal distribution by Kurtosis / Skewness Test or Shapiro-Wilk Normality Test If it is not a normal distribution Frequency of value to take error with actual samples Solution 1. Logarithmically transform (Log) and squared the dependent variable and explanatory variable 2. Calculate by the maximum likelihood method, like Possison model, Probit model, or Tobit model

72 IV. Regression Analysis 72 1) Estimating correlations with multiple variables Regression by ordinary least-square method (OLS) Conditions that OLS can be used vi)no strong correlation between explanatory variables: nonexistent of multicollinearity Multi-collinearity: it is not known which variables to influence among highly correlated explanatory variables, and the estimated coefficients become inaccurate Observed phenomena Although the coefficient of determination is high, the t value of each explanatory variable is low (not significant) Abnormally high standard error It does not coincide with the sign (+ or ) of the coefficient of the result estimated by the model with only one correlative explanatory variable. VIF (Variance inflation Factor) is obtained and it is confirmed whether or not a variable showing 4 or more (or 10 or more) exists If there is a multicollinearity Solution 1. Eliminating unnecessary explanatory variables 2. Convert explanatory variables to difference or ratio 3. Factor analysis or principal component analysis is carried out to the explanatory variables, creating a non-correlated synthetic variable

73 IV. Regression Analysis 73 1) Estimating correlations with multiple variables Regression by ordinary least-square method (OLS) 7 steps in regression analysis 1 Design the causal relationship model and drop it into the indicator Make a model without endogeneity (omitted variable bias, simultaneity bias) Samples should be large (at least explanatory variable 2 or ) 2 Create descriptive statistics & correlation matrix Be sure to create a histogram to verify the distribution If the dependent variable does not take normal distribution, estimates other than OLS are also considered If the digits of the explanatory variable are different from each other, multiply by 1,000, prepare by 1 / 1,000 times etc. For explanatory variables whose correlation is too strong, either one is dropped or later checked for multicollinearity

74 IV. Regression Analysis 74 1) Estimating correlations with multiple variables Regression by ordinary least-square method (OLS) 7 steps in regression analysis 3 Make two models with only control variables without explanatory variables and models with explanatory variables Compare R2 of both models and see contribution of explanatory variables 4 If it contains a variable with strong correlation, check whether there is multiple collinearity Check VIF : It is more than 4 or more (or 10 or more)? In the case of multiple collinearity, one drops out, converts a variable, aggregates it by principal component analysis, etc.

75 IV. Regression Analysis 75 1) Estimating correlations with multiple variables Regression by ordinary least-square method (OLS) 7 steps in regression analysis 5 If the estimation models including variables with strong correlations, you should conduct multiple estimations, in which there correlated variables are included/excluded If the sign (positive or negative) of the estimation result of that variable changes depending on the model, the effect of multiple collinearity strongly appears If there is a pair of explanatory variables that has a high correlation in the correlation matrix table, but does not have multiple collinearity, this can show that there is no problem in the estimation Model 1 Model 2 Model 3 Strongly correlated Explanatory variable A Explanatory variable B Included Not included Included Not included Included Included

76 IV. Regression Analysis 76 1) Estimating correlations with multiple variables Regression by ordinary least-square method (OLS) 7 steps in regression analysis 6 After performing multiple regression analysis, obtain error and verify it whether the error is not uneven distribution or normal distribution Inhomogeneity dispersion of error is confirmed by Breush-Pagan test and LM test If the error is unevenly distributed, use a robust standard error, etc. Whether it follows the normal distribution is confirmed by skewness kurtosis test and Shapiro-Wilk normality test If the error does not follow the normal distribution, logarithmic transformation of the variable, use the maximum likelihood method, etc. However, if the number of samples is large, not necessary

77 IV. Regression Analysis ) Estimating correlations with multiple variables Regression by ordinary least-square method (OLS) 7 steps in regression analysis Verifying the robustness of estimated results Exclude data that may be outliers The data which may be different in nature is estimated separately. Since OLS estimates the average value of explanatory variables, the influence of things that take outliers in explained variables is significant Countermeasures should be regression of the quantile (median, 25 th percentile, 75 th percentile estimate)

78 IV. Regression Analysis 78 2)Exercise Verify whether the following models are correct by using OECD data. Activate technology development Increase ratio of PCT applications Increase in technical trade balance Increase in income (+) (+) (+) Increase in technical trade balance Increase in income ( ) Being a European country European dummy ( ) Increase in added value of industry Increase ration of added value

79 IV. Regression Analysis 79 3) Run OLS Run OLS Statistics > Linear models and related > Liner regression The explained variable is the first, all the rest are explanatory variables #Regression analysis regress valueadded_growthrate laborforce_growthrate pct_growthrate eu_dummy

80 IV. Regression Analysis 80 3) Run OLS Run OLS (cont.) Set dependent and explanatory variables (including control variable)

81 IV. Regression Analysis 3) Run OLS How to read the output results 81 F statistic (Whether there is a statistically significant difference between this model and the model that does not include any explanatory variables). regress valueadded_growthrate laborforce_growthrate pct_growthrate eu_dummy Number of observations Source SS df MS Number of obs = F(3, 37) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = R valueadded_growthrate Coef. Std. Err. t P> t [95% Conf. Interval] laborforce_growthrate pct_growthrate eu_dummy _cons Estimated coefficient Standard error Significance probability Confidence interval (The factor may actually be between this number)

82 IV. Regression Analysis 82 3) Run OLS Check for multicollinearity After the regression analysis runs: Statistics > Linear models and related > Regression diagnostics > Specification tests, etc. #Compute VIF estat vif

83 IV. Regression Analysis 83 3) Run OLS Check for multicollinearity (cont.) Select Variance inflation factors

84 IV. Regression Analysis 84 3) Run OLS Check for multicollinearity (cont.): Result. estat vif Variable VIF 1/VIF eu_dummy laborforce~e pct_growth~e Mean VIF 1.27 Vif If it is 4 or more, there is multiple collinearity. (Even those who make it 10 or more)

85 IV. Regression Analysis 85 3) Run OLS Confirm heteroscedasticity of error dispersion After the regression analysis runs: Statistics > Linear models and related > Regression diagnostics > Specification tests, etc. #heteroscedasticity test estat hettest

86 IV. Regression Analysis 86 3) Run OLS Confirm heteroscedasticity of error dispersion Test for heteroscedasticity

87 IV. Regression Analysis 87 3) Run OLS Confirm heteroscedasticity of error dispersion: Results. estat hettest Hypothesis is Variance of errors is uniform" Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: fitted values of valueadded_growthrate chi2(1) = 1.15 Prob > chi2 = In this example, the probability that the assumption that dispersion is uniform is 28% (not very rare) = Interprete that dispersion is uniform

88 IV. Regression Analysis 88 3) Run OLS If heteroscedasticity is found: Robust standard error Statistics > Linear models and related > Liner regression <Same as OLS> [i]click the tab SE/Robust [ii] Select Robust #Regression with robust standard error regress valueadded_growthrate pct_growthrate laborforce_growthrate eu_dummy, vce(robust)

89 IV. Regression Analysis 89 3) Run OLS If heteroscedasticity is found: Robust standard error: Results regress valueadded_growthrate pct_growthrate laborforce_growthrate eu_dummy, vce(robust) Linear regression Number of obs = 41 F( 3, 37) = Robust standard errors are shown instead of standard errors Prob > F = R-squared = Root MSE = Robust valueadded_growthrate Coef. Std. Err. t P> t [95% Conf. Interval] pct_growthrate laborforce_growthrate eu_dummy _cons

90 IV. Regression Analysis 90 3) Run OLS Check the normal distribution of errors First, save the error to a new variable #Save the error to a new variable predict resd, residual

91 IV. Regression Analysis 91 3) Run OLS Check the normal distribution of errors (cont.) Statistics>Summaries >Distributional plots and tests> Skewness/Kurtosis tests for normality #Skewness test sktest resd

92 IV. Regression Analysis 92 3) Run OLS Check the normal distribution of errors (cont.) Select the variable you just created (the error is stored)

93 IV. Regression Analysis 93 3) Run OLS Check the normal distribution of errors (cont.) Hypothesis is Errors take normal distribution". sktest resd Skewness/Kurtosis tests for Normality joint Variable Obs Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi resd Density Residuals In this example, the probability that the assumption that it is normally distributed holds is 65% = Interpreted as being normally distributed When it is not normally distributed, adopt a logarithmic dependent variable or analysis by maximum likelihood method etc.

94 IV. Regression Analysis 94 3) Run OLS Plot estimated results #Run immediately after regression estimates: Store estimated results in a new variable predict p_va_gr In this example, X-axis: pct_growthrate #Plot estimates and actual values twoway (scatter valueadded_growthrate pct_growthrate, mcolor(gray)) (scatter p_va_gr pct_growthrate, mcolor(red)) The estimates are red and the actual data is gray PCT_GrowthRate ValueAdded_GrowthRate Fitted values

95 IV. Regression Analysis 95 3) Run OLS Robustness check: (Example) Drop top/bottom data #Compute percentile and identification of data within certain percentile summarize valueadded_growthrate, detail gen isinuse = inrange(valueadded_growthrate, r(p5), r(p95)) In this example, we create a new To change existing variable isinuse which takes 1 if variable, use replace value added growth of the data is #Change percentiles within top 5% to 95% replace isinuse = inrange(valueadded_growthrate, r(p3), r(p97)) #Regression with selected data regress valueadded_growthrate laborforce_growthrate pct_growthrate eu_dummy if isinuse == 1 if identifies the condition of data to use You must repeat = twice

96 V. Reporting of Regression Results 96

97 V. Reporting of Regression Results 97 1) Reporting of Regression Results Common practice We usually report descriptive statistics correlation matrix regression results You can integrate into one table

98 V. Reporting of Regression Results 98 1) Reporting of Regression Results Common practice Example of descriptive statistics, and correlation matrix Keller, R. T. (2001). Cross-functional project groups in research and new product development: Diversity, communications, job stress, and outcomes. Academy of Management Journal, 44(3),

99 V. Reporting of Regression Results 99 1) Reporting of Regression Results Common practice Examples of regression results Keller, R. T. (2001). Cross-functional project groups in research and new product development: Diversity, communications, job stress, and outcomes. Academy of Management Journal, 44(3),

100 V. Reporting of Regression Results 100 1) Reporting of Regression Results Set up add-ins: outreg2, mkcorr #Install outreg2 (You need to do it only once) ssc install outreg2 #Install mkcorr (You need to do it only once) ssc install mkcorr

V. Reporting of Regression Results 101 1) Reporting of Regression Results Export descriptive statistics You can export in MS word format. #Create a new desc_stat.

101 V. Reporting of Regression Results 101 1) Reporting of Regression Results Export descriptive statistics You can export in MS word format. #Create a new desc_stat.doc file and export descriptive statistics outreg2 using desc_stat.doc, replace sum(log) keep(valueadded_growthrate pct_growthrate laborforce_growthrate Select eu_dummy) variables to export in keep The file (reg_res.doc) will be saved in the folder indicated the status bar Results

102 V. Reporting of Regression Results 102 1) Reporting of Regression Results Export correlation matrix #Export correlation matrix in a text file mkcorr valueadded_growthrate pct_growthrate laborforce_growthrate eu_dummy, log(corr_matrix.txt)

103 V. Reporting of Regression Results 103 1) Reporting of Regression Results Export regression results #Regression analysis regress valueadded_growthrate laborforce_growthrate eu_dummy #Create a new file regress_res.doc and export results in it outreg2 using regress_res.doc, replace ctitle(model 1) #Another regression analysis regress valueadded_growthrate pct_growthrate laborforce_growthrate eu_dummy #Append the results into the file outreg2 using regress_res.doc, append ctitle(model 2)

104 V. Reporting of Regression Results 104 1) Reporting of Regression Results Export regression results: Results

105 V. Reporting of Regression Results 105 2) Visualization of Regression Results Plot estimated marginal effect Graphs showing marginal effects with confidence intervals #Plot marginal effect with confidence intervals graph twoway lfitci valueadded_growthrate pct_growthrate #Plot marginal effect with confidence intervals and original data graph twoway (lfitci valueadded_growthrate pct_growthrate) (scatter valueadded_growthrate pct_growthrate)

106 V. Reporting of Regression Results 106 2) Visualization of Regression Results Plot estimated marginal effect PCT_GrowthRate 95% CI Fitted values ValueAdded_GrowthRate

107 V. Reporting of Regression Results 107 2) Visualization of Regression Results Plot estimated results It is divided depending on whether it is Europe or not, and other values are plotted on the assumption that they are average values #Run immediately after regression estimates: Store estimated results in variables adjust laborforce_growthrate, by(eu_dummy) gen(p2_va_gr) Here, we use the mean value of Laborforce_growthrate #Show estimates twoway (scatter p2_va_gr pct_growthrate if eu_dummy==1, mcolor(blue))(scatter p2_va_gr pct_growthrate if eu_dummy==0, mcolor(red)), legend (order(1 "EU" 2 "Non- EU")) ytitle("value Added Growth") Blue in the EU and red in the case outside the EU

108 V. Reporting of Regression Results 108 2) Visualization of Regression Results You can change it in ytitle Value Added Growth PCT_GrowthRate You can change it in legend (order( ) ) EU Non-EU

109 109 Appendix For further improvement

110 Appendix 110 Variations of regressions for causality analysis Variations of estimation models corresponding with characteristics of the dependent variable Dependent variable = dummy variable Example: Surplus of technology balance of payments logistic regression logit model regression probit model regression Depenedent variable has cut-off point Example: Longitudanal performance of engineers (suddenly decrease due to the retirement, job rotation, and other life events) Tobit model

111 Appendix 111 Variations of regressions for causality analysis Variations of estimation models (cont.) Dependent variable = count & natural number Example: Number of inventions in a organization (the number of inventors who generate n inventions is 1/n 2 of all inventors (Narin&Breitzman, 1995)) Poisson model Negative binomial model

112 Appendix 112 Variations of estimation models to reveal causality Omitted variable bias prevention Panel data analysis Use time series data and exclude unobservable effects of individuals Fixed effect model Random effect model difference-in-difference regression discontinuity

113 Appendix 113 Variations of estimation models to reveal causality Estimation of other than mean value quantile regression

Sociology 63993, Exam1 February 12, 2015 Richard Williams, University of Notre Dame,

Sociology 63993, Exam1 February 12, 2015 Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ I. True-False. (20 points) Indicate whether the following statements are true or false.