Preliminary Report on Simple Statistical Tests (t-tests and bivariate correlations)

Size: px

Start display at page:

Download "Preliminary Report on Simple Statistical Tests (t-tests and bivariate correlations)"

Pierce Casey
5 years ago
Views:

1 Preliminary Report on Simple Statistical Tests (t-tests and bivariate correlations) After receiving my comments on the preliminary reports of your datasets, the next step for the groups is to complete all data gathering, move everything to SPSS or STATA or some other statistical software, conduct t-tests and bivariate correlations and report these findings. You will have until Thursday, 5 p.m. February 28 to upload these Preliminary Reports on Simple Statistical Tests to your group s common folder. These must be in a SINGLE PDF file clearly labeled. Once you have finished filling in the missing data in your datasets and you have converted your datasets into a form that is readable by a statistical analysis program, your group will be set to do some preliminary tests. The report will reflect the following steps: (1) Explain the logic of the core model(s) that your group has decided to test under the rubric of the chosen hypothesis. (2) Analyze the differences in means between two arrays of variables of interest using independent-sample t-tests. For example, if you are evaluating to what extent SMEs and LMEs have different standards of living, you might take per capita GDP and, using a dummy variable to select SMEs or LMEs only, you will have SPSS compare the differences in means of per capita GDP between these two cohorts. You may do as many t-tests as you wish. You need only report (a) the t-statistic and (b) the significance of the figure. Something like (-1.23, p<.001) would be fine. This means that the difference in means is significant at the.001 level. If the statistic is not significant, just say so. You can still report the t-statistic and the p value (for example: (2.55, p=.78)). (3) For each of your major DVs, you will evaluate its correlation with selected IVs. This may be done by running separate bivariate regressions for each or by running a correlation matrix. The latter can be done by running a multiple regression and looking at only the correlation matrix. In SPSS bivariate correlations are done by using the Regression Linear function under the Analyze drop-down menu in SPSS. Once you have selected a single DV and a single IV, you will select the Statistics button and check three things: (1) confidence intervals, (2) descriptives, and (3) collinearity diagnostics. After pressing Continue, you will select the Plots button. Place ZPRED in the X box and ZRESID in the Y box. Check Histogram. Then, you will run the regression. In the Output box you will see the Pearson s R correlation statistic, which will include a coefficient of correlation and a significance statistic. Anything under.05 will be statistically significant. (N is the number of cases). You will then note the R-square and the adjusted R-square. In the Coefficients box you will note the unstandardized coefficient and the standardized coefficient (the same as the Pearson s R).

2 I would like you to report for each of your bivariate tests, in a box, these statistics in the following form (### simply substitutes for the numbers you will have): Table 1: Test 1 Linear Regression of DV in question Variable Unstandardized Standard Standardized T Significance Coefficient Error Coefficient Constant ### ### ### ###** IV ### ### ### ### ### R-square Adjusted R- N square ### ### ### For all variables that are significant at the.05 or.01 levels you will note their significance in the box by making the entry bold (see the example of the Constant above, which is usually significant). You will indicate the level of significance by reporting the p value like so, below the box: * p <.05 ** p<.01 You will place a bold asterisk next to the significance statistic as demonstrated above. Please convert all unstandardized coefficients reported in scientific notation by SPSS. You may run as many bivariate tests as you believe necessary, but just label each test with a title and a number as shown above. What to Look For At this stage, you are testing your variables to see how they affect the DV(s). If you get back insignificant Pearson s R for a variable, then it may be statistically insignificant when you run it in a multivariate model. This assumption is not always true, but you can use it as a rule of thumb. You are also judging the correlation by the R-square (or adjusted R-square) to see about the goodness of fit of the regression line. If the R-square is low, you might have a nonlinear relationship or lurking variables. The latter is very likely in bivariate anlysis since you are allowing other variables to vary (i.e., they are not being controlled). In these cases, look at the standardized scatterplots which are run as part of the SPSS output. If you do not see a random distribution around the center (e.g., a curved pattern), then the relationship may be nonlinear. There may also be heteroskedasticity, especially if you see funnel or curved patterns in the distribution of the data points. You may also run a simple scatterplot from the Graphs dropdown menu in SPSS. If you see evidence of a nonlinear relationship, then report that in prose underneath the table of the test in question. Another reason to look at the residuals, even if the R-square looks pretty good, is to check to see what the pattern of outliers is. In the standardized residual plot generated by

3 SPSS as part of your output, check to see if there are many data points located outside of the -3 to 3 range. If there are, then these are probably outliers. You can click the chart in SPSS and label the cases so that you can identify which are the outliers. Look at those cases in the dataset and try to figure out why they are statistical outliers. Remember, you want to get a distribution in the simple scatterplot that is as linear as possible and a distribution in the standardized scatterplot that is random. If in the simple scatterplot you think you are looking at an exponential relationship and not a linear one (and the IV is a socio-economic variable), then you might have to calculate the natural logarithm of the IV in question and run that instead as the IV in another bivariate test. In SPSS this is done with the Calculate function in the Transform Data drop-down menu. Put in your source variable and put in the comment for the natural log. (You can also easily do this in Excel by cutting and pasting your IV data into a worksheet and then in the adjacent column inputing the formula =ln([cell, e.g., A1]), then selecting all of the cells underneath that first cell, selecting Fill from the Edit dropdown menu, and then Down ). Once you have the natural log of the variable, cut and paste that into SPSS. Run the regression using the natural log of the variable. That will give it a more even distribution. Again, look at the residuals in the simple scatterplot and see if the relationship looks more linear. In the histogram, if you see a certain skew in the data to one side or the other, and perhaps a less than perfect bell-shaped curve, then report what you see. We are looking for a normal distribution in the data around zero a bell-shaped curve. These are just preliminary tests of the data to see what kinds of relationships might exist. At the same time, you need to be thinking.what makes sense in the abstract? What should be affecting what? What would I expect to see when I run this bivariate test? Then you need to make decisions about which IVs seem to be the most robust predictors of the DV. Then, in the next step, you will specify your multivariate model.

4 Preliminary Report on Multivariate Regressions Now that you have finished your first bivariate tests, it is time to specify your multiple regression model(s). Your group must decide on (1) which DVs (if there are multiple choices) will the model attempt to explain; (2) which main IVs will the group decide is/are the one(s) to build an argument around; and (3) which control variables will be run in the model(s). Once you have specified the model, lay it out using the standard regression equation: Y = a + b 1 *x 1 + b 2 *x 2 + e Here is an example: Standard of Living (GDP per capita) = a + b 1 (Union Density) + b 2 (EPL) + b 3 (Unemployment)... + b p *X p...+ e In this example, the model will regress a DV called standard of living (operationalized as GDP per capita) on three variables, Union Density, the EPL index, and Unemployment. The a represent the constants (the Y intercept). This is where the regression line intercepts the Y axis, that is the value of the DV when all the IVs are 0. The b p s are the regression coefficients. The e is the error term, which is used to calculate statistical significance. (You must use the same formula format but just place the variable labels for your IVs and DVs in place of those listed above). You may omit the + b p *X p as this merely represents the fact that a regression equation may have more than the three IVs listed above. (NOTE: For further reference regarding multiple regression, see the document titled, Multiple Regression Instructions, available on the common folder for the class). The expectation of this argument is that Union Density and EPL, which are higher for SMEs than for LMEs, will predict change in GDP per capita controlling for the level of unemployment. Your group will no doubt have more than three variables, but you need to specify the model as I did above and you must list the variable names so they may be understood as I did above. The statistical results of your model will be reported using the same format you used for the bivariate analyses, with some minor modifications (i.e., I wish you to report the number of cases this time):

5 Table 1: Linear Regression of Standard of Living Variable Unstandardized Standard Standardized T Significance Coefficient Error Coefficient Constant ### ### ### ###** Union Density ### ### ### ### ### EPL Index ### ### ### ### ### Unemployment ### ### ### ### ### R-square Adjusted R- N square ### ### ## For all variables that are significant at the.05 or.01 levels you will note their significance in the box by making the entry bold (see the example of the Constant above, which is usually significant). You will indicate the level of significance by reporting the p value like so, below the box: * - p <.05 ** - p<.01 You will place a bold asterisk next to the significance statistic as demonstrated above. Please convert all unstandardized coefficients reported in scientific notation by SPSS. Note: Variable names in your dataset may be shortened (e.g., UNIDEN, EPLI, UNEMP, etc.). But when you report the output of your regressions, I do not wish to see a cut and paste job of the SPSS or STATA output!!! I will ONLY accept a proper table with the full names of the variables written out as illustrated above. If you wish to present more than one multiple regression model, then you must do so in a single table that nests the models, one after the other. Please see me about this. The above will constitute the final preliminary report of your group s data analysis prior ot the final presentation. This report will be uploaded as a SINGLE PDF FILE by Monday, 9 a.m., March 3. Additional Considerations About Your Data Groups must evaluate several diagnostics in their multiple regression output. Two in particularly will be of some concern to us: (1) heteroskedasticity and (2) multicollinearity. Heteroskedasticity As in the bivariate correlations, you should plot the standardized residuals. In SPSS this is ZRESID vs. ZPRED referred to above. The plot should show a random pattern, with no

6 nonlinearity or heteroscedasticity. If you get a funnel shape or a cone, then there is a heteroskedasticity problem. See the professor. Multicollinearity There are many kinds of multicollinearity, but we will be concerned with mostly one: multicollinearity among the IVs. This biases standard error terms which makes assessment of the significance of any given IV unreliable. You can detect multicollinearity in a number of ways. First, look at the correlation matrix in a multiple regression output. Which independent variables have reasonably high Pearson s coefficients with other IVs? Scores >.60 might be a cause of concern. Second, generate tolerance and variance inflation factor diagnostics when you run the regression. These statisics regress each IV on all of the others. The following, which is taken from the document Multiple Regression Instructions summarizes how to use tolerance and VIF statistics: o Tolerance is 1 - R 2 for the regression of that independent variable on all the other independents, ignoring the dependent. There will be as many tolerance coefficients as there are independents. The higher the intercorrelation of the independents, the more the tolerance will approach zero. As a rule of thumb, if tolerance is less than.20, a problem with multicollinearity is indicated. In SPSS 13, select Analyze, Regression, Linear; click Statistics; check Collinearity diagnostics to get tolerance. When tolerance is close to 0 there is high multicollinearity of that variable with other independents and the b and beta coefficients will be unstable.the more the multicollinearity, the lower the tolerance, the more the standard error of the regression coefficients. Tolerance is part of the denominator in the formula for calculating the confidence limits on the b (partial regression) coefficient. o o Variance-inflation factor, VIF VIF is the variance inflation factor, which is simply the reciprocal of tolerance. Therefore, when VIF is high there is high multicollinearity and instability of the b and beta coefficients. VIF and tolerance are found in the SPSS output section on collinearity statistics. The table below shows the inflationary impact on the standard error of the regression coefficient (b) of the jth independent variable for various levels of multiple correlation (R j ), tolerance, and VIF (adapted from Fox, 1991: 12). In SPSS 13, select Analyze, Regression, Linear; click Statistics; check Collinearity diagnostics to get VIF. Note that in the "Impact on SE" column, 1.0 corresponds to no impact, 2.0 to doubling the standard error, etc.:

7 R j Tolerance VIF Impact on SE b o Standard error is doubled when VIF is 4.0 and tolerance is.25, corresponding to R j =.87. Therefore VIF >= 4 is an arbitrary but common cut-off criterion for deciding when a given independent variable displays "too much" multicollinearity: values above 4 suggest a multicollinearity problem. Some researchers use the more lenient cutoff of 5.0: if VIF >=5, then multicollinearity is a problem. Groups ought to discuss their findings with the professor prior to the presentation. Issues involving diagnostic results can be evaluated at that time. The Oral Presentation Your group will have specified time limits (8-10 minutes) to present your findings to the class during one of the two classroom sessions dedicated to the data analysis group presentations. You may set up your presentation using whatever techniques you would like, but it must include the following content: (1) an explanation of your main argument including its logic and its importance, (2) an overview of how you gathered your data, (3) a statement of what expectations you had prior to conducting your analysis, (4) an explanation of how you did your bivariate tests on the data and how you specified your multiple regression model, (5) a discussion of the results of your model using a graphic display (PowerPoint or MS Word are fine tools for this), (6) conclusions including suggested areas for further research. Immediately following your presentation, the group members are to be peppered with insightful and keen questions from the audience. After a few minutes of crossexamination, your presentation will be over.

Daniel Boduszek University of Huddersfield

Daniel Boduszek University of Huddersfield d.boduszek@hud.ac.uk Introduction to Multiple Regression (MR) Types of MR Assumptions of MR SPSS procedure of MR Example based on prison data Interpretation of