SPSS GUIDE FOR THE DATA ANALYSIS TARGIL Michael Shalev January PDF Free Download

SPSS GUIDE FOR THE DATA ANALYSIS TARGIL Michael Shalev January 2005 GETTING TO KNOW YOUR DATA Before doing anything else you need to become familiar with all of the variables that you intend to analyze. For this purpose, use the procedures available from Descriptive Statistics (which is accessed from the Analyze menu). Start by requesting Frequencies for the variables of interest and study the results carefully (e.g. you may find that a variable has one or more categories that are nearly empty decide if you want to recode them as Missing, combine them with other categories, or leave them as they are).!(סולם מדידה) PAY ATTENTION TO THE SCALE OF MEASUREMENT Statistical procedures like correlation and regression requires continuous variables practice, In.(סולם מנה) or ratio scale (סולם רווחים ( scale either on an interval :(משתנים רציפים ( researchers may treat ordinal data סדר) (סולם as if it was interval, especially questions that ask people to express their attitudes, e.g. on a scale from 1 ( Strongly favor ) to 5 ( Strongly oppose ). These types of data must be distinguished from categorical variables שמי) (סולם like sex or ethnic origin. Categorical variables can only be included in correlations or regressions if they are converted to dichotomous variables which have only two possible values: 1 or 0. (These are often called dummy variables.) For the variable SEX the solution is easy: recode females 1 and males 0 (or vice versa). For variables with more than two categories, each of the categories except for one reference category (" התיחסות "קבוצת ) must be converted to a dummy variable. It is not possible to convert a categorical variable to multiple dichotomous variables if it is going to be the dependent variable in a regression. (Why? Because there can only be one dependent variable in a regression, whereas we can have as many independent variables as we like.) The regression coefficient of a dummy variable is interpreted as follows: it shows the average value of the dependent variable for the category coded 1 in comparison with the average for the reference category which is coded 0. DATA ANALYSIS AN EXAMPLE Our example is based on Arian and Shamir s article about the 1981 elections. One of their main arguments was that ethnic origin does not cause differences in voting. The most important causal variable is hawkishness. Voters who oppose territorial compromise prefer the Likud. The correlation between ethnicity and voting is spurious ( (כוזב/מדומה and is due to the fact that Mizrachim tend to be hawks and Ashkenazim doves. The false model is: ETHNICITY VOTE The true model is ETHNICITY VOTE HAWKISHNESS - 1 -

Testing this model requires the following five steps: 1. Decide exactly how to measure each variable in the model. 2. Use tables and charts to explore the relationships between the variables in more detail. 3. Decide whether to change the model in light of the results so far. 4. Use multiple regression to summarize the relationships between the variables and see if they fit the model(s). 1. DECIDE HOW TO MEASURE THE VARIABLES The dependent variable is vote: Like Arian and Shamir, we will use Vote if Knesset elections were held today (v122) to create a new dummy variable called LIKUD which is coded 1 if a person said they would vote for the Likud and 0 if s/he voted for the Maarach. (Of course this is not the only solution. For example, we could have used the Left-Right scale, V116, which gives meaningful values for respondents who supported parties other than Likud and Maarach.) עדה there are a variety of ways that :עדה The first independent variable (the original cause) is could be measured. (We will not do it here, but in this kind of situation it is best to repeat the analysis for each different measure. You can then see if the results depend on how the variable is measured.) We will try to improve on the way Arian and Shamir measured.עדה They did not distinguish between first generation and second generation. Also, they defined all Sabras (Israeli-born whose father was also born in Israel) as Ashkenazim. We will define immigrant Ashkenazim as the reference category התיחסות) (קבוצת and we will create 4 dummy variables for the other 4 categories of the variable Country of origin (V137). V137 ASH_ISR MIZ_OLEH MIZ_ISR SABRA 1 Israel - Israel 0 0 0 1 2 Israel - Asia-Africa 0 0 1 0 3 Israel - Europe- America 4 Asia-Africa - Asia- Africa 1 0 0 0 0 1 0 0 5 Europe-America - Europe-America* 0 0 0 0 6 other combination Missing Missing Missing Missing System Missing Missing Missing Missing Missing * No dummy variable is created for Ashkenazi immigrants because they serve as the reference category. The second independent variable (the real cause) is hawkishness: The questionnaire included three different questions concerning attitude to annexation: V7, V8 and V105. Are these three different ways of measuring the same thing? Or does hawkishness have more than one dimension? To find out, we could perform a Factor Analysis (a procedure which is explained at the end of this Guide). For now we will use only one question, V8, which ranges from a value of 1 ( no territorial concessions ) to 5 ( concede all of the territories ). This measures.(יוניות) dovishness - 2 -

2. GENERATE TABLES Use the Basic Tables procedure (from the Analyze menu, choose Custom Tables). This will help us to find out how different types of respondents differ in their voting (the dependent variable). In the dialog box shown below you will see that the variable to be summarized is LIKUD. (The summary we will want is the mean [ממוצע] Likud vote.) The types are combinations of ethnicity (V137) and dovishness (V8). We want to know what happens to the effect of ethnicity when dovishness is controlled. So for each category of.עדה dovishness, we need to know the mean Likud vote of each When creating a table like this, it is a good idea to put the control variable (or variables) in the columns. In this case we request עדה (V137) Down (in the rows), and dovishness (V8) Across (in the columns). To get tables that look good, the first time you make a table you will need to select a variety of options. (They will be remembered for later tables in the same SPSS session.) (1)Click on Statistics, select Mean, click on the arrow next to "Format", scroll up and select "ddd.dd", change "Decimals" to 2, click on "Add", click on "Continue". [This will generate means with two decimal places. You should choose a value of "Decimals" that suits your dependent variable.] (2)Click on Totals, select Totals over each group variable and click on "Continue". (3)Click on Layout and choose "In separate tables" for both "Summary variable labels" and "Statistics labels". (4)Add a Title. (5)Now click OK to create the table! - 3 -

Mean Mean Likud Vote in 1981 territorial concessions 1 None 2 Very little 3 Some 4 Nearly all 5 All Group Total country of origin 1 Israel - Israel.62.69.36.33.00.55 2 Israel - Asia-Africa.66.35.35.67.00.53 3 Israel - Europe-.48.22.24.00.00.32 America 4 Asia-Africa - Asia-.60.48.40.00.33.52 Africa 5 Europe-America -.37.34.17.00.00.28 Europe-America 6 other combination.00..00.00..00 Group Total.55.38.27.17.05.43 Study the table carefully and you will see several interesting things. First, look at the Group Totals in the last column, which show the overall effect of ethnicity: this effect is large. Both Mizrachim and Sabras were much more likely than Ashkenazim to vote Likud (more than 50%, compared with about 30%). We can also see that the difference between the foreign-born and Israeli-born generations is small. Second, look at the Group Totals in the bottom row, which show the effect of dovishness on voting. This effect is very strong, and it also appears to be linear. We can make a bar chart to make sure. Double-click the table, select the first 5 numbers in the bottom row, then click the right mouse button and choose Create Graph and Bar. Here is the result. The tallest bar is for people ready to give up no territory ;(שטחים) the shortest bar is for those ready to give up all of the occupied territories in return for peace. What about the inner cells of the table? If the ethnic effect on voting really is spurious, then within categories of dovishness there will be no ethnic vote. A quick look shows that inside each column (level of dovishness) there are still differences in the Likud vote between ethnic groups. However, before going any further we need to answer two questions. First, are there are enough cases for us to have confidence in the means? If there are very few - 4 -

people in a cell, it is not worth much We therefore generate the table again, but this time requesting Count instead of Mean as the desired Statistic. The results (not shown) reveal that there were very few people in the two most extreme categories of the dovishness variable. Second, are there any categories that should be dropped or combined? The Sabras (Israeliborn whose fathers were also born here) should be dropped. We already saw from the previous table that there is little difference in the vote of foreign-born and Israel-born respondents. We conclude that it would be a good idea to simplify both of our variables. First we recode V8 into a new variable, V8new, and give it the label Dovishness. The new variable combines the three most dovish categories (3, 4 and 5). Then we recode V137 into V137new, which leaves out the Sabras and compares all Ashkenazim (coded 1) with all Mizrachim (coded 0). (It s important to add Value Labels so your results will show that 1 is Ashkenazim and 0 is Mizrachim.) We will use the two new variables to make a chart showing the difference in Likud vote for Mizrachim and Ashkenazim at different levels of dovishness. Each level of dovishness will be represented by a different line. For this to work we must specify dovishness (V8new) as the row variable ( Down ). After creating the table double-click it, then select all cells except Group Totals and click with the right mouse button to request a Line chart. The result is the chart on the left. The question we asked is: do ethnic differences in voting disappear within categories of dovishness? The answer is NO! Whether they were hawks or doves, many more Mizrachim than Ashkenazim planned to vote Likud. Now let s ask a different question: can we see any conditional relationships מותנים)?(קשרים In other words, is there any difference between the slope of the three lines? If so, it would mean that the effect of ethnicity depends on the level of dovishness. But there is actually not much difference. For a good example of a conditional relationship, look at the chart on the right. Here the effect of ethnicity is examined for different categories of religiousness מסורת) (שמירת instead of for - 5 -

differences in dovishness. We see that among people who are very religious (the top line), the slope is unusual. In this group, Ashkenazim were actually more likely than Mizrachim to vote Likud. 3. SHOULD THE MODEL BE CHANGED? We try to learn from the Basic Tables how best to set up our regression model. In the present example one result (already mentioned) was the discovery that generation does not matter the Likud vote is very similar for immigrant and Israeli Ashkenazim, and for immigrant and Israeli Mizrachim. Therefore instead using 4 dummy variables for the 5 categories of עדה in our regressions, we could use only two: Mizrachim and Sabras, with all Ashkenazim serving as the reference category. (We do not actually make this change in the example below.) More important would be additions or changes to the causal relationships that our regression is designed to test. We have used Basic Tables to check two issues: (1)whether the effect of dovishness on voting is linear; and (2)whether dovishness conditions the effect of ethnicity. In reality, neither was a problem but what if they had been? (1)Suppose we had found that ethnicity made no difference to voting except for the most dovish group. Then, rather than continuing to measure dovishness as a continuous variable it would have been better to turn it into a dummy variable (1=very dovish, 0=everyone else). (2)What if the effect of ethnicity had been conditional on dovishness? (e.g. all hawks support Likud, but among other voters there is a difference between Ashkenazim and Mizrachim.) That would have required testing for interaction, which is explained on page 8. 4. REGRESSIONS The tables showed that both ethnicity and attitude to annexation affect whether people vote for the Likud. Multiple regression מרובה) (רגרסיה will provide a test of whether the effect of ethnicity on voting is spurious. The purpose of the regression is to see what happens to the effect of ethnicity after we control for dovishness. If the coefficients (מקדמים) of the ethnic dummy variables get a lot smaller, this would.קשר כוזב support the Arian-Shamir hypothesis of But smaller coefficients may also be consistent with the hypothesis that the effect of ethnicity is mediated by ( ע "י (מתווך dovishness, or that both ethnicity and dovishness.(קשרים משלימים - effects affect voting (complementary If on the other hand the coefficients remain the same, then the two effects are.(הפיקוח אינו יוצר תיקון) independent as well as complementary SPSS makes it easy to run before and after regressions. The original model is defined as the first block. The next model (the second block ) adds additional variables that were not in the first model. Use the Linear Regression procedure (from the Analyze menu, choose Regression). Select LIKUD as your Dependent Variable and the four ethnic dummy variables as your Independent Variables. Press Next and add V8 to your second Block. Click on Statistics, select Confidence intervals and R squared change, then click Continue. Now click OK to run the regression. - 6 -

Understanding Regression Output The first table of output is the Model Summary. It shows the percentage of variance explained by each model. (Remember, Model 1 is what we defined in Block 1, Model 2 includes Block 2 as well.) We see that ethnicity alone explains 4.7% (.047) of the variance in LIKUD. Adding V8 to the regression more than doubles the Adjusted R-squared, which rises to 10.6% (.106). As discussed in class, we are more interested in the effects (regression coefficients, or slopes ) than the ability of the model to explain variance. However, we may be interested in the F-test of whether the change in R=squared between models is significant. In this case it definitely is statistically significant (see the arrow pointing to.000). This F-test can be very useful when a new block adds more than one independent variable. It tests whether these variables as a group add anything significant to the previous model. Next we need to look at the regression coefficients (the table labeled Coefficients ). First let s examine the unstandardized coefficients (B) that are highlighted in yellow. Do Mizrachim vote differently from Ashkenazim, and what difference does it make if we control for dovishness? The B coefficient for MIZ in Model 1 is.232. An unstandardized regression coefficient represents the expected effect on Y of a 1-unit increase in X. In the special case of dummy variables, the coefficient represents the average difference between the category coded 1 and the reference category. The coefficient for MIZ shows that, on average, the proportion voting Likud among foreign-born Mizrachim is 26.2% higher than among foreignborn Ashkenazim. What happens when V8 is added to the regression (Model 2)? The coefficients of all the ethnic dummy variables decline slightly, but remain substantial. Comparing the coefficients for MIZ we see that the gap between foreign-born Mizrachim and Ashkenazim falls from 23% to 19%. - 7 -

Thus, the effect of ethnicity is not spurious. Only a small part of ethnicity s effect is actually due to dovishness, or is mediated by dovishness. Our results therefore do not support Arian and Shamir s main claim. Several other features of the regression output are worth noting: 1. The constant :(הקבוע) it is interpreted as the expected value of Y when all X s (independent variables) are zero. Usually this is not very informative. But in Model 1 the constant is.283, which means that 28.3% of the reference category (Ashkenazim) voted Likud. (Take a look at the last column of the table at the top of page 4, and you will see the identical result!) 2. The column headed Sig. shows the significance level of each coefficient. The previous column shows the value of the t-statistic, from which significance is computed. If t is at least 2 then the coefficient will usually be significant at the 5% level or better. If the value of Sig. is greater than.05 (5%), this means that there is more than a 5% probability that the true coefficient is actually zero (i.e. the independent variable probably has no effect). Significance levels provide a very rough guide to whether a regression coefficient means anything. More helpful are the Confidence Intervals בר סמח) (רווחי shown in the last two columns of the regression table. Recall that statistical significance is based on the idea that our data come from a sample which is only one of many possible samples that might have been drawn. The question is, how typical are the coefficients in the sample we used compared with what would have been found if we could have run the regression for all possible samples? What SPSS computes for us is the range of coefficients that would have been expected in 95% of these samples. If the values in this range are all meaningful to us, this is a good indication that a coefficient is solid. In Model 2 the effect of MIZ in 95% of samples is expected to be somewhere between 10% and 29%. This is encouraging, because even the lowest expected effect is quite large. 3. An effect can be statistically significant without being important. Importance is something only you can judge. It often helps to use the results of the regression to estimate what difference it makes. Consider the following simple comparison showing the effect of dovishness, after controlling for ethnicity (Model 2). Among people who were opposed to any territorial compromise (V8=1), the expected vote for Likud is 40.7% (.527 + [-.120*1]). Among people who were ready to concede some territory, only 16.7% are expected to vote Likud (.527 + [-.120*3]). That s a difference of 24 percentage points! 4. The Beta coefficients (β) in the third column of numbers indicate the relative importance of different independent variables, measured in units of standard deviations. (This usually cannot be judged by comparing ordinary B coefficients, because different independent variables are not measured on the same scale.) Knowing the relative importance of the effects may or may not be of interest it depends on the question you are asking. In the present case, Arian and Shamir might have been interested in knowing whether dovishness affects voting more than ethnicity. However, because ethnicity is measured by 4 different dummy variables we cannot directly compare the effects of ethnicity and dovishness. Testing for Interaction The analysis so far has illustrated how to test for all types of causal relationship except a conditional relationship מותנה).(קשר For this purpose we shall go back to the example that was shown earlier, in the chart on the right on page 5. This showed that while Mizrachim are normally more favorable to the Likud, among very religious voters more Ashkenazim than Mizrachim favored the Likud. In the model below, the red line represents the conditional relationship that we would like to test. - 8 -

ETHNICITY VOTE RELIGIOUSNESS This is often referred to as interaction between the effects of two independent variables. There are two ways of testing for interaction. In the present example, the simplest way would be to estimate the effect of ethnicity twice once for very religious people and once for everyone else. If the coefficients of the עדה variables in these two regressions are different, that would support the idea of a conditional relationship. SPSS has a procedure called Split File which makes it easy to use this method. In the present example, we recode V143 to create a new variable called DATI with two categories (very religious = 1, everyone else = 0). From the Data menu click on Split File, select Compare groups, use the arrow to select the DATI variable, and press OK. From now on, unless you choose the option Analyze all cases, any analysis you do will show separate results for very religious people and for everyone else. Here s what happens when we run a regression to test the effect of ethnicity and religious observance on LIKUD with Split File turned on. The area highlighted in blue shows that the regression results are repeated twice once for respondents coded 0 on DATI, and again for those coded 1. (Why doesn t the second block, Model 2, appear for DATI=1? Because all of this group of respondents is very religious, there is no variation in V143, therefore SPSS cannot test its effect on voting.) What interests us is whether the coefficients for the ethnic dummy variables are different for DATI=0 and DATI=1. They certainly are. We highlight in yellow one example, Mizrachim born abroad in comparison with Ashkenazim born abroad. The difference in expected support for the Likud is much higher among the very religious (40.4%) compared with non-religious respondents (15.7% after controlling for variations in religiosity). This method of testing conditional relationships is limited. If the conditioning variable has many categories, or if it is a continuous variable, then the analysis will have to be split into too many parts. The solution is to add one more variable called an interaction term to the regression. This variable is the product (מכפיל) of the two variables of interest. It tests whether combining - 9 -

the two variables has a different effect than the effects of each one on its own. We will illustrate how it s done with a simplified version of the conditional model we have been using so far. Instead of using 4 dummy variables to test the effect of ethnicity, we will contrast all Mizrachim and all Ashkenazim by using a single dummy variable called MIZ_ASH. ( Sabras are treated here as Missing.) We already have a dummy variable DATI contrasting very religious people to everyone else. Now we create an interaction variable called INTERAC, which is equal to MIZ_ASH * DATI (this is easily done using Compute, available from the Transform menu). We now run the regression in 3 blocks. Model 1 shows that the Mizrachi vote for Likud is 23.2% higher than for Ashkenazim, and we see from Model 2 that this gap falls very slightly to 22.4% after the effect of strong religious observance is taken into account. Model 3 includes the variable INTERAC to test whether the effect of ethnicity (MIZ_ASH) depends on whether people are religious or not (DATI). We can use the result to help us calculate the expected effect of MIZ_ASH for each value of DATI, the conditioning variable. The calculation can easily be done by making a table like the following one. (Click here to see how to do the calculations in Excel.) 1 2 3 4 Value of conditioning variable (DATI) B of conditioned variable (MIZ_ASH) B of interaction term (INTERAC) Conditional B 4=2 + [1*3] 1 0.233-0.284-0.051 0 0.233-0.284 0.233 The fourth column shows the effect of MIZ_ASH on voting, conditional on DATI. We see that the effect of ethnicity is expected to be much smaller for very religious people than for other people. In fact, among orthodox voters the difference between Mizrachim and Ashkenazim is slightly negative: Ashkenazim vote 5.1% more than Mizrachim for the Likud. Testing the statistical significance of conditional relationships is complicated. For our purposes, the important thing is whether the results are substantively meaningful. - 10 -

FACTOR ANALYSIS Factor Analysis is a useful way of finding out whether a group of variables share a single common denominator משותף),(מכנה or whether they are best summarized using several different dimensions.(מימדים) Even if we already know the answer to that question, Factor Analysis can simplify the construction of a new variable that summarizes the values of existing variables. In creating scores for the factors that it uncovers, the procedure can combine the values of variables measured on different scales. The resulting factor scores are easy to interpret, since they always have a mean of zero and a standard deviation of 1. Suppose we were interested in measuring what people like or dislike about different political parties. Variables V51-V60 cover a variety of reasons why voters might be attracted to the Labour Party. Four of them are: experience in government (V51), good leaders (V52), peace policy (V54) and social policy (V55). Do people who favor Labour rank the party high on all of these aspects, while people who dislike Labour dislike it in all respects? Or does like/dislike of Labour divide among several different dimensions? To find out we use these menus: Analyze, Data reduction, Factor. The 4 variables which interest us have been selected. Now we should change some of the details. Click on Rotation and select Varimax, then click Continue. Click on Options. Select Sorted by size and Suppress absolute values less than.10 and then click OK. Total Variance Explained Initial Eigenvalues Extraction Sums of Squared Loadings Total Component Total % of Variance Cumulative % % of Variance Cumulative % 1 2.821 70.519 70.519 2.821 70.519 70.519 2.501 12.517 83.037 3.372 9.305 92.342 4.306 7.658 100.000-11 -

The table at the bottom of the previous page shows the variance explained by the factors. In this example only the first factor (called Component 1) is important enough to be extracted. We see that it alone explains 70% of the collective variance in the four variables that were analyzed. We turn now to a different example. The Arian-Shamir survey included a variety of questions concerning involvement in election-related activities that could influence a person s vote. Is there one underlying factor here? For example, do people who participate in party rallies also tend to be influenced by the media campaign (or have they already made up their minds)? Are people who admit to being influenced by the television campaign also influenced by the radio and newspaper campaigns? A factor analysis of 7 relevant variables, V17 through V23, shows that there are two main factors. Total Variance Explained Initial Eigenvalues Extraction Sums of Squared Loadings Total Rotation Sums of Squared Loadings Total Component Total % of Variance Cumulative % % of Variance Cumulative % % of Variance Cumulative % 1 3.219 45.992 45.992 3.219 45.992 45.992 3.041 43.450 43.450 2 1.758 25.121 71.113 1.758 25.121 71.113 1.936 27.663 71.113 3.846 12.091 83.203 4.483 6.906 90.109 5.271 3.870 93.979 6.246 3.516 97.495 7.175 2.505 100.000 Extraction Method: Principal Component Analysis. The first factor is more important than the second one. Together they explain the majority (71%) of total variance. Note that the final factors have been rotated to make them as dissimilar from one another as possible. Rotated Component Matrix Component 1 2 V23 extent TV broadcasts help decide.929 V21 extent radio broadcasts help decide.907 V22 extent newspapers help decide.897 V20 extent the election campaign help decide.732.273 V17 participate parties' rallies.892 V18 participate parties' home-circles.888 V19 extent talk about political issues.523 There is a very clear division between the two factors ( Components ). The table above shows the extent to which variables contribute to each factor (the numbers represent correlations between each variable and a factor). By studying these loadings, as they are called, we can learn how to interpret the factors. The first factor represents passive participation in the election campaign, indicated by the extent voters are influenced by the media and the parties campaign efforts. The second factor covers actions initiated by voters themselves. In order to create summary scores for each factor, we simply need to run the Factor Analysis again but with one change: click on Scores and press Save as variables. In the present instance two new variables would be added to the dataset, one for each factor. These variables could be used in regressions or any other type of analysis. Comments or suggestions? Please send them to shalev@vms.huji.ac.il - 12 -