Announcement Homework #2 due next Friday at 5pm. Midterm is in 2 weeks. It will cover everything through the end of next week (week 5).
Political Science 15 Lecture 8: Descriptive Statistics (Part 1)
Data Coding Coding is the process of assigning numerical values to the values of your variable. The meaning of these codes will depend on the level of measurement of the variable: Nominal: codes are just indications of the category Ordinal: codes are indications of ordering Interval/Ratio: codes are the actual numerical value
Preparing Data for Hypothesis Testing Gather measurements on all of the concepts important for your hypothesis (dependent, independent, and control variables). Enter them into a spreadsheet. We will use SPSS in this class. Each row is an observation (unit), each column is a variable.
Example of Data Ready for Hypothesis Testing Interview # Religion Income Ideology 1 1 35000 4 2 1 46000 3 3 3 82000 5 4 2 19000 2 5 1 67000 6 We use a codebook to find out what these numbers mean.
Descriptive Statistics Descriptive statistics can be used for descriptive inference using data to learn something about the state of the world. These descriptive statistics will also be the building blocks we use for causal inference testing our hypotheses with data to learn something about how the world works. We begin with descriptive statistics for a single variable.
Understanding Our Data Before undertaking any data analysis you should examine your data carefully. Watch for unusual distributions of variable values and outliers in the data. An outlier is an extreme value on a variable. Try to determine why you have observed this value. An unusual case? A coding error?
Example of an outlier affecting a relationship
Exploring Data: Frequency Distributions Divide the variable into a set of exhaustive, mutually exclusive categories. Example: Cumulative Ideology # of people Percent Percent Conservative 300 30% 30% Moderate 500 50% 80% Liberal 200 20% 100% Total 1000 100% 100%
Exploring Data: Graphical Methods For nominal and ordinal level data bar graphs work well:
Exploring Data: Graphical Methods For interval level data a histogram is useful (note detection of outlier):
Central Tendency: Mode The mode is the category of a variable with the greatest frequency of observations. The mode is most commonly used on variables with a nominal level of measurement. There can be more than one modal value for a variable. Variables with more than one mode are referred to as bimodal or multimodal. Example: In a party ID variable we have 40 Democrats, 60 Republicans, and 20 Independents the mode is Republican.
Central Tendency: Median The median is the value of a variable that divides the observations on that variable in half. If we ordered our observations on a variable from lowest to highest, the median observation is the one in the middle. With an even number of observations there is no true median. The median is most commonly used on variables with an ordinal level of measurement, but is sometimes used on interval/ratio data because it is resistant to outliers.
Example of Calculating Median We have a 7-point scale on ideology in a survey: Category: 1 2 3 4 5 6 7 # responses: 32 54 97 103 44 21 12 The median observation is observation (N+1)/2 = 182. Count up from the lowest value median is 3.
Quartiles If we arrange a variable from lowest to highest value, the median is the observation at the 50% mark. Quartiles are at the 25%, 50% and 75% marks. Quintiles: 20%, 40%, 60%, 80% Deciles: every 10% Percentiles: every 1% We can use these to get a more detailed picture of the distribution of a variable.
Central Tendency: Mean The mean is the sum of the values of a variable divided by the number of observations on that variable. This is usually what people mean by average. The formula for the mean is written as: The mean is most commonly used on variables with an interval level of measurement.
Example of Calculating Mean We have campaign spending in 7 districts: District: 1 2 3 4 5 6 7 $ spent: 1000 5000 3500 2000 0 800 6000 ΣX = 1000 + 5000 + 3500 + 2000 + 0 + 800 + 6000 = 18300. N = 7 The mean is 18300/7 = 2614.
Central Tendencies in Global Income Distribution
Dispersion: Standard Deviation The variance of a variable is the sum of the squared differences between each value of that variable and the mean, divided by N 1. We square the differences so that positive and negative differences don t cancel out. We divide by N 1 to get a (conservative) estimate of the mean dispersion of the variable. The square root of the variance is the standard deviation:
Example of Calculating Standard Deviation We have campaign spending in 7 districts: District: 1 2 3 4 5 6 7 $ spent: 1000 5000 3500 2000 0 800 6000 Mean of variable is 2614. s = square root of [1/6 ((1000 2614) 2 + (5000 2614) 2 + ))] The standard deviation is 2106.
z scores A z score is a measure of how many standard deviations a particular observation is above or below the mean. We subtract the mean from the observation and divide by the standard deviation.
Example of Calculating z scores We have campaign spending in 7 districts: District: 1 2 3 4 5 6 7 $ spent: 1000 5000 3500 2000 0 800 6000 Mean of variable is 2614 Standard deviation of variable is 2106. z score for district 1 is (1000 2614)/2106 = -0.77 z score for district 2 is (5000 2614)/2106 = 1.13
Descriptive Statistics for Relationships Between Variables These are the more interesting descriptive statistics from our perspective, since we are interested in testing causal relationships between variables. Our hypothesis tests later in the class will usually be based on these calculations. As with a single variable, we begin by exploring our data to be sure we understand it.
Exploring Data: Bivariate Frequency Distributions Divide the variables into a set of exhaustive, mutually exclusive categories. Example: Favors Gas Tax Opposes Gas Tax Party ID Democrat 50% N=500 10% N=100 Party ID Republican 10% N=100 30% N=300
Examples of Relationships in Crosstabs Dem Rep Yes 25% 25% No 25% 25% Dem Rep Yes 40% 10% No 10% 40% Dem Rep Yes 10% 40% No 40% 10% No Yes No Our hypothesis is that Democrats are more supportive of a gas tax. Do our data support this?
Exploring Data: Graphical Methods For interval level data scatterplots are a good way to examine relationships between variables :
Correlations Correlations measure the relationship between two interval level variables. Correlations always fall between 1 and 1. Positive correlations indicate a positive relationship, negative correlations indicate a negative relationship. No relationship gives a 0 correlation, but 0 correlation does not necessarily mean no relationship. Correlations only capture linear relationships: y = a + b*x
Positive Correlations Stronger Weaker
Negative Correlations Stronger Weaker
Examples of Correlations