Dashboard Analysis for TRaC Studies Series One: Pre-Analysis Data Preparation

Size: px

Start display at page:

Download "Dashboard Analysis for TRaC Studies Series One: Pre-Analysis Data Preparation"

Brittany Jacobs
5 years ago
Views:

1 B UILDING R ESEARCH C APACITY Dashboard Analysis for TRaC Studies Series One: Pre-Analysis Data Preparation PSI s Core Values Bottom Line Health Impact * Private Sector Speed and Efficiency * Decentralization, Innovation, and Entrepreneurship * Long-term Commitment to the People We Serve

2 Research Division Population Services International 1120 Nineteenth Street NW, Suite 600 Washington, DC Dashboard Analysis for TRaC Studies Series One: Pre-Analysis Data Preparation PSI Research Division 2006 Population Services International, 2006 Contact Information Kathryn O Connell, Hongmei Yang, Hibist Astatke, or Varja Lipovsek Population Services International th Street, NW, Suite 600 Washington, DC USA Telephone Fax kate.oconnell@psi.org.kh hyang@psi.org hastatke@psi.org

3 DASHBOARD ANALYSIS SERIES ONE: PRE-ANALYSIS DATA PREPARATION LEARNING OBJECTIVES By the end of this chapter, the reader will be able to: 1. Understand the importance of documenting the process of analysis in the SPSS syntax file. 2. Know how to do data cleaning. 3. Know how to do factor analysis and reliability analysis. 4. Understand the process of creating new variables such as socioeconomic index, scale constructs, and knowledge index. BACKGROUND The Population Services International (PSI) Dashboard is an evidence-based decision making tool for social marketing (Patel & Chapman, 2005). Through a series of tables for segmentation, monitoring, and evaluation, the dashboard can answer questions that cannot be otherwise answered without evidence. In addition, it can do so with timely, accurate, objective, and easy to read instruments and measures that complement a social marketer s experience or feeling (Balch & Sutton, 1997). The dashboard tables are generated with data collected and analyzed from three different types of study designs known as TRaC (Tracking Results Continuously), MAP (Measuring Access and Performance), and FoQus (Framework for Qualitative Research in Social Marketing). TRaC is a quantitative, tracking instrument which collects data from populations about their behaviors, risk/need, behavioral determinants, source of supply of products/service delivery, and exposure to social marketing activities. MAP is a quantitative, mapping tool which objectively measures the system s coverage, quality of coverage, access, and equity of access of PSI products. FoQus is a framework for qualitative research for segmentation, scales, concept development, and concept testing. In order to effectively analyze data collected from TRaC, the Dashboard Analysis Series will provide a step-by-step guide to conduct segmentation, monitoring, and evaluation analysis for one or more rounds of data collection. Series are organized as follows:

4 Series One: Pre-Analysis Data Preparation Series Two: Monitoring Analysis Series Three: Segmentation Analysis Series Four: Evaluation Analysis Within each series, the reader is provided with (a) an explanation of the data analysis, (b) SPSS syntax and how to conduct the data analysis using the drop down menu, (c) a sample of SPSS output and its explanation, and (d) one or two examples. Within each sub-section, definitions are provided for new terminology that is explained. Where further detail or rational is required (e.g., scale development section) appropriate chapters are cross referenced. HOW-TO-STEPS Keep in mind that you should document all your data cleaning, recoding, and analysis in a syntax file. This will not only help you to replicate and re-run analysis, but it will also allow your regional researcher to check your analyses and understand what you have done. It is suggested that you develop a syntax file from the moment you start working on your data. It is okay if you do not create a new data set by saving the original data set as a new file. But remember not to save any changes you make to the data set; instead, you should save changes you make to the syntax file. In this case, you can simply rerun the syntax each time, and this way it ensures that you always work from the original data set. When you send data to your regional researcher, simply attach the original data file and the syntax. If there are any problems with the raw data or the coding, he or she can identify if this is as a result of the way the data was entered. Syntax files can be created in two ways. You can type in the commands by hand. This is time intensive and requires knowledge of the command names. Or, you can paste directly from this document into a syntax file. Alternatively, you can use the drop down menu to create your analysis and before hitting run simply hit paste and this will automatically give you the syntax you need. CREATING LABELED SYNTAX FILES Make sure that the steps followed in analysis are clearly described by titles, subtitles, and notes. For example, when removing variables from analysis, record this information into the syntax with a title such as the following: **DROP VARIABLE XXX DUE TO NON-SIGNIFICANCE**. or **DROP VARIABLE XXX TO INCREASE CRONBACH S ALPHA**. When recoding items or creating composite scores, record this information into the syntax with a title such as the following: 4

5 **RECODING ALL NEGATIVELY PHRASED ITEMS**. or **CREATING COMPOSITE VARIABLES**. Likewise, record other information you think is necessary. Guide a reader through the steps that you have done so that the analysis process can be followed with ease. This will also help you to remember what you have done and why, and it will help your regional researcher or other researchers to check your work. Note that ****** (i.e., stars) should be written into the syntax to write notes and subtitles. To avoid SPSS reading titles as commands, you must either put a period at the end of the stars that come after the titles, subtitles, and notes that have been written into the syntax, or you must place at least one line break (i.e., hit return twice) between titles and syntax file. Below are some examples: *****running frequencies for DV: condom use at last sex with sweetheart****. FREQ q102 q123 q123. or *****running frequencies for DV: condom use at last sex with sweetheart**** FREQ q102 q123 q123. DATA CLEANING AND SCREENING Even if data come in clean, it is important to run the following analysis and include this in your syntax file. Run frequencies, means, standard deviation, minimum, and maximum values. FREQUENCIES VARIABLES=q101 q102 q103 q104 q105 /STATISTICS=STDDEV MINIMUM MAXIMUM MEAN /ORDER= ANALYSIS. Drop Down Menus for Frequencies Analyze Descriptive statistics Frequencies Select variables to the variable box Click statistics Check boxes next to std. deviation, minimum, maximum, and mean Click on continue Click on paste Switch to syntax file, highlight the syntax, and run it This will help determine the following key diagnostics of your data set (explained in further detail below): if there are any out of range values, 5

6 if the means (i.e., average value for the variable) and standard deviations (i.e., measures of how spread out the values in a data set are from the mean for continuous variables) are reasonable and logical, and if the questionnaire codes match those in the data set (especially for dichotomous variables). 1. Are any values out of range? It is important to check the range of values. If a Likert scale ranges from 1 6 but there are values between 0 and 11, what should you do? Ideally, you should go back to the original questionnaire and check that the scale really does range from 1 6. If there are only a few cases that fall outside of this range, then you could go back to the original questionnaire and check the responses, since most likely these out of range values were entered incorrectly into the data set. If that is not possible, you need to recode the out of range value as missing. RECODE qxxx (0 11=sysmis) (else=copy). Drop Down Menus for Recoding into Same Values Transform Recode Into same variables Select variable to recode Click on old and new values Indicate old and new values and click on add. Click on continue Click on paste. Switch to syntax file, highlight the syntax, and run it 2. Are the means and standard deviations reasonable and logical? Look for outliers for continuous variables, such as number of sexual partners, age, and income. If a case has such an extreme value on one variable, the case is a uni-variate outlier. Operationally, uni-variate outliers are cases with very large standardized scores (z scores) on one or more variables. Cases with standardized scores in excess of 3.29 (p<.001, two-tailed test) are potential outliers. Z scores are available through SPSS DESCRIPTIVES (where z scores are saved in the data file). DESCRIPTIVES VARIABLES=age income /SAVE /STATISTICS=MEAN STDDEV MIN MAX. FREQ zage zincome. (note: zage and zincome indicate the variable name of z scores for age and income, respectively). Drop Down Menus for Descriptives Analyze Descriptive statistics Descriptives Select variable to variable box Check the box next to save standardized values as variables Click on options Check boxes next to std. deviation, minimum, maximum, and mean Click on continue Click on paste Switch to syntax file, highlight the syntax, and run it 6

7 There are four reasons for the presence of an outlier: i. incorrect data entry; ii. iii. iv. the outlier is not a member of the population from which you intended to sample; the case is from the intended population, but the distribution for the variable in the population has more extreme values than a normal distribution; and failure to specify missing value codes in computer syntax so that data missing-value indicators are read as real data. This is explained further later in this chapter. Once outliers have been identified, please check carefully to ensure that missing value codes are specified as missing (not numeric). If that is not the problem, check if data are correctly entered. If data are accurately entered, excluding the variable which is responsible for most of the outliers in the analysis may be advisable when the variable is not that critical to the analysis. Deleting the outliers (i.e., cases) may be an alternative, but the generalization of your results may be influenced if the outliers are from the target population. 3. Do the questionnaire codes match those in the data set? Ensure that dummy variables are coded 0 and 1. All respondents with the behavior or attribute who say yes to yes/no questions are labeled 1. We call these cases. All respondents without the behavior or attribute who say no to yes/no question are labeled with 0. We call these reference groups. It is important to follow this coding pattern because for dependent variables in logistic regression SPSS assumes that cases are 1 and the reference group is 0. RECODE qxxx (1=1) (2=0) INTO qxxxn. Drop Down Menu for Recoding Transform Recode Into new variables Select variable to recode Name and label output variable Click change Click on old and new values Indicate old and new values and click on add Click on continue Click on paste Switch to syntax file, highlight the syntax, and run it For those variables that are dichotomous such as gender (i.e., male or female), a geographical area (e.g., East vs. West), or marital status (e.g., married, single), we recommend that you recode these into 1 and 0. MISSING DATA Missing data is one of the most pervasive problems in data analysis. Its seriousness depends on the pattern of missing data, how much is missing, and why it is missing. Missing data can be problematic because it will not allow those people with missing data to be included in your analysis, thereby reducing your N s in subsequent analyses. If there is a large amount of missing data and you include this question 7

8 into a logistic regression, it will reduce your sample size and influence the power and significance testing results. Running FREQUENCY analysis can help you identify variables with excessive missing values. 1. What do you do if there are missing values? Running FREQUENCY analysis can help you identify missing data in your data set. For numeric variables (i.e., those variables that describe data as numbers such as categorical, dichotomous, or interval variables), missing values are labeled as Missing System in the output. For string variables (i.e., those variables that use letters and are often used with open ended questions), missing values are labeled as (i.e., blank) in the output. Once you identify the existence of missing values in certain variables, you need to identify what type of missing data they are. Typically, there are two types of missing data. Data may be missing because there was no response from the interviewee, or data can be missing because of a skip pattern in the questionnaire. Missing data due to no response is usually treated as missing data in the subsequent analysis and will not be included in the subsequent analysis. However, missing data due to a skip pattern in the questionnaire may be treated differently. The following example will give you some idea about this. Example 1 Q201: Have you ever had sexual intercourse? Yes has had sex..1 Never had sex.0 q301 Q224: Including all casual, regular, and marital partners, how many people overall have you had sex with in the last 12 months? Number:. Can t estimate 88 No response/refuse. 99 In the above example, respondents were first asked Have you ever had sexual intercourse? (Q201). Those who answered never had sex skipped question Q224. In the data set, their values for Q224 are set as missing because they did not answer this question since they were not sexually active. However, if you wanted to know the percentage of respondents who have multiple sexual partners, you need to change the missing values for Q224 to 0 (since these people have 0 sexual partners). To do this, you use the command: IF q201=0, q224=0. 2. What if missing data is coded as 99 or another number? Sometimes researchers will code answers as 99 for missing data or for no response/can t estimate as in the above example. If there is a value which should be treated as missing but 8

9 valued with a figure, say 99 for no response / refuse in the above example, such value should be defined as missing so that those respondents are not included in the analysis. This is done with the following command: MISSING VALUES q224 (99). Note: 88 for can t estimate in the above example, however, may be interpreted as having too many partners to be remembered. In this case, when calculating the percentage of having multiple partners, they should be treated as those who have multiple partners. 3. What if there is excessive missing data? If there is excessive missing data, this can be problematic. If less than 5% of the data are missing in a random pattern from a large data set, the problem is not so serious. There is as yet no firm guideline for how much missing data can be tolerated for a sample of a given size. In your analysis, please keep records of variables with more than 15% of cases missing. If these variables are not critical to the analysis or are highly correlated with other complete variables, it is preferable not to include them in the analysis. Keep note of them as well since the missing data may be due to complexity of the question, translation, or sensitivity. You may want to consider rewording this item or excluding it from subsequent rounds of analysis. DATA MINING After data cleaning, you can move ahead to data mining creating appropriate analysis variables or recoding variable value levels to desired levels. 1. Renaming new or recoded variables When you develop new variables or create new variables as a result of cleaning and preparing for data analysis, it is recommended that you label data in the following ways so that it is easy for regional researchers to check your data. All continuous variables that are recoded into dichotomous, dummy variables are suggested to be renamed with a D (e.g., q118d). All reverse scored variables should be renamed with an R (e.g., q117r). All new variables should be renamed with an N (e.g., q116n). Scales are suggested to contain the bubble term (e.g., socnorm for social norms). Based on the frequency results, recode variables to create appropriate analysis variables. In doing this, you are essentially creating a new variable. 2. Recoding data into dummy variables 9

10 Sometimes you may want to recode new variables. For example, you may have a variable with three values, and you want to create a dummy variable of yes and no. The following syntax can be used, and this syntax can also be used to recode negatively phrased items or to create new variables. RECODE var (1=1) (2 thru 4=0) (ELSE=SYSMIS) INTO vard. Drop Down Menu for Recoding Transform Recode Into different variables Select variable to be recoded Name and label output variable Click change Click on old and new values Indicate old and new values and click on add. Click on continue Click on paste Switch to syntax file, highlight the syntax, and run it Example of recoding data In Example 1, imagine you want to create an indicator percentage respondents of not being faithful, and this is defined as the percentage of respondents who reported having >=2 sexual partners during the past year among those aged or 15 49, regardless of their sexual experience. In this case, you may need to convert Q224 (i.e., number of sexual partners ) into a two-level categorical variable: 1 = having multiple sexual partners (i.e., >=2 sexual partners) 0 = not having multiple sexual partners. There are several ways to do this: IF q201=0 q224=0. RECODE q224 (0 1=0) (2 THRU 36=1) (88=1) (99=sysmis) INTO q224d. Remember that those who have never had sex actually have 0 sexual partners (i.e., IF q201=0 q224=0.). Where 99 missing becomes system missing, and 88, can not estimate, becomes 1 since this implies that a respondent has so many partners that he or she can not remember how many. Alternatively, you can use the following syntax to do this. IF q201=0 q224=0. IF q224=0 q224=1 q224d=0. IF q224 >= 2 & q224 <= 88 q224d=1. 10

11 3. Developing reverse code scale items Negative wording items need to be recoded so that a higher value indicates a more positive attitude. For example, a negatively phrased item such as Condoms are inappropriate to use with a spouse (1 = strongly disagree 6 = strongly agree) should be recoded. To do this you would use the following syntax: RECODE qxxx (6=1) (5=2) (4=3) (3=4) (2=5) (1=6) INTO qxxxr. Drop Down Menu for Recoding Transform Recode Into different variables Select variable to be recoded Name and label output variable Click change Click on old and new values Indicate old and new values and click on add Click on continue Click on paste Switch to syntax file, highlight the syntax, and run it 4. Creating new variables Creating new variables may be necessary. For example, maybe you want to create a composite score for a reliable scale. For this, you would need to create a new variable by adding up each item in the construct and dividing by the total number of items in the scaled construct. compute socnorm =(qxxx+qxxx+qxxx+qxxx)/4. Drop Down Menu for Creating Composite Scores Transform Compute Create name for the new variable to be created Select variable from which the new variable will be created and write the numeric expression necessary to create the new variable CREATING AN SES INDEX For more detailed information on the rational behind creating an SES index, please refer to the toolkit chapter on PCA. 1. Running frequencies Run frequencies to check for missing data and out of range values. Then identify possessions and/or amenities owned by less than 20% or more than 80% of respondents. Recode Dummy code variables for each indicator 11

12 2. Conducting PCA analysis Chose variables (items owned by more than 20% and less than 80% of respondents) Request a correlation matrix Chose an extraction method (PCA) Chose rotation method (VARIMAX) Save factor scores (SES index) FACTOR /VARIABLES q201a q201d q201e q201f q203a pittoil q202b q202c rivwat /MISSING PAIRWISE /ANALYSIS q201a q201d q201e q201f q203a pittoil q202b q202c rivwat /PRINT INITIAL CORRELATION EXTRACTION ROTATION /FORMAT SORT /PLOT EIGEN /CRITERIA MINEIGEN(1) ITERATE(25) /EXTRACTION PC /CRITERIA ITERATE(25) /ROTATION VARIMAX /SAVE REG(ALL) /METHOD=CORRELATION. Drop Down Menu for Conducting PCA Analysis Analyze Data reduction Factor Select items in scale and move them over to the box on the right side of the screen Click on rotation Select varimax and ensure maximum iterations is 25 and rotated solution is checked Click continue Click on extraction and in methods Select principal factor analysis, under display, select scree plot Click on options and ensure that exclude cases listwise is selected, suppress absolute values is checked and that the values are less than.25 Click continue Click paste Highlight syntax and run it FACTOR /VARIABLES q201a q201d q201e q201f q203a pittoil q202b q202c rivwat /MISSING PAIRWISE /ANALYSIS q201a q201d q201e q201f q203a pittoil q202b q202c rivwat /PRINT INITIAL CORRELATION EXTRACTION ROTATION /FORMAT SORT /PLOT EIGEN /CRITERIA MINEIGEN(1) ITERATE(25) /EXTRACTION PC /CRITERIA ITERATE(25) /ROTATION VARIMAX /SAVE REG(ALL) /METHOD=CORRELATION 12

13 3. Grouping respondents into five SES groups FREQUENCIES VARIABLES=fac1_1 /NTILES= 5 /ORDER= ANALYSIS /FORMAT NOTABLE. COMPUTE fac1new = RND(fac1_1* )/ RECODE fac1new (Lowest thru =1) ( thru =2) ( thru =3) ( thru =4) ( thru highest=5) (SYSMIS=SYSMIS) INTO SES CONSTRUCTING SCALES 1. Dealing with missing data Use FREQUENCY command in SPSS to see if values are within range, if means and standard deviations are plausible, and if there are missing data for each item. If there are missing data for some items, use a t-test to check if data are missing randomly by testing if there is a significant difference in other variables (e.g., DV and demographic characteristics) between cases with missing and without missing data on the variable of interest. Options for dealing with missing data: If missing data is concentrated in items not important for the analysis or items that are highly corrected with other items, these items can be deleted from the analysis. If the pattern of missing data is random and few cases have missing data, these cases can be dropped. If <1% of data are missing, use mean substitution to estimate missing data. If the pattern of missing data is random and there is more than 5% missing data, mean substitution can be used. If 5% to 10% of the data are missing in a random pattern, use mean substitution or regression to estimate missing data. If data are missing in a non-random pattern, substitute group mean or use regression to estimate missing data. Once your data have been properly cleaned, screened, and recoded, the next step is to determine whether or not you need to construct any scale. In this section, you will go through the specific processes to determine whether or not you have a scaled construct. 13

14 2. Factor analysis to determine subscales To explore the dimensions in a scaled construct, you typically use exploratory factor analysis (EFA) for scaled constructs. The purpose of EFA in PSI Dashboard analysis is to identify subscales or number of dimensions in a scale. For PSI Dashboard analysis, EFA is useful in determining whether or not you have developed a scale that has more than one dimension. EFA is a technique that will tell you if you developed a scale with two or more subscales rather than a general scale. Results of EFA could reveal a one factor solution suggesting that you used items that are all tapping into the same concept, and you have a uni-dimensional scale. If you had more than one factor, this would suggest that you developed a scale that captured two or more related but different aspects of a larger construct or several different constructs, and you have a multi-dimensional scale. Please see the toolkit chapter on scales for more detailed information. i. Sample size The reliability of factors emerging from EFA depends on the size of the sample although there is no consensus on what the size should be. There is an agreement, however, that there should be more participants than variables. Gorsuch (1983), for example, has proposed an absolute minimum of five participants per variable and not less than 100 individuals per analysis. Although factor analysis can be carried out on samples smaller than this to describe the relationships between the variables, not much confidence can be placed on these same factors emerging in a second sample. ii. iii. Principal Components Analysis (PCA) and Principal Axis Factoring (PAF) The two most widely used forms of factor analysis are principal components analysis and principal axis factoring; these are two types of extraction methods. When talking about these two methods, the usual convention is to refer to them collectively as factor analysis. The difference between PCA and PAF lies essentially in how they handle unique variance. PCA analyzes all the variance of a variable while PAF analyzes the variance it shares with other variables. PAF is the extraction method that is most appropriate for PSI scales. PCA is the most appropriate extraction method for creating indices. You will be provided with two examples of when to use PAF or PCA. Understanding output Step 1 The SPSS output will show the initial factors produced by PCA or PAF, and the amount of variance each factor accounts for (i.e., their eigen value). The first component or axis that is extracted accounts for the largest amount of variance shared by the tests. The second factor consists of the next largest amount of variance which is not related to or explained by the first one. The third factor extracts the next largest amount of variance and so on. In other words, the first few factors are the most important ones. Step 2 Since one of the objectives of FA is to reduce the number of variables you have to handle, this would not be achieved if we used all of the variables. Consequently the next step is to decide how many factors we should keep. There are two main criteria that can be used for deciding which factors to exclude. The first is known as Kaiser s 14

15 criterion which can be used to select those factors which have an eigen value greater than one. SPSS does this by default unless it receives instructions to do otherwise. The second method is a scree test proposed by Catell (1966). In this method, a graph is drawn for the descending variance accounted for by the factors initially extracted. The term scree is a geographical term for describing the debris found at the bottom of a rocky slope and implies that these factors are not very important. The factors to be retained are those which lie before the point at which the eigen values seem to level off. Step 3 Last, rotation methods (Varimax) are used to transform the data and make it easier to examine the factor loading of individual items on each factor. a. Factor loading indicates the strength of the association between each item and a factor. The relationship between each item or test and a factor is expressed as a correlation or loading. b. Rotation identifies and compiles items onto the factors that they have a strong association with and magnifies findings by making small numbers even smaller and large numbers even larger without changing the results. A factor loading of greater than.30 is typically used as the cut off point for identifying items strongly associated with a particular factor. In a nutshell, refer to the Rotated Component Matrix, look at the factor loading, and identify potential subscales. Also refer to the Total Variance Explained table and determine how much variance is explained by each of the subscales. Remove items that do not load with other factors. Re-run EFA analysis until a set of subscales is identified. FACTOR /VARIABLES q504r q505r q506r q507r q510 q512r q515 q516/missing LISTWISE /ANALYSIS q504r q505r q506r q507r q510 q512r q515 q516 /PRINT INITIAL EXTRACTION ROTATION /PLOT EIGEN /FORMAT BLANK(.25) /CRITERIA MINEIGEN(1) ITERATE(25) /EXTRACTION PAF /CRITERIA ITERATE(25) /ROTATION VARIMAX /METHOD=CORRELATION. KEY OUTPUT Total Variance explained Scree Plot Rotated Component Matrix Drop Down Menu for EFA Analyze Data reduction Factor Select items in scale and move them over to the box on the right side of the screen Click on rotation Select varimax and ensure maximum iterations is 25 and rotated solution is checked Click 15

16 continue Click on extraction and in methods Select principal axis factoring, under display, select scree plot Click on options and ensure that exclude cases listwise is selected, suppress absolute values is checked and that the values are less than.25 Click continue Click paste Highlight syntax and run it iv. Example developing subscales of a construct using PAF Using the social norms example from Cambodia, the scale contains 8 items that ask about different social norms related to condom use with sweethearts and spouses. It is possible that while the scale is about social norms related to condom use with a partner, there are subscales that address social norms related to sweethearts and subscales related to spouses. The social norms scale contains the following items: Social Norm Condom use is normal in S/H relationships these days Condom use is normal in spousal relationships these days It s acceptable for a woman to propose condom use to her spouse It s acceptable for a woman to propose condom use to her S/H It s acceptable for a man to propose condom use to his spouse It s acceptable for a man to propose condom use to his S/H It would be strange to use condoms with your S/H nowadays It would be strange to use condoms with your spouse nowadays Conceptually, you can see that some items ask about relationships with spouses, while others ask about sweetheart relationships. Using EFA will help you "see" which items load onto different factors (i.e., which ones hang together statistically). This is important for helping us determine whether or not there are appropriate subscales within the social norms scaled construct. FACTOR /VARIABLES q143 q146 q148 q149r q144 q145 q147 q150r/missing LISTWISE /ANALYSIS q143 q146 q148 q149r q144 q145 q147 q150r /PRINT INITIAL EXTRACTION ROTATION /FORMAT BLANK(.25) /CRITERIA MINEIGEN(1) ITERATE(25) /EXTRACTION PAF /CRITERIA ITERATE(25) /ROTATION VARIMAX /METHOD=CORRELATION. Q143 Condom use is normal in S/H relationships these days 16

17 Q146 It's acceptable for a man to propose condom use to her S/H Q148 It's acceptable for a man to propose condom use to his S/H Q149R It would be strange to use condoms with your sweetheart nowadays Q144 Condom use is normal in spousal relationships these days Q145 It's acceptable for a woman to propose condom use to her spouse Q147 It's acceptable for a man to propose condom use to his spouse Q150R It would be strange to use condoms with your spouse nowadays In the first instance, you test all of the items and conduct EFA. Use varimax rotation in order to simplify the results. You should note that there are options in SPSS to suppress values that are less than a certain value. Using this suppression option will not change your EFA; it will only display figures that are higher than the value you specify in your output. In this case, chose to display values that are greater than.25. Doing so will help you focus on main findings. The key SPSS outputs are: Rotated component matrix Total variance explained The Rotated Component Matrix output is the simplest to understand and most important for you to use; use it to interpret your data. The table below helps you interpret subscales by showing how components load onto one another, helping you understand the meaning of items. TABLE ONE: ROTATED COMPONENT MATRIX(A) Component Q143 Condom use is normal in S/H relationships these days.728 Q146 It's acceptable for a man to propose condom use to her S/H.783 Q148 It's acceptable for a man to propose condom use to his S/H.705 Q149R It would be strange to use condoms with your sweetheart nowadays Q144 Condom use is normal in spousal relationships these days Q145 It's acceptable for a woman to propose condom use to her spouse Q147 It's acceptable for a man to propose condom use to his spouse Q150R It would be strange to use condoms with your spouse nowadays Extraction Method: Principal Component Analysis. Rotation Method: Varimax with Kaiser Normalization. A rotation converged in 5 iterations. 17

18 Here you can see that there are three factors (in the columns). The first and second factors have four items each that load together. The third factor has three items that load together: q149r, q150r, and q144. Q419r also loads onto the second factor, but has a smaller association with that factor (.236). Q150r also loads onto the first factor, but has a smaller association with that factor (.424). Q144 also loads onto the first factor and has a higher association that factor (.699). Q144 should therefore be kept with the first factor since it is clearly contributing to its variance. At this point, you should also look at the Total Variance Explained table. You can see that the first factor explains the largest proportion of the variance (24.58%) while the third factor explains the least (16.25%). There is no decision rule regarding the % of variance that needs explanation. TABLE TWO: TOTAL VARIANCE EXPLAINED Component Initial Eigen Values Extraction Sums of Squared Loadings Rotation Sums of Squared Loadings Total % of Variance Cumulative % Total % of Variance Cumulative % Total % of Variance Cumulative % Extraction Method: Principal Component Analysis. The next step is to look at what these social norms items under factor #3 are asking. Q149r states it would be strange to use condoms with your sweetheart nowadays, and q150r states it would be strange to use condoms with your spouse nowadays. These items may be loading onto the same factor because they use the same phrasing for the statements. Since q149r appears to stand alone under factor #3 (.832), and q150r also loads onto factor #1, you make a decision to remove q149r and rerun the model. Q149r seems to be making some noise in the scaled construct. 18

19 Typically, decisions to move or keep a factor should be based on a number of considerations, such as the value between different factors (rotated component matrix), its conceptual meaning (your opinion), and the amount of variance explained (total variance explained table). Rerun the EFA syntax, but remove q149r: FACTOR /VARIABLES q143 q146 q148 q144 q145 q147 q150r/missing LISTWISE /ANALYSIS q143 q146 q148 q144 q145 q147 q150r /PRINT INITIAL EXTRACTION ROTATION /FORMAT BLANK(.20) /CRITERIA MINEIGEN(1) ITERATE(25) /EXTRACTION PAF /CRITERIA ITERATE(25) /ROTATION VARIMAX /METHOD=CORRELATION. Now the output looks like that in Table 3. TABLE THREE: ROTATED COMPONENT MATRIX(A) Component 1 2 Q143# Condom use is normal in S/H relationships these days.746 Q146# It's acceptable for a man to propose condom use to her S/H.792 Q148# It's acceptable for a man to propose condom use to his S/H.703 Q144# Condom use is normal is spousal relationships these days.767 Q145# It's acceptable for a woman to propose condom use to her spouse.746 Q147# It's acceptable for a man to propose condom use to his spouse.725 Q150R It would be strange to use condoms with your spouse nowadays.649 Extraction Method: Principal Component Analysis. Rotation Method: Varimax with Kaiser Normalization. A rotation converged in 3 iterations. You can see that there are two factors that clearly load together. Conceptually the items in these factors also hang together and make sense. Q143, Q146, and Q148 ask about condom use with sweethearts while the remaining items ask about condom use with spouses. Therefore you can create two scales about social norms, one for sweethearts and one for spouses. The next step is to run reliability tests on these two scales (as explained later in the reliability analysis subsection). 19

20 3. Reliability Assessing internal reliability is important in scales. It raises the question of whether scales are measuring a single idea; hence, whether the items that make up the scale are internally consistent. Use Cronbach s alpha to calculate reliability. i. Criteria for calculating internal consistency reliability The following criteria are used to determine whether or not a scale is reliable: < 0.60 = unacceptable = undesirable = minimally acceptable = acceptable (was respectable) = very good For the PSI data analysis frame work, you will use the criterion of equal to Reliability analyses will also produce inter-item correlations the strength of relationship between the item of interest and all the other items in the hypothesized construct. Consider removing items from scales if they have particularly poor inter-item correlations and will improve the alpha if they are deleted. The rule of thumb is that inter-item correlations should be around Reliability analyses will also provide you with alpha if item deleted. This will tell you how much an alpha value will increase or decrease based on the exclusion of an item. Things to remember before you start: Make sure your items are coded in the same direction. Make sure you have a minimum of three items. Note that dichotomous scaled items (agree/disagree) are analyzed using the same reliability syntax. Keep in mind that more scaled items are usually needed for dichotomous variables since there is less variability in Likert scale responses (range = 0 1, rather than 1 4 or 1 6). If you decide to remove an item, you do this one step at a time, starting with the item that will increase the alpha value the most. Items are dropped one by one and internal consistency analysis is conducted after dropping each item. Items that will result in the most increase in scale reliability are dropped first. ii. Should you always drop items to improve reliability? 20

21 Typically, all items that result in increased alpha scores are usually dropped; however, this is not always the case. For example, suppose a scale of four items has a coefficient alpha of.80. The items are: Q12 It is acceptable to use condoms with my sweetheart Q13 It is important to use condoms with my sweetheart Q14 Using condoms with a sweetheart is responsible Q15 Condoms are appropriate to use in loving relationships You check the alpha if item is deleted and observe that dropping item Q15 could result in an increased alpha coefficient of.90. Should you drop the fourth item to create a scale of three items? In such a case, it may be recommended that you keep all four items for the following reasons:.80 is still a very good alpha value; more items will help to ensure the reliability of the scale on a different sample; reducing the fourth item may create a scale that is conceptually redundant (i.e., all the items are asking the same thing), and in such a case, it may be more beneficial to have greater conceptual diversity and four items, and a slightly lower alpha value. iii. Example of reliability analysis Continue with the example used in the factor analysis. You need to test the reliability for each of the two factors (i.e., one for Q143, Q146, and Q148; and one for Q144, Q145, Q147, and Q150R). Below is the syntax and outputs for internal consistency reliability analysis conducted using SPSS. RELIABILITY /VARIABLES= q143 q146 q148 /FORMAT=LABELS /SCALE(ALPHA)=ALL/MODEL=ALPHA /STATISTICS=DESCRIPTIVE SCALE /SUMMARY=TOTAL. Q143 Condom use is normal in S/H relationships these days Q146 It's acceptable for a man to propose condom use to her S/H Q148 It's acceptable for a man to propose condom use to his S/H The last column in this table indicates the effect of dropping the items on the reliability of the entire scale and is used to decide on which items to drop. Also look at the inter-item correlations and observe the values. Observing the output below, alpha is acceptable at 0.78 for the three items related to perceptions about condom use with sweetheart. In this case, you do not consider removing Q148 since removal of this item will decrease the number of items to two, which may make the scale not suitable for other population. 21

22 TABLE FOUR: ITEM-TOTAL STATISTICS Scale Mean if Item Deleted Scale Variance if Item Deleted Corrected Item- Total Correlation Cronbach's Alpha if Item Deleted Q Q Q Reliability Coefficients 3 items Alpha =.7950 Standardized item alpha =.7849 Following is the reliability analysis for the other four items related to perceptions about condom use with spouse. The internal consistency coefficient is 0.72 which is acceptable. RELIABILITY /VARIABLES= q144 q145 q147 q150r /FORMAT=LABELS /SCALE(ALPHA)=ALL/MODEL=ALPHA /STATISTICS=DESCRIPTIVE SCALE /SUMMARY=TOTAL. Q144 Condom use is normal is spousal relationships these days Q145 It's acceptable for a woman to propose condom use to her spouse Q147 It's acceptable for a man to propose condom use to his spouse Q150R It would be strange to use condoms with your spouse nowadays TABLE FIVE: ITEM-TOTAL STATISTICS Scale Mean if Item Deleted Scale Variance if Item Deleted Corrected Item- Total Correlation Cronbach's Alpha if Item Deleted Q Q Q Q150R Reliability Coefficients 4 items Alpha =.7350 Standardized item alpha =

23 4. Creating composite scores Create composite scores for each scaled construct using the final items from the reliability analysis in Step 3. Create the composite scores by adding up each item in the construct and dividing by the total number of items in the scaled construct. compute threatr=(qxxx+qxxx+qxxx+qxxx)/4. Drop Down Window for Creating composite Scores Transform Compute Create name for the new variable to be created Select variable from which the new variable will be created and write the numeric expression necessary to create the new variable CREATING KNOWLEDGE INDEX For items that are scored as true/false for knowledge, different steps would need to be followed. True/false items should not be analyzed as a scale. The first step is to check whether any of the variables must be reverse coded. The next step is to run a correlation matrix among the items. If there are some items that are negatively correlated with other items, you may consider dropping them (assuming that it will not compromise validity). You may consider dropping items that address the same knowledge concept. Make a composite variable by adding similar knowledge items together. Keep in mind that if you combine knowledge items into an index, they should measure the same idea (e.g., the proper way to clean needles). Do not combine all knowledge items into one general index. Compute NEWINDEX = q122 + q123 + q q126. Freq NEWINDEX. This composite knowledge variable would be expressed in terms of number of correct answers and can then be used in logistic regression models as an independent variable. Once the composite knowledge variable is created, you can further create a new dichotomous variable, if necessary, to measure if respondents have a given level of knowledge (high vs. low). CASE EXAMPLES AND LESSONS LEARNED This section is not applicable to the pre-analysis data preparation toolkit chapter. 23

24 QUALITY IMPROVEMENT CHECKLIST CHECKLIST ONE: FILE MANAGEMENT CHECKLIST Record all the syntaxes for data cleaning and preparation in a syntax file. Name the syntax file properly (i.e., country- year- population- research areatype of analysis, etc.). Save the syntax file every time you make revisions and/or corrections to the syntax. Do not save the data set after you make changes to it. Be sure to keep the data set the same as the original one. Describe analysis steps briefly by notes which start with stars and end with a period. CHECKLIST TWO: DATA CLEANING CHECKLIST Check for out of range values for variables of interest. Check for match between questionnaire codes and values in the data set. Check for missing values for variables of interest. Identify the type of missing data: no-response missing or skip-pattern missing. Correctly deal with missing values; real missing or missing needs to be recoded to other values. CHECKLIST THREE: DATA MINING CHECKLIST Rename new or recoded variables in the manner suggested below: Rename the new variables with a D if continuous variables are recoded into dichotomous. Rename variables which are reversely coded with an R. Rename new variables with an N. Name OAM scales in the manner that it is easy to identify the bubble term from the variable name. 24

25 CHECKLIST FOUR: CHECKLIST FOR CREATING SOCIAL ECONOMIC STATUS (SES) INDEX IF THE QUESTIONNAIRE CONTAINS POSSESSIONS OR AMENITIES Identify relevant possession or amenity questions to be included in the SES index creation. Recode them into dummy variables. Run frequency distribution for these variables and identify and remove items in which less than 20% or more than 80% of the respondents possess. Run factor analysis using principal component analysis (PCA) as the extraction method and Varimax as the rotation method. Save the factor scores. Group respondents into 5 or 3 or 2 SES groups depending on distribution or the factor score and your interest CHECKLIST FIVE: SCALE CONSTRUCTS CHECKLIST For each scale construct, identify negatively phrased items and reverse code them. Run factor analysis to determine subscales using PCA for extraction and Varimax for rotation. Decide the number of factors and items loaded on each factor on the basis of the combination of results in the rotated component matrix, total variance explained, scree tree, and your understanding of the statement of each item. Run reliability analysis for each scale construct as suggested below: o o o o Check if all items are coded in the same direction. Check if you have a minimum of 3 items. Items are dropped one by one and internal consistency analysis is conducted after dropping each item. Item that will result in the most increase in scale reliability is dropped first. Create a composite variable by adding relevant items together then dividing it by the number of items. 25

26 CHECKLIST SIX: KNOWLEDGE INDEX CHECKLIST Check if all the similar knowledge items are in the same direction. Reverse code variables if needed. Run correlation matrix among all the items. Remove items which are negatively correlated with other items. Create a composite variable by adding similar knowledge items together. ANNEX This section is not applicable to the pre-analysis data preparation toolkit chapter. 26

27 REFERENCE Balch, G.I., & Sutton, S.M. (1997). Keep Me Posted: A Plea for Practical Evaluation. In M.E. Goldberg, M. Fishbein, & S.E. Middlestadt (Eds.), Social Marketing: Theoretical and Practical Perspectives. Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Cattell, R.B. (1966). The scree test for the number of factors. Multivariate Behavioural Research, 1, Gorsush, R.L. (1983). Factor Analysis. Hillside, New Jersey: Lawrence Erlbaum. Patel, D.S., & Chapman, S. (2005). The dashboard: a tool for social marketing decision making. PSI Research Division: Concept Paper. 27

PSI RESEARCH TOOLKIT. Dashboard Analysis Series Five: Analysis Methodology for Complex Survey Data B UILDING R ESEARCH C APACITY

PSI RESEARCH TOOLKIT. Dashboard Analysis Series Five: Analysis Methodology for Complex Survey Data B UILDING R ESEARCH C APACITY B UILDING R ESEARCH C APACITY Dashboard Analysis Series Five: Analysis Methodology for Complex Survey Data PSI s Core Values Bottom Line Health Impact * Private Sector Speed and Efficiency * Decentralization,