Associate Prof. Dr Anne Yee Dr Mahmoud Danaee 1
2 What does this resemble?
Rorschach test At the end of the test, the tester says you need therapy or you can't work for this company 3
Psychological Testing Occurs widely in personnel selection in clinical settings in education What constitutes a good test? 4
Validity and Reliability Validity: How well does the measure or design do what it purports to do? Reliability: How consistent or stable is the instrument? Is the instrument dependable?
Logical Validity Construc t AKA Criterion Statistical Face Conten t Convergent Divergent/ Discriminant Concurren t Predictive Reliability Consistency Objectivity
Face Validity Infers that a test is valid by face value It is clear that the test measures what it is supposed to As a check on face validity, test/survey items are sent to experts to obtain suggestions for modification. Because of its vagueness and subjectivity, psychometricians have abandoned this concept for a long time.
Content Validity Infers that the test measures all aspects contributing to the variable of interest Face validity Vs Content validity: Face validity can be established by one person Content validity should be checked by a panel, and thus usually it goes hand in hand with interrater reliability (Kappa!)
Example: Computer literacy includes skills in operating system, word processing, spreadsheet, database, graphics, internet, and many others. It is difficult to administer a test covering all aspects of computing. Therefore, only several tasks are sampled from the universe of computer skills. A test of computer literacy should be written or reviewed by computer science professors or senior programmers in the IT industry because it is assumed that computer scientists should know what are important in his own discipline.
Overall: A logically valid test simply appears to measure the right variable in its entirety? Subjective!!!
The Content Validity Index Content validity has been defined as follows: (1)...the degree to which an instrument has an appropriate sample of items for the construct being measured (Polit & Beck, 2004, p. 423); (2)...whether or not the items sampled for inclusion on the tool adequately represent the domain of content addressed by the instrument (Waltz, Strickland, & Lenz, 2005, p. 155); (3)...the extent to which an instrument adequately samples the research domain of interest when attempting to measure phenomena (Wynd, Schmidt, & Schaefer, 2003, p. 509).
Two types of CVIs. content validity of individual items content validity of the overall scale. Researchers use I-CVI information to guide them in revising, deleting, or substituting items I-CVIs tend only to be reported in methodological studies that focus on descriptions of the content validation process Most often reported in scale development studies is the CVI
CVI Degree to which an instrument has an appropriate sample of items for construct being measured S-CVI Content Validity of the overall scale I-CVI Content Validity of individual items: S-CVI/UA Proportion of items on a scale that achieves a relevance rating of 3 or 4 by all the experts S-CVI/Ave Average of the I- CVIs for
Question Has each item in the instruments consistency? Are the items representati ve of concepts related to the dissertation topic? Are the items relevance to concepts related to the dissertation topic? Are the items clarity in term of wording comments Q1 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Q2 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Q3 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Q4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Q5 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Ratings 1= not relevant 2 =somewhat relevant. 3= quite relevant 4= highly relevant.
I-CVI, item-level content validity index S-CVI, content validity index for the scale
Acceptable standard for the S-CVI recommended a minimum S-CVI of.80. If the I-CVI is higher than 79%, the item will be appropriate. If it is between 70% and 79%, it needs revision. If it is less than 70% it is eliminated
Kappa statistic is a consensus index of inter-rater agreement that adjusts for chance agreement and is an important supplement to CVI because Kappa provides information about the degree of agreement beyond chance Evaluation criteria for Kappa is the values above 0.74= excellent between 0.60 and 0.74=good between 0.40 and 0.59= fair
Logical Validity Construc t Criterion Statistical Face Conten t Convergent Divergent/ Discriminant Concurren t Predictive Reliability Consistency Objectivity
Criterion Validity This type of validity is used to measure the ability of an instrument to predict future outcomes. Validity is usually determined by comparing two instruments ability to predict a similar outcome with a single variable being measured. There are two major types of criterion validity predictive or concurrent forms of validity.
Criterion validity Warwick spider phobia questionnaire positive correlation with SPQ A test has high criterion validity if It correlates highly with some external benchmark (concurrent)? How well does the test correlated with outcome criteria (predictive)? Eg You have lost 30 pounds if your scale reported that you lost 30 pounds, you would expect that your clothes would also feel looser 24
Concurrent Criterion Validity Concurrent criterion validity is used when the two instruments are used to measure the same event at the same time. Example:
Predictive Criterion Validity Predictive validity is used when the instrument is administered then time is allowed to pass and is measured against the another outcome. Example:
Criterion validity When the focus of the test is on criterion validity, we draw an inference from test scores to performance. A high score of a valid test indicates that the test taker has met the performance criteria. Regression analysis can be applied to establish criterion validity. An independent variable could be used as a predictor variable and a dependent variable, the criterion variable. The correlation coefficient between them is called validity coefficients.
How is Criterion Validity Measured? The statistical measure or correlation coefficient tells the degree to which the instrument is valid based on the measured criteria. What does it look like in an equation? The symbol r denotes the correlation coefficient. A higher r value shows a positive relationship between the instruments. A mix of high and low r values shows a negative relationship.
Predictive Validity Concurrent Validity
As a rule of thumb, for absolute value of r: 0.00-0.19: very weak 0.20-0.39: weak 0.40-0.59: moderate 0.60-0.79: strong 0.80-1.00: very strong.
Logical Validity Construct AKA Criterion Statistical Face Conten t Convergent Divergent/ Discriminant Concurren t Predictive Reliability Consistency Objectivity
Construct validity Measuring things that are in our theory of a domain. The construct is sometimes called a latent variable You can t directly observe the construct You can only measure its surface manifestations it is concerned with abstract and theoretical construct, construct validity is also known as theoretical validity 32
What are Latent Variables? Most/all variables in the social world are not directly observable. This makes them latent or hypothetical constructs. We measure latent variables with observable indicators, e.g. questionnaire items. We can think of the variance of an observable indicator as being partially caused by: The latent construct in question Other factors (error)
Math anxiety I cringe when I have to go to math class. I am uneasy about going to the board in a math class. I am afraid to ask questions in math class. I am always worried about being called on in math class. I understand math now, but I worry that it's going to get really difficult soon.
Specifying formative versus reflective constructs is a critical preliminary step prior to further statistical analysis. Specification follows these guidelines: Formative Direction of causality is from measure to construct No reason to expect the measures are correlated Indicators are not interchangeable Reflective Direction of causality is from construct to measure Measures expected to be correlated Indicators are interchangeable An example of formative versus reflective constructs is given in the figure below.
Factor model A factor model identifies the relationship between observed items and latent factors. For example, when a psychologist wants to study the causal relationships between Math anxiety and job performance, first he/she has to define the constructs Math anxiety and job performance. To accomplish this step, about the psychologist needs to develop items that measure the defined construct.
Construct, dimension, subscale, factor, component This construct has eight dimensions (e.g. Intelligence has eight aspects) This scale has eight subscales (e.g. the survey measures different but weakly related things) The factor structure has eight factors/components (e.g. in factor analysis/pca)
Exploratory Factor Analysis (EFA) is a statistical approach to determining the correlation among the variables in a dataset. This type of analysis provides a factor structure (a grouping of variables based on strong correlations). EFA is good for detecting "misfit" variables. In general, an EFA prepares the variables to be used for cleaner structural equation modeling. An EFA should always be conducted for new datasets.
. The Kaiser-Meyer-Olkin measure of sampling adequacy tests whether the partial correlations among variables are small. KMO Statistics Marvelous:.90s Meritorious:.80s Middling:.70s Mediocre:.60s Miserable:.50s Unacceptable: <.50
Bartlett s Test of Sphericity Tests hypothesis that correlation matrix is an identity matrix. Diagonals are ones Off-diagonals are zeros A significant result (Sig. < 0.05) indicates matrix is not an identity matrix; i.e., the variables do relate to one another enough to run a meaningful EFA. Anti-image The anti-image correlation matrix contains the negatives of the partial correlation coefficients, and the anti-image covariance matrix contains the negatives of the partial covariances. In a good factor model, most of the off-diagonal elements will be small. The measure of sampling adequacy for a variable is displayed on the diagonal of the anti-image correlation matrix.
Communalities A communality is the extent to which an item correlates with all other items. Higher communalities are better. If communalities for a particular variable are low (between 0.0-0.4), then that variable will struggle to load significantly on any factor. In the table below, you should identify low values in the "Extraction" column. Low values indicate candidates for removal after you examine the pattern matrix.
Parallel analysis is a method for determining the number of components or factors to retain from pca or factor analysis. Essentially, the program works by creating a random dataset with the same numbers of observations and variables as the original data.
https://www.statstodo.com
Factor analysis for dichotomous variables
Using Factor software and simultaneously Parallel analysis for binary data
Establishing construct validity Convergent validity Agrees with other measures of the same thing Divergent/Discriminant validity Different tests measure different things Does the test have the ability to discriminate? (Campbell & Fiske, 1959) 55
Construct validity Construct validity is the extent to which a set of measured items actually reflected the theoretical latent construct those item are designed to measure. Thus, it deals with the accuracy of measurement. Construct validity is made up of TWO important components which they are: 1) Convergent validity: the items that are indicators of a specific construct should converge or share a high proportion of variance in common, known as convergent validity. The ways to estimate the relative amount of convergent validity among item measures:
Discriminant Validity: the extent to which a construct is truly distinct frame other construct. To test the discriminant validity the AVE for two factors should be grater than the square of the correlation between the two factors to provide evidence of discriminant validity. Discriminant validity can be tested by examining the AVE for each construct against squared correlations (shared variance) between the construct and all other constructs in the model. A construct will have adequate discriminant validity if the AVE exceeds the squared correlation among the constructs (Fornell & Larcker, 1981; Hair et al., 2006).
Factor Loading: at a minimum, all factor loading should be statistically significant. A good rule of thumb is that standardized loading estimates should be.5 or higher, and ideally.7 or higher. Average Variance Extracted (AVE): is the average squared factor loading. A VE of 0.5 or higher is a good rule of thumb suggesting adequate convergence. A VE less than.5 indicates that on average, more error remains in the items than variance explained by the latent factor structure impose on the measure (Haire et al., 2006, p 777). Construct Reliability: construct reliability should be.7 or higher to indicate adequate convergence or internal consistency.
Individual model (First order CFA)
Mea surement Model
Structural Equation Modeling (SEM) Individual Model Measurement Model Structural Model
Developing Assessments What are you trying to measure? Purpose What assessments do you already have that purport to measure this? Review If necessary, consider commercial assessments or create a new assessment Purchase Develop
Considerations Using what already have Is it carefully aligned to your purpose? Is it carefully matched to your purpose? Do you have the funds (for assessment, equipment, training)? Developing a new assessment Do you have the in-house content knowledge? Do you have the in-house assessment knowledge? Does your team have time for development? Does your team have the knowledge and time needed for proper scoring? Identify the goal of your questionnaire. What kind of information do you want to gather with your questionnaire? What is your main objective? Is a questionnaire the best way to go about collecting this information?
Adopting an Instrument Adapting an Instrument Adopting an instrument is quite simple and requires very little effort. Even when an instrument is adopted, though, there still might be a few modifications that are necessary Adapting an instrument requires more substantial changes than adopting an instrument. In this situation, the researcher follows the general design of another instrument but adds items, removes items, and/or substantially changes the content of each item
Whenever possible, it is best for an instrument to be adopted. When this is not possible, the next best option is to adapt an instrument. However, if there are no other instruments available, then the last option is to develop an instrument.
STEP Type of Validity Development Adaption Adoption Face + +/- +/- pertest Logical Content + + +/- Pilot / main study Criterion Construct Concurrent + + - Predictive + + - Convergent + + + Divergent + + +
Reliability
Types of Reliability Test-Retest Reliability: Degree of temporal stability of the instrument. Assessed by having instrument completed by same people during two different time periods. Alternate-Forms Reliability: Degree of relatedness of different forms of test. Used to minimize inflated reliability correlations due to familiarity with test items.
Types of Reliability (cont.) Internal-Consistency Reliability: Overall degree of relatedness of all test items or raters. Also called reliability of components. Item-to-Item Reliability: The reliability of any single item on average. Judge-to-Judge Reliability: The reliability of any single judge on average.
Cronbach s alpha to evaluate the internal consistency of observed items, and also applies factor analysis to extract latent constructs from these consistent observed variables. >0.90, means the questions are asking the same things 0.7 to 0.9 is the acceptable range.
Remember! An assessment that is highly reliable is not necessarily valid. However, for an assessment to be valid, it must also be reliable.
Improving Validity & Reliability Ensure questions are based on standards Ask purposeful questions Ask concrete questions Use time periods based on importance of the questions Use conventional language Use complete sentences Avoid abbreviations Use shorter questions
Overall Cronbach Coefficient Alpha One may argue that when a high Cronbach Alpha indicates a high degree of internal consistency, the test or the survey must be uni-dimensional rather than multidimensional. Thus, there is no need to further investigate its subscales. This is a common misconception.
Performing the Pilot test A pilot test involves conducting the survey on a small, representative set of respondents in order to reveal questionnaire errors before the survey is launched. It is important to run the pilot test on respondents that are representative of the target population to be studied. Cronbach's alpha Measures the intercorrelations among test items, and is thus known as an internal consistency estimate of reliability of test scores Test-retest reliability refers to the degree to which test results are consistent over time. In order to measure test-retest reliability, we must first give the same test to the same individuals on two occasions and correlate the scores Ch 11 83
Thank you Dr Mahmoud Danaee mdanaee@um.edu.my Associate prof. Dr. Anne Yee annyee17@um.edu.my Senior Visiting Research Fellow, Academic enhancement and leadership Development Center ( ADeC) Addiction psychiatrist Department of psychological medicine, University of Malaya Center of Addiction Science (UMCAS)