Making a psychometric Dr Benjamin Cowan- Lecture 9
What this lecture will cover What is a questionnaire? Development of questionnaires Item development Scale options Scale reliability & validity Factor Analysis
What is a questionnaire? Some concepts are difficult to measure directly using measurements like time, accuracy etc Attitudes, emotions, opinions We need to design psychometrics for these if we are to research them
Why would we want to make a psychometric? If we are looking at a new concept that hasn t been measured before Happens a lot in HCI with developments of new technologies Because a metric needs to measure something specific for it to have value, we need to design or tweak existing measures for new technologies Need to add items and re-test
Example- Anxiety towards facebook posting Let s say we wanted to make a measure of how anxious people were about posting to facebook This measure (our questionnaire) is made of attitude phrases (or items).
Stages of item development Literature review What are the key concepts in studying anxiety? Measure review What is available? How is anxiety currently measured? Focus groups/interviews What is important in facebook anxiety? Questions about facebook and negative emotions Gives an indication of how people describe the concepts, thus improving item wording
Generating Items- Interviews Conversation with a purpose 4 main types Unstructured Semi Structured Structured Group
Unstructured interviews Exploratory Talk around an area Planning the areas for discussion rather than specific questions Can explore topics as they come up
Structured Interviews Predetermined questions Standardised for all interviewees
Semi Structured Interviews Basic script used with all participants Mix of Structured and Unstructured Interview There are some questions that are covered with all and the rest is a free flowing conversation
What interview type to use? Depends on: How specific you need to get Purpose of the interview
Stages of item development This will allow you to get an idea of: Potential items Potential categories that need to be covered (factors) Pilot study Large number of items Participants rate: Clarity of wording Clarity of concept in the item Experts in the area to review items
The good, the bad, the ugly Good item Clear, well worded, one concept, to the point. I feel stressed when using facebook Bad item Can be clearly worded but does not cover one concept I feel stressed because of so many people on facebook and it is hard to use Ugly item Poorly worded and doesn t cover one concept Stress is something I feel all of the time when using facebook because people on it are plentiful and it s difficult This can happen when questionnaires are mis-translated.
Common scales used Likert Scales (Likert, 1926) 3 point, 5 point, 7 point, 9 point More points, the larger the variance of responses on item Arguments over which is best but 5 point is most common The use of a neutral point is also debated Semantic Differential Uses two polar opposite adjectives at the end of a scale Which to use? Strong-Not Strong (bad) Strong- Weak (good)
Important concepts in item response Response Acquiescence set A propensity for participants to answer positively to items Balancing psychometric as much as possible (positively and negatively worded items) Item Randomisation Social Desirability Responding with what you feel is socially appropriate
So We have our items We have piloted them with participants We now need to assess how good our questionnaire (or psychometric) is Good psychometrics have: High reliability High validity Possess a set of norms (baselines/guides)
Reliability Stability of the test score over time Test-Retest Reliability Internal consistency of the test Internal consistency reliability The extent to which the items are measuring the same underlying concept
Test-Retest Reliability Test at Time 1 6 month gap Test at Time 2 Testing same participants on the measure on two occasions Scores are then correlated to see strength of relationship Over 0.7 is good test- retest reliability
Why would the correlation not be perfect? Between times there may be changes on the variables Some people may have become less anxious over time Test Error N feeling ill, bored, tired.
Internal consistency reliability The extent to which each item measures the same underlying concept In our facebook posting anxiety scale we would expect all the items to be measuring elements of anxiety not measuring usability of facebook
Internal consistency measures Split Half method Divide measure in two randomly and correlate the scores on the two halves together Cronbach alpha (most commonly used) Average correlation of all possible split half correlations. 0.7 seen as a good alpha
What can impact on this reliability The number of items More items mean more of concept can be covered Weighing up number of items and boredom 10 items considered minimum for reliable test Can a measure be too internally consistent? (Cattell, 1957) Using items which effectively measure the same thing E.g. I like facebook and Facebook is something I like They are the same item, just different wording Leads to a bloated specific
Cronbach alpha analysis The analysis looks at all correlations of the item scores with the total questionnaire score (itemtotal correlations) Items with Item-total correlations of lower than 0.3 should be removed as they do not correlate well The test output also gives us an idea of what alpha would be without each item- great for item removal
Validity of a test A test can be reliable but not valid It could be high in reliability but not measuring what it proclaims to measure It is not as simple as looking at the item wordings to deduce this We need to identify whether our measure behaves as predicted
Validity Assessment Face validity The items seem to be worded right for the concept being measured This is a poor test of validity E.g. I am quite easily distracted - looks fine but can be interpreted differently by participants Concurrent Validity Correlation of test with other benchmark test that was given at the same time Dubious when there is no clear benchmark
Validity Assessment Predictive Validity The measure is able to predict some criterion E.g. facebook anxiety relates to posting behaviour Need to be aware that modest relationships are likely Many other factors important to posting behaviour closeness of facebook friends, drunken messaging? Sometimes clear criterions are not available Beware of the difference between statistical significance and psychological significance
Construct Validity (Cronbach & Meehl, 1955) Allows a collection of results to lead us to validity conclusions rather than just one Usually the case that not all hypotheses are confirmed Validity is therefore not as equivocal as reliability Interpretive and subjective
Construct Validity (Cronbach & Meehl, 1955) Construct Validity A bank of hypotheses based on the knowledge of our concept Our Hypotheses for Facebook anxiety Should correlate positively and highly with other measures of anxiety (concurrent validity) Should correlate positively with someone s fear of negative evaluation (concurrent validity) Should not correlate with personality tests that don t measure anxiety High scorers, compared with low scorers should show less activity on facebook, and more leaving facebook (predictive validity)
Norms We need to test our measures on A significant representative proportion of the population (1000 s of respondents) A sample of people we d expect to be high or low on the measure (for discriminatory markers) This is built up over years of use
Now we have Gathered our items Assessed their reliability Assessed their validity We are assuming at present that facebook anxiety is uni-dimensional. This might not be true, there may be many factors to it, which we have picked up in our measure
What are factors? Each questionnaire item gives a score There will items that correlate heavily together Factor analysis is fundamentally used to: reduce the data into the smallest number of explanatory concepts A factor is a combination of variables, the grouping of which indicates a relationship
What are factors? Each item has a factor loading correlation of that item with the factor Some items will have high loadings, some low or no loading at all on a specific factor Loadings of 0.4 are seen as helpful in defining a factor Items should only load heavily on one factor If they don t they are candidates for rewording
Shared Variance Correlation co-efficient represents The amount of agreement (or shared variance) between two sets of scores Square the correlation coefficient to get % agreement Variable x variance Shared (Common) Variance Variable y variance
Shared Variance & Communality By squaring the factor loading we can: Identify how much shared variance there is between the item and the factor They can be thought of as the contribution that the item makes to the factor If we do this for each factor loading an item has we get the item s communality the amount of variance shared between the item and all the factors
Factor Extraction Eigenvalues Indicate the importance of the factor extracted in explaining the variance in the data There will be few with high eigenvalues and lots with low Makes sense to keep the most important factors Rule of thumb is keep factors with eigenvalues > 1 (as an eigenvalue of 1 represents a significant amount of variation). The number to extract is identified using a Scree Plot (Cattell, 1966) Y axis is eigenvalues X axis is the number of factors
Scree Plot Eigenvalues Point of Inflexion Number of Factors
Factor Rotation Looking for best fit - factor structure with clearest interpretation Sometimes this involves rotation to get the clearest, simplest factor structure A simple factor structure is one that has a few high loading items and the rest being near 0 (Cattell, 1978)
Methods of Rotation The method you choose depends on how correlated you feel the factor scores should be Based on theoretical reasoning We would expect our questionnaire To have factors- 1) anxiety about social posting, 2) anxiety about interface interaction, 3) social confidence For the scores from this to be correlated
Methods of Rotation We would therefore use a method that takes this correlation into consideration- Direct Oblimin This is an oblique method of rotation (allows the factors to correlate)
Methods of Rotation If we felt they should not correlated then we could have used Varimax method This is an example of orthogonal rotation- ensures the extracted factors are not correlated.
Considerations Sample size Number of people in the sample debated 100 for stable factors (Kline, 1999)
Using Factor Analysis in questionnaire construction Give participants questionnaire Conduct factor analysis Any that load highly on more than one factor, check for concept clarity Check that all those with loadings >0.3 cover the most of what we need in the scale, if not write more items Replicate this on each new sample Validate the scale factors and calculate their reliability
Making a psychometric Takes a lot of time To develop the items To test on wide range of samples To test a large bank of hypotheses on relationships to ensure its validity Sometimes it cannot be avoided
Readings Kline, P. (2000). A Psychometrics Primer, Chapter 3. Free Association Books- 14.95 from Amazon Kline (1994). An easy guide to factor analysis (available in library) Field, A. (2007).Chapter 15- Exploratory Factor Analysis