Making a psychometric. Dr Benjamin Cowan- Lecture 9

Making a psychometric Dr Benjamin Cowan- Lecture 9

What this lecture will cover What is a questionnaire? Development of questionnaires Item development Scale options Scale reliability & validity Factor Analysis

What is a questionnaire? Some concepts are difficult to measure directly using measurements like time, accuracy etc Attitudes, emotions, opinions We need to design psychometrics for these if we are to research them

Why would we want to make a psychometric? If we are looking at a new concept that hasn t been measured before Happens a lot in HCI with developments of new technologies Because a metric needs to measure something specific for it to have value, we need to design or tweak existing measures for new technologies Need to add items and re-test

Example- Anxiety towards facebook posting Let s say we wanted to make a measure of how anxious people were about posting to facebook This measure (our questionnaire) is made of attitude phrases (or items).

Stages of item development Literature review What are the key concepts in studying anxiety? Measure review What is available? How is anxiety currently measured? Focus groups/interviews What is important in facebook anxiety? Questions about facebook and negative emotions Gives an indication of how people describe the concepts, thus improving item wording

Generating Items- Interviews Conversation with a purpose 4 main types Unstructured Semi Structured Structured Group

Unstructured interviews Exploratory Talk around an area Planning the areas for discussion rather than specific questions Can explore topics as they come up

Structured Interviews Predetermined questions Standardised for all interviewees

Semi Structured Interviews Basic script used with all participants Mix of Structured and Unstructured Interview There are some questions that are covered with all and the rest is a free flowing conversation

What interview type to use? Depends on: How specific you need to get Purpose of the interview

Stages of item development This will allow you to get an idea of: Potential items Potential categories that need to be covered (factors) Pilot study Large number of items Participants rate: Clarity of wording Clarity of concept in the item Experts in the area to review items

The good, the bad, the ugly Good item Clear, well worded, one concept, to the point. I feel stressed when using facebook Bad item Can be clearly worded but does not cover one concept I feel stressed because of so many people on facebook and it is hard to use Ugly item Poorly worded and doesn t cover one concept Stress is something I feel all of the time when using facebook because people on it are plentiful and it s difficult This can happen when questionnaires are mis-translated.

Common scales used Likert Scales (Likert, 1926) 3 point, 5 point, 7 point, 9 point More points, the larger the variance of responses on item Arguments over which is best but 5 point is most common The use of a neutral point is also debated Semantic Differential Uses two polar opposite adjectives at the end of a scale Which to use? Strong-Not Strong (bad) Strong- Weak (good)

Important concepts in item response Response Acquiescence set A propensity for participants to answer positively to items Balancing psychometric as much as possible (positively and negatively worded items) Item Randomisation Social Desirability Responding with what you feel is socially appropriate

So We have our items We have piloted them with participants We now need to assess how good our questionnaire (or psychometric) is Good psychometrics have: High reliability High validity Possess a set of norms (baselines/guides)

Reliability Stability of the test score over time Test-Retest Reliability Internal consistency of the test Internal consistency reliability The extent to which the items are measuring the same underlying concept

Test-Retest Reliability Test at Time 1 6 month gap Test at Time 2 Testing same participants on the measure on two occasions Scores are then correlated to see strength of relationship Over 0.7 is good test- retest reliability

Why would the correlation not be perfect? Between times there may be changes on the variables Some people may have become less anxious over time Test Error N feeling ill, bored, tired.

Internal consistency reliability The extent to which each item measures the same underlying concept In our facebook posting anxiety scale we would expect all the items to be measuring elements of anxiety not measuring usability of facebook

Internal consistency measures Split Half method Divide measure in two randomly and correlate the scores on the two halves together Cronbach alpha (most commonly used) Average correlation of all possible split half correlations. 0.7 seen as a good alpha

What can impact on this reliability The number of items More items mean more of concept can be covered Weighing up number of items and boredom 10 items considered minimum for reliable test Can a measure be too internally consistent? (Cattell, 1957) Using items which effectively measure the same thing E.g. I like facebook and Facebook is something I like They are the same item, just different wording Leads to a bloated specific

Cronbach alpha analysis The analysis looks at all correlations of the item scores with the total questionnaire score (itemtotal correlations) Items with Item-total correlations of lower than 0.3 should be removed as they do not correlate well The test output also gives us an idea of what alpha would be without each item- great for item removal

Validity of a test A test can be reliable but not valid It could be high in reliability but not measuring what it proclaims to measure It is not as simple as looking at the item wordings to deduce this We need to identify whether our measure behaves as predicted

Validity Assessment Face validity The items seem to be worded right for the concept being measured This is a poor test of validity E.g. I am quite easily distracted - looks fine but can be interpreted differently by participants Concurrent Validity Correlation of test with other benchmark test that was given at the same time Dubious when there is no clear benchmark

Validity Assessment Predictive Validity The measure is able to predict some criterion E.g. facebook anxiety relates to posting behaviour Need to be aware that modest relationships are likely Many other factors important to posting behaviour closeness of facebook friends, drunken messaging? Sometimes clear criterions are not available Beware of the difference between statistical significance and psychological significance

Construct Validity (Cronbach & Meehl, 1955) Allows a collection of results to lead us to validity conclusions rather than just one Usually the case that not all hypotheses are confirmed Validity is therefore not as equivocal as reliability Interpretive and subjective

Construct Validity (Cronbach & Meehl, 1955) Construct Validity A bank of hypotheses based on the knowledge of our concept Our Hypotheses for Facebook anxiety Should correlate positively and highly with other measures of anxiety (concurrent validity) Should correlate positively with someone s fear of negative evaluation (concurrent validity) Should not correlate with personality tests that don t measure anxiety High scorers, compared with low scorers should show less activity on facebook, and more leaving facebook (predictive validity)

Norms We need to test our measures on A significant representative proportion of the population (1000 s of respondents) A sample of people we d expect to be high or low on the measure (for discriminatory markers) This is built up over years of use

Now we have Gathered our items Assessed their reliability Assessed their validity We are assuming at present that facebook anxiety is uni-dimensional. This might not be true, there may be many factors to it, which we have picked up in our measure

What are factors? Each questionnaire item gives a score There will items that correlate heavily together Factor analysis is fundamentally used to: reduce the data into the smallest number of explanatory concepts A factor is a combination of variables, the grouping of which indicates a relationship

What are factors? Each item has a factor loading correlation of that item with the factor Some items will have high loadings, some low or no loading at all on a specific factor Loadings of 0.4 are seen as helpful in defining a factor Items should only load heavily on one factor If they don t they are candidates for rewording

Shared Variance Correlation co-efficient represents The amount of agreement (or shared variance) between two sets of scores Square the correlation coefficient to get % agreement Variable x variance Shared (Common) Variance Variable y variance

Shared Variance & Communality By squaring the factor loading we can: Identify how much shared variance there is between the item and the factor They can be thought of as the contribution that the item makes to the factor If we do this for each factor loading an item has we get the item s communality the amount of variance shared between the item and all the factors

Factor Extraction Eigenvalues Indicate the importance of the factor extracted in explaining the variance in the data There will be few with high eigenvalues and lots with low Makes sense to keep the most important factors Rule of thumb is keep factors with eigenvalues > 1 (as an eigenvalue of 1 represents a significant amount of variation). The number to extract is identified using a Scree Plot (Cattell, 1966) Y axis is eigenvalues X axis is the number of factors

Scree Plot Eigenvalues Point of Inflexion Number of Factors

Factor Rotation Looking for best fit - factor structure with clearest interpretation Sometimes this involves rotation to get the clearest, simplest factor structure A simple factor structure is one that has a few high loading items and the rest being near 0 (Cattell, 1978)

Methods of Rotation The method you choose depends on how correlated you feel the factor scores should be Based on theoretical reasoning We would expect our questionnaire To have factors- 1) anxiety about social posting, 2) anxiety about interface interaction, 3) social confidence For the scores from this to be correlated

Methods of Rotation We would therefore use a method that takes this correlation into consideration- Direct Oblimin This is an oblique method of rotation (allows the factors to correlate)

Methods of Rotation If we felt they should not correlated then we could have used Varimax method This is an example of orthogonal rotation- ensures the extracted factors are not correlated.

Considerations Sample size Number of people in the sample debated 100 for stable factors (Kline, 1999)

Using Factor Analysis in questionnaire construction Give participants questionnaire Conduct factor analysis Any that load highly on more than one factor, check for concept clarity Check that all those with loadings >0.3 cover the most of what we need in the scale, if not write more items Replicate this on each new sample Validate the scale factors and calculate their reliability

Making a psychometric Takes a lot of time To develop the items To test on wide range of samples To test a large bank of hypotheses on relationships to ensure its validity Sometimes it cannot be avoided

Readings Kline, P. (2000). A Psychometrics Primer, Chapter 3. Free Association Books- 14.95 from Amazon Kline (1994). An easy guide to factor analysis (available in library) Field, A. (2007).Chapter 15- Exploratory Factor Analysis