Developing and Testing Survey Items William Riley, Ph.D. Chief, Science of Research and Technology Branch National Cancer Institute With Thanks to Gordon Willis
Contributions to Self-Report Errors Self-report is an estimate of the true score (latent trait) Observed score = True score + Measurement Error Sources of Self-report Error Transient States (e.g. mood, fatigue) Forgetting and cognitive heuristics Self-reports of CPD cluster around the 10s Context of Reporting Difference in pain rating for an anonymous survey vs. in clinical context vs. applying for disability
Why Ask at All? Some phenomena of scientific interest Known only to the individual Current affective state Attitudes and beliefs Observable but effort to observe (especially in a large sample) is not worth the precision required for the study question Physical functioning Sexual activity
Questionnaire Development Approach (See Aday, L., & Cornelius, L. (2006). Designing and Conducting Health Surveys, Wiley) I. Determine Analytic Objectives II. What types of data will answer the research question? Develop general concepts to be covered List areas to be covered by questions III. Translate concepts into questions IV. Appraise questions for common pitfalls V. Evaluate questions empirically
Item Development Process is Iterative between Research and Respondent Researcher Respondent
... But Don t Recreate the Wheel Draw from Other Questionnaire Sources National Field Surveys (e.g. NHIS) PhenX (https://www.phenxtoolkit.org/) GEM (Grid Enabled Measures (http://cancercontrol.cancer.gov/brp/gem.html) PROMIS (www.nihpromis.org) Benefits of Using Existing Items Items already rigorously evaluated (hopefully) Comparison of results across studies Ability to do integrative data analyses
Questionnaire development approach (See Aday, L., & Cornelius, L. (2006). Designing and Conducting Health Surveys, Wiley) I. Determine Analytic Objectives What types of data will answer the research question? % of respondents with a preventive care visit in the past 12 months at a - Dental office (X%) Physician office (X%) % of respondents asked about smoking status in past 12 months at a - (X%) (X%) % of respondents checked for oral cancer in past 12 months at a - (X%) (X%)
Questionnaire development approach (See Aday, L., & Cornelius, L. (2006). Designing and Conducting Health Surveys, Wiley) I. Determine Analytic Objectives What types of data will answer the research question? II. Develop general concepts to be covered List areas to be covered by questions - Whether visit in past 12 months to dentist, doctor - Whether smoking status was asked at any visit - Whether oral cancer check done at any visit - (Smokers) Whether advice to stop smoking was given at any visit
Questionnaire development approach (See Aday, L., & Cornelius, L. (2006). Designing and Conducting Health Surveys, Wiley) I. Determine Analytic Objectives II. What types of data will answer the research question? Develop general concepts to be covered List areas to be covered by questions III. Translate concepts into questions
Importance of Respondent Input Do my items cover what s important to patients with x? Patient input is an initial requisite for content validity FDA requires for PRO guidance Typically established by multiple focus groups of patients who describe experiences with the phenomenon, volunteer items, and respond to your items
Questionnaire development approach (See Aday, L., & Cornelius, L. (2006). Designing and Conducting Health Surveys, Wiley) I. Determine Analytic Objectives II. What types of data will answer the research question? Develop general concepts to be covered List areas to be covered by questions III. Translate concepts into questions IV. Appraise questions for common pitfalls
Questionnaire development approach (See Aday, L., & Cornelius, L. (2006). Designing and Conducting Health Surveys, Wiley) I. Determine Analytic Objectives II. What types of data will answer the research question? Develop general concepts to be covered List areas to be covered by questions III. Translate concepts into questions IV. Appraise questions for common pitfalls V. Evaluate questions empirically
Appraise questions for common Administration Mode: pitfalls Interviewer administration Telephone In-person Self-Administration Mailed paper Internet Smartphone Items developed for in-person may be problematic as self-administered, and vice-versa
Sources of Response Error: Tourangeau (1984) Cognitive Model Encoding of question (Do they understand it?) Have you had a Fecal Occult Blood Test (FEBT) screening in the past 12 months? Retrieval of information (Can they remember it?) How many times in the past month have you texted while driving? Judgment processes (Are they willing to tell the truth?) How many sex partners have you had in the past 12 months? Response options (Can they answer accurately based on options provided?) Would you say your health is excellent, very good, good, fair, or poor?
http://appliedresearch.cancer.gov/areas/cognitive /qas99.pdf Question Appraisal System (Willis & Lessler, 1999)
Lack of Clarity: Difficult-to-Understand Questions Long/Convoluted Phrasing: The last time that you were seen by a doctor, nurse, or other health professional, as part of a regular medical check-up, did you receive any tests specifically designed to diagnose the presence of certain types of cancer? Typical response = What? DECOMPOSE question into concepts -- ask more, but simpler questions, with use of skips
Lack of Clarity: Difficult-to-Understand Questions Decomposition into simpler phrasing When did you last see a doctor, nurse, or other health professional, to get a regular medical check-up? During that visit, did you receive any tests that check for cancer? What types of cancer were you checked for? Doesn t solve problem of respondents not knowing the answer, but makes the question more understandable.
Lack of clarity: Terms/phrases are difficult to understand Complex/Unfamiliar Terminology: Were you seen on an inpatient or outpatient basis? Have you ever had a colonoscopy or sigmoidoscopy? Better to use simple language : Did you stay overnight at the hospital? (Use explanation of what the medical test entails)
Question Clarity/Vagueness Many questions that use simple language are variably interpreted: Have you ever been a regular smoker? Does anyone in your family have a mental problem?
Retrieval problem: Respondent doesn t know the answer Estimate the number of your women patients with whom you discussed enrollment in a cancer TREATMENT trial in the LAST 12 MONTHS: All Cancer treatment trials ALL WOMEN ASIAN AMERICAN WOMEN Breast Cancer Treatment Trials
Cultural Problems Questions that simply don t make sense for some ethnic or cultural subgroups Have you ever switched from a stronger to a lighter cigarette? How often do you use sunscreen?
Excessive Questionnaire Length Respondent Burden is a Serious Problem Increases costs Reduces response rate Results in less careful responding Leads to more missing data General Rule of Thumb < 30 minutes for in-person interview < 15 minutes for self-administered If longer than this is needed, break up administrations into segments.
How do we find questionnaire problems? Cognitive interviewing: Manual available at: http://appliedresearch.cancer.gov/areas/cognitive/interv iew.pdf Book: Willis, G. (2005). Cognitive Interviewing: A Tool for Improving Questionnaire Design. Thousand Oaks: Sage
The cognitive testing process Develop initial set of items to be evaluated Recruit participants of the targeted population How many? Enough for saturation. Usually 5-7 per item, but plan for multiple iterations Breadth of population Sociodemographics (gender, race, ethnicity) Education (be sure to include < high school education) Conduct one-on-one interviews Use both Think-Aloud and Verbal Probing techniques
Classic verbal probes Comprehension probe: What does the term dental sealant mean to you? Paraphrase: Confidence judgment: Recall probe: General probe: Can you repeat the question in your own words? How sure are you that your health insurance covers How do you know that you went to the dentist 3 times? How did you arrive at that answer?
Components of Items to Test Item Stem: Do they understand the item to mean what you want it to mean? Reporting or recall period Is the period of sufficiently length for the phenomena to occur but not so long that they can t remember Is the time frame clear (e.g. last month vs. past 30 days) Response Options Clear and matches stem (e.g. how often very much) Covers what they would have said without options Early test of monotonicity (is OK better or worse than fair )
Pain in the abdomen example In the last year have you been bothered by pain in the abdomen? What probes make sense here? What time period are you thinking about? What does bothered by pain mean to you? Where is your abdomen?
Using cognitive interviews to detect question wording problems VERSION 1 (No filter) On a typical day, how much time do you spend doing strenuous physical activities such as lifting, pushing, or pulling? None Less than 1 hour 1-4 hours 5 + hours VERSION 2 (Filtered) On a typical day, do you spend any time doing strenuous physical activities such as lifting, pushing, or pulling? IF YES: Administer Version 1 Willis, G.B. and S. Schechter (1997). Evaluation of Cognitive Interviewing Techniques: Do the Results Generalize to the Field? Bulletin de Methodologie Sociologique, Vol. 55, pp. 40-66.
Survey experiment results: Reporting of strenuous physical activity On a typical day, how much time do you spend doing strenuous physical activities such as lifting, pushing, or pulling? 0 <1 1-4 5+ FIELD PRETEST (n=78) No-filter version 32% 32% 35% 0% Filtered version 72% 18% 10% 0% WOMEN S HEALTH (n=191) No-filter version 4% 42% 50% 4% Filtered version 49% 16% 27% 8%
Psychometric Approaches Can I summarize items in a scale? Classical Psychometrics Factor Analysis (unidimensional or multidimensional) Internal Consistency Reliability (Cronbach alpha) Test-retest Reliability Modern Psychometric Approaches Item Response Theory (IRT): - To what extent does each item measure the level of the underlying construct (concept) Differential Item Functioning (DIF) - Does an item reflect the level of the construct variably, for different subgroups (gender, race )?
Item Response Theory (IRT): Category Response Curves In the past 7 days, I felt unhappy. 1.0 0.8 Never Rarely Some times Always Often 0.6 P 0.4 0.2 In the past 7 days, I felt I had no reason for living. P 0.0 1.0 0.8 0.6 very mild Depressive Symptoms severe -3.00-2.00-1.00 0.00 1.00 2.00 3.00 Never Some times Rarely Always Often 0.4 0.2 0.0-3.00-2.00-1.00 0.00 1.00 2.00 3.00 very mild Depressive Symptoms severe
wiriley@mail.nih.gov QUESTIONS