PAPER. Behavior and Analysis of 36-Item Short-Form Health Survey Data for Surgical Quality-of-Life Research

Similar documents
Examining differences between two sets of scores

Supplementary Appendix

PEER REVIEW HISTORY ARTICLE DETAILS VERSION 1 - REVIEW. Ball State University

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

BEST PRACTICES FOR IMPLEMENTATION AND ANALYSIS OF PAIN SCALE PATIENT REPORTED OUTCOMES IN CLINICAL TRIALS

Magellan Health Services: Using the SF-BH assessment to measure success and prove value

Color naming and color matching: A reply to Kuehni and Hardin

Psychology Research Process

Chapter 7: Descriptive Statistics

Things you need to know about the Normal Distribution. How to use your statistical calculator to calculate The mean The SD of a set of data points.

Undertaking statistical analysis of

Measurement and Descriptive Statistics. Katie Rommel-Esham Education 604

Standard Scores. Richard S. Balkin, Ph.D., LPC-S, NCC

Alexandra Savova, Guenka Petrova. Medical University Sofia Faculty of Pharmacy

Applied Statistical Analysis EDUC 6050 Week 4

Statistical Techniques. Masoud Mansoury and Anas Abulfaraj

Business Statistics Probability

STATISTICS AND RESEARCH DESIGN

Analysis and Interpretation of Data Part 1

Reliability and validity of the International Spinal Cord Injury Basic Pain Data Set items as self-report measures

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

Child Outcomes Research Consortium. Recommendations for using outcome measures

7 Statistical Issues that Researchers Shouldn t Worry (So Much) About

Introduction to statistics Dr Alvin Vista, ACER Bangkok, 14-18, Sept. 2015

Still important ideas

Chapter 20: Test Administration and Interpretation

Quantitative Methods in Computing Education Research (A brief overview tips and techniques)

Module 28 - Estimating a Population Mean (1 of 3)

Agents with Attitude: Exploring Coombs Unfolding Technique with Agent-Based Models

Designing Psychology Experiments: Data Analysis and Presentation

The EuroQol and Medical Outcome Survey 36-item shortform

Sheila Barron Statistics Outreach Center 2/8/2011

Outcome Measure Considerations for Clinical Trials Reporting on ClinicalTrials.gov

Final Report. HOS/VA Comparison Project

The Cochrane Collaboration

CHAPTER III METHODOLOGY

The Logic of Data Analysis Using Statistical Techniques M. E. Swisher, 2016

Here are the various choices. All of them are found in the Analyze menu in SPSS, under the sub-menu for Descriptive Statistics :

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

VIEW: An Assessment of Problem Solving Style

Unit 7 Comparisons and Relationships

Editorial. From the Editors: Perspectives on Turnaround Time

Lecturer: Rob van der Willigen 11/9/08

CURVE is the Institutional Repository for Coventry University

Shiken: JALT Testing & Evaluation SIG Newsletter. 12 (2). April 2008 (p )

Psychology Research Process

Lecturer: Rob van der Willigen 11/9/08

Answers to Common Physician Questions and Objections

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

Statistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN

Distributions and Samples. Clicker Question. Review

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA

Still important ideas

AP Psych - Stat 1 Name Period Date. MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Finalised Patient Reported Outcome Measures (PROMs) in England

Chapter 23. Inference About Means. Copyright 2010 Pearson Education, Inc.

AP Psych - Stat 2 Name Period Date. MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Descriptive Statistics Lecture

This is an edited transcript of a telephone interview recorded in March 2010.


Assessment of the SF-36 version 2 in the United Kingdom

Appendix B Statistical Methods

People have used random sampling for a long time

C-1: Variables which are measured on a continuous scale are described in terms of three key characteristics central tendency, variability, and shape.

EXPERT INTERVIEW Diabetes Distress:

Statistical probability was first discussed in the

Measures. David Black, Ph.D. Pediatric and Developmental. Introduction to the Principles and Practice of Clinical Research

Can Angioplasty Improve Quality of Life for CAD Patients?

Empirical Knowledge: based on observations. Answer questions why, whom, how, and when.

An InTROduCTIOn TO MEASuRInG THInGS LEvELS OF MEASuREMEnT

Introduction: Statistics, Data and Statistical Thinking Part II

Quality Digest Daily, March 3, 2014 Manuscript 266. Statistics and SPC. Two things sharing a common name can still be different. Donald J.

2016 Children and young people s inpatient and day case survey

PTHP 7101 Research 1 Chapter Assignments

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

ANATOMY OF A RESEARCH ARTICLE

Statistics: Interpreting Data and Making Predictions. Interpreting Data 1/50

European Association for Cardiovascular Prevention & Rehabilitation (EACPR) A Registered Branch of the ESC

Chapter 11. Experimental Design: One-Way Independent Samples Design

STATISTICS 8 CHAPTERS 1 TO 6, SAMPLE MULTIPLE CHOICE QUESTIONS

Pooling Subjective Confidence Intervals

ANXIETY A brief guide to the PROMIS Anxiety instruments:

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Practice-Based Research for the Psychotherapist: Research Instruments & Strategies Robert Elliott University of Strathclyde

Endogeneity is a fancy word for a simple problem. So fancy, in fact, that the Microsoft Word spell-checker does not recognize it.

IAPT for SMI: Findings from the evaluation of service user experiences. Julie Billsborough & Lisa Couperthwaite, Researchers at the McPin Foundation

Cómo se dice...? An analysis of patient-provider communication through bilingual providers, blue phones and live translators

PATHOLOGICAL VIEW FUNCTIONAL VIEW

Elementary Statistics:

Subliminal Programming

Biostatistics. Donna Kritz-Silverstein, Ph.D. Professor Department of Family & Preventive Medicine University of California, San Diego

Chapter 5. Doing Tools: Increasing Your Pleasant Events

Research Designs and Potential Interpretation of Data: Introduction to Statistics. Let s Take it Step by Step... Confused by Statistics?

Readings: Textbook readings: OpenStax - Chapters 1 11 Online readings: Appendix D, E & F Plous Chapters 10, 11, 12 and 14

LEVEL ONE MODULE EXAM PART TWO [Reliability Coefficients CAPs & CATs Patient Reported Outcomes Assessments Disablement Model]

Methodological skills

Examining the Psychometric Properties of The McQuaig Occupational Test

On the purpose of testing:

Summarizing Data. 1.1 Report all numbers with the appropriate degree of precision. CHAPTER 1. Reporting Numbers and Descriptive Statistics

Transcription:

PAPER ehavior and Analysis of 36-Item Short-Form Health Survey Data for Surgical Quality-of-Life Research Vic Velanovich, MD Hypothesis: Data from the 36-Item Short-Form Health Survey (SF-36) do not follow a normal distribution and should not be analyzed using parametric techniques. A novel type of analysis, top-box analysis, may add to the interpretation of these data. Design: Review of SF-36 data from preoperative and postoperative patients. Setting: Tertiary care hospital and clinic. Patients: One thousand randomly selected preoperative and postoperative patients with a variety of surgical diseases completed the SF-36 (8 domains: physical functioning, role physical, role emotional, bodily pain, vitality, mental health, social functioning, and general health). The best possible score was ; the worst possible score,. One item assessed health transition. The best score was 1; the worst score, 5. The health transition item and each domain were analyzed for mean with standard deviation, median, mode skewness, kurtosis, and normality. A top-box assessment was done by determining the frequency of patients scoring in each domain or 1 in the health transition item. In addition, preoperative and postoperative scores were compared. Results: The results for all questionnaires demonstrated that none of the domains had data that followed a normal distribution. The means, medians, and modes were different. Five domains had the mode and median at the top box. Conclusions: The SF-36 data did not follow a normal distribution in any of the domains. Data were always skewed to the left, with means, medians, and modes different. These data need to be statistically analyzed using nonparametric techniques. Of the 8 domains, 5 had a significant frequency of top-box scores, which also were the domains in which the mode was at, implying that change in top-box score may be an informative method of presenting change in SF-36 data. Arch Surg. 27;142:473-478 Author Affiliation: Division of General Surgery, Henry Ford Hospital, Detroit, Mich. QUALITY-OF-LIFE (QOL) measurement has become an increasingly used end point in assessing surgical outcomes. Many surgical studies have used QoL instruments to assess changes in QoL. A MEDLINE search of 12 leading US surgical journals, using the keywords quality of life, from 1996 to October 26 yielded 466 articles. Although there are a variety of QoL instruments available, one of the most commonly used is the 36-Item Short-Form Health Survey (SF-36). Once again, a MEDLINE search of these journals using the keyword SF-36 yielded 92 articles. When looking at the keyword SF-36, more than 4 articles are found. The SF-36 is a generic QoL instrument that measures 8 domains of QoL; these 8 domains can be reduced to 2 summary scores a physical component and a mental health component. 1 ecause of its popularity, the SF-36 is considered one of the leading QoL instruments worldwide, and has been translated into dozens of languages. A review 2 of surgical QoL studies has found that there were several deficiencies in the conduct of these studies. One of the most common problems was inappropriate statistical analysis. The proper statistical analysis of data is essential in interpreting the results of any study. 3 Commonly, data from the SF-36 have been presented as means with standard deviations or standard errors of the mean. The basic assumption of these studies is that the data follow a normal (gaussian) distribution, having a bell-shaped curve. However, many of these studies did not perform the statistical tests 4 needed to determine if, indeed, the data follow the normal distribution necessary to use this type of statistical analysis. The purpose of this study was to analyze the behavior of SF-36 data in a group of preoperativeandpostoperativesurgicalpatients to determine the most appropriate statistical methods of analysis. (REPRINTED) ARCH SURG/ VOL 142, MAY 27 473 Downloaded From: on 11/7/218 27 American Medical Association. All rights reserved.

Table 1. Descriptive Statistics of Each Domain of the SF-36* Domain Mean (SD) Median Mode Skewness Kurtosis Top-ox, % Physical functioning 7.3 (28.) 8..75 2.42 19.1 Role physical 58.7 (43.8) 75..35 1.35 46.2 Role emotional 74.5 (39.1). 1.8 2.43 66.8 odily pain 62.4 (28.) 62..26 2.8 2.2 Vitality 53.6 (23.3) 55. 5.28 2.45 1.1 Mental health 73.9 (19.4) 8. 92.98 3.64 5.1 Social functioning 73.9 (28.2) 87.5.84 2.68 39.9 General health 64.3 (22.1) 67. 77.48 2.55 4. Abbreviation: SF-36, 36-Item Short-Form Health Survey. *The Wilk-Shapiro P value was.1 for all domains. METHODS THE SF-36 GENERIC QoL INSTRUMENT The SF-36 measures 8 domains of QoL: physical functioning, limitations to physical activities because of health, such as selfcare, walking, and climbing stairs; role physical, interference with work or daily activities because of physical health; role emotional (RE), limitations to work or daily activities because of emotional health; bodily pain, pain intensity and how this affects work in and out of the home; vitality, how full of energy the patient feels; mental health, overall emotional and psychological status; social functioning, how much health interferes with social interactions; and general health, overall evaluation of health. The scores are standardized so that the worst possible score is and the best possible score is. In addition, question 2 addresses health transition : Compared to one year ago, how would you rate your health in general now? Possible answers are as follows: 1, much better now than 1 year ago; 2, somewhat better now than 1 year ago; 3, about the same as 1 year ago; 4, somewhat worse than 1 year ago; and 5, much worse than 1 year ago. PATIENTS AND INSTRUMENT ADMINISTRATION It is part of my practice to administer the SF-36, and other QoL instruments, depending on disease process, to all patients seen inconsultationintheoutpatientsettingofmygeneralsurgicalpractice. The patients are given the questionnaire and are merely instructed to complete the questionnaire before the clinical encounter. No one helps the patient complete the questionnaire to avoid any physician-related bias in the answers. Patients may or may not undergo a surgical procedure. Generally, postoperative patients complete the questionnaire 6 to 12 weeks postoperatively, duringafollow-upvisit.itismypracticetoprovidelong-termfollowup of patients with cancer and some patients with nonmalignant disease who require long-term follow-up. In these patients, the questionnaire is administered yearly. I score the instrument, recording the scores of each domain, without determining the physical or mental health component summary scores. SELECTION OF QUESTIONNAIRES INCLUDED IN THE STUDY A random sample of answered questionnaires was selected. Randomization was based on selection of the last digit of the patient s medical record number from a random number table. 5 This process was repeated until questionnaires were selected. One thousand was selected as an adequate sample size to leave no doubt as to the presence of skewness and kurtosis. 4 ecause of the random nature of selection, the sample included preoperative and postoperative questionnaires, and questionnaires from the same patient at different times could have been included. ecause each questionnaire is linked to a patient, the patient s diagnosis is known and was recorded. STATISTICAL ANALYSIS All statistical analysis was done using a statistical computer program. 6 Initially, each questionnaire was reviewed for missing items (ie, questions not answered). Questionnaires with missing data were still included in the analysis, with domains that could be scored analyzed. The denominators changed according to how many domains had scores that could be analyzed. Data from each domain of the SF-36 were analyzed for the descriptive statistics of mean with standard deviation, median with complete range and interquartile range, mode, skewness, and kurtosis. The data were also analyzed to determine fit to a normal distribution using the Wilk-Shapiro test. P.5 was considered statistically significant for the data not fitting the normal distribution. Last, because the highest score achievable on the SF-36 is for each domain and 1 for the health transition item, this score was considered the top box. The frequency of this top-box score was determined for each domain. The top-box score was inspired by its use in patient satisfaction measurements from Press Ganey Associates, Inc, South end, Ind (http://www.pressganey.com). Press Ganey Associates, Inc, provides a survey to measure patient satisfaction. These surveys are sent to the patient and returned to the company. The company reports include a summary score plus the distribution of the frequency of each answer selected from poor to very good. The very good selection is the top box by this method. Press Ganey Associates, Inc, then compares the frequency of this top-box score with other institutions using the survey to determine the percentile rank of the institution or individual practitioner. It is hoped that this information will help identify areas in need of improvement. RESULTS Of the questionnaires evaluated, 541 were preoperative and 459 were postoperative. The percentage of missing items ranges from a high of 5.7% (in the social functioning and general health domains) to a low of 2.1% (in the physical functioning domain). Table 1 shows the summary statistics for each domain for the entire sample of questions. Using the Wilk-Shapiro test, none of the domains, or answer distribution for question 2, follow a normal distribution. The results of the Wilk-Shapiro test are statistically significant, proving a (REPRINTED) ARCH SURG/ VOL 142, MAY 27 474 Downloaded From: on 11/7/218 27 American Medical Association. All rights reserved.

A 25 15 2 15 5 5 1 2 3 4 5 1 2 3 4 5 Preoperative Question 2 Answer Postoperative Question 2 Answer Figure 1. Distribution of preoperative (A) and postoperative () answers for question 2, the health transition question: Compared with 1 year ago, how would you rate your health in general now? In A, the mean (SD) is 3.2 (.9), the median (range) is 3 (1-5), and the top-box frequency is 6.5%; in, the mean (SD) is 2.1 (1.), the median (range) is 2 (1-5), and the top-box frequency is 38.6%. A 8 8 6 4 6 4 2 2 2 4 6 8 Preoperative Physical Functioning Score 2 4 6 8 Postoperative Physical Functioning Score Figure 2. Distribution of preoperative (A) and postoperative () scores for the physical functioning domain. In A, the mean (SD) is 64.3 (29.4), the median (range) is 7 (-), and the top-box frequency is 16.3%; in, the mean (SD) is 77.5 (24.6), the median (range) is 85 (-), and the top-box frequency is 22.4%. nonnormal distribution. The coefficients of skewness are all negative, implying a long tail toward smaller values than the mean. The coefficients of kurtosis are less than 3 (the value for a normal distribution) in 7 of 8 domains and question 2 and greater than 3 in 1 domain. Values less than 3 imply that the shoulders of the curve are broader than what would be expected from a normal distribution, while values greater than 3 imply a tall peak of the curve with narrow shoulders. The value of 3 is expected for data following a normal distribution. Figures 1, 2, 3, 4, and 5 show illustrative comparisons of preoperative with postoperative histograms of the number of each answer of question 2 (health transition) and the scores of physical functioning, role physical, RE, and vitality domains, respectively. Visually, none of the shown or not shown histograms follow a normal distribution, although the vitality domain seems the most like a bell-shaped curve. Similar descriptive statistics were done on each of these data sets, as with the overall data sets. The results also demonstrated different means, medians, and modes, negatively skewed data, with kurtosis different than 3, and significant Wilk-Shapiro tests, proving that none of the data sets follow a normal distribution. Table 1 also presents the frequency of top-box scores. In 5 of the 8 domains, many patients scored (the top box). In only 3 domains was the frequency of a top-box score less than 1%. Also, the domains with a higher frequency of topbox scores were also the domains in which the mode was at. However, when comparing preoperative with postoperative top-box scores, the frequencies are different (Figures 1-5). Once again, these differences are more apparent in those domains in which the top box was the mode. COMMENT This study demonstrated that, in a general surgical population, the behavior of SF-36 data does not follow a normal distribution. In fact, the role physical and RE domains have more of a U shape than a bell shape (Figure 3 and Figure 4, respectively). This finding is significant for surgical QoL research. Data that do not follow a normal distribution should not be analyzed with parametric statistical techniques. 4,7,8 The presentation of nonnormal data as means with standard deviations or standard errors of the mean misrepresents the data; analysis of such data with parametric techniques, such as a t test, is inappro- (REPRINTED) ARCH SURG/ VOL 142, MAY 27 475 Downloaded From: on 11/7/218 27 American Medical Association. All rights reserved.

A 2 25 15 2 15 5 5 5 Preoperative Role Physical Score 5 Postoperative Role Physical Score Figure 3. Distribution of preoperative (A) and postoperative () scores for the role physical domain. In A, the mean (SD) is 53.4 (44.5), the median (range) is 5 (-), and the top-box frequency is 41.%; in, the mean (SD) is 64.9 (42.2), the median (range) is (-), and the top-box frequency is 52.3%. A 3 4 2 3 2 5 5 Preoperative Role Emotional Score 5 5 Postoperative Role Emotional Score Figure 4. Distribution of preoperative (A) and postoperative () scores for the role emotional domain. In A, the mean (SD) is 69.1 (41.2), the median (range) is (-), and the top-box frequency is 59.8%; in, the mean (SD) is 8.9 (35.6), the median (range) is (-), and the top-box frequency is 75.1%. A 5 4 4 3 3 2 2 1 1 2 4 6 Preoperative Vitality Score 8 2 4 6 Postoperative Vitality Score 8 Figure 5. Distribution of preoperative (A) and postoperative () scores for the vitality domain. In A, the mean (SD) is 49.1 (23.7), the median (range) is 5 (-), and the top-box frequency is.2%; in, the mean (SD) is 58.8 (21.8), the median (range) is 6 (-), and the top-box frequency is 2.2%. priate. These types of data should be analyzed using nonparametric techniques. 4,7-9 Ultimately, the main point of this study is that it is essential to assess that continuous data follow a normal distribution before using parametric statistical analysis. The natural question to ask is the following: Are the results of this study unique to my patient population or to patients in general? The developers of the SF-36 have done similar studies in the general population. 1 Their results are similar. Table 2 presents their data from a sample of 2474 (REPRINTED) ARCH SURG/ VOL 142, MAY 27 476 Downloaded From: on 11/7/218 27 American Medical Association. All rights reserved.

Table 2. SF-36 Descriptive Statistics for the General US Population* Domain Mean (SD) Median (Range) Ceiling (ie, Top ox), % Physical functioning 84.15 (23.28) 9 (-) 38.7 Role physical 8.96 (34.) (-) 7.85 Role emotional 81.26 (33.4) (-) 71.1 odily pain 75.15 (23.69) 74 (-) 31.85 Vitality 6.86 (2.96) 65 (-) 1.5 Mental health 74.74 (18.5) 8 (-) 3.91 Social functioning 83.28 (22.69) (-) 52.32 General health 71.95 (2.34) 72 (5-) 7.4 Abbreviation: See Table 1. *Adapted from Table 1.1 in the book by Ware et al. 1 individuals. Although the distribution of the curves is shifted to the right (ie, means, medians, and the frequency of topbox scores are higher), the general shape of the curves is the same. Specifically, there are means and medians greater than 5, with a long tail to the left (ie, lower scores). Therefore, rather than an aberration, the distribution of SF-36 scores in the surgical population is consistent with the general population, although somewhat lower, as would be expected in patients requiring an operation. The authors of the SF-36 recognized limitations in the to scoring scale. Their specific concern was related to how the scores between domains would be compared. 1 Therefore, they developed a norm-based scoring system in which the mean, by definition, would be 5 and the standard deviation would be 1. Using a formula, the individual patient s score can be transformed into this norm-based score. The authors make clear that the purpose of norm-based scoring is to provide a basis for meaningful comparison of scores between each domain and to compare the score of a single patient with that of the general population. This can beg the following question: ecause these scores are normalized, does that imply that data using these scores will follow a normal distribution? In the study by Ware and Kosinski, 1 their Table 1.2 provides the norms for the general US population. y definition, the mean is 5 and the standard deviation is 1. However, in none of the 8 domains is the median 5, although the median is within 1 point in the general health domain. In all domains, the top of the range (ceiling) is closer to 5 than the bottom of the range (floor). This implies that the histograms of these normalized scores are skewed, with a long tail to the left, just as we see in the standard scores. Therefore, even if the researcher wants to use the population-based norms, the researcher cannot assume that the data will follow a normal distribution. Othersubtletiesofthedataalsoshouldbenoted. Although presented as a scale from to, in fact, not all numbers between and are available as scores. This is a matter of precision. 11 The more values available, the easier it is to differentiate between levels of QoL. For example, consider the RE domain. There are only 4 possible scores (, 33, 67, and ). In fact, the mean of 74.5 is not available as a choice. Therefore, one can argue that the scores of the SF-36 are not continuous at all, but rather ordinal, much like cancer staging. We know, by definition, that stage 3 disease is worse than stage 2 disease, but is there a definition of stage 2.5 disease? Most would say that stage 2.5 disease in cancer has no meaning. y analogy, one can argue that scores in between the scores that can be actually obtained have no meaning. Noncontinuous (ie, ordinal) data should be analyzed using nonparametric techniques. 8,9 This study also demonstrates that there would be value in a top-box analysis. Top-box analysis is the separate statistical analysis of the highest possible scores achievable by a survey (ie, the best score). These data then become nominal. Top-box analysis would be of most value when the distribution of scores bunches toward the highest value. For example, in the health transition question (Figure 1) and the role physical (Figure 3), RE (Figure 4), and social functioning domains, the difference in the frequency of the topbox score was more dramatic than the difference in the median scores. In fact, in the RE domain, the preoperative and postoperative medians are the same, yet it is the top-box frequency that is substantially different. It is probably not coincidental that these domains also had the fewest possiblescores(ie, theleastprecision). Thosedomainsforwhich the top-box score was most insightful were the domains in which the mode was also at the top box. As seen in the RE domain, if the mode and median are both at the top box, then the frequency of the top-box score may be the only way to determine if there is a change in QoL score. In conclusion, the SF-36 is a valuable instrument in measuring QoL in a variety of patient populations, including surgical patients. Nevertheless, researchers and readers of these studies need to be aware of the nature of these data and need to use appropriate statistical techniques. Accepted for Publication: January 5, 27. Correspondence: Vic Velanovich, MD, Division of General Surgery, Mailstop K-8, Henry Ford Hospital, 2799 W Grand lvd, Detroit, MI 4822 (vvelano1@hfhs.org). Financial Disclosure: None reported. Previous Presentation: This paper was presented at the 114th Annual Scientific Session of the Western Surgical Association; November 14, 26; Los Cabos, Mexico; and is published after peer review and revision. The discussions that follow this article are based on the originally submitted manuscript and not the revised manuscript. REFERENCES 1. Ware JE Jr, Kosinski M, Gandek. SF-36 Health Survey: Manual and Interpretation Guide. Lincoln, RI: Quality Metric Inc; 25. 2. Velanovich V. The quality of quality of life studies in general surgical journals. J Am Coll Surg. 21;193:288-296. 3. Greenhalgh T. How to Read a Paper: The asics of Evidence ased Medicine. London, England: MJ ooks; 21. 4. Snedecor GW, Cochran WG. Statistical Methods. 6th ed. Ames: Iowa State University Press; 1967. 5. eyer WH. Handbook of Tables for Probability and Statistics. 2nd ed. oca Raton, Fla: CRC Press; 2. 6. Stata Statistical Software: Release 8.. College Station, Tex: StataCorp; 23. 7. Mandel J. The Statistical Analysis of Experimental Data. New York, NY: Dover Publications; 1964. 8. Colton T. Statistics in Medicine. oston, Mass: Little rown & Co Inc; 1974. 9. Velanovich V. A review of data analysis for surgical research. Curr Surg. 199;47: 41-418. 1. Ware JE Jr, Kosinski M. SF-36 Physical & Mental Health Summary Scales: A Manual for Users of Version 1. 2nd ed. Lincoln, RI: Quality Metric Inc; 21. 11. Scientific Advisory Committee of the Medical Outcomes Trust. Assessing health status and quality-of-life instruments: attributes and review criteria. Qual Life Res. 22;11:193-25. (REPRINTED) ARCH SURG/ VOL 142, MAY 27 477 Downloaded From: on 11/7/218 27 American Medical Association. All rights reserved.

DISCUSSION John A. Weigelt, MD, Milwaukee, Wis: Quality measures have been a topic of our meeting as they are now a part of our health care system today. The SF-36 is one of many tools used to measure health status. It has been shown multiple times to be a reliable and valid measurement. It is used to measure health status before and after an intervention such as a surgical procedure. It is also associated with a growing body of literature trying to explain its use and how to apply it to various investigations. Currently, the simple fact is if we wish to assess quality outcomes, we will eventually be forced to use a tool such as the SF-36. As you do this, please think about this presentation. This paper must be consumed with its figures. It is a visual thing. And unless you visually digest the graphs, the manuscript will fall a little flat. Once the graphs are interpreted, the whole premise of the article becomes clear. We have heard a statistical challenge to how we are analyzing our SF-36 output. Most studies use parametric tests which assume a normal distribution. Dr Velanovich clearly demonstrates that the data we collect with the SF-36 are not always normal in their distribution but have many nonparametric characteristics. It is this observation that will make this report memorable after today. Now, whether this top-box approach is the correct approach will need to be assessed by health care researchers and their statisticians. I have 3 questions. Since many report SF-36 data using parametric analyses, what made you look at the data this way? The second question, nonparametric options are many and each have their champions. What was your reasoning for selecting the topbox methodology? And finally, since the top box was not ideal for all the domains, are you using any other nonparametric analyses to assess other parts of the SF-36? Dr Velanovich: I have read many articles and heard many talks with regard to the SF-36 and the data have frequently been presented as parametric in nature (ie, following a normal distribution with means and standard deviations). Every time I assessed my data, the distribution was never normal. I was becoming concerned that I was wrong. My goal was to definitively answer the question about normality, at least in my mind. The second question was about nonparametric tests and the reason for using the top-box methodology. At our institution, we use Press Ganey scores for measuring patient satisfaction. One of the ways that the scores are presented is in this top-box fashion. As I was reviewing these reports, it occurred to me that many of the SF-36 scores are at the level and, perhaps, this could be a method to assess change in the data. So, this was the inspiration of the top-box method. I also agree that I don t know whether long-term this is going to be helpful or not, but I believe it is worth exploring. The last question was about the appropriateness of the top box for all domains. Dr Weigelt is correct. It is not ideal for all the domains. So I suspect it is just going to be 1 of many ways to assess this type of data, and it is certainly not going to be the only way. Donald E. Low, MD, Seattle, Wash: The issue regarding application of the SF-36 for many of us relates to the specifics of individual patient populations. For some of us who deal with cancer patients, assessing quality of life before and after surgery is unlikely to be productive. Patients often have just learned they have cancer before they get to our office. Asking them to assess quality of life at that moment and then comparing to the postoperative parameters is open to misinterpretation. The real advantage of the SF-36 in many of our minds is the ability to assess postoperative parameters compared to general and sex-matched populations, to gauge whether these patients can undergo large operations and return to a quality of life comparative to the general population. Is this not 1 of the advantages of the SF-36 in its current form when assessing quality of life, especially in cancer patients? Dr Velanovich: The summary scores attempt to standardize the scores to population controls. The reason why the Medical Outcomes Trust went that way was because of the problem they had with the distribution of the data. If one looks at the original articles, they had the same or similar distribution of data that I show here. So I don t think there is a difference specifically with surgical patients compared to the general population. The summary statistics are used in order to try to avoid some of the statistical issues. For a cross-sectional study, which is what you are recommending, summary scores are a very good way of presenting these types of data. The problem is in determining the change of scores for individual patients. I think this is more problematic. I agree with you, there are potential problems in a cancer patient, who clearly may have anxiety and other issues. Nevertheless, it may not reflect what their precancer baseline is, but it is a valuable pretreatment baseline. Karen J. rasel, MD, Milwaukee: As you have shown, there are so many nonparametric statistical tests available; one might work for 1 of the domains but another is going to be required for a separate domain. Since now norm-based scores for the SF-36 are available for all of the domains, might an easier approach be to change, rather than using the to, use the norm-based scores and simplify the subsequent statistical analysis? Dr Velanovich: I don t have an answer for you. I have not used the norm-based scores in my research. So I don t know how much of that will change. One of the issues, though, is this whole point about what is a minimally important difference? And the minimally important difference that the SF-36 looks at is generally changes that are between 5 to 1 points. And that is based on the original parameters. asil A. Pruitt, Jr, MD, San Antonio, Tex: I wonder if your random selection of tests didn t ensure skewness and kurtosis because you included test results representing heterogeneity of age, heterogeneity of disease, and both preoperative and postoperative assessments. So how do we interpret that? I think that if you had a specific disease, specific age group, and postoperative, and maybe even sequential, postoperative assessments, because now there are techniques to look at trajectory of recovery across time, that you would have reduced the variability that you are decrying. Dr Velanovich: Ultimately, the point is that a researcher needs to test that his or her data follow a normal distribution prior to the use of parametric statistical tests. And notice there actually is not that much difference between the shape of the curves between preoperative and postoperative. I haven t shown here, but when you separate out cancer patients from noncancer patients, although the means and medians and the curves shift, the shapes of the curves look pretty much the same. Walter J. McCarthy, MD, Chicago, Ill: ecause of the heterogeneity of the patients, you are going to have a wide spectrum of results. We reviewed a large number, more than 5, patients with intermittent claudication and found that, particularly with the physical function aspect of the SF-36, the distribution was quite normal. We were also able to compare pretreatment and posttreatment physical function scores in a meaningful way, showing an improvement. My point is, in a homogeneous group, the SF-36 data will be more normal. Dr Velanovich: I applaud you for at least looking at it and taking that into consideration. Most of the time, though, even with homogeneous groups, such as with reflux patients, it still comes out this way. Financial Disclosure: None reported. (REPRINTED) ARCH SURG/ VOL 142, MAY 27 478 Downloaded From: on 11/7/218 27 American Medical Association. All rights reserved.