Statistical Analysis of Method Comparison Data

Statistical Analysis of Method Comparison Data Testing rmality GEORGE S. CEMBROWSKI, PH.D., JAMES O. WESTGARD, PH.D., WILLIAM J. CONOVER, PH.D., AND ERIC C. TOREN, JR., PH.D. Cembrowski, George S., Westgard, James O., Conover, William J., and Toren, Eric C, Jr.: Statistical analysis of method comparison data. Testing normality. Am J Clin Pathol 72: 2-26, 979. A Lilliefors test of normality has been applied to data from precision and accracy stdies. Most data sets tested as non-normal. Simlation stdies showed that the test is extremely sensitive to the ronded, narrowly distribted data that are typical of method performance stdies in clinical chemistry. The Lilliefors test can be modified to be applicable to ronded data so that it gives fewer indications of nonnormality. The athors conclde that the selection of a test of normality reqires carefl stdy of the properties of the test. Otherwise, the sbseqent choice between parametric and nonparametric statistics may not be meaningfl. (Key words: Method comparison stdies; Statistics; Tests of normality; Lilliefors test; Kolmogorov-Smirnov test; nparametric statistics.) THE ACCEPTABILITY of a laboratory method depends in part on the method's precision and accracy. These measres are derived primarily from the statistical analysis of replication data and methodcomparison data. The appropriateness of sing either a parametric or a nonparametric statistical approach depends on the freqency distribtion of the data. Parametric tests sch as the t test, the F test, the Pearson prodct moment correlation coefficient, and regression analysis assme that distribtions are normal, or gassian. nparametric tests sch as the sign test, Wilcoxon's test, and Spearman's rank correlation coefficient make no assmption of normality and can be applied to normal and non-normal data alike. Received December 27, 977; received revised manscript and accepted for pblication May 25, 978. Spported by Grants. GM 978 and GN 24453 from the National Instittes of Health, Grant. GP-43625X from the National Science Fondation, by a General Research Spport Sb-Grant from the University of Wisconsin Medical School, a compting grant from the Gradate School, and by the Clinical Laboratories, University of Wisconsin Hospitals, Madison, Wisconsin. Presented in part at the Twenty-eighth National Meeting of the American Association of Clinical Chemists, Hoston, Texas, Agst -6, 976. Address reprint reqests to Dr. Cembrowski: Department of Pathology, University of Wisconsin Hospitals, Clinical Science Center, 600 Highland Avene, Madison, Wisconsin 53792. Departments of Pathology and Medicine and the Clinical Laboratories, University of Wisconsin, Center for Health Sciences, Madison Wisconsin W and associates" recommended a form of the Lilliefors 5,7 statistical test for testing whether clinical laboratory data have a gassian, or normal, freqency distribtion. This test of normality, a specialized Kolmogorov test, 8 was sed to determine whether parametric or nonparametric technics shold be employed for the analysis of method comparison data. An example illstrating the calclation of the Lilliefors test was inclded by W and associates." Recently the se of this test has been recommended in a contining edcation pblication of ASCP.' Frther applications are expected to appear in the clinical pathology literatre. We have stdied the se of the recommended method for testing normality and have reservations abot its general application to clinical chemical data. We have confirmed the observation of W and colleages that the differences between paired reslts of patient specimens analyzed by two methods generally demonstrate non-normality when evalated by the recommended test. We tested method-comparison differences for 2 commonly measred constitents, sing data from stdies comparing the Technicon SMAC and SMA 2/60 analyzers* and the DPont ACA.t In addition, we tested precision data from lyophilized pools, which were analyzed by these instrments, all operating in rotine service. These data also tested predominantly non-normal. These observations abot the non-normal distribtion of data from clinical chemical performance stdies assme that the chosen test of normality has appropriate sensitivity for the type of data being tested. There are many tests for normality available, each having its own properties. 9 We present here some investigations which evalate the sensitivity of the Lilliefors test to ronded, * Technicon Corporation, Tarrytown, New York 59. t E. I. DPont de Nemors and Co..Wilmington, Delaware 9898. 0002-973/79/0700/002 $00.80 American Society of Clinical Pathologists 2 on 02 March 8

22 CEMBROWSKlErAL. A.J.C.P.. Jly 979.000 >- c_> -z. Z) a LJJ LT 5 O 5- - 0.800- =5 o.eoo ^2 0.400 a 0-44 45 46 47 48 49 44 45 46 47 48 49 BUN (MG/DL ) Fie.. Freqency histogram of simlated blood rea nitrogen vales. X = 46. mg/dl, SD =.02 mg/dl. narrowly distribted data that are typical of the performance stdies for atomated instrments and wellcorrelated chemical methodologies. Sensitivity of the Lilliefors Test For stdy prposes, we assmed the simplest case, where the differences between the test method and the reference method are the reslt of only the random error in the test method. If the test method had normally distribted errors, then the between-method differences wold also be normally distribted. We therefore stdied precision data, even thogh W and associates proposed the test for examining the differences between BUN (MG/DL) FIG. 2. Empirical distribtion fnction of data of Figre (see text). methods in a patient comparison stdy, rather than for testing the precision of an individal method. A compter was sed to generate normally distribted data whose means and standard deviations were similar to those of control observations prodced by each test channel of the SMA 2/60. Twelve series of 800 simlated normally distribted control reslts (one series for each test of SMA 2/60) were prodced by the scaling and sbseqent ronding of comptergenerated random normally distribted nmbers. Table shows the tests, their averages and standard deviations, and the significant figres to which the test reslts were ronded. Sample sizes of, 40,, 0, and 800 were tested for normality at the a = 5 Table I. Compter-simlated rmally Distribted SMA 2 Control Data, Means, and Standard Deviations for a Test Sample of 800, rmality a = 5, as Tested by W and Associates" Test* Mean Standard Deviation Figre to Which Test Reslts Were Ronded n = Accepted as formal n = 40 n = n = 0 n = 800 Calcim (mg/dl) Phosphors (mg/dl) Glcose (mg/dl) BUN (mg/dl) Uric acid (mg/dl) Cholesterol (mg/dl) Total protein (g/dl) Albmin (g/dl) Total bilirbin (mg/dl) ALP (/dl) LD (/) AST (/) 6.43 3.73 65.9 46. 5.70 30.9 4.3 2.6.44 8.2 52. 68.6 0.8 95 3.32.5 88 3.3 7 86 0.406 9.0 3.7 * Abbreviations sed: BUN = blood rea nitrogen; LD = lactate dehydrogenase (L-lactate: NAD oxidoredctase, EC...27); ALP = alkaline phosphatase (orthophosphoric acid monester phosphohydrolase, EC 3..3.); AST = serm gltamic oxaloacetic transaminase (L-aspartate: 2-oxogltarate aminotransferase, EC 2.6..). on 02 March 8

Vol. 72. I TESTING NORMALITY 23 significance level. As shown in Table, most of these normally distribted data sets tested as non-normal, especially data sets that had a large sample size or narrow distribtion with few concentration intervals. This sggested that the Lilliefors test was either overly sensitive or inappropriate for ronded, narrowly distribted data that are typical of method performance stdies in clinical chemistry. Rationale for a Modified Lilliefors Test of rmality In the application of the Lilliefors test, the data's empirical cmlative distribtion is compared with the normal cmlative distribtion fnction. The distribtion is classified as non-normal when the maximm vertical distance between the two fnctions exceeds the Lilliefors test statistic, which is tablated by sample size (n) and confidence coefficient (p), where - p = a, the significance level for the test. 5 As an example, Figre shows a histogram of the first vales of the simlated blood rea nitrogen (BUN) test data. The empirical distribtion fnction of these data is presented in Figre 2. The empirical distribtion fnction and the theoretical cmlative normal distribtion fnction are compared in Figre 3. For each test vale, the empirical distribtion fnction is defined as the fraction of test reslts that are less than or eqal to that test vale. Becase the BUN data have been ronded to the nearest integer, the fraction of test reslts that are less than a specific integral test reslt cannot be determined. For example, any test reslt ranging from 45.5 to 46.5 mg/dl will be ronded to 46. Therefore, the fraction of test reslts below 46 cannot be ascertained. However, the fraction of test reslts less than 45.5 or 46.5 mg/dl can easily be calclated. As shown in Figre 2, the vale of the empirical distribtion fnction for a BUN vale of 46.5 mg/dl is 0.65, i.e., 65% of the BUN data are lower than 46.5 mg/dl. In order to compare the normal cmlative distribtion fnction and the empirical distribtion fnction, the test vales are normalized by sbtracting their mean and dividing by their standard deviation. This transformation does not change the shape of the empirical distribtion crve; the test reslts are merely centered arond zero. The position of each test vale on the abscissa corresponds to the difference in standard deviations between the mean and that test vale. In Figre 3, where the empirical distribtion fnction (with normalized BUN vales) and the normal cmlative distribtion fnction (obtained from standard statistical tables 8 ) are sperimposed, the maximm vertical distance between the fnctions indicates the degree of non-normality. With the se of the algorithm of W and associates,'' the vertical distances between the two ION i ~z. Z) CO ( Q Q_ U.000 0.800 0-600- 0-400- o.ooo NORriflLIZED BUN VALUES FIG. 3. rmalized empirical distribtion fnction with the normal cmlative distribtion fnction sperimposed. The vertical line at the normalized vale of 0 on the X axis corresponds to the mean vale of approximately 46, as shown in the above figres. Using the algorithm of W and associates, Dl is the maximm vertical distance between the empirical distribtion fnction and the normal cmlative distribtion fnction. D2 is the maximm vertical distance when measred at midpoints between consective test vales, rather than at the test vales. fnctions are measred at each of the test vales. Ths in Figre 3 the maximm vertical distance is Dl, which has a vale of 0.. This mst be compared with the test statistic, which is 5 for n = and a =. Since Dl exceeds the test statistic, the distribtion is classified as non-normal, with only a % probability that this cold occr de to chance. Becase the empirical distribtion fnction is not rigorosly defined at each of the test vales, the comparison of the empirical distribtion fnction and the normal cmlative distribtion fnction at each of these test vales is not appropriate. The empirical distribtion fnction can be determined only at points on the abscissa corresponding to midpoints between two sccessive test vales. The maximm vertical distance measred at the midpoints is 2 (D2), which is considerably smaller than 5, the test statistic for a =. The Appendix describes an algorithm for a modified Lilliefors test that takes into consideration the ronding of data. There are two differences between the modified algorithm and that of W and associates." First, distances between the empirical distribtion fnction and the normal cmlative distribtion fnction are measred at points midway between sccessive test vales and not at the test vales. Second, only one distance is measred for each set of identical test points, whereas the algorithm cited by W and as- on 02 March 8

24 CEMBROWSKIEIAi.. A.J.C.P.. Jly 979 Table 2. Probabilities that the Lilliefors Test Statistic is Exceeded for Certain Significance Levels (a)* Test Nmber in Grop (n) a = 0. Probabi lity Test Statistic Exceeded for a = 5 a = 0. a = 5 a = Calcim 0 5 32 2 8 Phosphors 0 07 5 38 07 Glcose 0 5 28 38 48 40 2 38 30 3 22 09 Blood rea nitrogen 0 Uric acid 0 2 8 07 Cholesterol 0 46 68 62 88 98 3 52 48 68 86 23 39 22 46 56 3 8 Total protein 0 Albmin 0 Total bilirbin 0 Alkaline phosphatase 0 3 48 38 22 28 28 Lactate dehydrogenase 0 32 40 74 86 96 2 23 52 60 76 2 36 36 52 2 o:o28 on 02 March 8

Vol. 72. I TESTING NORMALITY 25 Table 2. (Contined) Test Nmber in Grop (n) a = 0. Probability Test Statistic Exceeded for a = 5 a = 0. a = 5 a = Serm gltamic oxaloacetic transaminase 0 32 40 46 7 22 30 40 2 8 28 09 ' For n - and n =,,000 sets of normality distribted data were generated at each n. For n =,, and 0, 0 sets were generated. sociates measred as many distances as there were points. Characteristics of the Modified Lilliefors Test Tests of the Kolmogorov type," when properly applied to discrete data, are conservative. This means that when the test classifies a method as non-normal, a is actally lower than what is being specified. The tre probability that the maximm vertical distance will exceed the test statistic will be lower than the significance level of that test statistic. 3,4 For example, the Lilliefors test statistic is 5 for n =. and a =. Given a set of normal discrete data, the probability that the maximm vertical distance will exceed 5 will actally be less than. Conover" has proposed a Kolmogorov test for discontinos distribtions in which the exact test statistic can be calclated. The calclations, however, become difficlt with a sample size greater than 30. To determine the degree of conservativeness of the proposed modified Lilliefors test, normally distribted SMA 2/60 data were simlated, groped in varios sample sizes, and tested. At least 0 different grops of data were simlated at each sample size. Estimates of the probability that a normal set of data wold test as non-normal were obtained from the fractions of non-normal grops. The simlation reslts are presented in Table 2. The empirically derived probabilities are considerably lower than those expected with a continos distribtion, especially for those tests that prodce very few different test vales (cf. standard deviations and least significant digits, Table ). These tests inclde total protein, albmin, and total bilirbin. The probability that a set of data is fond to be non-normal is greater for tests that prodce more intervals in a histogram, e.g., cholesterol, the enzymes, and glcose. These reslts indicate that the modified Lilliefors test is extremely conservative. To investigate the sensitivity of the method to otliers, grops of normal simlated SMA 2/60 data with single otliers were tested for normality. Otliers at 3, 4, and 5 standard deviations (SD) from the mean had no effect on the reslt of normality for grops as small as n =. rmal distribtions of SD = and n = tested as non-normal with otliers more than 6 SD away from the mean. Otliers at 8 SD reslted in grops as large as n = testing as non-normal. The compter-simlated normally distribted data of Table all tested as normal with the se of the modified Lilliefors test. Application of the test to the somatotropin difference data of Figre of W and associates" showed that the data previosly classified as non-normal were also normally distribted. The data cited in Table 4 of W and associates" were reanalyzed sing the modified algorithm. Two of the three previosly nonnormal sets were shown to be non-normal at lesser significance levels, a = 5 instead of a =. The data from Volme 7, page 79 (97), contained an extreme otlier that, when removed, cold not be proved as non-normal (a = 5). The other sets of data were normal. Discssion These investigations show that the choice of a test of normality reqires carefl stdy. One test may be overly sensitive and another very conservative. The choice of the test essentially determines whether the performance data test as normal or non-normal. For precision and accracy stdies of clinical chemical methods, the test proposed by W and associates will indicate a high freqency of non-normality, even with normally distribted data. The test as modified here will give a lower freqency of non-normality. The selection at present depends primarily on the point of view of the investigator. One who favors nonparametric statistics may select the more sensitive test, and one who favors parametric statistics, the less sensitive test. An objective choice of a test of normality reqires first that stdies be made to determine the natre of error distribtions for clinical analytic methods and how the distribtion affects the interpretation of the performance stdy data. on 02 March 8

26 CEMBROWSKICTAL. A.J.C.P.. Jly 979 Table 3. Example of Modified Lilliefors Test Calclation (X, - X)/SD + (0.5 NCDF X, (X, - X)/SD LSD*)/SD NCDFt EDFt - EDF -2.4 -.2 -.0-0.9-0.7-0.2 0.3 0.3 0.5.0 4.0 -.853-0.929-0.775-0.698-0.544-82 - 72 49 0.226 0.380 0.765 3.075 -.84-0.890-0.736-0.660-0.6-43 88 0.264 0.48 0.804 3.4 n = 7, X =. SD =.299. * Least significant digit. t NCDF = normal cmlative distribtion fnction. t EDF = empirical distribtion fnction. 35 87 0.23 0.255 0.307 0.483 0.54 0.544 0.574 0.604 0.662 0.789 0.999 59 8 76 0.235 0.294 0.353 0.588 0.647 0.706 0.824 0.882 0.94.000 69 54 3 30 74 0.3 3 0.29 0.2 52 6. Kolmogorov A: Confidence limits for an nknown distribtion fnction. Ann Math Statist 2:46-463, 94 7. Lilliefors HW: On the Kolmogorov-Smirnov test for normality with mean and variance nknown. J Am Statist Assoc 62: 399-402, 967 8. Natrella MG: Experimental Statistics. National Brea of Standards Handbook 9. Washington, D. C, U. S. Government Printing Office, 963, Table A-l, Cmlative normal distribtion vales of P, p T2 9. Shapiro SS, Wilk MB, Chen HJ: A comparative stdy of varios tests for normality. J Am Statist Assoc 63:343-372, 968. Westgard JO, de Vos DJ, Hnt MR, et al: Concepts and practices in the evalation of clinical chemistry methods. Part. Statistics. Part IV. Decisions on acceptability. Am J Med Technol 44:552-570, 727-742, 978. W GT, Twomey SL, Thiers RE: Statistical evalation of method comparison data. Clin Chem 2:35-3, 975 APPENDIX Implementation of the Modified Lilliefors Test Arrange the data in ascending order. Calclate the mean (X) and the standard deviation (SD). Let the vale of the maximm vertical distance be zero. Start at the beginning of the test reslts and move seqentially, stopping jst before each different test reslt to do steps -4:. rmalize the present test vale (X,) by sbtracting the mean and dividing by the SD. The reslt is: One other major consideration in the statistical approach is whether the jdgment of performance shold be based on statistical significance or clinical significance. Barnett 2 has long advocated the need for consideration of the clinical significance of laboratory reslts. The present application of nonparametric tests disregards this and considers only whether the observed differences are statistically significant. Small differences between methods may be tolerable, even when they are statistically significant. It is obvios that a systematic error or bias of mg/dl for a glcose method is of no concern, even thogh it may be statistically significant. Jdgments of the acceptability of a method's performance shold reqire that estimates of errors be statistically reliable bt that the acceptability of the error be jdged relative to the clinical demands on the test. The present recommendations for the se of nonparametric tests do not take sch elements into accont. Gidelines for proper application are lacking, and decisions abot a method's performance are likely to be more confsing, rather than more objective. References. Arthr GL, Rawnsley HM: Statistical Analysis of Method Comparison Stdies. Advanced Clinical Chemistry Check Sample. ACC-23. CCE Concil on Clinical Chemistry, American Society of Clinical Pathologists, 977 2. Barnett RN: Medical significance of laboratory reslts. Am J Clin Pathol :67-676, 968 3. Bradley JV: Distribtion-Free Statistical Tests. Englewood Cliffs, N. J., Prentice-Hall, Inc., 968, pp 302-303 4. Conover WJ: A Kolmogorov goodness-of-fit test for discontinos distribtions. J Am Statist Assoc 67:59-596, 972 5. Conover WJ: Practical nparametric Statistics. New York, John Wiley and Sons, 97, pp 30-306, p 398 (Table 5) X, - X SD 2. Add to this qantity half of the interval between two consective normalized test reslts. This step corresponds to adding half of the least significant digit (LSD) divided by the SD and gives: X, - X 0.5 LSD + SD SD With the se of statistical tables, look p the normal cmlative distribtion for this qantity. 3. Calclate the vale of the empirical distribtion fnction for the present test reslt. This is simply the position of the crrent test vale divided by the total nmber of points. 4. Calclate the absolte difference between the normal cmlative distribtion and the empirical distribtion fnctions. If the maximm vertical distance is less than the new absolte vale, reassign its vale to the vale of the new absolte distance. When all the reslts have been processed, compare the derived maximm absolte distance with the appropriate Lilliefors test statistic. Whenever the statistic is exceeded, the poplation can be said to be non-normally distribted. Use of the modified Lilliefors test is illstrated in Table 3, the initial data being differences of patients comparison reslts measred by two different analytic methods. The least significant digit is the nmber to which the data have been ronded, which is for these data. For n = 7 and a =.5 the Lilliefors test statistic is 0.25. The maximm absolte distance is 0.2, fond at X, = 0.5, i = 5. Becase this exceeds the Lilliefors test statistic, the data may be classified as non-normal at the a = 5 significance level. on 02 March 8