Figure 1: Design and outcomes of an independent blind study with gold/reference standard comparison. Adapted from DCEB (1981b)

Page 1 of 1 Diagnostic test investigated indicates the patient has the Diagnostic test investigated indicates the patient does not have the Gold/reference standard indicates the patient has the True positive (TP) False positive (FP) Gold/reference standard indicates the patient does not have the False negative (FN) True negative (TN) Figure 1: Design and outcomes of an independent blind study with gold/reference standard comparison. Adapted from DCEB (1981b)

Page 1 of 1 Table 15: Results recommended for the evaluation of a diagnostic test validity using an independent blind study with gold/reference standard comparison Statistic Description Desired Values Sensitivity Specificity Positive predictive value Negative predictive value Accuracy The ratio of true positive results from the diagnostic test evaluated to the total number of positive tests obtained with the gold/reference standard (TP/[TP+FP] using Figure 1)(DCEB, 1981b). Numerically, this is expressed as a value between 0 and 1. This is an index of the tests ability to detect the (DCEB, 1981b; Lalkhen and McCluskey, 2008; Altman and Bland, 1994a). The ratio of true negative results from the diagnostic test evaluated to the total number of negative tests obtained with the gold/reference standard (TN/[FN+TN] using Figure 1) (DCEB, 1981b). Numerically, this is expressed as a value between 0 and 1. This is an index of the tests ability to detect the absence of (DCEB, 1981b; Lalkhen and McCluskey, 2008; Altman and Bland, 1994a). The ratio of true positive tests to the total positive tests obtained with the diagnostic test examined (TP/[TP+FN] using Figure 1) (DCEB, 1981b). Numerically, this is expressed as a value between 0 and 1. Indicates the number of positive tests that correctly diagnosed the presence of a (DCEB, 1981b; Lalkhen and McCluskey, 2008). This value can also be calculated from the sensitivity and specificity of the test, and the prevalence of a in the population being tested ([sensitivity x prevalence]/[sensitivity * prevalence + (1 specificity) * (1 prevalence)]) (Altman and Bland, 1994b). The ration of true negative tests to the total negative tests obtained with the diagnostic test examined (TN/[FP+TN] using Figure 1) (DCEB, 1981b). Numerically, this is expressed as a value between 0 and 1. Indicates the number of negative tests that correctly diagnosed the absence of a (DCEB, 1981b; Lalkhen and McCluskey, 2008). This value can also be calculated from the sensitivity and specificity of the test, and the prevalence of a in the population being tested ([specificity * (1-prevalence)]/ [(1-sensitivity) * prevalence + specificity) * (1 prevalence)]) (Altman and Bland, 1994b). The ratio of the total number of true positive and true negative tests from the diagnostic test evaluated to the total number of tests conducted ([TP+TN]/n using Figure 1) (DCEB, 1981b). Numerically, this is expressed as a value between 0 and 1. It is the overall rate of agreement between the diagnostic test examined and the gold/reference standard (DCEB, 1981b; Bossuyt et al., 2003). greater sensitivity greater specificity greater capability of correctly diagnosing the presence of a greater capability of correctly diagnosing the absence of a greater accuracy of the diagnostic test

Page 1 of 3 Table 16: Results used for the evaluation of diagnostic test reliability Statistic Description Desired Values Test-Retest Reliability Pearson's r-correlation coefficient The ratio of the sum of the product of test administrator results to the square root of the product of the sum of squares results for each test administrator results. This provides a measure of the strength of the linear relationship between the results obtained by the various test administrators (Osborn, 2006; Blaisdell, 1998). Values range from -1 to 1 (Blaisdell, 1998). An r-value of 1 or -1 indicates a perfect linear relationship (Blaisdell, 1998). An r-value of 0 indicates no linear relationship (Blaisdell, 1998). The ratio of the variance between the results obtained on different occasions for a single subject and the total variance in all the results collected for that subject (Streiner and Norman, 2008; Cleophas, Zwinderman and Cleophas, 2006). Numerically, this is expressed as a value between 0 and 1, with 1 being perfect reliability/ reproducibility and 0 being no reliability/ reproducibility (Streiner and Norman, 2008; Cleophas, Zwinderman and Cleophas, 2005). Also known as test-retest stability (Streiner and Norman, 2008), proportion of variance (Cleophas, Zwinderman and Cleophas, 2006), or correlation ratio (Cleophas, Zwinderman and Cleophas, 2006). It is reasonable to demand measures greater than 0.5 (Streiner and Norman, 2008). Kuder Richardson-20 α test An index of the homogeneity of measurements for a given set of results, used to assess the internal consistency reliability of a measurement instrument (Thompson, 2010; Ramsey et al., 1991). Values range from 0 to 1, with a value of 1 representing perfect internal consistency (Thompson, 2010). The squared Kuder Richardson-20 α value represents the proportion of score variance not resulting from error (Thompson, 2010). A KR-20 value >0.7 is acceptable. Tests that use 50 or more items in their assessment should accept values >0.8 (Thompson, 2010). KR-20 2 values <0.7 indicate that the majority of score variance results from error (Thompson, 2010).

Page 2 of 3 Inter-Rater Reliability The percentage of tests with which the test administrators obtained the same results Similar to the test-retest description except variance among the results of different test administrators on a single subject is used instead of the variability between several tests administered on different occasions (Streiner and Norman, 2008; Gulliford, 2005; Tzannes and Murrell 2002; Cleophas, Zwinderman and Cleophas, 2006). Cadogan et al. (2011) indicated that a percent agreement greater that 80% is required for a test to be considered appropriate for inclusion in a clinical examination. Same as test-retest description, with the addition of; Tzannes and Murrell (2002) determined an intra-class coefficient of 0.5-0.7 to be reasonable inter-rater reliability. Tzannes et al. (2004) determined a good intraclass coefficient to be >0.65, and <0.31 to be poor inter-rater reliability. Cohen s kappa The ratio of the sum of observed agreements minus the sum of expected agreements to the total number of observations minus the sum of expected agreements (Sheskin, 2004). Calculations indicate the degree of agreement between the results obtained by different (or the same) test administers after the random chance of observers agreeing is corrected for (Sheskin, 2004). Calculations are based on results organized into contingency tables specific to the number of test administers and outcomes measured (Sheskin, 2004). Landis and Koch (1977) suggest that Kappa values of 0.00-0.2 indicate slight agreement; 0.21-0.4 fair; 0.41-0.60moderate; 0.61-0.8 substantial; and 0.81-1.00 indicate almost perfect agreement. Walsworth et al. (2008) indicated that a kappa value greater than 0.8 indicates strong agreement. Cadogan et al. (2011) indicated that a kappa value greater that 0.6 is required for a test to be considered appropriate for inclusion in a clinical examination. Coefficient of interobserver variability The ratio of inter-rater variability to the total observer related variability (Haber et al., 2005). Total observer related variability is the sum of intra- and inter-rater variability (Haber et al., 2005). Inter-rater variability is the variance in replicated measurements made on the same subject with all methods by all observers (Barnhart, Song and Haber, 2005; Haber et al., 2005). A higher Coefficient of Inter-Observer Variability value indicates a lower level of inter-rater agreement (Haber et al., 2005).

Page 3 of 3 Coefficient of interobserver agreement 1 minus the coefficient of inter-observer variability (Haber et al., 2005). A higher coefficient of inter-observer agreement value indicates a greater level of inter-rater agreement (Haber et al., 2005). See above Intra-Rater Reliability Similar to the test-retest description except variability among the results of a single test administrator is used instead of the variability between several tests administered on different occasions (Streiner and Norman, 2008; Cleophas, Zwinderman and Cleophas, 2006) Same as test-retest description Cohen s kappa Similar to the inter-rater description except variability among the results of a single test administrator is used instead of the variability between test administrators. Same as inter-rater description Intra-observer variability Intra-observer variability is the variance in replicated measurements made on the same subject with the same method by the same observer (Barnhart, Song and Haber, 2005; Haber et al., 2005). See above