SROC Curve. S. D. Walter McMaster University, Hamilton, Ontario, Canada. Petra Macaskill University of Sydney, NSW, Australia INTRODUCTION

SROC Curve S. D. Wlter McMster University, Hmilton, Ontrio, Cnd Petr Mcskill University of Sydney, NSW, Austrli INTRODUCTION The summry receiver operting chrcteristic (SROC) curve hs been recommended to represent the performnce of dignostic test, bsed on dt from metnlysis. Under fied-effect, logit-threshold model, the position of the SROC curve cn be chrcterized in terms of the overll dignostic odds rtio nd the mgnitude of interstudy heterogeneity. The Are Under the Curve (AUC) nd n inde Q* re potentilly useful summry mesures. It is shown tht AUC is mimized when the study odds rtios re homogeneous, nd tht it is quite robust to heterogeneity. An eplicit upper bound is derived for AUC in the homogeneous sitution, nd lower bound bsed on the limit cse Q*, defined by the point where sensitivity equls specificity: Q* is invrint to heterogeneity. The stndrd error of AUC is derived for homogeneous studies, nd is shown to be resonble pproimtion with heterogeneous studies. AUC nd its stndrd error re esily computed in the homogeneous cse, nd void the need for numericl integrtion in the more generl cse. SE(AUC) nd SE(Q*) re numericlly close, with SE(Q*) being lrger if the odds rtio is very lrge. Motivtion for the use of the AUC nd Q* mesures s summries of the SROC curve re discussed. A multilevel mied model is lso described, in which test ccurcy nd threshold re llowed to vry between studies. This provides more generl frmework, within which the fied effect model is specil cse. The mied model provides for the direct estimtion of summry vlues of sensitivity, specificity, nd likelihood rtios. Byesin Mrkov Chin Monte Crlo (MCMC) methods re required to fit the mied model, nd pproprite softwre is becoming more vilble. META-ANALYSIS OF DIAGNOSTIC TEST DATA Systemtic reviews of primry studies re becoming incresingly importnt for summrizing evidence bout the ccurcy of dignostic tests. Guidelines for the conduct of such reviews [1 3] include defining the objectives, retrievl of the relevnt literture, dt etrction, met-nlytic methods for obtining summry estimtes of test ccurcy, nd investigting resons for vrition in test ccurcy cross studies. The receiver operting chrcteristic (ROC) curve is well estblished s method of summrizing the performnce of dignostic test within single study. [4 6] It indictes the reltionship between the true positive rte (TPR) nd the flse positive rte (FPR) of the test, s the threshold used to distinguish disese cses from noncses vries. For instnce, the threshold might be defined level of serum cholesterol s mrker of crdiovsculr disese, or prticulr level of cellulr bnormlity s mrker of mlignncy. The summry receiver operting chrcteristic (SROC) curve nd the re under the curve (AUC) hve been proposed s wy to describe dignostic dt in the contet of metnlysis. [1,2,7 10] A schemtic ROC curve is shown in Fig. 1. All of its dt points come from single study, nd re defined by the results rising when one of severl lterntive thresholds (or cut points) on the test results is used to discriminte between disese cses nd noncses. Liberl thresholds tht give high vlues of TPR will tend to incorrectly lbel reltively high frction of the noncses s cses, so the flse positive rte (FPR) will be high. Conversely, conservtive thresholds yield incorrect lbels for noncses rther infrequently, but t the cost of detecting smller frction of the true cses; in this sitution, the vlue of TPR nd FPR re both reltively low. Thus one generlly epects TPR nd FPR to be positively ssocited. In the ROC spce, points ner the lower-left corner correspond to conservtive thresholds (low TPR nd low FPR vlues), nd points ner the upperright corner correspond to liberl thresholds (high TPR nd high FPR vlues). In single study, chnging the threshold necessrily results in monotonic chnges in TPR nd FPR. Accordingly, the ROC curve cn lwys be empiriclly represented, in its simplest form by connecting the dt points s shown in Fig. 1. Alterntively, smoothed curves Encyclopedi of Biophrmceuticl Sttistics 1 DOI: 10.1081/E-EBS 120024189 Copyright D 2004 by Mrcel Dekker, Inc. All rights reserved.

2 SROC Curve discriminting cses from noncses. S is function of the dignostic threshold in study, with high vlues corresponding to liberl inclusion criteri for cses. S =0 when TPR=1 FPR, i.e., on the ntidigonl from the top-left to bottom-right corners of the SROC spce. The regression eqution D ¼ þ bs ð3þ Fig. 1 Schemtic ROC curve. cn lso be fitted using ltent vrible pproch, bsed on model tht the cse nd noncse test vlues follow norml or logistic distributions. [11 19] In met-nlysis, the units of nlysis re seprte studies. In the simplest cse, ech study contributes n estimte of TPR nd FPR. The SROC curve represents the reltionship between TPR nd FPR cross studies, recognizing they my hve used different thresholds. In contrst to the ROC nlysis, the set of (FPR, TPR) points does not necessrily yield unique, monotonic curve. Severl methods hve been proposed for fitting n SROC curve; [7,10,20,21] this pper will focus on the properties of the populr method developed by Moses et l. [10] An lterntive but more comple method proposed by Rutter nd Gtsonis [20,21] is lso outlined nd its potentil dvntges re discussed. LOGIT-THRESHOLD FIXED EFFECT MODEL FOR SUMMARY RECEIVER OPERATING CHARACTERISTIC CURVE The most populr method to obtin smoothed fit of the SROC curve is to use the regression model proposed by Moses et l. [10] The dependent nd independent vribles re TPR FPR D ¼ ln ln ð1þ 1 TPR 1 FPR nd TPR FPR S ¼ ln þ ln 1 TPR 1 FPR respectively. D is equivlent to the dignostic log odds rtio, ln(or), which conveys the test s ccurcy in ð2þ cn be fitted by stndrd lest squres methods, ssuming tht D is pproimtely normlly distributed for given vlue of S. Optionlly, weights cn be employed to reflect interstudy heterogeneity with respect to the smple vrince of D, or robust fitting procedure my lso be used. [10] The coefficient b in Eq. 3 represents the dependence of the test ccurcy on threshold. If b0, then the studies will be referred to s homogeneous; they cn then be summrized by n overll OR, noting tht =ln(or). If b 6¼ 0, then the studies re referred to s heterogeneous with respect to OR. In this cse, cn be thought of s the vlue of ln(or) when S=0. This model ssumes logistic distributions for the test vlues in the cses nd noncses, but norml distributions my lso constitute n dequte pproimtion. However, the model is generlly fitted without directly ppeling to the logistic distribution ssumption. If the underlying test results ctully hve logistic distributions, then S for given study cn be epressed s function of the cut point tht gives rise to the sensitivity nd specificity for tht study. [10] If the vrinces of the distributions of test results for the cses nd noncses re equl, then the SROC is symmetric bout the digonl line where TPR=1 FPR. If the vrinces re unequl, the resulting SROC will be symmetric. No ssumptions re mde bout the distribution of S. Covrites my be dded to the model to ssess whether test ccurcy vries systemticlly with other study-relted fctors. [22,23] If the estimte of test ccurcy for ech study is weighted by the inverse of its vrince, one is ssuming tht smpling error is the only source of vribility between studies. However, this ssumption my be unrelistic, given tht studies re likely to vry in number of respects, including the spectrum of disese, conditions under which the test ws dministered, nd other study nd ptient chrcteristics tht could ffect test ccurcy. [24 26] Although the prmeters nd b re both fied becuse they re ssumed to be constnt cross studies, pplying equl weights to studies (i.e., fitting n unweighted regression) hs been empiriclly shown to give results tht re consistent with ssuming rndom effect. [8,10,20] This is becuse both the within- nd between-study vrinces re tken into ccount, thereby giving reltively more weight to smller studies, s would occur in rndom effect model. Irwig et l. [8] recommend the

SROC Curve 3 unweighted nlysis, becuse weights bsed on smple vrinces my give too much emphsis to inccurte studies, nd hence give bised SROC curve. Moses et l. [10] lso recommend the unweighted pproch. Once the regression hs been fitted, one cn reverse the trnsformtions (Eqs. 1 nd 2) nd hence deduce the reltionship between TPR nd FPR s FPR ð1 þ bþ=ð1 bþ ep 1 b 1 FPR TPR ¼ 1 þ ep 1 b FPR 1 FPR ð1 þ bþ=ð1 bþ Epression 4 gives TPR t ny given vlue of FPR, nd hence defines the entire SROC curve. While there my be interest in identifying prticulr points on the curve, it is lso useful to hve n overll summry mesure of the curve s behvior. One pproprite mesure is the Are Under the Curve (AUC), which cn be clculted s ð1 þ bþ=ð1 bþ ep AUC ¼ Z 1 0 1 b 1 þ ep 1 b 1 1 ð4þ ð1 þ bþ=ð1 bþ d In generl, AUC must be numericlly obtined, becuse there is no closed form for epression 5. Adoption of summry inde is helpful for succinct reporting of given dt set, especilly when limited dt preclude the relible identifiction of prticulr points on the curve. The AUC mesure is widely used in ROC nlysis, where it cn be interpreted s the probbility tht the test vlues for rndom pir of disesed nd nondisesed individuls would be correctly rnked; it lso represents the (unweighted) verge of TPR over ll possible vlues of FPR. The AUC is lso nturl cndidte summry for n SROC nlysis. We will consider the lterntive inde Q*, which will be defined s point of indifference on the SROC curve, where the probbilities of n incorrect test result re equl for disese cses nd noncses. The prtil AUC hs lso been proposed s summry mesure, this being the re under some restricted portion of the curve corresponding to FPR vlues of clinicl interest, or in which study dt re locted. EMPIRICAL BEHAVIOR OF THE SUMMARY RECEIVER OPERATING CHARACTERISTIC CURVE Figure 2 shows set of three symmetric SROC curves with b=0, which occurs when the studies re homogeneous nd thus ehibit no reltionship between OR nd ð5þ Fig. 2 Summry receiver operting chrcteristic with vrious vlues of (b=0). (View this rt in color t www.dekker.com.) threshold (S). The curves were obtined by numericl evlution of Eq. 4 for =1.5, 2, nd 3, which correspond to OR=4.5, 7.4, nd 20.1, respectively. As increses, the SROC curve moves closer to its idel position ner the upper-left corner. If!1, then AUC!1, which would indicte perfect test hving 100% sensitivity nd specificity, nd no errors in distinguishing cses from noncses. In contrst, if 0 (or OR1), the curve lies close to the digonl TPR=FPR in the SROC spce; then, AUC=1/2 nd the test performs no better thn chnce. For completeness, we mention situtions where <0. These correspond to OR < 1, when the test discrimintes cses nd noncses in the wrong direction, nd worse thn t rndom. As OR!0, AUC!0, nd the curve lies close to the lower-right corner of the SROC spce. Such situtions re unlikely to occur in prctice. We now consider the cse of heterogeneous studies under model 3, where dignostic ccurcy depends on threshold (b6¼0). Fig. 3 shows three SROC curves derived from Eq. 4, ll with = 2 but different vlues of b. Nonzero vlues of b give symmetric curves. If b>0, the curve initilly rises less steeply thn the symmetric curve (with b=0), but then it rises more steeply nd crosses the symmetric curve to chieve reltively high vlues of TPR for high vlues of FPR. The SROC curves with b <0 ehibit the opposite behvior see, for emple, the curve with b= 0.5 in Fig. 3. Interestingly, s suggested by Fig. 3 nd s shown by Moses et l., [10] the fmily of curves defined by fied vlue of ll pss through common point, locted on the ntidigonl, where TPR=1 FPR, or sensitivity equls specificity. Tht point hs coordintes TPR ¼ epð=2þ 1 þ epð=2þ ð6þ

4 SROC Curve Fig. 3 Summry receiver operting chrcteristic with vrious vlues of b (=2). (View this rt in color t www.dekker.com.) nd FPR ¼ 1 1 þ epð=2þ Moses et l. [10] denote the vlue of TPR t this point by Q*. In ROC nlysis, Q* hs been suggested s further summry mesure. It corresponds to the point closest to the idel top-left corner of the SROC spce in symmetric curves. The fct tht ll SROC curves with given vlue of pss through the sme Q* point mens tht Q* conveys no dditionl sttisticl informtion beyond the odds rtio. However, use of Q* cn be motivted by its simple interpretbility, given tht Q* is the point on the SROC curve where TPR=1 FPR; thus it represents the dignostic threshold t which the probbility of correct dignosis is constnt for ll subjects. Also, s will be shown lter, the eistence of common vlue of Q* permits the derivtion of useful lower bound for AUC. As consequence of ssuming model 3, when b6¼0, the SROC curve hs region where TPR<FPR, which lies below the min digonl. This region my be seen, for emple, ner the lower-left corner in the curve for b=0.5 in Fig. 3. In this region, the test would theoreticlly be performing worse thn t rndom. However, in prctice, this region is very smll. Even if reltively strong heterogeneity (with b> 0) is present, the improper prt of the SROC curve (where TPR<FPR) includes very low vlues of TPR, nd these vlues correspond to test thresholds tht re unlikely to be cceptble in clinicl prctice. On the other hnd, if heterogeneity with b<0 occurs, the improper prt of the curve corresponds to ð7þ very high vlues of FPR, which would gin hve no clinicl relevnce. See Ref. [27] for numericl detils of this effect. We now briefly discuss the behvior of model 3 for etreme vlues of b. As b!1, the fitted SROC curve becomes progressively steeper, nd in the limit cse t b =1 it degenertes to verticl line. This could occur, for emple, if there ws no vrition in FPR between studies. However, the line still psses through the common Q* point defined by Eqs. 6 nd 7. Once b>1, the curve inverts nd shows negtive reltionship of TPR to FPR. This is implusible in prctice, ecept perhps by chnce in smll smples. Similr behvior is seen if b is ner or below 1. Ner b= 1, the curve is horizontl, indicting no reltionship of TPR to FPR. This could occur if there ws no vrition in TPR between studies. If b< 1, the curve gin inverts (to different shpe) nd suggests n implusible negtive reltionship between TPR nd FPR. Despite this, ll the curves for jbj>1 possess the common vlue of Q* given by Eqs. 6 nd 7. In prctice, dt yielding jbj >1 re unlikely. Empiricl eperience suggests tht prcticl met-nlysis dt sets often hve b close to 0, nd re rrely lrger thn 0.5 in bsolute vlue. SUMMARY MEASURES: AREA UNDER THE CURVE AND Q* The AUC is populr inde of the overll performnce of test. [4 6,18,19,28] As mentioned erlier, AUC rnges from 1 for perfect test tht lwys correctly dignoses, to 0 for test tht never correctly dignoses. In single studies, AUC cn be interpreted s the probbility tht the test will correctly rnk rndomly chosen cse/noncse pir with respect to their test vlues. [29] The AUC is intended to fulfill the sme function in met-nlyses, nd in effect one ssumes tht the SROC curve ccurtely conveys test performnce t the individul subject level. Although numericl integrtion is required in generl to obtin AUC, some specil cses re of interest becuse they yield ect nlytic epressions nd comprtive results, s we now discuss. As epected, AUC increses with, for fied b. By emining Eq. 5, one cn prove tht for given vlue of, AUC is mimized when b=0, implying tht AUC is optimlly lrge in homogeneous studies. [27] Furthermore, one my show tht AUC is symmetric in b, so tht negtive vlues of b yield the sme vlue of AUC s the equivlent positive vlue, this despite the very different shpes of their ssocited SROC curves. If =b=0, then from Eq. 5 AUC ¼ Z 1 0 d ¼ 1 2

SROC Curve 5 Tble 1 AUC hom, Q*, nd their difference for vrious vlues of the dignostic odds rtio: homogeneous cse Odds rtio AUC hom Q* AUC hom Q* 0.5 0.386 0.414 0.028 1 0.500 0.500 0.000 1.5 0.567 0.551 0.017 2 0.614 0.586 0.028 3 0.676 0.634 0.042 4 0.717 0.667 0.051 5 0.747 0.691 0.056 10 0.827 0.760 0.067 20 0.887 0.817 0.069 30 0.913 0.846 0.068 40 0.929 0.863 0.065 50 0.939 0.876 0.063 By symmetry, it is evident tht AUC=1/2 when =0 even if b6¼0, so AUC=1/2 indictes rndom overll performnce for ny set of studies. In the homogeneous cse b= 0, the generl epression 5 becomes epðþ AUC ¼ Z 1 0 1 1 þ epðþ 1 d In this cse, we cn obtin n ect solution OR AUC hom ¼ ½ðOR 1Þ lnðorþš 2 ðor 1Þ ð8þ where AUC hom indictes the AUC for homogeneous studies, nd OR = ep(). If =0 (or OR =1), then the specil vlue AUC hom =1/2 should be used in plce of Eq. 8, which is then degenerte. Epression 8 cn be used to evlute AUC for homogeneous studies, without the need for numericl integrtion. As demonstrted lter, epression 8 is lso useful upper bound nd close pproimtion for AUC in heterogeneous studies. For reference, Tble 1 shows the vlue of AUC, Q*, nd their difference for rnge of vlues of OR in the homogeneous cse. By noting tht AUC declines with incresing b, nd tht the limit curve with b!1 psses through the common Q* point, from Eqs. 6 nd 7, we my deduce tht lower bound for AUC in curves with given vlue of is pffiffiffiffiffiffiffi Q epð=2þ ¼ 1 þ epð=2þ ¼ OR p 1 þ ffiffiffiffiffiffiffi ð9þ OR which is equivlent to the TPR vlue given in Eq. 6. Also, Q* from Eq. 9 nd the mimum vlue AUC hom from Eq. 8 provide esily computble lower nd upper bounds, respectively, for AUC with ny given vlue of >0. Numericl Tbultions Numericl evlutions of epressions 8 nd 9 demonstrte tht in the homogeneous cse, the difference between AUC hom nd Q* increses for moderte vlues of OR, but is never more thn bout 7%. [27] The mimum difference occurs t = 2.85 (or OR = 17.3), which vlue is the solution to trnscendentl eqution. For lrger vlues of, the difference declines very slowly, with limit vlue AUC hom Q*=0 t =1. The numericl integrtion of Eq. 5 for the heterogeneous cse llows one to ssess the impct of heterogeneity on the vlue of AUC. For fied vlue of OR, AUC declines slowly s b increses from 0 (homogeneous studies) to lrger vlues (incresing heterogeneity). However, the dependence of AUC on b is wek, nd the dominnt effect on AUC is the vlue of OR (or ). For jbj<0.4, the percentge chnge in AUC compred to the homogeneous cse is less thn 2%. Accordingly, AUC hom provides good pproimtion to AUC even in heterogeneous studies. [27] STANDARD ERRORS OF AREA UNDER THE CURVE AND Q* We first consider the smple vrition in dauc. From Eq. 5, we see tht AUC is function of the regression prmeters nd b, nd hence the vribility in dauc is function of the smple vrition in â nd bˆ. Using the delt method, n pproimte vrince for dauc is vrð daucþ ¼ @AUC 2 vrð^þ @ þ @AUC 2 vrð^bþ þ2 @AUC @b @ @AUC covð^; ^bþ ð10þ @b where, from Eq. 5 @AUC 1 ¼ ep @ 1 b 1 b p Z 1 1 h 0 pep i 2 d 1 þ 1 1 b @AUC @b ¼ 1 2 ep 1 b 1 b p h i Z 1 þ 2ln 1 1 h 0 pep i 2 d 1 þ 1 1 b ð11þ ð12þ

6 SROC Curve nd p=(1+b)/(1 b). The vrinces nd covrince of â nd ^b cn be directly obtined from the stndrd regression softwre used to fit model 3 to the dt. In generl, evlution of vr( dauc) gin requires numericl integrtion. However, for the specil cse of homogeneous studies where b = 0, n pproimte, lrgesmple epression is possible. Using the delt method, we my then obtin SEð dauc hom Þ ¼ OR ½ðOR þ 1Þ ln OR 3 ðor 1Þ 2ðOR 1ÞŠSEð^Þ ð13þ Eq. 13 implies SE( dauc hom ) is symmetric in ln(or), lthough we would usully be concerned with vlues OR>1. When OR=1, epression 13 is degenerte, but by using L Hôpitl s rule, one cn show tht in the neighborhood of OR=1, SEð daucþ SEð^Þ=6 ð14þ The delt method lso yields n pproimte stndrd error for ^Q* s pffiffiffiffiffiffiffi SEð ^Q OR Þ ¼ pffiffiffiffiffiffiffi 2ð OR þ 1Þ 2 SEð^Þ ð15þ Moses et l. [10] give this result in n lterntive form involving cosh(/4). Numericl evlutions show tht the stndrd errors of Q* nd AUC re both mimized when OR=1, so the lest precise sitution for either inde is for tests tht hve close to rndom performnce. In the region of OR=1, SE( dauc)1/6 nd SE( ^Q*)1/8. Comprisons show SE( dauc)>se( ^Q*) for OR vlues between 1 nd 17.3, the sme vlue ssocited with the vlues of AUC nd Q*; for OR > 17.3, AUC hs the smller stndrd error. Both stndrd errors pproch 0 s!1. For heterogeneous studies, SE( dauc) is not necessrily mimized when = 0, becuse it involves vr(^b) nd cov(â, ^b) s well s vr(â). However, it is the lst of these terms tht domintes, with the other two mking reltively smll numericl contributions. Thus in prctice SE( dauc) is mimized pproimtely when OR=1, even for heterogeneous studies. For most vlues of OR, the stndrd error declines by up to bout 10% t the most etreme level of b. OTHER ISSUES WITH THE FIXED-EFFECT MODEL So fr, we hve estblished some bsic properties of the SROC curve under model 3, nd we found tht the vlue AUC hom ssocited with homogeneous studies is resonble upper-bound pproimtion even for the generl AUC with heterogeneous studies. The corresponding stndrd error lso provides good pproimtion for heterogeneous studies, nd is conservtively lrge ecept in some cses of etreme heterogeneity. In the presence of strong heterogeneity, the AUC would be n indequte summry of the dt nywy, nd it would then be preferble to emine the SROC curve in more detil, including specific TPR vlues for given FPR. Q* lso provides n esily computed lower bound for AUC, but empiriclly ppers to be not quite s good n pproimtion s AUC hom. The motivtion for Q* sn inde in its own right is tht it is locted where the SROC curve crosses the ntidigonl from (0,1) to (1,0) of the SROC spce. Hence TPR=1 FPR t Q*, nd so the probbility of n incorrect result from the test is the sme for cses nd noncses. Q* is therefore point of indifference between flse-positive nd flse-negtive dignostic errors. In homogeneous studies, Q* is the point on the SROC curve lying closest to the optiml upper-left corner, but this is not true with heterogeneous studies. Use of Q* s the summry mesure ssumes implicitly tht flse-negtive nd flse-positive test results re of equl vlue. In prctice, there my be different costs ssocited with these two types of error: One wishes to minimize flse-positive results becuse of the dditionl testing required to estblish the correct dignosis (noncse), nd becuse the dditionl tests tend to be more costly, invsive, or risky. On the other hnd, flsenegtive results led to disese cses being missed, with possible deteriortion in their prognosis. In generl, one must weigh the flse-positive nd flse-negtive errors to blnce the overll performnce of the test in popultion; the optiml dignostic threshold does not then correspond in generl to the Q* point. One cn motivte the use of AUC s n inde tht represents the probbility tht the test will correctly rnk cse/noncse pir of subjects. [4,29] It cn lso be thought of s the verge TPR over the entire rnge of FPR vlues. Becuse it summrizes the whole SROC curve, AUC hs symmetric interprettion with respect to either TPR or FPR. The AUC is ffected by the whole SROC curve, including regions with limited or no dt, or by sectors corresponding to TPR nd FPR vlues tht re unlikely to occur in prctice. Accordingly, it hs been suggested tht prtil SROC curves be dopted, by limiting ttention to those portions of the SROC curve of clinicl interest, or where dt re ctully observed. [30 33] There re some unresolved issues on the use of prtil SROC curves nd the corresponding prtil AUC. First, the prtil AUC my be thought of s the verge TPR within restricted rnge of FPR, but not vice vers. Hence the prtil AUC hs n symmetric interprettion with respect to TPR nd FPR; on the other hnd, the complete

SROC Curve 7 AUC enjoys symmetric interprettion with respect to both types of test error. Second, there my be some rbitrriness in which regions of the curve to select. One might emine the SROC curve within prespecified rnge of TPR vlues, for instnce by defining mimum cceptble level for clinicl prctice. Another pproch would be to choose the region ccording to the observed rnge of dt in the met-nlysis; the choice itself would then be ffected by smpling vrition, which might be substntil in met-nlyses involving smll studies. Furthermore, resonble choice for one test my be unresonble for nother test, thus complicting their comprison. The nlytic methods for AUC presented in this pper cn be etended to cover the prtil AUC nd its stndrd error. We my epect the effect of interstudy heterogeneity to be greter for the prtil AUC thn the wek effect seen for the full AUC. For instnce, if ttention is limited to vlues of FPR<0.2 when =2 (Fig. 3), the corresponding prtil AUC will be greter when b>0 thn when b<0. Hence the prtil AUC lcks symmetry with respect to b. On the other hnd, the complete AUC hs some compensting decreses in contributions to the re t higher vlues of FPR, so there re only modest chnges in the totl AUC s b vries (Fig. 3). We my lso recll tht AUC is symmetric with respect to b, so tht the sme summry vlue will be obtined for given strength of dependence of dignostic ccurcy on the test threshold. The prtil AUC does not possess these properties, nd hence it will show greter dependence on the degree of interstudy heterogeneity. Further work is needed to eplore the properties of the prtil AUC in more detil. Similr investigtion is needed for other summry indices, such s the ASC (Are Swept out by the Curve), PLC (Projected Length of the Curve), [34] nd the Gini/Lorenz coefficients. [35] Smpling vribility nd dependence between test ccurcy nd test threshold re unlikely to fully ccount for the observed heterogeneity in test ccurcy between studies. Additionl sources of heterogeneity cn be eplored by emining the ssocition between test ccurcy nd study level covrite informtion. [3,8,36] However, given the limited dt on ptient nd study design chrcteristics tht re typiclly vilble, it is unlikely tht study level vribles will fully eplin the ecess vribility in test ccurcy. [26] Inppropritely ssuming fied effect cn led to spurious precision nd result in covrites being incorrectly identified s significntly ssocited with test ccurcy. [26] A mied model tht includes rndom effects for test ccurcy cn tke into ccount uneplined vribility between studies. This pproch is used in the hierrchicl (mied) model outlined briefly below, which, unlike the Moses pproch, directly models the TPR nd FPR for ech study. HIERARCHICAL (MIXED) EFFECTS MODEL An lterntive pproch to fitting SROC curves hs been proposed by Rutter nd Gtsonis. [20,21] As with the Moses model, ech study (i) contributes n estimte of TPR nd FPR to the met-nlysis. For ech study, the number in the disesed group who hve positive test result (y i1 ) nd the number in the nondisesed group who hve positive test result (y i2 ) re both ssumed to follow binomil distribution such tht y ij b(n ij, p ij ), where p ij represents the probbility of positive test for group j ( j=1,2) in study i, nd n ij represents the totl number of positive nd negtive test results in group j. The model is bsed on n ordinl logistic regression model [37] nd tkes the form logit(p ij )=(y i + i dis ij )ep( b dis ij ) where dis ij represents the true disese sttus (coded s 0.5 for the nondisesed nd 0.5 for the disesed). The model prmeters estimte the implicit threshold (y i ) for positive test result in study i nd the dignostic ccurcy ( i ) for study i. The prmeter b llows for n ssocition between test ccurcy nd test threshold. The estimted SROC is symmetric when b =0. This hierrchicl (multilevel) model tkes into ccount both the within- nd between-study vribility. The within-study vribility is modeled t the first level. For the ith study, logit(tpr i ) nd logit(fpr i ) re modeled to estimte the implicit threshold (y i ) nd dignostic ccurcy ( i ) for tht study. Hence rndom effects re ssumed for both threshold nd ccurcy s these my vry cross studies. When the SROC is symmetric (i.e., b=0), the level 1 nlysis for ech study reduces to ordinry logistic regression model where i is estimted by logit(tpr i ) logit(fpr i ) (which is equivlent to the observed vlue D i in study i for the dependent vrible in model 3) nd y i is estimted by (logit(tpr i )+ logit(fpr i ))/2 (which is equivlent to S i /2, where S i is the observed vlue in study i for the independent vrible in model 3). The smpling vribility for both TPR i nd FPR i is tken into ccount. The vribility between studies is modeled t level 2. The rndom effects for test threshold nd ccurcy re ssumed to be independent (uncorrelted) nd normlly distributed with y i N(Y, t y 2 ) nd i N(L, t 2 ). The hyperprmeters Y nd L represent the epected threshold nd ccurcy, respectively, cross ll studies in the nlysis. If t y 2 =0 nd t 2 =0, the model reduces to fied-effect model. Becuse ech study contributes only one point in ROC spce to the nlysis, single study does not provide informtion on the shpe of the SROC. Hence the shpe prmeter (b) is ssumed to be fied effect, which is estimted from the study points jointly considered. The hierrchicl model is fitted using Mrkov Chin Monte Crlo (MCMC) Byesin methods. This requires third level to specify prior distributions for ll model prmeters.

8 SROC Curve A summry ROC curve cn be constructed by choosing rnge of vlues of FPR nd using the estimted model prmeters to compute the predicted vlues for sensitivity. The epected TPR t chosen FPR vlue is given by TPR(FPR)=1/(1+ep[ {Lep( 0.5b)+logit(FPR) ep( b)}). The epected operting point on the curve is estimted using E(TPR)=1/(1+ep[ {(Y+L/2)ep( b/2)}]) nd E(FPR)=1/(1+ep[ {(Y L/2)ep( b/2)}]). Covrites cn be dded to the model to ssess whether threshold, ccurcy, nd/or the shpe of the SROC vry with ptient or study chrcteristics. Such terms re generlly fitted s fied effects, but could lso be included s rndom effects. Advntges of the Hierrchicl Model The Rutter nd Gtsonis method provides generl frmework for the met-nlysis of dignostic test performnce. The Moses method does not directly model the TPR nd FPR vlues for ech study. Becuse D nd S re functions of both TPR nd FPR, the prmeters for the Moses model cnnot be used to obtin the epected operting point on the SROC, the corresponding likelihood rtios t tht point, or the stndrd errors of these estimtes. The hierrchicl model cn be used to model vrition in test threshold s well s test ccurcy through the inclusion of study level covrites, wheres the Moses pproch cn only be used to model vrition in test ccurcy. However, the Moses model does hve the dvntge tht no ssumption is mde bout the distribution of S. Lstly, the hierrchicl model incorportes both the within- nd between-study vribility, nd tkes ccount of uneplined heterogeneity between studies through the inclusion of rndom effects for test threshold nd ccurcy. Fitting the Model Despite the potentil dvntges of the hierrchicl model, it hs not been widely dopted. This is most likely becuse of the necessity to use Mrkov Chin Monte Crlo (MCMC) methods to fit it. [20] Rutter nd Gtsonis [21] hve demonstrted how the model cn be fitted using BUGS (Byesin inference Using Gibbs Smpling) softwre. However, they comment tht the process involves MCMC simultion, nd is still reltively comple. The recent vilbility of softwre for fitting nonliner mied models in SAS [38] provides n lterntive nd potentilly more strightforwrd pproch tht does not require specifiction of prior distributions. PROC NLMIXED in SAS llows for nonliner function of the model prmeters, nd for non-norml error distribution, but the rndom effects re restricted to be normlly distributed. Summry estimtes of TPR, FPR, likelihood rtios cn be computed nd their symptotic confidence intervls estimted using the delt method. The distribution of the rndom effects my be checked by emining histogrm nd norml probbility plot of their Empiricl Byes (EB) estimtes. This follows the sme pproch dopted for checking distributionl ssumptions for liner mied models. [39,40] CONCLUSION The mied model provides rigorous method of nlysis tht tkes proper ccount of the sources of vribility in test performnce between studies. The resulting stndrd errors (nd confidence intervls) reflect the uneplined heterogeneity tht is likely to be present in mny metnlyses. The mied model lso llows estimtion of cliniclly relevnt indictors of test performnce such s the epected TPR, FPR nd likelihood rtios. A rnge of lterntive fied-effect nd rndom effects methods re vilble for obtining these summry mesures. [22] However, they cn led to results tht re potentilly inconsistent within the sme met-nlysis. For instnce, for the sme dt, summry estimtes of TPR nd FPR could be computed tht re not consistent with summry estimtes of likelihood rtios or n SROC estimted using the Moses model. A useful feture of the mied model is tht other pproches cn be regrded s specil cses. The compleity of the Byesin (MCMC) method for fitting the mied effects model is likely to discourge its use. The NLMIXED procedure is more ccessible, but t present this softwre cn only del with two levels in the response vrible. For met-nlyses where studies contribute more thn one (FPR, TPR) pir to the nlysis, n dditionl level would be required. A Byesin pproch potentilly provides greter fleibility for fitting the mied effects model in tht it llows the met-nlyst to specify lterntive distributions for the rndom effects nd lso llows prior informtion to be tken into ccount. However, in prctice, norml distribution is often ssumed for the rndom effects nd noninformtive priors re commonly used. Likelihood-bsed methods for estimting mied model prmeters re lso consistent with current pproches used in the met-nlysis of clinicl trils. [39,41,42] Model checks re required to ssess the dequcy of the distributionl ssumption for the rndom effects, but ssessing normlity will clerly be difficult in smll met-nlyses. [39] The Moses fied-effect SROC model hs the dvntge of simplicity, nd it mkes no ssumptions bout the distribution of S. It is useful s preliminry step before

SROC Curve 9 fitting the mied model. Mjor differences in the vribles found to be ssocited with test ccurcy or the shpe of the SROC would wrrnt investigtion s this my be indictive of convergence problems. An issue tht potentilly ffects ll of the methods discussed here is tht reference test (or gold stndrd) hs been ssumed to be error-free. In fct, the reference stndrd is often itself subject to mesurement error, s illustrted, for emple, in the observble differences between pthologists nd rdiologists in their ssessments of the sme smple mteril. Severl methods hve been proposed to correct for referent errors, but they primrily pertin to dt from single studies. Rther little hs been done on this problem in the contet of combining studies in met-nlysis, but one proposed pproch [43] hs been to use ltent clss frmework to etend the logitthreshold model (Eq. 3), nd to recognize the fct tht both the cndidte nd referent tests re potentilly subject to error. The true disese stte of n individul cnnot be directly observed, but n estimte of the probbility of n individul hving disese cn be obtined from the ltent clss model. This then permits dettenution of the errors in the dt, nd the resulting djustment to the SROC curve tends to led to n improved estimte of test performnce. Further development of the SROC methods re required to llow comprison of SROC curves when not ll component studies in met-nlyses re independent of one nother. For emple, some of the component studies my mke direct comprisons of two tests while other studies only evlute one. Methods to tke such dependencies into ccount hve been proposed for therpeutic studies, [44] nd etensions to dignostic test comprisons would be useful. The techniques discussed here pply to situtions where one hs only summry mesures (TPR nd FPR) from ech study. Sometimes one hs ccess to the individul level dt in ech study; one would then be ble to crry out multilevel nlysis, tking both internd intrstudy vrition into ccount, s well s the effects of subject-specific covrites. However, in prctice, current reporting of dignostic studies nd met-nlyses is methodologiclly poor, nd this level of detil in the dt would often be difficult to chieve. [1,36,45,46] Accordingly, further work to understnd the properties of the SROC curve bsed on summry dt from ech study seems wrrnted. REFERENCES 1. Irwig, L.; Tosteson, A.N.A.; Gtsonis, G.; Lu, J.; Colditz, G.; Chlmers, T.C.; Mosteller, F. Guidelines for metnlyses evluting dignostic tests. Ann. Intern. Med. 1994, 120 (8), 667 676. 2. Vmvks, E.C. Met-nlyses of studies of the dignostic ccurcy of lbortory tests. Arch. Pthol. Lb. Med. 1998, 122 (8), 675 686. 3. Deville, W.L.; Buntin, F. Guidelines for Conducting Systemtic Reviews of Studies Evluting the Accurcy of Dignostic Tests. In The Evidence Bse of Clinicl Dignosis; Knottnerus, J.A., Ed.; BMJ Books: London, 2002; 145 165. 4. Hnley, J.A. Receiver Operting Chrcteristic (ROC) Curves. In Encyclopedi of Biosttistics; Armitge, P., Colton, T., Eds.; Wiley: Chichester, 1998; 3738 3745. 5. Beck, J.R.; Shultz, E.K. The use of reltive operting chrcteristic (ROC) curves in test performnce evlution. Arch. Pthol. Lb. Med. 1986, 110 (1), 13 19. 6. Metz, C.E.; Hermn, B.A.; Shen, J.H. Mimum likelihood estimtion of receiver operting chrcteristic (ROC) curves from continuously-distributed dt. Stt. Med. 1998, 17 (9), 1033 1053. 7. Krdun, J.W.P.F.; Krdun, O.J.W.F. Comprtive dignostic performnce of three rdiologicl procedures for the detection of lumbr disk hernition. Methods Inf. Med. 1990, 29 (1), 12 22. 8. Irwig, L.; Mcskill, P.; Glsziou, P.; Fhey, M. Metnlytic methods for dignostic test ccurcy. J. Clin. Epidemiol. 1993, 48 (1), 119 130. 9. Midgette, A.S.; Stukel, T.A.; Littenberg, B. A metnlytic method for summrizing dignostic test performnces: Receiver-operting-chrcteristic-summry point estimtes. Med. Decis. Mk. 1993, 13 (3), 253 256. 10. Moses, L.E.; Shpiro, D.E.; Littenberg, B. Combining independent studies of dignostic tests into summry ROC curve: Dt-nlytic pproches nd some dditionl considertions. Stt. Med. 1993, 12 (14), 1293 1316. 11. Dorfmn, D.D.; Alf, E. Mimum likelihood estimtion of prmeters of signl detection theory nd determintion of confidence intervls Rting method dt. J. Mth. Psychol. 1969, 6 (3), 487 496. 12. Hnley, J.A. The use of the binorml model for prmetric ROC nlysis of quntittive dignostic tests. Stt. Med. 1996, 15 (14), 1575 1585. 13. McCullgh, P. Regression models for ordinl dt (with discussion). J. R. Stt. Soc., Ser. B 1980, 42 (2), 109 142. 14. M, G.; Hll, W.J. Confidence bnds for receiver operting chrcteristics curves. Med. Decis. Mk. 1993, 13 (3), 191 197. 15. Green, D.M.; Swets, J. Signl Detection Theory nd Psychophysics; Wiley: New York, 1966. 16. Hnley, J.A. Receiver operting chrcteristic (ROC) methodology: The stte of the rt. Crit. Rev. Dign. Imging 1989, 29 (3), 307 335. 17. Ogilvie, J.C.; Creelmn, C.D. Mimum likelihood estimtion of ROC curve prmeters. J. Mth. Psychol. 1968, 5 (3), 377 391. 18. Hnley, J.A.; McNeil, B.J. The mening nd use of the re under receiver operting chrcteristic (ROC) curve. Rdiology 1982, 143 (1), 29 36.

10 SROC Curve 19. Hilden, J. The re under the ROC curve nd its competitors. Med. Decis. Mk. 1991, 11 (2), 95 101. 20. Rutter, C.M.; Gtsonis, G. Regression methods for metnlysis of dignostic test dt. Acd. Rdiol. 1995, 2 (Suppl. 1), S48 S56. 21. Rutter, C.M.; Gtsonis, G. A hierrchicl regression pproch to met-nlysis of dignostic test ccurcy evlutions. Stt. Med. 2001, 20 (19), 2865 2884. 22. Deeks, J.J. Systemtic Reviews of Evlutions of Dignostic nd Screening Tests. In Systemtic Reviews in Helth Cre: Met-Anlysis in Contet; Egger, M., Dvey- Smith, G., Altmn, D.G., Eds.; BMJ Publishing Group: London, 2001. 23. Fhey, M.T.; Irwig, L.; Mcskill, P. Met-nlysis of Pp smer ccurcy. Am. J. Epidemiol. 1995, 141 (7), 680 689. 24. Irwig, L.; Bossuyt, P.; Glsziou, P.; Gtsonis, G.; Lijmer, J. Designing Studies to Ensure tht Estimtes of Test Accurcy Will Trvel. In Designing Studies on Dignostic Tests; Knottnerus, A., Ed.; BMJ Books: London, 2001. 25. Rnsohoff, D.F.; Feinstein, A.R. Problems of spectrum nd bis in evluting the efficcy of dignostic tests. N. Engl. J. Med. 1978, 299 (17), 926 930. 26. Lijmer, J.G.; Bossuyt, P.M.M.; Heisterkmp, S.H. Eploring sources of heterogeneity in systemtic reviews of dignostic tests. Stt. Med. 2002, 21 (11), 1525 1537. 27. Wlter, S.D. Properties of the summry receiver operting chrcteristic (SROC) curve for dignostic test dt. Stt. Med. 2002, 21 (9), 1237 1256. 28. DeLong, E.R.; DeLong, D.M.; Clrke-Person, D.L. Compring the res under two or more correlted receiver-operting chrcteristic curves: A non-prmetric pproch. Biometrics 1988, 44 (3), 837 845. 29. Bmber, D. The re bove the ordinl dominnce grph nd the re below the receiver operting grph. J. Mth. Psychol. 1975, 12 (4), 387 415. 30. McClish, D.K. Anlyzing portion of the ROC curve. Med. Decis. Mk. 1989, 9 (3), 190 195. 31. Thompson, M.L.; Zucchini, W. On the sttisticl nlysis of ROC curves. Stt. Med. 1989, 8 (10), 1277 1290. 32. Wiend, S.; Gil, M.H.; Jmes, B.R. A fmily of nonprmetric sttistics for compring dignostic tests with pired or unpired dt. Biometrik 1988, 76 (3), 585 592. 33. Jing, Y.; Metz, C.E.; Nishikw, R.M. A receiver operting chrcteristic prtil re inde for highly sensitive dignostic tests. Rdiology 1996, 201 (3), 745 750. 34. Lee, W.C.; Hsio, C.K. Alternte summry indices for the receiver operting chrcteristic curve. Epidemiology 1996, 7 (6), 605 611. 35. Lee, W.C. Probbilistic nlysis of globl performnces of dignostic tests: Interpreting the Lorenz curve-bsed summry mesures. Stt. Med. 1999, 18 (4), 455 471. 36. Lijmer, J.G.; Mol, B.W.; Heisterkmp, S. Empiricl evidence of design relted bis in studies of dignostic tests. JAMA 1999, 282 (11), 1061 1066. 37. McCullgh, P.; Nelder, J.A. Generlized Liner Models, 2nd Ed.; Chpmn nd Hll: London, 1989. 38. SAS Institute. SAS/STAT User s Guide, Version 8; SAS Institute: Cry, 1999. 39. Turner, R.M.; Omr, R.Z.; Yng, M.; Goldstein, H.; Thompson, S.G. A multilevel model frmework for metnlysis of clinicl trils with binry outcomes. Stt. Med. 2000, 19 (24), 3417 3432. 40. Verbeke, G.; Molenberghs, G. Liner Mied Models for Longitudinl Dt; Springer: New York, 2000. 41. Normnd, S.T. Met-nlysis: Formulting, evluting, combining, nd reporting. Stt. Med. 1999, 18 (3), 321 359. 42. Vn Houwelingen, H.C.; Arends, L.R.; Stijnen, T. Advnced methods in met-nlysis: Multivrite pproch nd met-regression. Stt. Med. 2002, 21 (4), 589 624. 43. Wlter, S.D.; Irwig, L.; Glsziou, P.P. Met-nlysis of dignostic tests with imperfect reference stndrds. J. Clin. Epidemiol. 1999, 52 (10), 943 951. 44. Bucher, H.C.; Guytt, G.H.; Wlter, S.D.; Griffith, L. The results of direct nd indirect tretment comprisons in met-nlysis of rndomized clinicl trils. J. Clin. Epidemiol. 1997, 50 (6), 683 691. 45. Wlter, S.D.; Jdd, A.R. Met-nlysis of screening dt: A survey of the literture. Stt. Med. 1999, 18 (24), 3409 3424. 46. Reid, M.C.; Lchs, S.; Feinstein, A.R. Use of methodologicl stndrds in dignostic test reserch: getting better but still not good. JAMA 1995, 274 (8), 645 651.

Request Permission or Order Reprints Instntly! Interested in copying nd shring this rticle? In most cses, U.S. Copyright Lw requires tht you get permission from the rticle s rightsholder before using copyrighted content. All informtion nd mterils found in this rticle, including but not limited to tet, trdemrks, ptents, logos, grphics nd imges (the "Mterils"), re the copyrighted works nd other forms of intellectul property of Mrcel Dekker, Inc., or its licensors. All rights not epressly grnted re reserved. Get permission to lwfully reproduce nd distribute the Mterils or order reprints quickly nd pinlessly. Simply click on the "Request Permission/ Order Reprints" link below nd follow the instructions. Visit the U.S. Copyright Office for informtion on Fir Use limittions of U.S. copyright lw. Plese refer to The Assocition of Americn Publishers (AAP) website for guidelines on Fir Use in the Clssroom. The Mterils re for your personl use only nd cnnot be reformtted, reposted, resold or distributed by electronic mens or otherwise without permission from Mrcel Dekker, Inc. Mrcel Dekker, Inc. grnts you the limited right to disply the Mterils only on your personl computer or personl wireless device, nd to copy nd downlod single copies of such Mterils provided tht ny copyright, trdemrk or other notice ppering on such Mterils is lso retined by, displyed, copied or downloded s prt of the Mterils nd is not removed or obscured, nd provided you do not edit, modify, lter or enhnce the Mterils. Plese refer to our Website User Agreement for more detils. Request Permission/Order Reprints Reprints of this rticle cn lso be ordered t http://www.dekker.com/servlet/product/doi/101081eebs120024189