Tolerance of Effectiveness Measures to Relevance Judging Errors

Size: px

Start display at page:

Download "Tolerance of Effectiveness Measures to Relevance Judging Errors"

Joel Dalton
5 years ago
Views:

1 Tolerance of Effectiveness Measures to Relevance Judging Errors Le Li 1 and Mark D. Smucker 2 1 David R. Cheriton School of Computer Science, Canada 2 Department of Management Sciences, Canada University of Waterloo, Canada Abstract. Crowdsourcing relevance judgments for test collection construction is attractive because the practice has the possibility of being more affordable than hiring high quality assessors. A problem faced by all crowdsourced judgments even judgments formed from the consensus of multiple workers is that there will be differences in the judgments compared to the judgments produced by high quality assessors. For two TREC test collections, we simulated errors in sets of judgments and then measured the effect of these errors on effectiveness measures. We found that some measures appear to be more tolerant of errors than others. We also found that to achieve high rank correlation in the ranking of retrieval systems requires conservative judgments for average precision (AP) and ndcg, while precision at rank 1 requires neutral judging behavior. Conservative judging avoids mistakenly judging non-relevant documents as relevant at the cost of judging some relevant documents as non-relevant. In addition, we found that while conservative judging behavior maximizes rank correlation for AP and ndcg, to minimize the error in the measures values requires more liberal behavior. Depending on the nature of a set of crowdsourced judgments, the judgments may be more suitable with some effectiveness measures than others, and the use of some effectiveness measures will require higher levels of judgment quality than others. 1 Introduction Information retrieval (IR) test collection construction can require 1 to 2 thousand or more relevance judgments. The best way to obtain relevance judgments is to hire and train assessors who both originate their own search topic and then have the ability to carefully and consistently judge hundreds of potentially complex documents. There is considerable interest in utilizing crowdsourcing platforms such as Amazon Mechanical Turk to obtain relevance judgments in an affordable manner [1 3]. Crowdsourced assessors are usually secondary assessors, i.e. assessors who did not originate the search topics. It is well known that secondary assessors produce relevance judgments that differ from those that are or would be produced by primary assessors [4]. Whether there is a single secondary assessor, or a group of secondary assessors that are combined using sophisticated algorithms [5, 6], there will be differences. M. de Rijke et al. (Eds.): ECIR 214, LNCS 8416, pp , 214. c Springer International Publishing Switzerland 214

2 Tolerance of Effectiveness Measures to Relevance Judging Errors 149 In this paper, we address this question: What effect do differences in judgments between primary and secondary assessors have on our ability to rank and score retrieval systems? Equivalently, what differences in judgments can various evaluation measures tolerate and still be able to match the evaluation quality produced using primary judgments? To investigate this question, we used two sets of runs submitted to two TREC tracks. For each set of runs, we took the appropriate NIST relevance judgments (also known as qrels) and then simulated a secondary assessor to produce a set of secondary qrels that differed from the NIST, primary qrels. For each set of qrels, we produced scores for the runs using precision at 1 (P@1), mean average precision (MAP), and normalized discounted cumulated gain (ndcg). With a given effectiveness measure, e.g. MAP, we can rank the systems as per the primary and secondary qrels and then measure their rank correlation. We measured rank correlation with Yilmaz et al. s AP Correlation (APCorr) [7]. Likewise, we can measure the accuracy of the scores produced by the secondary qrels by measuring the root mean square error (RMSE) between the two sets of scores. To simulate the secondary assessors, we treated the NIST qrels as truth and the secondary assessor as a classifier. A classifier s performance can be understood in terms of its true positive rate (TPR) and its false positive rate (FPR). A given TPR and FPR determine both a classifier s discrimination ability and how conservative or liberal it is in its judging. For example, a conservative classifier avoids judging non-relevant documents as relevant at the cost of mistakenly judging some relevant documents as non-relevant. We used d to measure discrimination ability and the criterion c to measure how conservative or liberal the judging behavior is [8]. We systematically varied the discrimination ability, d, and the criterion c, to produce different sets of qrels. We then evaluated the system runs submitted to the TREC 8 ad-hoc and Robust 25 TREC tracks with these qrels and compared the results to those we obtained using the official NIST qrels. After analyzing the results, we found that: 1. In terms of rank correlation (APCorr), mean average precision (MAP) is more tolerant of errors than ndcg and P@1. In other words, MAP can obtain the same APCorr as ndcg and P@1 with assessors of a lower discrimination ability. 2. To maximize rank correlation, ndcg, MAP, and P@1 require conservative judging. Of the three measures, P@1 requires the least conservative judging and works best with the judging close to neutral. The lower the discrimination ability of the judging, the more conservative judging is required by MAP and ndcg to maximize rank correlation. MAP and ndcg appear to be sensitive to false positives. 3. Depending on the discrimination ability of the judging, it can be hard to jointly optimize APCorr and RMSE for MAP and ndcg. The impact of these findings is that to optimize rank correlation requires attention to not only the discrimination ability of the assessors, but also to how

3 15 L. Li and M.D. Smucker conservative, liberal, or neutral those assessors are in their judgments. Judging schemes or consensus algorithms may need to be devised that will help produce more conservative judgments when MAP and ndcg are the targeted effectiveness measures. If is to be used as the effectiveness measure, efforts must be taken to maintain neutral judging. From a crowdsourcing point of view, it is likely that there will need to be acquired some set of high quality, primary assessor relevance judgments by which the lower quality, crowdsourced, secondary assessor relevance judgments can be calibrated to maximize rank correlation by controlling the relevance criterion used, i.e. by controlling how liberal or conservative the resulting relevance judgments are. 2 Methods and Materials To conduct our experiments, we used the set of runs submitted to two TREC tracks. For each TREC track, we took the NIST qrels and simulated assessors of different abilities and biases as compared to the NIST qrels to produce alternative qrels. We then used these alternative qrels to evaluate the sets of runs and measure the effect that the differences in judgments had on our evaluation of the runs submitted to the tracks. 2.1 Runs Submitted to TREC Tracks and QRels We used the runs submitted to the TREC 8 ad-hoc and Robust 25 TREC tracks as well as the NIST qrels for each track [9, 1]. For convenience, we refer the two data sets as Robust25 and TREC8. Both data sets contain 5 topics. The TREC8 qrels contain 86,83 judgments of which 4,728 are relevant (5.4%). The Robust25 qrels contain 37,798 judgments of which 6,561 are relevant (17%). TREC8 has 129 submitted runs and Robust25 has 74 submitted runs. 2.2 Simulation of Judgments We took the NIST qrels as truth and then simulated assessors of different abilities and biases as measured against the NIST qrels. We can describe the judging behavior of our simulated assessors in terms of their true positive rates (TPR) and false positive rates (FPR), where TPR = TP/(TP + FN)andFPR= FP/(FP + TN) (as shown in Table 1). Signal detection theory allows us to separately describe the discrimination ability and the decision criterion or bias of the assessor [8]. Discrimination ability is measured as d : d = z(tpr) z(fpr), (1) and the bias of the assessor towards either liberal or conservative judging is described by the criterion c: c = 1 (z(tpr)+z(fpr)), (2) 2

4 Tolerance of Effectiveness Measures to Relevance Judging Errors 151 Table 1. Confusion Matrix. Pos. and Neg. stand for Positive and Negative respectively. NIST (Primary) Assessor Simulated Secondary Assessor Relevant (Pos.) Non-Relevant (Neg.) Relevant TP =TruePos. FP = False Pos. Non-Relevant FN = False Neg. TN =TrueNeg. where TPR and FPR are true positive rate and false positive rate of this assessor, respectively. Function z, the inverse of the normal distribution function, converts the TPR or FPR to a z score [8]. If an assessor tends to label incoming documents to be relevant to avoid missing relevant documents (but at the risk of high false positive rate), then this assessor is liberal with a negative criterion. If c =, the assessor is neutral. A conservative assessor has a positive criterion. One advantage of using this model is that the measurement of an assessor s ability to discriminate is independent of the assessor s criterion. At a given discrimination ability d, there are many possible values for the TPR and FPR. In other words, two assessors can have the same ability to discriminate between relevant and non-relevant documents, but one may have a much higher relevance criterion than the other. The higher the relevance criterion, the more conservative the assessor. Figure 1 shows example d curves. All of the points along a curve have the same discrimination ability. Table 2 gives the TPR and FPR for a selection of d and c values. Table 2. The TPR and FPR for various d and c c = -1 (liberal) c = (neutral) c=1(conservative) d TPR FPR If an assessor s d and c are given, we can use them to calculate the TPR and FPR of this assessor using Equations 3 and 4, which are derived from Equations 1 and 2. TPR is computed as, and FPR is computed as, TPR= CDF(d c), (3) FPR = CDF( d c), (4) where CDF is the Cumulative Density Function of the standard normal distribution N (, 1). Assuming a document s true label is given, we can generate the simulated judgment by tossing a biased coin. The probability of the assessor making an

5 152 L. Li and M.D. Smucker True Positive Rate d = 2 d = 1 d = criterion > (conservative) d = criterion < (liberal) False Positive Rate Fig. 1. Curves of equal d. This figure is based on Figures 1.1 and 2.1 of [8]. error is calculated using the assessor s TPR and FPR rates. If the true label is relevant, the assessor makes an error with probability equal to 1 TPR.Ifthe true label is non-relevant, the assessor makes an error with probability equal to FPR. 2.3 Experiment Settings We simulated the noisy judgments of assessors by varying two variables, d and c, as shown in Algorithm 1 and described in Sec What are the candidate values of d and c for simulation? Smucker and Jethani [11] estimated the average d and c of NIST assessors to be 2.3 and 7, respectively, across 1 topics in 25 TREC Robust Track. In [12], the reported d and c of crowdsourced assessors are 1.9 and 4, respectively, with the same experiment settings in [11]. If we think of the judgments from NIST assessors as the upper bound that the consensus algorithm or hired assessors could be, then the results from these two papers indicate that one assessor should have d and c values close to NIST. Meanwhile, as shown in Fig. 1, d = means the assessor labels the document by tossing an unbiased coin, i.e. random guess. So, we set the range of d as [, 3] with a step size of. For the criterion c, the reported c suggests that both NIST and crowdsourced assessors are conservative with NIST assessors being more conservative than the crowdsourced workers [11, 12]. At the same time, the behaviors of liberal assessors are also worthy of investigation. So, we set the c with the range of [ 3, 3] with a step size of. In total, we simulated 366 different types of assessors who make random errors based on each pair of d and c values. While we consider c varying between 3 and 3, the likely range for c is probably at most 1 to 1. We show the range from 3 to 3 to allow trends to be better seen.

6 Tolerance of Effectiveness Measures to Relevance Judging Errors 153 Algorithm 1. Simulate The Judgments From One Assessor INPUT: d,c,truelabels TPR CDF(d c) Cumulative Distribution Function of N (, 1) FPR CDF( d c) for i =1:size(trueLabels) do judge i truelabels i flip rand(, 1) A random number in [, 1) if truelabel i == then if flip FPR then judge i 1 end if else if flip > TPR then judge i end if end if end for RETURN judge For each simulated assessor, we repeated the simulation 1 times to generate 1 independent simulated qrels, and then averaged the performance of the simulated qrels for the assessor. 2.4 Measures To measure the degree of correlation between the simulated and NIST assessors in IR evaluation, we evaluated the submitted runs of those two TREC tracks against qrels from simulated and NIST assessors, respectively. Three evaluation metrics were used: Precision at 1 (P@1), Mean Average Precision (MAP), and Normalized Discounted Cumulative Gain (ndcg). To be more precise, each assessor s simulated run was used as pseudo qrels to evaluate all test runs. So that each test run corresponded to one evaluation result. This evaluation result was averaged across all topics. For example, Robust25 has 74 test runs and we can get 74 MAPs against one qrels file. Hence, we can measure the correlation of two qrels based on the association between two lists of MAPs derived from identical test runs. The higher the correlation between the rankings produced by a set of simulated qrels and the rankings produced by the NIST qrels, the less effect the simulated errors have on the effectiveness measure. We can compare the tolerance of effectiveness measures to judging errors by measuring the correlation for each measure for a given set of simulated qrels. The higher the correlation, the more fault tolerant the effectiveness measure is. We used the Average Precision Correlation Coefficient (APCorr) [7] to measure the correlation. APCorr is very similar to another commonly-used correlation measurement, Kendall s τ [13], but APCorr gives more weight to the errors

7 154 L. Li and M.D. Smucker APCorr APCorr (a) APCorr, P@1, Robust (b) APCorr, P@1, TREC8 RMSE RMSE (c) RMSE, P@1, Robust (d) RMSE, P@1, TREC8 Fig. 2. The effects of the pseudo assessor s errors on evaluation, P@1 nearer to the top of rankings. If two ranking systems perfectly agree with each other, APCorr is 1. Moreover, the Root Mean Square Error (RMSE) was also adopted to calculate the errors between two lists of scores. The smaller the RMSE value is, the closer two measurements are in terms of the quantity difference. 3 Results and Discussions Results are shown in Figures 2, 3, and 4 and Tables 3 and 4. Each figure shows a different effectiveness measure on both TREC8 and Robust25. The tables show the maximum APCorr achieved by each effectiveness measure for each of the d values used in the experiment. As is to be expected, the larger the d value, the better the rank correlation (APCorr) is at a given criterion c. Recall that APCorr measures the degree to which the simulated qrels rank the retrieval systems in the same order as do the NIST qrels, and criterion values greater than are conservative and those less than zero are liberal.

8 Tolerance of Effectiveness Measures to Relevance Judging Errors APCorr APCorr (a) APCorr, MAP, Robust (b) APCorr, MAP, TREC RMSE RMSE (c) RMSE, MAP, Robust (d) RMSE, MAP, TREC8 Fig. 3. The effects of the pseudo assessor s errors on evaluation, MAP As can be seen in Tables 3 and 4, except for the two lowest d values in TREC8, mean average precision (MAP) achieves the best rank correlation at a given level of discrimination ability. Indeed, MAP often has an APCorr that is near the APCorr achieved by ndcg and P@1 on the next higher d, i.e. MAP can achieve the same APCorr as the other metrics but with lower quality assessors. MAP is more fault tolerant than P@1 and ndcg on these test collections. Evident most clearly in the Figures 2, 3, and 4, but also in Tables 3 and 4, MAP and ndcg require conservative judgments to maximize their rank correlation (APCorr). P@1 also requires conservative judgments, but the degree of conservativeness is close to neutral. As the discrimination ability decreases, MAP and ndcg require even more conservative judgments to maximize APCorr. It appears that both MAP and ndcg are sensitive to false positives. Our results reinforce those of Carterette and Soboroff [14] who also found, via a different simulation methodology, that false positives are to be avoided. A consequence of the need for conservative judgments to maximize rank correlation is that it is hard for secondary assessors, such as crowdsourced assessors, to produce a set of qrels that can produce the same scores for MAP and ndcg

9 156 L. Li and M.D. Smucker APCorr APCorr (a) APCorr, ndcg, Robust (b) APCorr, ndcg, TREC RMSE RMSE (c) RMSE, ndcg, Robust (d) RMSE, ndcg, TREC8 Fig. 4. The effects of the pseudo assessor s errors on evaluation, ndcg as with the NIST qrels. The reason for this is that conservative judging requires missing relevant documents and judging them to be non-relevant to avoid being liberal and mistakenly judging non-relevant documents to be relevant. Both MAP and ndcg are measures over the set of known relevant documents. Thus, conservative judging results in a lower estimate of the total number of relevant documents and changes the scores of MAP and ndcg. While at high levels of discrimination ability d,themaximumapcorris obtained with a criterion c that also produces a near to minimum RMSE for all of the effectiveness measures. For MAP and ndcg, as d decreases, the best criterion c for APCorr and RMSE move apart, and it becomes increasingly hard to jointly optimize for both measures. We can also see in the figures that assessors with greater discrimination ability, d, tend to be more robust to the change of criterion c, with high values of APCorr obtained over wider ranges of c. Meanwhile, we notice that the correlation results on TREC8 tend to be worse than that on Robust25. Our hypothesis is that since TREC8 contains a deeper pool with more non-relevant documents than Robust25, the number of false

10 Tolerance of Effectiveness Measures to Relevance Judging Errors 157 Table 3. The criterion, TPR, and FPR when the APCorr is maximal for each d, Robust25 d P@1 MAP ndcg c TPR FPR APCorr c TPR FPR APCorr c TPR FPR APCorr Table 4. The criterion, TPR, and FPR when the APCorr is maximal for each d, TREC8 d P@1 MAP ndcg c TPR FPR APCorr c TPR FPR APCorr c TPR FPR APCorr positives is higher with TREC8 when judged with the same FPR. Another possibility is that the unique nature of the manual runs present in TREC8, which are some of the best scoring runs, make it harder to judge than Robust25. A somewhat surprising result occurs with MAP on TREC8 and its RMSE. As Fig. 3 shows, the highly discriminative d =3. qrels actually have a higher RMSE than the lower d qrels at liberal c values less than -5 or so. As far as we can understand, this inversion of expected behavior results from the lower d qrels having higher false positives rates that while being noisier judgments, result in MAP values that on average are closer to the NIST scores. 3.1 Limitations of Our Methods Our existing simulation method only captures the random errors made by the assessors. Webber et al. [15] have shown that as documents are ranked lower by retrieval engines, the less likely assessors are to make false positive errors. In our simulation, the true and false positive rates do not depend on the document being judged. Likewise, we do not attempt to model crowdsourcing-specific error [16]. As such, our results cannot be used to show the discrimination ability required of assessors to obtain a desired rank correlation.

11 158 L. Li and M.D. Smucker 4 Related Work Voorhees [17] conducted experiments with obtaining secondary relevance judgments using high quality NIST assessors. In these experiments, Voorhees found that even with disagreements, the rank correlation of the runs was high. Subsequent work by others has found that differing levels of assessor expertise can negatively affect the ability of secondary assessors to produce qrels that evaluate systems in the same manner as qrels produced by high-quality primary assessors [18, 19]. Most similar to our work, Carterette and Soboroff [14] hypothesized several difference models of assessor behavior that could produce judging errors compared to NIST qrels. They found that their pessimistic models resulted in the best rank correlation. These findings are in line with our results showing that conservative assessors are required for maximizing rank correlation. Carterette and Soboroff examined the statmap measure, while we have looked at additional measures and discovered that P@1 does best with slightly conservative, almost neutral judging. 5 Conclusion We simulated assessor errors by varying both their discrimination ability and their relevance criterion. We examined the effect of these errors on three effectiveness measures: P@1, MAP, and ndcg. We found that MAP is more tolerant of judging errors than P@1 and ndcg. MAP can achieve the same rank correlation with lower quality assessors. We also found that conservative assessors are preferable to achieve high correlation. In other words, it is important that assessors avoid mistakenly judging non-relevant documents as relevant. We also found that different effectiveness measures have different responses to errors in judging. For example, P@1 requires a more liberal judging behavior than does MAP and ndcg. Crowdsourced relevance judging likely will require a sample of documents judged by high quality, primary assessors to allow for the calibration of the judgments produced by crowdsourcing. Future work could involve the design of effectiveness measures specifically designed to better handle relevance judging errors. Acknowledgments. We thank the reviewers for their helpful reviews. In particular we thank the meta-reviewer for the helpful set of references to related work. This work was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC), in part by the facilities of SHARCNET, and in part by the University of Waterloo. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reect those of the sponsors.

12 References Tolerance of Effectiveness Measures to Relevance Judging Errors Alonso, O., Mizzaro, S.: Can we get rid of TREC assessors? using Mechanical Turk for relevance assessment. In: Proceedings of the SIGIR 29 Workshop on the Future of IR Evaluation, pp (July 29) 2. McCreadie, R., Macdonald, C., Ounis, I.: Crowdsourcing blog track top news judgments at TREC. In: WSDM 211 Workshop on Crowdsourcing for Search and Data Mining (211) 3. Smucker, M.D., Kazai, G., Lease, M.: Overview of the TREC 212 crowdsourcing track (212) 4. Voorhees, E.M.: Variations in relevance judgments and the measurement of retrieval effectiveness. IPM 36, (2) 5. Hosseini, M., Cox, I., Milić-Frayling, N., Kazai, G., Vinay, V.: On aggregating labels from multiple crowd workers to infer relevance of documents. In: Baeza-Yates, R., de Vries, A.P., Zaragoza, H., Cambazoglu, B.B., Murdock, V., Lempel, R., Silvestri, F. (eds.) ECIR 212. LNCS, vol. 7224, pp Springer, Heidelberg (212) 6. Raykar, V.C., Yu, S., Zhao, L.H., Valadez, G.H., Florin, C., Bogoni, L., Moy, L.: Learning from crowds. The Journal of Machine Learning Research 99, (21) 7. Yilmaz, E., Aslam, J.A., Robertson, S.: A new rank correlation coefficient for information retrieval. In: SIGIR, pp (28) 8. Macmillan, N.A., Creelman, C.D.: Detection theory: A user s guide. Psychology Press (24) 9. Voorhees, E.M., Harman, D.: Overview of the Eighth Text REtrieval Conference (TREC-8). In: Proceedings of TREC, vol. 8, pp (1999) 1. Voorhees, E.M.: Overview of TREC 25. In: Proceedings of TREC (25) 11. Smucker, M.D., Jethani, C.P.: Measuring assessor accuracy: a comparison of NIST assessors and user study participants. In: SIGIR, pp (211) 12. Smucker, M., Jethani, C.: The crowd vs. the lab: A comparison of crowd-sourced and university laboratory participant behavior. In: Proceedings of the SIGIR 211 Workshop on Crowdsourcing for Information Retrieval (211) 13. Kendall, M.G.: A new measure of rank correlation. Biometrika 3(1/2), (1938) 14. Carterette, B., Soboroff, I.: The effect of assessor error on IR system evaluation. In: SIGIR, pp (21) 15. Webber, W., Chandar, P., Carterette, B.: Alternative assessor disagreement and retrieval depth. In: CIKM, pp (212) 16. Vuurens, J., de Vries, A.P., Eickhoff, C.: How much spam can you take? In: SIGIR 211 Workshop on Crowdsourcing for Information Retrieval, CIR (211) 17. Voorhees, E.: Variations in relevance judgments and the measurement of retrieval effectiveness. IPM 36(5), (2) 18. Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A., Yilmaz, E.: Relevance assessment: are judges exchangeable and does it matter. In: SIGIR, pp (28) 19. Kinney, K., Huffman, S., Zhai, J.: How evaluator domain expertise affects search result relevance judgments. In: CIKM, pp (28)

The Crowd vs. the Lab: A Comparison of Crowd-Sourced and University Laboratory Participant Behavior

The Crowd vs. the Lab: A Comparison of Crowd-Sourced and University Laboratory Participant Behavior Mark D. Smucker Department of Management Sciences University of Waterloo mark.smucker@uwaterloo.ca Chandra