Tolerance of Effectiveness Measures to Relevance Judging Errors

Size: px
Start display at page:

Download "Tolerance of Effectiveness Measures to Relevance Judging Errors"

Transcription

1 Tolerance of Effectiveness Measures to Relevance Judging Errors Le Li 1 and Mark D. Smucker 2 1 David R. Cheriton School of Computer Science, Canada 2 Department of Management Sciences, Canada University of Waterloo, Canada Abstract. Crowdsourcing relevance judgments for test collection construction is attractive because the practice has the possibility of being more affordable than hiring high quality assessors. A problem faced by all crowdsourced judgments even judgments formed from the consensus of multiple workers is that there will be differences in the judgments compared to the judgments produced by high quality assessors. For two TREC test collections, we simulated errors in sets of judgments and then measured the effect of these errors on effectiveness measures. We found that some measures appear to be more tolerant of errors than others. We also found that to achieve high rank correlation in the ranking of retrieval systems requires conservative judgments for average precision (AP) and ndcg, while precision at rank 1 requires neutral judging behavior. Conservative judging avoids mistakenly judging non-relevant documents as relevant at the cost of judging some relevant documents as non-relevant. In addition, we found that while conservative judging behavior maximizes rank correlation for AP and ndcg, to minimize the error in the measures values requires more liberal behavior. Depending on the nature of a set of crowdsourced judgments, the judgments may be more suitable with some effectiveness measures than others, and the use of some effectiveness measures will require higher levels of judgment quality than others. 1 Introduction Information retrieval (IR) test collection construction can require 1 to 2 thousand or more relevance judgments. The best way to obtain relevance judgments is to hire and train assessors who both originate their own search topic and then have the ability to carefully and consistently judge hundreds of potentially complex documents. There is considerable interest in utilizing crowdsourcing platforms such as Amazon Mechanical Turk to obtain relevance judgments in an affordable manner [1 3]. Crowdsourced assessors are usually secondary assessors, i.e. assessors who did not originate the search topics. It is well known that secondary assessors produce relevance judgments that differ from those that are or would be produced by primary assessors [4]. Whether there is a single secondary assessor, or a group of secondary assessors that are combined using sophisticated algorithms [5, 6], there will be differences. M. de Rijke et al. (Eds.): ECIR 214, LNCS 8416, pp , 214. c Springer International Publishing Switzerland 214

2 Tolerance of Effectiveness Measures to Relevance Judging Errors 149 In this paper, we address this question: What effect do differences in judgments between primary and secondary assessors have on our ability to rank and score retrieval systems? Equivalently, what differences in judgments can various evaluation measures tolerate and still be able to match the evaluation quality produced using primary judgments? To investigate this question, we used two sets of runs submitted to two TREC tracks. For each set of runs, we took the appropriate NIST relevance judgments (also known as qrels) and then simulated a secondary assessor to produce a set of secondary qrels that differed from the NIST, primary qrels. For each set of qrels, we produced scores for the runs using precision at 1 (P@1), mean average precision (MAP), and normalized discounted cumulated gain (ndcg). With a given effectiveness measure, e.g. MAP, we can rank the systems as per the primary and secondary qrels and then measure their rank correlation. We measured rank correlation with Yilmaz et al. s AP Correlation (APCorr) [7]. Likewise, we can measure the accuracy of the scores produced by the secondary qrels by measuring the root mean square error (RMSE) between the two sets of scores. To simulate the secondary assessors, we treated the NIST qrels as truth and the secondary assessor as a classifier. A classifier s performance can be understood in terms of its true positive rate (TPR) and its false positive rate (FPR). A given TPR and FPR determine both a classifier s discrimination ability and how conservative or liberal it is in its judging. For example, a conservative classifier avoids judging non-relevant documents as relevant at the cost of mistakenly judging some relevant documents as non-relevant. We used d to measure discrimination ability and the criterion c to measure how conservative or liberal the judging behavior is [8]. We systematically varied the discrimination ability, d, and the criterion c, to produce different sets of qrels. We then evaluated the system runs submitted to the TREC 8 ad-hoc and Robust 25 TREC tracks with these qrels and compared the results to those we obtained using the official NIST qrels. After analyzing the results, we found that: 1. In terms of rank correlation (APCorr), mean average precision (MAP) is more tolerant of errors than ndcg and P@1. In other words, MAP can obtain the same APCorr as ndcg and P@1 with assessors of a lower discrimination ability. 2. To maximize rank correlation, ndcg, MAP, and P@1 require conservative judging. Of the three measures, P@1 requires the least conservative judging and works best with the judging close to neutral. The lower the discrimination ability of the judging, the more conservative judging is required by MAP and ndcg to maximize rank correlation. MAP and ndcg appear to be sensitive to false positives. 3. Depending on the discrimination ability of the judging, it can be hard to jointly optimize APCorr and RMSE for MAP and ndcg. The impact of these findings is that to optimize rank correlation requires attention to not only the discrimination ability of the assessors, but also to how

3 15 L. Li and M.D. Smucker conservative, liberal, or neutral those assessors are in their judgments. Judging schemes or consensus algorithms may need to be devised that will help produce more conservative judgments when MAP and ndcg are the targeted effectiveness measures. If is to be used as the effectiveness measure, efforts must be taken to maintain neutral judging. From a crowdsourcing point of view, it is likely that there will need to be acquired some set of high quality, primary assessor relevance judgments by which the lower quality, crowdsourced, secondary assessor relevance judgments can be calibrated to maximize rank correlation by controlling the relevance criterion used, i.e. by controlling how liberal or conservative the resulting relevance judgments are. 2 Methods and Materials To conduct our experiments, we used the set of runs submitted to two TREC tracks. For each TREC track, we took the NIST qrels and simulated assessors of different abilities and biases as compared to the NIST qrels to produce alternative qrels. We then used these alternative qrels to evaluate the sets of runs and measure the effect that the differences in judgments had on our evaluation of the runs submitted to the tracks. 2.1 Runs Submitted to TREC Tracks and QRels We used the runs submitted to the TREC 8 ad-hoc and Robust 25 TREC tracks as well as the NIST qrels for each track [9, 1]. For convenience, we refer the two data sets as Robust25 and TREC8. Both data sets contain 5 topics. The TREC8 qrels contain 86,83 judgments of which 4,728 are relevant (5.4%). The Robust25 qrels contain 37,798 judgments of which 6,561 are relevant (17%). TREC8 has 129 submitted runs and Robust25 has 74 submitted runs. 2.2 Simulation of Judgments We took the NIST qrels as truth and then simulated assessors of different abilities and biases as measured against the NIST qrels. We can describe the judging behavior of our simulated assessors in terms of their true positive rates (TPR) and false positive rates (FPR), where TPR = TP/(TP + FN)andFPR= FP/(FP + TN) (as shown in Table 1). Signal detection theory allows us to separately describe the discrimination ability and the decision criterion or bias of the assessor [8]. Discrimination ability is measured as d : d = z(tpr) z(fpr), (1) and the bias of the assessor towards either liberal or conservative judging is described by the criterion c: c = 1 (z(tpr)+z(fpr)), (2) 2

4 Tolerance of Effectiveness Measures to Relevance Judging Errors 151 Table 1. Confusion Matrix. Pos. and Neg. stand for Positive and Negative respectively. NIST (Primary) Assessor Simulated Secondary Assessor Relevant (Pos.) Non-Relevant (Neg.) Relevant TP =TruePos. FP = False Pos. Non-Relevant FN = False Neg. TN =TrueNeg. where TPR and FPR are true positive rate and false positive rate of this assessor, respectively. Function z, the inverse of the normal distribution function, converts the TPR or FPR to a z score [8]. If an assessor tends to label incoming documents to be relevant to avoid missing relevant documents (but at the risk of high false positive rate), then this assessor is liberal with a negative criterion. If c =, the assessor is neutral. A conservative assessor has a positive criterion. One advantage of using this model is that the measurement of an assessor s ability to discriminate is independent of the assessor s criterion. At a given discrimination ability d, there are many possible values for the TPR and FPR. In other words, two assessors can have the same ability to discriminate between relevant and non-relevant documents, but one may have a much higher relevance criterion than the other. The higher the relevance criterion, the more conservative the assessor. Figure 1 shows example d curves. All of the points along a curve have the same discrimination ability. Table 2 gives the TPR and FPR for a selection of d and c values. Table 2. The TPR and FPR for various d and c c = -1 (liberal) c = (neutral) c=1(conservative) d TPR FPR If an assessor s d and c are given, we can use them to calculate the TPR and FPR of this assessor using Equations 3 and 4, which are derived from Equations 1 and 2. TPR is computed as, and FPR is computed as, TPR= CDF(d c), (3) FPR = CDF( d c), (4) where CDF is the Cumulative Density Function of the standard normal distribution N (, 1). Assuming a document s true label is given, we can generate the simulated judgment by tossing a biased coin. The probability of the assessor making an

5 152 L. Li and M.D. Smucker True Positive Rate d = 2 d = 1 d = criterion > (conservative) d = criterion < (liberal) False Positive Rate Fig. 1. Curves of equal d. This figure is based on Figures 1.1 and 2.1 of [8]. error is calculated using the assessor s TPR and FPR rates. If the true label is relevant, the assessor makes an error with probability equal to 1 TPR.Ifthe true label is non-relevant, the assessor makes an error with probability equal to FPR. 2.3 Experiment Settings We simulated the noisy judgments of assessors by varying two variables, d and c, as shown in Algorithm 1 and described in Sec What are the candidate values of d and c for simulation? Smucker and Jethani [11] estimated the average d and c of NIST assessors to be 2.3 and 7, respectively, across 1 topics in 25 TREC Robust Track. In [12], the reported d and c of crowdsourced assessors are 1.9 and 4, respectively, with the same experiment settings in [11]. If we think of the judgments from NIST assessors as the upper bound that the consensus algorithm or hired assessors could be, then the results from these two papers indicate that one assessor should have d and c values close to NIST. Meanwhile, as shown in Fig. 1, d = means the assessor labels the document by tossing an unbiased coin, i.e. random guess. So, we set the range of d as [, 3] with a step size of. For the criterion c, the reported c suggests that both NIST and crowdsourced assessors are conservative with NIST assessors being more conservative than the crowdsourced workers [11, 12]. At the same time, the behaviors of liberal assessors are also worthy of investigation. So, we set the c with the range of [ 3, 3] with a step size of. In total, we simulated 366 different types of assessors who make random errors based on each pair of d and c values. While we consider c varying between 3 and 3, the likely range for c is probably at most 1 to 1. We show the range from 3 to 3 to allow trends to be better seen.

6 Tolerance of Effectiveness Measures to Relevance Judging Errors 153 Algorithm 1. Simulate The Judgments From One Assessor INPUT: d,c,truelabels TPR CDF(d c) Cumulative Distribution Function of N (, 1) FPR CDF( d c) for i =1:size(trueLabels) do judge i truelabels i flip rand(, 1) A random number in [, 1) if truelabel i == then if flip FPR then judge i 1 end if else if flip > TPR then judge i end if end if end for RETURN judge For each simulated assessor, we repeated the simulation 1 times to generate 1 independent simulated qrels, and then averaged the performance of the simulated qrels for the assessor. 2.4 Measures To measure the degree of correlation between the simulated and NIST assessors in IR evaluation, we evaluated the submitted runs of those two TREC tracks against qrels from simulated and NIST assessors, respectively. Three evaluation metrics were used: Precision at 1 (P@1), Mean Average Precision (MAP), and Normalized Discounted Cumulative Gain (ndcg). To be more precise, each assessor s simulated run was used as pseudo qrels to evaluate all test runs. So that each test run corresponded to one evaluation result. This evaluation result was averaged across all topics. For example, Robust25 has 74 test runs and we can get 74 MAPs against one qrels file. Hence, we can measure the correlation of two qrels based on the association between two lists of MAPs derived from identical test runs. The higher the correlation between the rankings produced by a set of simulated qrels and the rankings produced by the NIST qrels, the less effect the simulated errors have on the effectiveness measure. We can compare the tolerance of effectiveness measures to judging errors by measuring the correlation for each measure for a given set of simulated qrels. The higher the correlation, the more fault tolerant the effectiveness measure is. We used the Average Precision Correlation Coefficient (APCorr) [7] to measure the correlation. APCorr is very similar to another commonly-used correlation measurement, Kendall s τ [13], but APCorr gives more weight to the errors

7 154 L. Li and M.D. Smucker APCorr APCorr (a) APCorr, P@1, Robust (b) APCorr, P@1, TREC8 RMSE RMSE (c) RMSE, P@1, Robust (d) RMSE, P@1, TREC8 Fig. 2. The effects of the pseudo assessor s errors on evaluation, P@1 nearer to the top of rankings. If two ranking systems perfectly agree with each other, APCorr is 1. Moreover, the Root Mean Square Error (RMSE) was also adopted to calculate the errors between two lists of scores. The smaller the RMSE value is, the closer two measurements are in terms of the quantity difference. 3 Results and Discussions Results are shown in Figures 2, 3, and 4 and Tables 3 and 4. Each figure shows a different effectiveness measure on both TREC8 and Robust25. The tables show the maximum APCorr achieved by each effectiveness measure for each of the d values used in the experiment. As is to be expected, the larger the d value, the better the rank correlation (APCorr) is at a given criterion c. Recall that APCorr measures the degree to which the simulated qrels rank the retrieval systems in the same order as do the NIST qrels, and criterion values greater than are conservative and those less than zero are liberal.

8 Tolerance of Effectiveness Measures to Relevance Judging Errors APCorr APCorr (a) APCorr, MAP, Robust (b) APCorr, MAP, TREC RMSE RMSE (c) RMSE, MAP, Robust (d) RMSE, MAP, TREC8 Fig. 3. The effects of the pseudo assessor s errors on evaluation, MAP As can be seen in Tables 3 and 4, except for the two lowest d values in TREC8, mean average precision (MAP) achieves the best rank correlation at a given level of discrimination ability. Indeed, MAP often has an APCorr that is near the APCorr achieved by ndcg and P@1 on the next higher d, i.e. MAP can achieve the same APCorr as the other metrics but with lower quality assessors. MAP is more fault tolerant than P@1 and ndcg on these test collections. Evident most clearly in the Figures 2, 3, and 4, but also in Tables 3 and 4, MAP and ndcg require conservative judgments to maximize their rank correlation (APCorr). P@1 also requires conservative judgments, but the degree of conservativeness is close to neutral. As the discrimination ability decreases, MAP and ndcg require even more conservative judgments to maximize APCorr. It appears that both MAP and ndcg are sensitive to false positives. Our results reinforce those of Carterette and Soboroff [14] who also found, via a different simulation methodology, that false positives are to be avoided. A consequence of the need for conservative judgments to maximize rank correlation is that it is hard for secondary assessors, such as crowdsourced assessors, to produce a set of qrels that can produce the same scores for MAP and ndcg

9 156 L. Li and M.D. Smucker APCorr APCorr (a) APCorr, ndcg, Robust (b) APCorr, ndcg, TREC RMSE RMSE (c) RMSE, ndcg, Robust (d) RMSE, ndcg, TREC8 Fig. 4. The effects of the pseudo assessor s errors on evaluation, ndcg as with the NIST qrels. The reason for this is that conservative judging requires missing relevant documents and judging them to be non-relevant to avoid being liberal and mistakenly judging non-relevant documents to be relevant. Both MAP and ndcg are measures over the set of known relevant documents. Thus, conservative judging results in a lower estimate of the total number of relevant documents and changes the scores of MAP and ndcg. While at high levels of discrimination ability d,themaximumapcorris obtained with a criterion c that also produces a near to minimum RMSE for all of the effectiveness measures. For MAP and ndcg, as d decreases, the best criterion c for APCorr and RMSE move apart, and it becomes increasingly hard to jointly optimize for both measures. We can also see in the figures that assessors with greater discrimination ability, d, tend to be more robust to the change of criterion c, with high values of APCorr obtained over wider ranges of c. Meanwhile, we notice that the correlation results on TREC8 tend to be worse than that on Robust25. Our hypothesis is that since TREC8 contains a deeper pool with more non-relevant documents than Robust25, the number of false

10 Tolerance of Effectiveness Measures to Relevance Judging Errors 157 Table 3. The criterion, TPR, and FPR when the APCorr is maximal for each d, Robust25 d P@1 MAP ndcg c TPR FPR APCorr c TPR FPR APCorr c TPR FPR APCorr Table 4. The criterion, TPR, and FPR when the APCorr is maximal for each d, TREC8 d P@1 MAP ndcg c TPR FPR APCorr c TPR FPR APCorr c TPR FPR APCorr positives is higher with TREC8 when judged with the same FPR. Another possibility is that the unique nature of the manual runs present in TREC8, which are some of the best scoring runs, make it harder to judge than Robust25. A somewhat surprising result occurs with MAP on TREC8 and its RMSE. As Fig. 3 shows, the highly discriminative d =3. qrels actually have a higher RMSE than the lower d qrels at liberal c values less than -5 or so. As far as we can understand, this inversion of expected behavior results from the lower d qrels having higher false positives rates that while being noisier judgments, result in MAP values that on average are closer to the NIST scores. 3.1 Limitations of Our Methods Our existing simulation method only captures the random errors made by the assessors. Webber et al. [15] have shown that as documents are ranked lower by retrieval engines, the less likely assessors are to make false positive errors. In our simulation, the true and false positive rates do not depend on the document being judged. Likewise, we do not attempt to model crowdsourcing-specific error [16]. As such, our results cannot be used to show the discrimination ability required of assessors to obtain a desired rank correlation.

11 158 L. Li and M.D. Smucker 4 Related Work Voorhees [17] conducted experiments with obtaining secondary relevance judgments using high quality NIST assessors. In these experiments, Voorhees found that even with disagreements, the rank correlation of the runs was high. Subsequent work by others has found that differing levels of assessor expertise can negatively affect the ability of secondary assessors to produce qrels that evaluate systems in the same manner as qrels produced by high-quality primary assessors [18, 19]. Most similar to our work, Carterette and Soboroff [14] hypothesized several difference models of assessor behavior that could produce judging errors compared to NIST qrels. They found that their pessimistic models resulted in the best rank correlation. These findings are in line with our results showing that conservative assessors are required for maximizing rank correlation. Carterette and Soboroff examined the statmap measure, while we have looked at additional measures and discovered that P@1 does best with slightly conservative, almost neutral judging. 5 Conclusion We simulated assessor errors by varying both their discrimination ability and their relevance criterion. We examined the effect of these errors on three effectiveness measures: P@1, MAP, and ndcg. We found that MAP is more tolerant of judging errors than P@1 and ndcg. MAP can achieve the same rank correlation with lower quality assessors. We also found that conservative assessors are preferable to achieve high correlation. In other words, it is important that assessors avoid mistakenly judging non-relevant documents as relevant. We also found that different effectiveness measures have different responses to errors in judging. For example, P@1 requires a more liberal judging behavior than does MAP and ndcg. Crowdsourced relevance judging likely will require a sample of documents judged by high quality, primary assessors to allow for the calibration of the judgments produced by crowdsourcing. Future work could involve the design of effectiveness measures specifically designed to better handle relevance judging errors. Acknowledgments. We thank the reviewers for their helpful reviews. In particular we thank the meta-reviewer for the helpful set of references to related work. This work was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC), in part by the facilities of SHARCNET, and in part by the University of Waterloo. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reect those of the sponsors.

12 References Tolerance of Effectiveness Measures to Relevance Judging Errors Alonso, O., Mizzaro, S.: Can we get rid of TREC assessors? using Mechanical Turk for relevance assessment. In: Proceedings of the SIGIR 29 Workshop on the Future of IR Evaluation, pp (July 29) 2. McCreadie, R., Macdonald, C., Ounis, I.: Crowdsourcing blog track top news judgments at TREC. In: WSDM 211 Workshop on Crowdsourcing for Search and Data Mining (211) 3. Smucker, M.D., Kazai, G., Lease, M.: Overview of the TREC 212 crowdsourcing track (212) 4. Voorhees, E.M.: Variations in relevance judgments and the measurement of retrieval effectiveness. IPM 36, (2) 5. Hosseini, M., Cox, I., Milić-Frayling, N., Kazai, G., Vinay, V.: On aggregating labels from multiple crowd workers to infer relevance of documents. In: Baeza-Yates, R., de Vries, A.P., Zaragoza, H., Cambazoglu, B.B., Murdock, V., Lempel, R., Silvestri, F. (eds.) ECIR 212. LNCS, vol. 7224, pp Springer, Heidelberg (212) 6. Raykar, V.C., Yu, S., Zhao, L.H., Valadez, G.H., Florin, C., Bogoni, L., Moy, L.: Learning from crowds. The Journal of Machine Learning Research 99, (21) 7. Yilmaz, E., Aslam, J.A., Robertson, S.: A new rank correlation coefficient for information retrieval. In: SIGIR, pp (28) 8. Macmillan, N.A., Creelman, C.D.: Detection theory: A user s guide. Psychology Press (24) 9. Voorhees, E.M., Harman, D.: Overview of the Eighth Text REtrieval Conference (TREC-8). In: Proceedings of TREC, vol. 8, pp (1999) 1. Voorhees, E.M.: Overview of TREC 25. In: Proceedings of TREC (25) 11. Smucker, M.D., Jethani, C.P.: Measuring assessor accuracy: a comparison of NIST assessors and user study participants. In: SIGIR, pp (211) 12. Smucker, M., Jethani, C.: The crowd vs. the lab: A comparison of crowd-sourced and university laboratory participant behavior. In: Proceedings of the SIGIR 211 Workshop on Crowdsourcing for Information Retrieval (211) 13. Kendall, M.G.: A new measure of rank correlation. Biometrika 3(1/2), (1938) 14. Carterette, B., Soboroff, I.: The effect of assessor error on IR system evaluation. In: SIGIR, pp (21) 15. Webber, W., Chandar, P., Carterette, B.: Alternative assessor disagreement and retrieval depth. In: CIKM, pp (212) 16. Vuurens, J., de Vries, A.P., Eickhoff, C.: How much spam can you take? In: SIGIR 211 Workshop on Crowdsourcing for Information Retrieval, CIR (211) 17. Voorhees, E.: Variations in relevance judgments and the measurement of retrieval effectiveness. IPM 36(5), (2) 18. Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A., Yilmaz, E.: Relevance assessment: are judges exchangeable and does it matter. In: SIGIR, pp (28) 19. Kinney, K., Huffman, S., Zhai, J.: How evaluator domain expertise affects search result relevance judgments. In: CIKM, pp (28)

The Crowd vs. the Lab: A Comparison of Crowd-Sourced and University Laboratory Participant Behavior

The Crowd vs. the Lab: A Comparison of Crowd-Sourced and University Laboratory Participant Behavior The Crowd vs. the Lab: A Comparison of Crowd-Sourced and University Laboratory Participant Behavior Mark D. Smucker Department of Management Sciences University of Waterloo mark.smucker@uwaterloo.ca Chandra

More information

An Analysis of Assessor Behavior in Crowdsourced Preference Judgments

An Analysis of Assessor Behavior in Crowdsourced Preference Judgments An Analysis of Assessor Behavior in Crowdsourced Preference Judgments Dongqing Zhu and Ben Carterette Department of Computer & Information Sciences University of Delaware Newark, DE, USA 19716 [zhu carteret]@cis.udel.edu

More information

RELEVANCE JUDGMENTS EXCLUSIVE OF HUMAN ASSESSORS IN LARGE SCALE INFORMATION RETRIEVAL EVALUATION EXPERIMENTATION

RELEVANCE JUDGMENTS EXCLUSIVE OF HUMAN ASSESSORS IN LARGE SCALE INFORMATION RETRIEVAL EVALUATION EXPERIMENTATION RELEVANCE JUDGMENTS EXCLUSIVE OF HUMAN ASSESSORS IN LARGE SCALE INFORMATION RETRIEVAL EVALUATION EXPERIMENTATION Prabha Rajagopal 1, Sri Devi Ravana 2, and Maizatul Akmar Ismail 3 1, 2, 3 Department of

More information

Paul Bennett, Microsoft Research (CLUES) Joint work with Ben Carterette, Max Chickering, Susan Dumais, Eric Horvitz, Edith Law, and Anton Mityagin.

Paul Bennett, Microsoft Research (CLUES) Joint work with Ben Carterette, Max Chickering, Susan Dumais, Eric Horvitz, Edith Law, and Anton Mityagin. Paul Bennett, Microsoft Research (CLUES) Joint work with Ben Carterette, Max Chickering, Susan Dumais, Eric Horvitz, Edith Law, and Anton Mityagin. Why Preferences? Learning Consensus from Preferences

More information

The Effect of Inter-Assessor Disagreement on IR System Evaluation: A Case Study with Lancers and Students

The Effect of Inter-Assessor Disagreement on IR System Evaluation: A Case Study with Lancers and Students The Effect of InterAssessor Disagreement on IR System Evaluation: A Case Study with Lancers and Students Waseda University tetsuyasakai@acm.org ABSTRACT This paper reports on a case study on the interassessor

More information

Relevance Assessment: Are Judges Exchangeable and Does it Matter?

Relevance Assessment: Are Judges Exchangeable and Does it Matter? Relevance Assessment: Are Judges Exchangeable and Does it Matter? Peter Bailey Microsoft Redmond, WA USA pbailey@microsoft.com Paul Thomas CSIRO ICT Centre Canberra, Australia paul.thomas@csiro.au Nick

More information

The Impact of Relevance Judgments and Data Fusion on Results of Image Retrieval Test Collections

The Impact of Relevance Judgments and Data Fusion on Results of Image Retrieval Test Collections The Impact of Relevance Judgments and Data Fusion on Results of Image Retrieval Test Collections William Hersh, Eugene Kim Department of Medical Informatics & Clinical Epidemiology School of Medicine Oregon

More information

Machine learning II. Juhan Ernits ITI8600

Machine learning II. Juhan Ernits ITI8600 Machine learning II Juhan Ernits ITI8600 Hand written digit recognition 64 Example 2: Face recogition Classification, regression or unsupervised? How many classes? Example 2: Face recognition Classification,

More information

A Little Competition Never Hurt Anyone s Relevance Assessments

A Little Competition Never Hurt Anyone s Relevance Assessments A Little Competition Never Hurt Anyone s Relevance Assessments Yuan Jin, Mark J. Carman Faculty of Information Technology Monash University {yuan.jin,mark.carman}@monash.edu Lexing Xie Research School

More information

Here or There. Preference Judgments for Relevance

Here or There. Preference Judgments for Relevance Here or There Preference Judgments for Relevance Ben Carterette 1, Paul N. Bennett 2, David Maxwell Chickering 3, and Susan T. Dumais 2 1 University of Massachusetts Amherst 2 Microsoft Research 3 Microsoft

More information

Variations in relevance judgments and the measurement of retrieval e ectiveness

Variations in relevance judgments and the measurement of retrieval e ectiveness Information Processing and Management 36 (2000) 697±716 www.elsevier.com/locate/infoproman Variations in relevance judgments and the measurement of retrieval e ectiveness Ellen M. Voorhees* National Institute

More information

Week 2 Video 3. Diagnostic Metrics

Week 2 Video 3. Diagnostic Metrics Week 2 Video 3 Diagnostic Metrics Different Methods, Different Measures Today we ll continue our focus on classifiers Later this week we ll discuss regressors And other methods will get worked in later

More information

METHODS FOR DETECTING CERVICAL CANCER

METHODS FOR DETECTING CERVICAL CANCER Chapter III METHODS FOR DETECTING CERVICAL CANCER 3.1 INTRODUCTION The successful detection of cervical cancer in a variety of tissues has been reported by many researchers and baseline figures for the

More information

Chapter IR:VIII. VIII. Evaluation. Laboratory Experiments Logging Effectiveness Measures Efficiency Measures Training and Testing

Chapter IR:VIII. VIII. Evaluation. Laboratory Experiments Logging Effectiveness Measures Efficiency Measures Training and Testing Chapter IR:VIII VIII. Evaluation Laboratory Experiments Logging Effectiveness Measures Efficiency Measures Training and Testing IR:VIII-1 Evaluation HAGEN/POTTHAST/STEIN 2018 Retrieval Tasks Ad hoc retrieval:

More information

When Overlapping Unexpectedly Alters the Class Imbalance Effects

When Overlapping Unexpectedly Alters the Class Imbalance Effects When Overlapping Unexpectedly Alters the Class Imbalance Effects V. García 1,2, R.A. Mollineda 2,J.S.Sánchez 2,R.Alejo 1,2, and J.M. Sotoca 2 1 Lab. Reconocimiento de Patrones, Instituto Tecnológico de

More information

Behavioral Data Mining. Lecture 4 Measurement

Behavioral Data Mining. Lecture 4 Measurement Behavioral Data Mining Lecture 4 Measurement Outline Hypothesis testing Parametric statistical tests Non-parametric tests Precision-Recall plots ROC plots Hardware update Icluster machines are ready for

More information

Sentiment Analysis of Reviews: Should we analyze writer intentions or reader perceptions?

Sentiment Analysis of Reviews: Should we analyze writer intentions or reader perceptions? Sentiment Analysis of Reviews: Should we analyze writer intentions or reader perceptions? Isa Maks and Piek Vossen Vu University, Faculty of Arts De Boelelaan 1105, 1081 HV Amsterdam e.maks@vu.nl, p.vossen@vu.nl

More information

Review. Imagine the following table being obtained as a random. Decision Test Diseased Not Diseased Positive TP FP Negative FN TN

Review. Imagine the following table being obtained as a random. Decision Test Diseased Not Diseased Positive TP FP Negative FN TN Outline 1. Review sensitivity and specificity 2. Define an ROC curve 3. Define AUC 4. Non-parametric tests for whether or not the test is informative 5. Introduce the binormal ROC model 6. Discuss non-parametric

More information

IR Meets EHR: Retrieving Patient Cohorts for Clinical Research Studies

IR Meets EHR: Retrieving Patient Cohorts for Clinical Research Studies IR Meets EHR: Retrieving Patient Cohorts for Clinical Research Studies References William Hersh, MD Department of Medical Informatics & Clinical Epidemiology School of Medicine Oregon Health & Science

More information

An Introduction to ROC curves. Mark Whitehorn. Mark Whitehorn

An Introduction to ROC curves. Mark Whitehorn. Mark Whitehorn An Introduction to ROC curves Mark Whitehorn Mark Whitehorn It s all about me Prof. Mark Whitehorn Emeritus Professor of Analytics Computing University of Dundee Consultant Writer (author) m.a.f.whitehorn@dundee.ac.uk

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction 1.1 Motivation and Goals The increasing availability and decreasing cost of high-throughput (HT) technologies coupled with the availability of computational tools and data form a

More information

Adjudicator Agreement and System Rankings for Person Name Search

Adjudicator Agreement and System Rankings for Person Name Search Adjudicator Agreement and System Rankings for Person Name Search Mark D. Arehart, Chris Wolf, Keith J. Miller The MITRE Corporation 7515 Colshire Dr., McLean, VA 22102 {marehart, cwolf, keith}@mitre.org

More information

Building Evaluation Scales for NLP using Item Response Theory

Building Evaluation Scales for NLP using Item Response Theory Building Evaluation Scales for NLP using Item Response Theory John Lalor CICS, UMass Amherst Joint work with Hao Wu (BC) and Hong Yu (UMMS) Motivation Evaluation metrics for NLP have been mostly unchanged

More information

Investigating the robustness of the nonparametric Levene test with more than two groups

Investigating the robustness of the nonparametric Levene test with more than two groups Psicológica (2014), 35, 361-383. Investigating the robustness of the nonparametric Levene test with more than two groups David W. Nordstokke * and S. Mitchell Colp University of Calgary, Canada Testing

More information

Analysis of Diabetic Dataset and Developing Prediction Model by using Hive and R

Analysis of Diabetic Dataset and Developing Prediction Model by using Hive and R Indian Journal of Science and Technology, Vol 9(47), DOI: 10.17485/ijst/2016/v9i47/106496, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Analysis of Diabetic Dataset and Developing Prediction

More information

INTRODUCTION TO MACHINE LEARNING. Decision tree learning

INTRODUCTION TO MACHINE LEARNING. Decision tree learning INTRODUCTION TO MACHINE LEARNING Decision tree learning Task of classification Automatically assign class to observations with features Observation: vector of features, with a class Automatically assign

More information

Rank Aggregation and Belief Revision Dynamics

Rank Aggregation and Belief Revision Dynamics Rank Aggregation and Belief Revision Dynamics Igor Volzhanin (ivolzh01@mail.bbk.ac.uk), Ulrike Hahn (u.hahn@bbk.ac.uk), Dell Zhang (dell.z@ieee.org) Birkbeck, University of London London, WC1E 7HX UK Stephan

More information

Efficient AUC Optimization for Information Ranking Applications

Efficient AUC Optimization for Information Ranking Applications Efficient AUC Optimization for Information Ranking Applications Sean J. Welleck IBM, USA swelleck@us.ibm.com Abstract. Adequate evaluation of an information retrieval system to estimate future performance

More information

Checking the counterarguments confirms that publication bias contaminated studies relating social class and unethical behavior

Checking the counterarguments confirms that publication bias contaminated studies relating social class and unethical behavior 1 Checking the counterarguments confirms that publication bias contaminated studies relating social class and unethical behavior Gregory Francis Department of Psychological Sciences Purdue University gfrancis@purdue.edu

More information

Improved Intelligent Classification Technique Based On Support Vector Machines

Improved Intelligent Classification Technique Based On Support Vector Machines Improved Intelligent Classification Technique Based On Support Vector Machines V.Vani Asst.Professor,Department of Computer Science,JJ College of Arts and Science,Pudukkottai. Abstract:An abnormal growth

More information

3. Model evaluation & selection

3. Model evaluation & selection Foundations of Machine Learning CentraleSupélec Fall 2016 3. Model evaluation & selection Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr

More information

Part II II Evaluating performance in recommender systems

Part II II Evaluating performance in recommender systems Part II II Evaluating performance in recommender systems If you cannot measure it, you cannot improve it. William Thomson (Lord Kelvin) Chapter 3 3 Evaluation of recommender systems The evaluation of

More information

A Brief Introduction to Bayesian Statistics

A Brief Introduction to Bayesian Statistics A Brief Introduction to Statistics David Kaplan Department of Educational Psychology Methods for Social Policy Research and, Washington, DC 2017 1 / 37 The Reverend Thomas Bayes, 1701 1761 2 / 37 Pierre-Simon

More information

Introduction to diagnostic accuracy meta-analysis. Yemisi Takwoingi October 2015

Introduction to diagnostic accuracy meta-analysis. Yemisi Takwoingi October 2015 Introduction to diagnostic accuracy meta-analysis Yemisi Takwoingi October 2015 Learning objectives To appreciate the concept underlying DTA meta-analytic approaches To know the Moses-Littenberg SROC method

More information

Predicting Breast Cancer Survivability Rates

Predicting Breast Cancer Survivability Rates Predicting Breast Cancer Survivability Rates For data collected from Saudi Arabia Registries Ghofran Othoum 1 and Wadee Al-Halabi 2 1 Computer Science, Effat University, Jeddah, Saudi Arabia 2 Computer

More information

Recent developments for combining evidence within evidence streams: bias-adjusted meta-analysis

Recent developments for combining evidence within evidence streams: bias-adjusted meta-analysis EFSA/EBTC Colloquium, 25 October 2017 Recent developments for combining evidence within evidence streams: bias-adjusted meta-analysis Julian Higgins University of Bristol 1 Introduction to concepts Standard

More information

AUTOMATIC ACNE QUANTIFICATION AND LOCALISATION FOR MEDICAL TREATMENT

AUTOMATIC ACNE QUANTIFICATION AND LOCALISATION FOR MEDICAL TREATMENT AUTOMATIC ACNE QUANTIFICATION AND LOCALISATION FOR MEDICAL TREATMENT Watcharaporn Sitsawangsopon (#1), Maetawee Juladash (#2), Bunyarit Uyyanonvara (#3) (#) School of ICT, Sirindhorn International Institute

More information

4. Model evaluation & selection

4. Model evaluation & selection Foundations of Machine Learning CentraleSupélec Fall 2017 4. Model evaluation & selection Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr

More information

Modeling Sentiment with Ridge Regression

Modeling Sentiment with Ridge Regression Modeling Sentiment with Ridge Regression Luke Segars 2/20/2012 The goal of this project was to generate a linear sentiment model for classifying Amazon book reviews according to their star rank. More generally,

More information

Improving Individual and Team Decisions Using Iconic Abstractions of Subjective Knowledge

Improving Individual and Team Decisions Using Iconic Abstractions of Subjective Knowledge 2004 Command and Control Research and Technology Symposium Improving Individual and Team Decisions Using Iconic Abstractions of Subjective Knowledge Robert A. Fleming SPAWAR Systems Center Code 24402 53560

More information

Analysis of Emotion Recognition using Facial Expressions, Speech and Multimodal Information

Analysis of Emotion Recognition using Facial Expressions, Speech and Multimodal Information Analysis of Emotion Recognition using Facial Expressions, Speech and Multimodal Information C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh, S. Lee, U. Neumann, S. Narayanan Emotion

More information

Fixed Effect Combining

Fixed Effect Combining Meta-Analysis Workshop (part 2) Michael LaValley December 12 th 2014 Villanova University Fixed Effect Combining Each study i provides an effect size estimate d i of the population value For the inverse

More information

Statistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN

Statistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN Statistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN Vs. 2 Background 3 There are different types of research methods to study behaviour: Descriptive: observations,

More information

Bayes Linear Statistics. Theory and Methods

Bayes Linear Statistics. Theory and Methods Bayes Linear Statistics Theory and Methods Michael Goldstein and David Wooff Durham University, UK BICENTENNI AL BICENTENNIAL Contents r Preface xvii 1 The Bayes linear approach 1 1.1 Combining beliefs

More information

Information Retrieval from Electronic Health Records for Patient Cohort Discovery

Information Retrieval from Electronic Health Records for Patient Cohort Discovery Information Retrieval from Electronic Health Records for Patient Cohort Discovery References William Hersh, MD Professor and Chair Department of Medical Informatics & Clinical Epidemiology Oregon Health

More information

This is the author s version of a work that was submitted/accepted for publication in the following source:

This is the author s version of a work that was submitted/accepted for publication in the following source: This is the author s version of a work that was submitted/accepted for publication in the following source: Moshfeghi, Yashar, Zuccon, Guido, & Jose, Joemon M. (2011) Using emotion to diversify document

More information

Sawtooth Software. The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? RESEARCH PAPER SERIES

Sawtooth Software. The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? RESEARCH PAPER SERIES Sawtooth Software RESEARCH PAPER SERIES The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? Dick Wittink, Yale University Joel Huber, Duke University Peter Zandan,

More information

How Does Analysis of Competing Hypotheses (ACH) Improve Intelligence Analysis?

How Does Analysis of Competing Hypotheses (ACH) Improve Intelligence Analysis? How Does Analysis of Competing Hypotheses (ACH) Improve Intelligence Analysis? Richards J. Heuer, Jr. Version 1.2, October 16, 2005 This document is from a collection of works by Richards J. Heuer, Jr.

More information

arxiv: v2 [cs.ai] 26 Sep 2018

arxiv: v2 [cs.ai] 26 Sep 2018 Manipulating and Measuring Model Interpretability arxiv:1802.07810v2 [cs.ai] 26 Sep 2018 Forough Poursabzi-Sangdeh forough.poursabzi@microsoft.com Microsoft Research Jennifer Wortman Vaughan jenn@microsoft.com

More information

7 Grip aperture and target shape

7 Grip aperture and target shape 7 Grip aperture and target shape Based on: Verheij R, Brenner E, Smeets JBJ. The influence of target object shape on maximum grip aperture in human grasping movements. Exp Brain Res, In revision 103 Introduction

More information

The recommended method for diagnosing sleep

The recommended method for diagnosing sleep reviews Measuring Agreement Between Diagnostic Devices* W. Ward Flemons, MD; and Michael R. Littner, MD, FCCP There is growing interest in using portable monitoring for investigating patients with suspected

More information

Meta-Analysis. Zifei Liu. Biological and Agricultural Engineering

Meta-Analysis. Zifei Liu. Biological and Agricultural Engineering Meta-Analysis Zifei Liu What is a meta-analysis; why perform a metaanalysis? How a meta-analysis work some basic concepts and principles Steps of Meta-analysis Cautions on meta-analysis 2 What is Meta-analysis

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

Phone Number:

Phone Number: International Journal of Scientific & Engineering Research, Volume 6, Issue 5, May-2015 1589 Multi-Agent based Diagnostic Model for Diabetes 1 A. A. Obiniyi and 2 M. K. Ahmed 1 Department of Mathematic,

More information

SLEEP DISTURBANCE ABOUT SLEEP DISTURBANCE INTRODUCTION TO ASSESSMENT OPTIONS. 6/27/2018 PROMIS Sleep Disturbance Page 1

SLEEP DISTURBANCE ABOUT SLEEP DISTURBANCE INTRODUCTION TO ASSESSMENT OPTIONS. 6/27/2018 PROMIS Sleep Disturbance Page 1 SLEEP DISTURBANCE A brief guide to the PROMIS Sleep Disturbance instruments: ADULT PROMIS Item Bank v1.0 Sleep Disturbance PROMIS Short Form v1.0 Sleep Disturbance 4a PROMIS Short Form v1.0 Sleep Disturbance

More information

CHAPTER 6 HUMAN BEHAVIOR UNDERSTANDING MODEL

CHAPTER 6 HUMAN BEHAVIOR UNDERSTANDING MODEL 127 CHAPTER 6 HUMAN BEHAVIOR UNDERSTANDING MODEL 6.1 INTRODUCTION Analyzing the human behavior in video sequences is an active field of research for the past few years. The vital applications of this field

More information

Predicting Task Difficulty for Different Task Types

Predicting Task Difficulty for Different Task Types Predicting Task Difficulty for Different Task Types Jingjing Liu, Jacek Gwizdka, Chang Liu, Nicholas J. Belkin School of Communication and Information, Rutgers University 4 Huntington Street, New Brunswick,

More information

Mining Human-Place Interaction Patterns from Location-Based Social Networks to Enrich Place Categorization Systems

Mining Human-Place Interaction Patterns from Location-Based Social Networks to Enrich Place Categorization Systems Mining Human-Place Interaction Patterns from Location-Based Social Networks to Enrich Place Categorization Systems Yingjie Hu, Grant McKenzie, Krzysztof Janowicz, Song Gao STKO Lab, Department of Geography,

More information

Impact and adjustment of selection bias. in the assessment of measurement equivalence

Impact and adjustment of selection bias. in the assessment of measurement equivalence Impact and adjustment of selection bias in the assessment of measurement equivalence Thomas Klausch, Joop Hox,& Barry Schouten Working Paper, Utrecht, December 2012 Corresponding author: Thomas Klausch,

More information

Multi-modal Patient Cohort Identification from EEG Report and Signal Data

Multi-modal Patient Cohort Identification from EEG Report and Signal Data Multi-modal Patient Cohort Identification from EEG Report and Signal Data Travis R. Goodwin and Sanda M. Harabagiu The University of Texas at Dallas Human Language Technology Research Institute http://www.hlt.utdallas.edu

More information

Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE

Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE 1. When you assert that it is improbable that the mean intelligence test score of a particular group is 100, you are using. a. descriptive

More information

FINAL. Recommendations for Update to Arsenic Soil CTL Computation. Methodology Focus Group. Contaminated Soils Forum. Prepared by:

FINAL. Recommendations for Update to Arsenic Soil CTL Computation. Methodology Focus Group. Contaminated Soils Forum. Prepared by: A stakeholder body advising the Florida Department of Environmental Protection FINAL Recommendations for Update to Arsenic Soil CTL Computation Prepared by: Methodology Focus Group Contaminated Soils Forum

More information

Predictive Models for Healthcare Analytics

Predictive Models for Healthcare Analytics Predictive Models for Healthcare Analytics A Case on Retrospective Clinical Study Mengling Mornin Feng mfeng@mit.edu mornin@gmail.com 1 Learning Objectives After the lecture, students should be able to:

More information

Agents with Attitude: Exploring Coombs Unfolding Technique with Agent-Based Models

Agents with Attitude: Exploring Coombs Unfolding Technique with Agent-Based Models Int J Comput Math Learning (2009) 14:51 60 DOI 10.1007/s10758-008-9142-6 COMPUTER MATH SNAPHSHOTS - COLUMN EDITOR: URI WILENSKY* Agents with Attitude: Exploring Coombs Unfolding Technique with Agent-Based

More information

Machine Learning to Inform Breast Cancer Post-Recovery Surveillance

Machine Learning to Inform Breast Cancer Post-Recovery Surveillance Machine Learning to Inform Breast Cancer Post-Recovery Surveillance Final Project Report CS 229 Autumn 2017 Category: Life Sciences Maxwell Allman (mallman) Lin Fan (linfan) Jamie Kang (kangjh) 1 Introduction

More information

Minimum Feature Selection for Epileptic Seizure Classification using Wavelet-based Feature Extraction and a Fuzzy Neural Network

Minimum Feature Selection for Epileptic Seizure Classification using Wavelet-based Feature Extraction and a Fuzzy Neural Network Appl. Math. Inf. Sci. 8, No. 3, 129-1300 (201) 129 Applied Mathematics & Information Sciences An International Journal http://dx.doi.org/10.1278/amis/0803 Minimum Feature Selection for Epileptic Seizure

More information

Journal of Political Economy, Vol. 93, No. 2 (Apr., 1985)

Journal of Political Economy, Vol. 93, No. 2 (Apr., 1985) Confirmations and Contradictions Journal of Political Economy, Vol. 93, No. 2 (Apr., 1985) Estimates of the Deterrent Effect of Capital Punishment: The Importance of the Researcher's Prior Beliefs Walter

More information

Essentials in Bioassay Design and Relative Potency Determination

Essentials in Bioassay Design and Relative Potency Determination BioAssay SCIENCES A Division of Thomas A. Little Consulting Essentials in Bioassay Design and Relative Potency Determination Thomas A. Little Ph.D. 2/29/2016 President/CEO BioAssay Sciences 12401 N Wildflower

More information

Marriage Matching with Correlated Preferences

Marriage Matching with Correlated Preferences Marriage Matching with Correlated Preferences Onur B. Celik Department of Economics University of Connecticut and Vicki Knoblauch Department of Economics University of Connecticut Abstract Authors of experimental,

More information

Testing the robustness of anonymization techniques: acceptable versus unacceptable inferences - Draft Version

Testing the robustness of anonymization techniques: acceptable versus unacceptable inferences - Draft Version Testing the robustness of anonymization techniques: acceptable versus unacceptable inferences - Draft Version Gergely Acs, Claude Castelluccia, Daniel Le étayer 1 Introduction Anonymization is a critical

More information

Reliability, validity, and all that jazz

Reliability, validity, and all that jazz Reliability, validity, and all that jazz Dylan Wiliam King s College London Published in Education 3-13, 29 (3) pp. 17-21 (2001) Introduction No measuring instrument is perfect. If we use a thermometer

More information

Running Head: AUTOMATED SCORING OF CONSTRUCTED RESPONSE ITEMS. Contract grant sponsor: National Science Foundation; Contract grant number:

Running Head: AUTOMATED SCORING OF CONSTRUCTED RESPONSE ITEMS. Contract grant sponsor: National Science Foundation; Contract grant number: Running Head: AUTOMATED SCORING OF CONSTRUCTED RESPONSE ITEMS Rutstein, D. W., Niekrasz, J., & Snow, E. (2016, April). Automated scoring of constructed response items measuring computational thinking.

More information

Evidence-Based Medicine and Publication Bias Desmond Thompson Merck & Co.

Evidence-Based Medicine and Publication Bias Desmond Thompson Merck & Co. Evidence-Based Medicine and Publication Bias Desmond Thompson Merck & Co. Meta-Analysis Defined A meta-analysis is: the statistical combination of two or more separate studies In other words: overview,

More information

Exploring Normalization Techniques for Human Judgments of Machine Translation Adequacy Collected Using Amazon Mechanical Turk

Exploring Normalization Techniques for Human Judgments of Machine Translation Adequacy Collected Using Amazon Mechanical Turk Exploring Normalization Techniques for Human Judgments of Machine Translation Adequacy Collected Using Amazon Mechanical Turk Michael Denkowski and Alon Lavie Language Technologies Institute School of

More information

International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use

International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use Final Concept Paper E9(R1): Addendum to Statistical Principles for Clinical Trials on Choosing Appropriate Estimands and Defining Sensitivity Analyses in Clinical Trials dated 22 October 2014 Endorsed

More information

ABSTRACT I. INTRODUCTION. Mohd Thousif Ahemad TSKC Faculty Nagarjuna Govt. College(A) Nalgonda, Telangana, India

ABSTRACT I. INTRODUCTION. Mohd Thousif Ahemad TSKC Faculty Nagarjuna Govt. College(A) Nalgonda, Telangana, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 1 ISSN : 2456-3307 Data Mining Techniques to Predict Cancer Diseases

More information

A Predictive Chronological Model of Multiple Clinical Observations T R A V I S G O O D W I N A N D S A N D A M. H A R A B A G I U

A Predictive Chronological Model of Multiple Clinical Observations T R A V I S G O O D W I N A N D S A N D A M. H A R A B A G I U A Predictive Chronological Model of Multiple Clinical Observations T R A V I S G O O D W I N A N D S A N D A M. H A R A B A G I U T H E U N I V E R S I T Y O F T E X A S A T D A L L A S H U M A N L A N

More information

Investigating the Exhaustivity Dimension in Content-Oriented XML Element Retrieval Evaluation

Investigating the Exhaustivity Dimension in Content-Oriented XML Element Retrieval Evaluation Investigating the Exhaustivity Dimension in Content-Oriented XML Element Retrieval Evaluation Paul Ogilvie Language Technologies Institute Carnegie Mellon University Pittsburgh PA, USA pto@cs.cmu.edu Mounia

More information

Emotion Recognition using a Cauchy Naive Bayes Classifier

Emotion Recognition using a Cauchy Naive Bayes Classifier Emotion Recognition using a Cauchy Naive Bayes Classifier Abstract Recognizing human facial expression and emotion by computer is an interesting and challenging problem. In this paper we propose a method

More information

Evaluation of CBT for increasing threat detection performance in X-ray screening

Evaluation of CBT for increasing threat detection performance in X-ray screening Evaluation of CBT for increasing threat detection performance in X-ray screening A. Schwaninger & F. Hofer Department of Psychology, University of Zurich, Switzerland Abstract The relevance of aviation

More information

Evaluation of CBT for increasing threat detection performance in X-ray screening

Evaluation of CBT for increasing threat detection performance in X-ray screening Evaluation of CBT for increasing threat detection performance in X-ray screening A. Schwaninger & F. Hofer Department of Psychology, University of Zurich, Switzerland Abstract The relevance of aviation

More information

Minimizing Uncertainty in Property Casualty Loss Reserve Estimates Chris G. Gross, ACAS, MAAA

Minimizing Uncertainty in Property Casualty Loss Reserve Estimates Chris G. Gross, ACAS, MAAA Minimizing Uncertainty in Property Casualty Loss Reserve Estimates Chris G. Gross, ACAS, MAAA The uncertain nature of property casualty loss reserves Property Casualty loss reserves are inherently uncertain.

More information

Knowledge Discovery and Data Mining. Testing. Performance Measures. Notes. Lecture 15 - ROC, AUC & Lift. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Testing. Performance Measures. Notes. Lecture 15 - ROC, AUC & Lift. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 15 - ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-17-AUC

More information

Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments

Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments Tyler McDonnell Dept. of Computer Science University of Texas at Austin tyler@cs.utexas.edu Matthew Lease School of Information

More information

A framework for predicting item difficulty in reading tests

A framework for predicting item difficulty in reading tests Australian Council for Educational Research ACEReSearch OECD Programme for International Student Assessment (PISA) National and International Surveys 4-2012 A framework for predicting item difficulty in

More information

PHYSICAL FUNCTION A brief guide to the PROMIS Physical Function instruments:

PHYSICAL FUNCTION A brief guide to the PROMIS Physical Function instruments: PROMIS Bank v1.0 - Physical Function* PROMIS Short Form v1.0 Physical Function 4a* PROMIS Short Form v1.0-physical Function 6a* PROMIS Short Form v1.0-physical Function 8a* PROMIS Short Form v1.0 Physical

More information

Receiver operating characteristic

Receiver operating characteristic Receiver operating characteristic From Wikipedia, the free encyclopedia In signal detection theory, a receiver operating characteristic (ROC), or simply ROC curve, is a graphical plot of the sensitivity,

More information

Framework for Comparative Research on Relational Information Displays

Framework for Comparative Research on Relational Information Displays Framework for Comparative Research on Relational Information Displays Sung Park and Richard Catrambone 2 School of Psychology & Graphics, Visualization, and Usability Center (GVU) Georgia Institute of

More information

NEW METHODS FOR SENSITIVITY TESTS OF EXPLOSIVE DEVICES

NEW METHODS FOR SENSITIVITY TESTS OF EXPLOSIVE DEVICES NEW METHODS FOR SENSITIVITY TESTS OF EXPLOSIVE DEVICES Amit Teller 1, David M. Steinberg 2, Lina Teper 1, Rotem Rozenblum 2, Liran Mendel 2, and Mordechai Jaeger 2 1 RAFAEL, POB 2250, Haifa, 3102102, Israel

More information

The Effects of Automated Risk Assessment on Reliability, Validity and Return on Investment (ROI)

The Effects of Automated Risk Assessment on Reliability, Validity and Return on Investment (ROI) The Effects of Automated Risk Assessment on Reliability, Validity and Return on Investment (ROI) Grant Duwe, Ph.D. Director, Research and Evaluation May 2016 Email: grant.duwe@state.mn.us Overview Recently

More information

Empirical Knowledge: based on observations. Answer questions why, whom, how, and when.

Empirical Knowledge: based on observations. Answer questions why, whom, how, and when. INTRO TO RESEARCH METHODS: Empirical Knowledge: based on observations. Answer questions why, whom, how, and when. Experimental research: treatments are given for the purpose of research. Experimental group

More information

Measuring Focused Attention Using Fixation Inner-Density

Measuring Focused Attention Using Fixation Inner-Density Measuring Focused Attention Using Fixation Inner-Density Wen Liu, Mina Shojaeizadeh, Soussan Djamasbi, Andrew C. Trapp User Experience & Decision Making Research Laboratory, Worcester Polytechnic Institute

More information

A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range

A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range Lae-Jeong Park and Jung-Ho Moon Department of Electrical Engineering, Kangnung National University Kangnung, Gangwon-Do,

More information

STATISTICS AND RESEARCH DESIGN

STATISTICS AND RESEARCH DESIGN Statistics 1 STATISTICS AND RESEARCH DESIGN These are subjects that are frequently confused. Both subjects often evoke student anxiety and avoidance. To further complicate matters, both areas appear have

More information

JSM Survey Research Methods Section

JSM Survey Research Methods Section Methods and Issues in Trimming Extreme Weights in Sample Surveys Frank Potter and Yuhong Zheng Mathematica Policy Research, P.O. Box 393, Princeton, NJ 08543 Abstract In survey sampling practice, unequal

More information

Identifying the Zygosity Status of Twins Using Bayes Network and Estimation- Maximization Methodology

Identifying the Zygosity Status of Twins Using Bayes Network and Estimation- Maximization Methodology Identifying the Zygosity Status of Twins Using Bayes Network and Estimation- Maximization Methodology Yicun Ni (ID#: 9064804041), Jin Ruan (ID#: 9070059457), Ying Zhang (ID#: 9070063723) Abstract As the

More information

Query Refinement: Negation Detection and Proximity Learning Georgetown at TREC 2014 Clinical Decision Support Track

Query Refinement: Negation Detection and Proximity Learning Georgetown at TREC 2014 Clinical Decision Support Track Query Refinement: Negation Detection and Proximity Learning Georgetown at TREC 2014 Clinical Decision Support Track Christopher Wing and Hui Yang Department of Computer Science, Georgetown University,

More information

The Impact of Relative Standards on the Propensity to Disclose. Alessandro Acquisti, Leslie K. John, George Loewenstein WEB APPENDIX

The Impact of Relative Standards on the Propensity to Disclose. Alessandro Acquisti, Leslie K. John, George Loewenstein WEB APPENDIX The Impact of Relative Standards on the Propensity to Disclose Alessandro Acquisti, Leslie K. John, George Loewenstein WEB APPENDIX 2 Web Appendix A: Panel data estimation approach As noted in the main

More information

Shiken: JALT Testing & Evaluation SIG Newsletter. 12 (2). April 2008 (p )

Shiken: JALT Testing & Evaluation SIG Newsletter. 12 (2). April 2008 (p ) Rasch Measurementt iin Language Educattiion Partt 2:: Measurementt Scalles and Invariiance by James Sick, Ed.D. (J. F. Oberlin University, Tokyo) Part 1 of this series presented an overview of Rasch measurement

More information

Session 1: Dealing with Endogeneity

Session 1: Dealing with Endogeneity Niehaus Center, Princeton University GEM, Sciences Po ARTNeT Capacity Building Workshop for Trade Research: Behind the Border Gravity Modeling Thursday, December 18, 2008 Outline Introduction 1 Introduction

More information