Tolerance of Effectiveness Measures to Relevance Judging Errors
|
|
- Joel Dalton
- 5 years ago
- Views:
Transcription
1 Tolerance of Effectiveness Measures to Relevance Judging Errors Le Li 1 and Mark D. Smucker 2 1 David R. Cheriton School of Computer Science, Canada 2 Department of Management Sciences, Canada University of Waterloo, Canada Abstract. Crowdsourcing relevance judgments for test collection construction is attractive because the practice has the possibility of being more affordable than hiring high quality assessors. A problem faced by all crowdsourced judgments even judgments formed from the consensus of multiple workers is that there will be differences in the judgments compared to the judgments produced by high quality assessors. For two TREC test collections, we simulated errors in sets of judgments and then measured the effect of these errors on effectiveness measures. We found that some measures appear to be more tolerant of errors than others. We also found that to achieve high rank correlation in the ranking of retrieval systems requires conservative judgments for average precision (AP) and ndcg, while precision at rank 1 requires neutral judging behavior. Conservative judging avoids mistakenly judging non-relevant documents as relevant at the cost of judging some relevant documents as non-relevant. In addition, we found that while conservative judging behavior maximizes rank correlation for AP and ndcg, to minimize the error in the measures values requires more liberal behavior. Depending on the nature of a set of crowdsourced judgments, the judgments may be more suitable with some effectiveness measures than others, and the use of some effectiveness measures will require higher levels of judgment quality than others. 1 Introduction Information retrieval (IR) test collection construction can require 1 to 2 thousand or more relevance judgments. The best way to obtain relevance judgments is to hire and train assessors who both originate their own search topic and then have the ability to carefully and consistently judge hundreds of potentially complex documents. There is considerable interest in utilizing crowdsourcing platforms such as Amazon Mechanical Turk to obtain relevance judgments in an affordable manner [1 3]. Crowdsourced assessors are usually secondary assessors, i.e. assessors who did not originate the search topics. It is well known that secondary assessors produce relevance judgments that differ from those that are or would be produced by primary assessors [4]. Whether there is a single secondary assessor, or a group of secondary assessors that are combined using sophisticated algorithms [5, 6], there will be differences. M. de Rijke et al. (Eds.): ECIR 214, LNCS 8416, pp , 214. c Springer International Publishing Switzerland 214
2 Tolerance of Effectiveness Measures to Relevance Judging Errors 149 In this paper, we address this question: What effect do differences in judgments between primary and secondary assessors have on our ability to rank and score retrieval systems? Equivalently, what differences in judgments can various evaluation measures tolerate and still be able to match the evaluation quality produced using primary judgments? To investigate this question, we used two sets of runs submitted to two TREC tracks. For each set of runs, we took the appropriate NIST relevance judgments (also known as qrels) and then simulated a secondary assessor to produce a set of secondary qrels that differed from the NIST, primary qrels. For each set of qrels, we produced scores for the runs using precision at 1 (P@1), mean average precision (MAP), and normalized discounted cumulated gain (ndcg). With a given effectiveness measure, e.g. MAP, we can rank the systems as per the primary and secondary qrels and then measure their rank correlation. We measured rank correlation with Yilmaz et al. s AP Correlation (APCorr) [7]. Likewise, we can measure the accuracy of the scores produced by the secondary qrels by measuring the root mean square error (RMSE) between the two sets of scores. To simulate the secondary assessors, we treated the NIST qrels as truth and the secondary assessor as a classifier. A classifier s performance can be understood in terms of its true positive rate (TPR) and its false positive rate (FPR). A given TPR and FPR determine both a classifier s discrimination ability and how conservative or liberal it is in its judging. For example, a conservative classifier avoids judging non-relevant documents as relevant at the cost of mistakenly judging some relevant documents as non-relevant. We used d to measure discrimination ability and the criterion c to measure how conservative or liberal the judging behavior is [8]. We systematically varied the discrimination ability, d, and the criterion c, to produce different sets of qrels. We then evaluated the system runs submitted to the TREC 8 ad-hoc and Robust 25 TREC tracks with these qrels and compared the results to those we obtained using the official NIST qrels. After analyzing the results, we found that: 1. In terms of rank correlation (APCorr), mean average precision (MAP) is more tolerant of errors than ndcg and P@1. In other words, MAP can obtain the same APCorr as ndcg and P@1 with assessors of a lower discrimination ability. 2. To maximize rank correlation, ndcg, MAP, and P@1 require conservative judging. Of the three measures, P@1 requires the least conservative judging and works best with the judging close to neutral. The lower the discrimination ability of the judging, the more conservative judging is required by MAP and ndcg to maximize rank correlation. MAP and ndcg appear to be sensitive to false positives. 3. Depending on the discrimination ability of the judging, it can be hard to jointly optimize APCorr and RMSE for MAP and ndcg. The impact of these findings is that to optimize rank correlation requires attention to not only the discrimination ability of the assessors, but also to how
3 15 L. Li and M.D. Smucker conservative, liberal, or neutral those assessors are in their judgments. Judging schemes or consensus algorithms may need to be devised that will help produce more conservative judgments when MAP and ndcg are the targeted effectiveness measures. If is to be used as the effectiveness measure, efforts must be taken to maintain neutral judging. From a crowdsourcing point of view, it is likely that there will need to be acquired some set of high quality, primary assessor relevance judgments by which the lower quality, crowdsourced, secondary assessor relevance judgments can be calibrated to maximize rank correlation by controlling the relevance criterion used, i.e. by controlling how liberal or conservative the resulting relevance judgments are. 2 Methods and Materials To conduct our experiments, we used the set of runs submitted to two TREC tracks. For each TREC track, we took the NIST qrels and simulated assessors of different abilities and biases as compared to the NIST qrels to produce alternative qrels. We then used these alternative qrels to evaluate the sets of runs and measure the effect that the differences in judgments had on our evaluation of the runs submitted to the tracks. 2.1 Runs Submitted to TREC Tracks and QRels We used the runs submitted to the TREC 8 ad-hoc and Robust 25 TREC tracks as well as the NIST qrels for each track [9, 1]. For convenience, we refer the two data sets as Robust25 and TREC8. Both data sets contain 5 topics. The TREC8 qrels contain 86,83 judgments of which 4,728 are relevant (5.4%). The Robust25 qrels contain 37,798 judgments of which 6,561 are relevant (17%). TREC8 has 129 submitted runs and Robust25 has 74 submitted runs. 2.2 Simulation of Judgments We took the NIST qrels as truth and then simulated assessors of different abilities and biases as measured against the NIST qrels. We can describe the judging behavior of our simulated assessors in terms of their true positive rates (TPR) and false positive rates (FPR), where TPR = TP/(TP + FN)andFPR= FP/(FP + TN) (as shown in Table 1). Signal detection theory allows us to separately describe the discrimination ability and the decision criterion or bias of the assessor [8]. Discrimination ability is measured as d : d = z(tpr) z(fpr), (1) and the bias of the assessor towards either liberal or conservative judging is described by the criterion c: c = 1 (z(tpr)+z(fpr)), (2) 2
4 Tolerance of Effectiveness Measures to Relevance Judging Errors 151 Table 1. Confusion Matrix. Pos. and Neg. stand for Positive and Negative respectively. NIST (Primary) Assessor Simulated Secondary Assessor Relevant (Pos.) Non-Relevant (Neg.) Relevant TP =TruePos. FP = False Pos. Non-Relevant FN = False Neg. TN =TrueNeg. where TPR and FPR are true positive rate and false positive rate of this assessor, respectively. Function z, the inverse of the normal distribution function, converts the TPR or FPR to a z score [8]. If an assessor tends to label incoming documents to be relevant to avoid missing relevant documents (but at the risk of high false positive rate), then this assessor is liberal with a negative criterion. If c =, the assessor is neutral. A conservative assessor has a positive criterion. One advantage of using this model is that the measurement of an assessor s ability to discriminate is independent of the assessor s criterion. At a given discrimination ability d, there are many possible values for the TPR and FPR. In other words, two assessors can have the same ability to discriminate between relevant and non-relevant documents, but one may have a much higher relevance criterion than the other. The higher the relevance criterion, the more conservative the assessor. Figure 1 shows example d curves. All of the points along a curve have the same discrimination ability. Table 2 gives the TPR and FPR for a selection of d and c values. Table 2. The TPR and FPR for various d and c c = -1 (liberal) c = (neutral) c=1(conservative) d TPR FPR If an assessor s d and c are given, we can use them to calculate the TPR and FPR of this assessor using Equations 3 and 4, which are derived from Equations 1 and 2. TPR is computed as, and FPR is computed as, TPR= CDF(d c), (3) FPR = CDF( d c), (4) where CDF is the Cumulative Density Function of the standard normal distribution N (, 1). Assuming a document s true label is given, we can generate the simulated judgment by tossing a biased coin. The probability of the assessor making an
5 152 L. Li and M.D. Smucker True Positive Rate d = 2 d = 1 d = criterion > (conservative) d = criterion < (liberal) False Positive Rate Fig. 1. Curves of equal d. This figure is based on Figures 1.1 and 2.1 of [8]. error is calculated using the assessor s TPR and FPR rates. If the true label is relevant, the assessor makes an error with probability equal to 1 TPR.Ifthe true label is non-relevant, the assessor makes an error with probability equal to FPR. 2.3 Experiment Settings We simulated the noisy judgments of assessors by varying two variables, d and c, as shown in Algorithm 1 and described in Sec What are the candidate values of d and c for simulation? Smucker and Jethani [11] estimated the average d and c of NIST assessors to be 2.3 and 7, respectively, across 1 topics in 25 TREC Robust Track. In [12], the reported d and c of crowdsourced assessors are 1.9 and 4, respectively, with the same experiment settings in [11]. If we think of the judgments from NIST assessors as the upper bound that the consensus algorithm or hired assessors could be, then the results from these two papers indicate that one assessor should have d and c values close to NIST. Meanwhile, as shown in Fig. 1, d = means the assessor labels the document by tossing an unbiased coin, i.e. random guess. So, we set the range of d as [, 3] with a step size of. For the criterion c, the reported c suggests that both NIST and crowdsourced assessors are conservative with NIST assessors being more conservative than the crowdsourced workers [11, 12]. At the same time, the behaviors of liberal assessors are also worthy of investigation. So, we set the c with the range of [ 3, 3] with a step size of. In total, we simulated 366 different types of assessors who make random errors based on each pair of d and c values. While we consider c varying between 3 and 3, the likely range for c is probably at most 1 to 1. We show the range from 3 to 3 to allow trends to be better seen.
6 Tolerance of Effectiveness Measures to Relevance Judging Errors 153 Algorithm 1. Simulate The Judgments From One Assessor INPUT: d,c,truelabels TPR CDF(d c) Cumulative Distribution Function of N (, 1) FPR CDF( d c) for i =1:size(trueLabels) do judge i truelabels i flip rand(, 1) A random number in [, 1) if truelabel i == then if flip FPR then judge i 1 end if else if flip > TPR then judge i end if end if end for RETURN judge For each simulated assessor, we repeated the simulation 1 times to generate 1 independent simulated qrels, and then averaged the performance of the simulated qrels for the assessor. 2.4 Measures To measure the degree of correlation between the simulated and NIST assessors in IR evaluation, we evaluated the submitted runs of those two TREC tracks against qrels from simulated and NIST assessors, respectively. Three evaluation metrics were used: Precision at 1 (P@1), Mean Average Precision (MAP), and Normalized Discounted Cumulative Gain (ndcg). To be more precise, each assessor s simulated run was used as pseudo qrels to evaluate all test runs. So that each test run corresponded to one evaluation result. This evaluation result was averaged across all topics. For example, Robust25 has 74 test runs and we can get 74 MAPs against one qrels file. Hence, we can measure the correlation of two qrels based on the association between two lists of MAPs derived from identical test runs. The higher the correlation between the rankings produced by a set of simulated qrels and the rankings produced by the NIST qrels, the less effect the simulated errors have on the effectiveness measure. We can compare the tolerance of effectiveness measures to judging errors by measuring the correlation for each measure for a given set of simulated qrels. The higher the correlation, the more fault tolerant the effectiveness measure is. We used the Average Precision Correlation Coefficient (APCorr) [7] to measure the correlation. APCorr is very similar to another commonly-used correlation measurement, Kendall s τ [13], but APCorr gives more weight to the errors
7 154 L. Li and M.D. Smucker APCorr APCorr (a) APCorr, P@1, Robust (b) APCorr, P@1, TREC8 RMSE RMSE (c) RMSE, P@1, Robust (d) RMSE, P@1, TREC8 Fig. 2. The effects of the pseudo assessor s errors on evaluation, P@1 nearer to the top of rankings. If two ranking systems perfectly agree with each other, APCorr is 1. Moreover, the Root Mean Square Error (RMSE) was also adopted to calculate the errors between two lists of scores. The smaller the RMSE value is, the closer two measurements are in terms of the quantity difference. 3 Results and Discussions Results are shown in Figures 2, 3, and 4 and Tables 3 and 4. Each figure shows a different effectiveness measure on both TREC8 and Robust25. The tables show the maximum APCorr achieved by each effectiveness measure for each of the d values used in the experiment. As is to be expected, the larger the d value, the better the rank correlation (APCorr) is at a given criterion c. Recall that APCorr measures the degree to which the simulated qrels rank the retrieval systems in the same order as do the NIST qrels, and criterion values greater than are conservative and those less than zero are liberal.
8 Tolerance of Effectiveness Measures to Relevance Judging Errors APCorr APCorr (a) APCorr, MAP, Robust (b) APCorr, MAP, TREC RMSE RMSE (c) RMSE, MAP, Robust (d) RMSE, MAP, TREC8 Fig. 3. The effects of the pseudo assessor s errors on evaluation, MAP As can be seen in Tables 3 and 4, except for the two lowest d values in TREC8, mean average precision (MAP) achieves the best rank correlation at a given level of discrimination ability. Indeed, MAP often has an APCorr that is near the APCorr achieved by ndcg and P@1 on the next higher d, i.e. MAP can achieve the same APCorr as the other metrics but with lower quality assessors. MAP is more fault tolerant than P@1 and ndcg on these test collections. Evident most clearly in the Figures 2, 3, and 4, but also in Tables 3 and 4, MAP and ndcg require conservative judgments to maximize their rank correlation (APCorr). P@1 also requires conservative judgments, but the degree of conservativeness is close to neutral. As the discrimination ability decreases, MAP and ndcg require even more conservative judgments to maximize APCorr. It appears that both MAP and ndcg are sensitive to false positives. Our results reinforce those of Carterette and Soboroff [14] who also found, via a different simulation methodology, that false positives are to be avoided. A consequence of the need for conservative judgments to maximize rank correlation is that it is hard for secondary assessors, such as crowdsourced assessors, to produce a set of qrels that can produce the same scores for MAP and ndcg
9 156 L. Li and M.D. Smucker APCorr APCorr (a) APCorr, ndcg, Robust (b) APCorr, ndcg, TREC RMSE RMSE (c) RMSE, ndcg, Robust (d) RMSE, ndcg, TREC8 Fig. 4. The effects of the pseudo assessor s errors on evaluation, ndcg as with the NIST qrels. The reason for this is that conservative judging requires missing relevant documents and judging them to be non-relevant to avoid being liberal and mistakenly judging non-relevant documents to be relevant. Both MAP and ndcg are measures over the set of known relevant documents. Thus, conservative judging results in a lower estimate of the total number of relevant documents and changes the scores of MAP and ndcg. While at high levels of discrimination ability d,themaximumapcorris obtained with a criterion c that also produces a near to minimum RMSE for all of the effectiveness measures. For MAP and ndcg, as d decreases, the best criterion c for APCorr and RMSE move apart, and it becomes increasingly hard to jointly optimize for both measures. We can also see in the figures that assessors with greater discrimination ability, d, tend to be more robust to the change of criterion c, with high values of APCorr obtained over wider ranges of c. Meanwhile, we notice that the correlation results on TREC8 tend to be worse than that on Robust25. Our hypothesis is that since TREC8 contains a deeper pool with more non-relevant documents than Robust25, the number of false
10 Tolerance of Effectiveness Measures to Relevance Judging Errors 157 Table 3. The criterion, TPR, and FPR when the APCorr is maximal for each d, Robust25 d P@1 MAP ndcg c TPR FPR APCorr c TPR FPR APCorr c TPR FPR APCorr Table 4. The criterion, TPR, and FPR when the APCorr is maximal for each d, TREC8 d P@1 MAP ndcg c TPR FPR APCorr c TPR FPR APCorr c TPR FPR APCorr positives is higher with TREC8 when judged with the same FPR. Another possibility is that the unique nature of the manual runs present in TREC8, which are some of the best scoring runs, make it harder to judge than Robust25. A somewhat surprising result occurs with MAP on TREC8 and its RMSE. As Fig. 3 shows, the highly discriminative d =3. qrels actually have a higher RMSE than the lower d qrels at liberal c values less than -5 or so. As far as we can understand, this inversion of expected behavior results from the lower d qrels having higher false positives rates that while being noisier judgments, result in MAP values that on average are closer to the NIST scores. 3.1 Limitations of Our Methods Our existing simulation method only captures the random errors made by the assessors. Webber et al. [15] have shown that as documents are ranked lower by retrieval engines, the less likely assessors are to make false positive errors. In our simulation, the true and false positive rates do not depend on the document being judged. Likewise, we do not attempt to model crowdsourcing-specific error [16]. As such, our results cannot be used to show the discrimination ability required of assessors to obtain a desired rank correlation.
11 158 L. Li and M.D. Smucker 4 Related Work Voorhees [17] conducted experiments with obtaining secondary relevance judgments using high quality NIST assessors. In these experiments, Voorhees found that even with disagreements, the rank correlation of the runs was high. Subsequent work by others has found that differing levels of assessor expertise can negatively affect the ability of secondary assessors to produce qrels that evaluate systems in the same manner as qrels produced by high-quality primary assessors [18, 19]. Most similar to our work, Carterette and Soboroff [14] hypothesized several difference models of assessor behavior that could produce judging errors compared to NIST qrels. They found that their pessimistic models resulted in the best rank correlation. These findings are in line with our results showing that conservative assessors are required for maximizing rank correlation. Carterette and Soboroff examined the statmap measure, while we have looked at additional measures and discovered that P@1 does best with slightly conservative, almost neutral judging. 5 Conclusion We simulated assessor errors by varying both their discrimination ability and their relevance criterion. We examined the effect of these errors on three effectiveness measures: P@1, MAP, and ndcg. We found that MAP is more tolerant of judging errors than P@1 and ndcg. MAP can achieve the same rank correlation with lower quality assessors. We also found that conservative assessors are preferable to achieve high correlation. In other words, it is important that assessors avoid mistakenly judging non-relevant documents as relevant. We also found that different effectiveness measures have different responses to errors in judging. For example, P@1 requires a more liberal judging behavior than does MAP and ndcg. Crowdsourced relevance judging likely will require a sample of documents judged by high quality, primary assessors to allow for the calibration of the judgments produced by crowdsourcing. Future work could involve the design of effectiveness measures specifically designed to better handle relevance judging errors. Acknowledgments. We thank the reviewers for their helpful reviews. In particular we thank the meta-reviewer for the helpful set of references to related work. This work was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC), in part by the facilities of SHARCNET, and in part by the University of Waterloo. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reect those of the sponsors.
12 References Tolerance of Effectiveness Measures to Relevance Judging Errors Alonso, O., Mizzaro, S.: Can we get rid of TREC assessors? using Mechanical Turk for relevance assessment. In: Proceedings of the SIGIR 29 Workshop on the Future of IR Evaluation, pp (July 29) 2. McCreadie, R., Macdonald, C., Ounis, I.: Crowdsourcing blog track top news judgments at TREC. In: WSDM 211 Workshop on Crowdsourcing for Search and Data Mining (211) 3. Smucker, M.D., Kazai, G., Lease, M.: Overview of the TREC 212 crowdsourcing track (212) 4. Voorhees, E.M.: Variations in relevance judgments and the measurement of retrieval effectiveness. IPM 36, (2) 5. Hosseini, M., Cox, I., Milić-Frayling, N., Kazai, G., Vinay, V.: On aggregating labels from multiple crowd workers to infer relevance of documents. In: Baeza-Yates, R., de Vries, A.P., Zaragoza, H., Cambazoglu, B.B., Murdock, V., Lempel, R., Silvestri, F. (eds.) ECIR 212. LNCS, vol. 7224, pp Springer, Heidelberg (212) 6. Raykar, V.C., Yu, S., Zhao, L.H., Valadez, G.H., Florin, C., Bogoni, L., Moy, L.: Learning from crowds. The Journal of Machine Learning Research 99, (21) 7. Yilmaz, E., Aslam, J.A., Robertson, S.: A new rank correlation coefficient for information retrieval. In: SIGIR, pp (28) 8. Macmillan, N.A., Creelman, C.D.: Detection theory: A user s guide. Psychology Press (24) 9. Voorhees, E.M., Harman, D.: Overview of the Eighth Text REtrieval Conference (TREC-8). In: Proceedings of TREC, vol. 8, pp (1999) 1. Voorhees, E.M.: Overview of TREC 25. In: Proceedings of TREC (25) 11. Smucker, M.D., Jethani, C.P.: Measuring assessor accuracy: a comparison of NIST assessors and user study participants. In: SIGIR, pp (211) 12. Smucker, M., Jethani, C.: The crowd vs. the lab: A comparison of crowd-sourced and university laboratory participant behavior. In: Proceedings of the SIGIR 211 Workshop on Crowdsourcing for Information Retrieval (211) 13. Kendall, M.G.: A new measure of rank correlation. Biometrika 3(1/2), (1938) 14. Carterette, B., Soboroff, I.: The effect of assessor error on IR system evaluation. In: SIGIR, pp (21) 15. Webber, W., Chandar, P., Carterette, B.: Alternative assessor disagreement and retrieval depth. In: CIKM, pp (212) 16. Vuurens, J., de Vries, A.P., Eickhoff, C.: How much spam can you take? In: SIGIR 211 Workshop on Crowdsourcing for Information Retrieval, CIR (211) 17. Voorhees, E.: Variations in relevance judgments and the measurement of retrieval effectiveness. IPM 36(5), (2) 18. Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A., Yilmaz, E.: Relevance assessment: are judges exchangeable and does it matter. In: SIGIR, pp (28) 19. Kinney, K., Huffman, S., Zhai, J.: How evaluator domain expertise affects search result relevance judgments. In: CIKM, pp (28)
The Crowd vs. the Lab: A Comparison of Crowd-Sourced and University Laboratory Participant Behavior
The Crowd vs. the Lab: A Comparison of Crowd-Sourced and University Laboratory Participant Behavior Mark D. Smucker Department of Management Sciences University of Waterloo mark.smucker@uwaterloo.ca Chandra
More informationAn Analysis of Assessor Behavior in Crowdsourced Preference Judgments
An Analysis of Assessor Behavior in Crowdsourced Preference Judgments Dongqing Zhu and Ben Carterette Department of Computer & Information Sciences University of Delaware Newark, DE, USA 19716 [zhu carteret]@cis.udel.edu
More informationRELEVANCE JUDGMENTS EXCLUSIVE OF HUMAN ASSESSORS IN LARGE SCALE INFORMATION RETRIEVAL EVALUATION EXPERIMENTATION
RELEVANCE JUDGMENTS EXCLUSIVE OF HUMAN ASSESSORS IN LARGE SCALE INFORMATION RETRIEVAL EVALUATION EXPERIMENTATION Prabha Rajagopal 1, Sri Devi Ravana 2, and Maizatul Akmar Ismail 3 1, 2, 3 Department of
More informationPaul Bennett, Microsoft Research (CLUES) Joint work with Ben Carterette, Max Chickering, Susan Dumais, Eric Horvitz, Edith Law, and Anton Mityagin.
Paul Bennett, Microsoft Research (CLUES) Joint work with Ben Carterette, Max Chickering, Susan Dumais, Eric Horvitz, Edith Law, and Anton Mityagin. Why Preferences? Learning Consensus from Preferences
More informationThe Effect of Inter-Assessor Disagreement on IR System Evaluation: A Case Study with Lancers and Students
The Effect of InterAssessor Disagreement on IR System Evaluation: A Case Study with Lancers and Students Waseda University tetsuyasakai@acm.org ABSTRACT This paper reports on a case study on the interassessor
More informationRelevance Assessment: Are Judges Exchangeable and Does it Matter?
Relevance Assessment: Are Judges Exchangeable and Does it Matter? Peter Bailey Microsoft Redmond, WA USA pbailey@microsoft.com Paul Thomas CSIRO ICT Centre Canberra, Australia paul.thomas@csiro.au Nick
More informationThe Impact of Relevance Judgments and Data Fusion on Results of Image Retrieval Test Collections
The Impact of Relevance Judgments and Data Fusion on Results of Image Retrieval Test Collections William Hersh, Eugene Kim Department of Medical Informatics & Clinical Epidemiology School of Medicine Oregon
More informationMachine learning II. Juhan Ernits ITI8600
Machine learning II Juhan Ernits ITI8600 Hand written digit recognition 64 Example 2: Face recogition Classification, regression or unsupervised? How many classes? Example 2: Face recognition Classification,
More informationA Little Competition Never Hurt Anyone s Relevance Assessments
A Little Competition Never Hurt Anyone s Relevance Assessments Yuan Jin, Mark J. Carman Faculty of Information Technology Monash University {yuan.jin,mark.carman}@monash.edu Lexing Xie Research School
More informationHere or There. Preference Judgments for Relevance
Here or There Preference Judgments for Relevance Ben Carterette 1, Paul N. Bennett 2, David Maxwell Chickering 3, and Susan T. Dumais 2 1 University of Massachusetts Amherst 2 Microsoft Research 3 Microsoft
More informationVariations in relevance judgments and the measurement of retrieval e ectiveness
Information Processing and Management 36 (2000) 697±716 www.elsevier.com/locate/infoproman Variations in relevance judgments and the measurement of retrieval e ectiveness Ellen M. Voorhees* National Institute
More informationWeek 2 Video 3. Diagnostic Metrics
Week 2 Video 3 Diagnostic Metrics Different Methods, Different Measures Today we ll continue our focus on classifiers Later this week we ll discuss regressors And other methods will get worked in later
More informationMETHODS FOR DETECTING CERVICAL CANCER
Chapter III METHODS FOR DETECTING CERVICAL CANCER 3.1 INTRODUCTION The successful detection of cervical cancer in a variety of tissues has been reported by many researchers and baseline figures for the
More informationChapter IR:VIII. VIII. Evaluation. Laboratory Experiments Logging Effectiveness Measures Efficiency Measures Training and Testing
Chapter IR:VIII VIII. Evaluation Laboratory Experiments Logging Effectiveness Measures Efficiency Measures Training and Testing IR:VIII-1 Evaluation HAGEN/POTTHAST/STEIN 2018 Retrieval Tasks Ad hoc retrieval:
More informationWhen Overlapping Unexpectedly Alters the Class Imbalance Effects
When Overlapping Unexpectedly Alters the Class Imbalance Effects V. García 1,2, R.A. Mollineda 2,J.S.Sánchez 2,R.Alejo 1,2, and J.M. Sotoca 2 1 Lab. Reconocimiento de Patrones, Instituto Tecnológico de
More informationBehavioral Data Mining. Lecture 4 Measurement
Behavioral Data Mining Lecture 4 Measurement Outline Hypothesis testing Parametric statistical tests Non-parametric tests Precision-Recall plots ROC plots Hardware update Icluster machines are ready for
More informationSentiment Analysis of Reviews: Should we analyze writer intentions or reader perceptions?
Sentiment Analysis of Reviews: Should we analyze writer intentions or reader perceptions? Isa Maks and Piek Vossen Vu University, Faculty of Arts De Boelelaan 1105, 1081 HV Amsterdam e.maks@vu.nl, p.vossen@vu.nl
More informationReview. Imagine the following table being obtained as a random. Decision Test Diseased Not Diseased Positive TP FP Negative FN TN
Outline 1. Review sensitivity and specificity 2. Define an ROC curve 3. Define AUC 4. Non-parametric tests for whether or not the test is informative 5. Introduce the binormal ROC model 6. Discuss non-parametric
More informationIR Meets EHR: Retrieving Patient Cohorts for Clinical Research Studies
IR Meets EHR: Retrieving Patient Cohorts for Clinical Research Studies References William Hersh, MD Department of Medical Informatics & Clinical Epidemiology School of Medicine Oregon Health & Science
More informationAn Introduction to ROC curves. Mark Whitehorn. Mark Whitehorn
An Introduction to ROC curves Mark Whitehorn Mark Whitehorn It s all about me Prof. Mark Whitehorn Emeritus Professor of Analytics Computing University of Dundee Consultant Writer (author) m.a.f.whitehorn@dundee.ac.uk
More informationChapter 1. Introduction
Chapter 1 Introduction 1.1 Motivation and Goals The increasing availability and decreasing cost of high-throughput (HT) technologies coupled with the availability of computational tools and data form a
More informationAdjudicator Agreement and System Rankings for Person Name Search
Adjudicator Agreement and System Rankings for Person Name Search Mark D. Arehart, Chris Wolf, Keith J. Miller The MITRE Corporation 7515 Colshire Dr., McLean, VA 22102 {marehart, cwolf, keith}@mitre.org
More informationBuilding Evaluation Scales for NLP using Item Response Theory
Building Evaluation Scales for NLP using Item Response Theory John Lalor CICS, UMass Amherst Joint work with Hao Wu (BC) and Hong Yu (UMMS) Motivation Evaluation metrics for NLP have been mostly unchanged
More informationInvestigating the robustness of the nonparametric Levene test with more than two groups
Psicológica (2014), 35, 361-383. Investigating the robustness of the nonparametric Levene test with more than two groups David W. Nordstokke * and S. Mitchell Colp University of Calgary, Canada Testing
More informationAnalysis of Diabetic Dataset and Developing Prediction Model by using Hive and R
Indian Journal of Science and Technology, Vol 9(47), DOI: 10.17485/ijst/2016/v9i47/106496, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Analysis of Diabetic Dataset and Developing Prediction
More informationINTRODUCTION TO MACHINE LEARNING. Decision tree learning
INTRODUCTION TO MACHINE LEARNING Decision tree learning Task of classification Automatically assign class to observations with features Observation: vector of features, with a class Automatically assign
More informationRank Aggregation and Belief Revision Dynamics
Rank Aggregation and Belief Revision Dynamics Igor Volzhanin (ivolzh01@mail.bbk.ac.uk), Ulrike Hahn (u.hahn@bbk.ac.uk), Dell Zhang (dell.z@ieee.org) Birkbeck, University of London London, WC1E 7HX UK Stephan
More informationEfficient AUC Optimization for Information Ranking Applications
Efficient AUC Optimization for Information Ranking Applications Sean J. Welleck IBM, USA swelleck@us.ibm.com Abstract. Adequate evaluation of an information retrieval system to estimate future performance
More informationChecking the counterarguments confirms that publication bias contaminated studies relating social class and unethical behavior
1 Checking the counterarguments confirms that publication bias contaminated studies relating social class and unethical behavior Gregory Francis Department of Psychological Sciences Purdue University gfrancis@purdue.edu
More informationImproved Intelligent Classification Technique Based On Support Vector Machines
Improved Intelligent Classification Technique Based On Support Vector Machines V.Vani Asst.Professor,Department of Computer Science,JJ College of Arts and Science,Pudukkottai. Abstract:An abnormal growth
More information3. Model evaluation & selection
Foundations of Machine Learning CentraleSupélec Fall 2016 3. Model evaluation & selection Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr
More informationPart II II Evaluating performance in recommender systems
Part II II Evaluating performance in recommender systems If you cannot measure it, you cannot improve it. William Thomson (Lord Kelvin) Chapter 3 3 Evaluation of recommender systems The evaluation of
More informationA Brief Introduction to Bayesian Statistics
A Brief Introduction to Statistics David Kaplan Department of Educational Psychology Methods for Social Policy Research and, Washington, DC 2017 1 / 37 The Reverend Thomas Bayes, 1701 1761 2 / 37 Pierre-Simon
More informationIntroduction to diagnostic accuracy meta-analysis. Yemisi Takwoingi October 2015
Introduction to diagnostic accuracy meta-analysis Yemisi Takwoingi October 2015 Learning objectives To appreciate the concept underlying DTA meta-analytic approaches To know the Moses-Littenberg SROC method
More informationPredicting Breast Cancer Survivability Rates
Predicting Breast Cancer Survivability Rates For data collected from Saudi Arabia Registries Ghofran Othoum 1 and Wadee Al-Halabi 2 1 Computer Science, Effat University, Jeddah, Saudi Arabia 2 Computer
More informationRecent developments for combining evidence within evidence streams: bias-adjusted meta-analysis
EFSA/EBTC Colloquium, 25 October 2017 Recent developments for combining evidence within evidence streams: bias-adjusted meta-analysis Julian Higgins University of Bristol 1 Introduction to concepts Standard
More informationAUTOMATIC ACNE QUANTIFICATION AND LOCALISATION FOR MEDICAL TREATMENT
AUTOMATIC ACNE QUANTIFICATION AND LOCALISATION FOR MEDICAL TREATMENT Watcharaporn Sitsawangsopon (#1), Maetawee Juladash (#2), Bunyarit Uyyanonvara (#3) (#) School of ICT, Sirindhorn International Institute
More information4. Model evaluation & selection
Foundations of Machine Learning CentraleSupélec Fall 2017 4. Model evaluation & selection Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr
More informationModeling Sentiment with Ridge Regression
Modeling Sentiment with Ridge Regression Luke Segars 2/20/2012 The goal of this project was to generate a linear sentiment model for classifying Amazon book reviews according to their star rank. More generally,
More informationImproving Individual and Team Decisions Using Iconic Abstractions of Subjective Knowledge
2004 Command and Control Research and Technology Symposium Improving Individual and Team Decisions Using Iconic Abstractions of Subjective Knowledge Robert A. Fleming SPAWAR Systems Center Code 24402 53560
More informationAnalysis of Emotion Recognition using Facial Expressions, Speech and Multimodal Information
Analysis of Emotion Recognition using Facial Expressions, Speech and Multimodal Information C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh, S. Lee, U. Neumann, S. Narayanan Emotion
More informationFixed Effect Combining
Meta-Analysis Workshop (part 2) Michael LaValley December 12 th 2014 Villanova University Fixed Effect Combining Each study i provides an effect size estimate d i of the population value For the inverse
More informationStatistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN
Statistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN Vs. 2 Background 3 There are different types of research methods to study behaviour: Descriptive: observations,
More informationBayes Linear Statistics. Theory and Methods
Bayes Linear Statistics Theory and Methods Michael Goldstein and David Wooff Durham University, UK BICENTENNI AL BICENTENNIAL Contents r Preface xvii 1 The Bayes linear approach 1 1.1 Combining beliefs
More informationInformation Retrieval from Electronic Health Records for Patient Cohort Discovery
Information Retrieval from Electronic Health Records for Patient Cohort Discovery References William Hersh, MD Professor and Chair Department of Medical Informatics & Clinical Epidemiology Oregon Health
More informationThis is the author s version of a work that was submitted/accepted for publication in the following source:
This is the author s version of a work that was submitted/accepted for publication in the following source: Moshfeghi, Yashar, Zuccon, Guido, & Jose, Joemon M. (2011) Using emotion to diversify document
More informationSawtooth Software. The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? RESEARCH PAPER SERIES
Sawtooth Software RESEARCH PAPER SERIES The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? Dick Wittink, Yale University Joel Huber, Duke University Peter Zandan,
More informationHow Does Analysis of Competing Hypotheses (ACH) Improve Intelligence Analysis?
How Does Analysis of Competing Hypotheses (ACH) Improve Intelligence Analysis? Richards J. Heuer, Jr. Version 1.2, October 16, 2005 This document is from a collection of works by Richards J. Heuer, Jr.
More informationarxiv: v2 [cs.ai] 26 Sep 2018
Manipulating and Measuring Model Interpretability arxiv:1802.07810v2 [cs.ai] 26 Sep 2018 Forough Poursabzi-Sangdeh forough.poursabzi@microsoft.com Microsoft Research Jennifer Wortman Vaughan jenn@microsoft.com
More information7 Grip aperture and target shape
7 Grip aperture and target shape Based on: Verheij R, Brenner E, Smeets JBJ. The influence of target object shape on maximum grip aperture in human grasping movements. Exp Brain Res, In revision 103 Introduction
More informationThe recommended method for diagnosing sleep
reviews Measuring Agreement Between Diagnostic Devices* W. Ward Flemons, MD; and Michael R. Littner, MD, FCCP There is growing interest in using portable monitoring for investigating patients with suspected
More informationMeta-Analysis. Zifei Liu. Biological and Agricultural Engineering
Meta-Analysis Zifei Liu What is a meta-analysis; why perform a metaanalysis? How a meta-analysis work some basic concepts and principles Steps of Meta-analysis Cautions on meta-analysis 2 What is Meta-analysis
More informationTechnical Specifications
Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically
More informationPhone Number:
International Journal of Scientific & Engineering Research, Volume 6, Issue 5, May-2015 1589 Multi-Agent based Diagnostic Model for Diabetes 1 A. A. Obiniyi and 2 M. K. Ahmed 1 Department of Mathematic,
More informationSLEEP DISTURBANCE ABOUT SLEEP DISTURBANCE INTRODUCTION TO ASSESSMENT OPTIONS. 6/27/2018 PROMIS Sleep Disturbance Page 1
SLEEP DISTURBANCE A brief guide to the PROMIS Sleep Disturbance instruments: ADULT PROMIS Item Bank v1.0 Sleep Disturbance PROMIS Short Form v1.0 Sleep Disturbance 4a PROMIS Short Form v1.0 Sleep Disturbance
More informationCHAPTER 6 HUMAN BEHAVIOR UNDERSTANDING MODEL
127 CHAPTER 6 HUMAN BEHAVIOR UNDERSTANDING MODEL 6.1 INTRODUCTION Analyzing the human behavior in video sequences is an active field of research for the past few years. The vital applications of this field
More informationPredicting Task Difficulty for Different Task Types
Predicting Task Difficulty for Different Task Types Jingjing Liu, Jacek Gwizdka, Chang Liu, Nicholas J. Belkin School of Communication and Information, Rutgers University 4 Huntington Street, New Brunswick,
More informationMining Human-Place Interaction Patterns from Location-Based Social Networks to Enrich Place Categorization Systems
Mining Human-Place Interaction Patterns from Location-Based Social Networks to Enrich Place Categorization Systems Yingjie Hu, Grant McKenzie, Krzysztof Janowicz, Song Gao STKO Lab, Department of Geography,
More informationImpact and adjustment of selection bias. in the assessment of measurement equivalence
Impact and adjustment of selection bias in the assessment of measurement equivalence Thomas Klausch, Joop Hox,& Barry Schouten Working Paper, Utrecht, December 2012 Corresponding author: Thomas Klausch,
More informationMulti-modal Patient Cohort Identification from EEG Report and Signal Data
Multi-modal Patient Cohort Identification from EEG Report and Signal Data Travis R. Goodwin and Sanda M. Harabagiu The University of Texas at Dallas Human Language Technology Research Institute http://www.hlt.utdallas.edu
More informationChapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE
Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE 1. When you assert that it is improbable that the mean intelligence test score of a particular group is 100, you are using. a. descriptive
More informationFINAL. Recommendations for Update to Arsenic Soil CTL Computation. Methodology Focus Group. Contaminated Soils Forum. Prepared by:
A stakeholder body advising the Florida Department of Environmental Protection FINAL Recommendations for Update to Arsenic Soil CTL Computation Prepared by: Methodology Focus Group Contaminated Soils Forum
More informationPredictive Models for Healthcare Analytics
Predictive Models for Healthcare Analytics A Case on Retrospective Clinical Study Mengling Mornin Feng mfeng@mit.edu mornin@gmail.com 1 Learning Objectives After the lecture, students should be able to:
More informationAgents with Attitude: Exploring Coombs Unfolding Technique with Agent-Based Models
Int J Comput Math Learning (2009) 14:51 60 DOI 10.1007/s10758-008-9142-6 COMPUTER MATH SNAPHSHOTS - COLUMN EDITOR: URI WILENSKY* Agents with Attitude: Exploring Coombs Unfolding Technique with Agent-Based
More informationMachine Learning to Inform Breast Cancer Post-Recovery Surveillance
Machine Learning to Inform Breast Cancer Post-Recovery Surveillance Final Project Report CS 229 Autumn 2017 Category: Life Sciences Maxwell Allman (mallman) Lin Fan (linfan) Jamie Kang (kangjh) 1 Introduction
More informationMinimum Feature Selection for Epileptic Seizure Classification using Wavelet-based Feature Extraction and a Fuzzy Neural Network
Appl. Math. Inf. Sci. 8, No. 3, 129-1300 (201) 129 Applied Mathematics & Information Sciences An International Journal http://dx.doi.org/10.1278/amis/0803 Minimum Feature Selection for Epileptic Seizure
More informationJournal of Political Economy, Vol. 93, No. 2 (Apr., 1985)
Confirmations and Contradictions Journal of Political Economy, Vol. 93, No. 2 (Apr., 1985) Estimates of the Deterrent Effect of Capital Punishment: The Importance of the Researcher's Prior Beliefs Walter
More informationEssentials in Bioassay Design and Relative Potency Determination
BioAssay SCIENCES A Division of Thomas A. Little Consulting Essentials in Bioassay Design and Relative Potency Determination Thomas A. Little Ph.D. 2/29/2016 President/CEO BioAssay Sciences 12401 N Wildflower
More informationMarriage Matching with Correlated Preferences
Marriage Matching with Correlated Preferences Onur B. Celik Department of Economics University of Connecticut and Vicki Knoblauch Department of Economics University of Connecticut Abstract Authors of experimental,
More informationTesting the robustness of anonymization techniques: acceptable versus unacceptable inferences - Draft Version
Testing the robustness of anonymization techniques: acceptable versus unacceptable inferences - Draft Version Gergely Acs, Claude Castelluccia, Daniel Le étayer 1 Introduction Anonymization is a critical
More informationReliability, validity, and all that jazz
Reliability, validity, and all that jazz Dylan Wiliam King s College London Published in Education 3-13, 29 (3) pp. 17-21 (2001) Introduction No measuring instrument is perfect. If we use a thermometer
More informationRunning Head: AUTOMATED SCORING OF CONSTRUCTED RESPONSE ITEMS. Contract grant sponsor: National Science Foundation; Contract grant number:
Running Head: AUTOMATED SCORING OF CONSTRUCTED RESPONSE ITEMS Rutstein, D. W., Niekrasz, J., & Snow, E. (2016, April). Automated scoring of constructed response items measuring computational thinking.
More informationEvidence-Based Medicine and Publication Bias Desmond Thompson Merck & Co.
Evidence-Based Medicine and Publication Bias Desmond Thompson Merck & Co. Meta-Analysis Defined A meta-analysis is: the statistical combination of two or more separate studies In other words: overview,
More informationExploring Normalization Techniques for Human Judgments of Machine Translation Adequacy Collected Using Amazon Mechanical Turk
Exploring Normalization Techniques for Human Judgments of Machine Translation Adequacy Collected Using Amazon Mechanical Turk Michael Denkowski and Alon Lavie Language Technologies Institute School of
More informationInternational Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use
Final Concept Paper E9(R1): Addendum to Statistical Principles for Clinical Trials on Choosing Appropriate Estimands and Defining Sensitivity Analyses in Clinical Trials dated 22 October 2014 Endorsed
More informationABSTRACT I. INTRODUCTION. Mohd Thousif Ahemad TSKC Faculty Nagarjuna Govt. College(A) Nalgonda, Telangana, India
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 1 ISSN : 2456-3307 Data Mining Techniques to Predict Cancer Diseases
More informationA Predictive Chronological Model of Multiple Clinical Observations T R A V I S G O O D W I N A N D S A N D A M. H A R A B A G I U
A Predictive Chronological Model of Multiple Clinical Observations T R A V I S G O O D W I N A N D S A N D A M. H A R A B A G I U T H E U N I V E R S I T Y O F T E X A S A T D A L L A S H U M A N L A N
More informationInvestigating the Exhaustivity Dimension in Content-Oriented XML Element Retrieval Evaluation
Investigating the Exhaustivity Dimension in Content-Oriented XML Element Retrieval Evaluation Paul Ogilvie Language Technologies Institute Carnegie Mellon University Pittsburgh PA, USA pto@cs.cmu.edu Mounia
More informationEmotion Recognition using a Cauchy Naive Bayes Classifier
Emotion Recognition using a Cauchy Naive Bayes Classifier Abstract Recognizing human facial expression and emotion by computer is an interesting and challenging problem. In this paper we propose a method
More informationEvaluation of CBT for increasing threat detection performance in X-ray screening
Evaluation of CBT for increasing threat detection performance in X-ray screening A. Schwaninger & F. Hofer Department of Psychology, University of Zurich, Switzerland Abstract The relevance of aviation
More informationEvaluation of CBT for increasing threat detection performance in X-ray screening
Evaluation of CBT for increasing threat detection performance in X-ray screening A. Schwaninger & F. Hofer Department of Psychology, University of Zurich, Switzerland Abstract The relevance of aviation
More informationMinimizing Uncertainty in Property Casualty Loss Reserve Estimates Chris G. Gross, ACAS, MAAA
Minimizing Uncertainty in Property Casualty Loss Reserve Estimates Chris G. Gross, ACAS, MAAA The uncertain nature of property casualty loss reserves Property Casualty loss reserves are inherently uncertain.
More informationKnowledge Discovery and Data Mining. Testing. Performance Measures. Notes. Lecture 15 - ROC, AUC & Lift. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 15 - ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-17-AUC
More informationWhy Is That Relevant? Collecting Annotator Rationales for Relevance Judgments
Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments Tyler McDonnell Dept. of Computer Science University of Texas at Austin tyler@cs.utexas.edu Matthew Lease School of Information
More informationA framework for predicting item difficulty in reading tests
Australian Council for Educational Research ACEReSearch OECD Programme for International Student Assessment (PISA) National and International Surveys 4-2012 A framework for predicting item difficulty in
More informationPHYSICAL FUNCTION A brief guide to the PROMIS Physical Function instruments:
PROMIS Bank v1.0 - Physical Function* PROMIS Short Form v1.0 Physical Function 4a* PROMIS Short Form v1.0-physical Function 6a* PROMIS Short Form v1.0-physical Function 8a* PROMIS Short Form v1.0 Physical
More informationReceiver operating characteristic
Receiver operating characteristic From Wikipedia, the free encyclopedia In signal detection theory, a receiver operating characteristic (ROC), or simply ROC curve, is a graphical plot of the sensitivity,
More informationFramework for Comparative Research on Relational Information Displays
Framework for Comparative Research on Relational Information Displays Sung Park and Richard Catrambone 2 School of Psychology & Graphics, Visualization, and Usability Center (GVU) Georgia Institute of
More informationNEW METHODS FOR SENSITIVITY TESTS OF EXPLOSIVE DEVICES
NEW METHODS FOR SENSITIVITY TESTS OF EXPLOSIVE DEVICES Amit Teller 1, David M. Steinberg 2, Lina Teper 1, Rotem Rozenblum 2, Liran Mendel 2, and Mordechai Jaeger 2 1 RAFAEL, POB 2250, Haifa, 3102102, Israel
More informationThe Effects of Automated Risk Assessment on Reliability, Validity and Return on Investment (ROI)
The Effects of Automated Risk Assessment on Reliability, Validity and Return on Investment (ROI) Grant Duwe, Ph.D. Director, Research and Evaluation May 2016 Email: grant.duwe@state.mn.us Overview Recently
More informationEmpirical Knowledge: based on observations. Answer questions why, whom, how, and when.
INTRO TO RESEARCH METHODS: Empirical Knowledge: based on observations. Answer questions why, whom, how, and when. Experimental research: treatments are given for the purpose of research. Experimental group
More informationMeasuring Focused Attention Using Fixation Inner-Density
Measuring Focused Attention Using Fixation Inner-Density Wen Liu, Mina Shojaeizadeh, Soussan Djamasbi, Andrew C. Trapp User Experience & Decision Making Research Laboratory, Worcester Polytechnic Institute
More informationA Learning Method of Directly Optimizing Classifier Performance at Local Operating Range
A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range Lae-Jeong Park and Jung-Ho Moon Department of Electrical Engineering, Kangnung National University Kangnung, Gangwon-Do,
More informationSTATISTICS AND RESEARCH DESIGN
Statistics 1 STATISTICS AND RESEARCH DESIGN These are subjects that are frequently confused. Both subjects often evoke student anxiety and avoidance. To further complicate matters, both areas appear have
More informationJSM Survey Research Methods Section
Methods and Issues in Trimming Extreme Weights in Sample Surveys Frank Potter and Yuhong Zheng Mathematica Policy Research, P.O. Box 393, Princeton, NJ 08543 Abstract In survey sampling practice, unequal
More informationIdentifying the Zygosity Status of Twins Using Bayes Network and Estimation- Maximization Methodology
Identifying the Zygosity Status of Twins Using Bayes Network and Estimation- Maximization Methodology Yicun Ni (ID#: 9064804041), Jin Ruan (ID#: 9070059457), Ying Zhang (ID#: 9070063723) Abstract As the
More informationQuery Refinement: Negation Detection and Proximity Learning Georgetown at TREC 2014 Clinical Decision Support Track
Query Refinement: Negation Detection and Proximity Learning Georgetown at TREC 2014 Clinical Decision Support Track Christopher Wing and Hui Yang Department of Computer Science, Georgetown University,
More informationThe Impact of Relative Standards on the Propensity to Disclose. Alessandro Acquisti, Leslie K. John, George Loewenstein WEB APPENDIX
The Impact of Relative Standards on the Propensity to Disclose Alessandro Acquisti, Leslie K. John, George Loewenstein WEB APPENDIX 2 Web Appendix A: Panel data estimation approach As noted in the main
More informationShiken: JALT Testing & Evaluation SIG Newsletter. 12 (2). April 2008 (p )
Rasch Measurementt iin Language Educattiion Partt 2:: Measurementt Scalles and Invariiance by James Sick, Ed.D. (J. F. Oberlin University, Tokyo) Part 1 of this series presented an overview of Rasch measurement
More informationSession 1: Dealing with Endogeneity
Niehaus Center, Princeton University GEM, Sciences Po ARTNeT Capacity Building Workshop for Trade Research: Behind the Border Gravity Modeling Thursday, December 18, 2008 Outline Introduction 1 Introduction
More information