Figure 1: Design and outcomes of an independent blind study with gold/reference standard comparison. Adapted from DCEB (1981b)

Similar documents
(true) Disease Condition Test + Total + a. a + b True Positive False Positive c. c + d False Negative True Negative Total a + c b + d a + b + c + d

Screening (Diagnostic Tests) Shaker Salarilak

Chapter 10. Screening for Disease

7/17/2013. Evaluation of Diagnostic Tests July 22, 2013 Introduction to Clinical Research: A Two week Intensive Course

DATA is derived either through. Self-Report Observation Measurement

Measurement and Descriptive Statistics. Katie Rommel-Esham Education 604

Questionnaire design. Questionnaire Design: Content. Questionnaire Design. Questionnaire Design: Wording. Questionnaire Design: Wording OUTLINE

Reliability and Validity checks S-005

LANGUAGE TEST RELIABILITY On defining reliability Sources of unreliability Methods of estimating reliability Standard error of measurement Factors

Evaluation of a clinical test. I: Assessment of reliability

L ecografia cerebrale: accuratezza diagnostica Dr Patrizio Prati Neurologia CIDIMU Torino

A review of statistical methods in the analysis of data arising from observer reliability studies (Part 11) *

reproducibility of the interpretation of hysterosalpingography pathology

Survey Question. What are appropriate methods to reaffirm the fairness, validity reliability and general performance of examinations?

Examining Inter-Rater Reliability of a CMH Needs Assessment measure in Ontario

Relationship Between Intraclass Correlation and Percent Rater Agreement

The recommended method for diagnosing sleep

4 Diagnostic Tests and Measures of Agreement

2 Philomeen Weijenborg, Moniek ter Kuile and Frank Willem Jansen.

Introduction On Assessing Agreement With Continuous Measurement

Introduction to Reliability

COMMITMENT &SOLUTIONS UNPARALLELED. Assessing Human Visual Inspection for Acceptance Testing: An Attribute Agreement Analysis Case Study

Types of Tests. Measurement Reliability. Most self-report tests used in Psychology and Education are objective tests :

S4. Summary of the GALNS assay validation. Intra-assay variation (within-run precision)

Validity and reliability of measurements

Statistics for Psychosocial Research Session 1: September 1 Bill

Statistical Validation of the Grand Rapids Arch Collapse Classification

COMPUTING READER AGREEMENT FOR THE GRE

Various performance measures in Binary classification An Overview of ROC study

Unequal Numbers of Judges per Subject

Validity and reliability of measurements

Statistical Methodology: 11. Reliability and Validity Assessment in Study Design, Part A Daiid J. Karl-as, MD

BMI 541/699 Lecture 16

Probability Revision. MED INF 406 Assignment 5. Golkonda, Jyothi 11/4/2012

English 10 Writing Assessment Results and Analysis

The cross sectional study design. Population and pre-test. Probability (participants). Index test. Target condition. Reference Standard

Statistics, Probability and Diagnostic Medicine

The Assessment of Physiotherapy Practice (APP) is a reliable measure of professional competence of physiotherapy students: a reliability study

Maltreatment Reliability Statistics last updated 11/22/05

Measurement and Reliability: Statistical Thinking Considerations

Clinical biostatistics: Assessing agreement and diagnostic test evaluation

Lab 4: Alpha and Kappa. Today s Activities. Reliability. Consider Alpha Consider Kappa Homework and Media Write-Up

Interpreting the Item Analysis Score Report Statistical Information

Issues in assessing the validity of nutrient data obtained from a food-frequency questionnaire: folate and vitamin B 12 examples

Week 17 and 21 Comparing two assays and Measurement of Uncertainty Explain tools used to compare the performance of two assays, including

Interrater and Intrarater Reliability of the Assisting Hand Assessment


10 Intraclass Correlations under the Mixed Factorial Design

Saville Consulting Wave Professional Styles Handbook

EPIDEMIOLOGY. Training module

02a: Test-Retest and Parallel Forms Reliability

Psychometric Evaluation of Self-Report Questionnaires - the development of a checklist

Preliminary Reliability and Validity Report

Demonstration of active Side Shift Type1(Mirror Image ) in Right (Major) Thoracic curve.

Teaching A Way of Implementing Statistical Methods for Ordinal Data to Researchers

Repeatability of a questionnaire to assess respiratory

Diagnostic accuracy of the Structured Inventory of Malingered Symptomatology (SIMS) in detecting instructed malingering

the standard deviation (SD) is a measure of how much dispersion exists from the mean SD = square root (variance)

Lessons in biostatistics

how good is the Instrument? Dr Dean McKenzie

Ryan Mattek, PhD Letitia Johnson PhD. SRA-FV: Evidence of Inter-rater Reliability in a Combined SOMMI Sample

Study protocol v. 1.0 Systematic review of the Sequential Organ Failure Assessment score as a surrogate endpoint in randomized controlled trials

Week 2 Video 2. Diagnostic Metrics, Part 1

ADMS Sampling Technique and Survey Studies

Research Questions and Survey Development

HPS301 Exam Notes- Contents

Evaluating Quality in Creative Systems. Graeme Ritchie University of Aberdeen

Reliability Analysis: Its Application in Clinical Practice

University of Bristol - Explore Bristol Research. Publisher's PDF, also known as Version of record

Chapter -6 Reliability and Validity of the test Test - Retest Method Rational Equivalence Method Split-Half Method

Any phenomenon we decide to measure in psychology, whether it is

Correlation and regression

A Cross-sectional, Randomized, Non-interventional Methods Study to Compare Three Methods of Assessing Suicidality in Psychiatric Inpatients

Tubal subfertility and ectopic pregnancy. Evaluating the effectiveness of diagnostic tests Mol, B.W.J.

Answer Key to Problem Set #1

GATE CAT Diagnostic Test Accuracy Studies

Performance of intraclass correlation coefficient (ICC) as a reliability index under various distributions in scale reliability studies

Victoria YY Xu PGY-3 Internal Medicine University of Toronto. Supervisor: Dr. Camilla Wong

Addressing error in laboratory biomarker studies

Running head: ATTRIBUTE CODING FOR RETROFITTING MODELS. Comparison of Attribute Coding Procedures for Retrofitting Cognitive Diagnostic Models

The Patient and Observer Scar Assessment Scale: A Reliable and Feasible Tool for Scar Evaluation

HOW STATISTICS IMPACT PHARMACY PRACTICE?

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz

Body Weight Behavior

N Utilization of Nursing Research in Advanced Practice, Summer 2008

Worksheet for Structured Review of Physical Exam or Diagnostic Test Study

Understanding CELF-5 Reliability & Validity to Improve Diagnostic Decisions

Overview of Non-Parametric Statistics

Using Differential Item Functioning to Test for Inter-rater Reliability in Constructed Response Items

Victoria YY Xu PGY-2 Internal Medicine University of Toronto. Supervisor: Dr. Camilla Wong

Visual assessment of breast density using Visual Analogue Scales: observer variability, reader attributes and reading time

Meta-analysis of diagnostic research. Karen R Steingart, MD, MPH Chennai, 15 December Overview

CHAPTER VI RESEARCH METHODOLOGY

loudspeaker. When this occurs, respondents agree, on average, only half the time (as one would expect). Recalculate percentage agreement and for the s

Performance Characteristics of the Daktari CD4 System

appstats26.notebook April 17, 2015

PTHP 7101 Research 1 Chapter Assignments

Relationship, Correlation, & Causation DR. MIKE MARRAPODI

Chest x-ray clues to osteoporosis : criteria, correlations, and consistency

VU Biostatistics and Experimental Design PLA.216

Transcription:

Page 1 of 1 Diagnostic test investigated indicates the patient has the Diagnostic test investigated indicates the patient does not have the Gold/reference standard indicates the patient has the True positive (TP) False positive (FP) Gold/reference standard indicates the patient does not have the False negative (FN) True negative (TN) Figure 1: Design and outcomes of an independent blind study with gold/reference standard comparison. Adapted from DCEB (1981b)

Page 1 of 1 Table 15: Results recommended for the evaluation of a diagnostic test validity using an independent blind study with gold/reference standard comparison Statistic Description Desired Values Sensitivity Specificity Positive predictive value Negative predictive value Accuracy The ratio of true positive results from the diagnostic test evaluated to the total number of positive tests obtained with the gold/reference standard (TP/[TP+FP] using Figure 1)(DCEB, 1981b). Numerically, this is expressed as a value between 0 and 1. This is an index of the tests ability to detect the (DCEB, 1981b; Lalkhen and McCluskey, 2008; Altman and Bland, 1994a). The ratio of true negative results from the diagnostic test evaluated to the total number of negative tests obtained with the gold/reference standard (TN/[FN+TN] using Figure 1) (DCEB, 1981b). Numerically, this is expressed as a value between 0 and 1. This is an index of the tests ability to detect the absence of (DCEB, 1981b; Lalkhen and McCluskey, 2008; Altman and Bland, 1994a). The ratio of true positive tests to the total positive tests obtained with the diagnostic test examined (TP/[TP+FN] using Figure 1) (DCEB, 1981b). Numerically, this is expressed as a value between 0 and 1. Indicates the number of positive tests that correctly diagnosed the presence of a (DCEB, 1981b; Lalkhen and McCluskey, 2008). This value can also be calculated from the sensitivity and specificity of the test, and the prevalence of a in the population being tested ([sensitivity x prevalence]/[sensitivity * prevalence + (1 specificity) * (1 prevalence)]) (Altman and Bland, 1994b). The ration of true negative tests to the total negative tests obtained with the diagnostic test examined (TN/[FP+TN] using Figure 1) (DCEB, 1981b). Numerically, this is expressed as a value between 0 and 1. Indicates the number of negative tests that correctly diagnosed the absence of a (DCEB, 1981b; Lalkhen and McCluskey, 2008). This value can also be calculated from the sensitivity and specificity of the test, and the prevalence of a in the population being tested ([specificity * (1-prevalence)]/ [(1-sensitivity) * prevalence + specificity) * (1 prevalence)]) (Altman and Bland, 1994b). The ratio of the total number of true positive and true negative tests from the diagnostic test evaluated to the total number of tests conducted ([TP+TN]/n using Figure 1) (DCEB, 1981b). Numerically, this is expressed as a value between 0 and 1. It is the overall rate of agreement between the diagnostic test examined and the gold/reference standard (DCEB, 1981b; Bossuyt et al., 2003). greater sensitivity greater specificity greater capability of correctly diagnosing the presence of a greater capability of correctly diagnosing the absence of a greater accuracy of the diagnostic test

Page 1 of 3 Table 16: Results used for the evaluation of diagnostic test reliability Statistic Description Desired Values Test-Retest Reliability Pearson's r-correlation coefficient The ratio of the sum of the product of test administrator results to the square root of the product of the sum of squares results for each test administrator results. This provides a measure of the strength of the linear relationship between the results obtained by the various test administrators (Osborn, 2006; Blaisdell, 1998). Values range from -1 to 1 (Blaisdell, 1998). An r-value of 1 or -1 indicates a perfect linear relationship (Blaisdell, 1998). An r-value of 0 indicates no linear relationship (Blaisdell, 1998). The ratio of the variance between the results obtained on different occasions for a single subject and the total variance in all the results collected for that subject (Streiner and Norman, 2008; Cleophas, Zwinderman and Cleophas, 2006). Numerically, this is expressed as a value between 0 and 1, with 1 being perfect reliability/ reproducibility and 0 being no reliability/ reproducibility (Streiner and Norman, 2008; Cleophas, Zwinderman and Cleophas, 2005). Also known as test-retest stability (Streiner and Norman, 2008), proportion of variance (Cleophas, Zwinderman and Cleophas, 2006), or correlation ratio (Cleophas, Zwinderman and Cleophas, 2006). It is reasonable to demand measures greater than 0.5 (Streiner and Norman, 2008). Kuder Richardson-20 α test An index of the homogeneity of measurements for a given set of results, used to assess the internal consistency reliability of a measurement instrument (Thompson, 2010; Ramsey et al., 1991). Values range from 0 to 1, with a value of 1 representing perfect internal consistency (Thompson, 2010). The squared Kuder Richardson-20 α value represents the proportion of score variance not resulting from error (Thompson, 2010). A KR-20 value >0.7 is acceptable. Tests that use 50 or more items in their assessment should accept values >0.8 (Thompson, 2010). KR-20 2 values <0.7 indicate that the majority of score variance results from error (Thompson, 2010).

Page 2 of 3 Inter-Rater Reliability The percentage of tests with which the test administrators obtained the same results Similar to the test-retest description except variance among the results of different test administrators on a single subject is used instead of the variability between several tests administered on different occasions (Streiner and Norman, 2008; Gulliford, 2005; Tzannes and Murrell 2002; Cleophas, Zwinderman and Cleophas, 2006). Cadogan et al. (2011) indicated that a percent agreement greater that 80% is required for a test to be considered appropriate for inclusion in a clinical examination. Same as test-retest description, with the addition of; Tzannes and Murrell (2002) determined an intra-class coefficient of 0.5-0.7 to be reasonable inter-rater reliability. Tzannes et al. (2004) determined a good intraclass coefficient to be >0.65, and <0.31 to be poor inter-rater reliability. Cohen s kappa The ratio of the sum of observed agreements minus the sum of expected agreements to the total number of observations minus the sum of expected agreements (Sheskin, 2004). Calculations indicate the degree of agreement between the results obtained by different (or the same) test administers after the random chance of observers agreeing is corrected for (Sheskin, 2004). Calculations are based on results organized into contingency tables specific to the number of test administers and outcomes measured (Sheskin, 2004). Landis and Koch (1977) suggest that Kappa values of 0.00-0.2 indicate slight agreement; 0.21-0.4 fair; 0.41-0.60moderate; 0.61-0.8 substantial; and 0.81-1.00 indicate almost perfect agreement. Walsworth et al. (2008) indicated that a kappa value greater than 0.8 indicates strong agreement. Cadogan et al. (2011) indicated that a kappa value greater that 0.6 is required for a test to be considered appropriate for inclusion in a clinical examination. Coefficient of interobserver variability The ratio of inter-rater variability to the total observer related variability (Haber et al., 2005). Total observer related variability is the sum of intra- and inter-rater variability (Haber et al., 2005). Inter-rater variability is the variance in replicated measurements made on the same subject with all methods by all observers (Barnhart, Song and Haber, 2005; Haber et al., 2005). A higher Coefficient of Inter-Observer Variability value indicates a lower level of inter-rater agreement (Haber et al., 2005).

Page 3 of 3 Coefficient of interobserver agreement 1 minus the coefficient of inter-observer variability (Haber et al., 2005). A higher coefficient of inter-observer agreement value indicates a greater level of inter-rater agreement (Haber et al., 2005). See above Intra-Rater Reliability Similar to the test-retest description except variability among the results of a single test administrator is used instead of the variability between several tests administered on different occasions (Streiner and Norman, 2008; Cleophas, Zwinderman and Cleophas, 2006) Same as test-retest description Cohen s kappa Similar to the inter-rater description except variability among the results of a single test administrator is used instead of the variability between test administrators. Same as inter-rater description Intra-observer variability Intra-observer variability is the variance in replicated measurements made on the same subject with the same method by the same observer (Barnhart, Song and Haber, 2005; Haber et al., 2005). See above