COMMITMENT &SOLUTIONS UNPARALLELED. Assessing Human Visual Inspection for Acceptance Testing: An Attribute Agreement Analysis Case Study

Similar documents
COMPUTING READER AGREEMENT FOR THE GRE

Maltreatment Reliability Statistics last updated 11/22/05

(true) Disease Condition Test + Total + a. a + b True Positive False Positive c. c + d False Negative True Negative Total a + c b + d a + b + c + d

A review of statistical methods in the analysis of data arising from observer reliability studies (Part 11) *

Reliability of feedback fechanism based on root cause defect analysis - case study

English 10 Writing Assessment Results and Analysis

Week 2 Video 2. Diagnostic Metrics, Part 1

Unequal Numbers of Judges per Subject

Evaluating Quality in Creative Systems. Graeme Ritchie University of Aberdeen

Lessons in biostatistics

Figure 1: Design and outcomes of an independent blind study with gold/reference standard comparison. Adapted from DCEB (1981b)

Lecture 2. Key Concepts in Clinical Research

Relationship Between Intraclass Correlation and Percent Rater Agreement

Victoria YY Xu PGY-3 Internal Medicine University of Toronto. Supervisor: Dr. Camilla Wong

NDIA Munitions Executive Summit

ADMS Sampling Technique and Survey Studies

Victoria YY Xu PGY-2 Internal Medicine University of Toronto. Supervisor: Dr. Camilla Wong

Comparison of the Null Distributions of

UvA-DARE (Digital Academic Repository) Statistical evaluation of binary measurement systems Erdmann, T.P. Link to publication

Statistical Validation of the Grand Rapids Arch Collapse Classification


Running head: ATTRIBUTE CODING FOR RETROFITTING MODELS. Comparison of Attribute Coding Procedures for Retrofitting Cognitive Diagnostic Models

Chapter 19. Confidence Intervals for Proportions. Copyright 2010 Pearson Education, Inc.

2 Philomeen Weijenborg, Moniek ter Kuile and Frank Willem Jansen.

NIH Public Access Author Manuscript Tutor Quant Methods Psychol. Author manuscript; available in PMC 2012 July 23.

Reliability of Ordination Analyses

Observer variation for radiography, computed tomography, and magnetic resonance imaging of occult hip fractures

Lab 4: Alpha and Kappa. Today s Activities. Reliability. Consider Alpha Consider Kappa Homework and Media Write-Up

10 Intraclass Correlations under the Mixed Factorial Design

Week 17 and 21 Comparing two assays and Measurement of Uncertainty Explain tools used to compare the performance of two assays, including

4 Diagnostic Tests and Measures of Agreement

Chapter 19. Confidence Intervals for Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

PSYCHOMETRIC PROPERTIES OF CLINICAL PERFORMANCE RATINGS

Reliability and Validity checks S-005

The knowledge and skills involved in clinical audit

Agreement Coefficients and Statistical Inference

Gage R&R. Variation. Allow us to explain with a simple diagram.

Lessons in biostatistics

Chapter IR:VIII. VIII. Evaluation. Laboratory Experiments Logging Effectiveness Measures Efficiency Measures Training and Testing

ORIGINS AND DISCUSSION OF EMERGENETICS RESEARCH

reproducibility of the interpretation of hysterosalpingography pathology

AAPOR Exploring the Reliability of Behavior Coding Data

Lesson 9: Two Factor ANOVAS

Application of Medical Statistics. E.M.S. Bandara Dep. of Medical Lab. Sciences

Psychometric qualities of the Dutch Risk Assessment Scales (RISc)

Training Strategies to Mitigate Expectancy-Induced Response Bias in Combat Identification: A Research Agenda

Performance of intraclass correlation coefficient (ICC) as a reliability index under various distributions in scale reliability studies

Empirical Knowledge: based on observations. Answer questions why, whom, how, and when.

A profiling system for the assessment of individual needs for rehabilitation with hearing aids

DATA is derived either through. Self-Report Observation Measurement

EPIDEMIOLOGY. Training module

FMEA AND RPN NUMBERS. Failure Mode Severity Occurrence Detection RPN A B

Epidemiologic Methods I & II Epidem 201AB Winter & Spring 2002

Objectives. Quantifying the quality of hypothesis tests. Type I and II errors. Power of a test. Cautions about significance tests

THE ESTIMATION OF INTEROBSERVER AGREEMENT IN BEHAVIORAL ASSESSMENT

Improving sound quality measures through the multifaceted soundscape approach

Lecture Outline. Biost 590: Statistical Consulting. Stages of Scientific Studies. Scientific Method

Using Statistical Intervals to Assess System Performance Best Practice

12/31/2016. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination

Comparing Vertical and Horizontal Scoring of Open-Ended Questionnaires

CHAPTER ONE CORRELATION

M. S. Raleigh et al. C6748

Improvement Science In Action. Improving Reliability

Repeatability of a questionnaire to assess respiratory

Correlating Trust with Signal Detection Theory Measures in a Hybrid Inspection System

Teaching A Way of Implementing Statistical Methods for Ordinal Data to Researchers

Paper presentation: Preliminary Guidelines for Empirical Research in Software Engineering (Kitchenham et al. 2002)

Appendix G: Methodology checklist: the QUADAS tool for studies of diagnostic test accuracy 1

02a: Test-Retest and Parallel Forms Reliability

Research Article Analysis of Agreement on Traditional Chinese Medical Diagnostics for Many Practitioners

Evaluating the Endoscopic Reference Score for eosinophilic esophagitis: moderate to substantial intra- and interobserver reliability

Measurement and Reliability: Statistical Thinking Considerations

A Cross-sectional, Randomized, Non-interventional Methods Study to Compare Three Methods of Assessing Suicidality in Psychiatric Inpatients

Reliability of Obituaries as a Data Source in Epidemiologic Studies: Agreement in Age, Residence and Occupation

CHAPTER 3 METHOD AND PROCEDURE

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

University of Wollongong. Research Online. Australian Health Services Research Institute

Course summary, final remarks

EXERCISE: HOW TO DO POWER CALCULATIONS IN OPTIMAL DESIGN SOFTWARE

Glossary of Practical Epidemiology Concepts

How to use the Lafayette ESS Report to obtain a probability of deception or truth-telling

When the Evidence Says, Yes, No, and Maybe So

Reproducibility of childhood respiratory symptom questions

and Screening Methodological Quality (Part 2: Data Collection, Interventions, Analysis, Results, and Conclusions A Reader s Guide

Defect Removal Metrics. SE 350 Software Process & Product Quality 1

Understanding Science Conceptual Framework

Evaluation of Linguistic Labels Used in Applications

LEVEL ONE MODULE EXAM PART TWO [Reliability Coefficients CAPs & CATs Patient Reported Outcomes Assessments Disablement Model]

Measuring and Assessing Study Quality

S Imputation of Categorical Missing Data: A comparison of Multivariate Normal and. Multinomial Methods. Holmes Finch.

ACO Congress Conference Pre Session Clinical Performance Measurement

USE AND MISUSE OF MIXED MODEL ANALYSIS VARIANCE IN ECOLOGICAL STUDIES1

Empirical Analysis of Object-Oriented Design Metrics for Predicting High and Low Severity Faults

Examining Inter-Rater Reliability of a CMH Needs Assessment measure in Ontario

PrEP & Microbicide Studies

The Institute of Chartered Accountants of Sri Lanka

Closed Coding. Analyzing Qualitative Data VIS17. Melanie Tory

Understanding Statistics for Research Staff!

Transcription:

DATAWorks 2018 - March 21, 2018 Assessing Human Visual Inspection for Acceptance Testing: An Attribute Agreement Analysis Case Study Christopher Drake Lead Statistician, Small Caliber Munitions QE&SA Statistical Methods & Analysis Group UNPARALLELED COMMITMENT &SOLUTIONS U.S. ARMY ARMAMENT RESEARCH, DEVELOPMENT & ENGINEERING CENTER

AGENDA Background and Motivation Methodology and Rules of Thumb Metrics Case Study: Short Range Training Ammunition (SRTA) Trace Study Test Objective Test Setup & Execution Data Analysis Conclusions & Recommendations 2

BACKGROUND AND MOTIVATION In the military, there are many applications which require visual assessment. Due to advances in technology, many automated computer visual inspection systems have been replacing traditional human visual inspection systems whenever possible for their superior performance with respect to accuracy, reliability, repeatability, and reproducibility. There are still some scenarios where human visual inspections are the only and best option due to certain constraints and limitations, so it is important to have a method for assessing the adequacy and effectiveness of these scenarios Traditional Gauge R&R is a tool used to help quantify the inspection error and uncertainty in the system when continuous measurements are available. In the case of human observations with categorical responses, Traditional Gauge R&R are not applicable, and Attribute Agreement Analyses are used instead. Figure 1: Conceptualizing Precision vs. Accuracy 3

METHODOLOGY & RULES OF THUMB Quantifying agreement and effectiveness are two of the ways in which the Attribute Agreement Analysis assesses the categorical visual inspections in an attempt to understand inter-rater reliability. The analysis considers Kappa, a metric for quantifying agreement, as well as Prevalence and Bias when applicable. Contingency Analyses are also used to quantify misclassification rates. For sample size, traditional power and sample size tools can be leveraged based on signal-to-noise style assessments, but as a rule of thumb, more than 20 observations per observer is required (must also consider proportion defective). For the number of observers required, 2 is often considered a minimum, with diminishing returns on statistical power after 3 for most applications. For proportion defective guidelines, it is generally desired to have a balance of 50% good and 50% bad parts, but more broadly a 30-70% balance is acceptable (this proportion could affect sample size). When interpreting the magnitude of Kappa, general guidelines have been suggested as the following, although ultimately somewhat arbitrary: κ 0 poor 0.01 κ 0.20 slight 0.21 κ 0.40 fair 0.41 κ 0.60 moderate 0.61 κ 0.80 substantial 0.81 κ 1.0 almost perfect 4

METRICS: KAPPA Kappa is stated to be a metric that measures true agreement, one in which takes into account the potential of agreement by pure chance, which is not true agreement. Range of possible values of Kappa are -1 to 1, though usually falling between 0 and 1, with a Kappa value of 1 representing perfect agreement, 0 representing agreement no better than would be expected by chance (simply guessing every rating), and negative values representing agreement worse than expected even by chance. Kappa values can be used for more than 2 level categorical responses, as well as ordinal responses. When dealing with ordinal responses, the weighted Kappa should be used. Constructing confidence intervals around Kappa is suggested, and hypothesis tests against some predetermined null hypothesis Kappa level can be used to produce valuable insights. 5

SITUATIONAL METRICS: PREVALENCE & BIAS The prevalence index looks to account for differences in proportions of agreements between classifying cases with positive and negative connotations. When the prevalence index is high, the proportion of chance agreement will also be high, resulting in lower kappa values. This effect is also greater for higher values of kappa. The bias index is similar in nature to the prevalence index, except it looks at rater s disagreement when classifying cases with positive and negative connotations. When bias values are larger, we expect to see higher kappa values. In contrast to prevalence, the effect of the bias index is greater when kappa is small. It is recommended to consider both prevalence and bias indices when assessing the magnitude of kappa values in the case when cases may be considered positive or negative. 6

The 7.62mm Short Range Training Ammunition (SRTA) Trace cartridge is a round that was designed for short range training scenarios. This round is used exclusively for training scenarios where there may be indoor ranges or limited ranges with a limited fan. The 7.62mm SRTA Trace is a recently developed Army ammunition with the added trace capability. This round needed a method for assessing it s trace performance for Lot Acceptance Testing (LAT) within budgetary constraints, so human visual inspection was chosen. The purpose of this study: CASE STUDY: BACKGROUND Characterize the baseline capability of the SRTA Trace visual inspection system. Use this study to make adjustments and revisions to the rating instructions. Compare results from a similar study with the revised instructions to validate improvement. Figure 2: Visual Summary of SRTA Trace Grading Instruction 7

CASE STUDY: TEST DESIGN Many of the same experimental planning steps essential for any well designed are also applicable for these studies, especially with regard to randomization and replication. Also leveraged during the design process were: recognition of and statement of the problem proper choice of varied factors and levels selection of the relevant response variable(s) appropriate choice of experimental design careful execution of the experiment proper data analysis For the SRTA Trace AAA, 100 samples were fired with approximately 50% of the product being defective. The observers were given detailed grading instructions for how to rate the trace events. 8

CASE STUDY: DATA ANALYSIS After the test was successfully completed and the data was gathered, it was appropriately formatted for analysis. For each of the 100 total shots, the 5 raters either rated it as a pass or fail. Subject matter experts also rated these same events and came to concurrence on a standard, or correct answer, and also took images of the events for future reflection. The data was analyzed using JMP statistical software. Figure 3: Raw Data Format in JMP Statistical Software 9

CASE STUDY: DATA ANALYSIS CONT. From the data, we see clearly from the baseline study that overall kappa values are 78.72%. The overall effectiveness of 89.1% was also low compared to the desired 95% goal. There seemed to be some confusion with regard to specific images rated due to the low effectiveness. We can clearly see which images had the least agreement and effectiveness Figure 4: Attribute Agreement Analysis JMP Output for Baseline Study 10

CASE STUDY: DATA ANALYSIS CONT. From the revised data, we see clearly that the overall kappa values are significantly improved, up from 78.72% to 85.97%. The overall effectiveness of 89.1% was also improved to 94.80, approximately meeting our goal of 95%. Many of the score for images which showed confusion among the raters were drastically improved after the new rating instructions. Still some marginal room to clarify these grading instructions based on a few of the images with lower agreement. Figure 5: Attribute Agreement Analysis JMP Output for Revised Study 11

CASE STUDY: CONCLUSIONS Assessing systems which rely on human observation can be difficult due to the inherently noisy nature of the test environment and subjects. Attribute Agreement Analyses are an important tool to be able to more comprehensively and precisely quantify and asses the adequacy, agreement, and overall effectiveness of human observer dependent inspection systems. By identifying occurrences with the least agreement, we can iteratively adjust our instructions and system to better accommodate the required needs of the inspection system. In the case study detailed above, the initial baseline study showed several areas where lack of agreement was between observers and the standard was an issue. Using this information, adjustments to the rating instructions were made. After the second study was run, it was clear that this effort netted a large improvement across the board in % Agreement, Kappa value, and overall Effectiveness. With these improvements implemented, the lot acceptance methods for SRTA Trace ammunition are now more accurate and effective, decreasing the risk of providing defective ammunition to the Warfighter. Table 1: Summary of Attribute Agreement Analysis Metrics from Case Study Test % Agreement Kappa Effectiveness LCB Mean UCB LCB Mean UCB LCB Mean UCB Baseline 61.5% 71% 79% 0.758 0.787 0.816 87% 89.1% 90.9% Revised Instructions 77.9% 86% 91.5% 0.798 0.86 0.922 92.5% 94.8% 96.4% 12

REFERENCES 1. Sim, J. and Wright, C. C. (2005) "The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements", in Physical Therapy. Vol. 85, No. 3, pp. 257 268 2. Fleiss, J.L. (1971) Measuring Nominal Scale Agreement Among Many Raters, Psychological Bulletin, Vol. 76, pp 378-382. 3. Landis JR, Koch GG. (1977) The Measurement of Observer Agreement for Categorical Data, Biometrics. 33:159 174. 4. Montgomery, D. (2013) Introduction to Statistical Quality Control, Seventh Edition, Wiley 13

QUESTIONS? 14