Knowledge discovery tools 381

Similar documents
Understandable Statistics

Introduction to Statistical Data Analysis I

V. Gathering and Exploring Data

Undertaking statistical analysis of

SPRING GROVE AREA SCHOOL DISTRICT. Course Description. Instructional Strategies, Learning Practices, Activities, and Experiences.

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you?

What you should know before you collect data. BAE 815 (Fall 2017) Dr. Zifei Liu

Population. Sample. AP Statistics Notes for Chapter 1 Section 1.0 Making Sense of Data. Statistics: Data Analysis:

Descriptive statistics

Unit 7 Comparisons and Relationships

Chapter 1: Exploring Data

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA

Lesson 9 Presentation and Display of Quantitative Data

Unit 1 Exploring and Understanding Data

Statistics is a broad mathematical discipline dealing with

Chapter 1: Introduction to Statistics

M 140 Test 1 A Name SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 60

Quantitative Methods in Computing Education Research (A brief overview tips and techniques)

Probability and Statistics. Chapter 1

PRINTABLE VERSION. Quiz 1. True or False: The amount of rainfall in your state last month is an example of continuous data.

Table of Contents. Plots. Essential Statistics for Nursing Research 1/12/2017

Appendix B Statistical Methods

Measuring the User Experience

Medical Statistics 1. Basic Concepts Farhad Pishgar. Defining the data. Alive after 6 months?

Research Methods in Forest Sciences: Learning Diary. Yoko Lu December Research process

Frequency distributions

Organizing Data. Types of Distributions. Uniform distribution All ranges or categories have nearly the same value a.k.a. rectangular distribution

Biostatistics for Med Students. Lecture 1

Section 6: Analysing Relationships Between Variables

Chapter 7: Descriptive Statistics

Here are the various choices. All of them are found in the Analyze menu in SPSS, under the sub-menu for Descriptive Statistics :

10/4/2007 MATH 171 Name: Dr. Lunsford Test Points Possible

HS Exam 1 -- March 9, 2006

Analysis and Interpretation of Data Part 1

Section I: Multiple Choice Select the best answer for each question.

9.0 L '- ---'- ---'- --' X

Missy Wittenzellner Big Brother Big Sister Project

Statistics: Making Sense of the Numbers

Student Performance Q&A:

MBA 605 Business Analytics Don Conant, PhD. GETTING TO THE STANDARD NORMAL DISTRIBUTION

Chapter 1: Explaining Behavior

DO NOT OPEN THIS BOOKLET UNTIL YOU ARE TOLD TO DO SO

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

Pitfalls in Linear Regression Analysis

STATISTICS AND RESEARCH DESIGN

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

M 140 Test 1 A Name (1 point) SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 75

STATISTICS & PROBABILITY

Statistics: A Brief Overview Part I. Katherine Shaver, M.S. Biostatistician Carilion Clinic

CHAPTER ONE CORRELATION

PROBABILITY Page 1 of So far we have been concerned about describing characteristics of a distribution.

AP Stats Review for Midterm

Psychology Research Process

C-1: Variables which are measured on a continuous scale are described in terms of three key characteristics central tendency, variability, and shape.

Students will understand the definition of mean, median, mode and standard deviation and be able to calculate these functions with given set of

Observational studies; descriptive statistics

ANOVA in SPSS (Practical)

One-Way Independent ANOVA

Human-Computer Interaction IS4300. I6 Swing Layout Managers due now

International Statistical Literacy Competition of the ISLP Training package 3

CHAPTER 2. MEASURING AND DESCRIBING VARIABLES

Copyright 2014, 2011, and 2008 Pearson Education, Inc. 1-1

Instructions and Checklist

Examining differences between two sets of scores

Six Sigma Glossary Lean 6 Society

Statistical Techniques. Masoud Mansoury and Anas Abulfaraj

q2_2 MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

2.4.1 STA-O Assessment 2

AP Statistics. Semester One Review Part 1 Chapters 1-5

The normal curve and standardisation. Percentiles, z-scores

STP226 Brief Class Notes Instructor: Ela Jackiewicz

Introduction & Basics

Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE

Math 214 REVIEW SHEET EXAM #1 Exam: Wednesday March, 2007

Before we get started:

Part 1. For each of the following questions fill-in the blanks. Each question is worth 2 points.

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

Overview of statistical methods 283. Figure 9.5. Linearity illustrated.

DOWNLOAD PDF SUMMARIZING AND INTERPRETING DATA : USING STATISTICS

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Data, frequencies, and distributions. Martin Bland. Types of data. Types of data. Clinical Biostatistics

Two-Way Independent ANOVA

Chapter 1. Picturing Distributions with Graphs

Section 1.2 Displaying Quantitative Data with Graphs. Dotplots

Research Methodology in Social Sciences. by Dr. Rina Astini

Business Statistics Probability

How to interpret scientific & statistical graphs

Statistical Summaries. Kerala School of MathematicsCourse in Statistics for Scientists. Descriptive Statistics. Summary Statistics

Quality Digest Daily, March 3, 2014 Manuscript 266. Statistics and SPC. Two things sharing a common name can still be different. Donald J.

Graphic Organizers. Compare/Contrast. 1. Different. 2. Different. Alike

Still important ideas

4.3 Measures of Variation

Table of Contents. EHS EXERCISE 1: Risk Assessment: A Case Study of an Investigation of a Tuberculosis (TB) Outbreak in a Health Care Setting

Test 1 Version A STAT 3090 Spring 2018

Still important ideas

UNIT V: Analysis of Non-numerical and Numerical Data SWK 330 Kimberly Baker-Abrams. In qualitative research: Grounded Theory

Chapter 25. Paired Samples and Blocks. Copyright 2010 Pearson Education, Inc.

LAB ASSIGNMENT 4 INFERENCES FOR NUMERICAL DATA. Comparison of Cancer Survival*

Using Lertap 5 in a Parallel-Forms Reliability Study

Descriptive Statistics Lecture

Transcription:

Knowledge discovery tools 381 hours, and prime time is prime time precisely because more people tend to watch television at that time.. Compare histograms from di erent periods of time. Changes in histogram patterns from one time period to the next can be very useful in nding ways to improve the process.. Stratify the data by plotting separate histograms for di erent sources of data. For example, with the rod diameter histogram we might want to plot separate histograms for shafts made from di erent vendors materials or made by di erent operators or machines. This can sometimes reveal things that even control charts don t detect. Exploratory data analysis Data analysis can be divided into two broad phases: an exploratory phase and a confirmatory phase. Data analysis can be thought of as detective work. Before the trial one must collect evidence and examine it thoroughly. One must have a basis for developing a theory of cause and effect. Is there a gap in the data? Are there patterns that suggest some mechanism? Or, are there patterns that are simply mysterious (e.g., are all of the numbers even or odd)? Do outliers occur? Are there patterns in the variation of the data? What are the shapes of the distributions? This activity is known as exploratory data analysis (EDA). Tukey s 1977 book with this title elevated this task to acceptability among serious devotees of statistics. Four themes appear repeatedly throughout EDA: resistance, residuals, reexpression, and visual display. Resistance refers to the insensitivity of a method to a small change in the data. If a small amount of the data is contaminated, the method shouldn t produce dramatically different results. Residuals are what remain after removing the effect of a model or a summary. For example, one might subtract the mean from each value, or look at deviations about a regression line. Re-expression involves examination of different scales on which the data are displayed. Tukeyp focused most of his attention on simple power transformations such as y ¼ ffiffi x, y ¼ x 2, y ¼ 1=x. Visual display helps the analyst examine the data graphically to grasp regularities and peculiarities in the data. EDA is based on a simple basic premise: it is important to understand what you can do before you learn to measure how well you seem to have done it (Tukey, 1977). The objective is to investigate the appearance of the data, not to confirm some prior hypothesis. While there are a large number of EDA methods and techniques, there are two which are commonly encountered in Six Sigma work: stem-and-leaf plots and boxplots. These techniques are commonly included in most statistics packages. (SPSS was used to create the figures used

382 KNOWLEDGE DISCOVERY in this book.) However, the graphics of EDA are simple enough to be done easily by hand. STEM-AND-LEAF PLOTS Stem-and-leaf plots are a variation of histograms and are especially useful for smaller data sets (n<200). A major advantage of stem-and-leaf plots over the histogram is that the raw data values are preserved, sometimes completely and sometimes only partially. There is a loss of information in the histogram because the histogram reduces the data by grouping several values into a single cell. Figure 11.14 is a stem-and-leaf plot of diastolic blood pressures. As in a histogram, the length of each row corresponds to the number of cases that fall into a particular interval. However, a stem-and-leaf plot represents each case with a numeric value that corresponds to the actual observed value. This is done by dividing observed values into two componentsöthe leading digit or digits, called the stem, and the trailing digit, called the leaf. For example, the value 75 has a stem of 7 and a leaf of 5. Figure 11.14. Stem-and-leaf plot of diastolic blood pressures. From SPSS for W ndows Base System User s Guide, p. 183. Copyright # 1993. Used by permission of the publisher, SPSS, Inc., Chicago, IL.

Knowledge discovery tools 383 In this example, each stem is divided into two rows. The first row of each pair has cases with leaves of 0 through 4, while the second row has cases with leaves of 5 through 9. Consider the two rows that correspond to the stem of 11. In the first row, we can see that there are four cases with diastolic blood pressure of 110 and one case with a reading of 113. In the second row, there are two cases with a value of 115 and one case each with a value of 117, 118, and 119. The last row of the stem-and-leaf plot is for cases with extreme values (values far removed from the rest). In this row, the actual values are displayed in parentheses. In the frequency column, we see that there are four extreme cases. Their values are 125, 133, and 160. Only distinct values are listed. When there are few stems, it is sometimes useful to subdivide each stem even further. Consider Figure 11.15 a stem-and-leaf plot of cholesterol levels. In this figure, stems 2 and 3 are divided into five parts, each representing two leaf values. The first row, designated by an asterisk, is for leaves of 0 and 1; the next, designated by t, is for leaves of 2 s and 3 s; the third, designated by f, is for leaves of 4 s and 5 s; the fourth, designated by s, is for leaves of 6 s and 7 s; and the fifth, designated by a period, is for leaves of 8 s and 9 s. Rows without cases are not represented in the plot. For example, in Figure 11.15, the first two rows for stem 1 (corresponding to 0-1 and 2-3) are omitted. Figure 11.15. Stem-and-leaf plot of cholesterol levels. From SPSS for W ndows Base System User s Guide,p.185.Copyright# 1993. Used by permission of the publisher, SPSS, Inc., Chicago, IL.

384 KNOWLEDGE DISCOVERY This stem-and-leaf plot differs from the previous one in another way. Since cholesterol values have a wide rangeöfrom 106 to 515 in this exampleöusing the first two digits for the stem would result in an unnecessarily detailed plot. Therefore, we will use only the hundreds digit as the stem, rather than the first two digits. The stem setting of 100 appears in the row labeled Stem width. The leaf is then the tens digit. The last digit is ignored. Thus, from this particular stem-and-leaf plot, it is not possible to determine the exact cholesterol level for a case. Instead, each is classified by only its first two digits. BOXPLOTS A display that further summarizes information about the distribution of the values is the boxplot. Instead of plotting the actual values, a boxplot displays summary statistics for the distribution. It is a plot of the 25th, 50th, and 75th percentiles, as well as values far removed from the rest. Figure 11.16 shows an annotated sketch of a boxplot. The lower boundary of the box is the 25th percentile. Tukey refers to the 25th and 75th percentile hinges. Note that the 50th percentile is the median of the overall data set, the 25th percentile is the median of those values below the median, and the 75th percentile is the median of those values above the median. The horizontal line inside the box represents the median. 50% of the cases are included within the box. The box length corresponds to the interquartile range, which is the difference between the 25th and 75th percentiles. The boxplot includes two categories of cases with outlying values. Cases with values that are more than 3 box-lengths from the upper or lower edge of the box are called extreme values. On the boxplot, these are designated with an asterisk (*). Cases with values that are between 1.5 and 3 box-lengths from the upper or lower edge of the box are called outliers and are designated with a circle. The largest and smallest observed values that aren t outliers are also shown. Lines are drawn from the ends of the box to these values. (These lines are sometimes called whiskers and the plot is then called a box-and-whiskers plot.) Despite its simplicity, the boxplot contains an impressive amount of information. From the median you can determine the central tendency, or location. From the length of the box, you can determine the spread, or variability, of your observations. If the median is not in the center of the box, you know that the observed values are skewed. If the median is closer to the bottom of the box than to the top, the data are positively skewed. If the median is closer to the top of the box than to the bottom, the opposite is true: the distribution is negatively skewed. The length of the tail is shown by the whiskers and the outlying and extreme points.

328 C hap te r Ten 2. Write the names of the categories above and below the horizontal line. Think of these as branches from the main trunk of the tree. 3. Draw in the detailed cause data for each category. Think of these as limbs and twigs on the branches. A good cause and effect diagram will have many "twigs," as shown in Fig. loa. If your cause and effect diagram doesn't have a lot of smaller branches and twigs, it shows that the understanding of the problem is superficial. Chances are that you need the help of someone outside of your group to aid in the understanding, perhaps someone more closely associated with the problem. Cause and effect diagrams come in several basic types. The dispersion analysis type is created by repeatedly asking "why does this dispersion occur?" For example, we might want to know why all of our fresh peaches don't have the same color. The production process class cause and effect diagram uses production processes as the main categories, or branches of the diagram. The processes are shown joined by the horizontal line. Figure 10.5 is an example of this type of diagram. The cause enumeration cause and effect diagram simply displays all possible causes of a given problem grouped according to rational categories. This type of cause and effect diagram lends itself readily to the brainstorming approach we are using. A variation of the basic cause and effect diagram, developed by Dr. Ryuji Fukuda of Japan, is cause and effect diagrams with the addition of cards, or CEDAC. The main difference is that the group gathers ideas outside of the meeting room on small cards, as well as in group meetings. The cards also serve as a vehicle for gathering input from people who are not in the group; they can be distributed to anyone involved with the process. Often the cards provide more information than the brief entries on a standard cause and effect diagram. The cause and effect diagram is built by actually placing the cards on the branches. Boxplots A boxplot displays summary statistics for a set of distributions. It is a plot of the 25th, 50th, and 75th percentiles, as well as values far removed from the rest. Figure 10.6 shows an annotated sketch of a boxplot. The lower boundary of the box is the 25th percentile. Tukey refers to the 25th and 75th percentile "hinges." Note that the 50th percentile is the median of the overall data set, the 25th percentile is the median of those values below the median, and the 75th percentile is the median of those values above the median. The horizontal line inside the box represents the median. Fifty percent of the cases are included within the box. The box length corresponds to the interquartile range, which is the difference between the 25th and 75th percentiles. The boxplot includes two categories of cases with outlying values. Cases with values that are more than 3 box-lengths from the upper or lower edge of the box are called extreme values. On the boxplot, these are designated with an asterisk (*). Cases with values that are between 1.5 and 3 box-lengths from the upper or lower edge of the box are called outliers and are designated with a circle. The largest and smallest observed values that aren't outliers are also shown. Lines are drawn from the ends of the box to these values. (These lines are sometimes called whiskers and the plot is then called a box-and-whiskers plot.) Despite its simplicity, the boxplot contains an impressive amount of information. From the median you can determine the central tendency, or location. From the length

330 C hap te r Ten Cause A- / Subcause Cause A- -Cause B --------.J~ I Process -----------+l~ I Process ~--------~'IL- p_ro_b_le_m Cause A- / / - Cause B Cause C - / / Cause A- / / / -Cause B Subcause / _ Cause C / - Cause D FIGURE 10.5 Production process class cause and effect diagram. ~ * o Values more than 3 box-lengths above the 75th percentile (extremes) Values more than 1.5 box-lengths above the 75th percentile (outliers) Largest observed value that isn't an outlier 75th percentile Median (50th percentile) 25th percentile o * FIGURE 10.6 Annotated boxplot. Smallest observed value that isn't an outlier Values more than 1.5 box-lengths below the 25th percentile (outliers) Values more than 3 box-lengths below the 25th percentile (extremes) of the box, you can determine the spread, or variability, of your observations. If the median is not in the center of the box, you know that the observed values are skewed. If the median is closer to the bottom of the box than to the top, the data are positively skewed. If the median is closer to the top of the box than to the bottom, the opposite is true: the distribution is negatively skewed. The length of the tail is shown by the whiskers and the outlying and extreme points.

Analyze Phase 331 60000 50000 40000 30000 20000 10000 O~----,------.------,------,,------,------.------,----- N = 227 136 27 41 32 5 ' V~ 00 0' 00 00 i-.~ fl' ~G ~~ ~O~ ()0 -S 0 -S ~~ 0 ~~ 0 ~G d> ~0~ ~0 0 0'0 2S ~~ (j «l FIGURE 10.7 Boxplots of salary by job category. Employment category Boxplots are particularly useful for comparing the distribution of values in several groups. Figure 10.7 shows boxplots for the salaries for several different job titles. The boxplot makes it easy to see the different properties of the distributions. The location, variability, and shapes of the distributions are obvious at a glance. This ease of interpretation is something that statistics alone cannot provide. Statistical Inference This section discusses the basic concept of statistical inference. The reader should also consult the glossary in the Appendix for additional information. Inferential statistics belong to the enumerative class of statistical methods. All statements made in this section are valid only for stable processes, that is, processes in statistical control. Although most applications of Six Sigma are analytic, there are times when enumerative statistics prove useful. The term inference is defined as (1) the act or process of deriving logical conclusions from premises known or assumed to be true, or (2) the act of reasoning from factual knowledge or evidence. Inferential statistics provide information that is used in the process of inference. As can be seen from the definitions, inference involves two domains: the premises and the evidence or factual knowledge. Additionally, there are two conceptual frameworks for addressing premises questions in inference: the design-based approach and the model-based approach. As discussed by Koch and Gillings (1983), a statistical analysis whose only assumptions are random selection of units or random allocation of units to experimental conditions results in design-based inferences; or, equivalently, randomization-based inferences. The objective is to structure sampling such that the sampled population has the same