Knowledge discovery tools 381

Size: px

Start display at page:

Download "Knowledge discovery tools 381"

Spencer Dennis
6 years ago
Views:

1 Knowledge discovery tools 381 hours, and prime time is prime time precisely because more people tend to watch television at that time.. Compare histograms from di erent periods of time. Changes in histogram patterns from one time period to the next can be very useful in nding ways to improve the process.. Stratify the data by plotting separate histograms for di erent sources of data. For example, with the rod diameter histogram we might want to plot separate histograms for shafts made from di erent vendors materials or made by di erent operators or machines. This can sometimes reveal things that even control charts don t detect. Exploratory data analysis Data analysis can be divided into two broad phases: an exploratory phase and a confirmatory phase. Data analysis can be thought of as detective work. Before the trial one must collect evidence and examine it thoroughly. One must have a basis for developing a theory of cause and effect. Is there a gap in the data? Are there patterns that suggest some mechanism? Or, are there patterns that are simply mysterious (e.g., are all of the numbers even or odd)? Do outliers occur? Are there patterns in the variation of the data? What are the shapes of the distributions? This activity is known as exploratory data analysis (EDA). Tukey s 1977 book with this title elevated this task to acceptability among serious devotees of statistics. Four themes appear repeatedly throughout EDA: resistance, residuals, reexpression, and visual display. Resistance refers to the insensitivity of a method to a small change in the data. If a small amount of the data is contaminated, the method shouldn t produce dramatically different results. Residuals are what remain after removing the effect of a model or a summary. For example, one might subtract the mean from each value, or look at deviations about a regression line. Re-expression involves examination of different scales on which the data are displayed. Tukeyp focused most of his attention on simple power transformations such as y ¼ ffiffi x, y ¼ x 2, y ¼ 1=x. Visual display helps the analyst examine the data graphically to grasp regularities and peculiarities in the data. EDA is based on a simple basic premise: it is important to understand what you can do before you learn to measure how well you seem to have done it (Tukey, 1977). The objective is to investigate the appearance of the data, not to confirm some prior hypothesis. While there are a large number of EDA methods and techniques, there are two which are commonly encountered in Six Sigma work: stem-and-leaf plots and boxplots. These techniques are commonly included in most statistics packages. (SPSS was used to create the figures used

2 382 KNOWLEDGE DISCOVERY in this book.) However, the graphics of EDA are simple enough to be done easily by hand. STEM-AND-LEAF PLOTS Stem-and-leaf plots are a variation of histograms and are especially useful for smaller data sets (n<200). A major advantage of stem-and-leaf plots over the histogram is that the raw data values are preserved, sometimes completely and sometimes only partially. There is a loss of information in the histogram because the histogram reduces the data by grouping several values into a single cell. Figure is a stem-and-leaf plot of diastolic blood pressures. As in a histogram, the length of each row corresponds to the number of cases that fall into a particular interval. However, a stem-and-leaf plot represents each case with a numeric value that corresponds to the actual observed value. This is done by dividing observed values into two componentsöthe leading digit or digits, called the stem, and the trailing digit, called the leaf. For example, the value 75 has a stem of 7 and a leaf of 5. Figure Stem-and-leaf plot of diastolic blood pressures. From SPSS for W ndows Base System User s Guide, p Copyright # Used by permission of the publisher, SPSS, Inc., Chicago, IL.

3 Knowledge discovery tools 383 In this example, each stem is divided into two rows. The first row of each pair has cases with leaves of 0 through 4, while the second row has cases with leaves of 5 through 9. Consider the two rows that correspond to the stem of 11. In the first row, we can see that there are four cases with diastolic blood pressure of 110 and one case with a reading of 113. In the second row, there are two cases with a value of 115 and one case each with a value of 117, 118, and 119. The last row of the stem-and-leaf plot is for cases with extreme values (values far removed from the rest). In this row, the actual values are displayed in parentheses. In the frequency column, we see that there are four extreme cases. Their values are 125, 133, and 160. Only distinct values are listed. When there are few stems, it is sometimes useful to subdivide each stem even further. Consider Figure a stem-and-leaf plot of cholesterol levels. In this figure, stems 2 and 3 are divided into five parts, each representing two leaf values. The first row, designated by an asterisk, is for leaves of 0 and 1; the next, designated by t, is for leaves of 2 s and 3 s; the third, designated by f, is for leaves of 4 s and 5 s; the fourth, designated by s, is for leaves of 6 s and 7 s; and the fifth, designated by a period, is for leaves of 8 s and 9 s. Rows without cases are not represented in the plot. For example, in Figure 11.15, the first two rows for stem 1 (corresponding to 0-1 and 2-3) are omitted. Figure Stem-and-leaf plot of cholesterol levels. From SPSS for W ndows Base System User s Guide,p.185.Copyright# Used by permission of the publisher, SPSS, Inc., Chicago, IL.

4 384 KNOWLEDGE DISCOVERY This stem-and-leaf plot differs from the previous one in another way. Since cholesterol values have a wide rangeöfrom 106 to 515 in this exampleöusing the first two digits for the stem would result in an unnecessarily detailed plot. Therefore, we will use only the hundreds digit as the stem, rather than the first two digits. The stem setting of 100 appears in the row labeled Stem width. The leaf is then the tens digit. The last digit is ignored. Thus, from this particular stem-and-leaf plot, it is not possible to determine the exact cholesterol level for a case. Instead, each is classified by only its first two digits. BOXPLOTS A display that further summarizes information about the distribution of the values is the boxplot. Instead of plotting the actual values, a boxplot displays summary statistics for the distribution. It is a plot of the 25th, 50th, and 75th percentiles, as well as values far removed from the rest. Figure shows an annotated sketch of a boxplot. The lower boundary of the box is the 25th percentile. Tukey refers to the 25th and 75th percentile hinges. Note that the 50th percentile is the median of the overall data set, the 25th percentile is the median of those values below the median, and the 75th percentile is the median of those values above the median. The horizontal line inside the box represents the median. 50% of the cases are included within the box. The box length corresponds to the interquartile range, which is the difference between the 25th and 75th percentiles. The boxplot includes two categories of cases with outlying values. Cases with values that are more than 3 box-lengths from the upper or lower edge of the box are called extreme values. On the boxplot, these are designated with an asterisk (*). Cases with values that are between 1.5 and 3 box-lengths from the upper or lower edge of the box are called outliers and are designated with a circle. The largest and smallest observed values that aren t outliers are also shown. Lines are drawn from the ends of the box to these values. (These lines are sometimes called whiskers and the plot is then called a box-and-whiskers plot.) Despite its simplicity, the boxplot contains an impressive amount of information. From the median you can determine the central tendency, or location. From the length of the box, you can determine the spread, or variability, of your observations. If the median is not in the center of the box, you know that the observed values are skewed. If the median is closer to the bottom of the box than to the top, the data are positively skewed. If the median is closer to the top of the box than to the bottom, the opposite is true: the distribution is negatively skewed. The length of the tail is shown by the whiskers and the outlying and extreme points.

328 C hap te r Ten 2. Write the names of the categories above and below the horizontal line. Think of these as branches from the main trunk of the tree. 3.

5 328 C hap te r Ten 2. Write the names of the categories above and below the horizontal line. Think of these as branches from the main trunk of the tree. 3. Draw in the detailed cause data for each category. Think of these as limbs and twigs on the branches. A good cause and effect diagram will have many "twigs," as shown in Fig. loa. If your cause and effect diagram doesn't have a lot of smaller branches and twigs, it shows that the understanding of the problem is superficial. Chances are that you need the help of someone outside of your group to aid in the understanding, perhaps someone more closely associated with the problem. Cause and effect diagrams come in several basic types. The dispersion analysis type is created by repeatedly asking "why does this dispersion occur?" For example, we might want to know why all of our fresh peaches don't have the same color. The production process class cause and effect diagram uses production processes as the main categories, or branches of the diagram. The processes are shown joined by the horizontal line. Figure 10.5 is an example of this type of diagram. The cause enumeration cause and effect diagram simply displays all possible causes of a given problem grouped according to rational categories. This type of cause and effect diagram lends itself readily to the brainstorming approach we are using. A variation of the basic cause and effect diagram, developed by Dr. Ryuji Fukuda of Japan, is cause and effect diagrams with the addition of cards, or CEDAC. The main difference is that the group gathers ideas outside of the meeting room on small cards, as well as in group meetings. The cards also serve as a vehicle for gathering input from people who are not in the group; they can be distributed to anyone involved with the process. Often the cards provide more information than the brief entries on a standard cause and effect diagram. The cause and effect diagram is built by actually placing the cards on the branches. Boxplots A boxplot displays summary statistics for a set of distributions. It is a plot of the 25th, 50th, and 75th percentiles, as well as values far removed from the rest. Figure 10.6 shows an annotated sketch of a boxplot. The lower boundary of the box is the 25th percentile. Tukey refers to the 25th and 75th percentile "hinges." Note that the 50th percentile is the median of the overall data set, the 25th percentile is the median of those values below the median, and the 75th percentile is the median of those values above the median. The horizontal line inside the box represents the median. Fifty percent of the cases are included within the box. The box length corresponds to the interquartile range, which is the difference between the 25th and 75th percentiles. The boxplot includes two categories of cases with outlying values. Cases with values that are more than 3 box-lengths from the upper or lower edge of the box are called extreme values. On the boxplot, these are designated with an asterisk (*). Cases with values that are between 1.5 and 3 box-lengths from the upper or lower edge of the box are called outliers and are designated with a circle. The largest and smallest observed values that aren't outliers are also shown. Lines are drawn from the ends of the box to these values. (These lines are sometimes called whiskers and the plot is then called a box-and-whiskers plot.) Despite its simplicity, the boxplot contains an impressive amount of information. From the median you can determine the central tendency, or location. From the length

6 330 C hap te r Ten Cause A- / Subcause Cause A- -Cause B J~ I Process l~ I Process ~ ~'IL- p_ro_b_le_m Cause A- / / - Cause B Cause C - / / Cause A- / / / -Cause B Subcause / _ Cause C / - Cause D FIGURE 10.5 Production process class cause and effect diagram. ~ * o Values more than 3 box-lengths above the 75th percentile (extremes) Values more than 1.5 box-lengths above the 75th percentile (outliers) Largest observed value that isn't an outlier 75th percentile Median (50th percentile) 25th percentile o * FIGURE 10.6 Annotated boxplot. Smallest observed value that isn't an outlier Values more than 1.5 box-lengths below the 25th percentile (outliers) Values more than 3 box-lengths below the 25th percentile (extremes) of the box, you can determine the spread, or variability, of your observations. If the median is not in the center of the box, you know that the observed values are skewed. If the median is closer to the bottom of the box than to the top, the data are positively skewed. If the median is closer to the top of the box than to the bottom, the opposite is true: the distribution is negatively skewed. The length of the tail is shown by the whiskers and the outlying and extreme points.

7 Analyze Phase O~----, ,------,,------, ,----- N = ' V~ 00 0' i-.~ fl' ~G ~~ ~O~ ()0 -S 0 -S ~~ 0 ~~ 0 ~G d> ~0~ ~0 0 0'0 2S ~~ (j «l FIGURE 10.7 Boxplots of salary by job category. Employment category Boxplots are particularly useful for comparing the distribution of values in several groups. Figure 10.7 shows boxplots for the salaries for several different job titles. The boxplot makes it easy to see the different properties of the distributions. The location, variability, and shapes of the distributions are obvious at a glance. This ease of interpretation is something that statistics alone cannot provide. Statistical Inference This section discusses the basic concept of statistical inference. The reader should also consult the glossary in the Appendix for additional information. Inferential statistics belong to the enumerative class of statistical methods. All statements made in this section are valid only for stable processes, that is, processes in statistical control. Although most applications of Six Sigma are analytic, there are times when enumerative statistics prove useful. The term inference is defined as (1) the act or process of deriving logical conclusions from premises known or assumed to be true, or (2) the act of reasoning from factual knowledge or evidence. Inferential statistics provide information that is used in the process of inference. As can be seen from the definitions, inference involves two domains: the premises and the evidence or factual knowledge. Additionally, there are two conceptual frameworks for addressing premises questions in inference: the design-based approach and the model-based approach. As discussed by Koch and Gillings (1983), a statistical analysis whose only assumptions are random selection of units or random allocation of units to experimental conditions results in design-based inferences; or, equivalently, randomization-based inferences. The objective is to structure sampling such that the sampled population has the same

Understandable Statistics

Understandable Statistics correlated to the Advanced Placement Program Course Description for Statistics Prepared for Alabama CC2 6/2003 2003 Understandable Statistics 2003 correlated to the Advanced Placement