MBA 605 Business Analytics Don Conant, PhD. GETTING TO THE STANDARD NORMAL DISTRIBUTION

Variables In the social sciences data are the observed and/or measured characteristics of individuals and groups of individuals. In management data includes characteristics of business processes. Constant A constant is a characteristic that remains the same for every individual or process. Variable A variable is a characteristic that can take on different values. Example In a study of third-graders, grade level is a constant. Gender, height, and weight are variables.

Variables Researchers distinguish independent from dependent variables Independent variables Variables that a researcher controls or manipulates. Dependent variables A measure of the effect of the independent variable. Example In an ad campaign to increase sales the amount spent on advertising (independent variable) effects the sales volume (dependent variable).

Scales of Measurement Measurement is a process of assigning numbers to characteristics according to a defined rule. Not all measurement is the same. Some measurements are more precise than others. The precision of measurement of a variable is important to determining what statistical methods should be used to analyze the data. Measurement scales of variables are classified in a hierarchy based on their precision. This hierarchy includes the nominal, ordinal, interval, and ratio scales.

Scales of Measurement Nominal The least precise measurement scale classifies objects into numbered categories based on some defined characteristic for the purpose of counting each object in a category. A nominal measurement of automobiles would classify them by make (Ford, GM, etc.) Other nominal measurements include gender, eye color, ethnicity, marital status, and religion. Properties Data categories are mutually exclusive. Data categories have no logical order. Statistical methods Count (mode).

Scales of Measurement Ordinal If nominal objects can be placed in a logical order they fit the ordinal scale. Categories are identified and assigned numbers. These numbers relate to the amount of the measured characteristic. Categories can be ranked from lowest to highest. Academic class standing (1-Freshman, 2-Sophomore, 3-Junior, 4-Senior) is an ordinal scale as are movie ratings (1 to 5 stars). Properties Data categories are mutually exclusive. Data categories have logical order. Data categories are scaled to the amount of the characteristic possessed. Statistical methods Count, greater or less than (median, mode).

Scales of Measurement Interval If the distances between characteristics on the ordinal scale are equal they fit the interval scale. Temperature is an example of a variable measured on the interval scale. The difference between 85F and 88F is 3 degrees which is the same as the difference between 22F and 25F. Properties Data categories are mutually exclusive. Data categories have logical order. Data categories are scaled to the amount of the characteristic possessed. Data category characteristics are separated by an equal distance. The point 0 is just another point on the scale. Statistical methods Addition and subtraction (mean, median, mode, standard deviation).

Scales of Measurement Ratio If a zero value for characteristics on the interval scale represents a starting point or the absence of the characteristic then they fit the ratio scale, ie. annual income, degrees Kelvin, length or distance. The presence of a true 0 allows for comparison of the proportions of a characteristic. 50F is not twice as warm as 25F but a 30lb bag of apples is twice as heavy as a 15lb bag. Properties Data categories are mutually exclusive. Data categories have logical order. Data categories are scaled to the amount of the characteristic possessed. Data category characteristics are separated by an equal distance. The point 0 reflects the absence of the characteristic. Statistical methods Addition and subtraction (mean, median, mode, standard deviation).

The Likert Scale The scale is named after Rensis Likert, who first developed it is often used in social science research to promote ordinal data to interval data for analysis. The Likert scale is a psychometric scale commonly used in questionnaires, and is the most widely used scale in survey research. When responding to a Likert questionnaire item, respondents specify their level of agreement to a statement. A Seven Level Likert Scale 1 2 3 4 5 6 7

The Likert Scale Likert scales can be used to evaluate subjective or objective criteria. They are bipolar scales in that generally some level of agreement or disagreement is measured. Many social science researchers recommend scales consisting of 7 to 9 items. A Seven Level Likert Scale 1 2 3 4 5 6 7

The Likert Scale Likert scale values should always be accompanied by item labels. Without labels a mean result with a scale value of 1.9 would be reported as 1.9 on a scale of 1 to 7. With labels a mean result of 1.9 can additionally be reported as Dissatisfied. This adds meaning to the interpretation of the results. A Seven Level Likert Scale 1 2 3 4 5 6 7 Very Dissatisfied Dissatisfied Somewhat Dissatisfied Neither Dissatisfied or Satisfied Somewhat Satisfied Satisfied Very Satisfied

The Likert Scale Some researchers object to the middle item indicating the absence of satisfaction (or absence of agreement or disagreement). They suggest that participants should be forced to respond or to select a no response option. A Seven Level Likert Scale 1 2 3 4 5 6 7 Very Dissatisfied Dissatisfied Somewhat Dissatisfied Neither Dissatisfied or Satisfied Somewhat Satisfied Satisfied Very Satisfied

The Likert Scale Moving the no response option out of the scale adds to the meaningfulness of measures of central tendency (mean, median, mode) It also allows data to be interpreted in two groups. The researcher can indicate the number of participants responding as neither satisfied or dissatisfied. And, when the calculations are made they are based on participants with an opinion about satisfaction. A Six Level Likert Scale 1 2 3 4 5 6 0 Very Dissatisfied Dissatisfied Somewhat Dissatisfied Somewhat Satisfied Satisfied Very Satisfied Neither Satisfied or Dissatisfied

Descriptive and Inferential Statistics The study of statistics is often described by two broad categories: descriptive and inferential statistics. Descriptive statistics Are used to classify and summarize (describe) numerical data. Mean, median, mode, variance, standard deviation are examples. Inferential statistics Consist of procedures for making generalizations about a population by studying a sample from that population. T-tests and Pierson s r are examples.

Measures of Central Tendency Many studies have distributions with heavy concentrations of scores in the middle and fewer scores trailing out into either extreme (tails). The measures of central tendency for these distributions lie near the center of the distribution. These measures most commonly consist of the mean, median, and mode.

Measures of Central Tendency Mode The mode is the simplest index of central tendency. It is the most frequently occurring score in a distribution. In this example, 75 is the mode. There can be more than one mode in a distribution. Unimodal distributions having one mode. Bimodal distributions having exactly 2 modes. Multimodal distributions having more than 2 modes. The mode identifies the most frequent occurrence(s) but does not lend itself to mathematical manipulation and thus has limited statistical value. X 53 55 61 65 65 67 67 69 70 70 72 72 73 75 75 75 78 88 89 91 93

Measures of Central Tendency Median The median value is the value that occurs in the middle of the distribution when sorted in ascending order (lowest to highest). In this example, 72. If the number of values in the distribution is odd the median is the middle value. If the number of values is even the median is the average of the two middle values. With an even number of values the median becomes a calculated rather than an actual value. To avoid this some remove an outlying value to make the number of values in the distribution odd. Others report two medians. X 53 55 61 65 65 67 67 69 70 70 72 72 73 75 75 75 78 88 89 91 93

Measures of Central Tendency Mean The mean is the arithmetic average of the scores in a distribution. The mean is determined by adding the scores and dividing by the total number of scores. 1,523 / 21 = 72.5 The symbol for the mean of the population is µ (mu). The symbol for the sample mean is X (X-bar). X = 72.5 In inferential statistics it is important to distinguish between the mean of the population and the corresponding mean of the sample. X (n=21) 53 55 61 65 65 67 67 69 70 70 72 72 73 75 75 75 78 88 89 91 93 1,523

Measures of Central Tendency Which is the best measure of central tendency the mean, median, or mode? If the distribution is symmetrical and unimodal the mean, median and mode are the same.

Measures of Central Tendency If the distribution is symmetrical and bimodal the mean and the median coincide but there are two modes.

Measures of Central Tendency If the distribution is skewed to the left (negatively skewed) the mean is less than the median. If the distribution is skewed to the right (positively skewed) the mean is greater than the median.

Measures of Central Tendency So which measure is best mean, median, or mode? Consider the following distribution of salaries. Position Number of Employees Salary President 1 180,000 Execurive vice president 1 60,000 Vice presidents 2 40,000 Controller 1 22,800(Mean) Senior salespeople 3 20,000 Junior salespeople 4 14,800 Foreman 1 12,000(Median) Machinist 12 8,000(Mode) How is this distribution skewed (left or right pos. or neg.)? Which measure would you use if you were chairman of machinist union negotiating a new contract? Management s representative in the negotiations?

Variance Measures of Variation The variance is the average of the sum of squared deviations around the mean. The population variance formula is: However, we often work with a sample not the population. The difference between working with the sample and working with the population has to do with restrictions that are placed on a sample when drawn from a population. These restrictions aren t present when working with the entire population. These restrictions are referred to as degrees of freedom (df).

Measures of Variation Degrees of freedom The concept of degrees of freedom is fundamentally mathematical. In the example of calculating the sample variance, because the mean is known a restriction is placed on the sample size (n). Imagine the population mean (25) is being estimated from a sample of five values. The first four of those values are free to assume any value (21, 23, 24, 30). However, the fifth value must result in the known mean of 25. Thus the fifth value must be 27. The calculated mean results in a loss of 1 degree of freedom.

Measures of Variation The sample variance The sample variance formula is: 1 For the sample 9, 12, 7, 5, 2, 3, 4 (n = 7). The variance is 12.67. X i X X i -X (X i -X) 2 9 6 3 9 12 6 6 36 7 6 1 1 5 6-1 1 2 6-4 16 3 6-3 9 4 6-2 4 Σ(X i -X) 2 76 (n -1) 6 σ 2 12.67

Measures of Variation The standard deviation The standard deviation is the square root of the variance. 1 3.56 The standard deviation indicates the average amount of variability values have around the mean. 2 4 6 8 10 12

Standard scores Standard Scores Standard scores use the standard deviation as a unit of measure thereby describing the relative position of a single score in the entire distribution of scores in terms of the mean and standard deviation. A common standard score is the z score. In our example a sample of 5 would have a z score of -0.28. 5 6 3.56 0.28 The z score corresponding to a given raw score indicates how many standard deviations the raw score falls either above or below the mean.

The Standard Normal Distribution When a variable is normally distributed and we know the mean and standard deviation we can use the normal distribution. Notice the area under the normal curve for plus and minus 3 standard deviations (sigma).

The Standard Normal Distribution We used to calculate the area under the standard normal curve for a value of z by consulting a table.

The Standard Normal Distribution Today we use the NORM.S.DIST() function in Excel. The function requires a z value. A z-value of 1 with cumulative = True returns the.3413 shown in the table plus the.5 of the left tail. If the cumulative option were set to False, Excel would return the ordinate value of.2420 (the Y ordinate for a z). A z-value of -1 would yield.1587 the area preceding -1 (or beyond 1 as shown in the table).

The Standard Normal Distribution Using Excel calculate the area under the normal curve for a z-value of 0.85.

The Standard Normal Distribution Using Excel calculate the area under the normal curve for a z-value of 0.85. z = 0.85 = 0.8023

The Standard Normal Distribution Using Excel calculate the area under the normal curve for a z-value of -0.25.

The Standard Normal Distribution Using Excel calculate the area under the normal curve for a z-value of -0.25. z = -0.25 = 0.4013

The Standard Normal Distribution Using Excel calculate the area under the normal curve between z = -0.5 and z = 1.5.

The Standard Normal Distribution z = 1.5 = 0.9332 z = -0.5 = 0.3085 Total area 0.6247