CHAPTER NINE DATA ANALYSIS / EVALUATING QUALITY (VALIDITY) OF BETWEEN GROUP EXPERIMENTS

Size: px

Start display at page:

Download "CHAPTER NINE DATA ANALYSIS / EVALUATING QUALITY (VALIDITY) OF BETWEEN GROUP EXPERIMENTS"

Abner Thomas
5 years ago
Views:

1 CHAPTER NINE DATA ANALYSIS / EVALUATING QUALITY (VALIDITY) OF BETWEEN GROUP EXPERIMENTS Chapter Objectives: Understand Null Hypothesis Significance Testing (NHST) Understand statistical significance and probabilities (p-values) Understand Type I and Type II Errors in hypothesis testing Understand t-tests, ANOVA / MANOVA, and ANCOVA / MANCOVA and how they are used in analyzing experiments and quasi-experiments. Understand Effect Size Estimates Understand the difference between statistical and practical significance Understand Power Analysis Understand how validity (quality) of experiments and quasi-experiments is evaluated

2 Understand internal validity and threats Understand statistical conclusion validity I. Data Analysis Data analysis in experimental and quasi experiment research is a multilayered process that begins with collecting data on the dependent variable after the experiment has been implemented and ends with making inferences about the statistical soundness and the meaning of the results. There are three complementary approaches to statistical analysis. Null Hypothesis Significance Testing (NHST) is the term applied to the set of procedures that (1) establishes statistical significance of results and (2) confirms or disconfirms an experiment s hypothesis. Effect Size refers to the magnitude of differences on the dependent variable after a treatment Power Analysis refers to the probability of avoiding errors in conclusions drawn from NHST.

3 Null Hypothesis Significance Testing: Significance Testing Statistical significance is a low probability that results are due to sampling error or chance and a high probability that results are due to the treatment. Significance testing is the first step in NHST. It is important to note that significance as used here does not mean importance. It merely reports on the probability that outcomes are not due to error, and thus affirms that the outcomes are real and are due to a treatment. Significance is based on probability. p represents the level of probability that the differences are due to various sources of error (e.g., sampling). In most cases the lower the p- value, the better. Alpha is the p-value that the researcher sets at the beginning of the study as an acceptable level of probability that differences are due to sampling error.

4 p.05 is the conventional alpha used in educational research. This means there is less than a 5% probability the differences are due to sampling error and a high probability the differences are due to the treatment. Probabilities are derived from the normal distribution curve. Remember from Chapter Six that 95% of values will randomly occur within the first two standard deviations (plus or minus 2 standard deviations) on the curve. Figure 1. Normal Distribution Curve p=.05 By establishing alpha at.05, the researcher is setting a standard that rules out the probability that 95% of randomly occurring results (those occurring within the shaded area) are due to sampling error or chance. Quantitative researchers use inferential statistical tests to arrive at a p-value. Inferential statistics use probability theory to make inferences from data about the likelihood that results occurred by chance and whether the

5 results can be generalized from the subjects in the sample to all of the subjects in the population from which the sample was drawn. The inferential tests use the means and standard deviations on measured outcomes to calculate the statistical significance in experiments are the t-test, ANOVA / MANOVA, and ANCOVA / MANCOVA. t-tests t-tests compare outcomes of two groups or two variables. There are two types of t-tests: 1) t-tests for independent means that analyze the difference in means between two groups and 2) t-test for correlated means that analyze differences in means for the same group before and after a treatment. The t-test yields a t-value that leads to a p value. For example, in a post-test only/ true experiment with randomly assigned treatment and control groups, the researcher set alpha at p.05. He uses a t-test to compare math scores of students who were instructed with an innovative method to those who had traditional instruction that yielded t He then derives a p-value from the value to see if the alpha has been achieved. The summary data display will look like this: N=60 Treatment Control T Mean * SD df = 58 * p =.05

6 The number of subjects in the study is designated by n, n=60. The (*) that appears next to the t-value directs us to the bottom of the chart where we see df (58); this stands for degrees of freedom (the number of subjects number of groups: 60-2) and is used to find p =.05 on a Table of Critical t-values table of t- values. A segment of that table for degrees of freedom from 40 to 75 is presented below. df Probability value (p) Table 1. Critical t-values (df = 40-75) In our example df=58. When we look down the table we look at df=60, which is close enough. When we scan across the table for a p=.05, the t-value listed is Any t-value above that would signify statistical significance. Since t=2.29 in our example, that indicates that the results meet the criteria for.05 and are significant.

7 ANOVA / MANOVA ANOVA (Analysis of Variance) and MANOVA (Multivariate Analysis of Variance) also compare outcomes between groups; ANOVA is more robust than t-tests in that it can compare outcomes on two or more groups. ANOVA is used to analyze differences between two or more variables and differences within groups. MANOVA (Multivariate Analysis of Variance) is used when testing more than two groups when there are more than two dependent variables. Both the ANOVA and MANOVA yield an F-value that leads to a p-value. For example a researcher uses ANOVA to determine whether there are differences in results from three approaches to reinforcing skills in math. The sample is selected from two fourth grade classrooms in a school; students are randomly assigned to three groups after receiving direct instruction. Group # 1 works in cooperative leaning groups; group #2 receives instruction from an interactive computer program; and group #3 uses concept mapping strategies Group #1 Cooperative Group #2 Computer Group #3 Concept Maps (n = 15) (n = 14) (n = 15) Mean Scores * SD *F (7.14); p =. 002 The ANOVA generated a critical F-value of 7.14, which resulted in p =.002, which is below the.05 alpha. The results are statistically significant for the concept-mapping group. The conversion from the F to p is based a Table of

8 Critical F-values that, like the table of t-values, lists degrees of freedom and p- values. ANCOVA / MANCOVA ANCOVA (Analysis of Covariance) and MANCOVA (Multiple Analysis of Co Variance) are predominantly used in quasi-experimental studies, when there is a likelihood that the groups are non-equivalent at the start; they are also used in other experimental designs in order to ensure the equivalence of the groups. These tests provide more certainly that the outcomes of a quasi experiment are not being affected by the differences, or variances, that exist in the subjects before the experiment begins. An ANCOVA or MANCOVA is performed at the end of the experiment to level the playing field; it is like a golf handicap that accommodates differences in competitors before the round begins. ANCOVA and MANCOVA yield a F-value that leads to a p-value. For example, in a quasi-experimental study that uses two intact classrooms to investigates the effect of an innovative approach to language arts on the two variables of writing and reading scores, the researcher uses MANCOVA to level the playing field between the two non-equivalent groups. The display of data will look like this:

9 Pre Test Scores Writing Reading Treatment Control M SD M SD Post Test Scores Writing Reading M SD M SD *F (11); p=.0018 ** F (20); p=.0001 The MANCOVA generated critical F-values of 11 and 20, which led to p=.0018 and p= These values are below the.05 alpha; the results are statistically significant for both reading and writing. These F-values tell you that there are statistically significant differences they do not specify exactly which groups are different; researchers follow up with t-tests to make these comparisons. One- and Two-Tailed Tests Researchers chose between using a one or two tailed inferential test. A one tailed test uses one end, or tail, of the curve to generate a p.05; a twotailed test uses two ends, or tails, with p.025 on each end of the curve of the curve, to generate p.05. The illustration below shows these probabilities.

10 Figure 2. One tailed tests at p=.05 Figure 3. Two-tailed test The most objective test is the two-tailed test. It represents a higher standard and degree of difficulty for results because the probability must be distributed to both

11 ends of the curve. A one-tailed test is more lenient because all of the probability for the hypothesis test is on one side of the distribution curve. In the final analysis what matters is the probability (p) value, and moreover, a theory that provides a rationale for making a well-founded prediction. Null Hypothesis Significance Testing: Hypothesis Testing Hypothesis testing is the next step in process of null hypothesis significance testing (NHST) and uses statistical significance to confirm or disconfirm a hypothesis. As a reminder, here are the definitions that are useful in this discussion. The hypothesis operationalizes the theory in terms of the independent and dependent variables. A hypothesis makes a prediction about the effect of the independent variable on the dependent variable and can be stated as either a directional or non-directional hypothesis.

12 The directional hypothesis predicts that the treatment will result in a change on and that the change will be a positive result of the experiment. The non-directional hypothesis predicts that a treatment will result in a change in outcomes, but does not predict the direction of the change: whether it will be positive or negative. In hypothesis testing, the researcher uses a construct known as the null hypothesis. The null hypothesis predicts there will be no significant difference between the experimental group(s) and control group(s) as a result of the treatment. To confirm that a hypothesis is true, the research has to demonstrate that the null is false. This is where significance testing comes into play. The researcher uses the p-value derived from significance testing as the probability level that the null is true. If p.05 there is little probability the null is true; therefore the researcher can reject the null hypothesis and confirm that the hypothesis is true. In effect, the null hypothesis is set up as a straw man; it is easier to prove something false than it is to prove something true. Below is a graphic that summarizes the multi-layered process of data analysis through NHST.

13 Summary of Null Hypothesis Significance Testing Figure 3. Summary of Null Hypothesis Significance Testing Type I and Type II Errors Hypothesis testing is often used in decision-making, and researchers have to guard against making inferential errors.

Type-I error means that the researcher erroneously rejects the null hypothesis. This is a false positive that leads to the incorrect conclusion that the hypothesis has been confirmed.

14 Type-I error means that the researcher erroneously rejects the null hypothesis. This is a false positive that leads to the incorrect conclusion that the hypothesis has been confirmed. Type-II error occurs researcher erroneously accepts the null hypothesis. This is a false positive that leads to the incorrect conclusion that the hypothesis has not been confirmed. Figure 4. Type I and Type II Errors; Decision-making about the null hypothesis A Type I error is called the alpha error because it usually occurs when the alpha may have set too high (p=. 05), leaving too much room for error. The researcher can avoid this error by setting a lower alpha (p=.025 or p.01). The Type II Error is called the Beta error. To avoid this error, the researcher has to follow a set of

15 statistical procedures before the study begins. This is called power analysis and is described in the section below. Effect Size Estimates and Power Analysis NHST is not without its critics. As Thompson stated, "statistical significance is not sufficiently useful to be invoked as the sole criterion for evaluating the noteworthiness" of research (2002a, p. 66). Statistical significance does tell whether differences exist, but it does not tell the size, or magnitude, of those differences; nor does it provide insight into the ultimate meaningfulness of the research. And it may obscure Type I and Type II Errors. Researchers use effect size estimates and power analysis to address those concerns. Effect Size Estimates Effect size estimates convey "the magnitude of an effect or the strength of a relationship" (APA, 2001, p. 25); they are meaningful estimations of impact. Effect Size Estimates provide information that statistical significance does not; they provide evidence about what is termed practical significance.

16 Practical significance answers the question: Are the differences big enough to have real meaning and to guide decision-making in practical situations? Expressed as ES or d, effect size is an essential statistical calculation in social research. The fifth edition of the APA Publication Manual (2001) states, The general principle to be followed is to provide the reader not only with information about statistical significance but also with enough information to assess the magnitude of the observed effect or relationship (pp ). There are several ways to calculate Effect Size. The formula commonly used in educational research is called the Glass Δ (delta). This formula simply calculates the difference between the means of the treatment and control groups and divides the answer by the standard deviation of the control group (Glass, 1976; Cohen, 1968). Glass Δ = Mean (experiment) Mean (control) Standard deviation (control) The Glass Δ can be translated into units of standard deviation gain: the greater the effect size, the greater the gain in SD units. By way of explanation: an effect size of 1.0 is equivalent to a one standard deviation change in outcomes on a normal distribution curve. An effect size of 0.5 is equivalent to ½ standard deviation change; this would mean that the average pupil experiencing a treatment would improve by almost one half ( ½ ) a standard deviation on a standardized measure. For example a student would move from the 50 th

17 percentile to the 67 th percentile on the math SAT, from a score of around 520 to a score of 570, as the result of a test preparation program. Cohen developed the following guidelines for interpreting effect size. ES=0.5 is considered adequate for establishing difference of sufficient magnitude < 0.1 = trivial effect (1/10 SD gain) = small effect (up to ⅓ SD gain) = moderate effect (up to ⅕ SD gain) > 0.5 = large effect (above ½ SD gain) Just as researchers usually have as their goal to achieve an alpha of p.05, they usually strive for ES = 0.5. Effect Size is not only used to determine practical significance; it is also used in calculating statistical power. Statistical Power and Power Analysis Statistical Power tells the likelihood that the researcher will avoid making a Type II error; accepting as true a null hypothesis that is false. Power is expressed as a statistic of probability; the higher the power, the greater the probability of avoiding error. A power analysis allows a researcher to calculate the sample size that will avoid a Type II error, given the desired alpha and effect size. A power of 0.80 is considered the acceptable threshold for avoiding error. Poor power cannot be corrected after an experiment is completed; to have use, it must occur before the experiment begins.

18 1. Before the experiment begins, the research sets the desired alpha at.05 and the desired effect size at The researcher conducts a statistical power analysis to determine the sample size necessary to reach power= The power analysis determines that a sample size of 65 subjects for each experimental group would be adequate to reach a 0.80 power. 4. If the researcher cannot have access to a sample of that size, she can increase power by lowering the alpha to.025, by raising the ES to 0.80, or by doing both. The concept of power is crucial to the conduct of responsible and sound research. An understanding of statistical power enables educators as consumers of research to ask the right questions about what the research says and to make informed judgments about effective interventions they might use in their own practice. II. Validity of True and Quasi Experiments

19 In evaluating the quality of an experiment, a reader has to consider the overall validity of the study Validity is the approximate truth of propositions, inferences, or conclusions (Trochim). There are three types of validity that taken together lead to the overall validity of an experiment. Conclusion validity answers the question: Is there a relationship between the independent and dependent variables? Internal validity answers the question: Is the relationship causal? External validity answers the question: Can we generalize findings to other people and settings? There are barriers to achieving each validity component that researchers have to be mindful of before they can make a judgment about the quality of the validity profile of the study. Threats to validity are factors that interfere with a study and compromise our confidence about its results. Conclusion Validity Conclusion Validity is the degree that it is reasonable to conclude that there is a relationship between variables. Threats to conclusion validity compromise our confidence that a relationship does exist between variables. These threats include the following. Low reliability of measures: reliability < 0.7

20 Low statistical power: inadequate sample size, alpha set too low Fishing : analyzing and re-analyzing data; making multiple comparisons with the aim of finding significant results Mismatch of statistics to sample characteristics Internal Validity Internal validity concerns the level of control over causation that the researcher has in an experiment. It is the degree to which results are due to a causal relationship between variables and that the effect of the IV on the DV is not due to any variables (extraneous variables) other than the independent variable. An experiment that has strong internal validity can unambiguously attribute the effects on the dependent variable to the treatment of the independent variable Threats to internal validity compromise our confidence in saying that a causal relationship exists between the independent and dependent variables The most common threats to internal validity are the following. History: An unanticipated event occurs while the experiment is in progress Maturation: Normal developmental processes that affect subjects differently over time Statistical Regression: Subjects at the extremes regress to the mean during post-tests Selection: The groups are not equivalent at the beginning of the study

21 Mortality: Subjects drop out of the study Testing: The pre-test sensitizes subjects to the post-test Instrumentation: Changes in the way the dependent variable is measured Compensatory Rivalry ( The John Henry Effect ): Social competition motivates a group to over-perform and mask treatment effects External Validity External validity (or generalizability) is the degree to which the findings can be applied to other people, settings, and times. External validity can be approached in two ways: as population validity or as ecological validity. Population validity is the degree to which the results of an experiment can be generalized to individuals who were not in the study. Ecological validity is the extent to which the results of an experiment can be generalized to different settings. Threats to external validity compromise confidence in stating whether the study s results are applicable to other people and settings. These threats include the following: For population validity: not having a representative sample, randomly selected from a target population. For ecological validity: o The Hawthorne effect (also called reactivity): Outcomes are due to the reaction of subjects to being studied and are not due to the treatment.

22 o Experimenter effect: Outcomes are due to the characteristics of the person conducting the study. o Insufficient description of the conditions of the experiments: setting, treatment, and measurement Evaluating Validity In evaluating the various validities of a between group experiments and quasi-experiments is a complex undertaking. We recommend research consumers form a judgment by a using a framework that considers three criteria: 1) theory and treatment, 2) sampling, and 3) measurement. Theory and Treatment: Fidelity to a theory that is supported by a well-referenced research review and is not subject to modification (through fishing ) enhances confidence about conclusion validity. A solid theory that is supported by a well-referenced research review and leads to a hypothesis that clearly states a causal relationship between variables enhances confidence about internal and external validity A theory-based treatment that is clearly described and strictly implemented enhances confidence in internal and external validity.

23 Sampling: Adequate sample size and a good match between the sample and statistical analysis enhances confidence about conclusion validity A detailed description of people in the sample (how many and subject characteristics) enhances confidence about internal and external validity. A detailed description of the setting enhances confidence about internal and ecological external validity Control over sampling threats enhances confidence about internal and external validity. Random assignment of the sample to the treatment condition enhances confidence of internal and external validity Random selection of the sample from a population enhances confidence enhances population external validity. Measurement: Reliable measures (r = < 0.7) enhance confidence in conclusion internal, and external validity. Valid measures enhance confidence in internal and external validity. Consistent measures enhance confidence in internal and external validity

24 The table below summarizes these criteria and may serve as a template for evaluating validity.

25 Criterion Theory and Treatment Conclusion Validity Clear statement of hypothesis that predicts how IV will affect DV Internal Validity Well referenced research review leading to a solid causal theory or framework External Validity (Population) Well referenced research review leading to a solid causal theory or framework External Validity (Ecological) Well referenced research review leading to a solid causal theory or framework No evidence of fishing Clear statement of hypothesis that predicts how IV will affect DV Evidence of history threat or testing threat? Clear statement of hypothesis that predicts how IV will affect DV Clear statement of hypothesis that predicts how IV will affect DV Detailed description of the treatment and conditions of the study, including history threat Sampling Adequate sample size determined by power analysis and match of sample to statistics Adequate sample size determined by power analysis to avoid Type I Error. Detailed description of sample (size and subject characteristics) Evidence of Hawthorn Effect Experimenter Effect, or insufficient description? Detailed description of setting Measurement Reliability of measures.7 Detailed description of sample (size and subject characteristics) and setting. Random assignment to treatment condition Evidence of maturation, selection, regression, or compensatory rivalry threats? Use of reliable, valid, and consistent measures for DV Evidence of instrumentation threat? Random selection from a population and random assignment to treatment Use of reliable, valid, and consistent measures for DV Random assignment to treatment Evidence of Hawthorn Effect, Experimenter Effect or insufficient description of the setting? Use of reliable, valid, and consistent measures for DV Rating H M L H M L H M L H M L

Summary of Criteria for Evaluating Studies Summary There are three complementary approaches to statistical analysis in experimental and quasi experimental research: Null Hypothesis Significance

26 Summary of Criteria for Evaluating Studies Summary There are three complementary approaches to statistical analysis in experimental and quasi experimental research: Null Hypothesis Significance Testing (NHST), Effect Size, and Power Analysis Inferential statistics build on descriptive statistics (means and standard deviations) to determine the likelihood that results occurred by chance and whether the results can be generalized.

27 Inferential tests that are used in experiments and quasi-experiments are t- tests, ANOVA/MANOVA, and ANCOVE/MANCOVA Inferential tests yield values that are converted to probabilities of results having occurred by chance. Experiments and quasi-experiments are evaluated for internal validity, statistical conclusion validity and external validity (defined as ecological validity) There are several threats to internal validity that the researcher seeks to control. Theory/treatment, sampling, and measurement are key elements in evaluating validity of experiments and quasi-experiments. Terms and Concepts NHST Effect size estimates p-value/probability t-test ANCOVA/MANCOVA degree of freedom null hypothesis internal validity statistical significance power analysis alpha ANOVA/MANOVA table of critical values one and two tailed tests Type I and Type II Errors threats to internal validity statistical conclusion validity fishing external validity population validity ecological validity

28 Review, Consolidation, and Extension of Knowledge 1. In your own words, explain null hypothesis significance testing. 2. In your own explain Effect Size Estimates and Power Analysis. 3. In your own words explain the difference between statistical and practical significance. 4 Read the data analysis/results sections section of the experimental study you chose in the previous chapter and complete the Guide below. 5. Use the Guide as a template for writing a critique about 1,000 words of the experimental or quasi-experimental between group study you selected. See the Appendix for an exemplar. Guide to Reading and Critiquing an Experimental and Quasi- Experimental Group Study Research Review and Theory: What is the purpose of the research review? Does it establish an underlying theory (big ideas) for the research? Purpose and Design:

29 What is the purpose of the study? Is there a hypothesis or a research question? If so, what is it? If not, can you infer the question from the text of the article? What is the basic research design and type? What are the dependent and independent variables? Identify each type of variable in the study. ( IV=, DV=) Sampling: How is the sample selected? How is the sample assigned to the treatment condition(s): random or nonrandom/intact group? Who is in the sample? What are the characteristics of the sample? What is the sample size? Data Collection: What measures are used for the dependent variable? Are these standardized measures? Adapted measures? Newly-created measures? What is the format of the measure (s) Are there indications of validity and reliability of the measures? What are they ( r-values)? Data Analysis and Results: What statistical tests are used to analyze the data? Were the results (p-values) significant or non-significant? What does the researcher conclude about the findings?

30 Evaluation of Validity: How do you evaluate internal validity? What are threats to internal validity? How do you evaluate statistical conclusion validity? What is your rationale? How do you evaluate external population validity? Ecological validity?what is you rationale?

CHAPTER ELEVEN. NON-EXPERIMENTAL RESEARCH of GROUP DIFFERENCES

CHAPTER ELEVEN. NON-EXPERIMENTAL RESEARCH of GROUP DIFFERENCES CHAPTER ELEVEN NON-EXPERIMENTAL RESEARCH of GROUP DIFFERENCES Chapter Objectives: Understand how non-experimental studies of group differences compare to experimental studies of groups differences. Understand