Inferential Statistics for Radiation Scientists: A Brief Guide to Better Statistical Interpretations

Size: px

Start display at page:

Download "Inferential Statistics for Radiation Scientists: A Brief Guide to Better Statistical Interpretations"

Eric Bradford
5 years ago
Views:

1 Journal of Medical Imaging and Radiation Sciences Journal of Medical Imaging and Radiation Sciences 43 (2012) Directed Reading Article Inferential Statistics for Radiation Scientists: A Brief Guide to Better Statistical Interpretations Yves Bureau, PhD a * a University of Western Ontario, St Joseph s Health Care Lawson Health Research Institute, London, Ontario, Canada Journal de l imagerie médicale et des sciences de la radiation LEARNING OBJECTIVES Distinguish between descriptive and inferential statistics. Describe the Monte Carlo method. This will be important as it will ensure that the investigator understands what P values mean. Of importance is for the investigator to understand estimations. By understanding estimation and error resulting from estimations, the investigator will appreciate statistical results. Understand rare events. This is related to the above point, and is crucial for hypothesis testing. Without this understanding, any interpretation of results will be memorization of rules only. Understand power. What is power and how do we interpret it? The reader will have a greater command of this concept. Understand effect size. Often we look at the Type 1 error erroneously for the effectiveness of a treatment. We should look at other measures as well. To be able to interpret outputs from PASW (SPSS). Critically evaluate data using the Type 1 error, effect size, and power together. Finally, through examples, the reader will be able to analyze and interpret the results from an experiment. ABSTRACT Inferential statistics is used to help investigators make decisions about their data. This package will help novice researchers understand how to think about inferential statistics and will offer some examples of specific statistical tests. Presented here is the Monte Carlo technique, which is an interesting approach to instructing in statistics. It is used as a practical way to show why distributions are important and how it relates to the famous.05 probability criterion when declaring results significant. Also presented here is how to conduct and interpret the t-test, The F test often referred to as analysis of variance, and interaction analysis. Finally, a discussion on misinterpretations is included to help prevent making erroneous statements about statistical analysis. RESUME Les statistiques deductives aident les chercheurs a prendre des decisions sur leurs donnees. Cette trousse aidera les chercheurs debutants a comprendre les statistiques deductives et presente des exemples de tests statistiques particuliers. On y trouve la technique Monte Carlo, une approche de statistiques interessante. Il s agit d une façon pratique de demontrer l importance des distributions et comment elles se relient au fameux critere de probabilite de 0,05 pour la pertinence des resultats. On y trouve aussi l execution et l interpretation du test-t et du test-f, souvent consideres comme une analyse des variances et de l interpretation. Enfin, on aborde la question des fausses interpretations, dans un objectif d aide a la prevention des declarations erronees sur l analyse statistique. Introduction Descriptive statistics describe, which is perfectly useful and appropriate when all that is wished is to summarize the data. However, the moment an investigator asks questions about populations, an inference is being made. Answering those questions require something called inferential statistics. Before inferential statistics, the best an investigator could do was to observe the data and make a blanket statement based on the sample characteristics. It ended up being a best guess. * Corresponding author: Yves Bureau, PhD, University of Western Ontario, St Joseph s Health Care London Health Research Institute, 288 Grosvenor Street, London, Ontario, N6A 4V2. address: ybureau@uwo.ca If an investigator was to determine whether or not two groups in an experimental design had different means from one another, all that could be done was to visually determine whether two group means were different. For lack of better terms we would call that decision-making process or test, the ocular trauma test. The issue with that type of decision making is that there is a probability of being incorrect because it is entirely possible that the means are different due to random error. Consequently, the subjects that make up the two groups might be from the same population but because of error in choosing the subjects, the means end up being different. In this article, it will be explained in detail that samples are always different from one another due to random sampling error. As a result, better ways are needed to help /$ - see front matter Ó 2012 Elsevier Inc. All rights reserved. doi: /j.jmir

2 investigators make decisions about their data. Thus, inferential statistics were born. No one can be 100% certain that a sample accurately represents the population. We will learn that this is sampling error and will have to wrestle with this error because samples rarely measure an entire population. This article will deal with random error and how to inference. It may seem odd that random error is so central to this document when the title of this article is Inferential Statistics for Radiation Scientists: a Guide to Better Statistical Interpretations. However, to understand how inferential statistics work, it will be essential to understand error. Once this is understood, it will be possible to appreciate inferential statistics because it always incorporates error. Descriptive Statistics Barker [1] introduced descriptive statistics in detail. Description of our data provides us with data summaries. However, these summaries are guesses as to what the population looks like, and therefore are inherently inaccurate when extrapolating to the population. It is known that there will be some error in estimation; error must be taken into consideration when making decisions about hypotheses. A number of things are known about error [2]. 1. It is a function of random sampling. Random sampling means that we sample a population without bias. If the sampling is perfectly performed, it is possible to obtain the same mean and standard deviation of the population. However, that never happens. This inaccuracy is called error in random sampling or random sampling error. 2. Sample size is crucial with respect to error when estimating population parameters (such as the mean). The greater the sample size, the less error there is in estimation. If there was access to all subjects in a population there would be no error of estimation, and therefore no need of inferences. 3. Error can be due to instrument imprecision. This part is not crucial to the arguments of this paper. However, the more errors the instrument introduces when measuring, the more difficult it is to accurately describe some population. Estimations and the Monte Carlo Method Monte Carlo (MC) methods provide approximate solutions to a variety of mathematical problems by performing statistical sampling experiments. This process involves performing many simulations using random numbers and probability to get an approximation of the answer to a problem. The name comes from the capital of Monaco that is known for its roulette tables. Because roulette tables are great random number generators, the MC method borrowed its name from this location [3, 4]. The purpose for MC methods in this article is to help us understand how sample sizes influence the accuracy of estimating means for a population. To perform an MC study, computers are used to generate a population of randomly selected numbers that represent some variable [5]. From this population, a sample of any given number of subjects can be obtained by random selection. After calculating the mean of the sample, the sample is returned to the population, a process called sampling with replacement. This process is repeated as often as necessary, sometimes as few as 2,000 times, but there are no limits. Of interest for us will be the resulting distribution of means as opposed to individual scores. With small samples, means tend to be more variable compared with distributions from large samples. Consequently, with small samples, it is more likely that we will sample from the extremes of the population. In short, here is how an MC simulation for the distribution of a statistic is conducted. 1. First, a population of a given number of scores with a fixed mean and standard deviation is generated. Figure 1. Three distributions of means constructed using Monte Carlo simulations. (A, B, C) Constructed using sample sizes of 4, 32, and 1,000 subjects, respectively. The larger the sample size, the less range in the distribution of means. 122 Y. Bureau/Journal of Medical Imaging and Radiation Sciences 43 (2012)

3 Figure 2. Three distributions of mean differences constructed using Monte Carlo simulations. (A, B, C) Constructed using sample sizes of 4, 32, and 1,000 subjects per group, respectively. The larger the sample sizes, the less range in the distribution of mean differences. 2. If a distribution of means is to be constructed a specific sample size is chosen. 3. The computer then randomly samples the constructed population and a mean is calculated. 4. The scores or subjects used for the first sampling are returned to the population. 5. The sampling is repeated as often as specified by the investigator. 6. The means are plotted to show the distribution using a histogram. 7. The investigator can then observe this distribution or conduct statistical analyses by comparing the distributions (distribution comparisons are beyond the scope of this article). Below are distributions of means generated by MC2G (Monte Carlo Analysis for 1 or 2 Groups, ver. 5.07, Brooks, 2005, Aspen, Ohio, index.htm), and graphed using PASW version 18 (SPSS Inc, IBM, Chicago, IL). PASW was formerly known as SPSS and has reverted to that name in later versions. To begin MC analysis, MC2G generated a population of 10,000 individual scores with a mean of zero and a standard deviation of one (m ¼ 0, s ¼ 1). The number of scores was chosen arbitrarily. It could easily have been one million; however, for the purposes here, 10,000 scores is a sufficiently large population. From this population, a distribution of 10,000 samples were randomly obtained and graphed. Three distributions were produced from samples of 4, 32, and 1,000 scores (Figure 1a, b, c). It can be seen from these MC simulations that the distributions range of means decrease as sample size increases. With a sample size of 1,000 scores, the range in means is negligible. This demonstrates that population estimations are more accurately determined with larger sample sizes. This has serious implications when conducting statistical analyses. Sampling from two different populations using small sample sizes will result in greater error because a distribution of mean differences would be variable. To emphasize the point that sample size matters with respect to estimation, MC simulations were conducted in which two means were randomly sampled from the same population. For this example the samples came from the same population. This is called sampling under the null hypothesis, which is to say that there are no differences between the means in the population. It would be expected then that the means are equal for every sampling. However, because of random error the sample means will not be the same [6]. Figure 2 shows MC simulations for two means from the same population that clearly show that as sample size increases the distribution of mean differences has less variability. This is similar to saying that the range of mean differences is reduced with increased sample size. This phenomenon is taken seriously when discussing statistical tests used for inferencing. Inferential Evaluation of Data Inferential statistics are the basis for decision-making after experimental or associational studies. As noted, this article is limiting itself to a few experimental designs. What follows here are explanations in how to use test tools and how to interpret them. The t-test and the F-test will be the subjects of the next few sections. To best understand how to use these tools, short explanations of concepts will be presented. Independent and Dependent Variables Again, Barker [1] should be consulted, but a short description will be presented here. The distinction between independent variables (IV) and dependent variables (DV) is simple but crucial for experimental designs and the accompanying statistical analysis. This is important if conducting analysis yourself or consulting with a statistician [7]. IVs are variables that have categorization. Most of the time an IV will denote treatment groups or some distinction of characterization. However, a proper IV is a manipulated variable [4]. These are the variables in which subjects are randomly allocated to groups. For example, if an investigator Y. Bureau/Journal of Medical Imaging and Radiation Sciences 43 (2012)

4 Table 1 Critical Values for the t test at Various Degrees of Freedom Alpha One-tailed Alpha Two-tailed df Infinity As the degrees of freedom (df) increase, the critical values decrease. Also, as the acceptable Type 1 error increases, the critical value decreases. Adapted from StatSoft ( wishes to determine whether or not a drug will lower cholesterol levels, the investigator will randomly allocate subjects to groups. Obviously, the investigator is manipulating that variable. Random allocation is important here because it ensures that all confounding variables are equally distributed in the groups. This ensures that any observed differences between the groups will be due to the manipulation and not some disproportionate representation of some confounding characteristic. Dependent variables (DV) are the measures influenced by the manipulation. For example, the cholesterol drug will influence the level of cholesterol in the blood. Cholesterol then is the DV and its concentration depends on the group from which it is measured. Thinking in terms of the DV depending on the IV s manipulation is one way to remember the difference between the DV and IV. Rare Events Rare events are events that occur less frequently than others. In statistics, a rare event is a statistic that has a less than predetermined probability of occurring. Traditionally, we view a statistic that has a probability of less than 5% (P <.05) of being observed to be rare. The critical values shown for the t-test (described later) are the cutoffs at specific probabilities of observing a t-statistic (Table 1). Any observed value within and including the critical values is considered common and therefore not rare. Please keep this in mind as you read the following sections. Hypotheses Hypotheses are the questions to be answered from experiments or studies. They are made in reference to populations. After all, the population is what we are trying to estimate. The null hypothesis is the one that states there is no effect in the population, whereas the alternate hypothesis makes a statement about effect. Directional hypotheses assume that the investigator is interested in making a statement that the effect will go in some direction. This is the more powerful hypothesis. A non-directional hypothesis is stated when the investigator suspects that there will be an effect but there is no knowledge to propose a direction. This type of hypothesis is less powerful but often is the best to make because of its conservative nature. However, some would debate that statement as advocating inaccuracy in place of proper testing. It is the view of this author that non-directional hypotheses reduce the chances of reporting nonreproducible results. H o ¼ null hypothesis. H a ¼ alternate hypothesis. Type 1 and Type 2 Errors All hypotheses can be erroneously rejected. A Type 1 error is the probability of rejecting the null hypothesis when the null hypothesis is correct. It is possible to conclude that an effect is observed when in fact only a random phenomenon is detected. In such a case the conclusion is wrong and an unacceptable Type 1 error has been committed. The acceptable Type 1 error is less than.05. In other words, it is acceptable to be wrong as long as the probability of being wrong is less than.05. This value of.05 is arbitrary. It could easily have been.10, but it would seem that the statistical world has settled on.05. In any experiment, the probability of committing a Type 1 error ranges between 0 and 1.0 for any separate analyses. A Type 2 error is the probability of failing to reject the null hypothesis when in fact there is an effect. The null hypothesis should be rejected but for whatever reason the effect was not detected. Neither error is desirable or acceptable nor would this author suggest that one is more acceptable than the other. Degrees of Freedom We will be somewhat non-mathematical in the explanation of degrees of freedom (df). In most statistical textbooks for the behavioral sciences, the explanations are relatively nonsatisfactory and it will remain so here. In mathematical texts the explanations are complex with sophisticated algebraic proofs. The explanation used here is from Hays [2]. Samples are used as estimates of populations. An inherent bias is that the samples underestimate variance in the population. However, one way of reducing the bias, which is a source of error, is to calculate variance using df as opposed to sample size. According to Hays [2], the sum of deviations of scores from the mean of any population must be zero. This fact has consequences. Suppose that you are told that N ¼ 4in 124 Y. Bureau/Journal of Medical Imaging and Radiation Sciences 43 (2012)

a sample and that you are to guess the four deviations from the mean. For the first deviation you can guess any number, and suppose you say d 1 ¼ 6, for the second d 2 ¼ 9, and for the third d 3 ¼ 7.

5 a sample and that you are to guess the four deviations from the mean. For the first deviation you can guess any number, and suppose you say d 1 ¼ 6, for the second d 2 ¼ 9, and for the third d 3 ¼ 7. However, when you come to the fourth deviation value, you are no longer free to guess any number you please. The value of d 4 must be equal to d 4 ¼ 0-d 1 -d 2 -d 3 ¼ 10. In short, given the values of an N-1 deviation from the means, which could be any set of N-1 number, the value of the last deviation is determined completely. Therefore, we say that there are n-1 df for a sample variance, which is the average squared deviation. Power Power is the probability of rejecting the null hypothesis at a predetermined alpha if we were to repeat the experiment with all conditions remaining constant. It could be said that this is the ability of a test to detect effects in the population. This is true for tests of means or associations. The power of an experiment depends on: 1) The effect size: Effect size is the degree to which variation in the data is not due to random error. In designs investigating mean differences, the degree to which means differ relative to error variance is important. This is a little different from the discussion that will come later, but suffice it to say that this definition works perfectly when describing tests of means as shown by Cohen [8]. 2) Sample size: As discussed previously, sample size is associated with error in estimation. The larger the sample size, the narrower the distribution of means. 3) Sample variance: Sample variance is quite important. The greater the sample variance, the greater the range in the distribution of means, which translates into greater error in estimating the population mean. Therefore, the likelihood of rejecting the null hypothesis decreases, which results in less power. 4) Directional or non-directional hypothesis: A hypothesis can be directional depending on the investigator s knowledge of the effect. If the direction is known, the test will be more powerful. The investigator must decide on a one (directional) or two (nondirectional)-tailed test. A one-tailed test has smaller critical values (cutoff scores) resulting in a greater probability of rejecting the null hypothesis (more power). For example, if you had six subjects per group for two groups with different means (0 vs. 1), a two-tailed test using the t-test would require an observed t of greater than 2.23, whereas, for a one-tailed test only a t greater than would be required (Figure 3). Experimental Design and Associated Statistics One of the simplest types of experiments comprises a treatment plus a control group. In such an experiment an investigator would be interested in determining whether or not one group mean significantly differs from the other with respect to some measure. Say that the investigator is interested in showing the effectiveness of some cancer drugs to shrink tumors in the lung. The investigator could measure tumor size before Figure 3. (A) The critical value for a directional hypothesis. The one-tailed test would require that the observed t statistic be greater than the value shown. (B) The critical value for a non-directional hypothesis. The two-tailed test would require that the observed t statistic be greater than the value shown. It should be noted that the distribution of rare ts are at one tail of the distribution for the one-tailed test but equally distributed at both tails for the two-tailed test. Y. Bureau/Journal of Medical Imaging and Radiation Sciences 43 (2012)

x1 x2 Difference Between Means t = = = 2 2 s s Standard Error of Mean Differences 1 2 + n n 1 s = Group variance x = Group mean n = Group sample size 2 Signal Noise Figure 4.

6 x1 x2 Difference Between Means t = = = 2 2 s s Standard Error of Mean Differences n n 1 s = Group variance x = Group mean n = Group sample size 2 Signal Noise Figure 4. A standard formula for calculating a two-sample independent t test. Of interest here is the emphasis on the effect of mean differences being equated to signal and the standard error of mean differences being equated to noise. This gives us a unique perspective on statistical analysis. and after treatment and calculate the mean differences for both groups. The means of these differences would then be compared. However, as already discussed, group differences calculated from samples vary from random sampling error. Therefore, determining whether or not two groups differ in the population requires that we know the range of some statistic. Let s begin by exploring the t-test s t statistic, a relevant inferential statistic for the present example. The t-test All inferential test statistics incorporate error within their formulas. Most do so by calculating a ratio of signal to noise. The signal is the effect of any treatment and the noise is the variability in the data that cannot be explained by the investigator s manipulations. For the example given previously, the signal would be calculated using the mean difference between groups, whereas the noise is calculated using individual differences between scores within the groups. Such a ratio for our example would be the t ratio. Figure 4 shows the equation for the t-statistic. Notice that it is a ratio as described previously with differences between group means divided by the standard error of the mean (standard deviation of differences between means). The variation of mean differences is estimated from differences between individuals, which is our measurement of random fluctuation. Once the calculations are complete, the resulting ratio will have to be evaluated. This will dictate whether or not the null hypothesis is rejected. To make those statements of significance, it will be necessary to compare the obtained t to a frequency distribution of t-statistics under the null hypothesis called the theoretical distribution (Figure 5). The null hypothesis states that there is no difference between groups in the population. This distribution will show that most ts cluster near the mean which is zero, whereas larger ts are fewer in number and away from the mean. It is important to consider sample size as well. The larger the sample size, the smaller the range of ts. This consequently influences the probability of obtaining a statistic with larger values compared to one calculated from our experiment. For an experiment with a moderate effect size conducted with four subjects per group, there is a probability of less than.05 (5%) of obtaining a value that is beyond 2.44 or below -2.44, whereas the t value would be 2.02 if the experiment was conducted with 20 subjects per group (Table 1). Therefore, an experiment with 20 subjects per group would be more powerful, which is to say we would be more likely to reject the null hypothesis. The distributions look somewhat different with varying degrees of freedom. We can see that comparing t statistics obtained from an experiment with four subjects per group should not be compared to a distribution of ts constructed with sample sizes of 20. You would be falsely tempted to conclude that your experiment worked. This is why we compare an obtained statistic to a distribution of ts obtained using the same df as our experiment. Example with Analysis A radiation scientist is interested in determining the effectiveness of various dosages of radiation to reduce burn area in patients being treated for cancer using a rat model of cancer without compromising the effectiveness of treatment. The investigator used two dosages in the experiment. Through imaging, one epithelial tumor was targeted followed by exposure to radiation. Each rat was then randomly allocated to a dosage group (n ¼ 10/group): 1) 60 Gy or 2) 80 Gy. In this experiment the dependent measures are: 1) the reduction in Figure 5. A theoretical distribution of ts. This distribution was constructed using a large sample size. Consequently, it resembles the Z distribution. Represented by beta (b) are the t values considered common under the null hypothesis and are in this case bound by and include to All other values are considered rare and are observed at the extremes (tails) of the distribution. In this example, the values are less than and greater than represented by alpha (a). 126 Y. Bureau/Journal of Medical Imaging and Radiation Sciences 43 (2012)

7 percentage of the tumor from the beginning to the final radiation exposure and 2) the area of burn on the skin in mm 2. All statistical analyses for this example were conducted using PASW version 18 (Chicago IL). Fictitious data were generated for this example using the literature for ranges. As seen, there are two hypotheses because there are two dependent variables. Null and Alternate Hypotheses H 0 ¼ null hypothesis H a ¼ alternate hypothesis First Set of Hypotheses H 0 ¼ the larger dosage will not be more effective in reducing tumor size H a ¼ the larger dosage will be more effective in reducing tumor size Second Set of Hypotheses H 0 ¼ the larger dosage will not result in larger skin burn area H a ¼ the larger dosage will result in larger skin burn area Both sets of hypotheses are directional. It is reasonable to suppose that higher dosages of radiation would kill more tumor cells and result in more burn. Because PASW conducts two-tailed tests only, the observed Type 1 error must be divided by two to be interpreted as a one-tailed test. As Table 2 shows, PASW prints out descriptive as well as inferential statistics. This is quite useful when summarizing the data and making inferences. PASW displays a number of statistics but only the relevant outputs were selected here. This analysis shows descriptive statistics and the inferential analysis. For the descriptive statistics, the sample size, the mean, standard deviation, and standard error of the mean are shown. There are no associated P values nor is there any possibility of making statements of significance from the available information. With respect to the table titledindependent samples testdall the information needed to make statements of significance is available. First there is the t value, then the df, followed by significance (Sig.; two-tailed) or the probability of making a Type 1 error. The analysis showed that there was no significant difference between dosage groups with respect to percent change in tumor size, t(18) ¼ , P ¼ NS, but there was a significantly greater area of skin burn in the 80 Gy group, t(18) ¼ -8.50, P <.001. Those statements can be made because the Sig. is less than.05 for skin burn area but greater than.05 for radiation dose. Therefore, according to this fictitious data both radiation dosages have equivalent treatment efficacy but different burn effects. When reporting results, the t value, df and Type 1 error must be reported. That being said, some journal editors instruct authors to report the P value only as a space-saving measure or to avoid visual clutter. This author recommends reporting all that is possible. Finally, the probability of a Type 1 error should be reported as either,.05,.01,.001, The reason for this is that it is not necessary to report exact values. If this study were to be repeated, the P value would not equate that of the first study. Therefore, a cutoff is more reasonable. The F Test (Analysis of Variance) and Multiple Group Analysis Multiple group analysis is always a problem. In order to fully analyze these designs multiple comparisons between Table 2 Group Descriptive Statistics and Inferential Statistics Output from PASW (SPSS) A. Group Statistics Dosage Group n Mean Standard Deviation Standard Error of the Mean Percent change in tumor size dimension1 60 Gy Gy Area of skin burned by treatment dimension1 60 Gy Gy B. Independent Samples Test t test for Equality of Means t df Significance (2-tailed) Percent change in tumor size Equal variances assumed Equal variances not assumed Area of skin burned by treatment Equal variances assumed Equal variances not assumed The sample size per group, mean, standard deviation, and standard error of the mean are shown. The inferential statistics shown here are the t statistic, degrees of freedom (df), and significance or the probability of a Type 1 error. The analyses for both percent change in tumor size and area of skin burned by treatment are shown. For the first analysis, dose did not influence tumor size (p >.05 or p ¼ ns, not significant) as indicated by significance (p ¼.072). Dose did influence burn area as shown by significance, which is.000 (p <.001). Y. Bureau/Journal of Medical Imaging and Radiation Sciences 43 (2012)

8 2 ( x x) 2 S = Sample variance or variance between individuals n 1 F = 2 n j ( x j x.. ) /( J 1) MS = 2 ( x x ) /( N J ) MS ij j BG Error Signal Noise Figure 6. The formula for the sample variance and the F-test also known as the analysis of variance. It is evident that the F-test formula is in fact variance over variance. The result then is the variance from one source of variance (effect variance) over that of random error. Consequently, large Fs that are interpreted as rare indicate that the effect is significantly larger than error allowing for an investigator to reject the null hypothesis. groups must be performed. However, with each comparison a Type 1 error is committed. If an experiment has three conditions, three pairs of comparisons must be conducted in order to fully analyze the experiment resulting in a Type 1 error for each comparison. Thus, the cumulative Type 1 error (.05 3) called the familywise error is.15. This means that the probability of observing a significant difference in means between groups is.15, which is an unacceptable error. To get around the familywise error the F-test must be performed. The F test is an omnibus test that provides us with a statistic capable to determining whether or not there are any group differences with one test as opposed to many. Of interest is that this Type 1 error would be.05 and not.15 as indicated previously. Figure 6 shows equations for the sample variance and for the analysis of variance (ANOVA). There are a number of ways to understand the F-test, but the focus here will be predominately conceptual. Sample variance focuses on random changes between individuals. This variation is not due to any manipulated variables. Interestingly, the same concept can be used to evaluate differences between groups. As means between groups differ, variance between groups increases, which is an excellent way to detect an effect. You will notice the ratio for the F statistic is variation between groups divided by variation between individuals providing us with a signal to noise ratio. If an F statistic is considered significant at the.05 level the null hypothesis of equal means between groups is rejected. Thus, no matter = the number of treatment groups in the experiment we have one statistic and a probability of.05 of making a Type 1 error. However, the F-test or ANOVA is an omnibus test. In statistics, omnibus is interpreted as determining whether or not an effect is present at all. This test does not provide information about group differences. To make that distinction, post hoc tests must be used. These tests take into consideration multiple comparisons and adjust the observed probabilities for each analysis or use statistics that would be more conservative compared to separate t-tests. Two post hoc tests will be discussed here. The Bonferroni method is one of the oldest methods of adjusting the familywise error. Because the Type 1 error is compounded with every comparison made, the per comparison alpha level is adjusted by dividing the per comparison alpha by the number of comparisons made. The investigator then compares the obtained alpha associated with the obtained t test to the alpha calculated by the Bonferroni procedure. For example, if an experiment has four groups, there are six possible comparisons. The per-comparison alpha would be divided by 6 resulting in a new per-comparison alpha of This would ensure that the familywise error is always.05. The Tukey test also known here as the honestly significant difference post hoc test or honestly significant difference Tukey is somewhat different. The equation is similar to the t test equation but provides us with a studentized q statistic (Figure 7). The major difference is that the error is taken from the F test conducted before conducting the post hoc tests. This statistic is then compared with the theoretical distribution of all possible q statistics under the null hypothesis. If the obtained statistic calculated for any comparison analysis is larger than the critical value, the groups would be considered significantly different from one another. Effect Size: Eta Squared (h 2 ) Effect size was previously discussed using the t test. However, eta squared is perhaps the best and most useful way to understand effect size. Eta squared provides us with the proportion of variance that is due to the independent variable (ie, the treatment in our experiment). The greater the variability between the groups relative to the total variance, the more the effect is said to be large. This is an incredibly useful statistic that is insensitive to sample size. Thus, regardless of the sample size it is possible to determine the size of the influence an independent variable has on the dependent variable. This statistic is not Figure 7. The studentized q formula for the honestly significant difference (HSD) Tukey post hoc test. The form is very similar to the t test formula. Like the t test, the signal is the difference between groups, although there is an emphasis on subtracting the smallest value from the largest, whereas this is unnecessary for the t-test. The error or mean squares (MS) error is directly taken from the F-test conducted before conducting the HSD Tukey, divided by the group size. The square root of the result is then taken. We then compare the q to a table of critical q values as we would the results from a t test (available in any statistics text book such as Hays 1994). Table 3 The Design for Chemotherapy Dose by Radiation Dose Chemotherapy Dose 1 Dose 2 Radiation Dose Dose This shows that all levels are crossed; therefore, we note that every dose of chemotherapy will be combined with every dose of radiation. 128 Y. Bureau/Journal of Medical Imaging and Radiation Sciences 43 (2012)

Figure 8. A formula for eta squared. This formula computes a ratio of variation between groups and the total variation in the data and is an indication of effect size.

9 Figure 8. A formula for eta squared. This formula computes a ratio of variation between groups and the total variation in the data and is an indication of effect size. SS ¼ sum of squares, BG ¼ between groups. tested. There is no determining its importance using hypotheses. It is simply a matter of reporting the value. Interactions Interactions have the misfortune of being underused because of interpretation difficulties. Many investigators will attempt to interpret their data using a number of single factor (one IV) analyses of variance instead of conducting more complex analyses such as factorial analysis of variance (more than one IV). It is true that the interaction is the most complicated effect when conducting an analysis of variance but it will be evident that the more complicated analysis often simplifies the interpretation of data. Interactions are possible when an experiment has multiple IV. These designs cross all treatment groups so that the level of one IV will be combined separately at all levels of the other IVs. The definition of the interaction is that the effect of one IV is not the same at all levels of the other IVs. For example, if an investigator is interested in exploring the effect of radiation and chemotherapy dosage on the number of years a patient lives after being diagnosed with breast cancer, it would be reasonable to assume that the radiation dosage would not have the same ability to increase survival at all levels of chemotherapy dosage. This is the inconsistency mentioned earlier. This experiment with two levels per IV would have all levels crossed (Table 3). The interaction would best be visualized using line graphs (Figure 9). As we can see, the lines are not running parallel as would be expected when an interaction is observed. Hypotheses There would be a minimum of three sets of hypotheses; two main effects and one interaction hypothesis. Main Effects H o : There are no differences between the means for radiation. H a : There are differences between the means for radiation. H o : There are no differences between the means for chemotherapy. H a : There are differences between the means for chemotherapy. Interaction H o : There is no interaction between radiation and chemotherapy on survival means. H a : There is an interaction between radiation and chemotherapy on survival means. You will notice that the hypotheses are somewhat general. This is because of the omnibus nature of the F test. The F test is not designed to determine specific group differences. This is done using post hoc techniques afterwards. Figure 9. The interaction between radiation dose and chemotherapy dose on years lived. The effect of radiation dose at chemotherapy dose is not the same. At chemotherapy dose 1, the difference in years lived is not as great compared with the difference at dose 2 indicating that chemotherapy dose and radiation dose work together to increase lifespan. Consequently, it is concluded that radiation and chemotherapy dose interact resulting in an inconsistency of effect of one independent variable at levels of another. The graph shown is from PASW (SPSS). It is used as opposed to graphs from other programs because readers will most likely begin with graphs from PASW to explore their data. Note that the title is estimated marginal meansdthis is to say that PASW estimates the means as they might be in the population. There is no need for alarm. The means are nearly identical to those calculated from the data. Therefore, the graph perfectly represents the data. Also, this graph does not have error bars. PASW does not make it easy to indicate error bars. For a quick perusal, this is not important; however, for publication an investigator must indicate error bars and have proper titles for the axes. Why Not Use Multiple Single-Factor Analyses of Variance? For any set of data, it is advantageous to explain variance to the fullest. Designs with many IVs explain more of the variance in a database because there are more sources of variance. Consequently, there is less unexplained variation (also called error) resulting in a smaller error term. As the F statistic is more likely to be larger, rejecting the null hypothesis is more probable. Table 4 shows three sources of variance: the main effect for radiation dose, chemotherapy dose, and the interaction between both plus error. It should be evident now that the signal to noise ratio for the main effects would be smaller compared to conducting a single factor analysis of variance. Y. Bureau/Journal of Medical Imaging and Radiation Sciences 43 (2012)

10 Table 4 Analysis of Variance Summary Table for an Analysis of Variance for the Years Surviving Cancer Study A. Tests of between-subjects Effects Dependent Variable: Yrslivd Source Type III Sum of Squares df Mean square F Significance Corrected model Intercept Radiation Chemotherapy Radiation chemotherapy Error Total Corrected total R 2 ¼ (adjusted R 2 ¼ 0.633). B. Tests of between-subjects Effects Dependent Variable: Yrslivd Source Partial Eta Squared Noncentrality Parameter Observed Power Corrected model Intercept Radiation Chemotherapy Radiation Chemotherapy Computed using alpha ¼ There is a significant main effect for radiation dosage and chemotherapy dosage; a significant radiation by chemotherapy interaction is also observed. Simple Main Effects On occasion, after observing a significant interaction, an investigator might be interested in determining whether or not there are differences between means of one IV at levels of another IV. This is said by some to investigate the interaction. However, this has little to do with the interaction as it is not necessary to observe significant differences in means to observe an interaction. Nevertheless, simple main effect analysis consists of determining differences between levels of one IV at levels of another IV. You can conduct this analysis by using t tests but the results might not be accurate. To have a proper analysis, the error from the original factorial analysis must be used in separate single factor ANOVAs of one IV at levels of another IV. This is because there is no need to recalculate error. Error has already been determined and therefore can be used in subsequent analyses. The error then could be much smaller resulting in a larger signal to noise ratio. Most investigators will use an F test but replace the error term with the one from the overall factorial analysis of variance. Limitations of This Package 1. First, only a few tests were discussed. They were the t - test and the F -test for completely between subjects designs. Repeated measures designs are beyond the scope of this package. Twice the material would be needed to do it justice. I chose instead to concentrate on key concepts not previously covered in this journal. 2. Second, the reader might recognize that the material here is somewhat unorthodox. Explaining the concepts using MC methods is not common. However, after teaching statistics for nearly 15 years to students not familiar with any statistics has shown me that this is an excellent method. Students can complete an entire course and have no idea what the Type 1 error represents. This method ensures that the student understands the meaning behind a rare event and consequently the Type 1 error. 3. Third, we did not touch upon designs of association. Again, this would be somewhat extensive. They comprise the Pearson product moment correlation, simple regression, multiple regression, and one of my favorites, factor analysis. Some of these were touched upon in Barker [1]. 4. Fourth, we did not touch on any of the multivariate tests. For that type of discussion we must have a strong grasp of associational analysis and repeated measures analysis. Again, this was beyond the scope of this package. Strengths of This Package 1. This document clearly instructs in how to evaluate tests of means. The t - test and F test should be clearly understood. 2. The reader is more likely to understand error and estimation problems. Consequently, the Type 1 error should no longer be a conceptual or interpretive challenge. 3. Even though there was little in terms of graphing data, the reader may have a greater appreciation for data presentation. 130 Y. Bureau/Journal of Medical Imaging and Radiation Sciences 43 (2012)

11 4. Not previously covered in this journal are interaction effects. This may be the single most important tool discussed in this package. Interaction effects are often underused. By implementing factorial designs, the reader will be capable of evaluating main and interaction effects alike, which is a clear boon for the reader. Dos and Don ts When Interpreting Results There are so many mistakes made in either performing statistical analyses or interpreting them that it is not reasonable to go over them all. However, there are a few that pertain directly to this package for which there is now an understanding. 1. The Type 1 error indicated as Sig. in PASW is not an indication of strength of the effect. Often, an investigator will be tempted to say, based on the Type 1 value (P value) that their results are either very or not very significant. All that should be said is whether their results are significant or not. Therefore, don t report on the strength of your effect using the P value. 2. In addition to the P value, two more things should be reported if the journal editors allow: 1) effect size and 2) power. This allows for an investigator to indicate the effect and the probability of observing a significant result if the study were conducted again. 3. Graphing should always include error bars (standard error of the mean). This is the best way to present data. Conclusion This package is written in the spirit of properly instructing the reader in some fundamental concepts underlying statistical analyses. The Type 1 error, estimation error, effect size, power, t test, F tests, post hoc analysis, simple main effects, and interaction effects were covered. I would recommend reading texts on repeated measures next. These tools would be the logical continuation of this package. Following those readings, I would study associational designs and then all the non-parametric tests such as Mann-Whitney U tests and the Wilcoxon sign-ranked tests. I hope that this package is well received and that it has proved useful. References [1] Barker, R. F. (2007). Deciphering statistics in research: a beginner s guide to statistics for radiation science professionals. Can J Med Radiat Technol [2] Hays, W. L. (1994). Statistics, (5th ed.). New York: Harcourt-Brace. [3] Metropolis, N., & Ulam, S. (1949). The Monte Carlo method. J Am Stat Assoc 44, [4] Metropolis, N. (1987). The beginning of the Monte Carlo method. Los Alamos Science (Special Issue) [5] Boneau, C. A. (1960). The effects of violations of assumptions underlying the t test. Psychol Bull 57, [6] Howell, D. C. (1994). Statistical Methods for Psychology, (5th ed.). Pacific Grove, CA: Duxbury. [7] De Muth, J. E. (2008). Preparing for the first meeting with a statistician. Am J Health Syst Pharm 15, [8] Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences, (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates. Y. Bureau/Journal of Medical Imaging and Radiation Sciences 43 (2012)

Psychology Research Process

Psychology Research Process Logical Processes Induction Observation/Association/Using Correlation Trying to assess, through observation of a large group/sample, what is associated with what? Examples: