B. Weaver (18-Oct-2006) MC Procedures... 1 Chapter 1: Multiple Comparison Procedures 1.1 Introduction The omnibus F-test in a one-way ANOVA is a test of the null hypothesis that the population means of all k samples are equal. Note that rejection of this null hypothesis (when k > 2) does not provide us with very much detailed information. In other words, rejection of the null hypothesis does not tell us which means differ significantly from other means--apart from the fact that the smallest and largest means are different, of course! And so, if the null hypothesis is rejected, a search for which differences are significant is in order. These search techniques are known as multiple comparison (MC) procedures. 1.2 Fisher’s Least Significant Difference (LSD) Method You may recall from your introduction to ANOVA that there are definite problems associated with multiple t-tests. Specifically, whenever you carry out multiple t-tests (rather than one-way ANOVA) on some set of means with the alpha per comparison (α ) set at .05, the PC probability of at least one Type I error can be much greater than .05. To illustrate the point, if you carried out c tests (or contrasts), and if each test was independent of all the others, then the maximum probability of at least one Type I error in the set, or the familywise alpha level would be given by: α ≤1−(1−α )C (1.1) FW PC The reason this formula gives you a maximum (rather than an exact value) for α should be FW clear: If you fail to reject the null hypothesis in all k cases, then the probability of at least one Type I error is zero. If you reject all c null hypotheses, then α will be equal to the right hand FW portion of equation 1.1. In actual fact, there are dependencies among all possible pairwise comparisons, and this makes things more difficult. It is not possible to determine exactly the familywise (FW) alpha for several nonindependent t-tests. However, it is known that under all circumstances: α ≤ cα (1.2) FW PC Equation 1.2 captures what is known as the Bonferroni inequality. Howell (1997, p. 362) explains it as follows: “...the probability of occurrence of one or more events can never exceed the sum of their individual probabilities. This means that when we make three comparisons, each with a probability = [.05] of Type I error, the probability of at least one Type I error can never exceed .15.” All of the foregoing has been concerned with doing multiple t-tests in place of the one- way ANOVA. The use of t-tests after rejection of the null hypothesis for an omnibus F-test is somewhat different. This technique is known as Fisher’s least significant difference (LSD) test, or sometimes as Fisher’s protected t. Under certain circumstances, Fisher’s LSD test will B. Weaver (18-Oct-2006) MC Procedures... 2 ensure that the FW alpha level is no greater than the per comparison (PC) alpha. In order to understand what these circumstances are, we must first clarify some terminology used by Howell (1997). What Howell calls the complete null hypothesis is just the null hypothesis for a one-way ANOVA, which states that all population means are equal. If you have 5 treatments, for example, the complete null hypothesis specifies that µ = µ = µ = µ = µ . When you are doing 1 2 3 4 5 multiple comparisons, it may be that the complete null hypothesis is not true, but a more limited null hypothesis may be. For example, it may be that µ = µ < µ = µ < µ . 1 2 3 4 5 According to Howell (1997, p. 369), “When the complete null hypothesis is true, the requirement of a significant overall F ensures that the familywise error rate will equal α. Unfortunately, if the complete null hypothesis is not true but some other more limited null hypothesis involving several means is true, the overall F no longer affords protection for FW.” Therefore, the LSD technique is not recommended except when you have three treatment levels. In this case the FW error rate will remain at α. To see why, consider the following scenarios: First, consider the case where the complete null hypothesis is true: µ = µ = µ In 1 2 3. this case, the probability that you will commit a Type I error with your overall F-test is α; and any subsequent Type I errors that might occur (i.e., when you carry out the three possible pairwise t-tests) will not affect the FW error rate. (Note that this is true for any number of means when the complete null hypothesis is true.) If the complete null hypothesis is not true, but a more limited null hypothesis is true, then it must be the case that two of the means are equal and different from the third. For example, it may be that µ = µ < µ In this case, it is not possible to make a Type I error when carrying out 1 2 3. the omnibus F-test. (You can only make a Type I error when the null hypothesis is true, and the null hypothesis for the omnibus F is the complete null hypothesis--and we have just said that it is not true.) In this case, there will be only one pairwise comparison for which the null hypothesis is true, and therefore only one opportunity to make a Type I error. And so the probability of Type I error will be α . PC 1.3 Calculation of t when k > 2 Before we go on, it should be noted that when you have more than 2 independent samples (i.e., when k > 2), t-values for multiple comparisons are computed in a way that differs somewhat from what you learned previously. Before everyone panics, let me emphasise that the difference is very slight, and that the reasons for the difference even makes sense! Let us begin with the t-test for 2 independent samples. The formula boils down to: X − X 1 2 t = (1.3) S X1−X2 And the standard error of the difference between 2 independent means is often calculated with this formula: B. Weaver (18-Oct-2006) MC Procedures... 3 SS + SS 1 1 S = 1 2 + (1.4) X1−X2 n +n −2 n n 1 2 1 2 But note that that (SS + SS )/(n + n - 2) is really a pooled variance estimate, or s2 . And so 1 2 1 2 p equation 1.4 can be rewritten as: 1 1 S = S2 + (1.5) X1−X2 P n n 1 2 Finally, note that when n = n , equation 1.5 can be rearranged to give: 1 2 S = S2 1 + 1 = S2 2 = 2SP2 (1.6) X1−X2 P n n P n n 1 2 It should be clear that the main component of the denominator of the t-ratio is a pooled estimate of the population variance. Normally when you perform an independent samples t-test, this variance estimate is based on 2 samples. However, the more samples you can base it on, the more accurately it will estimate the population variance. And so, when you have k independent samples/groups (where k > 2), and when the homogeneity of variance assumption is tenable, it makes abundant sense to use a pooled variance estimate that is based on all k samples: (SS + 1 SS + ... SS ) / (n + n + ... n - k). This formula should look familiar to you, by the way, 2 k 1 2 k because it is the MS or MS for the one-way ANOVA. within error Finally then, when calculating t-ratios in the context of more than two independent samples, when all sample sizes are equal (and when the homogeneity of variance assumption is tenable), it is customary to use the following formula: X − X i j t = (1.7) 2MS error n 1.4 The studentized range statistic, q In our discussion to this point, we have glossed over another problem that arises when one has more than 2 samples and does multiple t-tests. It is well known that as sample size (n) B. Weaver (18-Oct-2006) MC Procedures... 4 increases, so does the magnitude of the sample range. (In case you’ve forgotten, the range is the highest score in the sample minus the lowest score.) Imagine drawing random samples of various sizes from a normally distributed population with σ = 10. The effect of sample size on expected value of the range is clear in the following table: Table 1.1 Expected values of the range for samples of various sizes drawn from a normal population with σ = 10. Expected Value of Sample Size Range 2 11 5 23 10 31 20 37 50 45 100 50 200 55 500 61 1000 65 Note that the effect of increased sample size on the expected value of the range is most pronounced when the sample sizes are relatively small. That is, an increase from n = 2 to n = 5 results in more than doubling of the expected value of the range (11 to 23). But an increase from n = 500 to n = 1000 results in a very modest increase in the expected value of the range (from 61 to 65). If you’re wondering what this has to do with multiple t-tests, remember that when you do a t-test, you are really comparing two scores that are drawn from a normal distribution: The two scores you compare are two sample means, and the normal distribution is the sampling distribution of the mean. Note that if you really have 5 samples (and 5 means), then the expected difference between the largest and smallest means (i.e., the range for the set of 5 means) will be much larger than if there are only 2 samples (and 2 means). In other words, if you were to draw 5 random samples from a normal population and compare the smallest and largest sample means, “the observed t-ratio would exceed the critical t-ratio far more often than the probability denoted by the nominal value of alpha” (Glass & Hopkins, 1984, p. 369). Not surprisingly, there is a statistic that does take into account the number of samples: It is called the studentized range statistic, or q. The value of q is calculated by subtracting the smaller of two sample means from the larger, and dividing by the standard error of the mean: X − X L S q = (1.8) MS error n This formula looks very similar to formula 1.7, but note that MS is not multiplied by 2 when error you calculate q. Perhaps the difference between t and q is even clearer in equation 1.9: B. Weaver (18-Oct-2006) MC Procedures... 5 q = t 2 (1.9) Just as there are critical values of t, there are critical values of q. Many statistics texts (including Howell, 1997) have tables of critical values of q. Most such tables show Error df down the left side of the table. This refers to the degrees of freedom for MS in the overall error ANOVA. And across the top of the table is something like: r = Number of Steps Between Ordered Means. This refers to the number of means encompassed by the two means being tested. For example, if you have a set of 5 sample means rank ordered from smallest (M ) to 1 largest (M ), the number of means encompassed by M and M would be 5--or in other words, r 5 1 5 = 5. 1.5 The Tukey HSD Method of Multiple Comparisons The Tukey HSD test is designed to make all pairwise comparisons between means while maintaining the FW error probability at α. (That is, if you set α = .05 for the Tukey HSD test, the probability of at least one Type I error will be no greater than .05.) The test statistic is the studentized range statistic q as defined in equation 1.8. The critical value of q for all pairwise comparisons is the critical value with the maximum value of r for that set of means. For example, if there are 5 means in the set, then all mean differences are treated as if the two means were 5 steps apart. Let us look at an example with k = 5 independent groups (with 8 in each group). The means of the 5 groups, and the results of a one-way ANOVA are shown below (and in Table 12.1 in Howell, 1997): Table 1.2 Group means and ANOVA sumary table Group Mean Source df MS F p M-S 4 Between 4 874.40 27.33 < .01 M-M 10 Error 35 32.00 S-S 11 S-M 24 Mc-M 29 There are two ways we could go about carrying out the Tukey HSD test. On the one hand, we could calculate q for each of the possible pairwise comparisons, and compare the obtained value to the critical value. (We would reject the null hypothesis of no difference between the means if the obtained value of q was equal to or greater than the critical value.) The other approach is to rearrange equation 1.8 in such a way that we can calculate a critical difference between means. To be consistent with Howell (1997), I will call this critical difference W (for width). W = the smallest width (or difference) between means that will be significant. The formula for calculating W is: MS W = q error (1.10) crit crit n B. Weaver (18-Oct-2006) MC Procedures... 6 As described earlier, the critical value of q depends on the degrees of freedom associated with the error term (from the overall ANOVA), and on the number of means in the set. It also depends on the alpha level you have chosen. For this example, df = 35 (from the ANOVA summary table), and r = 5. If we set α = .05, then the critical value of q = 4.07. (Note that there is no entry for df = 35, so I averaged the values for df = 30 and df = 40.) The square root of (MS /n) = 2. And so W = 4.07(2) = 8.14. In other words, any difference between means that error is equal to or greater than 8.14 will be declared significant at the .05 level. The next step is to construct a table of mean differences as follows (each cell entry is the mean at the top of the column minus the mean at the left end of the row): Table 1.3 Mean pairwise differences between conditions. M-S M-M S-S S-M Mc-M (4) (10) (11) (24) (29) M-S (4) 6 7 20 25 M-M (10) 1 14 19 S-S (11) 13 18 S-M (24) 5 All significant differences (i.e., those larger than 8.14) are shown in the shaded area of the table. It is clear that there are no significant differences amongst the 3 smallest means; and that there is no significant difference between the 2 largest means. Furthermore, the largest 2 means do differ significantly from the smallest 3. This information is sometimes conveyed by writing down the treatments (i.e., just the group names/codes in this example) and underlining the homogeneous subsets. For the present results, for example: M-S M-M S-S S-M Mc-M 1.6 The Newman-Keuls Method The Newman-Keuls (NK) method of multiple comparisons is very similar to the Tukey HSD method. But whereas the Tukey test uses only one critical value of q (i.e., the critical value for the largest value of r), the NK test uses k-1 critical values of q. So in the previous example with k = 5 treatments, you would need 4 different critical values of q. Using df = 35 once again, the critical values of q would be: q = 2.875 2 q = 3.465 3 q = 3.815 4 q = 4.070 5 B. Weaver (18-Oct-2006) MC Procedures... 7 The subscripts on the q’s indicate the value of r, or the number of means encompassed by the 2 means being compared. Thus, when you compare the largest and smallest means, the critical value of q = 4.07, just as it was for the Tukey HSD test. But as the number of means encompassed by the 2 being compared decreases, so does the critical value of q. These critical values can be converted to critical differences (W) using equation 1.10: W = 2.875(2) = 5.75 2 W = 3.465(2) = 6.93 3 W = 3.815(2) = 7.63 4 W = 4.070(2) = 8.14 5 The mean differences in Table 1.3 would then be compared to these critical differences. Note that with this set of critical differences, we would conclude that the difference between M-S and M-M (10-4) and the difference between M-S and S-S (11-4) are both significant (which we were unable to do with the Tukey HSD test). Thus, the results of the Newman-Keuls test on these data could be summarized as follows: M-S M-M S-S S-M Mc-M 1.7 Comparison of Tukey and NK Methods The Tukey and NK methods both use the studentized range statistic, q. The main difference between them is that the Tukey method limits the FW alpha level more strictly. (In fact, it limits the probability of at least one Type I error to α.) It achieves this by treating all pairwise comparisons as if the 2 means were k steps apart, and using just one critical value of q. Note that for the initial comparison of the smallest and largest means, the Tukey and NK tests are identical. But for subsequent comparisons, the NK method will reject the null hypothesis more easily, because the critical value of q becomes smaller as the number of means in the range decreases. Some researchers and statisticians shy away from the NK test, because they feel that it is too liberal--i.e., that the probability of Type I error is too high. We will return to this issue later. 1.8 Linear Contrasts Before going on to other MC methods, we must understand what a linear contrast is. A pairwise t-test (see equation 1.7) is a special kind of linear contrast that allows comparison of one mean with another mean. In general, linear contrasts allow us to compare one mean (or set of means) with another mean (or set of means). It may be easier to understand what a linear contrast is if we first define a linear combination. According to Howell (1997, p. 355), “a linear combination [L] is a weighted sum of treatment means ”: B. Weaver (18-Oct-2006) MC Procedures... 8 k L = a X + a X +...+a X = ∑a X (1.11) 1 1 2 2 k k i i i=1 A linear combination becomes a linear contrast when we impose the restriction that the coefficients must sum to zero (Σa = 0). i The contrast coefficients are just positive and negative numbers (and zeroes) that define the hypothesis to be tested by the contrast. Table 1.4 shows the coefficients for making all possible pairwise contrasts when there are 5 means. Table 1.4 Coefficients for all pairwise contrasts involving 5 sample means. Means being compared 1 2 3 4 5 Σa i 1 v 5 1 0 0 0 -1 0 1 v 4 1 0 0 -1 0 0 1 v 3 1 0 -1 0 0 0 1 v 2 1 -1 0 0 0 0 2 v 5 0 1 0 0 -1 0 2 v 4 0 1 0 -1 0 0 2 v 3 0 1 -1 0 0 0 3 v 5 0 0 1 0 -1 0 3 v 4 0 0 1 -1 0 0 4 v 5 0 0 0 1 -1 0 1.9 Simple versus Complex Contrasts Contrasts that involve only two means, with contrast coefficients equal to 1 and -1, are called simple or pairwise contrasts. All of the contrasts in Table 1.4, for example are simple contrasts. Complex contrasts involve 3 or more means. For example, with a set of 5 means, it may be hypothesised that the mean of the first two means is different from the mean of the last 3 means. One set of coefficients that would work for this particular contrast is: 3 3 -2 -2 -2 It may not be immediately obvious how I arrived at these coefficients, so let’s work through it in steps. Step 1: Mean of groups 1 & 2 = (M + M ) /2 = (1/2)M + (1/2)M 1 2 1 2 B. Weaver (18-Oct-2006) MC Procedures... 9 Step 2: Mean of groups 3-5 = (M + M + M )/3 = (1/3)M + (1/3)M + (1/3)M 3 4 5 3 4 5 Step 3: We could use as our coefficients the fractions in the right-hand portions of the equations in Steps 1 and 2. Note that one set of coefficients (which set is arbitrary) would have to be made negative so that the sum of the coefficients is zero. Thus, our coefficients could be: 1/2 1/2 -1/3 -1/3 -1/3 Step 4: Some texts recommend stopping at this point, but others suggest that it is easier to work with coefficients that are whole numbers. To convert the coefficients to whole numbers, multiply each one by the lowest common denominator. In this case, that means multiplying each one by 6. Doing so yields the set of coefficients shown above. Note that there is a shortcut that does not involve so many steps. The coefficient for the first set of means in a contrast equals the number of means in the second set; and the coefficient for the second set of means equals the number of means in the first set. Finally, the coefficient for one of the two sets is arbitrarily selected and made negative. (Note that if you had 10 means, and had coefficients of 6 and -4, these could be reduced to 3 and -2.) In the example given earlier, there are 3 means in the second set, and so the coefficient for the first set of means is 3; and there are 2 means in the first set, so the coefficient for the second set of means is 2. Making the 2’s negative yields the set of coefficients listed earlier. 1.10 Testing the Significance of a Linear Contrast Many MC methods use a modified t-ratio, or an F-ratio as the test statistic. Here, we will focus on the modified t-test. (See Howell (1997, 1992) for an explanation of the F-test--and note that F = t2.) Before looking directly at the t-ratio for a linear contrast, let me remind you that in general, t = (statistic - parameter) / (standard error of statistic). In the case of a single sample t-test, the statistic is a sample mean, the parameter is the (null hypothesis) population mean, and the term in the denominator is the standard error of the mean (i.e., the sample standard deviation divided by the square root of the sample size. For an independent samples t-test (equation 1.3), the statistic is the difference between 2 sample means; the parameter is the (null hypothesis) difference between the two population means (which is usually equal to zero); and the term in the denominator is the standard error of the difference between two independent means (see equations 1.4 and 1.5). When you test the significance of a linear contrast, the statistic in the numerator of the t- ratio is the linear contrast L (equation 1.11). Note that L is computed using sample means. Therefore, L is really an estimate of a corresponding contrast that uses population means. This contrast that uses population means (rather than sample means) is the parameter for our t-ratio. In almost all cases, however, the null hypothesis specifies that this parameter equals zero, and so it can be left off the formula. The final piece of the formula is the standard error of the linear contrast, which is computed as follows: B. Weaver (18-Oct-2006) MC Procedures... 10 a2 S = MS ∑ (1.12) L error n And when all sample sizes are equal, this reduces to: S = MSerror ∑a2 (1.13) L n Finally, note that for simple (pairwise) contrasts with n = n = n, the coefficients are 1 and -1, so 1 2 formula 1.13 reduces to: 2MS S = error (1,14) L n This is the same as the denominator of the t-ratio shown in equation 1.7. Finally then, the t-ratio for a linear contrast is calculated with equation 1.15: t = L/s (1.15) L 1.11 Planned versus Post Hoc Comparisons There is a very important distinction between planned (or a priori) and post hoc (or a posteriori) contrasts. Glass and Hopkins (1984, p. 380) say this about the distinction: In planned contrasts, the hypotheses (contrasts) to be tested must be specified prior to data collection. MC methods which employ planned comparisons can be advantageous if the questions that the researcher is interested in are a relatively small subset of questions. The distribution theory and probability statements for these MC methods are valid only if there is no chance for the user to be influenced by the data in the choice of which hypotheses are to be tested. The rationale for planned contrasts is similar to that for “one-tailed” t-tests--to be valid, the decision must be made a priori. Post hoc MC techniques do not require advance specification of the hypotheses (contrasts) to be tested. The Tukey and NK MC methods are considered to be post hoc methods since there is no delimitation as to which pairs of means will be contrasted. To further illustrate how important this distinction really is, consider the scenario described by Howell (1997, p. 350). If you have 5 means, there will be 10 possible pairwise comparisons. Let us imagine that the complete null hypothesis is true (i.e., all 5 population
Description: