Psicológica (2010), 31, 401-430. A (cid:3)ew (cid:3)onparametric Levene Test for Equal Variances David W. Nordstokke* (1) & Bruno D. Zumbo (2) (1) University of Calgary, Canada (2) University of British Columbia, Canada Tests of the equality of variances are sometimes used on their own to compare variability across groups of experimental or non-experimental conditions but they are most often used alongside other methods to support assumptions made about variances. A new nonparametric test of equality of variances is described and compared to current ‘gold standard’ method, the median-based Levene test, in a computer simulation study. The simulation results show that when sampling from either symmetric or skewed population distributions both the median based and nonparametric Levene tests maintain their nominal Type I error rate; however, when one is sampling from skewed population distributions the nonparametric test has more statistical power. Most studies in education, psychology, and the psycho-social and health sciences more broadly, use statistical hypothesis tests, such as the independent samples t-test or analysis of variance, to test the equality of two or more means, or other measures of location. In addition, some, but far fewer, studies compare variability across groups or experimental or non- experimental conditions. Tests of the equality of variances can therefore be used on their own for this purpose but they are most often used alongside other methods to support assumptions made about variances. This is often done so that variances can be pooled across groups to yield an estimate of variance that is used in the standard error of the statistic in question. In current statistical practice, there is no consensus about which statistical test of variances maintains its nominal Type I error rate and maximizes the statistical power when data are sampled from skewed population distributions. * Address Correspondence to: David W. Nordstokke, Ph.D. Division of Applied Psychology. University of Calgary. 2500 University Drive NW. Education Tower 302. Calgary, Alberta, Canada T2N 1N4. Email: [email protected] 402 D.W. $ordstokke & B.D. Zumbo The widely used hypothesis for the test of equal variances, when, for example, there are two groups, is 2 2 H :σ =σ 0 1 2 , (H1) 2 2 H :σ ≠σ 1 1 2 wherein, a two-tailed test of the null hypothesis (H ) that the variances are 0 equal against the alternative hypothesis (H ) that the variances are not equal 1 is performed. Nordstokke and Zumbo (2007) recently investigated the widely recommended parametric mean based Levene test for testing equal variances and, like several papers before them (e.g., Carroll and Schneider, 1985; Shoemaker, 2003; Zimmerman, 2004), they highlighted that the Levene test is a family of techniques, and that the original mean version is not robust to skewness of the population distribution of scores. As a way of highlighting this latter point, Nordstokke and Zumbo showed that if one is using the original variation of Levene’s test, a mean-based test, such as that found in widely used statistical software packages like SPSS and widely recommended in textbooks, one may be doing as poorly (or worse) than the notorious F test of equal variances, which the original Levene test (1960) was intended to replace. As a reminder, the mean version of the Levene test (1960) is ANOVA ( X − X ), (T1) ij j wherein, equation (T1) shows that this test is a one-way analysis of variance conducted on the absolute deviation value, which is calculated by subtracting from each individual’s score, denoted (cid:1) , from their group (cid:2)(cid:3) mean value, denoted (cid:1)(cid:5), for each individual i in group j. The family of (cid:4) Levene tests can be applied to more than two independent groups but, without loss of generality, the current study focuses on the two-group case. The primary goal of this study is to compare the Type I error rates and the statistical power of the median version of the Levene test and a new nonparametric Levene test (described in detail below) that was briefly introduced by Nordstokke and Zumbo (2007). Nordstokke and Zumbo remind readers that the mean version of the Levene test for equality of variances does not maintain its nominal Type I error rate when the underlying population distribution is skewed and, in so doing, introduced the nonparametric version of the Levene test that is intended to be more robust under the conditions where samples are collected from population distributions that are skewed. $onparametric test of equal variances 403 As Conover and his colleagues (1981) showed, the top performing test for equality of variances was the median based Levene test. The median based version of the Levene test for equal variances is ANOVA( X − Mdn ), (T2) ij j wherein, building on our notation above, the analysis of variance is conducted on the absolute deviations of individual’s score, denoted (cid:1) , (cid:2)(cid:3) from their group median value, denoted (cid:6)(cid:7)(cid:8) , for each individual i in (cid:3) group j. The median based version has been shown to perform well in situations wherein data were skewed. This test is available in widely used statistical software programs such as SAS (using, for example, PROC GLM and MEANS sample / HOVTEST=BF) and R using the packages “lawstat” (Hui, Gel, & Gastwirth, 2008) or “car” (Fox, 2009). Browne and Forsythe (1974), who are widely recognized as the developers of the median based version of the Levene test, also demonstrated that this test was suitable for use with skewed distributions. In terms of skewed distributions, Browne and Forsythe investigated the Chi-square distribution with 4 degrees of freedom (skew equals approximately 1.5) and showed that the Type I error rates were maintained in all of the conditions and had power values above .80 when the effect size was large (4/1), the ratio of sample sizes was 1/1, and the sample size was 40. This result suggests that the median version of the test could potentially become the widely used standard if these results hold across a broader range of conditions. Therefore, another purpose of this paper is to investigate the performance of the Levene median test for equal variances under a wider range of conditions than studied by Browne and Forsythe (1974). Conover, Johnson, and Johnson (1981) also showed that the Levene median test maintains its Type I error rates under an asymmetric double exponential distribution, but had average power values of .10. It is important that the Levene median test be studied further to assess its usefulness across varying research situations. For these reasons, the Levene median test for equal variances is used as a comparison for the newly developed nonparametric Levene test. As Nordstokke and Zumbo (2007) describe it, the nonparametric Levene test involves pooling the data from all groups, ranking the scores allowing, if necessary, for ties, placing the rank values back into their original groups, and running the Levene test on the ranks. Of course, if one were using SPSS (or some other program wherein the means version of the Levene test is computed) then one would merely have to apply the rank transformation and then submit the resulting ranks to the means Levene test 404 D.W. $ordstokke & B.D. Zumbo and one would have the nonparametric test, as described in Nordstokke and Zumbo (2007, pp. 11-12). Using our earlier notation, the nonparametric Levene test can be written as ANOVA ( R − X R ), (T3) ij j wherein a one-way analysis of variance is conducted on the absolute value of the mean of the ranks for each group, denoted (cid:1)(cid:9)(cid:10), subtracted from each (cid:3) individual’s rank (cid:11) , for individual i in group j. This nonparametric (cid:2)(cid:3) Levene test is based on the principle of the rank transformation (Conover & Iman, 1981). When the data are extremely non-normal, perhaps caused by several outliers, or the variable is genuinely non-normal (e.g., salary), or some other intervening variables, the transformation changes the distribution and makes it uniform. Conover and Iman suggested conducting parametric analyses, for example, the analysis of variance, on rank transformed data. The use of rank transformed data, although popularized by Conover and Iman, is an idea that has had currency in the field of statistics for many years as a way to avoid the assumption of normality in analysis of variance (see, for example, Friedman, 1937; 1940). Thus the nonparametric Levene test is a parametric Levene test on the rank transformed data. It should be noted that the null hypothesis for both the median and nonparametric Levene test is not the same as for the mean version of the Levene test. The null hypothesis of these two tests is that the populations are identically distributed in shape (not necessarily in location), and the alternate hypothesis is that they are not identically distributed in shape. If two or more distributions are identically distributed in shape, then it is implied that the variances are equal. That is, if the researcher can assume identical distributions, then they can assume homogeneity of variances. Thus the overlap between the hypotheses of parametric and nonparametric tests allows for interchangeability between them when testing for equal variances because implicit in the assumption of equal variances is identical distributions. This overlap allows one to test for equal variances using the nonparametric hypothesis of identical distributions. Rank transformations are appropriate for testing for equal variances because, if the rankings between the two groups are widely disparate, it will be reflected by a significant result. For example, if the ranks of one of the groups tend to have values whose ranks are clustered near the top and bottom of the distribution and the other group has values whose ranks cluster near the middle of the distribution, the result of the nonparametric $onparametric test of equal variances 405 Levene test would lead one to conclude that the variances are not homogeneous. METHOD Data Generation A computer simulation was performed following standard simulation methodology (e.g., Nordstokke & Zumbo, 2007; Zimmerman, 1987; 2004). Population distributions were generated and the statistical tests were performed using the statistical software package for the social sciences, SPSS. A pseudo random number sampling method with the initial seed selected randomly was used to produce χ2 distributions. An example of the syntax used to create the population distribution of one group belonging to a normal distribution is included in Appendix 1. Building from Nordstokke and Zumbo (2007), the design of the simulation study was a 4 x 3 x 3 x 9 completely crossed design with: (a) four levels of skew of the population distribution, (b) three levels of sample size, (c) three levels of sample size n ratio, 1 , and (d) nine levels of ratios of variances. The dependent n 2 variables in the simulation design are the proportion of rejections of the null hypothesis in each cell of the design and, more specifically, the Type I error rates (when the variances are equal), and power under the eight conditions of unequal variances. Staying consistent with Nordstokke and Zumbo (2007), we only investigated statistical power in those conditions wherein the nominal Type I error rate, in our study .05, is maintained. Shape of the population distributions1 Four levels of skew 0, 1, 2, and 3 were investigated. As is well known, as the degrees of freedom of a χ2 distribution increase it more closely approximates a normal distribution. The skew of the distributions for both groups were always the same in all replications and are shown in 1 It should be noted that the population skew was determined empirically for large sample sizes of 120,000 values with 1000, 7.4, 2.2, and .83 degrees of freedom resulting in skew values of 0.03, 1.03, 1.92, and 3.06, respectively; because the degrees of freedom are not whole numbers, the distributions are approximations. The well known mathematical 8 relation is γ = . 1 df 406 D.W. $ordstokke & B.D. Zumbo Figure 1 (reading from top left to bottom right) for skew values of 0, 1, 2, and 3 respectively. Skew = 0 Skew = 1 Skew = 2 Skew = 3 Figure 1. Shape of population distributions used in simulations Sample Sizes Three different sample sizes,$ = n +n , were investigated: 24, 48, 1 2 n and 96. Three levels of ratio of group sizes ( 1 : 1/1, 2/1, and 3/1) were n 2 also investigated. $onparametric test of equal variances 407 Population variance ratios 2 σ Nine levels of variance ratios were investigated ( 1 : 5/1, 4/1, 2 σ 2 3/1, 2/1, 1/1, 1/2, 1/3, 1/4, 1/5). Variance ratios were manipulated by multiplying the population of one of the groups in the design by a constant 2 (2/1, 1/2 ratios), 3 (3/1, 1/3 ratios), 4 (4/1, 1/4 ratios, and 5 (5/1, 1/5 ratios). The value of the constant was dependent on the amount of variance imbalance that was required for the cell of the design. For example, for a variance ratio of 2/1, the scores would be multiplied by 2. The design was created so that there were direct pairing and inverse pairing in relation to unbalanced groups and direction of variance imbalance. Direct pairing occurs when the larger sample sizes are paired with the larger variance and inverse pairing occurs when the smaller sample size is paired with the larger variance (Tomarken & Serlin, 1986). This was done to investigate a more complete range of data possibilities. In addition, Keyes and Levy (1997) drew our attention to concern with unequal sample sizes, particularly in the case of factorial designs – see also O’Brien (1978, 1979) for discussion of Levene’s test in additive models for variances. Findings suggest that the validity and efficiency of a statistical test is somewhat dependent on the direction of the pairing of sample sizes with the ratio of variance. As a whole, the complex multivariate variable space represented by our simulation design captures many of the possibilities found in day-to-day research practice. Determining Type I Error Rates & Power The frequency of Type I errors was tabulated for each cell in the design. In all, there were 324 cells in the simulation design. As a description of our methodology, the following will describe the procedure for (T2) and (T3) for completing the steps for one cell in the design. First, for both tests, two similarly distributed populations were generated and sampled from; for this example, it was two normally distributed populations that were sampled to create two groups. In this case each group had 12 members, and the population variances of the two groups are equal. So, this example tests the Type I errors for the two tests under the current conditions on the same set of data. For (T2), the absolute deviation from the median is calculated for each value in the sampled distribution and an ANOVA is performed on these values to test if the variances are significantly different at the nominal alpha value of .05 (±.01). For (T3), values are pooled and ranked, then partitioned back into their respective groups. An independent samples t-test 408 D.W. $ordstokke & B.D. Zumbo is then performed on the ranked data of the two groups. A Levene’s test for equality of variances, by which we mean (T3), is reported in this procedure as a default test to determine if the variances are statistically significantly different at the nominal alpha value of .05 (±.01). The value of ±.01 represents moderate robustness and comes from Bradley (1978). The choice of Bradley’s criterion is somewhat arbitrary, although it is a middle ground between his alternatives, and some of our conclusions may change with the other criteria. It should be noted that when Type I error rates are less than .05, the validity of the test is not jeopardized to the same extent as they are when they are inflated. This makes a test invalid if Type I errors are inflated, but when they decrease, the test becomes more conservative, reducing power. Reducing power does not invalidate the results of a test, so tests will be considered to be invalid only if the Type I error rate is inflated. Again, note that we intend to mimic day-to-day research practice, hence the number of cells under varying conditions. This procedure was replicated 5000 times for each cell in the design. In the cells where the ratio of variances was not equal and that maintained their Type I error rates, statistical power is represented by the proportion of times that the Levene’s median test, (T2) and the nonparametric Levene’s test (T3), correctly rejected the null hypothesis. RESULTS The Type I error rates for the Levene median test (T2) and the nonparametric Levene test (T3) for all of the conditions in the study are illustrated in Table 1. For example, the first row in Table 1 (reading across the row left to right), for a skew of 0, and a sample size of 24 with equal group sizes each containing 12 per group, the Type I error rate for the nonparametric Levene test is .049 and the Type I error rate for the Levene median test is .038. In all of the conditions of the simulation, both tests maintain their Type I error rate, with the Levene median test (T2) being somewhat conservative in some of the conditions. As mentioned previously, the power values of the Levene median test (T2) and the nonparametric Levene (T3) will only be investigated if the nominal Type I error rate was maintained. It was the case that the Type I error rates of both tests was maintained in all of the conditions of the present study. Table 2 reports the power values of the Levene median test (T2) and the nonparametric Levene tests when the population skew is equal to 0. In nearly all of the cells of the Table 2 the Levene median test (T2) has slightly higher power values. For example, in the first row of the table $onparametric test of equal variances 409 are the results for the nonparametric Levene test (T3), which, for a sample size of 24 with equal groups and a ratio of variances of 5/1, the power is .42; that is, 42 percent of the null hypotheses were correctly rejected. In comparison, the power of the Levene median (T2) test (the next row in the table) under the same conditions was .50. In 61 of the 72 cells in Table 2, the median test had higher power than the nonparametric test. The values for the power of the nonparametric Levene test (T3) and the Levene median test (T2) when the population distributions have a skew equal to 1 are illustrated in Table 3. Again, in most of the cases, the Levene median test (T2) had slightly higher power values than the nonparametric Levene test (T3); however the discrepancy between the scores is reduced. The power values are much closer than when the population skew was equal to 0. For example, in the first row of Table 3 are the power values for the nonparametric Levene (T3). For a sample size of 24 with equal groups and a ratio of variances 5/1, the power value is .474. In comparison, the Levene median test (T2) under identical conditions has a power value of .434. The median test was more powerful than the nonparametric test in 25 of the 72 cells in Table 3.5. The power of the two tests when population skew is equal to 2 is listed in Table 4. In a great number of the cells of the table, the nonparametric Levene (T3) has higher power values than the Levene median test (T2). For example, the power for the nonparametric Levene test (T3) is present in the first row of Table 4. For a sample size of 24 with equal group sizes and a ratio of variances of 5/1, the power value is .572. In comparison, the power of the Levene median test under the same conditions is .296. The nonparametric test was more powerful than the median test in every cell of Table 4. When population skew was equal to 3, the greatest differences between the power values of the two tests were present and are illustrated in Table 5. The nonparametric Levene test (T3) has notably higher power values than the Levene median test (T2). For example, the first row of Table 5 lists the power values for the nonparametric Levene test (T3). For a sample size of 24 with equal group sizes and a ratio of variances of 5/1, the power value is .667. In comparison, the power of the Levene median test (T2) is .155. The nonparametric test was more powerful than the median in every cell of Table 5. 410 D.W. $ordstokke & B.D. Zumbo Table 1. Type I error rates of the (cid:3)onparametric and Median versions of the Levene tests. (cid:3) n1/n2 (cid:3)onparametric Levene Levene Median Skew = 0 24 1/1 .049 .038 24 2/1 .050 .037 24 3/1 .047 .039 48 1/1 .044 .039 48 2/1 .053 .043 48 3/1 .054 .046 96 1/1 .047 .040 96 2/1 .051 .043 96 3/1 .051 .043 Skew = 1 24 1/1 .043 .040 24 2/1 .048 .041 24 3/1 .049 .039 48 1/1 .046 .041 48 2/1 .048 .040 48 3/1 .058 .044 96 1/1 .050 .052 96 2/1 .048 .044 96 3/1 .047 .042 Skew = 2 24 1/1 .051 .050 24 2/1 .049 .047 24 3/1 .054 .050 48 1/1 .053 .055 48 2/1 .052 .047 48 3/1 .053 .045 96 1/1 .051 .051 96 2/1 .049 .050 96 3/1 .054 .048 Skew =3 24 1/1 .049 .053 24 2/1 .054 .050 24 3/1 .046 .049 48 1/1 .049 .045 48 2/1 .046 .043 48 3/1 .050 .044 96 1/1 .045 .044 96 2/1 .054 .048 96 3/1 .050 .043