B. Weaver (17-Jan-2008) Categorical Data ... 1 Chapter 2: Analysis of Categorical Data 2.1 Introduction Categorical, or nominal data is most often encountered when observations are grouped into discrete, mutually exclusive categories (i.e., each observation can fall into one and only one category), and one examines the frequency of occurrence for each of the categories. The most common statistical test for such data is some form of a chi-square test. (The ch in chi is pronounced like a k, and chi rhymes with eye.) 2.2 One-way classification: Chi-square “goodness of fit” test Let us begin with an example: You have been hired by a supermarket chain to conduct some "Pepsi-challenge" market research. Let us assume that you asked 150 shoppers to sample 3 different brands of Cola: Coke, Pepsi, and X (a No-Name brand produced for the supermarket). (We will assume that you were a careful experimenter, and counterbalanced the order in which the 3 Colas were presented to subjects, and that subjects were blind as to what they were drinking.) Each participant had to indicate which of the 3 Colas they preferred, and the data looked like this: Table 2.1 Observed frequencies for Pepsi-challenge problem Coke Pepsi Brand X Total 45 40 65 150 So in your sample of 150, more people preferred Brand X to either Coke or Pepsi. But can we conclude from this that more people in the population of interest prefer Brand X? Not necessarily. It could be that there are no differences between Coke, Pepsi, and Brand X in the population, and that the differences we see in this sample are due to sampling error. Fortunately, we can evaluate this possibility with a chi-squared test. The chi-squared test is based on the difference between observed frequencies, and the frequencies that are expected if the null hypothesis is true. The null hypothesis often states that the frequencies will be equal in all categories, but not always. In this case, let's assume that it does. Therefore, the expected frequencies would be: Table 2.2 Expected frequencies for Pepsi-challenge problem Coke Pepsi Brand X Total 50 50 50 150 The X2 statistic can be computed with the following formula: B. Weaver (17-Jan-2008) Categorical Data ... 2 (O−E)2 X2 = ∑ (2.1) E Note that many textbooks call the statistic calculated with this formula χ2rather thanX2. Siegel and Castellan (1988) use X2to emphasise the distinction between the observed value of the statistic (X2) and the theoretical probability distribution that is its (asymptotic) sampling distribution under a true null hypothesis (χ2). That is, if H is true (and certain 0 conditions/assumptions1 are met), the χ2 distribution with df = k-1 provides a pretty good approximation to the sampling distribution of X2. Therefore, we can use the χ2 distribution with df = k-1 to obtain the probability of getting the observed value ofX2, or a larger value, given that the null hypothesis is true. This conditional probability is the p-value, of course. And if the p-value is small enough (.05 or less, usually), we can reject the null hypothesis. Table 2.3 Calculation of X2 for the Pepsi-challenge problem Observed (O) Expected (E) O-E (O-E)2 (O-E)2/E 45 50 -5 25 0.5 40 50 -10 100 2.0 65 50 15 225 4.5 150 150 0 7.0 For the Pepsi challenge example, X2(df=2, n=150) = 7.0, p = 0.030 (see Table 2.3 for the calculations). Therefore, we can reject the null hypothesis that all 3 brands are preferred equally in the population from which we have sampled. 2.3 Unequal proportions under the null hypothesis As mentioned earlier, the null hypothesis does not always state that the same number of observations is expected in each category. For example, a geneticist might know that in the fruit fly population, 4 different sub-types of fruit flies appear in the ratio 4:3:2:1. So if a sample of 100 fruit flies was randomly selected from this population, the expected frequencies according to the null hypothesis would be 40, 30, 20, and 10. Would the geneticist be able to reject the null hypothesis (that the sample was randomly drawn from this population) if the observed frequencies were 44, 36, 12, and 8? Let’s work it out and see. The calculations shown in Table 2.4 reveal that X2(df=3, n=100) = 5.2, p = 0.158. Therefore, the geneticist would not have sufficient evidence to allow rejection of the null hypothesis. 1 We will discuss those assumptions a bit later. B. Weaver (17-Jan-2008) Categorical Data ... 3 Table 2.4 Calculation of X2 for the fruit fly problem Observed (O) Expected (E) O-E (O-E)2 (O-E)2/E 44 40 4 16 0.4 36 30 6 36 1.2 12 20 -8 64 3.2 8 10 -2 4 0.4 100 100 0 5.2 2.4 Chi-square test of independence (or association) So far we have considered cases with one categorical variable. We now move on to another common use of chi-squared tests: A test of independence between two categorical variables. In this case, it is common to present the data in a contingency table. The levels of one variable are represented on different rows, and the levels of the other variable in different columns. Let us assume, for example, that we have a problem-solving task that requires subjects to use a screw-driver as a pendulum. We randomly assign 90 subjects to 3 groups of 30. One group is given no special instructions (a ); a second group is asked to list 5 common uses of screwdrivers 1 (a ); and the third group is asked to list 5 uncommon uses of screwdrivers (a ). Then all subjects 2 3 are given the problem, and each one is categorised as to whether or not they solved it. The frequency data look like this: Table 2.5 Observed frequencies for the “screwdriver” experiment a a a Total 1 2 3 Solve 9 17 22 48 Fail to solve 21 13 8 42 Total 30 30 30 90 The null hypothesis states that the proportion of subjects solving the problem is independent of instructions.2 In other words, we expect the proportions of solvers and non-solvers to be the same in all 3 groups. If that is so, then based on the marginal (row) totals, we expect 48/90, or about 53% of subjects from each group to solve the problem. The general rule for calculating the E, the expected frequency for a cell under the null hypothesis, is: (row total)(column total) E = (2.2) grand total 2 The null hypothesis states that the two variables are independent of each other, so the alternative must state that they are associated. Hence the two names for the same test: test of independence, and test of association. B. Weaver (17-Jan-2008) Categorical Data ... 4 In terms of frequencies then, we expect 30x48/90 = 16 subjects in each group to solve the problem; and 30x42/90 = 14 to not solve it. Using these expected frequencies, we can calculate X2 as before: X2= 11.5178 (see Table 2.6). For a chi-square test of independence, the number of degrees of freedom is given by: df =(r −1)(c−1) (2.3) where r = the number of rows, and c = the number of columns. For this problem, therefore, df = (2-1)(3-1) = 2. So the p-value for this example can be obtained using theχ2 distribution with α = .05 and df = 2. Using SPSS, I found that p = 0.003. Therefore, I would report the results of my analysis as follows: : X2(df=2, n=90) = 11.5178, p = 0.003. Because the p-value is lower than the conventional alpha level of 0.05, we can reject H . 0 Table 2.6 Calculation of X2 for the fruit fly problem Observed (O) Expected (E) O-E (O-E)2 (O-E)2/E 9 16 -7 49 3.0625 17 16 1 1 0.0625 22 16 6 36 2.2500 21 14 6 49 3.5000 13 14 1 1 0.0714 8 14 -6 36 2.5714 90 90 0 11.5178 2.5 Alternative to Pearson’s X2: Likelihood ratio tests There is an alternative to Pearson’s chi-square that has been around since the 1950s. It is based on likelihood ratios (which we have not discussed), and is often referred to as the maximum likelihood chi-square. Note that this statistic is calculated by both Statview (where it is called the “G” statistic) and SPSS (where it is called the “Likelihood chi-square”, and symbolized G2 in some of the manuals). Following Howell (1992, 1997), I will use L2 to symbolise the statistic that is calculated. There are various ways to calculate L2, but the easiest is probably the way given by Howell. The formula is as follows: L2 = 2 ∑ Oln(O E) (2.4) all cells where: ln = natural logarithm (see Appendix A) O = observed frequency for a cell E = expected frequency for a cell. B. Weaver (17-Jan-2008) Categorical Data ... 5 L2 and X2 (computed with Pearson's formula) usually yield slightly different values. Nevertheless, for practical purposes, L2 is distributed as χ2 with df = k-1 for one-way classification (goodness of fit) problems, and df = (r-1)(c-1) for tests of independence. Thus it can be used in any situation where Pearson's formula could be used. (Pearson's formula has remained prominent and popular because it was computationally easier in the days before electronic calculators and computers.) One might wonder if there is any advantage to using L2 rather than X2. For the moment, let me to simply assure you that there are some advantages. We will be in a better position to discuss them after discussing two or three other topics. But first, let us redo two of the problems we looked at earlier using L2 as our statistic. As you can see by comparing Tables 2.3 and 2.7, L2 and Pearson’s X2 yield slightly different values for the Pepsi-challenge data. L2(df=2, n=150) = 6.7736, p = 0.034; and X2(df=2, n=150) = 7.0, p = 0.030. The p-values differ only in the 3rd decimal place, and in both cases, we can reject the null hypothesis. Table 2.7 Calculation of L2 for the Pepsi-challenge problem Observed (O) Expected (E) ln(O/E) O ln(O/E) 45 50 -0.1054 -4.7412 40 50 -0.2231 -8.9257 65 50 0.2624 17.0537 150 150 3.3868 L2 = 6.7736 The steps in calculating L2 for the “screwdriver” problem are shown in Table 2.8. Again, the values of L2 and X2 are somewhat different: L2(df=2, n=90) = 11.866, p = 0.003; and X2(df=2, n=90) = 11.518, p = 0.003. In this case, the p-values for both tests are the same (to 3 decimal places), and both tests lead to rejection of the null hypothesis. Table 2.8 Calculation of L2 for the “screwdriver” problem Observed (O) Expected (E) ln(O/E) O ln(O/E) 9 16 -0.5754 -5.1783 17 16 0.0606 1.0306 22 16 0.3185 7.0060 21 14 0.4055 8.5148 13 14 -0.0741 -0.9634 8 14 -0.5596 -4.4769 90 90 5.9328 L2 = 11.8656 B. Weaver (17-Jan-2008) Categorical Data ... 6 2.6 Additivity of independent χ2-distributed variables It is a fact that the sum of two independent chi-squares with v and v degrees of freedom 1 2 respectively is itself a chi-square with df = v + v . Note also that this additivity can be 1 2 extended to any number of chi-squares, provided that they are all independent of each other. 2.7 Partitioning an overall chi-square: One-way classification From section 2.6, it follows that we should be able to take a chi-square with df =v (when v > 1), and partition it into v independent chi-squares, each with df = 1. I hope that the reason why we might want to do this will become clear through a couple of examples. To begin, let us return again to the Pepsi-challenge problem. The L2 value for that problem was 6.7736 (see Table 2.7) with df = 2. We ought to be able to partition this overall L2 into two independent L2 values with 1 degree of freedom each. With one of these partitions, we could compare Coke and Pepsi to see if the proportion of people choosing them is different; and with the second, we could compare Coke and Pepsi combined to Brand X. Table 2.9 Calculation of L2 for Coke vs Pepsi comparison Observed (O) Expected (E) ln(O/E) O ln(O/E) 45 42.5 0.0572 2.5721 40 42.5 -0.0606 -2.4250 85 85 0.1471 L2 = 0.2942 The observed frequencies for Coke and Pepsi are 45 and 40. Therefore the total number of observations for this sub-table is 85. The null hypothesis is that Coke and Pepsi are preferred by equal numbers of people. Therefore the expected frequency is 42.5 for both cells. Working it out, we find that L2 (df=1, n = 85) = 0.2942, p = 0.588. (see Table 2.9). Therefore, we cannot reject the null hypothesis for this test. Table 2.10 Calculation of L2 for (Coke+Pepsi) vs Brand X comparison Observed (O) Expected (E) ln(O/E) O ln(O/E) 85 100 -0.1625 -13.8141 65 50 0.2624 17.0537 150 150 3.2396 L2 = 6.4792 For our second comparison, the observed frequencies are 85 (for Coke & Pepsi combined) and 65 (for Brand X). The original null hypothesis asserted that the proportions in the 3 categories were equal. Therefore, if we combine 2 of those categories, the expected proportion for the combined category is 2/3, and for the single category it is 1/3. Thus, according to the null hypothesis, the expected frequencies are 100 and 50. Working it through, we get L2 (df=1, n = B. Weaver (17-Jan-2008) Categorical Data ... 7 150) = 6.4792, p = 0.011. This value of L2 would allow us to reject the null, and we could conclude that more than one third of the population prefers Brand X. The important point to be made here is that the sum of the L2 values for these two orthogonal (or independent) comparisons is equal to the L2 for the overall data: 0.2942 + 6.4792 = 6.7734 The degrees of freedom are also additive. For the overall data, df = 2; and for each of the analytical comparisons, df = 1. 2.8 Partitioning the overall chi-square for a contingency table It is also possible to partition the overall chi-square for a contingency table, provided that df > 1. To see how, let us return to the “screwdriver” problem (Table 2.5). You may recall that there were 3 groups of subjects, and that group 1 was given “no special instructions”, whereas groups 2 and 3 were given instructions. Therefore, it may be sensible to begin by comparing group 1 to groups 2 and 3 combined. This will allow us to assess the effect of no instructions versus instructions. The observed frequencies for this comparison are shown in Table 2.11. The expected frequencies can be calculated in the usual fashion. For this comparison, L2 (df=1, n = 90) = 10.0208, p = .002 (see Table 2.12). Therefore we would reject the null hypothesis, and conclude that there is a difference between having and not having instructions. Table 2.11 Observed frequencies for comparison of a to (a + a ) 1 2 3 a a Total 1 2+3 Solve 9 39 48 Fail to solve 21 21 42 Total 30 60 90 Table 2.12 L2 calculations for comparison of a to (a + a ) 1 2 3 Observed (O) Expected (E) ln(O/E) O ln(O/E) 9 16 -0.5754 -5.1783 39 32 0.1978 7.7152 21 14 0.4055 8.5148 21 28 -0.2877 -6.0413 90 90 5.0104 L2 = 10.0208 B. Weaver (17-Jan-2008) Categorical Data ... 8 Having discovered that there is a difference between having and not having instructions, we might wish to go on and compare the 2 kinds of instructions that were given. The null hypothesis for this comparison is that the 2 instructional conditions do not differ. The observed frequencies for this comparison are shown in Table 2.13, and the calculations are summarised in Table 2.14. (Note that the expected frequencies are calculated in the usual fashion, but on the basis of this sub-table alone.) For this comparison, L2 (1, n = 60) = 1.8448, p = 0.174. (see Table 2.14). Therefore we cannot reject the null hypothesis, and must conclude that there is no difference between the two instructional conditions. Table 2.13 Observed frequencies for comparison of a and a 2 3 a a Total 2 3 Solve 17 22 39 Fail to solve 13 8 21 Total 30 30 60 Table 2.14 L2 calculations for comparison of a and a 2 3 Observed (O) Expected (E) ln(O/E) O ln(O/E) 17 19.5 -0.1372 -2.3324 22 19.5 0.1206 2.6538 13 10.5 0.2136 2.7765 8 10.5 -0.2719 -2.1755 60 60 0.9224 L2 = 1.8448 Finally, it should be noted that the sum of the L2 values for these two comparisons (each with df = 1) is equal to the L2 value we calculated for the overall data (in Table 2.8). The sum of the values for our two comparisons is 10.0208 + 1.8448 = 11.8656; and the original L2 value was 11.8656. 2.9 Assumptions/restrictions for use of chi-square based tests The value of Pearson’s X2 will have a χ2distribution if the null hypothesis is true, and if the following conditions are met: 1. Each observation is independent of all the others (i.e., one observation per subject); 2. “No more than 20% of the expected counts are less than 5 and all individual expected counts are 1 or greater” (Yates, Moore & McCabe, 1999, p. 734); 3. For 2x2 tables: a) All expected frequencies should be 10 or greater. B. Weaver (17-Jan-2008) Categorical Data ... 9 b) If any expected frequencies are less than 10, but greater than or equal to 5, some authors suggest that Yates' Correction for continuity should be applied. This is done by subtracting .5 from the absolute value of (O-E) before squaring (see equation ). However, the use of Yates' correction is controversial, and is not recommended by all authors. c) If any expected frequencies are smaller than 5, then some other test should be used (e.g., Fisher exact Test for 2x2 contingency tables). ( )2 O−E −0.5 χ2 =∑ (2.5) Yates E 2.10 Advantages of Likelihood Chi-Square Tests It was suggested earlier that there are certain advantages to using L2 rather than Pearson’s X2 . First, according to Hays (1963), there is reason to believe that likelihood ratio tests are less affected by small sample sizes (and small expected frequencies) than are standard Pearson chi- squares, particularly when df > 1. In these circumstances, Hays suggests that the likelihood ratio test is superior. However, Alan Agresti (1990)—who is more of an authority on this topic, in my opinion—makes exactly the opposite claim: It is not simple to describe the sample size needed for the chi-squared distribution to approximate well the exact distributions of X2 and G2 [i.e., L2]. For a fixed number of cells, X2 usually converges more quickly than G2. The chi-squared approximation is usually poor for G2 when n/IJ < 5 [where n = the grand total and IJ = rc = the number of cells in the table]. When I or J [i.e., r or c] is large, it can be decent for X2 for n/IJ as small as 1, if the table does not contain both very small and moderately large expected frequencies. (Agresti, 1990, p. 49) Another advantage concerns partitioning of an overall contingency table into orthogonal components. Had we done this using Pearson's X2 method, we would have needed to use two different expected frequencies for each cell. The expected frequency in the numerator (E ) is n based on the current sub-table; and the expected frequency in the denominator (E ) is based on d the original, overall table of frequencies. (For some comparisons, E = E , but this is not always n d the case.) Things are much simpler with L2 , however, because there is only one expected frequency for each cell, and it is always based on the current sub-table you are working with. Also, when you partition a contingency table into as many orthogonal components as possible, with X2, the orthogonal components usually do not add up exactly to the overall X2. With L2, on the other hand, rounding error aside, things always add up as they should. (I don't know about you, but this property of L2 tests gives me a warm feeling inside. Things that are supposed to add up to a certain total do—and that gives me confidence in the test.) Finally, L2 tests of this sort can be viewed as the simplest form of loglinear analysis, which is increasing in popularity. We will not go into it here, but some of you may come across it in the future. B. Weaver (17-Jan-2008) Categorical Data ... 10 2.11 Fisher’s exact test As noted earlier, it may be more appropriate to use Fisher’s exact test to analyze the data in a 2x2 contingency table if any of the expected frequencies are less than 5. Consider the following (slightly modified) example, taken from the BMJ’s Statistics at Square One chapter on the Exact Probability Test. Some soldiers are being trained as parachutists. One rather windy afternoon 55 practice jumps take place at two localities, dropping zone A and dropping zone B. Of 40 men who jump at dropping zone A, two suffer sprained ankles, and of 15 who jump at dropping zone B, five suffer this injury. The casualty rate at dropping zone B seems unduly high, so the medical officer in charge decides to investigate the disparity. Is it a difference that might be expected by chance? If not it deserves deeper study. (From http://www.bmj.com/collections/statsbk/9.shtml , downloaded on 2-Mar-2001.) The data are summarized in Table 2.15. Note that the smallest expected frequency is 7(15)/55 = 1.91, which is considerably less than 5. Therefore the sampling distribution of Pearson’s X2 will not be all that well approximated by a χ2 distribution with df=1. Table 2.15: Numbers of injured and uninjured men at two different drop zones. Injured Uninjured Total Drop zone A 2 38 40 Drop zone B 5 10 15 Total 7 48 55 The medical officer’s null hypothesis is that there is no (population) difference in the injury rate at the two drop zones, or that the difference between 2/40 (5.0%) and 5/15 (33.3%) is due to chance. So what we really want to know then, is how likely is it that we would see a discrepancy in injury rates this large or larger if there is really no difference in the population from which we’ve sampled? Table 2.16: More extreme differences in injury rates at the two drop zones. (a) (b) Injured Uninjured Total Injured Uninjured Total Zone A 1 39 40 Zone A 0 40 40 Zone B 6 9 15 Zone B 7 8 15 Total 7 48 55 Total 7 48 55
Description: