Title stata.com anova — Analysis of variance and covariance Syntax Menu Description Options Remarks and examples Stored results References Also see Syntax (cid:2) (cid:3) (cid:2) (cid:3) (cid:2) (cid:3) (cid:2) (cid:3) (cid:2) (cid:3) anova varname termlist if in weight , options where termlist is a factor-variable list (see [U] 11.4.3 Factor variables) with the following additional features: • Variables are assumed to be categorical; use the c. factor-variable operator to override this. • The|symbol(indicatingnesting)maybeusedinplaceofthe#symbol(indicatinginteraction). • The / symbol is allowed after a term and indicates that the following term is the error term for the preceding terms. options Description Model repeated(varlist) variables in terms that are repeated-measures variables partial use partial (or marginal) sums of squares sequential use sequential sums of squares noconstant suppress constant term dropemptycells drop empty cells from the design matrix Adv. model bse(term) between-subjects error term in repeated-measures ANOVA bseunit(varname) variable representing lowest unit in the between-subjects error term grouping(varname) grouping variable for computing pooled covariance matrix bootstrap, by, fp, jackknife, and statsby are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. aweights are not allowed with the jackknife prefix; see [R] jackknife. aweights and fweights are allowed; see [U] 11.1.6 weight. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Linear models and related > ANOVA/MANOVA > Analysis of variance and covariance Description Theanovacommandfitsanalysis-of-variance(ANOVA)andanalysis-of-covariance(ANCOVA)models for balanced and unbalanced designs, including designs with missing cells; for repeated-measures ANOVA; and for factorial, nested, or mixed designs. 1 2 anova — Analysis of variance and covariance The regress command (see [R] regress) will display the coefficients, standard errors, etc., of the regression model underlying the last run of anova. If you want to fit one-way ANOVA models, you may find the oneway or loneway command more convenient; see [R] oneway and [R] loneway. If you are interested in MANOVA or MANCOVA, see [MV] manova. Options (cid:3) (cid:0) (cid:3) (cid:0) Model repeated(varlist) indicates the names of the categorical variables in the terms that are to be treated as repeated-measures variables in a repeated-measures ANOVA or ANCOVA. partial presents the ANOVA table using partial (or marginal) sums of squares. This setting is the default. Also see the sequential option. sequential presents the ANOVA table using sequential sums of squares. noconstant suppresses the constant term (intercept) from the ANOVA or regression model. dropemptycells drops empty cells from the design matrix. If c(emptycells) is set to keep (see [R] set emptycells), this option temporarily resets it to drop before running the ANOVA model. If c(emptycells) is already set to drop, this option does nothing. (cid:3) (cid:0) (cid:3) (cid:0) Adv. model bse(term) indicates the between-subjects error term in a repeated-measures ANOVA. This option is needed only in the rare case when the anova command cannot automatically determine the between-subjects error term. bseunit(varname) indicates the variable representing the lowest unit in the between-subjects error term in a repeated-measures ANOVA. This option is rarely needed because the anova command automatically selects the first variable listed in the between-subjects error term as the default for this option. grouping(varname) indicates a variable that determines which observations are grouped together in computing the covariance matrices that will be pooled and used in a repeated-measures ANOVA. This option is rarely needed because the anova command automatically selects the combination of all variables except the first (or as specified in the bseunit() option) in the between-subjects error term as the default for grouping observations. Remarks and examples stata.com Remarks are presented under the following headings: Introduction One-way ANOVA Two-way ANOVA N-way ANOVA Weighted data ANCOVA Nested designs Mixed designs Latin-square designs Repeated-measures ANOVA Video examples anova — Analysis of variance and covariance 3 Introduction anovausesleastsquarestofitthelinearmodelsknownasANOVAorANCOVA(henceforthreferred to simply as ANOVA models). If your interest is in one-way ANOVA, you may find the oneway command to be more convenient; see [R] oneway. Structural equation modeling provides a more general framework for fitting ANOVA models; see the Stata Structural Equation Modeling Reference Manual. ANOVA was pioneered by Fisher. It features prominently in his texts on statistical methods and his designofexperiments(1925,1935). ManybooksdiscussANOVA;see,forinstance,Altman(1991);van Belleetal.(2004);Cobb(1998);SnedecorandCochran(1989);orWiner,Brown,andMichels(1991). For a classic source, see Scheffe´ (1959). Kennedy and Gentle (1980) discuss ANOVA’s computing problems. Edwards (1985) is concerned primarily with the relationship between multiple regression and ANOVA. Acock (2014, chap. 9) illustrates his discussion with Stata output. Repeated-measures ANOVA is discussed in Winer, Brown, and Michels (1991); Kuehl (2000); and Milliken and John- son (2009). Pioneering work in repeated-measures ANOVA can be found in Box (1954); Geisser and Greenhouse (1958); Huynh and Feldt (1976); and Huynh (1978). For a Stata-specific discussion of ANOVA contrasts, see Mitchell (2012, chap. 7–9). One-way ANOVA anova, entered without options, performs and reports standard ANOVA. For instance, to perform a one-way layout of a variable called endog on exog, you would type anova endog exog. Example 1: One-way ANOVA We run an experiment varying the amount of fertilizer used in growing apple trees. We test four concentrations, using each concentration in three groves of 12 trees each. Later in the year, we measure the average weight of the fruit. If all had gone well, we would have had 3 observations on the average weight for each of the fourconcentrations. Instead, twoofthegrovesweremistakenlyleveledbyaconfusedmanonalarge bulldozer. We are left with the following data: . use http://www.stata-press.com/data/r13/apple (Apple trees) . list, abbrev(10) sepby(treatment) treatment weight 1. 1 117.5 2. 1 113.8 3. 1 104.4 4. 2 48.9 5. 2 50.4 6. 2 58.9 7. 3 70.4 8. 3 86.9 9. 4 87.7 10. 4 67.3 4 anova—Analysisofvarianceandcovariance To obtain one-way ANOVA results, we type . anova weight treatment Number of obs = 10 R-squared = 0.9147 Root MSE = 9.07002 Adj R-squared = 0.8721 Source Partial SS df MS F Prob > F Model 5295.54433 3 1765.18144 21.46 0.0013 treatment 5295.54433 3 1765.18144 21.46 0.0013 Residual 493.591667 6 82.2652778 Total 5789.136 9 643.237333 We find significant (at better than the 1% level) differences among the four concentrations. Although the output is a usual ANOVA table, let’s run through it anyway. Above the table is a summary of the underlying regression. The model was fit on 10 observations, and the root mean squared error (Root MSE) is 9.07. The R2 for the model is 0.9147, and the adjusted R2 is 0.8721. Thefirstlineofthetablesummarizesthemodel. Thesumofsquares(PartialSS)forthemodelis 5295.5with3degreesoffreedom(df). Thislineresultsinameansquare(MS)of5295.5/3≈1765.2. ThecorrespondingF statisticis21.46andhasasignificancelevelof0.0013. Thusthemodelappears to be significant at the 0.13% level. The next line summarizes the first (and only) term in the model, treatment. Because there is only one term, the line is identical to that for the overall model. The third line summarizes the residual. The residual sum of squares is 493.59 with 6 degrees of freedom, resultinginameansquarederrorof82.27. Thesquarerootofthislatternumberisreported as the Root MSE. The model plus the residual sum of squares equals the total sum of squares, which is reported as 5789.1 in the last line of the table. This is the total sum of squares of weight after removal of the mean. Similarly, the model plus the residual degrees of freedom sum to the total degrees of freedom, 9. Remember that there are 10 observations. Subtracting 1 for the mean, we are left with 9 total degrees of freedom. Technical note Rather than using the anova command, we could have performed this analysis by using the oneway command. Example 1 in [R] oneway repeats this same analysis. You may wish to compare the output. Type regress to see the underlying regression model corresponding to an ANOVA model fit using the anova command. Example 2: Regression table from a one-way ANOVA Returning to the apple tree experiment, we found that the fertilizer concentration appears to significantly affect the average weight of the fruit. Although that finding is interesting, we next want to know whichconcentration appears to grow theheaviest fruit. One way to findout is by examining the underlying regression coefficients. anova — Analysis of variance and covariance 5 . regress, baselevels Source SS df MS Number of obs = 10 F( 3, 6) = 21.46 Model 5295.54433 3 1765.18144 Prob > F = 0.0013 Residual 493.591667 6 82.2652778 R-squared = 0.9147 Adj R-squared = 0.8721 Total 5789.136 9 643.237333 Root MSE = 9.07 weight Coef. Std. Err. t P>|t| [95% Conf. Interval] treatment 1 0 (base) 2 -59.16667 7.405641 -7.99 0.000 -77.28762 -41.04572 3 -33.25 8.279758 -4.02 0.007 -53.50984 -12.99016 4 -34.4 8.279758 -4.15 0.006 -54.65984 -14.14016 _cons 111.9 5.236579 21.37 0.000 99.08655 124.7134 See [R] regress for an explanation of how to read this table. The baselevels option of regress displays a row indicating the base category for our categorical variable, treatment. In summary, we find that concentration 1, the base (omitted) group, produces significantly heavier fruits than concentration 2, 3, and 4; concentration 2 produces the lightest fruits; and concentrations 3 and 4 appear to be roughly equivalent. Example 3: ANOVA replay We previously typed anova weight treatment to produce and display the ANOVA table for our apple tree experiment. Typing regress displays the regression coefficients. We can redisplay the ANOVA table by typing anova without arguments: . anova Number of obs = 10 R-squared = 0.9147 Root MSE = 9.07002 Adj R-squared = 0.8721 Source Partial SS df MS F Prob > F Model 5295.54433 3 1765.18144 21.46 0.0013 treatment 5295.54433 3 1765.18144 21.46 0.0013 Residual 493.591667 6 82.2652778 Total 5789.136 9 643.237333 Two-way ANOVA You can include multiple explanatory variables with the anova command, and you can specify interactions by placing ‘#’ between the variable names. For instance, typing anova y a b performs a two-way layout of y on a and b. Typing anova y a b a#b performs a full two-way factorial layout. The shorthand anova y a##b does the same. Withthedefaultpartialsumsofsquares, whenyouspecifyinteractedterms, theorderoftheterms does not matter. Typing anova y a b a#b is the same as typing anova y b a b#a. 6 anova — Analysis of variance and covariance Example 4: Two-way factorial ANOVA The classic two-way factorial ANOVA problem, at least as far as computer manuals are concerned, is a two-way ANOVA design from Afifi and Azen (1979). Fifty-eight patients, each suffering from one of three different diseases, were randomly assigned tooneoffourdifferentdrugtreatments, andthechangeintheirsystolicbloodpressurewasrecorded. Here are the data: Disease 1 Disease 2 Disease 3 Drug 1 42, 44, 36 33, 26, 33 31, –3, 25 13, 19, 22 21 25, 24 Drug 2 28, 23, 34 34, 33, 31 3, 26, 28 42, 13 36 32, 4, 16 Drug 3 1, 29, 19 11, 9, 7 21, 1, 9 1, –6 3 Drug 4 24, 9, 22 27, 12, 12 22, 7, 25 –2, 15 –5, 16, 15 5, 12 Let’s assume that we have entered these data into Stata and stored the data as systolic.dta. Below we use the data, list the first 10 observations, summarize the variables, and tabulate the control variables: . use http://www.stata-press.com/data/r13/systolic (Systolic Blood Pressure Data) . list in 1/10 drug disease systolic 1. 1 1 42 2. 1 1 44 3. 1 1 36 4. 1 1 13 5. 1 1 19 6. 1 1 22 7. 1 2 33 8. 1 2 26 9. 1 2 33 10. 1 2 21 . summarize Variable Obs Mean Std. Dev. Min Max drug 58 2.5 1.158493 1 4 disease 58 2.017241 .8269873 1 3 systolic 58 18.87931 12.80087 -6 44 . tabulate drug disease Patient’s Disease Drug Used 1 2 3 Total 1 6 4 5 15 2 5 4 6 15 3 3 5 4 12 4 5 6 5 16 Total 19 19 20 58 anova—Analysisofvarianceandcovariance 7 Each observation in our data corresponds to one patient, and for each patient we record drug, disease, and the increase in the systolic blood pressure, systolic. The tabulation reveals that the data are not balanced—there are not equal numbers of patients in each drug–disease cell. Stata does not require that the data be balanced. We can perform a two-way factorial ANOVA by typing . anova systolic drug disease drug#disease Number of obs = 58 R-squared = 0.4560 Root MSE = 10.5096 Adj R-squared = 0.3259 Source Partial SS df MS F Prob > F Model 4259.33851 11 387.212591 3.51 0.0013 drug 2997.47186 3 999.157287 9.05 0.0001 disease 415.873046 2 207.936523 1.88 0.1637 drug#disease 707.266259 6 117.87771 1.07 0.3958 Residual 5080.81667 46 110.452536 Total 9340.15517 57 163.862371 AlthoughStata’stablecommanddoesnotperformANOVA,itcanproduceusefulsummarytables of your data (see [R] table): . table drug disease, c(mean systolic) row col f(%8.2f) Patient’s Disease Drug Used 1 2 3 Total 1 29.33 28.25 20.40 26.07 2 28.00 33.50 18.17 25.53 3 16.33 4.40 8.50 8.75 4 13.60 12.83 14.20 13.50 Total 22.79 18.21 15.80 18.88 These are simple means and are not influenced by our anova model. More useful is the margins command (see [R] margins) that provides marginal means and adjusted predictions. Because drug is the only significant factor in our ANOVA, we now examine the adjusted marginal means for drug. . margins drug, asbalanced Adjusted predictions Number of obs = 58 Expression : Linear prediction, predict() at : drug (asbalanced) disease (asbalanced) Delta-method Margin Std. Err. t P>|t| [95% Conf. Interval] drug 1 25.99444 2.751008 9.45 0.000 20.45695 31.53194 2 26.55556 2.751008 9.65 0.000 21.01806 32.09305 3 9.744444 3.100558 3.14 0.003 3.503344 15.98554 4 13.54444 2.637123 5.14 0.000 8.236191 18.8527 Theseadjustedmarginalpredictionsarenotequaltothesimpledrugmeans(seethetotalcolumnfrom the table command); they are based upon predictions from our ANOVA model. The asbalanced option of margins corresponds with the interpretation of the F statistic produced by ANOVA—each cell is given equal weight regardless of its sample size (see the following three technical notes). You 8 anova — Analysis of variance and covariance can omit the asbalanced option and obtain predictive margins that take into account the unequal sample sizes of the cells. . margins drug Predictive margins Number of obs = 58 Expression : Linear prediction, predict() Delta-method Margin Std. Err. t P>|t| [95% Conf. Interval] drug 1 25.89799 2.750533 9.42 0.000 20.36145 31.43452 2 26.41092 2.742762 9.63 0.000 20.89003 31.93181 3 9.722989 3.099185 3.14 0.003 3.484652 15.96132 4 13.55575 2.640602 5.13 0.000 8.24049 18.871 Technical note Howdoyouinterpretthesignificanceoftermslikedruganddiseaseinunbalanceddata? Ifyou are familiar with SAS, the sums of squares and the F statistic reported by Stata correspond to SAS type III sums of squares. (Stata can also calculate sequential sums of squares, but we will postpone that topic for now.) Let’s think in terms of the following table: Disease 1 Disease 2 Disease 3 Drug 1 µ µ µ µ 11 12 13 1· Drug 2 µ µ µ µ 21 22 23 2· Drug 3 µ µ µ µ 31 32 33 3· Drug 4 µ µ µ µ 41 42 43 4· µ µ µ µ ·1 ·2 ·3 ·· In this table, µ is the mean increase in systolic blood pressure associated with drug i and disease ij j, while µ is the mean for drug i, µ is the mean for disease j, and µ is the overall mean. i· ·j ·· If the data are balanced, meaning that there are equal numbers of observations going into the calculation of each mean µ , the row means, µ , are given by ij i· µ +µ +µ µ = i1 i2 i3 i· 3 In our case, the data are not balanced, but we define the µ according to that formula anyway. The i· test for the main effect of drug is the test that µ =µ =µ =µ 1· 2· 3· 4· To be absolutely clear, the F test of the term drug, called the main effect of drug, is formally equivalent to the test of the three constraints: anova — Analysis of variance and covariance 9 µ +µ +µ µ +µ +µ 11 12 13 = 21 22 23 3 3 µ +µ +µ µ +µ +µ 11 12 13 = 31 32 33 3 3 µ +µ +µ µ +µ +µ 11 12 13 = 41 42 43 3 3 In our data, we obtain a significant F statistic of 9.05 and thus reject those constraints. Technical note Statacandisplaythesymbolicformunderlyingtheteststatisticsitpresents,aswellasdisplayother test statistics and their symbolic forms; see Obtaining symbolic forms in [R] anova postestimation. Here is the result of requesting the symbolic form for the main effect of drug in our data: . test drug, symbolic drug 1 -(r2+r3+r4) 2 r2 3 r3 4 r4 disease 1 0 2 0 3 0 drug#disease 1 1 -1/3 (r2+r3+r4) 1 2 -1/3 (r2+r3+r4) 1 3 -1/3 (r2+r3+r4) 2 1 1/3 r2 2 2 1/3 r2 2 3 1/3 r2 3 1 1/3 r3 3 2 1/3 r3 3 3 1/3 r3 4 1 1/3 r4 4 2 1/3 r4 4 3 1/3 r4 _cons 0 This says exactly what we said in the previous technical note. Technical note Saying that there is no main effect of a variable is not the same as saying that it has no effect at all. Stata’s ability to perform ANOVA on unbalanced data can easily be put to ill use. For example, consider the following table of the probability of surviving a bout with one of two diseases according to the drug administered to you: 10 anova — Analysis of variance and covariance Disease 1 Disease 2 Drug 1 1 0 Drug 2 0 1 If you have disease 1 and are administered drug 1, you live. If you have disease 2 and are administered drug 2, you live. In all other cases, you die. Thistablehasnomaineffectsofeitherdrugordisease, althoughthereisalargeinteractioneffect. You might now be tempted to reason that because there is only an interaction effect, you would be indifferent between the two drugs in the absence of knowledge about which disease infects you. Given an equal chance of having either disease, you reason that it does not matter which drug is administered to you—either way, your chances of surviving are 0.5. You may not, however, have an equal chance of having either disease. If you knew that disease 1 was 100 times more likely to occur in the population, and if you knew that you had one of the two diseases, you would express a strong preference for receiving drug 1. When you calculate the significance of main effects on unbalanced data, you must ask yourself why the data are unbalanced. If the data are unbalanced for random reasons and you are making predictions for a balanced population, the test of the main effect makes perfect sense. If, however, the data are unbalanced because the underlying populations are unbalanced and you are making predictions for such unbalanced populations, the test of the main effect may be practically—if not statistically—meaningless. Example 5: ANOVA with missing cells Stata can perform ANOVA not only on unbalanced populations, but also on populations that are so unbalanced that entire cells are missing. For instance, using our systolic blood pressure data, let’s refit the model eliminating the drug 1–disease 1 cell. Because anova follows the same syntax as all other Stata commands, we can explicitly specify the data to be used by typing the if qualifier at the end of the anova command. Here we want to use the data that are not for drug 1 and disease 1: . anova systolic drug##disease if !(drug==1 & disease==1) Number of obs = 52 R-squared = 0.4545 Root MSE = 10.1615 Adj R-squared = 0.3215 Source Partial SS df MS F Prob > F Model 3527.95897 10 352.795897 3.42 0.0025 drug 2686.57832 3 895.526107 8.67 0.0001 disease 327.792598 2 163.896299 1.59 0.2168 drug#disease 703.007602 5 140.60152 1.36 0.2586 Residual 4233.48333 41 103.255691 Total 7761.44231 51 152.185143 Here we used drug##disease as a shorthand for drug disease drug#disease.
Description: