Permutation procedures for ANOVA, Regression and PCA by Christine Storm Submitted in partial fulfilment of the requirements for the degree Master of Science (Mathematical Statistics) In the Faculty of Natural & Agricultural Sciences University of Pretoria Pretoria 2012 i ©© UUnniivveerrssiittyy ooff PPrreettoorriiaa Declaration I, the undersigned, hereby declare that this dissertation, which I hereby submit for the degree Master of Science at the University of Pretoria, is my own work and has not previously been submitted by me for a degree at this or any other tertiary institution. Signature ______________________ Date ______________________ ii Acknowledgements I would like to sincerely thank my supervisor Dr. L Fletcher for her extraordinary efforts to my benefit throughout my tenure as a MSc student in the Department of statistics at the University of Pretoria. Her knowledge and guidance during my research was indispensible. iii Summary Parametric methods are effective and appropriate when data sets are obtained by well-defined random sampling procedures, the population distribution for responses is well-defined, the null sampling distributions of suitable test statistics do not depend on any unknown entity and well-defined likelihood models are provided for by nuisance parameters. Permutation testing methods, on the other hand, are appropriate and unavoidable when distribution models for responses are not well specified, nonparametric or depend on too many nuisance parameters; when ancillary statistics in well-specified distributional models have a strong influence on inferential results or are confounded with other nuisance entities; when the sample sizes are less than the number of parameters and when data sets are obtained by ill-specified selection-bias procedures. In addition, permutation tests are useful not only when parametric tests are not possible, but also when more importance needs to be given to the observed data set, than to the population model, as is typical for example in biostatistics. The different types of permutation methods for analysis of variance, multiple linear regression and principal component analysis are explored. More specifically, one-way, two- way and three-way ANOVA permutation strategies will be discussed. Approximate and exact permutation tests for the significance of one or more regression coefficients in a multiple linear regression model will be explained next, and lastly, the use of permutation tests used as a means to validate and confirm the results obtained from the exploratory PCA will be described. iv Contents 1. Introduction ............................................................................................................................... 3 2. Notation and Abbreviations ....................................................................................................... 5 3. Conditionality and Exchangeability ........................................................................................... 6 3 Randomization and Permutation ................................................................................................ 9 4 When Permutation is Appropriate ............................................................................................ 11 5 The Beginnings of Permutations............................................................................................... 13 5.1.1 1920–1939 ............................................................................................................... 13 5.1.2 1940–1959 ............................................................................................................... 16 6 Computational Aspects ............................................................................................................ 17 6.1.1 1960–1979 ............................................................................................................... 17 6.1.2 1980–1999 ............................................................................................................... 18 6.1.3 2000–2012 ............................................................................................................... 20 7 Optimal Procedures ................................................................................................................. 21 8 Analysis of Variance ................................................................................................................ 26 8.1 Introduction ..................................................................................................................... 26 8.2 One-Way Analysis of Variance .......................................................................................... 27 8.2.1 The Parametric Approach......................................................................................... 27 8.2.2 The Permutation Approach ....................................................................................... 29 8.3 Two-Way Analysis of Variance .......................................................................................... 31 8.3.1 The Parametric Approach......................................................................................... 31 8.3.2 Permutation of Raw Data ......................................................................................... 33 8.3.3 Permutation under the Reduced Model ..................................................................... 35 8.3.4 Permutation of the Full Model .................................................................................. 36 8.3.5 Permutation for an Exact Test .................................................................................. 37 8.4 Three-Way Analysis of Variance ....................................................................................... 42 8.5 Simulation Study .............................................................................................................. 43 8.6 Numerical Example .......................................................................................................... 45 9 Multiple Linear Regression ...................................................................................................... 47 9.1 Introduction ..................................................................................................................... 47 9.2 The Parametric Approach ................................................................................................. 48 9.3 Permutation of Raw Data ................................................................................................. 50 9.3.1 The Manly Method.................................................................................................... 50 1 9.4 Permutation under the Reduced Model ........................................................................... 52 9.4.1 The Freedman and Lane Method .............................................................................. 52 9.4.2 The Kennedy Method ................................................................................................ 53 9.5 Permutation under the Full Model ................................................................................... 56 9.5.1 The Ter Braak Method .............................................................................................. 56 9.5.2 The Tantawanich Method ......................................................................................... 58 9.6 Permutation for an Exact Test .......................................................................................... 60 9.6.1 The Kherad-Pajouh and Renaud Method .................................................................. 60 9.7 Simulations ...................................................................................................................... 66 10 Principal Component Analysis ............................................................................................. 72 10.1 Introduction ..................................................................................................................... 72 10.2 The Principal Component Procedure ................................................................................ 73 10.3 The Permutation Approach .............................................................................................. 74 10.4 Two Permutation Strategies ............................................................................................. 75 10.5 Numerical Example .......................................................................................................... 76 11 Appendix .............................................................................................................................. 80 12 References ........................................................................................................................... 84 13 Appendix .............................................................................................................................. 89 13.1 Code for Section 8.3 ......................................................................................................... 89 13.2 Code for Section 8.5 ......................................................................................................... 95 13.3 Code for Section 9.7 ......................................................................................................... 99 13.4 Code for Section 10.5 ..................................................................................................... 104 2 1. Introduction Permutation tests are currently the gold standard against which conventional parametric tests are tested and evaluated. In this document, permutation statistical methods are introduced, a historical chronology of the development of permutation methods is provided and the advantages of permutation methods are detailed. The different types of permutation methods are also described for analysis of variance, multiple linear regression and principal component analysis. These permutation methods are then compared to the traditional parametric tests using examples and simulations. The population model assumes random sampling from one or more specified population. Under the population model, the level of statistical significance that results from applying a statistical test to the results of an experiment or a survey corresponds to the frequency with which the null hypothesis would be rejected in repeated random samplings from the same specified population. Because repeated sampling of the true population is usually impractical, it is assumed that the sampling distribution of the test statistics under repeated random sampling conforms to an assumed theoretical distribution, such as the normal distribution. The size of the test, for example 0.05, is the probability under a specified null hypothesis that repeated outcomes based on random samples of the same size are equal to or more extreme than the observed outcome. In the population model, assignment of treatment to subjects is viewed as fixed with a stochastic element taking the form of the error that would vary if the experiment was repeated. Probabilities are then calculated based on the potential outcomes of conceptual repeated draws of these errors. With the permutation approach, a test statistic is computed for the observed data, then the data are permuted over all possible arrangements of the observed data and the test statistic is computed for each likely arrangement. An ordered sequence of n exchangeable objects yields n! equally likely arrangements of the n objects. The proportion of arrangements with test statistic values equal to or more extreme than the observed case yields the probability of the observed test statistic. Probabilities are then calculated according to all outcomes associated with assignments of treatments to subjects for each case. 3 Permutation tests differ from traditional parametric tests in several ways. Permutation tests are data dependent, in that all the information required for analysis is contained within the observed data set. Permutation tests do not assume any underlying theoretical distribution. Permutation tests do not depend on the assumptions associated with traditional parametric tests, such as normality and homogeneity. Permutation tests provide probability values based on the discrete permutation distribution of equally likely test statistic values, rather than an approximate probability value based on a theoretical distribution, such as a normal. 4 2. Notation and Abbreviations ANOVA: Analysis of variance PCA : Principal Component Analysis i.i.d : independent and identically distributed N(m,s2): Gaussian or normal variable with mean mand variances2 OLS : ordinary least squares ~ : distributed as n : the (finite) sample size tr( ) : the trace of a matrix X : a univariate random variable X : a multivariate variable or a sample of n units, X ={X ,i =1,.........,n} i X* : a permutation of X FM* : permutation statistic for Manly (1991, 1997, 2007) FE* : permutation statistic for Edgington (2007) FSW* : permutation statistic for Still and White (1981) FJ* : permutation statistic for Jung et al. (2006) FFL* : permutation statistic for Freedman and Lane (1983) FK* : permutation statistic for Kennedy (1995) FTB* : permutation statistic for Ter Braak (1992) FT* : permutation statistic for Tantawanich (2006) FKR* : permutation statistic for Kherad-Pajouh and Renaud (2010) 5 3. Conditionality and Exchangeability For most problems of hypothesis testing, the observed data setP˛ y={y,........,y } is 1 n usually obtained by an experiment performed n times on a population variable X. For the purposes of analysis, the data set x is generally partitioned into groups or samples, according to the treatment levels of the experiment. For any general testing problem, under the null hypothesis, which assumes that data comes from only one (with respect to groups) unknown population distribution P, the whole set of observed data x is considered to be a random sample. Pesarin (2001) defines nonparametric distributions as follows: A family of distributionsP is said to behave non-parametrically when we are not able to find a parameterq, belonging to a known finite-dimensional parameter space QandP, in a sense that each member ofPcannot be identified by only one member of Q and vice versa. This definition by Pesarin (2001) includes families of distributions which are either unspecified or specified, except for an infinite number of unknown parameters. All nonparametric familiesPwhich are of interest in permutation analysis are assumed to be sufficient in such a way that if x and x’ are any two points, then x„ x'implies f (x)„ f (x') P P for at least oneP˛P, except for points with null density. The characterisation of a familyP as being nonparametric essentially depends on the knowledge we assume about it. When we assume that the underlying familyPcontains all continuous distributions, then the data set x is sufficient. By sufficiency, it means that x and f are said to contain essentially the same P amount of information with respect to P. They are equivalent for inferential purposes. The same conclusion is obtained if the sample distribution is assumed to be invariant with respect to permutations of the arguments of x. This happens when the assumption of independence for observable data is replaced by that of exchangeability: f(x,.........x )= f(x ,.........x ), where(u*,.......,u*) is any permutation of (1,.......,n). The 1 n u* u* 1 n 1 n data sets under the mull hypothesis is always contain a finite number of points, as n is finite (Good; 2005). 6
Description: