ebook img

CORRESPONDENCE ANALYSIS IN R FOR ARCHAEOLOGISTS PDF

18 Pages·2011·0.3 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview CORRESPONDENCE ANALYSIS IN R FOR ARCHAEOLOGISTS

Archeologia e Calcolatori 21, 2010, 211-228 CORRESPONDENCE ANALYSIS IN R FOR ARCHAEOLOGISTS: AN EDUCATIONAL ACCOUNT 1. Introduction We have previously published papers that involve the use of a statistical technique, Correspondence Analysis (CA), for comparing assemblages of �nds across different sites. The positive response from archaeological colleagues with similar concerns to those we addressed has been encouraging, but it is apparent that many of these colleagues – particularly those not located within a university (and therefore without access to effective but costly statistical software and easy access to expert statistical advice) – have no problems in understanding how CA works, but do have problems implementing it. The main purpose of this article is to try to indicate how CA can be implemented in a software package, R, by an archaeologist prepared to invest some effort in exploring the technique. R is an open source (that is to say, free) statistical package. It has been developed by and for experienced statisticians so is not necessarily easy to use by those without some statistical training or guidance. We aim to provide the latter. A brief and non-technical account of CA is given in the next section, but the ideal reader of this paper will already know about CA and will be more inter- ested in how to use it. Expositions of CA written for an archaeological readership include Shennan (1997, 308-341), Baxter (1994, 100-139) and Baxter (2003, 136-146). Greenacre (1984) provides a thorough mathematical account. Many articles on CA applications to archaeological case studies have also been published in this Journal (for a theoretical account see lastly Djindjian 2009). The next section includes a brief description of the aims of CA and the way its use has developed in archaeology. Following this, some information about the software, R, is given. The remainder of the paper illustrates how R can be used for CA, using archaeological data, and concludes with advice on some practical issues of implementation. We have trialled the instructions using a PC running Windows XP Pro with a 2.2 Mb Broadband connection. 2. Correspondence Analysis At its simplest, CA can be viewed as a statistical technique for visualising a table of non-negative numbers. As a concrete example, suppose informa- tion has been collected from r sites (or contexts) and a count has been made of the numbers of each of c artefact types (or more generally �nds) present 211 M.J. Baxter, H.E.M. Cool within the context. The results can be collected in tabular form, where each of the r rows corresponds to a context, and each of c columns corresponds to an artefact type or �nd. A natural question that arises with this kind of data is to ask how similar contexts are in terms of the pro�le of �nds within them. It is also of interest to ask how similar the pro�le is of �nd types across sites. Essentially CA reduces a table of data to two maps (or plots). In the �rst map, points on the plot correspond to the rows of the table (i.e. the contexts). Points on the plot that are close to each other can be identi�ed with contexts that have a similar pro�le in terms of their �nds assemblage; points which are very distant correspond to contexts which have very different assemblage pro�les. In the second map, points correspond to the columns of the table (i.e. the �nds type), and points which are close together identify �nds which have a similar distribution across sites. The two maps can be “superimposed” and viewed together they allow the similarities and differences between contexts to be assessed. Examples will follow. The way CA has developed in archaeology is quite curious and the story is described in some detail, up to 1992, in Baxter (1994, 133-139). The use of CA for the purpose of seriation (not something discussed much further in this paper) is now common. An early application, appearing in a statistical journal and having little immediate in�uence on archaeological practice, was Hill (1974). The «World Archaeology» paper by Bølviken et al. (1982) is often credited in the archaeological literature for introducing CA to archaeology, but this ignores contributions in the French-language literature dating from the mid-1970s, much of it associated with the work of François Djindjian (see Baxter 1994, 134 for details). Given that CA is an obviously useful method, its diffusion into certain sections of the archaeological community was painfully slow. Orton (1999) identi�ed CA as the most important statistical technique introduced into archaeology in the 1980s. Baxter (1994, 134) credits Ringrose (1988) as being the �rst fully �edged British application of CA that post-dated Bøl- viken et al. (1982), and Ringrose was a statistician, so regular British use of CA by archaeologists only really dates from about the early 1990s (Baxter disquali�ed a slightly earlier British contribution on the grounds that it was assisted by an Australian). Given the contribution of North American scholarship to the development of quantitative methodology in archaeology, the length of time it took for CA to penetrate the North American literature borders on the astonishing. Cowgill (2001) has said that CA was «virtually unheard of» in the US until the late 1980s; Baxter (1994, 135) was unable to identify much in the way of American usage of CA up to about 1992; and Duff (1996, 90), at a comparatively late date, was able to write that CA «was not well established in Americanist literature». 212 Correspondence Analysis in R for archaeologists: an educational account The situation has changed now (to some extent), but why the compara- tive neglect? For a start, many archaeologists, even if they have been exposed to some training in quantitative methodology, are averse to statistics and – even when they acknowledge the potential usefulness of statistical methods – lack the con�dence to use them. Collaboration between archaeologists and stat- isticians is the obvious solution to this problem, but a lot of archaeologists do not have ready access to a usable statistician. 3. Simple Correspondence Analysis in R 3.1 Getting the package The “Comprehensive R Archive Network” (CRAN) at http://cran. r-project.org/ provides a great deal of information on R. The information summarised below was current in August 2010, but as R is constantly updated details will change. Here version 2.11.1 is used, and it assumed that this is to be installed on a Windows platform. The route Windows → base → Download R.11.1 for Windows provides access to the package. Downloading the �le places the application R- 2.6.1-Win 32 in the folder of your choice. This can be installed by clicking on it and running the Window Installation Wizard as normal. Accepting defaults will load R in C:/Program Files/R/R-2.9.1 and create a desktop icon (R 2.9.1) that can be used to launch the package. The �le to be downloaded is c. 33 Mb and the complete download and installation should take only a minute or so with a Broadband connection. Please note that depending on how you have the �rewalls on your computer set up, you may have to override them sometimes, for example when installing the library packages discussed below. The package operates from typed lines of command entered after the > prompt. It is case-sensitive so it is important to use upper case letters where shown. When you start using the package you are likely to encounter error messages because of mistakes in your typing. If you see «Error: syntax error in …», it will often be because of quite simple mistakes, such as either omit- ting a space or introducing one accidentally. An error message «object ….. not found» often means the double quotation marks have been omitted. Plots open in separate windows so the window where you type the commands (R console) is best kept in a slightly minimised mode so that you can see them. 3.2 Data entry This can be done in more than one way. For illustration, data from Table 1 of Cool and Baxter (1999) are used. For 18 sites the amounts of six glass 213 M.J. Baxter, H.E.M. Cool vessel types are quanti�ed by estimated vessel equivalents (EVEs). The sites differ in date and type. The data are given in Table 1 in the Appendix. If the data is set up in EXCEL (including variable names) highlight the data you wish to use; within EXCEL use Edit → Copy and then, within R, create a data �le JRA1 using > JRA1 <- read.table(file = “clipboard”, header = T) and type > JRA1 to see the data1. An alternative method of data entry is to create a plain text �le (.txt) with a plain-text editor (Windows NotePad in this instance, which can be found in Accessories in the Windows Start menu). Note that column (vari- able) headings are provided. For illustration purposes the �le is named JRA1 and placed in a folder called in My Documents. The path for this CAinArch will be similar to “ C:/Documents and Settings/Your Name/My Documents/CAin- ” Arch/ where the element relates to whatever name your system is set up Your Name under. The �le can be read in after the R prompt, >, using the read.table function. This contains two components within the brackets; the �rst speci- �es the path to the �le and the second, , indicates that variable header = T names are to be expected in the �rst line of the �le. > JRA1 <- read.table(“C:/Documents and Settings/Your Name/My Documents/CAinArch/JRA1.txt”, header = T) Note that the path is enclosed in double quotation marks and uses forward slashes (/) (this is essentially what was done to import the �le from EXCEL, where the copy command within EXCEL placed the data on the ). “clipboard” An error message will occur if there are problems. If there are none type > JRA1 to see the data. For full help on the function type > ?read.table 1 The writers of the manual for data import/export prefer you to write the EXCEL �le to R a Tab or comma-separated �le and use or (see below). read.delim read.csv 214 Correspondence Analysis in R for archaeologists: an educational account These notes mainly provide information on what is needed to get started. It is worth getting into the habit of using the help facility [ help(read.ta- is an alternative to ] to see what else is available. For ble) ?read.table example, and can be used with comma separated read.csv read.delim variables and Tab delimited variables respectively. 3.3 Packages in R By default R comes with a “base” statistics package, but to make the most of it you need to be able to access packages of functions that are either bundled with R or contributed by users and accessible from R. In the �rst instance we shall use the bundled package, associated MASS with the book Modern Applied Statistics with S by Venables and Ripley (2002). Load this either by typing > library(MASS) (R is case-sensitive so it is important to use capital letters) or, from the menu Packages → Load Package → MASS This latter route will show you other available packages. To get help on the library function use ; to get help on a particular package, ?library for example, > will provide basic informa- MASS library(help = MASS) tion. Access to non-bundled packages is discussed later. 3.4 Simple Correspondence Analysis At this stage a data set has been created, and the package loaded. MASS For simple CA the function is available, and provides corresp ?corresp help on this. Using > JRAca1 <- corresp(JRA1, nf = 2) Warning message:negative or non-integer entries in table in: corresp.matrix(as.matrix(x), ...) > biplot(JRAca1) gives Fig. 1, which may be compared with Fig. 3 in Cool and Baxter (1999). The analysis can be done in a single line using > biplot(corresp(JRA1, nf = 2)) The warning message can be ignored as it is simply alerting us to the fact that some of the data is non-integer, without stopping calculations. In the present context there is no problem with non-integer numbers, but some 215 M.J. Baxter, H.E.M. Cool software will not allow this and requires data manipulation (e.g., multiplica- tion by some power of 10) before analysis can proceed. To save a �gure, from within R use File → Save as and select from the �le formats available. Numbers in the �gure label the rows of the data set from 1 to 18, and column names are also given. Before interpreting the results, note that we have not used any information about site date. To do this, create a new vari- able, , as follows date > date <- c(1,1,1,1,1,1,1,1,2,2,4,4,4,4,4,4,4,4) where 1 = sites of the 1st/2nd century AD, 2 = sites of the 2nd/3rd century, and 4 = sites of the 4th century, and use > biplot(JRAca1, xlabs = date) which gives Fig. 2. The addition of results in points cor- xlabs = date responding to rows being labelled by the numbers in . The plot shows date quite nicely that later assemblages have a composition distinct from earlier assemblages, and that this is largely attributable to the relatively higher pro- portion of cups present in later assemblages. Note that there is no need to re-do the CA, since the results from this are held in the “objectQ” JRAca1 previously created. The appearance of the plot is not complicated here, and the message is quite clear. For larger data sets and/or longer labels for the variables, plots like those of Figs. 1 and 2 can become overcrowded and dif�cult to read, and we often prefer to present plots for rows and columns separately. This is now done, where the opportunity is also taken to simplify labelling of the types. The latter is not really necessary here but, for illustration, can be done using > type = c(“C”, “Bw”, “Ja”, “F”, “Ju”, “Bt”) Note that, because the labels are names rather than numbers, they have to be enclosed in double quotation marks. To get the row plot use > biplot(JRAca1, xlabs = date, ylabs = rep(“ “, 6)) where makes 6 copies of a blank label. ylabs = rep(“ ”, 6) For the column plot use > biplot(JRAca1, xlabs = rep(“ ”, 18), ylabs = type) 216 Correspondence Analysis in R for archaeologists: an educational account As given, two separate plots are produced. If they are sandwiched be- tween the directives > par(mfrow = c(1,2)) and > par(mfrow = c(1,1)) Fig. 3 results. The commands would thus look like this on the worksheet > par(mfrow = c(1,2)) > biplot(JRAca1, xlabs = date, ylabs = rep(“ “, 6)) > biplot(JRAca1, xlabs = rep(“”, 18), ylabs = type) > par(mfrow = c(1,1)) The �rst usage produces a plotting region of 1 “row” and two par() “columns”, and the second usage restores things to normal. Such multiple plots are not always satisfactory and some manipulation of plot parameters (beyond the scope of the present article) may be needed before getting aes- thetically pleasing and informative results. 3.5 Other packages A large number of user-written packages are available for R. To access those not automatically bundled with R go to Packages on the tool bar , and select a download site after Packages → Set CRAN mirror then Packages → Install package(s) and select the package of choice. Here we select . ade4 Once the package is downloaded type library(ade4) to access the functions within it. The quality of documentation and trans- parency of use for different packages is variable. Some come with extensive and helpful documentation; others with little at all. Apart from information available via CRAN judicious use of is often very helpful. Google The sequence > JRAca2 <- dudi.coa(JRA1, scannf = FALSE) > scatter(JRAca2) 217 M.J. Baxter, H.E.M. Cool produces Fig. 4, which may be compared with Fig. 1. The shaded bars in the plot to the top left shows the relative importance of the �rst two CA axes compared to the rest, while d = 0.5 in the top-right de�nes the scale, the length of the side of a grid square being 0.5. Replacing the directive with scatter() > s.label(JRAca2$li, label = date) produces Fig. 5, which shows the row plot labelled by date. Using > s.label(JRAca2$co, label = type) produces the variable plot (not shown). The and parts in the above $li $co code identify plotting coordinates held in the object originally cre- JRAca2 ated. You can do some quite clever things with a little experimentation (sug- gested by the help facilities and material found via ). For example, Google > bet <- between(JRAca2, as.factor(date), scannf = FAL- SE) > s.class(bet$ls, as.factor(date), xax = 1, yax = 2) produces Fig. 6 in which the ellipses emphasise the separation of the date groups. As with many R functions there is a lot of control over the appearance and labelling, and what is presented here is basic. Use the directive help() described earlier to see what is available. 3.6 Seriation and detrended CA A common use of CA is for seriation (Madsen 1988, provides numer- ous, and in some cases idealized, examples). Usually results from a CA are presented as a two-dimensional graph from which it is hoped that a one-di- mensional ordering, interpretable as a temporal “gradient”, can be read off. In some instances, as in the example used here, a “gradient” can be interpreted as a spatial one. To �x ideas, Table 2 reproduces Table 5 from Cool and Baxter (1999). This shows EVE values for seven glass drinking vessel types, from contexts dating to the later 1st century AD. The contexts are ordered from north to south – Carlisle to Fishbourne – with the �rst three from the north, the next two from the Midlands, and the remaining �ve from the south (we are aware that the numbers are rather small, and discuss the more general issue of sample size in a later section.) Calling the data set , and using from the Flavian corresp MASS library > Flavianca1 <- corresp(Flavian,nf = 2) 218 Correspondence Analysis in R for archaeologists: an educational account > biplot(Flavianca1, ylabs = rep(“ “, 7)) Fig. 7, showing the plot for contexts, results. Labels correspond to the order of contexts in Table 2. The �gure has an approximate “horseshoe” shape, which is what is usu- ally hoped for. We can read round the horseshoe from 10 in the bottom left to 3 in the bottom right to get a one-dimensional ordering which in this case corresponds, more-or-less, to the ordering on the �rst axis (the positioning of context 1, which lies off the horseshoe, is a little ambiguous). With the excep- tion of context 2 (York), which is a bit out of order, the ordering corresponds to a south-north gradient and, in conjunction with the plot for vessel types, Cool and Baxter (1999, 90-91) interpreted this as evidence of regionality in the assemblages. (It was argued that the southern sites were characterised by newer Flavian forms, with the northern and Midland sites characterised by older Claudio-Neronian forms – these differences not being related to site type.) CA as used here suggests relative chronological or spatial ordering. Ecolo- gists, and others, who have used and developed CA extensively, would sometimes like to be able to interpret distances between points on the graph in absolute terms. Characteristically, and with larger data sets, there is also bunching at the terminals of the horseshoe that can hamper interpretation. Detrended Corre- spondence Analysis (DCA) attempts to rectify these problems (for a more exten- sive discussion of DCA and seriation in archaeology see Lockyear 2000a). For archaeological applications to seriation problems we do not view the horseshoe as a problem (in fact achievement of the horseshoe effect is often seen as evidence of success of a seriation). Some aspects of DCA methods, which can be thought of as algorithms to “unbend” the horseshoe, have been considered to be arbitrary (see Baxter 2003, 139-40 for a brief discussion) and it is primarily discussed here, both to further illustrate the potential power of R and for the bene�t of those archaeologists who may wish to explore it. The function in the library may be used. This library decorana vegan will need to be downloaded in the same way as was. Then ade4 > library(vegan) > Flaviandca <- decorana(Flavian) > plot(Flaviandca) with results in Fig. 8. We don’t think this adds anything to the previous analysis. In �rst using R it will often be the case that, as above, the simplest of analyses (that will often be suggested by examples given in the help for a function) are adequate. To improve graphs – for publication purposes, for example – a little experimentation usually helps. It does not take long, for instance, to work out that 219 M.J. Baxter, H.E.M. Cool > plot(Flaviandca, display = “sites”) drops the vessel-type labelling from the previous plot. If you prefer the sites to have names rather than numbers create a variable, for example , sitenames as discussed previously, and use > plot(Flaviandca, display = “sites”, type = “n”) > text(Flaviandca, display = “sites”, sitenames) produces a blank plot and the function adds the type = “n” text() desired labels. 4. Some practicalities 4.1 Dealing with small numbers By “small numbers” we mean “small” row and column totals. It is possible, though not inevitable, for such small numbers to have an adverse effect on a CA display and interpretation. In the most extreme case, with a zero total, the row or column affected cannot be used at all. With small but non-zero totals, omitting offending rows (columns) is an obvious possibility. It is also legitimate to amalgamate rows (columns) to obtain larger totals, providing the newly de�ned rows (columns) have a legitimate archaeological interpretation. Our preferred approach is to retain all the data, in the �rst instance. This is because, in some applications, numbers are inevitably small and it seems wasteful to throw them away without �rst seeing if they nevertheless have a useful story to tell. Table 2 is an example. If there are problems (see below) then the courses of action already alluded to are available. 4.2 Dealing with outliers One problem that can arise with small totals is that the associated row (column) marker on a plot appears as an outlier. An outlier is a point that lies at some distance from other points on a CA plot. It can represent a “rogue” data point, possibly arising from small numbers, or may be genuine and associated with a large total but simply very different from other rows (columns). Whatever the cause, a major problem is that a serious outlier will de- termine the scale of a plot, possibly causing other points to bunch together, obscuring interpretable pattern in them. In our view it is almost always sensible to re-do a CA omitting obvious outliers, to see what patterns – if any – have been obscured in the remaining data. This is related to, but separate from, the �nal presentation and interpretation of results. This will be determined by those analyses which tell the most informative story, and could, for ex- ample, involve plots both with and without outliers, or only the latter, with 220

Description:
technique, Correspondence Analysis (CA), for comparing assemblages of finds . 1 The writers of the R manual for data import/export prefer you to write the
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.