PCA and K-Means Decipher Genome 8 0 Alexander N. Gorban1,3 and Andrei Y. Zinovyev2,3 0 2 1 University of Leicester, University Road, Leicester, LE1 7RH, UK, n [email protected] a 2 Institut Curie, 26, rue d’Ulm,Paris, 75248, France, J [email protected] 5 3 InstituteOf Computational Modeling of Siberian Branch of Russian Academy of Science, Krasnoyarsk, Russia ] M Summary. In this paper, we aim to give a tutorial for undergraduate students Q studying statistical methods and/or bioinformatics. The students will learn how data visualization can help in genomic sequence analysis. Students start with a . o fragment of genetic text of a bacterial genome and analyze its structure. By means i ofprincipalcomponentanalysisthey“discover”thattheinformationinthegenome b - isencoded bynon-overlappingtriplets.Next,theylearn howtofindgenepositions. q This exercise on PCA and K-Means clustering enables active study of the basic [ bioinformatics notions. Appendix 1 contains program listings that go along with this exercise. Appendix 2 includes 2D PCA plots of triplet usage in moving frame 2 v foraseriesofbacterialgenomesfromGC-poortoGC-richones.Animated3DPCA 3 plots are attached as separate gif files. Topology (cluster structure) and geometry 1 (mutualpositions of clusters) of these plots dependsclearly on GC-content. 0 4 0 Key words: Bioinfomatics; Data visualization; Cryptography; Clustering; 5 Principal component analysis 0 / o i 1 Introduction b - q Whenitisclaimedinnewspapersthatanew genomeisdeciphered,itusually : v means that the sequence of the genome has been read only, so that a long i X sequence using four genetic letters: A, C, G and T is known. The first step in complete deciphering of the genome is identification of the positions of r a elementarymessagesinthistextordetecting genesinthebiologicallanguage. This is imperative before trying to understandwhat these messagesmeanfor a living cell. Bioinformatics – and genomic sequence analysis, in particular – is one of the hottest topics in modern science. The usefulness of statistical techniques inthis fieldcannotbe underestimated.Inparticular,a successfulapproachto 2 A.N.Gorban, and A.Y.Zinovyev theidentificationofgenepositionsisachievedbystatisticalanalysisofgenetic text composition. Inthisexercise,weuseMatlabtoconvertgenetictextintoatableofshort wordfrequencies andto visualize this table using principal componentanaly- sis(PCA).Studentscanfind,bythemselves,thatthesequenceoflettersisnot random and that the information in the text is encoded by non-overlapping triplets.Usingthe simplestK-Meansclusteringmethodfromthe MatlabSta- tistical toolbox, it is possible to detect positions of genes in the genome and even to predict their direction. 2 Required Materials To follow this exercise, it is necessary to prepare a genomic sequence. We providea fragmentofthe genomicsequence ofCaulobacter Crescentus.Other sequencescanbedownloadedfromtheGenbankFTP-site[1].Ourprocedures work with the files in the Fasta format (the corresponding files have a .fa extension) and are limited to analyzing fragments of 400–500kb in length. Five simple functions are used: LoadFreq.m loads Fasta-file into a Matlab string CalcFreq.m converts a text into a numerical table of short word frequencies PCAFreq.m visualizes a numerical table using principal component analysis ClustFreq.m is used to perform clustering with the K-Means algorithm GenBrowser.m is used to visualize the results of clustering on the text Allsequencefilesandthem-filesshouldbe placedintothecurrentMatlab working folder. The sequence of Matlab commands for this exercise is the following: str = LoadSeq(’ccrescentus.fa’); xx1 = CalcFreq(str,1,300); xx2 = CalcFreq(str,2,300); xx3 = CalcFreq(str,3,300); xx4 = CalcFreq(str,4,300); PCAFreq(xx1); PCAFreq(xx2); PCAFreq(xx3); PCAFreq(xx4); fragn = ClustFreq(xx3,7); GenBrowser(str,300,fragn,13000); All the required materials can be downloaded from [2]. PCA and K-Means DecipherGenome 3 3 Genomic Sequence 3.1 Background The information that is needed for a living cell functioning is encoded in a long molecule of DNA. It can be presented as a text with an alphabet that has only four letters A, C, G and T. The diversity of living organisms and their complex properties is hidden in their genomic sequences. One of the most exciting problems in modern science is to understand the organization of living matter by reading genomic sequences. One distinctive message in a genomic sequence is a piece of text, called a gene. Genes can be oriented in the sequence in the forward and backward directions (see Fig. 1). This simplified picture with unbroken genes is close to reality for bacteria. In the highest organisms (humans, for example), the notion of a gene is more complex. Itwasoneofmanygreatdiscoveriesofthetwentiethcenturythatbiological information is encoded in genes by means of triplets of letters, called codons in the biological literature. In the famous paper by Crick et al. [3], this fact was proven by genetic experiments carried out on bacteria mutants. In this exercise, we will analyze this by using the genetic sequence only. In nature, there is a special mechanism that is designed to read genes. It is evident that as the information is encoded by non-overlapping triplets, it is important for this mechanism to start reading a gene without a shift, from the first letter of the first codon to the last one; otherwise, the information decoded will be completely corrupted. An easy introduction to modern molecular biology can be found in [4]. 3.2 Sequences for the Analysis The work starts with a fragment of genomic sequence of the Caulobacter Crescentusbacterium.Ashortbiologicaldescriptionofthis bacteriumcanbe foundin[5].Thesequenceisgivenasalongtextfile(300kb),andthestudents areaskedtolookatthe fileandensurethatthe textusesthe alphabetoffour letters (a, c, g and t) and that these letters are used without spaces. It is noticeable that, although the text seems to be random, it is well organized, but we cannot understand it without special tools. Statistical methods can help us understand the text organization. The sequence can be loaded in the Matlab environment by LoadSeq func- tion: str = LoadSeq(’ccrescentus.fa’); 4 A.N.Gorban, and A.Y.Zinovyev 1) cgtggtgaATGgatgctagggcgcacgTAGtgagctgatgctctagcgacgtggtgagctgcatctagg gcaccacttacctacgatcccgcgtgcatcactcgactacgaGATcgctgcaccactcgacGTAgatcc 2) cgtggtgaATGgatgctagggcgcacgTAGtgagctgatgctCTAgcgacgtggtgagctgCATctagg 3) cgtggtgaATGgatgctagggcgcacgTAGtgagctgatgctCTAgcgacgtggtgagctgCATctagg a) d) c) b) Fig. 1. 1) DNA can be represented as two complementary text strings. “Comple- mentary” here means that instead of any letter “a”,“t”,“g” and “c” in one string therestands“t”,“a”,“c” and“g”, respectively,intheotherstring.Elementary mes- sages or genes can be located in both strings, but in the lower one they are read from right to the left. Genes usually start with “atg” and end with “tag” or “taa” or“tga”words.2)Indatabasesoneobtainsonlytheupper“forward”string.Genes from the lower “backward” string can be reflected on it, but should be read in the oppositedirectionandchangingtheletters,accordinglytothecomplementaryrules. 3)Ifwetakearandomlychosenfragment,itcanbeofoneofthreetypes:a)entirely in a gene from the forward string; b) entirely in a gene from the backward string; c) entirely outside genes; d) partially in genes and partially outside genes 4 Converting Text to a Numerical Table Awordisanycontinuouspieceoftextthatcontainsseveralsubsequentletters. As there are no spaces in the text, separationinto words is not unique. The method we use is as follows.We clip the whole text into fragments of 4 300 letters in length and calculate the frequencies of short words (of length 1–4) inside every fragment. This will give us a description of the text in the formof a numericaltable. There will be four suchtables for everyshortword length from 1 to 4. As there are only four letters, there are four possible words of length 1 (singlets),16=42possiblewordsoflength2(duplets),64=43possiblewords of length 3 (triplets) and 256 = 44 possible words of length 4 (quadruplets). The first table contains four columns (frequency of every singlet) and the number of rows equals the number of fragments. The second table has 16 columns and the same number of rows, and so on. 4 Mean gene size in bacteria is about 1000 genetic letters, the fragment length in 300 letters corresponds well to detect genes on thisscale, with some resolution PCA and K-Means DecipherGenome 5 To calculate the tables, students use the CalcFreq.m function. The first input argument for the function CalcFreq is the string containing the text, the second input argument is the length of the words to be counted, and the third argument is the fragment length. The output argument is the resulting table of frequencies. Students use the following set of commands to generate tables corresponding to four different word lengths: xx1 = CalcFreq(str,1,300); xx2 = CalcFreq(str,2,300); xx3 = CalcFreq(str,3,300); xx4 = CalcFreq(str,4,300); 5 Data Visualization 5.1 Visualization PCAFreq.m function has only one input argument, the table obtained from the previous step. It produces a PCA plot for this table (PCA plot shows distribution of points on a principal plane, with the x axis corresponding to the projection of the point on the first principal component and the y axis corresponding to the projection on the second one). For an introduction to PCA, see [6]. By typing the commands below, students produce four plots for the word lengths from 1 to 4 (see Fig. 2). PCAFreq(xx1); PCAFreq(xx2); PCAFreq(xx3); PCAFreq(xx4); 5.2 Understanding Plots The main message in these four pictures is that the genomic text contains informationthatis encodedbynon-overlappingtriplets, becausethe plotcor- responding to the triplets is evidently highly structured as opposed to the pictures of singlets, duplets and quadruplets. The triplet picture evidently contains 7 clusters. It is important to explain to students how 7-cluster structure in Fig. 1 occurs in nature. Let us suppose that the text contains genes and that information is en- coded by non-overlapping subsequent triplets (codons), but we do not know where the genes start and end or what the frequency distribution of codons is. 6 A.N.Gorban, and A.Y.Zinovyev PCA plot for text fragments PCA plot for text fragments 4 8 3 n=1 n=2 6 2 1 4 0 2 −1 −2 0 −3 −2 −4 −5 −4 a) −6 −4 −2 0 2 4 b) −6 −4 −2 0 2 4 6 8 PCA plot for text fragments PCA plot for text fragments 6 n=3 2 n=4 4 0 2 −2 0 −4 −2 −6 −4 −8 −10 −6 c) −8 −6 −4 −2 0 2 4 6 8 d) −10 −5 0 5 Fig. 2. PCA plots of word frequencies of different length. c) Shows the most structured distribution.The structurecan beinterpreted as theexistence of anon- overlapping triplet code Letusblindly cutthetextintofragments.Anyfragmentcancontain:a)a piece of a gene in the forwarddirection; b) a piece of a gene in the backward direction;c)nogenes(non-codingpart);d)oramixofcodingandnon-coding parts. Consider the first case (a). The fragmentcan overlapwith a gene in three possible ways,with three possible shifts (mod3) of the first letter of the frag- ment with respect to the start of the gene. If we enumerate the letters in the gene form the first one, 1-2-3-4-..., then the first letter of the fragment can be in the sequence 1-4-7-10-... (=1(mod(3)) (a “correct” shift), the sequence 2-5-8-11-... (=2(mod(3)), or 3-6-9-12-...(=0(mod(3)). If we start to read the information triplet by triplet, starting from the first letter of the fragment, we can read the gene correctly only if the fragment overlaps with it with a correct shift (see Fig. 1). In general, if the start of the fragment is not cho- sen deliberately, then we canreadthe gene in three possibleways.Therefore, this firstcase(a)generatesthreepossiblefrequencydistributions,eachoneof which is “shifted” in relation to another. The second case (b) is analogous and also gives three possible triplet dis- tributions. They are not independent of the ones obtained in the first cases PCA and K-Means DecipherGenome 7 for the following reason. The difference is the triplets are read “from the end to the beginning” which produces a kind of mirror reflection of the triplet distributions from the first case (a). The third case (c) produces only one distribution, which is symmetrical with respect to the ‘shifts’ (or rotations) in the first two cases, and there is a hypothesis that this is a result of genomic sequence evolution. This can be explained as follows. Vitality of a bacterium depends on the correct functioning of all biolog- ical mechanisms. These mechanisms are encoded in genes, and if something wronghappenswithgenesequences(forexamplethereisanerrorwhenDNA is duplicated), then the organism risks becoming non-vital. Nothing is per- fect in our world and the errors happen all the time, including during DNA duplication. These errors are called mutations. The most dangerous mutations are those that change the reading frame, i.e. letter deletions or insertions. If such a mutation happens inside a gene sequence, the rest of the gene becomes corrupted: the reading mechanism (which reads the triplets one by one and does not know about the mutation) will read it with a shift. Because of this the organisms with such mutations often die before producing offspring. On the other hand, if such a mutation happensinthenon-codingpart(wherenogenesarepresent)thisdoesnotlead to a serious problem, and the organism produces offspring. Therefore, such mutationsareconstantlyaccumulatedinthenon-codingpartandthreeshifted triplet distributions are mixed into one. The fourth case (d) also produces a mix of triplet distributions. As a result, we have three distributions for the first case (a), three for the second case (b) and one, symmetrical distribution for the ‘non-coding’ fragments(third case (c)). Becauseof naturalstatisticaldeviations andother reasons, we have 7 clusters of points in the multidimensional space of triplet frequencies. For more illustrative material see [7, 8, 11, 15, 12]. 6 Clustering and Visualizing Results Thenextstepisclusteringthefragmentsinto7clusters.Thiscanbeexplained to the students that as a classification of fragments of text by similarity in their triplet distributions. This is an unsupervised classification (the cluster centers are not known in advance). Clustering is performed by the K-Means algorithm using the Matlab Sta- tistical toolbox.It is implemented in the ClustFreq.m file. The first argument isthefrequencytablenameandthesecondargumentisthenumberofclusters proposed. As we visually identified 7 clusters, in this activity we put 7 as the value of the second argument: fragn = ClustFreq(xx3,7); 8 A.N.Gorban, and A.Y.Zinovyev The function assigns different colors onto the cluster points. The cluster thatistheclosesttothecenterofthepictureisautomaticallycoloredinblack (see Fig. 3). After clustering, every fragment of the text is assigned a cluster label (a color). The GenBrowser function puts this color back to the text and then visualizesit.Theinputargumentsofthisfunctionareastringwiththegenetic text, the fragment size, a vector with cluster labels and a position in the genetic text to read from (we use a value 13000, which can be changed for any other position): GenBrowser(str,300,fragn,13000); This function implements a simple genome browser (a programfor visual- izinggenomesequencetogetherwithotherproperties)withinformationabout gene positions. It is explained to the students that clustering offragmentscorrespondsto segmentationofthewholesequenceintohomogeneousparts.Thehomogeneity is understood as similarity of the short word frequency distributions for the fragments of the same cluster (class). For example, for the following text aaaaaaaatataaaaaattttatttttttattttttggggggggggaagagggggccccccgcctccccccc one can try to apply the frequency dictionary of length 1 for a fragment size around 5-10 to separate the text into four homogeneous parts. 7 Task List and Further Information Interested students can continue this activity. They can modify the Matlab functions in order to solve the following problems (the first three of them are rather difficult and require programming): Determine a cluster corresponding to the correct shift. Asitwasexplained in the “Understanding plots” section, some fragments overlap with genes in such a way that the information can be read with the correct shift, whereas otherscontainthe“shifted” information.Theproblemis todetectthe cluster correspondingto the correctshift. To give a hint, we cansaythat the correct triplet distribution (probably) will contain the lowest frequency of the stop codons TAA, TAG and TGA (see [16, 17]). Stop codon can appear only once in a gene because it terminates its transcription. Measure information content for every phase. The informationofatriplet distribution with respect to a letter distribution is I = P f ln fijk , ijk ijk pipjpk where p is a frequency of letter i (a genetic text is characterized by four i such frequencies), and f is the frequency of triplet ijk. Each cluster is a ijk collection of fragments. Each fragment F can be divided on triplets starting PCA and K-Means DecipherGenome 9 K−means clustering 6 4 2 0 −2 −4 −6 −8 −6 −4 −2 0 2 4 6 8 a) acgatgccgttgcggatgtcgcccatcgggaaaaactcgatgccgttgtcggcgacgttgcgggccagcgtcttcagctgcaggcgcgcttcctcgtcgg ccacggcgtcgacgcccagggcctggccctcggttgggatgttgtggtcggccacggccagggtgcggtcaggccggcgcaccttgcggccggccgcgcg aaggcctgcgaaggcctgtggcgtggtcacctcatggatgaggtgcaggtcgatatagaggatcgcttcgccgcctgcttcgctgacgacgtgggcgtcc cagatcttgtcgtaaagggttttgccggacatggcggggatatagggcggatcgtgcgccaaaatccagtggcgccccaggctcgcgacaatttcttgcg ccctgtggccttccagtctcccgaacagccccgaacgacatagggcgcgcccatgtttcggcgcgccctcgcatcgttttcagccttcgggggccgatgg gctccgcccgaagagggccgtccactagaaagaagacggtctggaagcccgagcgacaagcgccgggggtggcatcacgttctcacgaaacgctggtttg tcaatctatgtgtggttgcacccaagcttgaccgtacggcgcgggcaagggcgagggctgttttgacgttgagtgttgtcagcgtcgtggaaaccgtcct cgcccttcgacaagctcagggtgagcacgactaatggccacggttcgcacgaaatcctcatcctgagcttgtcccccgggaccgaccgaaggtcggcccg aggataaactccggacgaggatttccatccaggctagcgccaacagccagcgcgaagtcagccaaagaaaaaggccccggtgtctccaccggggcctcgt tcatttccgaaggtcggaagcttaggcttcggtcgaagccgaagccttcgggccgcccgcgcgctcggcgatacgggccgacttaccgcgacggtcgcgc aggtagtacagcttggcgcgacggacgacgccgcgacgcttgacttcgatgctttcgatgctcggcgacagcagcgggaacagacgctccacgccttcgc cgaacgaaatcttacggaccgtgaagctctcgtgcacgcccgtgccggcgcgggcgatgcagacgccttcataggcctgaacgcgctcgcgctcgccttc cttgatcttcacgttgacgcgcagggtgtcgccgggacggaagtccgggatcgcgcgggcggccagcagacgggccgattcttcttgctcgagctgagcg atgatgttgatcgccatgatcatgtctccttgggcttacccggtttcgggtcgccttttgcctgttgattggcgaggtgggcctcccagagatccgggcg ccgctcgcgggtcgtttcttcacgcatacgcttgcgccattggtcgatcttcttgtgatcgcccgacagcagcacctcggggatgtcgagctcttcgaac gtccgcggtctcgtgtactgcggatgctcgaggagaccgtcctcgaagctttcttccagcgtgctctcgatattccccagaacccccggggcaagtcgaa cgcacgcctcgatcacgaccatcgccgccgcttcgccaccggcgagtacggcgtctccgaccgagacctcttcgaaccctcgggcgtcgagcacccgctg gtccaccccctcgaagcggccgcacagcaccacgatgccgggcgccttggaccactccctcacgcgcgcctgggtcaggggcctgccccgggcgctcatg tacaaaagcggccgcccatcgcgttcgaggctgtccagcgccgaggcgatcacgtccgctttgagtacggctcctgcgccgccacccgcaggggtgtcgt cgaggaagccgcgcttatccttggaaaaggcgcgaatgtccagtgtttccaaacgccacaggtcctgctccttccaggcggtcccgatcatcgagacgcc gagcgggccggggaaggcctcggggaacatggtcaggacggtggcggtgaacggcatggccgctgtttagcggctggacgctccagagtcgaagccctgc gcggcgcgggggttcatcaagcgcattttccagttgttcaccgtcatcggcgaagttcggcggacaaactgaggtccacttcgtcccatgccgaccaaga gccgtttccttcgacgccttagcgagggcgtcgctgctttcgcccgccgtctgagacgtgacgaccgcggggcgatcgcgatccagttcgcgctgttggc cctgccgctgtcgatccttctgtttggtctcttggatgtgggccgcctcagcctgcagcggcgccagatgcaggacgcgctcgatgcggcgaccctgatg b) Fig. 3. K-Means clustering of triplet frequencies (a) and visualizing the cluster- ing results onto the text (b). There are three scales in the browser. The bottom color code represents the genetic text as a whole. The second line shows colors of 100 fragments starting from a position in the text specified as an argument of the GenBrowser.m function. The letter color code shows 2400 symbols, starting from the same position. The black color corresponds to the fragments with non-coding information (the central cluster); other colors correspond to locations of coding in- formation in different directions and with different shifts with respect to randomly chosen division of the text intofragments 10 A.N.Gorban, and A.Y.Zinovyev from the first letter. We can calculate the information value of this triplet distribution I(F) for each afragment F . Is the information of fragments in the cluster with a correct shift significantly different from the information of fragments in other clusters? Is the mean information value in the cluster with a correct shift significantly different from the mean information value of fragments in other clusters? Could you verify the hypothesis that in the clusterwithacorrectshiftthemeaninformationvalueishigherthaninother clusters? Increase resolution of determining gene positions. In this exercise, we use the same set of fragments to calculate the cluster centers and to annotate the genome. As a result the gene positions are determined with precision that is equal to the fragment size. In fact, cluster centers can be calculated with non-overlapping fragments and then the genome can be scanned again with a sliding window (this will produce a set of overlapping fragments). Following this, triplet frequencies for each window can be calculated and the corresponding cluster in the 64-dimensional space can be determined. The position in the text is assigned a color corresponding to the content of the fragment centered in this position. For further details of this process, see [9, 10, 11, 13]. Precise start and end positions of genes. The biological mechanism for reading genes (the polymerasemolecule) identifies the beginning and the end ofageneusingspecialsignals,knownasspecializedcodons.Therefore,almost all genes start with “ATG” start codon and end with “TAG”, “TAA” or “TGA” stop codons. Try to use this information to find the beginning and end of every gene. Play with natural texts. The CalcFreq function is not designed specifically for the four-letter alphabet text of genome sequences; it can work with any text. For natural text, it is also possible to construct local frequency dictio- naries and segment it into “homogeneous” (in short word frequencies) parts. But be carefulwith longerfrequency dictionaries,for anEnglishtext one has 2 (26+1) =729 possible duplets (including space as an extra letter)! Studentscanalsolookatvisualizationsof143bacterialgenomicsequences at[14].Allpossibletypesofthe7-clusterstructurehavebeendescribedin[15]. Nonlinear principal manifolds were utilized for visualization of the 7-cluster structure in [17]. Principal trees are applied for visualization of the same structure in [18]. 8 Conclusion In this exercise on applying PCA and K-means to the analysis of a genomic sequence, the students learn basic bioinformatics notions and train to apply PCA to the visualization of local frequency dictionaries in genetic text.