ebook img

ACM transactions on information systems (January) PDF

145 Pages·2005·3.068 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview ACM transactions on information systems (January)

Introduction to Genomic Information Retrieval Genomic information retrieval (GIR) is an emerging area of intersection betweeninformationretrieval(IR)andbioinformatics.Bioinformaticianshave defined their discipline as the “research, development, or application of com- putational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.”1 This definition clearly overlaps with that of information retrieval. Indeed, IR practitioners can bring substantial exper- tisetobioinformaticsbyapplyingandinnovatingwithtechniquesusedintext and multimedia retrieval for storing, organizing, analyzing, and visualizing data. The application domain of GIR is broad. Expertise can be applied to the retrieval and management of collections of genomic data (such as DNA and proteinsequences),medicalandbiologicalliterature,microarraydata,orother bioinformaticsdomains.Problemsthatcanbesolvedincludeunderstandingthe structure,function,androleofproteins,discoveringevidenceforthecausesof disease, and annotating medical and biological artifacts. However, the skills neededtoworkonsuchproblemsarebroad:sufficientunderstandingofbiology and genetics, information retrieval expertise, and skills in fundamental com- puter science techniques such as compression, string matching, data mining, anddatastructuresandalgorithms. Thisspecialsectionservestobringtogetherarticlesthatshowhowdiverse, important bioinformatics problems can be solved with GIR. As editor of this section, I was pleased by the response to the call for papers, and particularly impressedbythebreadthofworksubmittedforreview. ArticlesInThisSection Thisspecialsectioncontainstwoarticlesthatdescribesolutionstoverydiffer- entproblemsinGIR. Korodi and Tabus describe, to my knowledge, the most effective DNA com- pressionschemetodate.CompressionofDNAisinterestingforbothpractical reasons (such as reduced storage and transmission costs) and functional rea- sons(suchasinferringstructureandfunctionfromcompressionmodels).DNA consists of four symbols, and so compaction of the ASCII representation to two bits per symbol is trivial. However, compression to less that two bits per symbol is much harder: for example, compression with the well-known bzip2 algorithm is ineffective, leading typically to expansion. Prior to the work de- scribedinKorodiandTabus’sarticle,themosteffectiveschemes(describedin detail in the article) achieved compression to between 1.41 and 1.86 bits per symbol.Thisarticledescribesaschemethatisalwaysmoreeffectivethanother approaches, achieving between 1.36 and 1.82 bits per symbol. Their approach 1ThisisaworkingdefinitionproposedbytheUSNationalInsitutesofHealth.Theworkingdocu- mentcanbefoundashttp://grants1.nih.gov/grants/bistic/CompuBioDef.pdf. ACMTransactionsonInformationSystems,Vol.23,No.1,January2005,Pages1–2. 2 • Introduction isbasedonanormalizedmaximumlikelihoodmodel,withheuristicsthatmake theschemefastondesktophardware. ThearticleofSanderetal.proposesandexaminestechniquesforanalyzing Serial Analysis of Gene Expression (SAGE) data. SAGE data is a new but importantresourcethatallowsthecomparisonofnormalhumantissues,cells, and cancers. The work in this article shows that SAGE data is robust, that is, that data produced from different laboratories is comparable and can be usedtogetherforlargestudies.Importantly,theworkalsosuggestsinteresting similaritiesbetweenbrainandovariancancer,andshowsthatbrainandbreast cancer have strong gene expression signatures. This may be an interesting direction for further biomedical research. The GIR techniques described and usedincludemethodsforpreprocessingtheSAGEdata,clusteringandselecting datasubsetsofinterest,andmethodsforpresentingandunderstandingSAGE experiments. ACKNOWLEDGMENTS Ithanktherefereesfortheirhardworkinreviewingandprovidingfeedbackon thepapers,insomecases,toverytightdeadlines.Ialsothanktheauthorsofall submittedpapers.Last,IamgratefultotheEditor-in-Chief,GaryMarchionini, forsupportingtheGIRspecialsection,andtoAssociateEditorJavedMostafa forhishands-onsupportandmentoringthroughoutthereviewing,discussion, andfinalizingofthespecialsection.IhopethatTOISreadersenjoythespecial sectiononGenomicInformationRetrieval. HUGHE.WILLIAMS RMITUniversity Melbourne,Australia ACMTransactionsonInformationSystems,Vol.23,No.1,January2005. An Efficient Normalized Maximum Likelihood Algorithm for DNA Sequence Compression GERGELYKORODIandIOANTABUS TampereUniversityofTechnology This article presents an efficient algorithm for DNA sequence compression, which achieves the bestcompressionratiosreportedoveratestsetcommonlyusedforevaluatingDNAcompression programs.Thealgorithmintroducesmanyrefinementstoacompressionmethodthatcombines: (1) encoding by a simple normalized maximum likelihood (NML) model for discrete regression, throughreferencetoprecedingapproximatematchingblocks,(2)encodingbyafirstordercontext coding and (3) representing strings in clear, to make efficient use of the redundancy sources in DNA data, under fast execution times. One of the main algorithmic features is the constraint on the matching blocks to include reasonably long contiguous matches, which not only reduces significantlythesearchtime,butalsocanbeusedtomodifytheNMLmodeltoexploittheconstraint forgettingsmallercodelengths.ThealgorithmhandlesthechangingstatisticsofDNAdatainan adaptivewayandbypredictivelyencodingthematchingpointersitissuccessfulincompressinglong approximatematches.ApartfromcomparisonwithpreviousDNAencodingmethods,wepresent compressionresultsfortherecentlypublishedhumangenomedata. CategoriesandSubjectDescriptors:E.4[CodingandInformationTheory]—Datacompaction andcompression;G.3[ProbabilityandStatistics]—Correlationandregressionanalysis;Markov processes;J.3[LifeandMedicalSciences]—Biologyandgenetics GeneralTerms:Algorithms,Theory AdditionalKeyWordsandPhrases:Approximatesequencematching,DNAcompression,normal- izedmaximumlikelihoodmodel 1. INTRODUCTION Beingthedefinitivecodebookforthetraitsandbiologicalfunctionalbehavior ofeverylivingorganism,DNAsequencesarethemostchallenginginformation sourcesthathumanstrytodecipheranduse.Anenormousquantityofdifferent DNAsequencesexists,thesizeofeachsequencevaryingintherangeofmillions tobillionsofnucleotides.Inbothscientificandcommercialcommunitiesthereis intenseactivitytargetedatsequencingtheDNAofmanyspeciesandstudying thevariabilityofDNAbetweenindividualsofthesamespecies,whichproduces Authors’address:InstituteofSignalProcessing,TampereUniversityofTechnology,P.O.Box553, FIN-33101Tampere,Finland;email:{gergely.korodi,ioan.tabus}@tut.fi. Permissiontomakedigitalorhardcopiesofpartorallofthisworkforpersonalorclassroomuseis grantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprofitordirectcommercial advantageandthatcopiesshowthisnoticeonthefirstpageorinitialscreenofadisplayalong withthefullcitation.CopyrightsforcomponentsofthisworkownedbyothersthanACMmustbe honored.Abstractingwithcreditispermitted.Tocopyotherwise,torepublish,topostonservers, toredistributetolists,ortouseanycomponentofthisworkinotherworksrequirespriorspecific permissionand/orafee.PermissionsmayberequestedfromPublicationsDept.,ACM,Inc.,1515 Broadway,NewYork,NY10036USA,fax:+1(212)869-0481,[email protected]. (cid:1)C 2005ACM1046-8188/05/0100-0003$5.00 ACMTransactionsonInformationSystems,Vol.23,No.1,January2005,Pages3–34. 4 • G.KorodiandI.Tabus huge amounts of information that needs to be stored and communicated to a large number of people. Therefore there is a great need for fast and efficient compressionofDNAsequences. The traditional way to tackle the lossless compression of data is to make use of statistically significant repetitions of the symbols or groups of symbols in the data. Many of the commonly used general purpose methods eliminate the redundancies of exact replicas of a certain subsequence by using efficient referencingtoitsclosestoccurrenceinthepast. Unfortunately, these general-purpose compression algorithms typically fail toattainanysizereductionwhenappliedtoDNAsequences,whichimpliesthat thesesequencesrarelypossessthekindofregularities(likehighpredictability ofthenextsymbolbasedonitscontext)thataretypicalfor,sayhumancreated text.However,itiswellknownthatredundancyiscommonlypresentinDNA, butitinvolvesmoresubtleforms,suchaspalindromematchesandapproximate repetitions.TheeffectivehandlingofthesefeaturesisthekeytosuccessfulDNA compression. The direct practical advantages of compressing DNA sequences stem from the already large, and rapidly increasing amount of available data. Although in recent years mass storage media and devices became easily affordable and ubiquitous, data compression now has an even more important role in reduc- ing the costs of data transmission, since the DNA files are typically shared and distributed over the Internet. Space-efficient representation of the data reduces the load on FTP service providers, as the transmissions are done faster, and it also saves costs for clients who access and download the se- quences. Since the price of transmission is proportional to the sizes of ac- cessed files, even savings of 10–20% are useful. In this article we demon- stratethatthissizereductioncanbeeasilyaccomplishedfortypicaleukaryotic data. The compression of DNA has an intrinsic value in itself, but additionally it provides many useful clues to the nature of regularities that are statistically significantintheDNAdata,indicatinghowdifferentpartsofthesequenceare inrelationwitheachother,howsensitivethegenomeistorandomchangessuch ascrossoverandmutation,whattheaveragecompositionofasequenceis,and wheretheimportantcompositionchangesoccur.Duetoitsmanypotentialuses asastatisticalanalysistoolforDNAsequences,anyDNAcompressionmethod isfacedwithallofthechallengesthatmodelingtoolsarefacing:maximizingthe statistical relevance of the model used, and choosing between various models basedontheoverallperformance-complexitymerit. One example of a statistical question easily solved by an efficient compres- sionschemeiswhetherornotacertainapproximatematchoftwosubsequences isstatisticallyrelevant.Tosolvethis,onesimplyhastocheckwhetherencod- ing one subsequence based on a reference to the other subsequence and also includingthedescriptionofthemismatchesisabletocompresstheoverallDNA sequenceornot.Tomaximizethelikelihoodofacorrectanswer,thecompression programthatisusedhastobethebestknowncompressor,otherwisestatisti- callyrelevantmatcheswillremainundetected.Thisshowsthatdesigningthe mostefficientcompressorisagoalofparamountimportanceforstatisticaland ACMTransactionsonInformationSystems,Vol.23,No.1,January2005. AnEfficientNMLAlgorithmforDNASequenceCompression • 5 biologicalinference[Lietal.2001;Lietal.2003],evenifthegainsoverother compressors are just marginal when judging only from the storage savings benefits. The complexity of the models that are implicitly or explicitly used by each specificlosslesscompressionmethodforcapturingtheregularitiesinthedata should be weighted carefully when comparing various compression methods, due to its impact on various aspects such as: statistical relevance of model parameters, simplicity of parameter estimation, simplicity of using the model forencoding,andattainablecompressionrate. Thepresentarticlediscussesacompressionmethodaimedatachievingvery efficientcompressionwithafastexecutiontime.Thefoundationsforthiswork havebeenlaiddowninTabusetal.[2003a],whichisinturnbasedontheNor- malizedMaximumLikelihoodmodelpresentedinTabusetal.[2003b].Whereas inTabusetal.[2003a]thesimplicityofthemodelwasoneofthestipulatedobjec- tives,herewerelaxthisrestrictioninfavorofincreasedcompressionefficiency and practical feasibility. As a result, in Section 2 we introduce the framework forouralgorithm,morecomplexthantheoneconsideredinTabusetal.[2003a], followedbyadetaileddiscussionandanalysisinSection3regardingimplemen- tationissues,solutionsandperformanceevaluations. InSection4wepresenttheempiricalevaluationofourmethod,concentrating on constraints such as compression efficiency, processing speed and memory requirements.Weshowthatatthetimeofthiswritingourprogramsurpassesin termsofcompressibilityallpreviouslypublishedresultsonadefactosetoftest sequences.Thepracticalfeasibilityofourprogramisdemonstratedbyencoding theentirehumangenome,achievingacompressionrateof1.66bits/base. 1.1 ConnectionswithPreviousWorks Thefirstcompressionalgorithmdedicatedexclusivelytonucleicacidsequences, andhencesignificantlysurpassinggeneral-purposemethods,waspublishedby GrumbachandTahi[1993].Theirprogram,calledBiocompress,usedaLempel- Ziv style substitutional algorithm [Ziv and Lempel 1977] that, apart from the exactdirectmatches,wasalsoabletodetectcomplementarypalindromes,which are exact matches of an invertible transform of the subsequence, obtained by takingthesequenceinreversedorderandreplacingeachbasewithitscomple- ment(A↔T andC ↔G).Suchpalindromematchesareingeneralasfrequent asdirectmatches,beingaredundancysourceworthexploitingforcompression. Inalatermethod,calledBiocompress-2GrumbachandTahi[1994],improved Biocompress by adding an order-2 context coder followed by arithmetic cod- ing[Rissanen1976]. A new generation of DNA compression algorithms appeared when practi- cal use was made of the data redundancy involved in approximate repeti- tions[Chenetal.2001,2002].Especiallyefficientaretwoalgorithmsintroduced inChenetal.[2001]byChen,KwongandLi:thefirstversion,GenCompress-1, againusedtheLempel-Zivalgorithmandanorder-2contextcoder,butherethe substitutional method only used replacement operations when matching two subsequences. In the second version, GenCompress-2, insertion and deletion ACMTransactionsonInformationSystems,Vol.23,No.1,January2005. 6 • G.KorodiandI.Tabus operations were allowed in addition to the replacements, when looking for an approximate match. However, this did not prove to offer consistent improve- mentsoverGenCompress-1,accordingtothereportedresults[Chenetal.2001], leavingopentheusefulnessofconsideringinsertionsanddeletionsinaddition to substitutions in their scheme. In both GenCompress-1 and GenCompress-2, alongwiththepointertothematchedsubsequence(includingoffsetandlength) theencoderhastoprovidecodesfortheeditoperationsinvolvedintheapprox- imate match, in order that compression would be lossless. Establishing codes fortheeditoperationwasanempiricalprocess,andassuch,subjectedtoover- trainingwithrespecttothetestdatathatwasused. Anearliersubstitutionalmethod,Cfact[Rivalsetal.1995]usedatwopass coding strategy: in the first pass the exact matches worth providing a gain in compression were detected in the overall file, and in the second pass the encodingwasperformed.Assuch,itcanbethoughtofasaprecursortooneof thebestDNAcompressionalgorithms,DNACompress,introducedbyChen,Li, Ma,andTromp [Chenetal.2002],whichalsoemploysatwo-passstrategy,and isagainbasedonasubstitutional(Lempel-Zivstyle)compressionscheme.Inthe firstpassaspecializedprogramcalledPatternHunterisusedasapreprocessor for finding significant approximate repetitions. The second pass then encodes these by a pointer to their previous occurrence, and it also encodes the edit operations to correct the approximations of the match, while the rest of the sequencesarecodedusinganorder-2contextarithmeticcoder. ThesuccessofDNACompressshowstheimportanceofthemethodusedfor findingsignificantapproximatematches,whichinonewayoranothershouldbe partofasubstitutionalcompressionalgorithm.Webrieflymentionafewmajor methods for aligning sequences, because they are well related to the approxi- mate matching problem, and furthermore because their principle of search is worthexploitinginacompressionalgorithmfortheapproximatematchsearch task. Locatinginadatabasethesequencesthataresignificantlywellalignedwith a given sequence was a task first solved in the seventies by exact computing methodsbasedondynamicprogramming,butthemethodswereslowforsearch- inginlargedatabasesinapracticaltime.WilburandLipmandevelopedseveral algorithmsforaligningtwonucleotidesequences,suchasinWilburandLipman [1983], which was based on matching fixed-length seeds, that is, consecutive strings of nucleotides. This approach became popular and it was later refined infurtheralgorithmswiththesameobjective,suchastheFASTAalgorithmby PearsonandLipman[1988],andthefamilyofBLASTprograms[Altschuletal. 1990], which became very popular ways of comparing one sequence against all the sequences of a database. A more recent search algorithm attaining significantly faster speed than BLAST programs at the same level of sensi- tivity is PatternHunter by Ma, Tromp and Li [Ma et al. 2002]. In contrast to earlier algorithms, it uses strings of nonconsecutive symbols as seeds for search. Allofthementionedmethodsarewelloptimizedforthetaskoffindingalign- mentscoresforsequencesofarbitrary(notnecessarilythesame)lengths,and theyoperateintwosteps:firstlocatingapotentialgoodmatchusingaseed,then ACMTransactionsonInformationSystems,Vol.23,No.1,January2005. AnEfficientNMLAlgorithmforDNASequenceCompression • 7 subsequently trying to extend the seed sequence at both sides and evaluating after each extension the matching (or alignment, if gaps are included) score, to get an approximate match as long as possible. This second part makes the processevenmorecomputationallyexpensive.Inourcompressionprogramthe approximatesearchtaskismoreparticular:itneedsonlytocompareblocksof the same size, and although we use the same principle of “seed” based accel- eration search, the subsequent operations are less expensive than those used by the BLAST family of programs, and we optimize them to get a very fast encodingalgorithm.Atthedecoderside,thesearchprocessisnotneeded,and hence the decoding time is much faster than the encoding time, which is true forallsubstitutionalcompressionmethods. Several other methods that do not use the substitution principle have also been proposed for DNA compression. Some of them only offer estimates of the achievable compression, or estimates of the entropy [Loewenstern and Yianilos 1997; Lanctot et al. 2000]; others are full compression programs. From the latter category we mention a combination of context tree weight- ing(CTW)methodwithLempel-ZivcodingcalledCTW+LZ[Matsumotoetal. 2000],whichachievesverygoodcompressionresults,butitisveryslow,anda Burrows-WheelerTransform-basedcompressor[Adjerohetal.2002],whichat- tainsonlymodestcompressionresults.Thoughseveralfundamentallydifferent algorithmshavealsobeenproposedinthe10yearsfollowingtheintroduction of Biocompress, Lempel-Ziv style algorithms always represented the state-of- the-artinDNAcompression. A departure from Lempel-Ziv style algorithms is the method NMLComp presented in Tabus et al. [2003a], which replaces the Lempel-Ziv method by encodingwiththeNormalizedMaximumLikelihoodmodelfordiscreteregres- sion[Tabusetal.2003b].InTabusetal.[2003a]itwasdemonstratedthatthe NMLmodeliswellsuitedforencodingapproximatematchesbasedonreplace- mentoperationsandalsothatitcanbeimplementedwithverylowalgorithmic andcomputationalrequirements. In our present article we introduce refinements and algorithmic improve- ments for the method introduced in Tabus et al. [2003a], directed towards speedinguptheexecutionprocessandobtainingbettercompressionresults. The kernel of the method was presented in Tabus et al. [2003a] and it con- sistsofblockparsingtheDNAsequenceandusingoneofthethreeoptionsfor encoding: the first uses a reference to a previously encoded DNA segment for conditionally encoding the current block, by means of a simple NML model, the second uses an order-1 context coder, and the third uses “in clear” repre- sentation of the current block. The present article refines the kernel scheme, proposingimprovementsinthewaythereferencetothepreviousmatchisen- coded,howtheveryexpensivesearchofthepreviousmatchcanbeaccelerated byconstrainingthematches,withjustasmallpriceinthecodingperformance, andhowthenewmatchconstraintscanbeusedtoderiveamoreefficientNML model,whichisaseasytohandlealgorithmicallyaswastheoneinTabusetal. [2003a].Additionally,moreadaptivepoliciesareintroduced,suchastheselec- tion of the block size used in parsing, and scalable forgetting factors for the memory model. As a result of these and other algorithmic improvements, our ACMTransactionsonInformationSystems,Vol.23,No.1,January2005. 8 • G.KorodiandI.Tabus program achieves the best compression ratios published to date for a de facto standardtestset,withveryfastexecution. 1.2 OrganizationoftheArticle In Section 2 we start reviewing the three encoders that compete for encoding a block of data, and present all components of Tabus et al. [2003a] which are necessarytomakethispaperself-contained.TheNMLencoderprincipleisdis- cussedandsomeimportantissuesareelaboratedinmoredetailthaninTabus et al. [2003a], for example, the analysis of the threshold for deciding which stringstoincludeinthenormalizationsum,inSections2.1.3and2.1.4.These- lectionoftheblocksizebecomesamorerefinedprocess,beingintertwinedwith decisionsoverotherimportantparameters,forexample,forgettingmemory,in anoverallalgorithm,whichispresentedinSection2.3.Section3containsother major algorithmic novelties in the article, first addressing the acceleration of theapproximatematchsearch,andthenconsideringimprovementsintheper- formance,whichcanbeachievedbyexploitingtheparticularsearchalgorithm used. A new version of NML encoding is derived, that uses an additional con- straint in the block model, namely that in each approximate match found, a contiguousrunofmatchedcharactersthathasatleasta(given)fixedlengthis guaranteedtoexist.Thealgorithmicimprovementsareassessedoverartificial randomsequences(whereexpectedvaluescanbecomputed)andoverabench- mark of DNA sequences. In Section 4 we compare our results with previous DNAcompressionprogramsoversomeDNAsequences,thenevaluatevarious options in our algorithm by tuning it to different constraints, and finally we presenttheresultsofourprogramwhencompressingallofthe3billionbases ofhumanchromosomes,revealinganaveragecompressibilityof1.66bits/base forthewell-specifiedbases. 2. ALGORITHMDESCRIPTION In this section we present an algorithm with the objective of attaining good compression of DNA sequences while being practically low-cost, especially on the side of the decoder. In order to efficiently process various types of DNA sequences,webuildupourencoderfromthreemodules,eachwithseveralpa- rametersandtargetingittoaddressdifferenttypesofredundancyintheinput sequence. In Sections 2.1 and 2.2 we present these three algorithms, while the final encoderisdescribedinSection2.3. 2.1 TheNMLModelforDiscreteRegression TheNMLmodelfordiscreteregressionisapowerfultooltodetecthiddenreg- ularities in the input data [Tabus et al. 2003b]. Such regularities relevant to DNAcompressionaretheapproximaterepetitions,bywhichonesequencecan beefficientlyrepresentedusingknowledgeofasequencethatoccurredearlier, by replacing, deleting or inserting characters in the earlier one to match the later[Chenetal.2001].Ourstrategytohandleapproximaterepeatsistofocus on blocks of relatively small length and study their encoding based on substi- tution operations alone (not considering insertions and deletions). Thus our ACMTransactionsonInformationSystems,Vol.23,No.1,January2005. AnEfficientNMLAlgorithmforDNASequenceCompression • 9 firstobjectiveistomatchtwosequencesonlybypossiblysubstitutingacertain numberofbasesinthefirstsequence,andtoderiveanefficientrepresentation for the operations needed to recover the second sequence from the first. This demand is efficiently accomplished by a simplified version of the NML model, firstdescribedinTabusetal.[2003a].Theideabehindthisapproachistosplit theinputsequenceintofixed-lengthblocks,andtofindforeachblockanappro- priate regressor sequence in the already encoded—thus known—data to code thatblock.TheregressorsequenceshouldhaveminimalHamming-distanceto thecurrentblock,amongallpossiblesequencesinthepast. Byourindependentaddressingoftheblocks,wecanmodellongapproximate repeats well, since the block concatenation can model substitution operations (insideblocks),andinsertionsanddeletions(byshiftingtheregressorblocks). AmorecomplexNMLmodelwouldenableustocaptureandeliminatemore sophisticated redundancies from the DNA sequence. However, such a model wouldrequireadifferentchoiceoftheregressordata,whichnaturallyimposes largercostsforfindingthebestmatchingsequence,bothintermsofspeedand memoryrequirements.Sinceinpracticalapplications,thedatatobeprocessed canhaveenormoussizes(atypicalchromosomeofaeukaryoteorganismvaries in the range of several tens up to several hundreds of millions of base pairs) the advantage of using complex NML models which may offer only slightly better compression rates, but which at the same time increase computational requirementsconsiderablyisnotyetclear.Forthisreasoninthefollowingwe baseouralgorithmonthesimplerNMLvariantfirstdescribedinTabusetal. [2003a]. In the present section we give an overview of this algorithm, as well as improvements that have not been discussed in Tabus et al. [2003a] and a suitablewayofintegratingthemintoanefficientimplementation. 2.1.1 The Probability Model. From the statistical point of view DNA se- quencesaremessagess=s0,s1,s2,...,sNs−1 emittedbyasourcewithalphabet size of M = 4 symbols [Grumbach and Tahi 1993]. In biological literature thesesymbols,calledbases,arenotedbytheletters A,C,G andT,butforcon- venience we associate to them the numbers 0,1,2 and 3 in respective order; thus s ∈ {0,1,2,3}. In our method we split the sequence s into successive, k nonoverlappingblocksoflengthn,whereonegenericblocks(cid:1)n,...,s(cid:1)n+n−1 will be referred to as yn = y0,..., yn−1. For convenience, we drop the reference to therunningindex(cid:1)fromtheblocknotation,althoughwealwaysconsider ynto bethe“current”block,with y0 =s(cid:1)n,..., yn−1 =s(cid:1)n+n−1forsomeinteger(cid:1).Each block ynisencodedwiththehelpofsomeregressorblocksk,...,sk+n−1 written as xn = x0,...,xn−1 inthealreadyencodedstream,whichdoesnotnecessarily startatapositionthatisamultipleofn.Wedenotebyθ,theprobabilityofthe event that the symbol at any position in the current block matches the corre- sponding symbol in the regressor block. If the symbols do not match, there is an additional choice of 1 in M −1 to specify the correct base y , and thus the i probabilitymodelis (cid:1) θ if y = x P(y |x ;θ)= i i , (1) i i ψ if y (cid:4)= x i i ACMTransactionsonInformationSystems,Vol.23,No.1,January2005. 10 • G.KorodiandI.Tabus whereψ = 1−θ .Fromnowonweassumethatthesymbolswithinthesequence M−1 areindependent—thoughthisisnottrueforrealDNAsequences,itdoespro- vide us a simple model, which turns out to be very efficient for our purposes. Theprobabilitymodel(1)extendedtoblocksofnsymbolsisthen (cid:2) (cid:2) P(yn|xn;θ)=θ in=−01χ(yi=xi)ψ in=−01χ(yi(cid:4)=xi) =θnmψn−nm, (2) whereχ(condition)is1ifconditionistrue;0otherwise.Thenumberofmatching bases in blocks xn and yn is indicated by n . Consequently, the maximum m likelihood estimate of the parameter θ is θˆ(yn,xn) = nm, and the maximized n likelihoodis P(yn|xn;θˆ(yn,xn))=(cid:3)nm(cid:4)nm(cid:5) n−nm (cid:6)n−nm. (3) n n(M −1) Itturnsoutthattherearemanypossibleblocksthataredissimilartotheextent thatthereisnogaininusingxnforthemasregressor.Collectingonlythe“good” blocksforwhichNMLisprofitableintothesetY ,wenormalizethemaximized xn likelihoodtoobtaintheuniversalmodel P(yn|xn;θˆ(yn,xn)) Pˆ(yn|xn) = (cid:2) P(zn|xn;θˆ(zn,xn)) zn∈Yxn (cid:7)nm(cid:8)nm(cid:3) n−nm (cid:4)n−nm = (cid:2) (M −n1)n−m(cid:5)n(nM(cid:6)−1(cid:7))m(cid:8)m(cid:3) n−m (cid:4)n−m, (4) m∈(cid:5)n m n n(M−1) where the second expression makes use of the fact that the “good” blocks are clearly those with high probability in (3), and since this probability depends onlyonn ,constrainingY isstraightforwardlydonebyconstrainingn toa m xn m set(cid:5) ⊆{0,...,n}. n 2.1.2 TheNMLCodeLength. From(4)itfollowsthatthecodelengththat isspentonencodingtheblock yn knowingtheregressorblock xn is L (n ,(cid:5) )=−log Pˆ(yn|xn). (5) NML m n 2 (Theargumentsofthecodelength,n and(cid:5) indicatethatthislengthdepends m n onlyonthenumberofmatchingbasesandthesettowhichn isconstrained.) m Theblockxnismostconvenientlyspecifiedbyapointertoitsstart,andonebit indicatingifdirectorpalindromematchingisused.Sincethe(cid:1)thblock ynstarts atposition(cid:1)n,andweexcluderegressorsequencesoverlappingthisblock,the pointer to xn can be encoded with log ((cid:1)n−n+1) bits. However, to prevent 2 explodingrun-timerequirements,inpracticeitisoftenimportanttointroduce awindowW ={(cid:1)n−n−w+1,...,(cid:1)n−n}ofsizewinwhichthesearchforthe starting position of the regressor is carried out. This way the pointer can also be sent in log w bits, so the number of bits needed to completely specify the 2 block xn iseither1+log ((cid:1)n−n+1)or1+log w,whicheverissmaller.This 2 2 meansthattheoverallnumberofbitstheNML-algorithmspendsonencoding yn is L (n ,(cid:5) )=L (n ,(cid:5) )+log min{((cid:1)−1)n+1,w}+1. (6) 1 m n NML m n 2 ACMTransactionsonInformationSystems,Vol.23,No.1,January2005.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.