1 Exploring the efficacy of molecular fragments of different complexity in computational SAR modeling Albrecht Zimmermann, Bjo¨rn Bringmann, Luc De Raedt 5 Abstract—AnimportantfirststepincomputationalSARmod- selected according to their correlation with the target value 1 elingistotransformthecompoundsintoarepresentationthatcan to ones selected using a threshold on their length. 0 beprocessedbypredictivemodelingtechniques.Thisistypically 2 The paper is structured as follows: we first discuss the a feature vector where each feature indicates the presence or conceptoffingerprintsanditsextensiontogeneralizedfinger- n absenceofamolecularfragment.Whilethetraditionalapproach a to SAR modeling employed size restricted fingerprints derived prints,aswellasdifferencesinthecomplexityoffragmentson J from path fragments, much research in recent years focussed whichto base them. Followingthis, we lay outour methodol- 3 on mining more complex graph based fragments. Today, there ogy for the experimentalcomparisons in terms of complexity 1 seems to be a growing consensus in the data mining community andselectioncriterion,onwhichwereportafterwards.Finally, that these more expressive fragments should be more useful. ] We question this consensus and show experimentally that we discuss the observed phenomena and draw conclusions. E fragments of low complexity, i.e. sequences, perform better than C equally large sets of more complex ones, an effect we explain II. (GENERALIZED)FINGERPRINTS by pairwise correlation among fragments and the ability of . s a fragment set to encode compounds from different classes The usual approach in computational biology/chemistry c distinctly. The size restriction on these sets is based on ordering [ when using the second-orderrepresentation for SAR involves the fragments by class-correlation scores. In addition, we also 1 evaluatetheeffectsofusingasignificancevalueinsteadofalength assessing the structural similarity of molecules. They are v restrictionforpathfragmentsandfindasignificantreductionin decomposed into sets of (potentially overlapping) fragments the number of features with little loss in performance. and the similarity of any two molecules evaluated comparing 5 1 their respective fragments using kernel functions [3]. 0 A variety of different fragments has been used in the 3 I. INTRODUCTION literature, from paths/walks [4], [2], [5], via fragments with 0 Structure-activity relationship (SAR) prediction is an im- branches (trees) [6], to those with cycles (graphs) [7], [8], . 1 portant task in computational biochemistry. The aim is to [9], [10]. Often, a new kernel function is proposed as well. 0 predicttheeffectofcompoundsbasedontheirstructuralchar- These approachesshare two characteristics:1) they start from 5 acteristics – the second-order representation comprising the vertices (atoms) of individual or pairs of molecules, enumer- 1 : topologicalarrangementof atoms and bonds of the molecule. ating the paths starting from this vertex, or the neighborhood v For algorithms to process molecular data and build models graphs surrounding it. 2) the fragments are size restricted, i X to predict their activity, molecules have to be simplified by length restricted for paths (such as 0≤l ≤8 or 3≤l ≤10), r transforming them into a different representation. To this or diameter restricted for graphs. a end, molecules are abstracted as graphs – networks of atoms Theresultingfragmentsare oftenusedto mapmoleculesto linked to each other. A commonapproachto SAR consists of bit-vectorsofagivensizek(suchas512or1024),inaprocess constructing fragments from individual or pairs of molecules, called fingerprinting, involving the generation of b random andsubjectingthosemoleculestofingerprintingtogainafinal integers that are mapped using a modulo k reduction. While representation that is more easily accessible for prediction al- values such as k, l, and b are based on empirical knowledge gorithmssuch asSupportVectorMachines.Thegraphmining of biochemical practitioners, it has been shown that, e.g., community,ontheotherhand[1],approachestheconstruction differentlength restrictions can have a profoundeffect on the of fragmentsslightly differentlyand while the differencesare usefulness of derived fragments [11]. subtle, they can have significant effects. As an alternative to hashing, Swamidass et al. [2] pro- In this paper, we build on earlier work [2] that aimed at posed generalized fingerprints (gfp) in which the explicit generalizing the existing fingerprinting approach and explore size restriction on bit-strings is lifted. Thus, potential loss how to derive the molecular fragments on which to base of information is avoided since each fragment is represented. generalizedfingerprintsinapredictivesetting.Specifically,we Also,itbecomespossibletouseinformationthatgoesbeyond compare fragments of different complexity in terms of their presence, e.g. how often a fragmentoccursin the data, which usefulness. Additionally, we compare the use of fragments the authorsexploitedinproposinga kernel.However,hashing fragments to the same bit can weed out redundancyand such Albrecht Zimmermann, Luc De Raedt, KU Leuven, first- arepresentationpotentiallyavoidsthecurseofdimensionality, [email protected] B.Bringmann,Deloitte &ToucheGmbH,[email protected] and reducesmemory requirementsfor the modeling step. The 2 retention of information seems to outweigh the benefits of III. APPROACH hashing, since Swamidass et al. showed that gfp outperform We use substructures that correlate with one of two target fp, especially for smaller fps. classes (e.g. active and inactive) – and therefore discriminate Theirapproachusedpathfragments,andintheirconclusion among the two. Techniques exist for mining top-k substruc- they suggested the use of shallow trees as fragments from turesaccordingtoconvexmeasuressuchasχ2 orInformation which to construct gfp. This coincides with trends in the data Gainwhilestillpruninglargepartsofthesearchspace.Similar miningcommunitywheregraphminingistoutedasthetoolof search strategies can be used to find all substructures with a choice to derive fragments for SAR prediction. In contrast to score above a user defined threshold. Please note that in this this there have been claims that simpler features may well workweonlyuseχ2 sinceearlierworkshowedthatthisleads suffice [12], [13]. This assumption has been supported by to better results than employing Information Gain [15]. recent work [8] that evaluated the efficacy of structures of Regardingchemical compounds,there exist three very well different complexity against one another and found little, if studied types of substructures, namely: any, advantage in using more complex structures such as LG subgraphs, most expressive, but expensive to mine; graphs.Ithastoberemarked,however,thatthelatterworkstill LT subtrees, can represent anything but cycles; constructssizerestrictedfragmentsfromindividualmolecules. LS subsequences, least expressive, rather easy to mine. Incontrasttothis,wehavefoundinthepastthatsequential fragments are more useful than more complex ones [14]. Yet The relation LS ⊂ LT ⊂ LG holds, implying that we constructfragmentsdifferently:theyarenotsize restricted |LS|≤|LT|≤|LG|.Notethatsequencesareslightlydifferent from paths as used by Swamidass et al. as they only allow a but chosen based on how well they correlate with the target variable,measuredbyχ2,acorrelationthatisevaluatedonthe bijective mapping of the nodes and edges from the fragment tothedata,i.e.avertexcanoccuratmostonceina sequence. entire data. The size restriction on fingerprints can either be Our first question is concerned with comparing these three enforcedexplicitlybytakingthe k best-correlatingfragments, types of structures w.r.t. their value in terms of predictive or implicitly by requiring a minimum correlation score. accuracy. To carry out this task, we extract a number of We reproduce our experiments on new data and perform substructures from the data, and use them to describe each additional analysis to answer the following questions: of the seen or unseen chemical compounds. The molecules Q1. Are fragments with low complexity as useful as more are transformed into generalized fingerprints indicating the complex fragments and if so, what are the underlying substructures’ presence or absence. From the feature vectors phenomena? a model for the activity of the compounds is learned. Restrictingthenumberoffragmentsusedgivesusacontrolled Support vector machines (SVMs) have been used success- setting in which to evaluate the efficacy of fragments from fullyforSARproblemsandcan filter outredundant/irrelevant different fragment classes. By analyzing the encoding of the features. We use the Tanimoto kernel that has been used to data that can be derived from the mined patterns, and the goodeffecton the NCI cancer data set we do our comparison relationamongpatternsthemselves,we gainan intuitionas to on [2]. The data is encoded as undirected graphs, vertices why simpler structuresare the better choice when the number labeled with their atom type, edges as single, double, or of patterns is limited in the mining process. It is not obvious aromaticbonds.Hydrogenatomsarenotencoded.Theshortest whether these results will transfer to a size restricted setting possible sequence consists of a single edge, i.e. two atoms. but we can answer a related question, namely: Q2. Is mining fragments using a significance threshold as IV. EXPERIMENTALEVALUATION good as the size restricted approach to building gfps? The NIC60 cancer data set is a popular data set for testing Thesizerestrictedapproachisequivalenttoconsideringthe SAR predictions [2], [9]. It consists of approximately 4000 occurrencesofallfragmentsadheringtothosesizerestrictions compounds that have been tested against 60 tumor cell lines. and using those that occur at least once. This can lead to an Each of the 60 subsets consists of around 3,500 compounds. explosionin the numberof enumeratedfragments,even using For each of the 60 cell lines comprising the NCI cancer length restriction on the patterns, as we will show. Arbitrarily data set, we performed a stratified 10-fold cross validation. increasing this threshold, on the other hand, might exclude Fragments are mined exclusively on the training folds since interesting fragments, and as mentioned above, the effect of there is no information about class labels in the test data. changing the size restrictions is not always predictable [11]. Analogously, minimally correlating fragments can be con- A. Comparing differentcomplexity classes – fps of equalsize sidered to consist of at least two atoms and to adhere to a correlation constraint. Changing the size constraint can still The classical approach to encoding molecules in fps lies in have unpredictable results whereas changing the correlation fixing the size k of the fp and hashing fragments to a bit- threshold has a clear interpretation. Consequently, in a final string of this length. In the technique we propose, mining experiment, we compare the effect of increasing the number significant fragments according to the χ2 statistic, fixing the offragmentsoflowcomplexitybyloweringtheminingthresh- size of fps can be done by mining only the k highest- old, showing the increasing quality of the encoding (and its scoring fragments. In earlier work [14], we showed that for consequences for the quality of the classification model), and a fixed k, sequences were more effective in encoding data contrasting their usefulness with length restricted fragments. for classification purposesthan trees or graphs. We repeatthe 3 Fig.3. Averagenumberoffeaturespermoleculeforthetop-1000fragments. Fig. 1. AUC results of SVM-classification on encodings derived from the top-1000fragments andnumberofcorrespondences ineachencoding. of fragments to each other. Figure 2 shows the strength of correlation of different fragments’ presence. As can be seen, sequencesaremuchmorediversethangraphs(withgraphsand 100 1 100 1 trees virtually identical), describing data more distinctively. 90 0.8 90 0.8 80 0.6 80 0.6 At first glance, the plot seems to disagree with this de- 6700 0.4 6700 0.4 scription since the left-hand side of the figure shows far less 0.2 50 0.2 50 0 yellow,i.e.highpairwisecorrelations,thantheright-handside. 40 0 40 30 30 -0.2 However,in the figure,the lowest scoringsequencecontained 1200 --00..42 1200 --00..64 in the top-100 graphs corresponds to the sequence ranked 41 0 -0.6 0 -0.8 in the top-100 sequences. Hence, the lower-left part (up to 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 fragment 41) of the top-100 sequences in the right is like a smallscaledversionoftheleftfigure.Thisleadstothecoarser Fig.2. Intercorrelation ofthe100bestclass-correlating graphs(sequences) appearance on the left-hand side, indicating that for each intheleft(right)correlation matrix.Thesequences ranked42andlowerare sequence there is at least one highly correlating graph. The notinthetop-100 graphs. most distinctive example for this can be found in the graphs ranked∼8−26:themajorityofthesegraphsshowverysimilar behavior w.r.t. pairwise correlation with other fragments and experimentsofthisworkhere,thistimeusingtheNCI60data the entire block is equivalent to sequences ranked ∼8−15. set used by Swamidass et al.. For the mining process, k was Note that these figures display only pairwise correlation. set to 1000,the top-k sequences,trees, and graphsmined and Using the final set of fragments during modeling allows for the AUC (area under the ROC) estimated as described above. much more complex combinations than pairs. Hence, these As the lower half of Figure 1 shows, fps using sequences figures show only the tip of the iceberg, and we can expect are significantly more useful for learning a modelon the data the overall intercorrelation effects amongst combinations of using an SVM. This is consistentwith our earlier results. The fragments in the result set to be much stronger. WilcoxonMatchedPairsSignedRankTest shows39outof60 Figure 3, plotting the average number of features per cases in which using sequences is significantly (p-level 99%) molecule, shows that the more complex fragments, while not better than using graphs and 38 out of 60 cases in which the reducingcorrespondencesasefficiently,usemorefeatures.The tree-basedencodingisoutperformed.Togainanunderstanding lowest curve in this figure shows how few sequences are on of the reasons for this, it is helpful to consider the encoding average used for encoding once cyclic graphs crowd out less of molecules that can be derived from each mined set. complex structures from the set. We focus specifically on pairs of molecules from different classesthatareencodedbythesame fp(correspondences).To B. Comparing different complexity classes – from fps to gfps give an idea of the prevalence of this phenomenon, we show The higher diversity mentioned above also means that the for each fragment type and data set their average number in scoresoftreesandgraphsfallintoasmallerintervalthanthose the upper half of Figure 1. In particular, this also includes all of sequences.1 In other words, the 1000th-best χ2 sequence molecules that are not matched by any fragment at all. Such score is lower than the 1000th-best score for more complex fpsinthetrainingdatacorrespondtodatapointsthattheSVM structures (see Figure 4). The horizontallines show the worst cannoteffectivelylearntodistinguish,whilewhenunseentheir scoreforeach structuretype,whichalso indicatesthenumber classification is essentially up to chance. Generally speaking, of fragments of other types that exceed it. the betterperformanceofthe SVM onsequence-encodeddata aligns with fewer correspondenceson this encoding. 1In practice the best fragment is often a sequence. In general, the best This phenomenon can be explained by the relationship graphsmight scoremuchbetter thananysequences. 4 Fig.4. Scoredistributionforthetop-1000fragmentsaccordingtoχ2ofone representative training fold. Fig. 6. AUC results of SVM-classification on encodings derived from the less complex fragments contained in the top-1000 graphs and number of correspondences ineachencoding. Figure 6. It is important to note the trade-off among the number of features and the quality of the feature set. The numberofcorrespondencesarerathersimilarforeachdataset for all structure types, despite the differences in total number of fragments. This indicates that the large sets of tree- and graph-structuredfragmentsstillshowmuchredundancy,which inturnmeansthatwhileadditionalcomplexityallowsforsome morediversityforagiventhreshold,thegainisrelativelysmall compared to simply increasing the number of fragments. C. Increasing gfp-size by lowering the mining threshold Fig. 5. Number of fragments remaining after the set is reduced using the 1000th-worst graphscore. Increasingthe number of featuresimprovesthe chance that moleculesfromdifferentclassesare encodedina way thatal- lowstodistinguishthoseclasses.Treeandgraphminingbeing To normalize the observed advantage of sequences with far more expensive than sequcne mining [14], it is unrealistic regard to diversity, we use the 1000th-best graph score (the to try and mine large amounts of complex fragments. Also, red horizontal line) to crop the size of fps derived from tree using all graphs which have a χ2-score exceeding a given and sequential fragments. We effectively obtain generalized threshold does little to increase diversity over sequences. fingerprints, without explicit size restriction of bit-vectors, The computational complexity of fragment mining arises similar to the ones used by Swamidas et al., with the length fromtheneedtosystematicallyexplorealargesearchspaceof restrictiononfragmentsreplacedwithaminimumsignificance potentiallyinterestingfragmentsandcounttheiroccurrencesin value. The number of fragments left is shown in Figure 5. thedata.Approachesthatstartfromindividualmoleculesavoid This reduction is in fact rather severe, pushing the number of thisbottleneck,yetwhilethefragmentsderivedinthatmanner sequencefragmentsdownto 10%−20%ofthe original1000. can be used for assessing molecules’ similarities this is often Asmentionedbefore,Figure3alsoshowstheaveragenum- the extent of their usefulness, especially since they are often berofsequentialfeaturespermoleculeinthetop1000graphs, tiedtotheirrespectivekernelfunctions.Fragmentscorrelating which is equivalent to using the reduced set of sequential with the target value, on the other hand, capture information features (bottom curve). In comparison to the second curve aboutthe relationshipofstructureandactivitythemselvesand from the bottom, one can see that, again, the reduction is can be analyzed independently from the modeling step. severe, yet not as severe as for the entire set of features, only In a third experiment we thus mine sequences which have dropping to 25%−33% of the original number (∼40−50). a χ2-score exceeding the (unadjusted) 95%, 99%, and 99.9% Reducing the number of features in this way leads to a p-values, respectively. As expected, Figure 8 shows a direct very slight advantage for the graph-structured features w.r.t. relationship between lowering the significance threshold and AUC results(lowerhalfof Figure6). In fact,accordingto the the number of features mined. More features also leads to Wilcoxontest, there are onlytwo significantdifferences,even fewercorrespondenceswhichleadthentocorrespondingAUC though so many fewer sequences than graphs are used. values (Figure 7). According to the Wilcoxon Test, using the Similarly to the results described in the preceding section, p-value for 99% improves significantly on the 99.9% value the decrease in accuracy goes along with an increase in the in 36 cases. Using the 95% value increases this to 45 (and number of correspondences, as shown in the upper half of improves in 17 cases on the 99% value). 5 Fig.9. Averagenumberoffragmentspermoleculeminedwiththreedifferent significance thresholds andaslength restricted pathsoffrequency 1. Fig.7. AUCforsequences basedonthreedifferent significance thresholds and length-restricted paths of frequency 1 along with the number of corre- spondences ineachencoding. andaso-calledMin-Max-Kernel.Whilethelattergivesslightly betterresultsw.r.t.predictiveaccuracy,itisevaluatedonarep- resentationof the data that not only denotesabsence/presence of substructures but also the number of times they occur in a molecule. Since the semantic information of significant fragments is such that only their presence correlates with an activity, we do not adopt this representation and thus do not compare against the Min-Max-Kernel. The lower half of Figure 7 also lists the average AUC the SVM achieved on gfps using length-restriction. As we expected,using length-restrictedpathswith minimum support ofoneleadstoslightlymoreusefulfeaturesetsbutatthecost of significantly larger fps. In fact, while the AUC increase is not significant for most data sets (only 13/60 according to the Wilcoxon test), Figure 9 shows that the average number of fragments used to describe a single molecule effectively Fig. 8. Number of sequential-fragments for three different significance doubles. The gain derived from increasing the amount of thresholds andlength-restricted pathsoffrequency 1. features is thus affected by diminishing returns. Those fragments are relevant in terms of the similarity of molecules yet do not capture any tendencies in the data D. Comparing techniques for determining gfp-size: signifi- themselves. The kernel matrix gives a global view of the cance versus length-restriction similarity of molecules and the SVM is used to discover the The preceding experiments had the main purpose of as- underlyingphenomena.Figure10showsthatthelowestscores sessing the usefulness of graph-, tree- and sequential frag- for fragments derived in the length-restricted approach are ments for gfps, chosen by the significance of their χ2 score. clearly non-significant and it would be hard to base actual Swamidass et al. [2] use gfps whose number is determined understanding of the data on them. It also shows that the by a length restriction – all fragments are paths of maximal scores of the worst sequence and worst tree included in the length 10 occuring in at least one molecule in the training top-thousand graphs are virtually indistinguishable from each data. This approach gives rise to more than one hundred other and from what is considered the worst graph. Finally, thousand features (the top graph in Figure 8), significantly the score of the 1000th-worst tree is marginally worse than more than result even from using the permissive 95% p- the score of the 1000th-worst graph, indicating that most of value. According to the results described above, this should the top-1000 graphs are in fact trees. allow those feature sets to encode all molecules distinctively. Indeed,the low frequencythresholdmeansthatthere are very few correspondences, as can be seen the top part of Figure V. CONCLUSIONS 7. While the number of correspondences is reduced even We performed an empirical evaluation to gain insight into further,however,theyarenoteliminatedcompletely–itwould the reasons for the superiority of sequential molecular frag- probably need longer paths (and thus many more fragments) ments as features for classification, compared to complex to effect this. ones. We find that the reason lies in the greater diversity InSwamidass’sworktwo kernelsareusedforclassification of sequences which leads to a more distinctive encoding of – one based on the well-established Tanimoto-similarity [16] instances, effectively giving classifiers a better representation 6 . .... . of sequencesforclassification. Additionally,dueto sequences ................................................................................................................ bstheeeeinmegnsdgqrouaniptcheespatoghsaesiminbslgeeilvvtehesar,tisasesutceohxapalsaemintieondfinsigneqotuhpeeenrtiaintaitloronfrdawugcomtuioelndnt,si.int .... ACKNOWLEDGEMENTS We would like to thank Kurt De Grave for his invaluable help in preparingthis manuscript.AlbrechtZimmermannwas supported by the Fonds Wetenschappelijk Onderzoek (FWO) by the time of writing. REFERENCES [1] T.WashioandH.Motoda,“Stateoftheartofgraph-baseddatamining,” SIGKDDExplor.Newsl.,vol.5,pp.59–68,July2003. [2] S. J. Swamidass, J. H. Chen, J. Bruand, P. Phung, L. Ralaivola, Fig. 10. Worstscore forfragments mined with three different significance and P. Baldi, “Kernels for small molecules and the prediction of thresholds andaslengthrestricted pathsoffrequency1. mutagenicity,toxicity andanti-cancer activity,” inISMB(Supplement of Bioinformatics), pp.359–368,2005. [3] D.Haussler,“Convolution kernels ondiscrete structures,” TR,1999. [4] T. Ga¨rtner, P. A. Flach, and S. Wrobel, “On graph kernels: Hardness to work with. A straight-forward way of improving the en- results and efficient alternatives,” in COLT (B. Scho¨lkopf and M. K. coding lies in increasing the number of fragments used. Our Warmuth,eds.),vol.2777ofLNCS,pp.129–143, Springer, 2003. experimentsshow,however,thatthereisalwaysneedforafar [5] K.M.Borgwardt andH.-P.Kriegel, “Shortest-path kernels ongraphs,” in ICDM (J. Han, B. W. Wah, V. Raghavan, X. Wu, and R. Rastogi, greaternumberoftrees/graphsthansequences.Asthesestruc- eds.),(Houston,Texas,USA),pp.74–81,IEEE,Nov.2005. tures are also computationally more expensive to enumerate, [6] N.ShervashidzeandK.M.Borgwardt,“Fastsubtreekernelsongraphs,” this leads to vastly increased computational complexity. Our inNIPS(Y. Bengio, D.Schuurmans, J. D.Lafferty, C. K.I.Williams, andA.Culotta, eds.),pp.1660–1668,CurranAssociates, Inc.,2009. results indicate that this should be avoided. [7] S. Menchetti, F. Costa, and P. Frasconi, “Weighted decomposition Enumerating a subset of all sequences that cover at least kernels,”inICML(L.D.RaedtandS.Wrobel,eds.),vol.119ofACM, pp.585–592, ACM,2005. one moleculeproducesan effectivefeatureset butalso a very [8] N.Wale,I.A.Watson,andG.Karypis,“Comparisonofdescriptorspaces large one. Also, these fragments are hard to interpret outside for chemical compound retrieval and classification,” Knowl. Inf. Syst., of their use in a pairwise similarity measure. In contrast to vol.14,no.3,pp.347–375, 2008. [9] F. Costa and K. D. Grave, “Fast neighborhood subgraph pairwise this, fragments that are selected based on their correlation distancekernel,”inICML(J.Fu¨rnkranzandT.Joachims,eds.),pp.255– with the target can be ranked based on their score and the 262,Omnipress,2010. most interesting ones inspected and interpreted by an end [10] L.Schietgat,F.Costa,J.Ramon,andL.D.Raedt,“Effectivefeaturecon- structionbymaximumcommonsubgraphsampling,”MachineLearning, user. While it would be possible to evaluate all size restricted vol.83,no.2,pp.137–161, 2011. fragmentson the data and performa similar ranking,this will [11] G. B. McGaughey, R. P. Sheridan, C. I. Bayly, J. C. Culberson, be computationally expensive due to their large number. C. Kreatsoulas, S. Lindsley, V. Maiorov, J.-F. Truchon, and W. D. Cornell, “Comparison of topological, shape, and docking methods in Our experimentsalso indicate a clear trade-offbetween the virtual screening,” JCIM,vol.47,no.4,pp.1504–1519, 2007. number of fragments (which can be set by adjusting the k in [12] C.Helma,ed.,Predictive Toxicology. CRCPress,2005. top-k mining or the minimum significance threshold) and the [13] C. Helma, T. Cramer, S. Kramer, and L. D. Raedt, “Data mining and machinelearningtechniquesfortheidentificationofmutagenicityinduc- quality of the feature set. Given existing results, it is unlikely ing substructures and structure activity relationships of noncongeneric that similar clear-cut effects would appear when changing the compounds,”JCIM,vol.44,no.4,pp.1402–1411, 2004. length or minimum support of length restricted paths. [14] B. Bringmann, A. Zimmermann, L. De Raedt, and S. Nijssen, “Don’t beafraidofsimplerpatterns,”in10thPKDD(J.Fu¨rnkranz,T.Scheffer, Redundancy among complex patterns could be reduced andM.Spiliopoulou, eds.),pp.55–66,Springer, 2006. explicitly, e.g. in a post-processing step. We have suggested [15] B. Bringmann and A. Zimmermann, “Tree2 - Decision trees for tree a technique that achieves this [17] and since the fragments structured data.,” in9th PKDD (A.Jorge, L.Torgo,P.Brazdil, R. Ca- macho,andJ.Gama,eds.),pp.46–58,Springer, 2005. can be considered features, feature selection techniques are [16] D.J.Rogers andT.T.Tanimoto, “Acomputer program forclassifying applicable [18], but this would again require the mining of a plants,”Science, vol.21,pp.1115–1118, oct1960. large set of trees or graphs. It would need to be larger than [17] B. Bringmann and A. Zimmermann, “The chosen few: On identifying valuable patterns,” in ICDM (N. Ramakrishnan and O. Zaiane, eds.), a set of sequential patterns that could be used without post- pp.63–72,IEEEComputerSociety, 2007. processing,increasingcomputationalcomplexitysignificantly. [18] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” JMLR,vol.3,pp.1157–1182,2003. Given these arguments, class-correlated sequences seem to [19] M. Thoma, H. Cheng, A. Gretton, J. Han, H.-P.Kriegel, A. J. Smola, be the best choice for the large-scale mining of molecular L. Song, P. S. Yu, X. Yan, and K. M. Borgwardt, “Near-optimal fragments as features for SAR prediction. supervisedfeatureselectionamongfrequentsubgraphs,”inSDM,pp.1– 12,SIAM,2009. Analternativeliesiniterativeapproaches,inwhichpatterns [20] A.Zimmermann,B.Bringmann,andU.Ru¨ckert,“Fast,effectivemolec- areminedanddatamanipulated[15],[19],[20],orredundancy ular feature mining by local optimization,” in ECML/PKDD (3) (J. L. with already found patterns made part of the quality function Balca´zar, F. Bonchi, A. Gionis, and M. Sebag, eds.), pp. 563–578, Springer, 2010. [21].Effective,verycompactpatternsetscanbeminedinthis [21] U.Ru¨ckertandS.Kramer,“Optimizingfeaturesetsforstructureddata,” way. So far there is however no clear understanding about in18thECML(J.N.Kok,J.Koronacki,R.L.deMa´ntaras,S.Matwin, whetherthesefeaturesetswouldbecompetitivewithlargesets D.Mladenic, andA.Skowron, eds.),pp.716–723,Springer, 2007.