Table Of Content

HybridGO-Loc: Mining Hybrid Features on Gene Ontology for Predicting Subcellular Localization of Multi- Location Proteins Shibiao Wan1, Man-Wai Mak1*, Sun-Yuan Kung2 1DepartmentofElectronicandInformationEngineering,TheHongKongPolytechnicUniversity,HongKongSAR,China,2DepartmentofElectricalEngineering,Princeton University,Princeton,NewJersey,UnitedStatesofAmerica Abstract Protein subcellular localization prediction, as an essential step to elucidate the functions in vivo of proteins and identify drugs targets, has been extensively studied in previous decades. Instead of only determining subcellular localization of single-label proteins, recent studies have focused on predicting both single- and multi-location proteins. Computational methods based on Gene Ontology (GO) have been demonstrated to be superior to methods based on other features. However,existingGO-basedmethodsfocusontheoccurrences ofGOtermsanddisregard theirrelationships.Thispaper proposes a multi-label subcellular-localization predictor, namely HybridGO-Loc, that leverages not only the GO term occurrencesbutalsotheinter-termrelationships.ThisisachievedbyhybridizingtheGOfrequenciesofoccurrencesandthe semantic similarity between GO terms. Given a protein, a set of GO terms are retrieved by searching against the gene ontology database, using the accession numbers of homologous proteins obtained via BLAST search as the keys. The frequencyofGOoccurrencesandsemanticsimilarity(SS)betweenGOtermsareusedtoformulatefrequencyvectorsand semantic similarity vectors, respectively, which are subsequently hybridized to construct fusion vectors. An adaptive- decisionbasedmulti-labelsupportvectormachine(SVM)classifierisproposedtoclassifythefusionvectors.Experimental resultsbasedonrecentbenchmarkdatasetsandanewdatasetcontainingnovelproteinsshowthattheproposedhybrid- feature predictor significantly outperforms predictors based on individual GO features as well as other state-of-the-art predictors. For readers’ convenience, the HybridGO-Loc server, which is for predicting virus or plant proteins, is available onlineathttp://bioinfo.eie.polyu.edu.hk/HybridGoServer/. Citation:WanS,MakM-W,KungS-Y(2014)HybridGO-Loc:MiningHybridFeaturesonGeneOntologyforPredictingSubcellularLocalizationofMulti-Location Proteins.PLoSONE9(3):e89545.doi:10.1371/journal.pone.0089545 Editor:PeterCsermely,SemmelweisUniversity,Hungary ReceivedNovember11,2013;AcceptedJanuary23,2014;PublishedMarch19,2014 Copyright:(cid:2)2014Wanetal.Thisisanopen-accessarticledistributedunderthetermsoftheCreativeCommonsAttributionLicense,whichpermitsunrestricted use,distribution,andreproductioninanymedium,providedtheoriginalauthorandsourcearecredited. Funding:ThisworkwasinpartsupportedbyHKPolyUGrantNos.G-YJ86andG-YL78.Thefundershadnoroleinstudydesign,datacollectionandanalysis, decisiontopublish,orpreparationofthemanuscript. CompetingInterests:Theauthorshavedeclaredthatnocompetinginterestsexist. *E-mail:[email protected] Introduction with large-scale proteomic data to determine the subcellular localization of proteins. Proteins located in appropriate physiological contexts within a Conventionally, subcellular-localization predictors can be cell are of paramount importance to exert their biological roughly divided into sequence-based and annotation-based. functions. Subcellular localization of proteins is essential to the Sequence-based methods use (1) amino-acid compositions functions of proteins and has been suggested as a means to [10,11], (2) sequence homology [12,13], and (3) sorting signals maximize functional diversity and economize on protein design [14,15] as features. Annotation-based menthods use information and synthesis [1]. Aberrant protein subcellular localization is beyondtheproteinsequences,suchasGeneOntology(GO)terms closely correlated to a broad range of human diseases, such as [16–21],Swiss-Protkeywords[22],andPubMedabstracts[23,24]. Alzheimer’s disease [2], kidney stone [3], primary human liver A number of studies have demonstrated that methods based on tumors [4], breast cancer [5], pre-eclampsia [6] and Bartter GOinformationaresuperiortomethodsbasedonsequence-based syndrome [7].Knowingwhereaproteinresideswithinacellcan features [25–28]. Note that the GO database contains not only give insights on drug targets identification and drug design [8,9]. experimental data but also predicted data (http://www. Wet-labexperimentssuchasfluorescentmicroscopyimaging,cell geneontology.org/GO.evidence.shtml),whichmaybedetermined fractionation and electron microscopy are the gold standard for by sequence-based methods. From this point of view, the GO- validating subcellular localization and are essential for the design based prediction, which uses the GO annotation database to ofhighqualitylocalizationdatabasessuchasTheHumanProtein retrieve GO terms, is a filtering method for sequence-based Atlas (http://www.proteinatlas.org/). However, wet-lab experi- predictions. ments are time-consuming and laborious. With the avalanche of The GO comprises three orthogonal taxonomies whose terms newly discovered protein sequences in the post-genomic era, describe the cellular components, biological processes, and computational methods are required to assist biologists to deal molecular functions of gene products. The GO terms in each PLOSONE | www.plosone.org 1 March2014 | Volume 9 | Issue 3 | e89545 HybridGO-Loc taxonomy are organized within a directed acyclic graph. These similarity between the terms for protein subcellular localization termsareplacedwithinstructuralrelationships,ofwhichthemost prediction. Compared to existing multi-label subcellular-localiza- important being the ‘is-a’ relationship (parent and child) and the tion predictors, our proposed predictor has the following ‘part-of’ relationship (part and whole) [29,30]. Recently, the GO advantages: (1) it formulates the feature vectors by hybridizing consortium has been enriched with more structural relationships, GOfrequencyofoccurrencesandGOsemanticsimilarityfeatures such as ‘positively-regulates’, ‘negatively-regulates’ and ‘has-part’ whichcontainricherinformationthanonlyGOtermfrequencies; [31,32]. These relationships reflect that the GO hierarchical tree (2)itadoptsanewstrategytoincorporatericherandmoreuseful for each taxonomy contains redundant information, for which homologousinformationfrommoredistanthomologsratherthan semantic similarity overGO termscan befound. using the top homologs only; (3) it adopts an adaptive decision Instead of only determining subcellular localization of single- strategy for multi-label SVM classifiers so that it can effectively label proteins, recent studies have been focusing on predicting deal with datasets containing both single-label and multi-label both single- and multi-location proteins. Since there exist multi- proteins. Results on two recent benchmark datasets and a new location proteins that can simultaneously reside at, or move dataset containing novel proteins demonstrate that these three between, two or more subcellular locations, it is important to properties enable the proposed predictor to accurately predict include these proteins in the predictors. Actually, multi-location multi-location proteins and outperform several state-of-the-art proteinsplayimportantrolesinsomemetabolicprocessesthattake predictors. place in more than one cellular compartment, e.g., fatty acid b- oxidation in the peroxisome and mitochondria, and antioxidant Methods defense in thecytosol,mitochondriaand peroxisome [33]. Legitimacy of Using GO Information Recently,severalmulti-labelpredictorsbasedonGOhavebeen proposed, including Plant-mPLoc [34], Virus-mPLoc [35], iLoc- Despite their good performance, GO-based methods have Plant [36], iLoc-Virus [37], KNN-SVM [38], mGOASVM [39] receivedsomecriticismsfromtheresearchcommunity.Themain and others [40,41]. These predictors have demonstrated superi- argument of these criticisms is that the cellular component GO ority over sequence-based methods. These predictors use the terms already have the cellular component categories, i.e., if the occurrences of the GO terms but do not take the semantic GOtermsareknown,thesubcelluarlocationswillalsobeknown. relationships between GOterms intoaccount. Thepredictionproblemcanthereforebeeasilysolvedbycreating SincetherelationshipbetweenGOtermsreflectstheassociation alookuptableusingthecellularcomponentGOtermsasthekeys betweendifferentgeneproducts,proteinsequencesannotatedwith andthecellularcomponentcategoriesasthehashedvalues.Sucha GO terms can be compared on the basis of semantic similarity naive solution, however, will lead to very poor prediction measures. The semantic similarity over GO has been extensively performance, as demonstrated and explained in our previous studied and have been applied to many biological problems, studies[28,39].Anumberofstudies[70–72]byothergroupsalso including protein function prediction [42,43], subnuclear locali- strongly support the legitimacy of using GO information for zation prediction [44], protein-protein interaction inference [45– subcellular localization. For example, as suggested by [72], the 47] and microarray clustering [48]. The performance of these good performance of GO-based methods is due to the high predictorsdependsonwhetherthesimilaritymeasureisrelevantto representation power of the GO space as compared to the the biological problems. Over the years, a number of semantic Euclideanfeaturespacesusedbytheconventionalsequence-based methods. similaritymeasureshavebeenproposed,someofwhichhavebeen used innatural languageprocessing. Semantic similarity measures can be applied at the GO-term Retrieval of GO Terms levelorthegene-productlevel.AttheGO-termlevel,methodsare The proposed predictor can use either the accession numbers roughly categorized as node-based and edge-based. The node- (AC) or amino acid (AA) sequences of query proteins as input. based measures basically rely on the concept of information Specifically, for proteins with known ACs, their respective GO content of terms, which was proposed by Resnik [49] for natural terms are retrieved from the Gene Ontology annotation (GOA) language processing. Later, Lord et al. [50] applied this idea to database (http://www.ebi.ac.uk/GOA) using the ACs as the measurethesemanticsimilarityamongGOterms.Linetal.[51] searchingkeys.ForproteinswithoutACs,theirAAsequencesare proposed a method based on information theory and structural presented toBLAST[73] tofindtheir homologs,whose ACsare information. Subsequently, more node-based measures [52–54] thenusedas keys tosearch against theGOA database. were proposed. Edge-based measures are based on using the While the GOA database allows us to associate the AC of a lengthorthedepthofdifferentpathsbetweentermsand/ortheir protein with a set of GO terms, for some novel proteins, neither common ancestors [55–58]. At the gene-product level, two most their ACs nor theACs of their top homologs have any entries in commonmethodsarepairwiseapproaches[59–63]andgroupwise theGOAdatabase;inotherwords,noGOtermscanberetrieved approaches [64–67]. Pairwise approaches measure similarity byusingtheirACsortheACsoftheirtophomologs.Insuchcase, betweentwogeneproductsbycombiningthesemanticsimilarities the ACs of the homologous proteins, as returned from BLAST between their terms. Groupwise approaches, on the other hand, search, will be successively used to search against the GOA directlygrouptheGOtermsofageneproductasaset,agraphor database until a match is found. With the rapid progress of the a vector, and then calculate the similarity by set similarity GOAdatabase,itisreasonabletoassumethatthehomologsofthe techniques, graph matching techniques or vector similarity query proteins have at least one GO term [17]. Thus, it is not techniques. More recently, Pesquita et al. [68] reviewed the necessarytouseback-upmethodstohandlethesituationwhereno semanticsimilaritymeasuresappliedtobiomedicalontologies,and GO termscanbe found. Theprocedures areoutlined in Fig1. Guzzi et al. [69] provides a comprehensive review on the relationship between semantic similarity measures and biological GO Frequency Features features. LetWdenoteasetofdistinctGOtermscorrespondingtoadata This paper proposes a multi-label predictor based on hybrid- set. W is constructed in two steps: (1) identifying all of the GO izing frequency of occurrences of GO terms and semantic terms in the dataset and (2) removing the repetitive GO terms. PLOSONE | www.plosone.org 2 March2014 | Volume 9 | Issue 3 | e89545 HybridGO-Loc Figure1.ProceduresofretrievingGOterms. :thei-thqueryprotein;k :themaximumnumberofhomologsretrievedbyBLASTwiththe i max defaultparametersetting;Qi,ki:thesetofGOtermsretrievedbyBLASTusingtheki-thhomologforthei-thqueryprotein i;ki:theki-thhomolog usedtoretrievetheGOterms. doi:10.1371/journal.pone.0089545.g001 SupposeW distinctGOtermsarefound,i.e.,jWj~W;theseGO contentsofcategoricaldata.Inthecontextofgeneontology,the terms forma GO Euclidean space with W dimensions. For each semanticsimilaritybetweentwoGOtermsisbasedontheirmost sequence in the dataset, a GO vector is constructed by matching specificcommonancestorintheGOhierarchy.Therelationships its GO terms against W, using the number of occurrences of between GO terms in the GO hierarchy, such as ‘is-a’ ancestor- individualGOtermsinWasthecoordinates.Specifically,theGO child, or ‘part-of’ ancestor-child can be obtained from the SQL vector pi of thei-thprotein i isdefined as: database through the link: http://archive.geneontology.org/ latest-termdb/go_daily-termdb-tables.tar.gz. Note here only the (cid:2)f , GOhit ‘is-a’ relationship is considered for semantic similarity analysis pFi ~½bi,1,(cid:2)(cid:2)(cid:2),bi,j,(cid:2)(cid:2)(cid:2),bi,W(cid:3)T,bi,j~ 0i,j , otherwise ð1Þ [51].Specifically,thesemanticsimilaritybetweentwoGOterms x and y is defined as [49]: wheref isthenumberofoccurrencesofthej-thGOterm(term- i,j frequency) in the i-th protein sequence. The rationale is that the sim(x,y)~max ½{log(p(c))(cid:3), ð3Þ term-frequencies contain important information for classification. c[A(x,y) Note that b ’s are analogous to the term-frequencies commonly i,j whereA(x,y)isthesetofancestorGOtermsofbothxandy,and used indocument retrieval. p(c)istheprobabilityofthenumberofgeneproductsannotatedto Similarly, for the t-th query protein , the GO frequency t the GO term c divided by the total number of gene products vector isdefined as: annotated in the GO taxonomy. WhileResnik’smeasureiseffectiveinquantifyingtheshared qF~½b ,(cid:2)(cid:2)(cid:2),b ,(cid:2)(cid:2)(cid:2),b (cid:3)T,b ~(cid:2)ft,j , GOhit ð2Þ information between two GO terms, it ignores the distance t t,1 t,j t,W t,j 0 , otherwise between the terms and their common ancestors in the GO hierarchy. To further incorporate structural information from the GO hierarchy into the similarity measure, we have Inthefollowingsections,weusethesuperscriptFtodenotethe explored three extension of Resnik’s measure, namely Lin’s GO frequency features inEq. 2. measure [51], Jiang’s measure [74], and relevance similarity (RS) [52]. Semantic-Similarity Features GiventwoGOtermsxandy,thesimilaritybyLin’smeasureis: Semantic similarity (SS) is a measure for quantifying the similarity between categorical data (e.g., words in documents), (cid:3) 2:½{log(p(c))(cid:3) (cid:4) where the notion of similarity is based on the likeliness of sim (x,y):sim (x,y)~ max ð4Þ meaningsinthedata.ItisoriginallydevelopedbyResnik[49]for Lin 1 c[A(x,y) {log(p(x)){log(p(y)) natural language processing. The idea is to evaluate semantic similarity in an ‘is-a’ taxonomy using the shared information PLOSONE | www.plosone.org 3 March2014 | Volume 9 | Issue 3 | e89545 HybridGO-Loc The similarity byJiang’s measure is: GOterms,whileGOSSfeatures(Eq.4toEq.6)usethesemantic similarity between GO terms. These two features are developed from two different perspectives. It is therefore reasonable to sim (x,y):sim (x,y) Jiang 2 believe that these two kinds of information complement each (cid:3) 1 (cid:4)ð5Þ other. Based on this assumption, we combine these two GO ~max c[A(x,y) 1{log(p(x)){log(p(y))z2:½{log(p(c)) features andforma hybridized vector as: 2qF 3 The similarity byRS iscalculated as: t qHl~6 7, ð10Þ t 4 5 qSl simRS(x,y):sim3(x,y) t ~ max (cid:3) 2:½{log(p(c))(cid:3) :(1{p(c))(cid:4)ð6Þ where l[f1,2,3g. In other words, qHt l represents the hybridizing- c[A(x,y) {log(p(x)){log(p(y)) feature vector by combining the GO frequency features and the SSfeaturesderivedfromthel-thSSmeasure.Wereferthemtoas Hybrid1,Hybrid2 andHybrid3,respectively. Among the three measures, sim (x,y) and sim (x,y) are Lin Jiang relative measures that are proportional to the difference in Multi-label Multi-class SVM Classification information content between the terms and their common The hybridized-feature vectors obtained from the previous ancestors, which is independent of the absolute information subsection are used for training multi-label one-vs-rest support content of the ancestors. On the other hand, sim (x,y) RS vector machines (SVMs). Specifically, for an M-class problem incorporates the probability of annotating the common ancestors (here M is the number of subcellular locations), M independent as a weighing factor to Lin’s measure. To simplify notations, we binary SVMs are trained, one for each class. Denote the hybrid refer sim (x,y), sim (x,y) and sim (x,y) as sim (x,y), Lin Jiang RS 1 GOvectorsofthet-thqueryproteinusingthel-thSSmeasureas sim (x,y)and sim (x,y),respectively. 2 3 qHl.Giventhet-thqueryprotein ,thescoreofthem-thSVM Based on the semantic similarity between two GO terms, we t t using thel-thSSmeasure is adopted a continuous measure proposed in [48] to calculate the similaritybetweentwoproteins.Specifically,giventwoproteins i and j,weretrievedtheircorrespondingGOtermsPi andPj as sm,l( t)~Xam,rym,rK(pHrl,qHt l)zbm ð11Þ describedinthesubsection‘‘Retrieval ofGOTerms’’.(Notethat r[Sm strictlyspeaking,P shouldbeP ,wherek isthek-thhomolog i i,ki i i used to retrieve the GO terms for the i-th protein. To simplify whereqHt l isthehybridGOvectorderivedfrom t (SeeEq.10), notations, we write it as Pi.) Then, we computed the semantic Sm,l isthesetofsupportvectorindexescorrespondingtothem-th similarity between twosets ofGO termsfPi,Pjg asfollows: SVM,am,raretheLagrangemultipliers,ym,r[f{1,z1gindicates whetherther-thtrainingproteinbelongstothem-thclassornot, S(P,P)~Xmax sim(x,y), ð7Þ and K(:,:)is a kernel function. Here, the linear kernel was used. l i j y[Pj l Unlike the single-label problem where each protein has one x[Pi predicted label only, a multi-label protein could have more than where l[f1,2,3g, and sim(x,y) is defined in Eq. 4 to Eq. 6. one predicted labels. In this work, we compared two different l S(P,P) is computed in the same way by swapping P and P. decisionschemesforthismulti-labelproblem.Inthefirstscheme, l j i i j Finally,theoverallsimilaritybetweenthetwoproteinsisgivenby: the predicted subcellular location(s) of the i-th query protein are givenby S(P,P)zS(P,P) SS (P,P)~ l i j l j i , ð8Þ l i j Sl(Pi,Pi)zSl(Pj,Pj) M(cid:4)l( t)~ 8 wJiahnegrealn[df1R,2S,3tgo.aIsnStSh1e,sSeqSu2eal,nwdeSrSe3fe,rrethspeeSctSivmelye.asuresbyLin, >>>><SMm~1fm:sm,l( t)w0g, whenA m[f1,...,Mgs:t: sm,l( t)w0ð(;1122Þ) semTahnutsi,cfsoimr ialarteitsytin(SgS)pvroectetoinrqStl cwainthbeGoObtateinrmedsbeyt cQotm, pauGtinOg >>>>:argmaxMm~1sm,l( t), otherwise: t the semantic similarity between Qt and each of the training proteinf gN ,whereNisthenumberoftrainingproteins.Thus, i i~1 The second scheme is an improved version of the first one in canberepresented by an N-dimensionalvector: t that the decision threshold is dependent on the test protein. Specifically, the predicted subcellular location(s) of the i-th query qStl~½SSl(Qt,P1),(cid:2)(cid:2)(cid:2),SSl(Qt,Pi),(cid:2)(cid:2)(cid:2),SSl(Qt,PN)(cid:3)T, ð9Þ protein aregivenby: If A s ( )w0, m,l t wherel[f1,2,3g.Inotherwords,qSl representstheSSvectorby t using thel-thSSmeasure. M [ M( )~ ðm:s ( )§minf1:0,f(s ( ))gÞ ð13Þ l t m,l t max,l t Hybridization of Two GO Features m~1 Ascanbeseenfromthesubsections‘‘GOFrequencyFeatures’’ otherwise, and ‘‘Semantic-Similarity Features’’, we know that the GO frequency features (Eq. 2) use the frequency of occurrences of PLOSONE | www.plosone.org 4 March2014 | Volume 9 | Issue 3 | e89545 HybridGO-Loc for the i-th protein (i~1,...,N), respectively. Here, N~207 i for thevirus dataset and N~978for theplant dataset. Thenthe M( t)~argMmaxsm,l( t): ð14Þ five measurementsare defined as follows: m~1 1 XN (cid:3)jM( )\L( )j(cid:4) In Eq. 13, f(s ( )) is a function of s (Q), where Accuracy~ i i ð16Þ max,l t max,l t N jM( )|L( )j s (Q)~maxM s (Q). In this work, we used a linear i~1 i i max,l t m~1 m,l t functionas follows: f(smax,l( t))~hsmax,l( t), ð15Þ Precision~1 XN (cid:3)jM( i)\L( i)j(cid:4) ð17Þ N jM( )j i~1 i where h[½0:0,1:0(cid:3) is a hyper-parameter that can be optimized through cross-validation. In fact, besides SVMs, many other machine learning models, 1 XN (cid:3)jM( )\L( )j(cid:4) such as hidden Markov models (HMMs) and neural networks Recall~ i i ð18Þ N jL( )j (NNs) [75,76], have been used in protein subcellular-localization i~1 i predictors. However, HMMs and NNs are not suitable for GO- basedpredictorsbecauseofthehighdimensionalityofGOvectors. The main reason is that under such condition, HMMs and NNs 1 XN (cid:3)2jM( )\L( )j(cid:4) canbeeasilyovertrainedandthusleadtopoorperformance.On F1~N jM( )ijzjL( i)j ð19Þ the other hand, linear SVMs can well handle high-dimensional i~1 i i databecauseevenifthenumberoftrainingsamplesissmallerthan thefeaturedimension,linearSVMsarestillabletofindanoptimal solution. 1 XN (cid:3)jM( )|L( )j{jM( )\L( )j(cid:4) HL~ i i i i ð20Þ N M i~1 Materials and Performance Metrics wherej:jmeanscountingthenumberofelementsinthesettherein Datasets and\represents theintersectionof sets. In this paper, a virus dataset [35,37] and a plant dataset [36] Accuracy, Precision, Recall and F1 indicate the classification wereusedtoevaluatetheperformanceoftheproposedpredictor. performance. The higher the measures, the better the prediction ThevirusandtheplantdatasetswerecreatedfromSwiss-Prot57.9 performance. Among them, Accuracy is the most commonly used and 55.3, respectively. The virus dataset contains 207 viral criteria.F1-scoreistheharmonicmeanofPrecisionandRecall,which proteins distributed in 6 locations. Of the 207 viral proteins, 165 allowsustocomparetheperformanceofclassificationsystemsby belongtoonesubcellularlocations,39totwolocations,3tothree takingthetrade-offbetweenPrecisionandRecallintoaccount.The locations and none to four or more locations. This means that HammingLoss(HL)[77,78]isdifferentfromothermetrics.Ascan about20%oftheproteinsinthedatasetarelocatedinmorethan be seen from Eq. 20, when all of the proteins are correctly one subcellular location. The plant dataset contains 978 plant predicted, i.e., jM( )|L( )j~jM( )\L( )j (i~1,...,N), i i i i proteinsdistributedin12locations.Ofthe978plantproteins,904 then HL~0; whereas, other metrics will be equal to 1. On the belongtoonesubcellularlocations,71totwolocations,3tothree other hand, when the predictions of all proteins are completely locations and none to four or more locations. The sequence wrong,i.e.,jM( )|L( )j~MandjM( )\L( )j~0,then i i i i identity of bothdatasets was cut offat 25%. HL~1;whereas,othermetricswillbeequalto0.Therefore,the ThebreakdownofthesetwodatasetsarelistedinFigs.2(a)and lower theHL,thebetter theprediction performance. 2(b).Fig.2(a)showsthatthemajority(68%)ofviralproteinsinthe Two additional measurements [37,39] are often used in multi- virusdatasetarelocatedinhostcytoplasmandhostnucleuswhile label subcellular localization prediction. They are overall locative proteins located in the rest of the subcellular locations totally accuracy (OLA) and overall actual accuracy (OAA). The former is account only around one third. This means that this multi-label givenby: dataset is imbalanced across the six subcellular locations. Similar conclusionscanbedrawnfromFig.2(b),wheremostoftheplant pdrroiotneinwshielexipstroitneinchslionroopthlaesrt,8csyutbocpellalusmla,rlnouccalteiounssatnodtalmlyiatoccchoounn-t OLA~PN 1jL( )jXN jM( i)\L( i)j, ð21Þ i~1 i i~1 forlessthan30%.Thisimbalancedpropertymakestheprediction of these two multi-label datasets difficult. These two benchmark andtheoverall actual accuracy (OLA) is: datasetsaredownloadablefromthehyperlinksintheHybridGO- Locserver. 1 XN OAA~ ∆½M( ),L( )(cid:3) ð22Þ N i i Performance Metrics i~1 Compared to traditional single-label classification, multi-label where classification requires more complicated performance metrics to betterreflectthemulti-labelcapabilitiesofclassifiers.Convention- alsingle-labelmeasuresneedtobemodifiedtoadapttomulti-label (cid:2)1 ,if M( )~L( ) D½M( ),L( )(cid:3)~ i i ð23Þ classification.ThesemeasuresincludeAccuracy,Precision,Recall,F1- i i 0 ,otherwise: score (F1) and Hamming Loss (HL) [77,78]. Specifically, denote L( )andM( )asthetruelabelsetandthepredictedlabelset i i PLOSONE | www.plosone.org 5 March2014 | Volume 9 | Issue 3 | e89545 HybridGO-Loc Figure2.Breakdownofthe(a)virusand(b)plantdatasets.Thenumberofproteinsshownineachsubcellularlocationrepresentsthenumber of‘locativeproteins’[37,39].For(a),thereare207actualproteinsand252locativeproteins;For(b),thereare978actualproteinsand1055locative proteins. doi:10.1371/journal.pone.0089545.g002 According to Eq. 21, a locative protein is considered to be thebiasandvarianceofanestimator[81],toavoidconfusion,we correctlypredictedifanyofthepredictedlabelsmatchesanylabels onlyuse theterm LOOCVinthis paper. in the true label set. On the other hand, Eq. 22 suggests that an actualproteinisconsideredtobecorrectlypredictedonlyifallof Results the predicted labels match those in thetrue label set exactly. For Comparing Different Features example,foraproteincoexistin,saythreesubcellularlocations,if only two of the three are correctly predicted, or the predicted Fig. 3(a) shows the performance of individual and hybridized resultcontainsalocationnotbelongingtothethree,theprediction GO features on the virus dataset based on leave-one-out cross isconsideredtobeincorrect.Inotherwords,whenandonlywhen validation (LOOCV). In the figure, SS1, SS2 and SS3 represent all of the subcellular locations of a query protein are exactly Lin’s, Jiang’s and RS similarity measures, respectively. Hybrid1, predictedwithoutanyoverpredictionorunderprediction,canthe Hybrid2 and Hybrid3 represent the hybridized features obtained from these measures. As can be seen, in terms of all the six prediction be considered as correct. Therefore, OAA is a more performance metrics, the performance of the hybrid features is stringentmeasureascomparedtoOLA.OAAisalsomoreobjective remarkably better than the performance of individual features, thanOLA.Thisisbecauselocativeaccuracyisliabletogivebiased regardlessofwhichoftheGOfrequencyfeaturesorthethreeGO performance measures when the predictor tends to over-predict, SS features were used. Specifically, the OAAs (the most stringent i.e., giving large jM( )j for many . In the extreme case, if i i andobjectivemetric)ofallofthethreehybridfeaturesareatleast every protein is predicted to have all of the M subcellular 3% (absolute) higher than that of the individual features, which locations, according to Eq. 20, the OLA is 100%. But obviously, suggests that hybridizing the two features can significantly boost thepredictionsarewrongandmeaningless.Onthecontrary,OAA the prediction performance. Moreover, among the hybridized is 0% in this extreme case, which definitely reflects the real features, the performance of Hybrid2, namely combining GO performance. frequency features and GO SS features by Jiang’s measure, Among all the metrics mentioned above, OAA is the most outperformsHybrid1andHybrid3.Anotherinterestingthingisthat stringentandobjective.Thisisbecauseifonlysome(butnotall)of althoughalloftheindividualGOSSfeaturesperformmuchworse the subcellular locations of a query protein are correctly predict, than the GO frequency features, the performance of the three the numerators of the other 4 measures (Eqs. 16 to 21) are non- hybridized features is still better that of any of the individual zero,whereasthenumeratorofOAAinEq.22is0(thuscontribute features. This suggests that the GO frequency features and SS nothing tothefrequency count). features are complementary toeachother. In statistical prediction, there are three methods that are often Similarconclusionscanbedrawnfromtheplantdatasetshown used for testing the generalization capabilities of predictors: inFig.3(b).However,comparisonbetweenFig.3(a)andFig.3(b) independent tests, sub-sampling tests (or K-fold cross-validation) reveals that for the plant dataset, the performance of hybridized and leave-one-out cross validation (LOOCV). For independent features outperforms all of the individual features in terms of all tests,theselectionofindependentdatasetoftenbearssomesortof metrics except OLA and Recall, while for the virus dataset, the arbitrariness [79]; for the K-fold cross validation, different formerissuperiortothelatterintermsofallmetrics.However,the partitioning of a dataset will lead to different results, thus still losses in these two metrics do not outweigh the significant beingliabletostatisticalarbitrariness;forLOOCV,itwillyielda improvement on other metrics, especially on OAA, which has unique outcome and is considered to be the most rigorous and around3%(absolute)improvementintermsofhybridizedfeatures bias-freemethod[80].Hence,LOOCVwasusedtoexaminethe as opposed to using individual features. Among the hybridizing performanceofallpredictorsinthiswork.Moredetailedanalysis features,Hybrid2alsooutperformsHybrid1andHybrid3intermsof of the statistical methods can be found in the supplementary OLA,Accuracy,RecallandF1-score,whereasHybrid1performsbetter materials. Note that the jackknife cross validation in iLoc-Plant than others in terms of OAA and Precision. These results anditsvariantsisthesameasLOOCV,asmentionedin[36,79]. demonstrate that the GO SS features obtained by Lin’s measure Becausethetermjackknifealsoreferstothemethodsthatestimate andJiang’smeasurearebettercandidatesthantheRSmeasurefor PLOSONE | www.plosone.org 6 March2014 | Volume 9 | Issue 3 | e89545 HybridGO-Loc combining with the GO frequency features; however, there is no Table 2 except that the OLA of the proposed predictor is slightly evidencesuggestingwhichmeasureisbetter.Itisalsointerestingto worsethanthatofmGOASVM,andtheRecallisequivalenttothat seethattheperformanceofthethreeindividualGOSSfeaturesis ofmGOASVM.Nevertheless,thesmalllossesdonotoutweighthe better than that of GO frequency features, in contrary to the impressiveimprovementintheothermetrics,especiallyintheOAA results showninFig 3(a). (0.936 vs0.874). Comparing with State-of-the-Art Predictors Prediction of Novel Proteins Table1andTable2comparetheperformanceoftheproposed To further demonstrate the effectiveness of HybridGO-Loc, a predictor against several state-of-the-art multi-label predictors on newerplantdatasetconstructedformGOASVM[39]wasusedto thevirusandplantdatasetbasedonleave-one-outcrossvalidation. compare with state-of-the-art multi-label predictors using inde- Note that we used the best performing hybridizing features with pendent tests. Specifically, this new plant dataset contains 175 theadaptivedecisionstrategy.Specifically,forboththevirusand plantproteins,ofwhich147belongtoonesubcellularlocation,27 plant datasets, the best performance was achieved when Hybrid2 belong to two locations, 1 belong to three locations and none to and the adaptive decision strategy with h~0:3 were used. h was fourormorelocations.TheseplantproteinswereaddedtoSwiss- determined by cross-validation as stated previously. Unless stated Prot between 08-Mar-2011 and 18-Apr-2012. Because the plant otherwise, we used Hybrid2 to represent HybridGO-Loc in dataset used for training the predictors was created on 29-Apr- subsequent experiments. Our proposed predictor use the GO 2008,thereisanalmost3-yeartimegapbetweenthetrainingdata frequency features and GO semantic similarity features, whereas andtest data inour experiments. other predictors use only the GO frequency of occurrences as Table 3 compare the performance of HybridGO-Loc against features. From the classification perspective, Virus-mPLoc [35] several state-of-the-art multi-label plant predictors on the new uses an ensemble OET-KNN (optimized evidence-theoretic K- plant dataset. All the predictors use the 978 proteins of the plant nearest neighbors) classifier; iLoc-Virus [37] uses a multi-label dataset (See Fig. 2(b)) for training the classifier and make KNN classifier; KNN-SVM [38] uses an ensemble of classifiers independent test on the new 175 proteins. As can be seen, combining KNN and SVM; mGOASVM [39] uses a multi-label HybridGO-Loc performs significantly better than all the other SVMclassifier;andtheproposedpredictoruseamulti-labelSVM predictors in terms of all of the performance metrics. Similar classifier incorporated withtheadaptive decision scheme. conclusionscanalsobedrawnfromtheperformanceinindividual As shown in Table 1, the proposed predictor perform subcellular locations. significantly better than the other predictors. The OAA and OLA Fig.4showsthedistributionoftheE-valuesofthetestproteins, of the proposed predictor are more than 15% (absolute) higher which were obtained by using the training proteins as the than that of iLoc-Virus and Virus-mPLoc. It also performs repository and the test proteins as the query proteins in the significantly better than KNN-SVM in terms of OLA. When BLAST search. If we use a common criteria that homologous comparing with mGOASVM, the proposed predictor performs proteins should have E-value less than 10{4, then 74 out of 175 remarkably better in of all of the performance metrics, especially for the OAA (0.937 vs 0.889). These results demonstrate that testproteinsarehomologsofthetrainingproteins,whichaccount hybridizing the GO frequency features and GO SS features can for42%ofthetestset.Notethatthishomologousrelationshipdoes significantly boost prediction performance, which also suggests notmeanthatusingBLAST’shomologytransferscanpredictallof thatthesetwokindsofinformationareprovedtobecomplemen- the74testproteinscorrectly.Infact,BLAST’shomologytransfers tary to each other in terms of predicting subcellular localization. (based on the CC field of the homologous proteins) can only Similar conclusions can be drawn for the plant dataset from achieve a prediction accuracy of 26.9% (47/175). As the Figure 3. Performance of the hybrid features and individual features on the (a) virus and (b) plant datasets. Freq: GO frequency features;SS1,SS2andSS3:GOsemanticsimilarityfeaturesbyusingLin’smeasure[51],Jiang’smeasure[74]andRSmeasure[52],respectively;Hybrid1, Hybrid2 and Hybrid3: GO hybrid features by combining GO frequency features with GO semantic similarity features based on SS1, SS2 and SS3, respectively. doi:10.1371/journal.pone.0089545.g003 PLOSONE | www.plosone.org 7 March2014 | Volume 9 | Issue 3 | e89545 HybridGO-Loc Table1.Comparing theproposed predictorwith state-of-the-artmulti-labelpredictors basedonleave-one-outcrossvalidation (LOOCV)usingthe virus dataset. Label SubcellularLocation LOOCVLocativeAccuracy(LA) Virus-mPLoc[35] KNN-SVM[38] iLoc-Virus[37] mGOASVM[39] HybridGO-Loc 1 Viralcapsid 8/8=1.000 8/8=1.000 8/8=1.000 8/8=1.000 8/8=1.000 2 Hostcellmembrane 19/33=0.576 27/33=0.818 25/33=0.758 32/33=0.970 32/33=0.970 3 HostER 13/20=0.650 15/20=0.750 15/20=0.750 17/20=0.850 18/20=0.900 4 Hostcytoplasm 52/87=0.598 86/87=0.988 64/87=0.736 85/87=0.977 85/87=0.966 5 Hostnucleus 51/84=0.607 54/84=0.651 70/84=0.833 82/84=0.976 82/84=0.988 6 Secreted 9/20=0.450 13/20=0.650 15/20=0.750 20/20=1.000 20/20=1.000 OverallLocativeAccuracy(OLA) 152/252=0.603 203/252=0.807 197/252=0.782 244/252=0.968 245/252=0.972 OverallActualAccuracy(OAA) – – 155/207=0.748 184/207=0.889 194/207=0.937 Accuracy – – – 0.935 0.961 Precision – – – 0.939 0.965 Recall – – – 0.973 0.976 F1 – – – 0.950 0.968 HL – – – 0.026 0.016 ‘‘–’’meansthecorrespondingreferencesdonotprovidetheresultsontherespectivemetrics.HostER:Hostendoplasmicreticulum. doi:10.1371/journal.pone.0089545.t001 predictionaccuracyofHybridGO-Loconthistestset(seeTable3) Discussion is significantly higher than this percentage, the extra information available from the GOA database plays a very important role in Semantic Similarity Measures theprediction. In this paper, we have compared three of the most common semanticsimilaritymeasuresforsubcellularlocalization,including Table2.Comparing theproposed predictorwith state-of-the-artmulti-labelpredictors basedonleave-one-outcrossvalidation (LOOCV)usingthe plantdataset. Label SubcellularLocation LOOCVLocativeAccuracy(LA) Plant-mPLoc[34] iLoc-Plant[36] mGOASVM[39] HybridGO-Loc 1 Cellmembrane 24/56=0.429 39/56=0.696 53/56=0.946 51/56=0.911 2 Cellwall 8/32=0.250 19/32=0.594 27/32=0.844 28/32=0.875 3 Chloroplast 248/286=0.867 252/286=0.881 272/286=0.951 278/286=0.972 4 Cytoplasm 72/182=0.396 114/182=0.626 174/182=0.956 168/182=0.923 5 Endoplasmicreticulum 17/42=0.405 21/42=0.500 38/42=0.905 38/42=0.905 6 Extracellular 3/22=0.136 2/22=0.091 22/22=1.000 21/22=0.955 7 Golgiapparatus 6/21=0.286 16/21=0.762 19/21=0.905 19/21=0.905 8 Mitochondrion 114/150=0.760 112/150=0.747 150/150=1.000 149/150=0.993 9 Nucleus 136/152=0.895 140/152=0.921 151/152=0.993 150/152=0.987 10 Peroxisome 14/21=0.667 6/21=0.286 21/21=1.000 21/21=1.000 11 Plastid 4/39=0.103 7/39=0.179 39/39=1.000 38/39=0.974 12 Vacuole 26/52=0.500 28/52=0.538 49/52=0.942 48/52=0.923 OverallLocativeAccuracy(OLA) 672/1055=0.637 756/1055=0.717 1015/1055=0.962 1009/1055=0.956 OverallActualAccuracy(OAA) – 666/978=0.681 855/978=0.874 915/978=0.936 Accuracy – – 0.926 0.959 Precision – – 0.933 0.972 Recall – – 0.968 0.968 F1 – – 0.942 0.966 HL – – 0.013 0.007 ‘‘–’’meansthecorrespondingreferencesdonotprovidetheresultsontherespectivemetrics. doi:10.1371/journal.pone.0089545.t002 PLOSONE | www.plosone.org 8 March2014 | Volume 9 | Issue 3 | e89545 HybridGO-Loc Table3. ComparingHybridGO-Loc with state-of-the-art multi-labelplantpredictors basedon independenttestsusingthe new plantdataset. Label SubcellularLocation IndependentTestLocativeAccuracy Plant-mPLoc[34] iLoc-Plant[36] mGOASVM[39] HybridGO-Loc 1 Cellmembrane 8/16=0.500 1/16=0.063 7/16=0.438 16/16=1.000 2 Cellwall 0/1=0 0/1=0 0/1=0% 1/1=1.000 3 Chloroplast 27/54=0.500 45/54=0.833 39/54=0.722 30/54=0.556 4 Cytoplasm 5/38=0.132 15/38=0.395 19/38=0.500 31/38=0.816 5 Endoplasmicreticulum 1/9=0.111 1/9=0.111 3/9=0.333 4/9=0.444 6 Extracellular 0/3=0 0/3=0 1/3=0.333 0/3=0 7 Golgiapparatus 3/7=0.429 1/7=0.143 3/7=0.429 7/7=1.000 8 Mitochondrion 6/16=0.375 3/16=0.188 11/16=0.688 16/16=1.000 9 Nucleus 31/46=0.674 43/46=0.935 33/46=0.717 44/46=0.957 10 Peroxisome 4/6=0.667 0/6=0 3/6=0.500 4/6=0.667 11 Plastid 0/1=0 0/1=0 0/1=0 0/1=0 12 Vacuole 2/7=0.286 4/7=0.571 4/7=0.571 7/7=1.000 OverallLocativeAccuracy(OLA) 87/204=0.427 113/204=0.554 123/204=0.603 160/204=0.784 OverallActualAccuracy(OAA) 60/175=0.343 91/175=0.520 97/175=0.554 127/175=0.726 Accuracy 0.417 0.574 0.594 0.784 Precision 0.444 0.626 0.630 0.826 Recall 0.474 0.577 0.609 0.798 F1 0.444 0.592 0.611 0.803 HL 0.116 0.076 0.075 0.037 doi:10.1371/journal.pone.0089545.t003 N Lin’smeasure[51],Jiang’smeasure[74],andrelevancesimilarity B1) Inter-term relationship. SS vectors are based on inter- measure [52]. We excluded Resnik’s measure because it ignores term relationships. They are defined on a space in which thedistancebetweenthetermsandtheircommonancestorsinthe each basis corresponds to one training protein and the GO hierarchy. In addition to these measures, many online tools coordinate along that basis is defined by the semantic arealsoavailableforcomputingthesemanticsimilarityattheGO- similarity between a testing protein and the term level and gene-product level [44,82–84]. However, these corresponding trainingprotein. N measuresarediscretemeasureswhereasthemeasuresthatweused B2) Inter-group relationship. The pairwise relationships arecontinuous.Researchhasshownthatcontinuousmeasuresare betweena test protein and the training proteins are better than discrete measuresinmanyapplications [48]. hierarchically structured. This is because each basis of the SS space depends on a group of GO terms of the GO-Frequency Features versus SS Features corresponding training protein, and the terms are NotethatwedonotreplacetheGOfrequencyvectors.Instead, arranged in a hierarchical structure (parent- child weaugmenttheGOfrequency featurewithamoresophisticated relationship). Because the GO terms in different groups feature,i.e.theGOSSvectors,whicharetobecombinedwiththe are not mutually exclusive, the bases in the SS space are GO frequency vectors. A GO frequency vector is found by not independent of eachother. countingthenumberofoccurrencesofeveryGOterminasetof distinctGOtermsobtainedfromthetrainingdataset,whereasan Bias Analysis SS vector is constructed by computing the semantic similarity Except for the new plant dataset, we adopted LOOCV to between a test protein with each of the training proteins at the examine the performance of all predictors in this work, which is gene-product level. That is, each element in an SS vector considered to be the most rigorous and bias-free [80]. Neverthe- represents the semantic similarity of two GO-term groups. This less,determiningthesetofdistinctGOtermsWfromadatasetis can be easily seen from their definitions in Eq. 2 and Eq. 4–9, by no means without bias, which may favor the LOOCV respectively. performance. This is because the set of distinct GO terms W TheGOfrequencyvectorsandtheGOSSvectorsaredifferent derived from a given dataset may not be representative for other in twofundamental ways. datasets; in other words, the generalization capabilities of the N predictorsmaybeweakenedwhennewGOtermsoutsideWare A). GO frequency vectors are more primitive in the sense that foundin thetest proteins. their elements are based on individual GO terms without However,wehavethefollowingstrategiestominimizethebias. considering the inter-term relationship, i.e., the elements in a First, the two benchmark datasets used in this paper were GO frequency vectors are independent ofeach other. constructed based on the whole Swiss-Prot database (although in N different years), which, to some extent, incorporated all the B). GO SS vectors are more sophisticated in the following two senses. PLOSONE | www.plosone.org 9 March2014 | Volume 9 | Issue 3 | e89545 HybridGO-Loc Figure4.Distributionoftheclosenessbetweenthenewtestingproteinsandthetrainingproteins.TheclosenessisdefinedastheBLAST E-valuesofthetrainingproteinsusingthetestproteinsasthequeryproteinsintheBLASTsearches.NumberofProteins:Thenumberoftesting proteinswhoseE-valuesfallintotheintervalspecifiedunderthebar.SmallE-valuessuggestthatthecorrespondingnewproteinsareclosehomologs ofthetrainingproteins. doi:10.1371/journal.pone.0089545.g004 possible information of plant proteins or virus proteins in the this bias will not exist during LOOCV (see the accompanying database. In other words, W was constructed based on all of the supplementarymaterialsfortheproof).Furthermore,theresultsof GOtermscorrespondingtothewholeSwiss-Protdatabase,which theindependenttests(SeeTable3)forwhichnosuchbiasoccurs enables W to be representative for all of the distinct GO terms. also strongly suggest that HybridGO-Loc outperforms other Second,thesetwobenchmarkdatasetswerecollectedaccordingto predictors by alarge margin. strict criteria. Details of the procedures can be found in the supplementary materials. and the sequence similarity of both Conclusions datasetswascutoffat25%,whichenablesustouseasmallsetof representative proteins to represent all of the proteins of the Thispaperproposesanewmulti-labelpredictorbyhybridizing correspondingspecies(i.e.,virusorplant)inthewholedatabase.In GOfrequencyfeaturesandsemanticsimilarityfeaturestopredict other words, W will vary from species to species, yet still be the subcellular locations of multi-label proteins. Three different statistically representative for all of the useful GO terms for the semantic similarity measures have been investigated to be corresponding species.Third,usingWforstatisticalperformance combined with GO frequency features to formulate GO hybrid evaluationisequivalentoratleastapproximatetousingallofthe feature vectors. The feature vectors are subsequently recognized distinctGOtermsintheGOAdatabase.ThisisbecauseotherGO by multi-label multi-class support vectors machine (SVM) classi- terms that do not correspond to the training proteins will not fiersequippedwithanadaptivedecisionstrategythatcanproduce participateintrainingthelinearSVMs,norwilltheyplayessential multiple class labels for a query protein. Compared to existing roles in contributing to the final predictions. In other words, the multi-label subcellular-localization predictors, our proposed pre- generalizationcapabilitiesofHybridGO-Locwillnotbeweakened dictor has the following advantages: (1) it formulates the feature even if some new GO terms are found in the test proteins. A vectors by hybridizing GO frequency of occurrences and GO mathematical proof of this statement can be found in the semantic similarity features which contains richer information supplementary materials available in theHybridGO-Locserver. than only GO term frequencies; (2) it adopts a new strategy to Onemayarguethattheperformancebiasmightarisewhenthe incorporatericherandmoreusefulhomologousinformationfrom whole W was used to construct the hybrid GO vectors for both more distant homologs rather than using the top homologs only; training and testing during cross validation. This is because, in (3) it adopts an adaptive decision strategy for multi-label SVM eachfoldoftheLOOCV,thetrainingproteinsandthesingled-out classifiers so that it can effectively deal with datasets containing test protein will use the same W to construct the GO vectors, both single-label and multi-label proteins. Experimental results meaning that the SVM training algorithm can see some demonstrate the superiority of the proposed hybrid features over information of the test protein indirectly through the GO vector each individual features. It was also found that the proposed space defined by W. It is possible that for a particular fold of predictorperformsremarkablybetterthanexistingstate-of-the-art LOOCV,theGOtermsofatestproteindonotexistinanyofthe predictors. For readers’ convenience, HybridGO-Loc is available training proteins. However, we have mathematically proved that online athttp://bioinfo.eie.polyu.edu.hk/HybridGoServer/. PLOSONE | www.plosone.org 10 March2014 | Volume 9 | Issue 3 | e89545

Description:

semantic similarity measures applied to biomedical ontologies, and. Guzzi et al. such as hidden Markov models (HMMs) and neural networks. (NNs) [75,76] Mueller JC, Andreoli C, Prokisch H, Meitinger T (2004) Mechanisms for multiple Journal of Artificial Intelligence Research 11: 95–130. 50.

HybridGO-Loc: Mining Hybrid Features on Gene Ontology for Predicting Subcellular Localization of ... PDF

12 Pages·2014·0.95 MB·English

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview HybridGO-Loc: Mining Hybrid Features on Gene Ontology for Predicting Subcellular Localization of ...

Description:

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.