Large Collection of Diverse Gene Set Search Queries Recapitulate Known Protein-Protein Interactions and Gene-Gene Functional Associations Neil R. Clark1,2,3, Avi Ma’ayan1,2,3,∗ 1Department of Pharmacology and Systems Therapeutics, Icahn School of Medicine at Mount Sinai, One Gustave L Levy Place, Box 1215, New York, NY, USA 2Big Data to Knowledge (BD2K) Library of Integrated Network-based Cellular Signatures (LINCS) 6 Data Coordination and Integration Center (DCIC) 1 3Mount Sinai Knowledge Management Center (KMC) for Illuminating the Druggable Genome (IDG) 0 2 ∗Corresponding author: [email protected] n a J 7 Abstract—Popular online enrichment analysis tools from the associated biological theme, e.g., members of cell signaling fieldofmolecularsystemsbiologyprovideuserswiththeabilityto pathways, or targets of transcription factors. Enrichr [2] is a ] N submittheirexperimentalresultsasgenesetsforindividualanal- popular online tool we developed to enable users to submit ysis. Such queries are kept private, and have never before been M gene sets for this type of analysis. Since its launch in 2013, considered as a resource for integrative analysis. By harnessing gene set query submissions from thousands of users, we aim to over900,000genesetshavebeensubmittedtoEnrichrbyover . o discover biological knowledge beyond the scope of an individual 25,000 unique users. These gene sets originate from diverse i study. In this work, we investigated a large collection of gene experimentalplatforms,profilinggenesandproteinsatvarious b sets submitted to the tool Enrichr by thousands of users. Based - regulatory levels of mammalian molecular data. q on co-occurrence, we constructed a global gene-gene association We hypothesize that the gene set submitted to Enrichr may [ network. We interpret this inferred network as providing a summary of the structure present in this crowdsourced gene set contain patterns and structure that could reveal novel associa- 1 library,andshowthatthisnetworkrecapitulatesknownprotein- tions between genes, gene modules and molecular biological v protein interactions and functional associations between genes. mechanisms.Thelargesizeofthisdatasetandthediversityof 3 Thisfindingimpliesthatthisnetworkalsoofferspredictivevalue. the scientific community from which it originates potentially 5 Furthermore, we visualize this gene-gene association network provide a unique perspective that may reveal new insights. 6 using a new edge-pruning algorithm that retains both the local 1 andglobalstructuresoflarge-scalenetworks.Ourabilitytomake In combinatorial mathematical terms, the nearly 1 million 0 predictions for currently unknown gene associations, that may gene sets that were submitted to Enrichr comprise a family of 1. not be captured by individual researchers and data sources, is a sets,i.e.,acollectionofsubsetsfromthesetofallgenesinthe demonstrationofthepotentialofharnessingcollectiveknowledge 0 human, mouse and rat genomes. A common approach to the from users of popular tools in the field of molecular systems 6 analysisofthisformofdataistotreateachsetasanitem-setin biology. 1 Keywords - gene sets, big data, item set mining analogy to a market basket, and identify frequently occurring : v combinationsofitems[5].Analternativehasbeenproposedin i I. INTRODUCTION whichitem-setsthatarelogicallyrelatedareidentified[6].The X Systems approaches to genome-wide molecular data are identificationoflogicalitem-setsisparticularlyrelevantforthe r a increasinglyusinggenesets,asopposedtoindividualgenes,as gene sets submitted to Enrichr because this approach takes thebasicunitsofanalysis.Forexample,inthefieldofmolec- into account rare items, and because biological knowledge is ular diagnostics, biomarker sets, as opposed to single gene incomplete, logically associated groups of items may only be biomarkers,areincreasinglybeingapplied[1].Beyondmolec- partially represented in any given basket. These are important ular diagnostics, genomics, transcriptomics and proteomics properties, as we aim to identify functional modules even experiments often identify gene sets that are differentially ex- for rarely occurring genes. We expect that due to partial pressed,orgenesetsthatcontaingeneticvariationsassociated information contained within an individual query, not all withaphenotype.Usingmethodssuchasenrichmentanalyses members of a relevant functional module will be present in [2]–[4], common biological functions can illuminate prior any given gene set submitted for analysis with Enrichr. This knowledge originating from the collection of experimentally partial information is expected to produce high level of false identified gene sets. positive associations. Due to the expected high false positive One approach to enrichment analysis is to take a set of rates,adegreeofnoise-filteringmayalsoimprovetheaccuracy experimentally identified genes and analyze these genes by of such analysis. comparing them to a library of gene sets with a known Here we provide an interpretable picture of the global propertiesofalargecollectionofthelistssubmittedtoEnrichr 104 by thousands of users. To accomplish this we employ the Distributionofgenefrequencies method of logical item-set to extract gene modules. As a con- sequence, we infer a network of relationships between genes 1000 andvisualizethisnetworkusinganoveledge-pruningmethod we devised. Next, we examine the degree of compatibility 100 of our findings with annotated gene set libraries such as the nt. u Human Gene Atlas or the Gene Ontology (GO), as well as o C with known protein-protein interactions. 10 II. RESULTS 1 By the middle of 2015, the Enrichr gene set queries con- sisted of 172,798 gene sets submitted from 5,114 unique in- 0 200 400 600 800 1000 1200 1400 Enrichrgenefreq ternetprotocol(IP)addresses.Ouranalysisbeganbyapplying a number of filters. First, we removed any query that did not Fig. 1. The distribution of the occurrences of unique gene symbols in correspond to an official gene symbol. Next, we ensured that Enrichrgenesetqueries. no individual users dominated in their contributions to the data set by removing all entries from users that contributed many gene sets (see methods for more details). After this Distributionofquery submissions lengths strict cleaning criteria, the result was a family of 19,196 gene -3 10 sets, composed of 27,770 genes supplied from 3,308 unique IP addresses of unique users of Enrichr. -4 A. Enrichr gene set library metrics nt.10 u o Thedistributionofgenefrequenciesissurprisinglycomplex C (Fig. 1). The distribution approximately decays exponentially -5 10 from frequencies greater than 300; however, the lower fre- quencies are characterized by a bimodal distribution. The largemodeatthelowestfrequencieshasanoverrepresentation -6 10 of uncharacterized genes, gene names with no functional 50 100 500 1000 associations or known protein-protein interactions; but even Enrichrlinelength discounting those genes, this peak remains relatively large. Fig. 2. The distribution of the size of each gene set query submitted to The lower frequencies are also characterized by a mode at Enrichr. around200occurrences.Thefrequencyofoccurrenceofgenes in submissions to Enrichr has a characteristic frequency that decays exponentially, as opposed to being characterized by a In order to visualize the network of gene associations, we powerlaw.Giventhepredominanceofscale-freedistributions, employ an edge-pruning algorithm that results in an inter- it is perhaps surprising that there is a definite scale for pretable representation of the global network structure while frequencies of gene occurrence in this data set. preserving local features of the network topology (Fig. 3a). A The distribution of the lengths of gene sets submitted to selection of the largest local structures, preserved by the edge Enrichr approximately fits a power-law, with a notable peak pruning algorithm, is shown in Fig. 3b. One potential use of at about 200 and 400 genes (Fig. 2). The distribution of the thisnetworkistomakepredictionsofnovelassociationsbased number of submissions to Enrichr per user IP address also onpriorknowledge.Asanexampleofthis,wehighlightgenes follows a power-law, suggesting that most users submit only known to be associated with Adult Onset Diabetes (KEGG) a few lists, but there is a substantial body of heavy users. (Fig. 3c). Illustrating local network structures that include at least one of these genes suggests local structures as candidate B. Gene-gene and gene-set/gene-set network inference novel predictions for genes likely associated with the adult- TheEnrichrcollectionofgenesetqueriescanbetransposed, onset diabetes pathway. such that for each gene there is an associated set of Enrichr Oneofthelocalstructurescontainstheadult-onsetdiabetes submission queries. To explore the structure in this data, Genes MAFA and NEUROG3 (KEGG) along with two other and also to potentially form the basis for an informative genes,UTS2RandFLJ45717,thatarenotidentifiedinKEGG decomposition of the data, we can infer networks of similar- as being associated with diabetes. However, UTS2 and its ities between gene pairs and, in addition, networks of Enricr receptor UTS2R have been reported [7] to be involved in queries. We employ the methods developed in [6] to infer glucose metabolism and insulin resistance, which lead to the noise-filtered networks of associations between these entities. developmentoftype-2diabetesinhumans.Thereisnoknown b a c MNX1 NR5A2 MAFA HNF4A FOXA3 INS BHLHB8 ONECUT1 FOXA2 HNF4G NEUROG3 Fig.3. Theedge-prunednetworkofgeneassociationsinferredfromEnrichrqueries.(a)Theglobalviewofthenetwork.(b)Aselectionofthelargestlocal networkstructures.(c)Asanexampleofonepossibleuseofthisnetwork,wehighlightgenesknowntobeassociatedwithadult-onsetdiabetes(KEGG)and illustratethelocalnetworkstructurethatincludesatleastoneofthesegenes. association of FLJ45717 with diabetes; however, as this gene D. Examiningthegenenetworkforrecoveryofknownprotein- forms the center of a star graph local structure, we consider protein interactions it a novel candidate for involvement in adult-onset diabetes. Next, we asked whether the gene-gene association network We can also apply the same approach to the network of created from the Enrichr queries can be used to predict associations between gene set queries submitted to Enrichr physicalproteininteractions.WeusedthePSICQUICdatabase (Fig. 4). This network appears to show a degree of clustering of protein-protein interactions (PPIs) and examined the statis- by user; this perhaps reflects individual users special research tical significance of the overlap between the PSICQUIC PPI interests, or a common method of data acquisition. networkandthegene-genenetworkinferredfromEnrichrsub- missions.Afterapplyingathresholdof0.3forthesimilarityof anormalizedpoint-wisemutualinformationforthegene-gene association network, two binary matrices remained. C. Comparing the clustering of Enrichr networks to random In the first test of the significance of the similarity between scale-free networks these two binary matrices, we count the number of non- zero entries in each matrix and the number of elements that The clustering coefficients of the gene and gene-set net- are nonzero in both matrices. Based on a null distribution works are 0.43 and 0.17, respectively. In order to gauge the whereby the matrices are randomly permuted, we can use the significance of this, we compared these clustering coefficients hypergeometric distribution to quantify the significance of the to the typical clustering coefficients for a random scale-free overlap between the two matrices. network generated by the Barabasi-Albert random scale-free There are n = 13,586 genes that are in both the Enrichr 1 graphmodel.Artificialshufflednetworksofasimilarsizeand queries network and the PSICQUIC PPI network. With a degree distribution generated by the Barabasi-Albert model threshold on the normalized point-wise mutual information of have a typical clustering coefficient of 0.0026 and 0.0054 0.05,therearen =99,758edgesintheEnrichrnetworkand 1 respectively. Hence, we see that the Enrichr queries gene, and n =156,388 known PPIs. There are n =2,763 edges that 2 k gene-set,networkshaveaclusteringcoefficientwhichisorders are present in both the Enrichr network and the PSICQUIC of magnitude greater than would be expected if the networks PPI network. In this regime, the Poisson approximation to the were randomly scale-free by the Barabasi-Albert model. hypergeometricdistributionisapplicable.Withatotalnumber (a) (b) Fig.4. Theedge-prunednetworkofassociationsbetweenquerysubmissionstoEnrichr.(a)Theglobalviewofthenetwork.(b)Aselectionofthelargest localstructures.(c)ThegenesetssubmittedbythefiveEnrichruserswiththelargestnumberofsubmissionsarehighlighted,eachwithadifferentcolor. of possible edges given by n (1−n ), the rate parameter for g g the Poisson distribution is given by: n n 1 2 =169.1 (1) n (1−n ) g g Under the null hypothesis that edges are randomly assigned, weexpectameannumberofsharededgestobe169.1withthe standard deviation of approximately 13.0. Consequently, the observedvalueofn =2,763isabout200standarddeviations k greater than the expected value. Therefore, under the null hypothesis, the number of edges present in both the Enrichr gene network and the PSICQUIC PPI network is extremely statistically significant. It is conceivable that this result is due to users submitting gene set queries containing proteins known to have many interactors, and hence the null distribution we used for the calculation would be inappropriate and the result invalid. One way to account for this is to condition the test on the frequency of each gene. To do this, we perform a separate hypergeometric test for each gene, and correct for multiple hypothesis testing. There are 899 proteins that have at least one interaction that overlaps with the Enrichr network; when corrected for multiple hypotheses testing, with a False Dis- Fig.5. Thepartofthenetworkofthegene-geneassociationsinferredfrom covery Rate of 2 Enrichr that is also supported by known protein-protein interactions from PSICQUIC. Another possible explanation for the apparently significant overlap between edges in the Enrichr gene network and the PSICQUIC PPI network is that there are a significant number of user queries containing genes with known PPIs. To address this, we recovered the Enrichr lists that contribute to the prediction of at least one PPI and counted the number of the significant overlap in the edges in the gene network proteins in each query list for which there is another protein inferred from Enrichr submissions is predictive of known and in the query with which it interacts. If a user submitted a potentially novel protein-protein interactions. list of proteins with known protein interactions, then the ratio of the counted proteins to the length of the query will be large. We note that the actual ratio is less than 1%, which We show the network of edges between genes that are suggests that users are not submitting gene sets with known present in both the Enrichr-inferred gene network and the PPIs to Enrichr to any significant degree. This suggests that PSICQUIC PPI network (Fig. 5). TABLEI OMIMExpanded VTiFru PsMWIMNTs GENESETLIBRARIESUSEDTOASSESSRECOVERYOFFUNCTIONAL micTroFRPNPAIss ASSOCIATIONSBYTHENETWORKINFERREDFROMTHEENRICHR BioCartapathwKaEyAs QUERIES. MGIMPtop3 MGIMPtop4 NURSAIP-MS CORUM GeneSetLibrary #ofSets MeanSetSize GeneSigDB GeneOntologyCC BioCartapathways 249 18 GeneOntologyBP PPIHubProteins CancerCellLineEncyclopedia 967 176 NCI60 HMDBMetabolites ChEA 240 1456 Structural Domains GeneOntologyMF Chromosomelocation 386 85 Chromosomelocation KEWGikGiPpaatthhwwaayyss CORUM 1673 5 OMIMdiseaseCgeCnLeEs GeneOntologyBiologicalProcess 941 78 ReactomepathCwhaEyAs GeneOntologyCellularComponent 205 172 HMuomuasenGGeenneeAAttllaass GeneOntologyMolecularFunction 402 122 0.0 0.1 0.2 0.3 0.4 0.5 GeneSigDB 2139 127 Fractionofgenesetssignificantlyclustered(FDR1%) GenomeBrowserPWMs 615 275 HMDBMetabolites 3906 47 Fig. 6. The proportion of gene sets in each library that are significantly HumanGeneAtlas 84 450 clustered in the gene-gene association network inferred from queries to KEA 474 37 Enrichr. KEGGpathways 200 48 MGIMPtop3 71 717 MGIMPtop4 476 202 E. Evaluating the recapitulation of prior knowledge about microRNAs 222 155 MouseGeneAtlas 96 660 genefunctionswiththegene-geneassociationnetworkinferred NCI60 93 343 from Enrichr queries NURSA-IP-MS 1796 158 OMIMdiseasegenes 90 25 We took a collection of 28 annotated gene set libraries that OMIMExpanded 187 89 represent prior knowledge of associations between genes and Pfam-InterPro-domains 311 35 PPIHubProteins 385 247 their functions from a range of biological themes, including Reactomepathways 78 73 GeneOntology,mousephenotypes,proteincomplexes,histone TFPPIs 290 79 modificationsandDNAbinding.Weaddressedthequestionof VirusMINT 85 15 WikiPathwayspathways 199 39 the extent to which the information in these gene set libraries isrecapitulatedbythegene-geneassociationnetworksinferred by Enrichr queries. sets in the library are recoverable by the Enrichr queries’ In order to test this for a single gene set, we mapped the gene-gene association network. The libraries containing data genes in the set to nodes in the Enrichr queries gene-gene from genome-wide gene expression profiling have the most association network, and then calculated the mean point-wise overlap,followedbyChEA,whichhaslistsofgenesassociated mutual information between each pair of genes. In order to with transcription factors from ChIP-seq profiling. Next are assess the significance of this collective measure of similarity, pathwaylibrariessuchasreactome,KEGGandWikiPathways. we numerically calculated a null distribution by randomly This rank order of libraries likely reflects the nature of choosing a set of nodes in the Enrichr queries’ network with most submitted lists to analysis with Enrichr. These query the same cardinality as the gene set in question. The random submissions are likely dominated by differentially expressed choice was weighted by the overall frequency of the gene in genes from genome-wide mRNA profiling. The recovery of thegenesetlibraryfromwhichitoriginates.Thisisequivalent annotated pathways suggests that the inferred gene-gene as- to generating a null distribution based on random permutation sociation network likely contains new and more complete ofthegenesetlibraryrepeatedmanytimes.Bycomparingthe pathway knowledge. actualcollectivesimilaritytothenulldistribution,wecalculate a significance p value for the collective similarity of the gene III. DISCUSSIONANDCONCLUSIONS set in the Enrichr queries’ gene-gene association network. A In this study, we analyzed for the first time, a large collec- small p value indicates that the members of the gene set in tion of gene set queries submitted by the many users of the question are significantly more similar to each other in the popular enrichment analysis tool Enrichr. We examined the network inferred by Enrichr queries than would be expected distribution of occurrence of genes, the length of submitted if the gene set library from which the set was originated was gene sets, and the distribution of submissions by individual randomlypermuted.Weinterpretasmallpvalueasindicating users. We show that the distribution of gene occurrence in that the network inffered by theEnrichr queries has recovered queries is complex, with some genes appearing often in the gene set in question. The list of all gene set libraries submitted lists. The length of gene set queries peaks at 200- examined (Table I). 400genesperquery,andthedistributionofsubmittersfollows After correcting for multiple hypotheses testing and setting apowerlaw.Byconstructingagene-geneassociationnetwork afalsediscoveryratethresholdof1%,wecountedthenumber of co-occurrence, we were able to show that such a network of gene sets in each library that are significantly recovered in captures known protein-protein interactions and functional the Enrichr queries’ gene-gene network (Fig. 6). associationsmuchmorefrequentlythanbychance.Thismeans We observed that for some libraries, over 30% of the gene that the collective data from thousands of queries potentially holds new knowledge about the local and global structure B. Network Inference of the human functional interactome. As more submissions We employ the methods described in [6] to reconstruct accumulate,itisexpectedthatsuchpredictionswillberefined networks from the filtered gene sets. and improved. While experimental validation of these predic- Themostbasicapproachtoconstructanetworkfromalarge tions is outside the scope of this present work, the gene-gene collection of gene sets is to define a distance matrix between associationnetworkresourcewedevelopedherecanbeusedto the gene sets and examine the resulting distance matrix. The prioritize predictions for general and specific hypotheses that mostobvioussimilaritymeasurebetweentwogenesetsAand can be tested experimentally and systematically drive rational B is the Jaccard index: research explorations. The reuse of privately submitted gene |A∩B| sets by the Enrichr users should be handled with caution. The J(A,B)= (2) |A∪B| analysisthatweconductedhereandthegene-geneassociation networks we constructed keep the identities of users private. Let the set of all elements be V, and the family of sets be M, Furthermore, we do not provide the gene sets used for our then: analysis publicly. M =(cid:26)m(n) =(cid:110)m(n)(cid:111)Ln ⊂V(cid:27) (3) l l=1 IV. METHODS and the co-occurrence counts of pairs of genes is given by: A. Data Preprocessing (cid:88)N φ(α,β)= δ(α∈m(n))δ(β ∈m(n)) (4) As of June 25th 2015, 172,798 queries of gene sets were n=1 submitted to Enrichr from 5,114 unique IP addresses. In where δ(bool) is a Dirac delta function defined such that: order to analyze the global structure of these gene sets, (cid:40) preprocessing was necessary because the raw data include 1 ifbool=True, δ(bool)= multiple instances of the example data set supplied by the 0 Otherwise Enrichrwebsitefordemonstrationpurposes,specialcharacters and strings that cannot be interpreted as gene lists caused by The marginal counts are then defined in terms of the co- erroneous input from users, and large collections of queries occurrence counts as: from individual users who utilized the Enrichr API. These (cid:88) φ(α)= φ(α,β) (5) entries were removed so that the resulting dataset is repre- sentative of the entire community of Enrichr users and is not β∈G,α(cid:54)=β dominated by any individual user or small subset thereof. whichisthenumberofallco-occurrences.Finally,thetotal The first filter was applied to retain only word strings that number of co-occurrences is given by: are members of a reference set of 39920 standard human, 1 (cid:88) 1 (cid:88) (cid:88) mouse and rat gene names. The resulting sets of genes φ0 = 2 φ(α)= 2 φ(α,β) (6) corresponding to each user input list was then subjected to α∈V α∈V β∈V a sequence of subsequent filters. Then the co-occurrence probability is given by: Use of the Enrichr API enables programmatic access to φ(α,β) Enrichr.Thisresultedinaminorityofuserssubmittingalarge P(α,β)= (7) φ number of gene lists. In order for the data set to be balanced 0 and representative of the entire community of Enrichr users, and the marginal probability is given by: and not dominated by a small minority of users, we filtered P(α)=φ(α)/phi (8) out gene lists associated with IP addresses from which large 0 numbers of lists were received. The Jaccard index can now be defined in these terms as: The results of 11,579 differential expression analyses were P(α,β) received from the GEO2Enrichr Chrome Extension, which ψ = (9) jac P(α)+P(β)−P(α,β)) submits three gene sets for every differential expression com- putationforanalysiswithEnrichr:upregulated,downregulated When the collection of gene set queries is transposed such andcombinedgenesets,respectively.Onlythecombinedgene that the genes are the labels and the queries are the members sets were retained in this analysis. of sets, the Jaccard index can be applied. Gene sets that contained more than 2,000 genes were We can also use the following distance measures: rejected on the basis that they were non-specific and incurred • Cosine Distance: (cid:112) undue computational expense in subsequent analysis. ψ =P(α,β)/ P(α)P(β) cos Finally,allinstancesofthedemonstrationgenesetfromthe • Point-Wise Mutual Information: Enrichr website were removed. The result was 19,196 gene ψ =max{0,Log(P(α,β)/P(α)P(β))} pmi sets composed of 27,770 genes supplied from 3,308 unique • Normalized Point-Wise Mutual Information: IP addresses. ψ =−ψ /(Log(P(α,β))) nmi pmi We apply a noise filter by iteratively applying a threshold for ACKNOWLEDGMENT similarity: the pair counts for gene pairs that do not pass We would like to thank Dr. Kathleen M. Jagodnik for pro- the threshold are set to zero, and the similarity matrix is viding useful comments and copyediting. Funding: This work recalculated. wassupportedinpartbygrantsfromtheNIH:R01GM098316, Throughout the rest of this analysis, we only use the U54HG008230 and U54CA189201 to AM. normalized point-wise mutual information as a measure of similarity between genes for the Enrichr queries data. REFERENCES [1] X.Wang,E.Dalkic,M.Wu,andC.Chan,“Genemodulelevelanalysis: C. Network visualization identificationtonetworksanddynamics,”Currentopinioninbiotechnol- ogy,vol.19,no.5,pp.482–491,2008. Becausetheresultantnetworksaredenselyconnected,direct [2] E.Y.Chen,C.M.Tan,Y.Kou,Q.Duan,Z.Wang,G.V.Meirelles,N.R. visualization of the graph is not interpretable. Edge pruning Clark,andA.Ma?ayan,“Enrichr:interactiveandcollaborativehtml5gene has been proposed for the simplification of graphs to aid in listenrichmentanalysistool,”BMCbioinformatics,vol.14,no.1,p.128, 2013. generating interpretable visualizations of their structure [8]. [3] D. W. Huang, B. T. Sherman, and R. A. Lempicki, “Bioinformatics We find that such edge pruning algorithms, applied to very enrichmenttools:pathstowardthecomprehensivefunctionalanalysisof large graphs, are able to represent the global structure at the largegenelists,”Nucleicacidsresearch,vol.37,no.1,pp.1–13,2009. [4] A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, expense of the local structure. To potentially improve this, we M.A.Gillette,A.Paulovich,S.L.Pomeroy,T.R.Golub,E.S.Lander employed an edge pruning approach that is able to capture et al., “Gene set enrichment analysis: a knowledge-based approach both global and local structure in the graph. for interpreting genome-wide expression profiles,” Proceedings of the NationalAcademyofSciencesoftheUnitedStatesofAmerica,vol.102, In the initial step, a set of local structures is derived by no.43,pp.15545–15550,2005. varying the threshold for similarity and identifying connected [5] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur, “Dynamic itemset graph components at the point at which they disconnect from countingandimplicationrulesformarketbasketdata,”inACMSIGMOD Record,vol.26,no.2. ACM,1997,pp.255–264. the giant component. Such point is defined as any connected [6] S.Kumar,V.Chandrashekar,andC.Jawahar,“Logicalitemsetmining,” component that contains more than 10% of all nodes or has in Data Mining Workshops (ICDMW), 2012 IEEE 12th International an absolute number of nodes greater than 100 nodes; this Conferenceon. IEEE,2012,pp.603–610. [7] Z. Jiang, J. J. Michal, D. J. Tobey, Z. Wang, M. D. MacNeil, and number is chosen based on the visualizability properties of N. S. Magnuson, “Comparative understanding of uts2 and uts2r genes the resulting local structures. for their involvement in type 2 diabetes mellitus,” International journal The local structures represent local regions of the network ofbiologicalsciences,vol.4,no.2,p.96,2008. [8] F. Zhou, S. Malher, and H. Toivonen, “Network simplification with that are distinguishable from the rest of the network at an ap- minimallossofconnectivity,”inDataMining(ICDM),2010IEEE10th propriatescaleofsimilarity.Assuch,theydepictthesimilarity InternationalConferenceon. IEEE,2010,pp.659–668. relationshipsbetweenthenodesandtheirneighborhoodinthe network. Once the local structures have been identified, we employ a naive edge-pruning algorithm. We do not prune edges that are members of the local structures, thereby preserving them in the final simplified network. The number of edges that are available to be pruned is equal to the difference between the total number of edges in the original network, n , and the e numberofedgesthatarepartofthelocalstructures,n .The e;ls pruning process preserves the connectivity of the network, so after the maximum degree of pruning, the minimum number of edges remaining must be equal to: n +n (10) ls e;ls Accordingly, the maximum number of edges that can be pruned while preserving the connectivity of the network is: n −n −n (11) e e;ls ls Theiterativeprocessbywhichedgesareprunedisasfollows: 1) Set a counter i=0 2) Identify the edge with the lowest similarity measure 3) If removing the edge does not cut the network, then remove it. 4) i→i+1 5) Ifi<γ(n −n −n )thenreturntostep1;otherwise, e e;ls ls stop.