Mining Social Ties Beyond Homophily Hongwei Liang Ke Wang Feida Zhu School of Computing Science School of Computing Science School of Information Systems Simon Fraser University, Canada Simon Fraser University, Canada Singapore Management University, Singapore [email protected] [email protected] [email protected] Abstract—Summarizing patterns of connections or social ties in addition to substitutes (similar products). Similarly, the in a social network, in terms of attributes information on nodes social ties following the homophily principle are usually well- and edges, holds a key to the understanding of how the actors expected and people can easily dope out them without much interact and form relationships. We formalize this problem as effort, even though they are somewhat useful. Therefore, mining top-k group relationships (GRs), which captures strong discovering the social ties that are popular and interesting, but social ties between groups of actors. While existing works focus are not simply expected from homophily is more practical. In onpatternsthatfollowfromthewellknownhomophilyprinciple, this paper we study this problem. weareinterestedinsocialtiesthatdonotfollowfromhomophily, thus, provide new insights. Finding top-k GRs faces new chal- lenges: it requires a novel ranking metric because traditional Asian Latino White metrics favor patterns that are expected from the homophily principle; it requires an innovative search strategy since there is Grad College High School no obvious anti-monotonicity for such GRs; it requires a novel datastructuretoavoiddataexplosioncausedbymultidimensional F 1 2 3 4 5 6 7 nodes and edges and many-to-many relationships in a social network. We address these issues through presenting an efficient algorithm, GRMiner, for mining top-k GRs and we evaluate its effectiveness and efficiency using real data. I. INTRODUCTION Information networks, such as social networks, citation M 8 9 10 11 12 13 14 networks, dating networks, etc., are heterogeneous and mul- tidimensional [1] in that nodes and edges belong to certain classes and each class has description on multiple attributes. (a) Network topology (patterns represent Race and shapes represent For example, in addition to links, Facebook contains large Education) quantities of user demographic data that reveal detailed per- sonal information. The frequent patterns of connections in social networks, concisely in terms of attribute information of ID SEX RACE EDU 1 F Asian Grad nodesandedges,indicatespecificcommonsocialinteractions. 2 F Latino Grad We call this kind of patterns “social ties”. Summarizing such 3 F White Grad social ties holds a key to the understanding of how the actors 4 F Asian College interact with each other and form relationships, which is 5 F White College useful in user behavior analysis and modelling, friends/items 6 F Asian High School recommendation, inferring user demographics, etc. 7 F Latino High School 8 M Asian Grad In addition, it is well known that social ties follow the 9 M Latino Grad homophily principle, or “love of the same” [2]: a contact 10 M White Grad between similar people occurs at a higher rate than among 11 M Latino College dissimilar people, where similarity is measured by common 12 M White College characteristics such as beliefs/religion, value, race, age, etc. 13 M Asian High School So far, the literature largely focuses on applications based on 14 M White High School the homophily principle, such as community detection, link prediction, friend/product recommendation and information (b) Attributes on nodes diffusion. However, there is more and more voice from the academic circle and industry that they want to break the Fig. 1: A toy dating network boundaries to unearth the treasures beyond homophily. For example,theauthorsin[3]claimthat“recommendingpopular items is unlikely to result in more gain than discovering A. Motivating Examples insignificant yet liked items because the popular ones might be already known to the user”, and [4] infers networks of To motivate our work, we consider the toy online dating product relationships to recommend complementary products network in Fig. 1a with the node information in Fig. 1b. A pair of individuals in a dyadic tie is a dating relationship and GR4 is ranked low by the confidence metric because most each individual has attributes SEX, RACE, and EDU. We can females with Grad education dated male partners with Grad represent a group of links between two groups of individuals education according to the homophily principle. However, bya grouprelationship or GR,denoted l−→w r. l and r arethe if we exclude this “homophily effect” by restricting to the attributes information describing the two groups of nodes and male partners not having Grad education, GR4 holds 100% w is the attributes information describing the edges between of the time, which indicates a strong preference beyond the the groups. GRs serve as the representation of social ties. homophilyprinciple.Thisobservationmotivatesanewranking metric, called non-homophily preference. Intuitively, non- Example 1: According to the recent study on Facebook homophily preference captures “secondary bonds” beyond dating app, Are You Interested1, men tended to prefer Asian the “primary bonds” of homophily. One contributing factor women for dating. This finding can be represented by GR1 in of secondary bonds is heterophily [6], i.e., the tendency of the following table. individuals to collect in diverse groups. It was shown in that heterophilious networks are better to promote and spread GR1 (SEX:M) → (SEX:F, RACE:Asian) innovations[6].Thereby,thoughtheprimarybondisimportant supp = 7/15; conf = 7/14 in multiple applications, exploring the secondary bond can GR2 (SEX:M, RACE:Asian) → (SEX:F, RACE:Asian) result in more interesting findings and bring extra value to supp = 0; conf = 0 manybusinesses.Thenextexamplefurtherexplainsthispoint. The edge descriptor w = dates is omitted. supp = 7/15 Example 3: To leverage social influence for promoting and conf = 7/14 are two intuitive metrics, support and products, an obvious strategy for a financial institution is to confidence, originally used for association rule mining [5]. use GRs following from homophily, such as supp = 7/15 means that 7 out of the 15 links are involved (JOB :Lawyer,PRODUCT :Stocks)→(PRODUCT :Stocks) in this relationship, and conf = 7/14 means that 7 out of the 14 links originating from the nodes for male go to the nodes to promote Stocks to the friends, f, of existing customers who forAsianwomen.HavingfoundGR1,peoplewonderwhether arelawyersandhaveboughtStocks(onLHS).Thiseffortfails Asian men particularly prefer Asian women, so propose the if most such friends f already bought or do not like Stocks. variation GR2, whose supp and conf can be obtained by On the other hand, suppose queries on the data. GR2 and GR1 together suggest that while mostmenpreferredAsianwomen,Asianmenareanexception, (JOB :Lawyer,PRODUCT :Stocks)→(PRODUCT :Bonds) which is a finding in Are You Interested. This finding could has a high non-homophily preference, that is, among the be interesting to a dating service provider. (cid:3) friends f who do not buy Stocks, many buy Bonds. This GR can be used to promote Bonds to a friend if he/she has not GRsthatareexpectedfromthehomophilyprincipleusually boughtBonds,andthehighnon-homophilypreferenceimplies tend to have a high confidence, and such GRs generally have w a high adoption rate. (cid:3) theforml−→r wherethevaluesinr occurinl.Inthispaper, we assume that the homophily principle is known, and our Indeed, many companies have both e-commerce services goal is to find interesting GRs not expected from homophily. and social network services, enabling them to create informa- Example 2 illustrates that such GRs can be potentially useful tionnetworkstomineGRsforeconomicbenefits.Forexample, but they are not ranked high by confidence. Alibaba Group2 provides various sales services, and has the instantmessengerAliwangwangthatbuildsthesocialnetwork Example 2: Consider the two GRs, GR3 and GR4, listed amongcustomersandvendors.Asanotherexample,Facebook in the table below. Platform3 allows a third party business to build application based on their platforms. This tool enables integrating the (SEX:F, EDU:Grad) → (SEX:M, EDU:Grad) GR3 socialgraphwiththecustomerinformationownedbythethird supp = 4/15; conf = 4/6 party business, and applications on facebook.com are allowed (SEX:F, EDU:Grad) → (SEX:M, EDU:College) GR4 supp = 2/15; conf = 2/6 to access the graph. B. Contributions Assume that the attribute EDU follows the homophily principle. Therefore, GR3 likely has a high confidence but In summary, we make the following contributions. is not interesting because it is expected from the homophily principle. GR4 likely has a low confidence since GR3 has a • We propose a novel ranking metric called non- high confidence. supp and conf are obtained from the data in homophily preference (Section III-B) to identify Fig. 1. A closer inspection of the data reveals that if a female strong social ties beyond the homophily principle; withGradeducationdoesNOTwantherpartnertohaveGrad we define the problem of mining top-k GRs (Section education, there is a high chance that she prefers a partner III-C) to extract k most interesting social ties under with College education. This preference of College education, the non-homophily preference metric. whichisconditionedontheeducationsotherthanGrad,could • The search space of top-k GRs is large due to be interesting to the dating service provider. Though GR4 multidimensional nodes and edges and the lack of correctly captures this relationship, it will not be ranked high usualanti-monotonicityofnon-homophilypreference. by the confidence metric. (cid:3) 2http://en.wikipedia.org/wiki/Alibaba Group 1http://huffingtonpost.com/jenny-davis/race-online-dating b 4449946.html 3http://en.wikipedia.org/wiki/Facebook Platform We first propose a compact data model to store proposed “self-sufficiency” to measure the interestingness of the multidimensional nodes and edges information itemsets.Multi-relationaldatamining[25]generalizesfrequent in social networks, then we present a novel search patterns by allowing multiple predicates and variables in a strategytoenableanewformofanti-monotonicityfor pattern, but it does not consider the issues associated with non-homophily preference (Section IV). This strategy social networks. As pointed out in Section I, the homophily ensuresthatonlynon-trivialGRsthatmeetaminimum of social networks requires reconsideration of interestingness requirementonsupportandnon-homophilypreference metrics and new strategies of pruning. [26] studied mining are enumerated. unexpectedrulesbasedonpriorknowledgewhereunexpected- nessismeasuredbysimilaritybetweenfuzzyterms.Suchnon- • We present an efficient top-k GRs mining algorithm, statistical rules cannot be used for social network applications GRMiner,basedonthenewdatamodelandtheabove thatmotivateourwork.[26]usessupport-basedpruning,which search strategy (Section V). is too weak to unearth non-homophily GRs buried deeply in • We evaluate our approach on two real world data sets many homophily based GRs. Our interestingness metric is (Section VI), and provide potential extensions of our a statistical measure and captures the notion of conditional framework (Section VII). probability, which is good for inference. Social structure of social networks. Our work has simi- II. RELATEDWORK larity with the study on social structure of Facebook networks in [27], which focuses on calculating the propensity for two Graph mining. Most previous works on graph mining nodeswiththesamecategoricalvaluetoformatie.Whiletheir summarizes a large graph by simple statistics, such as degree workcanbeusedtoquantifyandspecifyhomophilyattributes distributions, hop-plots, clustering coefficients and number in our problem, our focus is on searching for unexpected ties of triangles. See surveys [7], [8]. Other works summarize that do not follow from homophily. a large graph by densely connected subgraphs [9], [10], network motifs [11], and frequent sub-structures [12], [13]. Linkprediction[14]usesneighbourhoodinformationtopredict III. PROBLEMSTATEMENTS the existence of a link between two nodes. Majority of these AsocialnetworkisapairG=(V,E)withV beingasetof works exploit only the topological structure of graphs. The nodes/vertices and E being a set of directed edges/links. |V| work on community detection with node attributes, such as denotesthenumberofnodesinV and|E|denotesthenumber [15], develops a probabilistic model to model the interaction of edges in E. Each node in V has descriptions over a fixed between network structures and node attributes for detecting setofnodeattributesandeachedgeinE hasdescriptionsover overlappingcommunities.Theworkslike[16]and[17]jointly a fixed set of edge attributes. Each attribute A has a discrete model network structure and vertex attributes with probability domain {0,1,··· ,|A|}, where |A| is the domain size, with models. The motivation of these works is quite different from 0 representing the null value. We consider directed edges; an ours. [18] focuses on class propagation in a social network undirectededgecanberepresentedbyapairofdirectededges using a given influence matrix. Our GRs can serve as the in the opposite directions. assumedinfluencematrix.Infact,GRscaptureamoregeneral type of influences between sub-populations summarized by A subset of nodes of V that share same values a on some RHS and LHS, which makes more sense because strong node attributes A can be represented by a set of pairs (A : influences typically exist at sub-population level. a) called a node descriptor. For example, (SEX:F, JOB:IT) represents all the nodes having the values (SEX:F, JOB:IT). Information network summarization. This body of Similarly, a subset of edges in E that share same values w on works considers information networks where nodes and edges some edge attributes W can be represented by a set of pairs have attributes like ours [1], [19], [20], [21]. These works (W : w) called an edge descriptor. Table I summarizes the focused on summarizing the entire graph, whereas our focus main notations used in the paper. is on identifying strong relationships that exist for certain groups of nodes and certain types of edges. The work in [22] focuses on community detection whereas we focus on strong TABLE I: Frequently used notations non-homophily patterns between individuals. They considered Notation Definition multigraphs that allow only values on edges (called dimen- G(V,E) graph with the nodes V and the edges E sions) but not on nodes such as SEX and RACE like ours. l, r, w three parts of attributes values in a GR l−→w r Suchmultigraphscannotmodelourgraphswithvaluesonboth L edges and nodes, such as “a male dates a female Asian”. The R attributes for the values in l, r, and w itemsetsin[22]canconstructrulesaboutanindividuallike“if W X publishes in VLDB, X also publishes in SIGMOD” where Al,Ar Al is an attribute in L, and Ar is Al in R VLDB and SIGMOD are dimensions on edges, but cannot l−→w l[β] homophily effect, see Eqn. (5) construct GRs that aim at a pair (X,Y) of individuals such as “if X is a male, X tends to date a female Asian Y”, which is useful to gauge the influence between two persons. A. Group Relationships Association rule mining. Support and confidence were Definition 1: [GR]Agrouprelationship(GR)hastheform w firstintroducedforassociationrulemining[5]fromtransaction l −→ r, where l and r are node descriptors and w is an edge data. Mining frequent combinations of attribute-values in a descriptor. l is called LHS. r is called RHS. L, W, R denote relational table was studied as iceberg cube queries [23]. [24] the attribute sets for l, w, and r, respectively. (cid:3) w For GR3 in Example 2, l = (SEX:F, EDU:Grad), w = TrivialGRs.WesaythataGRl−→ristrivialifallofthe (TYPE:dates),andr=(SEX:M,EDU:Grad).ThisGRsaysthat values in r are from homophily attributes and r ⊆l. A trivial femaleswithGrad educationtendtoprefermalepartnerswith GR is expected from the homophily principle, so we are only Grad education. The “tendency” can be measured by support interested in non-trivial GRs. and confidence [5]. TocaptureandranktheGRsnotexpectedfromhomophily, w Definition 2: [Support] Support of l −→ r indicates the we propose to exclude the homophily effect from confi- probabilitythatanedgesatisfiesalltheconditionsinl∧w∧r: dence. Consider GR4, (SEX:F, EDU:Grad) −d−a−te→s (SEX:M, |E(l∧w∧r)| EDU:College), in Example 2. Assume that EDU is a ho- w supp(l−→r)=P(l∧w∧r)= (1) mophily attribute. The confidence of GR4 is given by |E| supp(GR4)/supp(l∧dates)=2/6.4ofthesupportsupp(l∧ |E(l∧w∧r)|denotesthenumberofedgessatisfyingl∧w∧r. dates)=6 is contributed by the homophily effect represented Support of l∧w is defined as by GR3 (SEX : F,EDU : Grad) −d−a−te→s (EDU : Grad), |E(l∧w)| supp(GR3)=4.Excludingthiseffectfromsupp(l∧dates)= supp(l∧w)=P(l∧w)= (2) 6, the new metric of GR4 is 2/(6 − 4) = 100%, read |E| as: for women with Grad education, when not dating men |E(l∧w)| is the number of edges satisfying l∧w. (cid:3) having Grad education, they were dating men having College education with 100% probability. With |E| being a constant for a given network, we can use absolute support by dropping the denominator |E|. While In general, for a GR l −→w r, let β denote the homophily supportmeasuresthegeneralityofaGR,confidencemeasures attributes in R that occur in L but have different values in the the strength of a GR. two sides, i.e., w Definition 3: [Confidence]Confidenceofl−→r isdefined β ={Ar ∈R|Al ∈L,r[Ar](cid:54)=l[Al]} (4) as Letl[β]denotetheconditionfortheRHScontainingthevalues w supp(l−→r) conf(l−→w r)=P(r |l∧w)= (cid:3) (3) in l restricted to β. We define homophily effect as supp(l∧w) w l−→l[β] (5) B. Non-homophily Preference In example GR4, EDU is the only homophily attribute and The above support/confidence metric has been used in the the values for EDU on both sides are different, thereby, β literature to mine interesting association rules, by specifying = {EDU} , and the homophily effect is (SEX:F, EDU:Grad) a minimum threshold on support and a minimum threshold −d−a−te→s (EDU:Grad). Recall conf (l −→w r) = supp(l−→w r). We on confidence. However, many GRs that have a high support supp(l∧w) w and a high confidence, like GR3, often are known and well can exclude the homophily effect by subtracting supp(l −→ expected because of the homophily effect, and those that do l[β]) from the denominator supp(l ∧ w) in the confidence. not follow from homophily but are still interesting, like GR4, This gives rise to the following new metric. are missed due to a low confidence. To unearth interesting Definition 4: [Non-homophily Preference] The definition GRs like GR4, the support/confidence metric approach has to w of non-homophily preference of a GR l−→r is given by set the thresholds for support and confidence at a very low level, leading to a much larger search space. In this paper w nhp(l−→r)=P(r |l∧w∧¬l[β]) we assume that the homophily principle is known and we are w interested in GRs that are not expected from the homophily = supp(l−→r) (cid:3) (6) principle. Therefore, the confidence metric is not suitable for supp(l∧w)−supp(l−→w l[β]) our purpose and we need a new metric to identify interesting GRs. First, let us clarify the notion of homophily. w Intuitively, nhp(l −→ r) is the conditional probability of Homophily attributes. Intuitively, a GR is considered to links going to a node described by r, given that they satisfy follow from the homophily principle if the LHS and RHS of l∧w and do not go to a node described by l[β]. this GR share a (set of) value(s). However, the homophily is Remark 1: In the case of β = ∅, the edges due to the attribute sensitive so that sharing values on certain attributes w homophily effect do not exist, we define supp(l−→l[β])=0; are not considered as trivial, such as the attribute SEX in a consequently,nhpdegeneratestoconf.Hence,confisaspecial dating site. We differentiate an attribute as either a homophily caseofnhpwherethereisnohomophilyattribute.Inthecase attribute or a non-homophily attribute. An homophily attribute of β (cid:54)= ∅, nhp ≥ conf, so excluding the homophily effect refers to an attribute on which the individuals sharing the boosts the rank of a GR not expected from homophily. This same value are more likely to connect to each other. For a is exactly what we want to achieve. givensocialnetwork,weassumethatthesettingofhomophily attributes is specified. Some existing works, like [27], studied w Theorem 1: Assume supp(l −→ r) (cid:54)= 0. (i) The denomi- the methods to identify homophily attributes. In many cases, nator in Eqn. (6) is not zero. (ii) nhp ∈ [0,1]. homophily attributes are known from a common sense. For example, EDU is likely a homophily attribute for dating Proof:(i)Ifβ =∅,thedenominatorinEqn.(6)isequalto relationships whereas SEX is a non-homophily attribute since supp(l∧w), which is not equal to 0. Assume β (cid:54)=∅. Suppose w dating could be between two people of same or opposite sex. supp(l ∧ w) − supp(l −→ l[β]) = 0, i.e., supp(l ∧ w) = w supp(l−→l[β]), this implies that all edges satisfying l∧w go for high dimensional nodes with large #Attr and densely V to the nodes covered by l[β] and no edge goes to the nodes connected graphs with large |E|. w covered by r, i.e., supp(l −→r)=0. But this contradicts the Another straightforward approach is to use a threshold for assumption. standard confidence (as defined in Definition 3), minConf, and (ii) If β = ∅, the denominator in Eqn. (6) is equal to minSupp to mine all the GRs that satisfy these two criteria, supp(l∧w), so nhp has a value in the range [0,1]. If β (cid:54)=∅, then remove the trivial (homophilic) GRs in a post-processing itsufficestonotethatthelinksaccountedforbysupp(l−→w r) phase to get the final results. This approach has the following w drawbacks. First, as discussed in Section III-B, the confidence and the links accounted for by supp(l −→ l[β]) are disjoint metric favors GRs that follow from the homophily principle (because r and l disagree on β), and both are subsets of those so that the majority of the high-confidence GRs in the top-k accounted for by supp(l∧w). resultsaretrivial(wewillshowthisinTableII).Thatistosay, many non-trivial and interesting GRs are not returned because C. Top-k GRs either their conf are less than minConf or they are not ranked as the top k GRs. As a result, this algorithm has to set a very Some GRs are interesting to users while some are not, we small minConf and very large k to first let the non-trivial GRs can use a threshold of support and non-homophily preference tobereturnedbeforedoingthepost-processing.Bydoingthis, to select the interesting ones. Furthermore, for two GRs g : 1 l −w−→1 r andg :l −w−→2 r ,ifl ⊆l ,w ⊆w ,andr =r , the efficiency of this approach becomes terrible owing to the 1 1 2 2 2 1 2 1 2 1 2 computation of the huge number of trivial GRs. Second, the wesaythatg ismoregeneralthang ,andg ismorespecial 1 2 2 post-processing for the great number of trivial GRs is another thang .Intuitively,ifg ismoregeneralthang ,g isasimilar 1 1 2 1 cost, which makes this approach rather worse. tendency to g but covers more nodes on LHS. In this case, 2 if both g1 and g2 satisfy certain support and non-homophily An ideal algorithm is that it examine only the necessary preference thresholds, g1 would make g2 redundant. GRs and return the top-k results in one phase. To achieve this andaddresstheissuesmentionedabove,thekeyistopushthe On account of the above discussion, finding the k most minNhp threshold, in addition to the minSupp threshold, as interesting GRs offers a brief and valuable overview of the early as possible. Besides, storing edge and node attributes entire social network. Hence, this problem is formulated as information separately without duplication helps a lot. In this follows. section, we first introduce the data model of representing Definition 5: [Top-k GRs] Given the homophily settings the social networks that contain edge and node attributes forattributes,asupportthresholdminSupp,anon-homophily information, then we mainly focus on a new search strategy preference threshold minNhp, and an integer k, a non-trivial for pushing the minNhp threshold. The full implementation l−→w r is a top-k GR if the three conditions hold: of our algorithm will be presented in Section IV. w w • (1) supp(l −→ r) ≥ minSupp and nhp(l −→ r) ≥ A. Data Model minNhp; w • (2) no non-trivial GR is more general than l −→ r Social LArray EArray RArray while satisfying (1); Networks W Ptr • (3) no more than k−1 non-trivial GRs have a higher ID Al Bl Out Ind 1 2 Ar Br rank while satisfying (1) and (2), where the rank is 2 4 measured by non-homophily preference, followed by 1 1 2 3 1 1 5 1 2 support, followed by the alphabetical order of GRs. 2 2 2 2 4 1 3 2 2 2 5 The objective is to mine the top-k GRs. (cid:3) 5 3 2 5 10 6 2 1 2 5 1 4 4 1 3 7 16 1 3 IV. MININGTOP-k GRS 3 6 w …. … … … … 1 7 … … Onebaselinealgorithmforfindingtop-k l−→r istoapply … … regularApriori-likealgorithmssuchas[5]tofindfrequentsets l∧w and l∧w∧r above the minSupp threshold and then construct GRs in a post-processing step using the minNhp Fig. 2: Data model: LArray, EArray and RArray threshold. This algorithm does not work well for GRs with a small support because there are too many frequent sets when minNhp is small. In fact, strong social ties typically exist For the sake of illustration, let’s consider two node at- among small groups. Another issue is that frequent set mining tributesA,BandoneedgeattributeW.Foreachnodeattribute usually requires collecting all information in one table. For A, we use the symbol Al for the occurrence in LHS of a GR graph data, this means replicating the node information for and use the symbol Ar for the occurrence in RHS. Then, we every edge adjacent to the node, and the size of this table is shall store the node and edge information of social networks |E|×(2×#Attr +#Attr ), where #Attr is the number V E V separately as shown in Fig. 2. ofattributesinV and#Attr isthenumberofattributesinE. E The term |E|×2×#Attr usually causes storage explosion LArraycontainstherecordsforindividualsthatcouldoccur V andimposesabottleneckformostgraphalgorithms,especially in the LHS of GRs and RArray contains the records for w individuals that could occur in the RHS of GRs. Out is the addition, β (cid:54)= ∅ (see Eqn. (4)), so supp(l −→ l[β]) (cid:54)= 0. This out-degree of a record and Ind is the starting position of the change may increase or decrease nhp(l −→w r). In this case, outgoingedgesinEArray.EArraycontainsonerecordforeach nhp does not have anti-monotonicity. edge and Ptr is the pointer to the record for the destination Intheremainderofthissection,weproposeacarefulorder node in RArray. We assume that this structure is held in w memoryanduseittopartitionthedataforcountingthesupport of enumerating GRs l −→r so that GRs can be pruned based for GRs. For example, the first row in LArray represents the on nhp in all the cases of adding a value to r. record1forLHS,whichconnectstothedestinationrecords2, 4and5forRHS,foundbythepointersPtr keptintheentries C. Subset-First Depth-First (SFDF) Enumeration [Ind,Ind+Out−1]ofEArray.NotethatRArrayandLArray WeuseatreestructuretorepresentallGRswhereeachtree are for destinations and sources of edges (thus, not a subset node represents a subset LWR (see Table I for the definition of each other) and will be sorted by the different attributes for w of L, W and R) and all the corresponding GRs l −→ r. This RHS and LHS for counting support. For this reason, RArray tree structure is only a conceptual representation and is not and LArray must be separately stored. stored in entirety. The nodes of this tree are enumerated to This compact data structure has the size |V|×(#AttrV + ensure two properties: 2)+|E|×(#Attr +1)+|V|×#Attr , which eliminates E V the bottleneck term |E| × 2 × #Attr of the single table • Property 1: Enumerate a subset LWR by adding V representation as mentioned above. The difference is usually attributes in the order of those in L, W, and R. This large because #Attr is typically much larger than #Attr order enables the pruning in Theorem 2(1,2,3) where V E andanodetypicallyconnectsto,orisconnectedfrom,multiple the values for r are added after those for l and w. nodes.Evenforasparsenetwork,thespacerequirementofthe • Property 2: Enumerate a subset L W R before any compact data model is also smaller since the nodes with zero 1 1 1 subset L W R where L ⊆ L , W ⊆ W , and out-degree of in-degree will not appear in LArray or RArray. 2 2 2 1 2 1 2 R ⊆ R . This order ensures that the node for 1 2 w w l −→ l[β] is enumerated before the node for l −→ r B. Pruning Strategies w (because β is a subset of R), hence, supp(l −→l[β]) TopruneGRsusingaminimumthresholdonnhp(Defini- was computed before computing nhp(l−→w r). This is tion4),thechallengeisthat,showninTheorem2(2,3),nhphas necessary because the latter depends on the former. anti-monotonicity only for “certain cases”; for the remaining cases, adding a value for a homophily attribute to RHS would Theregulardepth-firstenumerationdoesnotprovideProp- increase or decrease nhp, so the traditional tree-based pattern erty 2, and the regular breadth-first enumeration (level order) enumeration cannot prune GRs using a threshold of nhp. See meets these requirements but has to keep all nodes and their more discussion in Remark 2. In the rest of this section, we GRsatthesamelevel,whichimposesabottleneckonmemory. deviseanewenumerationtomanifesttheanti-monotonicityof We propose a novel strategy, called subset-first depth-first nhpinallcases(Theorem3).Thisstrategyallowsustoprune (SFDF), that will enumerate a subset before a superset like GRs based on the threshold of nhp. First, the next theorem the breadth-first enumeration but is depth-first (to avoid the states pruning properties of GRs. memory bottleneck). Theorem 2: (1) supp(l −→w r) is not increased by adding To ensure that each subset LWR is enumerated at most an attribute value to l or r or w. (2) If β (cid:54)=∅, nhp(l −→w r) is once, we impose the following order on all attributes: w not increased by adding a value to r. (3) If β =∅, nhp(l −→ τ :NHr,Hr,W,NHl,Hl (7) r)isnotincreasedbyaddingavaluetor foranon-homophily attribute or for a homophily attribute not occurring in l. where NHl denotes non-homophily attributes for LHS, and NHr denotes non-homophily attributes for RHS. Similarly, Proof: (1) follows from the anti-monotonicity of support. Hl and Hr denote homophily attributes for LHS and RHS, nhpisequalto supp(l−→w r) (Definition4).Adding respectively. W denotes edge attributes. supp(l∧w)−supp(l−→w l[β]) a value to r does not affect supp(l∧w), and if β (cid:54)=∅, never w w 0nil increases supp(l −→ l[β]) and supp(l −→ r). This shows (2). If β = ∅, adding a value to r for a non-homophily attribute, or a homophily attribute not occurring in l, preserves β = ∅, 1Br 2Ar 4W 8Bl 16Al w thus, supp(l −→l[β])=0. Then (3) holds similarly as in (2). 3Br 5Br 6Ar 9Ar 10Br 12W 17Br 18Ar 20W 24Bl 7Br 11Ar 13Ar 14Br 19Br 21Br 22Ar 25Br 26Ar 28W Remark 2: Theorem 2(1) enables supp based pruning of GRs, and Theorem 2(2,3) enables nhp based pruning when 15Ar 23Br 27Br 29Br 30Ar expanding the RHS r of a GR under certain cases. The remaining case is expanding a value to r for a homophily 31Br attribute that occurs in l when β = ∅. In this case, nhp does Fig. 3: Subset-First Depth-First enumeration not have the anti-monotonicity. To see this, suppose that we add a value br for a homophily attribute B to r, where some value bl of Bl, bl (cid:54)= br, has already occurred in l. Before At any tree node t, label(t) denotes the labeling attribute w the addition, β = ∅, thus, supp(l −→ l[β]) = 0, but after the for t, path(t) denotes the attribute set LWR constructed by all the labels for the nodes on the path from the root to t, enumeratedinpath(t ).Then,tail(t )isdynamicallyordered 8 8 and tail(t) denotes the prefix of the list τ to the left of the as (Ar,Br,W), instead of the static order (Br,Ar,W). This attribute label(t). tail(t) is the set of unused attributes that order ensures that Br is added to path(t ) before Ar if 8 can be to expand path(t) in the subtree below t. Initially, both Br and Ar appear in the path, as shown by the path label(root) is nil, path(root) is empty, and tail(root) = τ. t ,t ,t .Onthepatht ,t ,t ,Br isaddedtopath(t )after 8 10 11 4 6 7 4 If tail(t) (cid:54)= ∅, for each attribute in tail(t) in order, t has Ar. This does not contradict our order because no homophily one child t(cid:48) labeled by the attribute. Note that this order will attributewasenumeratedinpath(t ),i.e.,Hr ={Br,Ar}and 4 1 expandthesubsetpath(t)byaddingtheattributesintheorder Hr = ∅. The next theorem shows that this dynamical order 2 Hl,NHl,W,Hr,NHr,i.e.,thoseforLHS,followedbythose restores the anti-monotonicity of nhp. for edges, followed by those for RHS. This gives Property 1. Theorem 3: Assume that tail(t) is dynamically ordered at For the sake of illustration, we assume that both A a node t described above, and g(cid:48) and g are non-trivial GRs and B are homophily attributes (the enumeration of non- where g(cid:48) is obtained from g by adding one or more values to homophily attributes are straightforward as discussed below). the RHS of g. Then nhp(g(cid:48))≤nhp(g). NHr = NHl = ∅, Hr = {Br,Ar}, Hl = {Bl,Al}, and Proof: If β (cid:54)= ∅ for g, Theorem 2(2) implies nhp(g(cid:48)) ≤ τ =(Br,Ar,W,Bl,Al). Fig. 3 shows the SFDF enumeration nhp(g).Weassumeβ =∅forg.Letg(cid:48) betheresultofadding of all subsets LWR with the order indicated by the sequence a value b to the RHS of g. If b is a value for an attribute numbersasidethenodes.Lett denotethetreenodenumbered i in Hr or NHr, Theorem 2(3) implies nhp(g(cid:48)) ≤ nhp(g). i. At the root t , label(t ) = nil, path(t ) = ∅, and 1 0 0 0 So we assume that b is a value for a homophily attribute tail(t ) = τ. The root has five child nodes, t ,t ,t ,t ,t , 0 1 2 4 8 16 in Hr. In this case, according to the dynamic ordering of labeled Br,Ar,W,Bl,Al in that order. Next, the SFDF order 2 tail(t), the RHS of g contains only values for attributes in enumeratest .tail(t )=∅andt hasnochild.Thenextnode 1 1 1 Hr since the values for attributes in Hr and NHr are added enumerated is t labeled Ar, tail(t )=(Br) and t has one 2 1 2 2 2 after those for attributes in Hr. Thereby, all attributes in the child t labeled Br. path(t ) = {Ar,Br}, which represents 2 3 3 RHS of g are homophily attributes and occur in the LHS. all GRs l −→w r with L = ∅, W = ∅, and R = {Ar,Br}. Then the assumption β = ∅ implies that the values of these Similarly, t ,t ,t ,t are enumerated following this order. 4 5 6 7 attributes are contained in the LHS of g, hence, g is a trivial Atnodet ,path(t )={Bl}andtail(t )=(Br,Ar,W). GR, contradicting our assumption. This shows that b cannot 8 8 8 For the first time, some homophily attribute, B, occurs in the be a value for a homophily attribute in H2r if β =∅. The case LHS.ThisnoderepresentstheenumeratedsubsetLWRwhere of adding more values to the RHS of g follows by repeating L = {Bl} and W = R = ∅. Note β = ∅. t has three the above argument on g(cid:48). 8 child nodes labeled Br,Ar,W. Following the above order, The above enumeration order ensures that our depth-first the subset BlBr will be enumerated before the subset BlAr, traversal enumerates smaller subsets LWR before enumerat- then the subset BlArBr will be enumerated as a child node ing larger ones, i.e., Property 2, adds the attributes for LHS of BlAr (by adding Br). This is exactly the case discussed before adding the attributes for RHS, i.e., Property 1, and in Remark 2 where a homophily attribute Bl has a value in l restores the anti-monotonicity of nhp, i.e., Theorem 3. All and adding a new value for Br to r changes β =∅ to β (cid:54)=∅, these properties are essential for pruning GRs based on the causing the lack of anti-monotonicity of nhp. threshold of nhp. To avoid this problem, at node t , it helps to add Ar 8 (because Al does not occur in the LHS) before adding Br D. Computing Non-homophily Preference (because Bl occurs in the LHS). In Fig. 3, this order ensures A remaining issue is how to compute nhp at a node. that the subset BlAr (at t9) is enumerated before the subset Suppose that we are enumerating the current node t for GRs BlBr (at t10), therefore, the subset BlBrAr (at t11) is l−→w r.Innhp(l−→w r)= supp(l−→w r) (Definition4), enumeratedasachildnodeofBlBr insteadofachildnodeof supp(l∧w)−supp(l−→w l[β]) BlAr. This is defined by the dynamic order of tail(t) below. supp(l −→w r)iscomputedattandsupp(l∧w)wascomputed atanearlier nodebecausetheattributesetforl∧w isasubset Dynamic ordering of tail(t). At any node t with path(t) w = LWR, let NHl,NHr,W,Hl,Hr be the same as in Eqn. oftheattributesetforl−→r(i.e.,Property2).Inthefollowing w (7). Let Hr and Hr be the partitioning of Hr, where Hr discussion, we consider supp(l −→ l[β]) and assume β (cid:54)= ∅, 1 2 1 w contains those Ar with the corresponding Al not enumerated thus, supp(l−→l[β]) (cid:54)=0. Note β ⊆R. There are two cases: in path(t) and Hr contains those Ar with Al enumerated in w 2 Case 1: If β ⊂R, the node for l−→l[β] was enumerated path(t). We dynamically order the attributes in tail(t) at a w at an earlier node because the attribute set for l −→ l[β] is a node t as follows: w w subset of the attribute set for l −→ r. Note that l −→ l[β] is a NHr,Hr,Hr,W,NHl,Hl (8) trivial GR, its support can be easily computed. 1 2 Inotherwords,thehomophilyattributesinHr aredynamically Case 2: β =R. In this case, supp(l−→w l[β]) is computed ordered on the basis of whether their corresponding attributes at the current node t for l −→w r. An example is a GR g at were enumerated in the LHS at t. Consequently, the attributes t : (a ,b ) → (a ,b ), where a ,a are different values for 27 2 2 1 1 2 1 in Hr are added to path(t) before the attributes in Hr. attribute A, and b ,b are different values for B. So β = 2 1 2 1 {Ar,Br} and Consider the node t again. Recall that path(t ) = {Bl} 8 8 and tail(t8) = (Br,Ar,W). H1r = {Ar} and H2r = {Br} nhp(g)= supp((a2,b2)→(a1,b1)) (9) because Al was not enumerated in path(t8) and Bl was supp((a2,b2))−supp((a2,b2)→(a2,b2)) Algorithm 1: GRMiner Our algorithm prunes further partitioning using the thresholds on supp and nhp as in Theorem 2(1) and Theorem 3. We 1 Procedure Main() discussedhowtocomputenhpforagivenpartitioninSection 2 initiate LArray, EArray and RArray; IV-D. Below, we focus on how to partition the data using 3 RIGHT(RArray, tail(nil)); LArray, EArray, and RArray as introduced in Section IV-A. 4 EDGE(EArray, tail(nil)); 5 LEFT(LArray, tail(nil))); Algorithm 1, GRMiner, gives the pseudo-code of our 6 Output(top[k]); algorithm. The main procedure starts with loading LArray, EArray and RArray into memory at line 2. tail() returns the 7 Procedure LEFT(data, Tail) attributes(dimensions)thatwillbeusedtoexpandtheattribute 8 forall the dimension d both in Tail and in LArray do setLWR,similartotail(t)inSectionIV-C.Initially,tail(nil) 9 forall the partition p of data on dimension d do returnsalltheattributesintheorderinEqn.(8).Inourrunning 10 if supp(p) < minSupp then example, tail(nil) = {Br,Ar,W,Bl,Al}, where {Br,Ar} is 11 return; in RArray, {W} is in EArray, and {Bl,Al} is in LArray. 12 RIGHT(getRight(p), tail(p.Att)); 13 EDGE(getEdge(p), tail(p.Att)); At the current node t of the tree, data denotes the data 14 LEFT(p, tail(p.Att)); partition generated by LWR at t. Since the attributes in tail(t) are contained in the tables LArray, EArray, and RAr- ray, we use three recursive procedures RIGHT(data,Tail), 15 Procedure EDGE(data, Tail) EDGE(data,Tail) and LEFT(data,Tail) to partition data, 16 forall the dimension d both in Tail and in EArray do whereTailisavariablefortail(t).Initially,dataistheentire 17 forall the partition p of data on dimension d do tables LArray, EArray, and RArray and Tail = tail(nil), 18 if supp(p) < minSupp then at lines 3 - 5. Partitioning data by an attribute in tail(t) 19 return; generates the partitions for a child node created as described 20 RIGHT(getRight(p), tail(p.Att)); in Section IV-C. These calls then search recursively deeper 21 EDGE(p, tail(p.Att)); into the enumeration tree, explained below. On return from all calls, top[k] contains the top-k GRs. 22 Procedure RIGHT(data, Tail) LEFT(data,Tail) partitions data using each dimension 23 forall the dimension d both in Tail and in RArray do occurring both in Tail and in LArray (line 8) (i.e., the 24 forall the partition p of data on dimension d do dimensionsinTailcontainedinLArray).Byabuseofnotation, 25 if supp(p) < minSupp OR nhp(p) < minNhp foreachpartitionp,wealsouseptodenotethecorresponding then GR. supp(p) returns the support of p and p.Att returns the 26 return; attributes on which p has been partitioned. p.Att corresponds 27 if p is a non-trivial GR and no more general to path(t) in Section IV-C. If supp(p) < minSupp, the GR than p found then procedure returns immediately (line 11), otherwise, p is re- 28 update top[k] and minNhp if necessary; cursively partitioned on the next three lines. The functions 29 RIGHT(p, tail(p.Att)); getRight(p) and getEdge(p) expand the partition p to the records in RArray and EArray, respectively. EDGE(data,Tail) is similar to LEFT(data,Tail) ex- cept that it partitions data by each dimension occurring both Ifwegenerate(a ,b )→(a ,b )beforegeneratinganyother 2 2 2 2 inTailandinEArray,andrecursivelyprocesseseachpartition GRs with (a ,b ) on the LHS, supp((a ,b ) → (a ,b )) 2 2 2 2 2 2 p by the calls RIGHT() and EDGE(). will be available when generating g. Enforcing this order only requires knowing the LHS of the current GR g, i.e., (a ,b ) RIGHT(data,Tail) partitions data by each dimension 2 2 in this example, therefore, can be easily implemented. occurring both in Tail and in RArray, and recurs on each partition. Line 25 checks if p meets the thresholds for support w In both cases, supp(l −→ l[β]) is either already computed and non-homophily preference, and Line 27 checks if p w orcanbecomputedatthesamenodeasforl−→r.Therefore, represents a non-trivial GR and if a more general GR than w w nhp(l−→r) can be computed at the node for l−→r. the GR for p was generated before. Since our enumeration examinessmallersubsetsofattributesbeforeexamininglarger V. ALGORITHMFRAMEWORK subsets, once a GR passes this checking, no later GR can be more general than it, so every GR in top[k] is a most general Wenowpresentthealgorithmframework,whichpartitions GR. Line 28 updates the top[k] list if the GR for p is among the data stored in the format as in Fig.2 and leverages the the top k GRs so far, and upgrades minNhp by the non- enumeration and pruning strategies presented in Section IV. homophily preference of the least ranked GR in top[k]. Our algorithm enumerates each attribute subset (LWR) Theorem 4: (1) top[k] returned by GRMiner contains the following the SFDF order as described in the last section. To top-k GRs. (2) A non-trivial GR is examined by Algorithm 1 compute supp and nhp of the GRs at the node for LWR, it only if it passes both minSupp and minNhp. (cid:3) partitions the data using the attribute set and then considers each partition recursively. A linear sorting method, Counting The work of Algorithm 1 is proportional to the number of Sort [28], is adopted to sort and get the aggregate of each GRs examined. (2) implies that no time is spent on examining partition. It sorts in O(N) time without any key comparisons. any non-trivial GRs that do not meet the thresholds minSupp and minNhp, thanks to the checking at lines 10, 18, 25, commonthatstudentsandprofessorsareco-authorsbutgener- and Theorem 3. Typically, much fewer GRs are examined allystudentshavemuchfewerpublicationsthanprofessors.We because minNhp is dynamically updated to the smallest non- construct one edge attribute Collaboration Strength homophily preference of the current top-k GRs (line 28). We (S) with three domain values: occasional (f =1), moderate will examine this effect of minNhp on real life data sets in (2≤f <5), often (f ≥5), where f is the number of papers Section VI. co-authored by the two authors at the ends of an edge. We evaluated the interestingness of GRs in Sections VI-B VI. EXPERIMENTS and VI-C, and evaluated the efficiency of the GRMiner algo- rithm in Section VI-D. We evaluated the GRMiner algorithm on real life data on CentOS 6.4 with Intel 8-core processors 2.53GHz and 12G of RAM. The programs were written in C++. B. Interestingness Study for Pokec Data One of our claims is that the proposed non-homophily A. Data Sets preferencemetric(i.e.,nhp)helpstoidentifyinterestingsocial ties beyond the well-known homophily principle. We evaluate We used two public real-world data sets: Pokec Social this claim by comparing the top-k GRs ranked by nhp with Network data4 and DBLP co-authorship data5 because the the top-k GRs ranked by the standard confidence, conf. Note domains of these data sets are easy to understand, which is that when applying conf, homophily effect is not excluded. essential for interestingness studies. We set minSupp=0.1% (i.e., absolute minSupp = 21078), minNhp and minConf at 50%, and k = 300. Table IIa shows Pokec Social Network Data. Pokec is the most popu- the top-5 GRs ranked by nhp (in boldface) and top-5 GRs lar online social network service in Slovakia for discover- ranked by conf, plus one less ranked GR by nhp (the last ing, chatting and dating with online friends. This data set row). 4 of the top-5 GRs ranked by conf are trivially expected contains anonymized users with profile data and directed from the homophily principle as both LHS and RHs contain friendships between users. We extracted 6 most important node attributes: Gender (G,3), Age (A,11), Region the same value; this trend continues further down the list (R,188), Education (E,10), What-Looking-For (not shown here). This suggests that the conf metric fails to (L,11), and Marital Status (S,7), where the letter find interesting relationships beyond what is known from the homophily effect. In contrast, the GRs ranked by nhp, i.e., and number in a bracket are the abbreviation and domain size ofanattribute.WespecifyA,R,E,Lashomophilyattributes. P1-P5 and P207, tend to provide more insights. The conf of these GRs are included for comparison. These GRs are found While all attributes have drop lists for choosing their values, E,L,S arealsofillablewithanytext.Weusedthevaluesfrom because their nhp is high, even though their conf is low. Note that the proportion of data covered by a GR is captured by the drop list whenever they were chosen, and otherwise, the supp. We pick P2, P5, and P207 to discuss in details, other user-filledtextsubjecttothefollowingpreprocessinginorder: GRs are interpreted in a similar way. (1) Remove all characters except letters and apply standard IR pre-processing to the filled text. (2) For the words that P2: (E:Basic)→ (E:Secondary). This GR indicates that for occur in more than 200 user profiles, replace them by their people with Basic education, when not partnering with people English synonym and mark the other words as “invalid”. (3) withthesameeducationastheirown,theypreferred(in68.7% Use the highest level for E (for example, keep “Master” if cases) those with Secondary education. With Training being both“Bachelor”and“Master”arefilled);andforLandS,use closer to Basic, this GR is less expected from homophily of the word with highest frequency. (4) Keep only user profiles Education because Training is expected to be more popular containing no “invalid” value. The final induced graph has among people with Basic education. Further examination of 1,436,515 (87.98%) users and 21,078,140 (68.83%) directed data reveals that the proportion of Secondary is 19.54% and edges. In addition, we discretized the domain of Age into “0- theproportionofTrainingisonly1.9%,whichisprobablythe 6”,“7-13”,“14-17”,“18-24”,“25-34”,“35-44”,“45-54”,“55- reason for the high nhp of this GR. 64”, “65-79”, and “80 or older”. P5: (L:Sexual Partner) → (G:Female). For this GR, nhp DBLPData.Thisistheco-authorshipDBLPdatasetused degenerates into conf because β = ∅ (no homophily attribute in [1], and it contains 28,702 authors and 66,832 directed co- occurs on both sides). This GR suggests that for people authorrelationships(wereplaceeachundirectededgewithtwo describingthemselvesaslookingforsexualpartners,64.7%of directed edges in opposite directions). Each author has two theirpartnersarefemale.StartingwiththisGRandwondering node attributes, Area (A) and Productivity(P), and whether gender has any impact on this behavior, we formed Areahas4valuesDB,DM,AI andIR,andProductivity the following two hypothesis by varying P5, and queried their has 4 values Poor, Fair, Good and Excellent. We use the nhp and supp from the data: exact same criteria as in [1] to discretize the values for the two attributes. Definitely, an author may belong to multiple (G:Male,L:Sexual Partner)→(G:Female) areas, we select one only among the four in which she/he nhp=68.1%;supp=392652 publishes most. We specify Area as a homophily attribute since authors in the same areas tend to collaborate; while we (G:Female,L:Sexual Partner)→(G:Male) specify Productivity as a non-homophily one, since it is nhp=48.8%;supp=71699 4http://snap.stanford.edu/data/soc-pokec.html Thispairsuggestsabigdifferenceinthepreferenceofopposite 5http://dblp.uni-trier.de/xml/ sex partners by males and females when looking for sexual (a) Pokec data set (b) DBLP data set Rankedbynhp Rankedbyconf Rankedbynhp Rankedbyconf (L:Chat)→(L:GoodFriend) (A:AI)→(P:Poor) (R:27)→(R:27) (A:AI)→(A:AI) P1: nhp=69.5%;supp=649723 D1: nhp=74.3%;supp=31330 conf=72.2%;supp=250930 conf=88.8%;supp=37458 (conf=30.9%) (conf=74.3%) (E:Basic)→(E:Secondary) (A:DB)−o−f−t−e→n (A:DM) (R:24)→(R:24) (A:DB)→(A:DB) P2: nhp=68.7%;supp=682715 D2: nhp=71.5%;supp=98 conf=66.1%;supp=197374 conf=88.7%;supp=44980 (conf=15.4%) (conf=6.98%) (E:Preschool)→(E:Basic) (P:Poor)→(P:Poor) (R:32)→(R:32) (A:IR)→(A:IR) P3: nhp=66.1%;supp=54765 D3: nhp=70.6%;supp=63174 conf=65.1%;supp=143219 conf=75.9%;supp=16020 (conf=30.4%) (conf=70.6%) (E:HardlyAny)→(E:Basic) (P:Excellent)→(A:DB) (R:10)→(R:10) (A:AI)→(P:Poor) P4: nhp=65%;supp=34099 D4: nhp=68.1%;supp=2744 conf=65%;supp=279623 conf=74.3%;supp=31330 (conf=30.7%) (conf=68.1%) (L:SexualPartner)→(G:Female) (A:IR)→(P:Poor) (L:SexualPartner)→(G:Female) (A:DM)→(A:DM) P5: nhp=64.7%;supp=468012 D5: nhp=68.1%;supp=14368 conf=64.7%;supp=468012 conf=72.3%;supp=14232 (conf=64.7%) (conf=68.1%) (G:Male,A:25-34)→(A:18-24) (A:AI,P:Good)→(A:DM) P207: nhp=50.8%;supp=593785 D16: nhp=55.2%;supp=272 (conf=33.9%) (conf=11.6%) TABLE II: Comparison of top GRs ranked by nhp and conf partners. Without first finding P5, it is difficult to find this a true preference to DM, not due to data skewness. A possible difference from the collection of GRs. reason is that DM is an interdisciplinary field that intersects database and machine learning (a subarea of AI). P207: (G:Male, A:25-34) → (A:18-24). Again, we form hypothesis from the seed P207. We replace Male with Female Remark 3: Finding top-k GRs typically serves the entry on the LHS and get nhp = 32.8% and supp = 204780, which point in pattern mining. In the above case studies, the human suggeststhatwomenmuchlesspreferredyoungerpartnersthan analyst starts with top-k GRs found, forms new hypothesis men. The next two variations show that this difference is even throughvaryingtheGRsfound,andcomparessuchhypothesis bigger for partner with opposite sex: as well as data distribution. This process can apply to the new hypothesis recursively. This cycle of hypothesis formulation (G:Male,A:25-34)→(G:Female,A:18-24) and hypothesis comparison often leads to new insights into nhp=39.1%;supp=456201 the behaviors of different groups of actors or an explanation of the presence of a GR. Unlike manual probing of a data (G:Female,A:25-34)→(G:Male,A:18-24) set, top-k GRs provide an entry point to this cycle by filtering nhp=12.8%;supp=80070 many uninteresting and trivial patterns. D. Efficiency of Algorithms C. Interestingness Study for DBLP Data OuralgorithmfinishedrunningontheDBLPdatasetinno For DBLP data, we set minSupp = 0.1% (i.e., absolute more than 0.483 seconds for all parameter settings. Therefore, minSupp = 67), minNhp and minConf at 50%, and k = 20. our study below focuses on the Pokec data, which is much Table IIb shows the top GRs ranked by nhp (in boldface) largerthantheDBLPdata.GRMiner(k)denotesthealgorithm and conf. Similar to the study on Pokec Data, the top GRs that pushes all the constraints of minSupp, minNhp, top-k, ranked by nhp are more interesting than those ranked by and generality of GRs to prune search space, as described in conf. Recall that Area (A) is a homophily attribute and Section VI-D. GRMiner pushes all constraints except for the Productivity(P) is not. top-k constraint. The difference will tell the effectiveness of D1 & D3 & D5: On surface, D1 & D3 & D5 suggests the dynamically upgrading minNhp to that of top-k GRs. preferencetoauthorswithPoorproductivity.Thisisinteresting We consider two baseline solutions. One stores the node asitcontradictswiththecommonsense.Aquickcheckonthe and edge attributes information in a single table, applies the data(byexaminingthevaluesdistributionontheattribute)tells BUC algorithm [23] to mine the combinations of attribute that 91.18% of the authors have the value Poor for P because values above the threshold minSupp. We denote this baseline many authors are students and most co-authorship is between byBL1.Thesecondbaseline,BL2,issimilartoBL1butworks supervisors and students. withthenodeandedgeattributesinformationseparatelystored often inthreetables.Bothbaselinesprunethesearchspaceusingthe D2: (A:DB)−−−−→(A:DM) D2 suggests that authors in the anti-monotonicity of support, but not minNhp, and find the DB area often collaborate with those in the DM area when top-k GRs in a post-processing step. collaboratingwiththosenotintheirownarea.D16isasimilar pattern for authors in AI area. In fact, DM has the least Unless otherwise stated, we consider the four node proportion among all areas. Therefore, these GRs represent attributes with largest domain sizes, i.e., Age, Region,
Description: