ebook img

Who With Whom And How?: Extracting Large Social Networks Using Search Engines PDF

0.77 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Who With Whom And How?: Extracting Large Social Networks Using Search Engines

Who With Whom And How? - Extracting Large Social Networks Using Search Engines Stefan Siersdorfer, Philipp Kemkes, Hanno Ackermann, Sergej Zerr L3SResearchCenter,Hannover,Germany {siersdorfer,kemkes,zerr}@L3S.de [email protected] 7 ABSTRACT 1 0 Social network analysis is leveraged in a variety of applica- 2 tions such as identifying influential entities, detecting com- n munities with special interests, and determining theflow of a information and innovations. However, existing approaches J for extracting social networks from unstructured Web con- 8 tentdonotscalewellandareonlyfeasibleforsmallgraphs. 2 In this paper, we introduce novel methodologies for query- based search engine mining, enabling efficient extraction of ] social networks from large amounts of Web data. To this I end, we use patterns in phrase queries for retrieving entity S connections, and employa bootstrappingapproach for iter- . s atively expandingthe pattern set. Our experimental evalu- c ationindifferentdomainsdemonstratesthatouralgorithms [ provide high quality results and allow for scalable and effi- 1 cient construction of social graphs. v Categories andSubject Descriptors 5 8 H.3.3[InformationSystems]: Informationstorageandre- 2 trieval; E.1 [Data Structures]: Graphsand networks 8 Figure 1: Subgraph of a network extracted using our ap- 0 Keywords proachalongwithexamplesforconnectingphrasepatterns. . 1 pattern based queries,social network extraction 0 items and new contacts to their users. Furthermore, know- 7 1. INTRODUCTION ingthetopologyofsocialgraphscanshedlightontheprop- 1 agationofideasandtrustinsocialnetworks[36, 1, 12],and Networkingandcommunicationarenaturalhumanactiv- : can enhance various IR applications such as personalized v ities that are massively supported through modern infor- queryexpansion and media recommendation [4, 13]. i mation technologies, most prominently the internet. Large X In order to enable the analysis of social networks they networks of people are present in many parts of the Web, havetobeextractedfromunderlyingdatainthefirstplace. r for instance, in form of contacts and friends in Social Web a Some sources already provide explicit and easy to extract platforms, co-occurring named entities in web pages, or co- information about user relations. This includesonline plat- authors of articles in scientific portals. Social networks ex- forms such as Facebook, Twitter, or LinkedIn that main- tractedfromtheseinformationsourcesareexploitedforvar- tain user databases and offer software interfaces for ac- ious purposes: Node centrality measures in networks can cessing contacts, friends, or followers. Other sources - for help to identify important and influential persons or com- instance, email corpora revealing communication links be- munitiesinareassuchasentertainment,science,orpolitics. tween persons, or scientific portals comprising information Informationonsocialconnectionsincustomernetworkscan onco-authorship-containmoreimplicit,yeteasytoextract be leveraged as part of recommender systems that suggest network information. However, in many cases information aboutsocialconnectionsishiddenwithinunstructureddata Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor such as Web pages, archives,and multimediarepositories. classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed Search engines constitute our main access points to the forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcita- tiononthefirstpage. Copyrightsforcomponentsofthisworkownedbyothersthan Web,andthereexistsanumberofworksthatemploysearch ACMmustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orre- queries for detectingco-occurrences of entities and leverage publish,topostonserversortoredistributetolists,requirespriorspecificpermission theseconnectionsforextractingsocialnetworks[20,25,30]. and/[email protected]. However, the suggested algorithmic approaches do not CIKM’15,October19–23,2015,Melbourne,Australia. scale well and are only suitable for rather small datasets (cid:13)c 2015ACM.ISBN978-1-4503-3794-6/15/10...$15.00. comprising just a few hundredor a few thousandnodes. DOI:http://dx.doi.org/10.1145/2806416.2806582. In this paper, we exploit pattern-based search engine textual patterns [2, 22]. In [32] a bootstrapping algorithm queries for efficient acquisition of large scale networks from is employed for iteratively discovering new patternsand se- unstructured Web data. These queries consist of entities manticconnections. In[33]and[5]categoricalfactsandcon- (e.g.“BarackObama”)andconnectingpatterns(e.g.“meets cepthierarchiesareextractedfromwebandnewscorporaas with”, “and his wife”) leveraged for mining links between entity-pattern-entity triples. Cimiano et al. [7, 8] leverage persons. Using such search patterns we can simultaneously pattern-based search engine queries for building ontologies. explore a large number of connections with a single query. In [16] comparable entities such as consumer items are ex- Startingwithaninitialsetofentities,weiterativelyexpand tracted using an expandable set of contrasting expressions both entity pair and pattern sets. We identify connections (e.g.“vs.”and“insteadof”);Jiangetal.[17]usesimilarrela- whicharelikelytobeimportantandusethisinformationto tionsforconstructingentitygraphsandproductrecommen- formulate subsequent queries, thereby greatly reducing the dations in web search. The KnowItAll system [11] employs number of search engine requests. For illustration, a small asupervisedlearningapproachforvalidatingfactsextracted excerpt of aconstructed graph is shown in Figure 1. usingsearchenginequeriesbasedonmanuallyprovidedpat- tern templates. Open information extraction systems such The main contributions of this work can be summarized as TextRunner [35] employ shallow parsing, and, as pat- as follows: terns,employnormalizedexpressionsbetweennounphrases in an unsupervised manner. The output of these systems • Weintroduceintelligentprioritizationapproachesforef- are databases that contain fact tuples such as “(Berkley, ficientlyexpandingsocial networksusingpattern based located in, Bay Area)”or“(Oppenheimer, professor of, the- search engine queries. oretical physics)”. In contrast to works focusing on general fact mining, we providea novelmethodology for extracting • Weemploya bootstrappingapproach forcovering mul- social networks. Our contributions include the systematic tiple aspects of social relations by iteratively extending construction of connected social graphs, their efficient and search pattern and entity sets for discovering domain intelligent expansion, as well as the analysis of their struc- specific social connections. tural properties. • Weconductanextensiveexperimentalevaluationdemon- The first approach employing search engines for network stratingthehighefficiencyandaccuracyofourmethods mining is probably the work by Kautz et al. [20]. The au- clearlyoutperformingthebaseline. Furthermore,wean- thors exploit co-occurrences of names on result pages to alyzevariousaspectsoftheextractedsocialgraphs,such identifyconnectionsbetweenpersons;theresultingnetworks as their structural properties, connecting patterns, and are relatively small and just centered around a single per- strength of social relations. son. The POLYPHONET system [25, 26] determines co- occurrencecountsinGooglequeriestoidentifypairwisecon- The remainder of this paper is organized as follows: Sec- nections. In addition, different types of relationships and tion 2 discusses related work on social network construc- correlations with topic-related keywords are taken into ac- tionfromsemi-structureddatasources,searchenginebased count to disambiguate entities. In [6] search engines are network mining, and extraction of semantic relationships. used to retrieve connected email addresses. The Flink sys- In Section 3 we describe our methods for network extrac- tem[27]focusesontheminingofauthornetworksandcom- tionincludingsearchpatternsandtheirexpansionbasedon bines web search with information from emails and pub- bootstrappingaswellasdifferentsearchprioritizationcrite- lications. In [29] co-occurrences of persons in search result ria. The evaluation presented in Section 4 studies the cost snippetsareleveraged. Thereareafewattemptstoincrease efficiency and accuracy of our methods, and provides fur- efficiencyindiscoveringentityrelationsontheweb [15,30]; therinsightsaboutthestructureoftheextracted networks. they focus on snippet clustering and entity disambiguation Finally, in Section 5 we conclude and describe directions of to reduce the number of search requests. However, all of our futurework. the current methods are based on pair-wise entity queries, making them infeasible for mining larger graphs. 2. RELATED WORK To the best of our knowledge, our methods are the first to go beyond pair-wise querying for social connections and useintelligentquerypatternconstructionandprioritization There is a plethora of work on social network extraction approaches for graph expansion that allow for tunable and fromtextandvisualdata. In[9],forinstance,socialconnec- larger scale network construction. tions between fictional characters are inferred from dialogs inbooks,and,similarly,in[19]asocialnetworkisextracted from the narrative of an Ottoman scholar and world trav- 3. SOCIALGRAPHMINING eler. In [10] social connections are constructed from a his- torical multimedia repository by leveraging co-occurrences In this section we describe our methodology for mining offacesinimages. Birdetal.[3]extractsocialnetworksus- social graphs using search engines. To this end, we first in- ing information aboutsenders andrecipientsobtained from troduce an efficient basic algorithm using web search with headers in email corpora. Social networks extracted from a fixed pattern set for identifying connections between per- other (semi-)structured sources include academic networks sons. Wethenextendthismethodbyaniterativeapproach or co-citation graphs in publication databases [28, 34]. for automatically collecting new patterns. Finally, in order Information extraction tackles the problem of deriving to further improve the cost efficiency of network construc- structured information from unstructured data. There is tion, we introduce methods for prioritizing nodes for graph a body of work on semantic relationship mining based on expansion. 3.1 Mining GraphswithaStaticSetofQuery Algorithm1:GraphConstructionwithPatternSetEx- Patterns pansion Given an initial seed set of entities I and a set of pat- Input: I: initial set of entities terns P, we use connectivity search queries of the form IP: initial set of patterns <entity> <p>(where<p>isaconnectingpatternase.g. in τ: threshold for edge weights “Barack Obama meets”) in order to identify links to other σ: threshold for pattern weights entities. We iteratively issue these queries over an initial maxIter: maximum numberof iterations seed set of entities (e.g. a list of a politicians). In this Output: social graph G=(V,E) way we obtain links between entities in the seed set as well 1 begin as new entities. This is repeated for all patterns in our 2 E ←∅ pattern set (in our experiments we used a small initial set 3 V ←I consistingofthesingleterm“and”). Forsnippetsofthetop- 4 Cand←I // graph extension candidates k search results in Bing we check for exact strings of the 5 P ←IP // pattern set form <entity> <p> <entity> where both entities match a 6 iter←0 person name in a database (we were using entities from 7 repeat DBpedia and IMDb in our experiments).1 The process of 8 NewCand←∅ querying entities is repeated for the newly discovered enti- 9 NewPatterns←∅ ties in a“breadth-first”manner. The resulting (undirected) 10 for e∈Cand do edges between two entities are weighted by the overall oc- 11 edges←search(e,P,τ) // web search for currence count in the snippets across all patterns, omitting edges with node e, using query patterns edges whose weight is below a threshold τ. P and weight threshold τ Algorithm 2 shows thedetails of thealgorithm. The net- 12 for ({e,ei},w)∈edgesdo work and a priority queue of nodes are initialized based on 13 if {e,ei}6∈E then the entity seed set (lines 2–4). Here the priority queue is 14 V ←V ∪{ei} a simple“first-in-first-out”queue; in Section 3.3 we will in- 15 NewCand←NewCand∪{ei} troduce more enhanced mechanisms for prioritization. The 16 E ←E∪{({e,ei},w)} nodes in the queue serve as input for pattern-based queries 17 else (line8),andqueryresultsareusedfor expandingthesocial 18 E ←E\{({e,ei},x)} network and updatingthe queue(lines 9–17). The network 19 E ←E∪{({e,ei},x+w)} expansion continues until the query budget is spent or un- til there are no unvisited nodes left (line 19). The running 20 Cand←Cand∪NewCand timeofthealgorithmislinearinthesumofnodesandedges 21 NewPatterns←pSearch(E,σ) // web search visited and found. for patterns, using edges from E, and weight threshold σ 3.2 Iterative PatternMining 22 P ←P ∪NewPatterns;iter←iter+1 We extend the Graph Mining method described above 23 until iter=maxIter; through an approach for iteratively discovering new pat- 24 return (V,E) terns. To this end, we start with an initial set of seed pat- terns and extend it in each round of the algorithm, taking into account theweights of edges extracted in that round. We then iterate over the top-h resulting edges with the highest weight and use them to issue queries of The details of the algorithm are shown in Algorithm 1. the form “<entity>” “<entity>”. Snippets of the Seed patterns and entities are first used to initialize the search results are then checked for strings of the form bootstrapping approach (lines 2–5). Then, for a fixed num- <entity> <c> <entity>, where <c> is an arbitrary string ber of iterations, the social network is expanded based on betweentwodifferententities;<c>becomesacandidatefor the current query patterns (lines 11–19), and, conversely, a new pattern. Among the candidates we want to identify the pattern set is expanded using the current edges in the patterns with high coverage in the result snippets. To this network(lines21–22). Therunningtimeofthealgorithmis end, let n be number of occurrences of candidate pattern linear in the sum of nodes and edges visited, multiplied by <c> (support), m be the number of distinct entity pairs thenumberof patternsfound. (diversity) and d be the number of distinct domains (spam resistance)comprisingthepattern. Weinclude<c>intothe 3.3 Prioritization of Nodes for Network Ex- pattern set if n·m·d2 is above a threshold σ. Extracting pansion top-k patterns can be managed effectively using a priority The algorithms described so far expand the social graph queuewithlog(n)runningtimeforoperationslikeinsertion in a “breath-first-search” manner. The advantage of this and updating. This pattern mining method is employed af- approach is its good coverage: Seedset and subsequentially tereachcompleteiterationofthealgorithmdescribedinthe found nodes are systematically visited and expanded until previous paragraph. our budget of connectivity queries is exhausted. However, 1 in order to further optimize budget usage we might want Wedecidedtokeeptheentityrecognitionsub-componentsimple to put higher priority on more“essential”nodes during the and lightweight. More enhanced NLP and disambiguation tech- niquesareoutofthescopeofthiswork,butcouldhelptofurther networkexpansionprocess. Inthefollowingwefocusontwo improveouralreadyverygoodresults. criteria for prioritizing nodes: popularity and novelty. 4. EVALUATION Algorithm 2: Graph Construction with Static Pattern Set In this section we evaluate the network extraction meth- Input: I: initial set of entities ods described in Section 3, starting with entities from two P: set of patterns domains: politics and movies. The objective of our evalua- τ: threshold for edge weights tion was to study (1) the cost efficiency of different strate- maxReq: maximum numberof requests gies in terms of network size obtained with a given budget pQueue: A container which sorts thecandidates of search engine requests, (2) the accuracy of the extracted according to theemployed algorithm networks, and (3) the structure of the social graphs as well Output: social graph G=(V,E) as properties of discovered social relations. In our exper- 1 begin iments, more than 300,000 persons and 1.4 million social 2 E ←∅ // Edges connections were retrieved using two seed sets containing 3 V ←I // Vertices just 342 politicians and 100 actors. The resulting networks 4 pQueue.addAll(I)// graph extension candidates are coined WikiNet and IMDbNet. 5 requestCnt←0 4.1 Setup 6 repeat 7 e←pQueue.pop() 8 (edges,requests)←search(e,P,τ)// web DomainsandSeedSets. Forinitializingouralgorithmswe search for edges with node e, using query used two entity seed sets from political and movie domains patterns P and weight threshold τ as follows: 9 for ({e,ei},w)∈edgesdo 10 if {e,ei}6∈E then • WikiNetseed: Weextracted342politiciannamesfrom 11 V ←V ∪{ei} the Wikipedia list of the current heads of state and 12 E ←E∪{({e,ei},w)} government from all over theworld2. 13 if ei 6∈pQueue then 14 pQueue.push(ei) • IMDbNetseed: Weconsideredthe100currentandfor- mer leading actors as listed in IMDB3. 15 else 16 E ←E\{({e,ei},x)} In order to simplify named entity recognition in search en- 17 E ←E∪{({e,ei},x+w)} ginesnippets,werestrictedthesetofpossiblepersonsinour 18 requestCnt←requestCnt+requests WikiNettothoselistedinDBpediaVersion2014 (containing 19 until requestCnt≥maxReq or pQueue.empty(); about1milliondistinctpersonnames)andintheIMDbNet 20 return (V,E) tonamesfromtheIMDbdirectory(about2milliondistinct names). Implementation Details. We employed the Bing Search EngineAPIforissuingournetworkexpansionqueries. Pre- liminary tests showed that in most cases the overall result list returned per query did not exceed k = 200 entries, al- Popularity in this context intuitively corresponds to the though the estimated number of results often went beyond importance of nodes within the network. We employ the thousands. Therefore, we issued 4 requests per query as numberofinlinksρinthecurrentversionofthesocialgraph Bing provides 50 results per request (if available). In order during the extraction process as popularity measure. Al- to avoid recurring API requests we simultaneously cached though very effective in our experiments, one could easily results from previous queries. The search API has some imagine using alternative node centrality measures such as technicallimitations, includingthefact that a few (special) PageRank [31],hubscores inHITS[21], oroneoftheirvar- characters,like“&”,“,”,“+”cannotbeusedassearchquery. ious variants (e.g. [23, 18]). Furthermore, most of the snippets returned for disjunctive queries do not contain query terms, and, thus, cannot be Novelty, on the other hand, corresponds to how recently used for relation extraction. anodewasaddedinthecourseofournetworkconstruction process. Intuitively, exploring more recent nodes can help TestedStrategies. Weevaluated thefollowing methods: to reach out more quickly to new communities within the network. Wecombinepopularityandnoveltyintoapriority • The breadth-first graph mining approach with fixed measure Φ for a node v using exponential temporal decay as follows: Φ(v) = ρ·e−α·t, where our time measure t cor- pattern set described in Section 3.1 (BF). responds to the numberof expansion node steps conducted • TheprioritizationapproachfromSection3.3,withdif- since the node was added to the queue, and α is a param- ferent values for the tunable decay parameter α (0, eter for balancing the influence of popularity and novelty. 0.005, and 0.01) (Prio). Exponentialdecayisoftenusedfortimedependentranking ofitemsinthecontextofIR(seee.g. TempPageRank[37]). • The iterative pattern mining approach described in This approach can be directly integrated in the framework Section 3.2 (PatternIter). providedinAlgorithm2usingapriorityqueuebasedonthe Φ score, and could, in principle, also be integrated into the 2http://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_governme dynamicpattern mining process (Algorithm 1). 3http://www.imdb.com/list/ls050274118/ Algorithm Nodes Edges • Ournewmethodsclearlyoutperformthebaselinewhich baseline t=0.1 6,956 13,571 does not scale well for expanding entity sets and pro- duces much smaller networks for the given budget. baseline t=0.4 2,925 3,838 Even for a small co-occurrence threshold t = 0.1 the BF 113,988 368,806 WikiNet networksobtainedbythebaselinemethodarealready decay α=0 98,479 456,223 betweenoneandtwoordersofmagnitudesmaller. For decay α=0.005 110,260 346,663 t = 0.4 (suggested in POLYPHONET for extracting decay α=0.01 116,081 323,372 relations of acceptable quality) the reduction in net- work growth becomes even more apparent. baseline t=0.1 7,159 34,903 baseline t=0.4 2,192 6,666 Note, that in addition to the number of search engine BF 109,453 376,429 requests taken into account for this evaluation, the base- IMDbNet decay α=0 107,085 425,830 line also introduces a substantial overhead by requiring the decay α=0.005 112,736 264,679 downloadandprocessingofalargesetofWebpages(about decay α=0.01 112,782 242,184 85,000 pages in ourexperiments). Incontrast, ourmethods avoid these extra costs by working directly on search result Table 1: Number of nodes and edges obtained for different snippets. strategies. Figure2providesadditionaldetailsaboutthenetworksize We set the edge weight threshold τ to 2 for all algorithms. development with respect to the amount of search engine For PatternIter we set the cut-off value σ for pattern se- requestsspent. Theeffectofastrongeremphasisonnovelty lection to 5, and used value h = 100 for top connections (corresponding to higher α) is reflected in a slight increase employed for discovering new patterns. The initial query in the number of new nodes, starting from around 50k – pattern sets used in our experiments consisted only of the 100k requests. On theotherhand, prioritization of popular generalpattern“and”;forextractingconnectionsfrom snip- nodeswithhighin-degree(lowerα)leadstoahighincrease pets we employed an additional small set of manually se- in the number of edges (starting from around 10k requests lected patterns4. for both datasets). Baseline. All of the current works on social graph con- 4.3 Network Properties struction using search engines are based on pairwise entity In the following we describe structural properties of the queries. Asbaselinewetestedthemethoddescribedin[25] extracted networks as well as the development of the net- (POLYPHONET)whichtriestoreducethenumberofpair- works duringtheexecution of ouralgorithms. wisequeriesbyidentifyingsubsetsofpromisingquerypairs. Morespecifically,startingwithaseedsetofentities,foreach Node Degree and Edge Weight Distributions. Table 2 entityalistofcandidateentities(containedintheentityset) provides summary statistics for the distribution of nodes is extracted from Web pages obtained by querying for the andedgesintheextractednetworks. Forallexperimentswe entity. Then queries for the pairs of seed entities and the observearight-skeweddistributionofnodesandedgeswith candidates for associated entities are issued in order to ob- median node degree of 1 or 2 and median edge weight of 3. tainco-occurrencevaluesbasedonsearchresultcounts,and (Theedgeweightsinalltheexperimentsfollowapowerlaw edges with a value above a threshold t are included in the distributionasexemplarilydepictedinFigure5fortheBF- graph. In order to allow for network expansion beyond the constructed WikiNet.) The decay parameter α for thePrio seedset,weincludedfurtherentitiesfoundintheWebpages strategyhasaclearimpactonthedistributions: Decreasing for identifyingadditional querycandidates. the value of α results in an increase of the means for both 4.2 CostEfficiency node degree and edge weights due to the higher prioritiza- tionofnodeswithalreadyhighdegree,andalsoresultsina Table 1 shows the number of nodes and edges in the ex- wider“spread”ofthevaluesasreflectedbyahigherstandard pandednetworkscomputedbyour methodsusing thesame deviation. seed sets and exhaustinga fixedbudget of 200k queries per Figure3showsthedegreedistributionsofthenodesinthe run. The main observations are thefollowing: constructednetworks. Thedistributionforthebreadth-first • Starting with the small seed sets consisting of just a (BF) strategy is close to a power law distribution. For the fewhundredentitiesdescribedinthesetup,ourmeth- Priostrategiesthedescribedeffectforthedecayparameterα ods are able to extract networks containing around interfereswiththepowerlaw,resultingintwolocalextrema: 100,000entities,andbetween200,000and450,000edges a minimum for nodes with lower in-degree followed by a for both domains. Our quality oriented evaluation in maximum for nodeswith higherin-degree. Section 4 will demonstrate thehigh accuracy of edges generated byour methods. GraphStructures. Figure 4 visualizes the social networks • The tunable decay parameter α in the prioritization obtained fordifferent methodsand seed sets. Fortheprior- approachPrioisapplicablefortradingnoveltyagainst itization strategy Prio the decay parameter α has a strong popularity. This is reflected in both domains by the influence on the network structure: For a stronger focus larger number of social relations extracted for lower on popularity (α = 0) the larger number of edges results values of α and the larger number of nodes found for in higher graph density; putting more emphasis on novelty higher valuesof α. (α = 0.01) leads to a wider spread of the graph, providing a bird’s eye view of the discovered space. A more in-depth 4 Thefollowingpatternswereused:“meets”,“ ”,“&”,“,”,“speaks explorationofcommunitystructuresissubjecttoourfuture with”,“und”,“et”,“y”,“-” work. Node Degree Edge Weight Algorithm Mean SD Median Mean SD Median BF 6.47 12.53 2 8.53 23.62 3 decay α=0 9.27 16.24 2 8.84 23.93 3 WikiNet decay α=0.005 6.29 9.46 2 7.65 18.43 3 decay α=0.01 5.57 8.39 2 7.65 17.95 3 BF 6.88 14.03 1 8.69 24.11 3 decay α=0 7.95 15.43 1 8.97 25.04 3 IMDbNet decay α=0.005 4.70 7.60 2 7.43 19.14 3 decay α=0.01 4.29 6.74 2 7.35 18.67 3 Table 2: Summarystatistics for node degrees and edge weights in theextracted networks. 500k 120k BF BF α=0 α=0 400k α=0.005 α=0.005 α=0.01 90k α=0.01 es 300k es dg od 60k #E 200k #N 30k 100k 0 0 0 50k 100k 150k 200k 0 50k 100k 150k 200k #Requests #Requests (a) WikiNet edges (b) WikiNet nodes 500k 120k BF BF α=0 α=0 400k α=0.005 α=0.005 α=0.01 90k α=0.01 es 300k es dg od 60k #E 200k #N 30k 100k 0 0 0 50k 100k 150k 200k 0 50k 100k 150k 200k #Requests #Requests (c) IMDbNetedges (d) IMDbNetnodes Figure 2: Nodesand edges versusrequests spent for WikiNet and IMDbNet. Example Connections. Finally, Table 3 lists the top-15 reference connections. One of the connected pairs was ob- connections in the IMDbNet found by the BF method. tained usingourmethod andtheotherconsisted ofpersons Among those connections are famous co-actors (“Terence uniformly sampled from the entity set (the order of these Hill”and“BudSpencer”),romanticcouples(“RyanGosling” two options was randomized). Users were asked the follow- and“Eva Mendes”), or a combination of both (“Brad Pitt” ing question: “Which of the given pairs has the stronger and“AngelinaJolie”-thestrongestconnectionobtained). In relation?”,andwereencouragedtouseanysourceincluding addition, we observe that our methods are also capable of searchenginestoanswerthisquestion(asanaidweprovided discovering connections from other domains (“Sergey Brin” links to a Google search page). The assessors consisted of and“Larry Page”). 15 undergraduate students of Computer Science and other disciplines, and 5 graduatestudentsof Computer Science. 4.4 QualityofExtracted Networks Table 4 shows that all of our methods achieved accuracy valuesbetween95%and99%,demonstratingthecorrectex- Weconductedalargeuserstudytoevaluatethequalityof traction of almost all tested edges. Overall, for 4694 out of theconstructednetworkswithoverall5,600evaluatededges. 4800 edge pairs the users preferred the connections discov- The goal of our evaluation was two-fold: First to check the eredbyouralgorithms. Wemeasuredinter-rateragreement correctness of the extracted social connections; second to with 5users for a subset of 200 edgesfrom each of thecon- evaluate strength of theconnections in more detail. structednetworksandobtainedanaveragepairwisepercent Pairwiseassessment. Determiningthecorrectnessofacon- agreement of 93.7% and 96.5% for WikiNet and IMDbNet, nectionbetweentwopersonsisadifficulttaskforhumanas- respectively,alsoreflected byhigh Fleiss’ Kappa[14]values sessors. Therefore,insteadoflettingassessorsdirectlyassign of 0.83 for WikiNet and 0.97 for IMDbNet. arelevancevalue,wefirstaskedthemtochoosebetweentwo 105 105 105 104 104 104 y y y nc 103 nc 103 nc 103 e e e u u u eq 102 eq 102 eq 102 Fr Fr Fr 10 10 10 1 1 1 1 10 100 1 10 100 1 10 100 Node degree Node degree Node degree (a) WikiNet BF (b) WikiNet α=0.01 (c) WikiNet α=0.0 105 105 105 104 104 104 y y y nc 103 nc 103 nc 103 e e e u u u eq 102 eq 102 eq 102 Fr Fr Fr 10 10 10 1 1 1 1 10 100 1 10 100 1 10 100 Node degree Node degree Node degree (d) IMDbNetBF (e) IMDbNet α=0.01 (f) IMDbNetα=0.0 Figure 3: Nodedegree distribution for different algorithms and theWikiNet/IMDbNetseed sets. (a) WikiNet BF (b)WikiNet α=0.0 (c) WikiNet α=0.01 (d) IMDbNetBF (e) IMDbNetα=0.0 (f) IMDbNetα=0.01 Figure 4: Resulting WikiNet and IMDbNetgraphs for different strategies and decay parameters. 106 Pattern Edges Pairs Domains 105 and 4,230 94 91 y 104 , 1,656 93 87 c & 828 90 95 n ue 103 806 91 89 q Fre 102 und 1,124 79 71 ; 54 33 25 10 : 36 27 24 - 36 23 20 1 1 10 100 1000 or 33 19 14 Edge weight / 28 22 11 Figure 5: Edge weight distribution for the BF algorithm with 20 15 15 on 23 15 14 applied on WikiNet seed set y 16 14 13 and hiswife 32 11 10 PERSON 1 PERSON 2 et 16 11 9 Brad Pitt – Angelina Jolie | 18 13 7 Trey Parker – Matt Stone mit 14 12 9 e 14 10 9 SpencerPratt – HeidiMontag vs. 13 10 9 StephenMoyer – AnnaPaquin and wife 25 7 7 Bud Abbott – Lou Costello + 9 8 7 Robert Pattinson – Kristen Stewart left and 13 7 6 Bruce Robison – Kelly Willis and actor 12 7 7 Ryan Gosling – EvaMendes hat 8 7 8 Dev Patel – Freida Pinto Pictures Photo of 31 28 1 Terence Hill – Bud Spencer Table 5: Top-25 patternsafter 1st iteration of patternmin- Sergey Brin – Larry Page ing algorithm PatternIter applied toIMDbNet Nick Cannon – Mariah Carey Tod Williams – Billie Tsien other in person or not”, 4 -“agree - Persons most probably Josh Dallas – Ginnifer Goodwin knoweachotherinperson”,5-“stronglyagree-Personsare Eric Dane – RebeccaGayheart strongly related or know each other well (in person)”. We assessed 600 edges (100 edges per dataset for each of our Table 3: Top-15 social relations found for IMDbNet using methods) sampled from the generated social networks and theBF extraction method. mixed in random order with 200 uniformly sampled pairs of unconnected nodes. Pairs of nodes generated using our Algorithm Votes Accuracy approaches obtained an average rating of 4.55 (stdv 0.87), incontrasttoanaverageof1.81(stdv0.99)forunconnected BF 800 0.969 ± 0.008 nodes. We also found that just 10.2% of the social connec- WikiNet decay a=0.0 800 0.990 ± 0.007 tionsfound in ournetworksobtained arating of3 orlower. decay a=0.01 800 0.983 ± 0.009 Theseresultsfurtherconfirmthehighaccuracyofourgraph BF 800 0.958 ± 0.014 construction methods. IMDbNet decay a=0.0 800 0.983 ± 0.009 4.5 Pattern ExpansionandAnalysis decay a=0.01 800 0.986 ± 0.008 Overall 4800 0.978 ± 0.004 In this subsection we show results for the cost efficiency and accuracy of the iterative pattern mining algorithm de- Table 4: User evaluation results: accuracy of different scribed in Section 3.2, and analyze the phrase patterns ob- methods tained for social connections. Cost Efficiency and Accuracy. We conducted two iter- Wealsotestedhowmanyoftheedgesinourgraphscorre- ations of the pattern mining algorithm leading to 100,319 spondedtoco-workersinmoviesaslisted inIMDb. Alarge requestsforIMDbNetand44,089forWikiNetandobtained percentage of edges found by our algorithms (e.g. 33.8% for method BF, and 32.7% for method Prio with α = 0) over 2,100 and over 5,000 distinct patterns (occurring with frequencyofatleast10)forWikiNetandIMDbNet,respec- wereco-actinginthesamemovies. However,ouralgorithms tively. The extracted patterns provide additional informa- are capable to additionally reveal many relationships be- tion about type and topical focus of the extracted social tween actors not appearing in the same movie as captured connections. OveralltheresultingWikiNetcontained20,622 by IMDb (for instance,“Tom Cruise”and“Katie Holmes”). nodes and 40,342 edges, and the resulting IMDbNet 35,649 Assessmentofindividualedges. In addition to the pair- nodes and 129,326 edges. wisechecksdescribedabovewealsodirectlyassessedstrength Ahumanassessmentofedgesextractedbythealgorithm, and correctness of individual connections using a 5 point which was conducted in the same way as described in Sec- Likert scale. Levels 1 to 5 corresponded to the following tion 4.4, resulted in accuracy values of more than 0.96 and answers to the question“Are these people connected?”: 1 0.97 for the graph expansions of the WikiNet and IMDb- -“strongly disagree - Persons are not connected”, 2 -“dis- Net seed sets, respectively - with 95% confidence intervals agree-Personsdonotlikelyknoweachotherinperson”,3- of0.01. Thisdemonstratestheviabilityofourapproachfor “not known - it is not clear whether the persons know each constructinghigh-quality networks. Pattern terms in WikiNet (politicianseed) Pattern terms in IMDbNet (actor seed) meet minist presid coach prime star direct photo produc actor forward former senat rep leader director pictur wife written actress center beat back guard defens galleri husband latest daughter pic king gener sai manag mayor plai date welcom film cast right defend sen defeat chief girlfriend join alongsid marri boyfriend midfield lineback score receiv captain featur video screenwrit attend movi left governor secretari foreign tackl imag danc perform show sister replac khan player against face camilla frontman screenplai dure skarsgard assist deputi premier state while mother born parker writer comedi head u. professor republican talk fan also stellan best kiss Table 6: Top-50 stemmed terms extracted from patterns according to their Mutual Information values with seed sets of politicians in Wikipediaand actors in IMDb. Extracted Phrase Patterns. Table 5 shows the 25 best peripheral or assisting positions (“guard”, “assist”, “secre- patterns extracted by searching for the 100 most highly tari”), as well as concepts related to competition (“defend”, weightededgesafterthefirstiterationofactorseededgraph “score”, “beat”). In contrast, terms from patterns in the construction,andcomputingpatternratingsasdescribedin movie domain mostly refer to actors (“star”, “actor”, “ac- Section 3.2. Apart from obvious - yet useful - relationship tress”), film shooting (“cast”,“screenwrit”,“plai”), and fam- patternslike“with”,“or”, and“on”, many of the mined pat- ily relations (“husband”, “sister”, “boyfriend”). This result terns consist of special characters such as“,”,“&”, and“ ” indicatesthattermsfoundinpatternsareoftenstronglycor- which often serve to connect persons linked to each other. relatedwithsocialcommunities,andillustratesthepotential AlthoughEnglish patternsareprevalent,ourmethodfound merit of using machine learning techniquesfor domain spe- alsousefulpatternsinotherlanguagesallowingformoregen- cific selection of pattern candidates. This opens promising eral, multilingualretrievalofsocialconnections,andwiden- directions for focused, topic-oriented network expansion. ingthescopeoftheminednetworks. Onesuchpatternwas, forinstance,theconjunctionterm“and”inSpanish,French, and German (“y”, “et”, and “und”). Just a small number 5. CONCLUSIONSAND FUTUREWORK of patternsdonot describe relations between persons. This We have introduced efficient methods for extracting so- holdsforinstancethephrase“Pictures Photos of”whichoc- cial networks from unstructuredWeb datausing connectiv- curred in a large spam site containing many person names. ity search queries. Our graph expansion algorithms lever- However,thispatternobtainedacomparablylowratingbe- age pattern based query mining enhanced by a bootstrap- causeitappearedinjustasingledomain. Finally,weobserve pingapproachforfindingnewqueryphrasepatternsandby a few more specific patterns such as “and US president”, methods for intelligently prioritizing nodes. Our evaluation “and actors”, “and his wife”that can help to reveal inter- shows the applicability of our methods for extracting large esting information about the type of detected relationships scalesocialnetworks,anddemonstratesthehighaccuracyof and contexts. ourapproach. Wealsofoundthatpopularityandnoveltyof nodesareimportantcriteriaforcontrollingthenetworkcon- Term Analysis of Patterns. Patterns can provide useful struction process: Popularity of nodes can be leveraged for cluesondomainsandtopicalfociofextracted(sub)networks. efficiently extracting networks with more connections and Althoughgeneralpatternslike“and”and“with”occurred,for higher density, while exploiting novelty is useful for faster instance, very frequently both in thecontext of movies and exploration of more entities. An in-depth analysis of the politics we also found various domain specific differences: characteristics of the extracted networks sheds additional Patterns such as “meets”, “in conversation with”, and“is light on information obtained though connectivity queries joined by”were more dominant in the politician seed based and on properties of ouralgorithms. networks, while patterns like“and director”,“and actress”, Inourfutureworkweaimtoanalyzepatterntorevealsen- “is starring alongside” were more prominent in the movie timent and other information about the social connections domain. found: Are the linked individuals friends or foes? Do they Inordertoobtainthemostdiscriminativetermsfromthe have a personal or private connection? Are the relation- patterns, we computed the Mutual Information (MI) mea- ships bidirectional or rather unidirectional? Furthermore, sure[24]frominformationtheory,whichmeasureshowmuch weplantoimprovetheefficiencyofouralgorithmsbypriori- the joint distribution of features (terms from patterns in tizingsearchpatternsbasedonthecontextofanentityusing our case) and categories deviates from a hypothetical dis- observed structures in graphs as training data for machine tribution in which features and categories (“politics” and learning approaches. Finally, weaim tostudystrategies for “movies”) are independent of each other. Table 6 shows extracting social graphs in a way that allows for simultane- the top-50 representative stemmed words automatically ex- ously detecting communities in an efficient way. We think tractedfrom ourquerypatternsforthepolitician andactor that this work has direct applications to social graph ex- seedednetworksdescribedinSection4.1. Manyoftheterms ploration and people search and provides a foundation for inthepatternsfallingintothepoliticscategorydescribeper- interesting analyses and findingsabout social networks and sonal roles like leadership (“minist”,“presid”,“senat”) and communities. 6. ACKNOWLEDGMENTS [16] A.Jain and P. Pantel. Identifyingcomparable entities This work is partly funded by the European Research on theweb. CIKM ’09, pages 1661–1664. ACM, 2009. Council under ALEXANDRIA (ERC 339233) and by the [17] Z.Jiang, L. Ji, J. Zhang, J. Yan,P. Guo, and N.Liu. EuropeanCommissionFP7underQualiMaster(grantagree- Learning open-domaincomparable entitygraphs from ment No. 619525). usersearch queries. CIKM, 2013. [18] X.Jin, S.Spangler,R.Ma, andJ.Han.Topicinitiator detection on theworld wide web. WWW ’10, pages 7. REFERENCES 481–490. ACM, 2010. [19] C. Karbeyaz, E. Can, F. Can, and M. Kalpakli. A [1] A.Anagnostopoulos, R.Kumar, and M. Mahdian. content-basedsocial network study of evliya celebi’s Influenceand correlation in social networks. KDD ’08, seyahatname-bitlissection. InComputer and pages 7–15. ACM, 2008. Information Sciences II.Springer, 2012. [2] N.Bach and S. Badaskar. A Review of Relation [20] H.A.Kautz, B. Selman, and M. A.Shah.The hidden Extraction. 2007. web. AI Magazine, 18(2):27–36, 1997. [3] C. Bird, A. Gourley,P. Devanbu,M. Gertz, and [21] J. M. Kleinberg. Hubs,authorities, and communities. A.Swaminathan. Mining email social networks.In ACM Comput. Surv., 31(4es), dec1999. Proceedings of the 2006 International Workshop on [22] N.Konstantinova.Review of relation extraction Mining Software Repositories, MSR ’06, pages methods: What is new out there? Springer, 2014. 137–143. ACM, 2006. [23] A.N.Langville and C. D. Meyer. Google’s PageRank [4] M. R.Bouadjenek, H. Hacid,and M. Bouzeghoub. and Beyond: The Science of Search Engine Rankings. Sopra: A new social personalized rankingfunction for Princeton University Press, 2006. improvingweb search. SIGIR’13, pages 861–864. [24] C. Manning, P. Raghavan,H.Schutze,and ACM, 2013. E. Corporation. Introduction to information retrieval. [5] M. J. Cafarella, J. Madhavan,and A. Halevy. Cambridge UniversityPress, 2008. Web-scaleextraction of structured data. SIGMOD [25] Y.Matsuo, J. Mori, M. Hamasaki, K. Ishida, Rec., 37(4):55–61, mar 2009. T. Nishimura, H.Takeda, K.Hasida, and M. Ishizuka. [6] X.Canaleta, P. Ros, A.Vallejo, D. Vernet,and Polyphonet: An advanced social network extraction A.Zaballos. Asystemtoextractsocial networksbased system from theweb. WWW,2006. on the processing of information obtained from [26] Y.Matsuo, H.Tomobe, and T. Nishimura.Robust internet.In Proceedings of the 11th International estimation of google countsfor social network Conference of the Catalan Association for Artificial extraction.In AAAI,2007. Intelligence, pages 283–292. IOS Press, 2008. [27] P.Mika. Flink: Semanticweb technology for the [7] P. Cimiano, S.Handschuh,and S.Staab. Towards the extraction and analysis of social networks. Web self-annotating web. WWW ’04, pages 462–471. ACM, Semant., 3(2-3):211–223, oct 2005. 2004. [28] A.Mohaisen, A.Yun,and Y. Kim. Measuring the [8] P. Cimiano, G. Ladwig, and S.Staab. Gimme’ the mixingtime of social graphs. InSIGCOMM,2010. context: Context-drivenautomatic semantic annotation with c-pankow.WWW ’05, pages 332–341. [29] M. K.M. Nasution and S.A. Noah.Superficial ACM, 2005. method for extracting social network for academics usingweb snippets.In RSKT, 2010. [9] D.K. Elson, N.Dames, and K. R.McKeown. Extractingsocial networksfrom literary fiction.In [30] R.Nuray-Turan,Z. Chen,D. V.Kalashnikov, and ACL’10. S.Mehrotra. Exploitingweb queryingfor web people search in weps2. InWeb People Search Evaluation [10] L. W.et al. Building thesocial graph of the history of Workshop (WePS), 18th WWW Conference, 2009. european integration - A pipelinefor humanist-machineinteraction in the digital [31] L. Page, S. Brin, R.Motwani, and T. Winograd. The humanities. InHISTOINFORMATICS,2013. pagerank citation ranking: Bringing orderto theweb. Technical report, 1999. [11] O.Etzioni, M. Cafarella, D.Downey,S. Kok,A.-M. Popescu, T. Shaked,S.Soderland, D.S. Weld,and [32] P.Pantel and M. Pennacchiotti. Espresso: Leveraging A.Yates. Web-scaleinformation extraction in generic patternsfor automatically harvestingsemantic knowitall: (preliminary results). InWWW’09. relations. InACL 2006, Sydney, Australia, 2006. [12] A.Goyal, F.Bonchi, and L. V.Lakshmanan. Learning [33] M. Pasca. Acquisition of categorized named entities influenceprobabilities in social networks.WSDM’10, for web search. In CIKM’04. pages 241–250. ACM, 2010. [34] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang,and Z.Su. [13] I.Guy,N. Zwerdling, I. Ronen,D. Carmel, and Arnetminer: Extraction and miningof academic social E. Uziel. Social media recommendation based on networks.KDD ’08, pages 990–998. ACM, 2008. people and tags. SIGIR’10, pages 194–201. ACM, [35] A.Yates, M. Cafarella, M. Banko,O. Etzioni, 2010. M. Broadhead, and S.Soderland. Textrunner: Open [14] K.Gwet. Handbook of Inter-rater Reliability: The information extraction on theweb. In ACL, 2007. DefinitiveGuide to Measuringthe Extent of Agreement [36] S.Yeand S. Wu.Measuring message propagation and Among Raters. AdvancedAnalytics,LLC, 2010. social influenceon twitter.com. In LNCS.2010. [15] J. He, Y.Liu, Q.Tu, C. Yao, and N.Di. Efficient [37] P.S. Yu,X. Li, and B. Liu. On thetemporal entityrelation discovery on web. JCIS, 2007. dimension of search. WWW,2004.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.