ebook img

DTIC ADA563941: A Domain Independent Framework for Extracting Linked Semantic Data from Tables PDF

0.75 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA563941: A Domain Independent Framework for Extracting Linked Semantic Data from Tables

Preprint of: Varish Mulwad, Tim Finin and Anupam Joshi, A Domain Independent Framework for Extracting Linked Semantic Data from Tables, in Search Computing - Broadening Web Search, Stefano Ceri and Marco Brambilla (eds.), LNCS volume 7538, Springer, 2012. A Domain Independent Framework for Extracting Linked Semantic Data from Tables Varish Mulwad, Tim Finin and Anupam Joshi Computer Science and Electrical Engineering University of Maryland, Baltimore County Baltimore, MD 21250 USA {varish1,finin,joshi}@cs.umbc.edu Abstract. Vast amounts of information is encoded in tables found in documents,ontheWeb,andinspreadsheetsordatabases.Integratingor searchingoverthisinformationbenefitsfromunderstandingitsintended meaning and making it explicit in a semantic representation language like RDF. Most current approaches to generating Semantic Web rep- resentations from tables requires human input to create schemas and often results in graphs that do not follow best practices for linked data. Evidence for a table’s meaning can be found in its column headers, cell values,implicitrelationsbetweencolumns,captionandsurroundingtext butalsorequiresgeneralanddomain-specificbackgroundknowledge.Ap- proaches that work well for one domain, may not necessarily work well forothers.Wedescribeadomainindependentframeworkforinterpreting theintendedmeaningoftablesandrepresentingitasLinkedData.Atthe core of the framework are techniques grounded in graphical models and probabilistic reasoning to infer meaning associated with a table. Using background knowledge from resources in the Linked Open Data cloud, we jointly infer the semantics of column headers, table cell values (e.g., strings and numbers) and relations between columns and represent the inferredmeaningasgraphofRDFtriples.Atable’smeaningisthuscap- turedbymappingcolumnstoclassesinanappropriateontology,linking cell values to literal constants, implied measurements, or entities in the linked data cloud (existing or new) and discovering or and identifying relations between columns. Keywords: linked data, RDF, Semantic Web, tables, entity linking, machine learning, graphical models 1 Introduction The Web has become a primary source of knowledge and information, largely replacing encyclopedias and reference books. Most Web text is written in a nar- rative form as news stories, blogs, reports, letters, etc., but significant amounts of information is also encoded in structured forms as stand-alone spreadsheets or tables and as tables embedded in Web pages and documents. Cafarella et al. Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2012 2. REPORT TYPE 00-00-2012 to 00-00-2012 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER A Domain Independent Framework for Extracting Linked Semantic Data 5b. GRANT NUMBER from Tables 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION University of Maryland, Baltimore County,Computer Science and REPORT NUMBER Electrical Engineering,Baltimore,MD,21250 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT Vast amounts of information is encoded in tables found in documents, on the Web, and in spreadsheets or databases. Integrating or searching over this information benets from understanding its intended meaning and making it explicit in a semantic representation language like RDF. Most current approaches to generating Semantic Web rep- resentations from tables requires human input to create schemas and often results in graphs that do not follow best practices for linked data. Evidence for a table’s meaning can be found in its column headers, cell values, implicit relations between columns, caption and surrounding text but also requires general and domain-specic background knowledge. Ap- proaches that work well for one domain, may not necessarily work well for others. We describe a domain independent framework for interpreting the intended meaning of tables and representing it as Linked Data. At the core of the framework are techniques grounded in graphical models and probabilistic reasoning to infer meaning associated with a table. Using background knowledge from resources in the Linked Open Data cloud we jointly infer the semantics of column headers, table cell values (e.g. strings and numbers) and relations between columns and represent the inferred meaning as graph of RDF triples. A table’s meaning is thus cap- tured by mapping columns to classes in an appropriate ontology, linking cell values to literal constants, implied measurements, or entities in the linked data cloud (existing or new) and discovering or and identifying relations between columns. 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE Same as 18 unclassified unclassified unclassified Report (SAR) Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 2 V. Mulwad, T. Finin and A. Joshi [5]estimatedthattheWebcontainsover150millionhighqualityrelationalhtml tables. Tables are also used to present and summarize key data and results in doc- uments in many subject areas, including science, medicine, healthcare, finance, and public policy. As a part of a coordinated open data and transparency ini- tiative, nearly 30 nations are publishing government data on sites in structured formats. The US data.gov site shares more than 390,000 datasets drawn from many federal agencies and is complemented by similar sites from state and lo- calgovernmentorganizations.Tablesareusedtorepresentsignificantamountof informationandknowledge,yet,wearenotabletofullyexploitit.Bothintegrat- ing or searching over this information will benefit from a better understanding of the intended meaning of the data and its mapping to other reference dataset. The goal of our research is to unlock knowledge encoded in tables. In this paper, we present a domain independent framework for automatically inferring the intended meaning and semantics associated with tables. Using the Linked Open Data [2] (or an provided ontology, knowledge base [KB]) as background knowledge, our techniques grounded in graphical models and probabilistic rea- soning, map every column header to a class from an ontology, links table cell values to entities from the KB and discovers relations between table columns. TheinferredinformationisrepresentedasagraphofRDFtriplesallowingother applications to utilize the recovered knowledge. 2 Impact Manyrealworldproblemsandapplicationscanbenefitfromexploitinginforma- tionstoredintablesincludingevidencebasedmedicalresearch[22].Itsgoalisto judge the efficacy of drug dosages and treatments by performing meta-analyses (i.e systematic reviews) over published literature and clinical trials. The pro- cess involves finding appropriate studies, extracting useful data from them and performing statistical analysis over the data to produce a evidence report. Key information required to produceevidencereportsinclude datasuchaspatientdemograph- ics,drugdosageinformation,dif- ferenttypesofdrugsused,brands ofthedrugsused,numberofpa- tientscuredwithaparticulardo- sage etc. Most of this informa- tion is encoded in tables, which are currently beyond the scope Fig.1. The number of papers reporting on sys- ofregulartextprocessingsystems tematicreviewsandmeta-analysesissmallcom- and search engines. This makes pared to those reporting on individual clinical theprocessmanualandcumber- trials, as shown in this data from MEDLINE. some for medical researchers. Extracting Linked Semantic Data from Tables 3 Presentlymedicalresearchersperformkeywordbasedsearchonsystemssuch as PubMed’s MEDLINE which end up producing many irrelevant studies, re- quiring researchers to manually evaluate all of the studies to select the relevant ones. Figure 1 obtained from [6] clearly shows the huge difference in number of meta-analysis and number of clinical trials published every year. By adding se- manticstotableslikeFigure2,wecandevelopsystemsthatcaneasilycorrelate, integrate and search over different tables from different studies to be combined for a single meta-analysis. Web search is another area that can benefit from understanding information storesintables.Searchenginesworkwellatsearchingovertextinwebpages,but poorly when searching over tables. If recovered semantics are available, search engines can answer queries like dog breeds life span, wheat production in Africa or temperature change in the Arctic,with tables or web pages containing them asresults.Wealsoseeourworkhelpingtogeneratehighqualitysemanticlinked data, which in turn will aid the growth of the Semantic Web. 3 Inferring the Semantics of Tables Analyzingtablesprovideuniquechallenges.Onemightbetemptedtothinkthat regulartextprocessingmightworkwithtablesaswell.Afteralltablesalsostore text. However that is not the case. To differentiate between text processing and table processing consider the text “Barack Hussein Obama II (born August 4, 1961) is the 44th and current President of the United States. He is the first African American to hold the office.” The over-all meaning can be understood from the meaning of wordsinthesentence.Themean- ing of each word can be can be recoveredfromtheworditselfor byusingcontextofthesurround- ing words. Now consider the ta- ble shown in Figure 2. In some ways, this information is easier tounderstandbecauseofitsstruc- ture,butinothersitismoredif- ficult because it lacks the nor- mal organization and context of narrativetext.Themessagecon- veyedbytableinFigure2isdif- Fig.2. Tables in clinical trials literature have characteristics that differ from typical, generic ferent eradication rates for dif- Webtables.Theyoftenhaverowheaderswellas ferent drug dosages and treat- column headers, most of the cell values are nu- ment regimes for the disease H meric, cell values are often structured and cap- pylori.Similarlyconsidertheta- tionscancontaindetailedmetadata.(From[32]) bleshowninFigure3.Thetable represents information about cities in the United States of America. A closer 4 V. Mulwad, T. Finin and A. Joshi City State Mayor Population Baltimore MD S.C.Rawlings-Blake 640,000 Philadelphia PA M.Nutter 1,500,000 New York NY M.Bloomberg 8,400,000 Boston MA T.Menino 610,000 Fig.3.AsimpletablerepresentinginformationaboutcitiesinUnitedStatesofAmerica look at the table tells us that the cities in column one are the largest cities of the respective states in column three. Toextractsuchinformationfromtables,itwillbeimportanttointerpretthe meaningofcolumn(androw)headers,correlationbetweencolumns,andentities andliteralsmentionedintables.Additionalcontextandinformationcanbealso be obtained from caption of the table as well text surrounding the table. The intended meaning of column headers can be extracted by analyzing the values in the columns. For example, the strings in column one in Figure 3 can be recognized as entity mentions that are instances of the dbpedia-owl:Place class. Additional analysis can automatically generate a narrower description such as major U.S. cities. The string in the third column match the names of people and also the narrower class of politicians. The column header pro- vides additional evidence and better interpretation that the strings in column three are the mayors of the cities in column one. Linking the table cell val- ues to known entities enriches the table further. Linking S.C.Rawlings-Blake todbpedia:Stephanie C. Rawlings-Blake,T.Meninotodbpedia:Thomas Menino, M.Nutter to dbpedia:Michael Nutter we can automatically infer the additional information that all three belong to the Democratic party. Discovering correla- tionsbetweentablecolumnsalsoaddkeyinformation.Forexample,inthiscase, correlation between columns one and two help us infer that cities in column one are largestCities of the respective states in column three. The techniques above will work well when the table cell values are strings; but not necessarily when cell values are literals, for e.g. numerical values such as the ones from the table in Figure 2 or values from column four of the table in Figure 3. We discuss the challenges posed by such literals and how to tackle them later in the paper. Producinganoverallinterpretationofatableisacomplextaskthatrequires developing an understanding of the intended meaning of the table as well as attention to the details of choosing the right URIs to represent both the schema as well as instances. We break down the process into following major tasks: (i) assign every column (or row header) a class label from an appropriate ontology; (ii) link table cell values to appropriate LOD entities, if possible; (iii) discover relationshipsbetweenthetablecolumnsandlinkthemtolinkeddataproperties; and (iv) generate a linked data representation of the inferred data. Extracting Linked Semantic Data from Tables 5 Fig.4. We are developing a robust domain independent framework for table interpre- tationthatwillresultinarepresentationoftheinformationextractedasRDFLinked Open Data. 4 DIF-LDT : A Domain Independent Framework We present DIF-LDT - our domain independent framework for inferring the semantics associated with tables in Figure 4. With little or no domain depen- dence, the framework should work equally well with tables found on web pages, inmedicalliteratureortabulardatasetsfromsiteslikedata.gov.Thegoalofthis frameworkisalsotoaddressanumberofpracticalchallenges,includinghandling with large tables containing many rows, tables with acronyms and encoded val- ues, and literal data in the form of numbers and measurements. At the core of the framework are two modules - a) module that queries and generates initial set of mappings for column headers, cell values and relation betweencolumnsinatableandb)amodulegroundedinprobabilisticgraphical model,whichperformsjointinference.Oncethetablepassesthroughinitialpre- processing, the query phase generates a set of candidate classes, entities and relations between columns, for every column header and cell values in a table. The module for joint inference will jointly assign values to column headers, cell values and relations between columns in a table. The table interpretation will be useful only when we are able to generate an appropriate representation of it which can be reasoned and queried over by other systems. Thus, the next step would be generating an appropriate representation of the inferred information. Certainapplicationsmayrequirethattheuserreviewandifnecessarychangethe generatedinterpretation.Toincorporatethisrequirement,anadditionalmodule provides a interactive framework to allow a human to work with the system to produce the interpretation. In the following sections we describe each module in detail. 6 V. Mulwad, T. Finin and A. Joshi 4.1 Pre-processing The goal of the preprocessing modules at the start of the process is for dealing with special cases. For example, certain tables or datasets can be too large to be dealt by the module for joint inference. In such cases, it would better to sample the table, generate a smaller version and let that pass through the rest ofworkflow.Whileapplyingjointinference/jointassignmenttechniquestolarge tables is not feasible, we believe that it is also not necessary. We note that people can usually understand a large table’s meaning by looking only at its initialportion.Ourapproachwillbesimilar–givenalargetable,wewillsample the rows to select a smaller number to which we will apply the graphical model. The other pre-processing module we present is for acronyms. Many tables tend touseacronyms.Replacingthemwiththeirexpandedforms,willprovideamore accurate context and thus help in generating a better interpretation. While, we present only two such modules, given the independent nature of the modules, more modules can be easily added without breaking the rest of the workflow. 4.2 Generate and Rank Candidates The goal of the querying phase is to access knowledge sources and generate a initial set of mappings of classes, entities and relations for each mention in the table.Theknowledgesourcesusedinthequeryprocesswillincludedatasetssuch as DBpedia [3], Yago [25] from the LOD cloud. For other specialized domains such as the medical domain or open government data, additional ontologies and knowledge resources may be needed. For general tables, like the ones found on the web, DBpedia, Yago and Wikitology [26] provide very good coverage. Presently, we use Wikitology, a hybrid kb based on Wikipedia’s structured andunstructuredinformationaugmentedwithinformationfromstructuredsources like DBpedia, Freebase [4], WordNet [17] and Yago, to generate our initial map- pings. The query module generates a set of candidate entities for each cell value in a table by querying Wikitology, using query techniques described in [19]. Each returned entity has a set of associated classes (or types). For example, a subset of classes associated with the entity dbpedia:Baltimore are yago:Ind- ependentCitiesInTheUnitedStates,dbpedia-owl:PopulatedPlace,dbpedia-owl:City, yago:CitiesInMaryland.Thesetofcandidateclassesforagivencolumninatable can be obtained by taking a union of the set of classes associated with the can- didateentitiesinthatcolumn.Ourcurrentfocusisrestrictedtocolumnheaders and entities in a table. Oncethecandidatesetsaregenerated,thenextstepistorankthecandidates. We developed two functions ψ and ψ for this purpose. ψ ranks the candidate 1 2 1 classes in a given set, whereas ψ ranks the candidate entities. ψ will compute 2 1 the ‘affinity’ between a column header string (e.g., City) and a class from the candidate set (say dbpedia-owl:City). We define ψ as the exponential of the 1 product of a weight vector and a feature vector computed for column header. ψ will assign a score to each candidate class which can be used to rank the 1 candidate classes. Thus, Extracting Linked Semantic Data from Tables 7 ψ =exp(wT.f (C ,L )) 1 1 1 i Ci where w is the weight vector, L is the candidate class label and C is the 1 Ci i string in column header i. The feature vector f is composed of the following 1 features : f =[LevenshteinDistance(C ,L ),DiceScore(C ,L ), 1 i Ci i Ci SemanticSimilarity(C ,L ),InformationContent(L )] i Ci Ci f includes a set of string similarity metrics (Levenshtein distance [15], Dice 1 score [24]) to capture string similarity between the class and column header string.Toovercomecaseswherethereisnostringorcontentmatch(e.g.dbpedia- owl:AdministrativeRegion andState),wealsoincludeametrictocaptureSeman- tic Similarity [10] between the candidate class and column header string. Selecting ‘specific’ classes is more useful than selecting ‘general’ classes. For example it is better to infer that a column header is of type of dbpedia-owl:City ascomparedtoinferringitasdbpedia-owl:Place orowl:Thing.Thus,topromote classes of the likes of dbpedia-owl:City, f incorporates an Information content 1 measure. Based on semantic similarity defined in [21], we define Information Contentas,I.C(L )=−log [p(L )],wherep(L )istheprobabilityoftheclass C 2 C C L . We computed I.C. for classes from the DBpedia ontology and noticed that C specific classes will have a higher value for I.C. as compared to more general classes. We also develop a function ψ to rank and compute the affinity between 2 the string in the table row cell (say Baltimore) and the candidate entity (say dbpedia:Baltimore). We define ψ as the exponential of the product of a weight 2 vector and a feature vector computed for a cell value. Once again ψ will assign 2 a score to each entity which can be used to rank the entities. Thus, ψ =exp(wT.f (R ,E )) 2 2 2 i,j i,j where w is the weight vector, E is the candidate entity and R is string 2 i,j i,j value in column i and row j. The feature vector f is composed as follows: 2 f =[LevenshteinDistance(R ,E ),DiceScore(R ,E ), 2 i,j i,j i,j i,j PageRank(E ),KBScore(E ),PageLength(E )] i,j i,j i,j f isconsistsasetofstringsimilaritymetrics(Levenshteindistance,Dicescore) 2 and also a set of popularity metrics(Predicted Page Rank [27], Page Length and Wikitology KB score for the entity). When it is difficult to disambiguate betweenentities,themorepopularentityislikelytobethecorrectanswer;hence the inclusion of popularity metrics. The weight vectors w , w can be tweaked 1 2 via experiments or can be learned using standard machine learning procedures. As we continue to make progress in our work, in the future, we will develop a similar function for ranking candidate relations. 8 V. Mulwad, T. Finin and A. Joshi 4.3 Joint Inference Given candidate sets for column headers, cell values and relation between ta- ble columns, the joint inference module is responsible for joint assignment to mentions in the table and infer the meaning of a table as a whole. Probabilistic graphical models [13] provide a powerful and convenient framework for express- ing a joint probability over a set of variables and performing inference or joint assignment of values to the variables. Probabilistic graphical models use graph basedrepresentationstoencodeprobabilitydistributionoverasetofvariablesfor a given system. The nodes in such a graph represent the variables of the system and the edges represent the probabilistic interaction between the variables. Based on the graphical representation used to model the system, the graph needs to be parametrized and then an appropriate inference algorithm needs to be selected to perform inferencing over the graph. Thus constructing a graph- ical model involves the following steps: (i) identifying variables in the system; (ii)specifyinginteractionsbetweenvariablesandrepresentingitasagraph;(iii) parameterizing the graphical structure; and (iv) selecting an appropriate algo- rithm for inferencing. Following this plan, we describe how a graphical model for inferring the semantics of tables is constructed. Variables in the System. The column headers, cells values (strings and lit- erals) and relation between columns in a table represent the set of variables in an interpretation framework. Each variable has a set of candidates associated, which are generated as described in section 4.2. The initial assignment to each variable will be its top ranked candidate. Graphical Representation. There are three major representation techniques forencodingthedistributionoversetofvariables:directedmodels(e.g.,Bayesian networks), undirected models (e.g., Markov networks), and partially directed models. In the context of graphical models, Markov networks are undirected graphs in which nodes represent the set of variables in a system and the undi- rectededgesrepresenttheprobabilisticinteractionsbetweenthethem.Theedges in the graph are undirected because the interaction between the variables are symmetrical. In the case of tables, interaction between the column headers, ta- ble cell values and relation between table columns are symmetrical. Thus we choose a Markov network based graphical model for the inferring the semantics of tables. Figure 5(a) shows the interaction between the variables in a table. In a typi- cal well formed table, each column contains data of a single syntactic type (e.g., strings) that represent entities or values of a common semantic type (e.g., peo- ple). For example, in a column of cities, the column header City represents the semantic type of values in the column and Baltimore, Boston and Philadelphia areinstancesofthattype.Thusknowingthetype(orclass)ofthecolumnheader, influences the decision of the assignment to the table cells in that column and vice-versa. To capture this influence, we insert an edge between the column header and each of the table cells in the column. Edges between the table cell themselves in

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.