ebook img

DTIC ADA460615: Algorithms That Learn to Extract Information BBN: Description of the Sift System as Used for MUC-7 PDF

18 Pages·0.08 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA460615: Algorithms That Learn to Extract Information BBN: Description of the Sift System as Used for MUC-7

ALGORITHMS THAT LEARN TO EXTRACT INFORMATION BBN: DESCRIPTION OF THE SIFT SYSTEM AS USED FOR MUC-7 Scott Miller, Michael Crystal, Heidi Fox, Lance Ramshaw, Richard Schwartz, Rebecca Stone, Ralph Weischedel, and the Annotation Group BBN Technologies 70 Fawcett Street Cambridge, MA 02138 [email protected] ABSTRACT For MUC-7, BBN has for the first time fielded a fully-trained system for NE, TE, and TR; results are all the output of statistical language models trained on annotated data, rather than programs executing hand- written rules. Such trained systems have some significant advantages: • They can be easily ported to new domains by simply annotating data with semantic answers. • The complex interactions that make rule-based systems difficult to develop and maintain can here be learned automatically from the training data. We believe that the results in this evaluation are evidence that such trained systems, even at their current level of development, can perform roughly on a par with rules hand-tailored by experts. Since MUC-3, BBN has been steadily increasing the proportion of the information extraction process that is statistically trained. Already in MET-1, our name-finding results were the output of a fully statistical, HMM-based model, and that statistical Identifinder™m odel was also used for the NE task in MUC-7. For the MUC-7 TE and TR tasks, BBN developed SIFT, a new model that represents a significant further step along this path, replacing PLUM, a system requiring handwritten patterns, with SIFT, a single integrated trained model. SIFT: AN INTEGRATED TE/TR EXTRACTION SYSTEM Introduction At the sentence level, the SIFT system (“Statistics for Information From Text”) employs a unified statistical process to map from words to semantic structures. That is, part-of-speech determination, name- finding, parsing, and relationship-finding all happen as part of the same process. This allows each element of the model to influence the others, and avoids the assembly-line trap of having to commit to a particular part-of-speech choice, say, early on in the process, when only local information is available to inform the choice. The SIFT sentence-level model was trained from two sources: • General knowledge of English sentence structure was learned from the Penn Treebank corpus of one million words of Wall Street Journal text. • Specific knowledge about how the target entities and relations are expressed in English was learned from about 500 K words of on-domain text annotated with named entities, descriptors, and semantic relations. For extraction in a new domain, the names and descriptors of relevant items (persons, organizations, locations, and artifacts) are marked, as well as the target relationships between them that are signaled syntactically. For example, in the phrase “GTE Corp. of Stamford”, the annotation would record a 1 Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 1998 2. REPORT TYPE 00-00-1998 to 00-00-1998 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Algorithms That Learn to Extract Information BBN: Description of the 5b. GRANT NUMBER Sift System as Used for MUC-7 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION BBN Technologies,10 Moulton Street,Cambridge,MA,02238 REPORT NUMBER 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES The original document contains color images. 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 17 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 “location-of” connection between the company and the city. The model can thus learn the structures that are typically used in English to convey the target relationships. While the bulk of the TE/TR task happens within the sentence-level decoder, some further processing was still required to produce TE and TR answer templates. After the names, descriptors, and local relationships had been extracted from the decoder output, a merging process had to be applied to link multiple occurrences of the same name or of alternative forms of the name from different sentences. A second, cross-sentence model was then invoked to try to identify non-local relationships that were missed by the decoder, as when the two entities do not occur in the same sentence. Finally, type and country information was filled in using heuristic tests and a gazetteer database, and output filters were applied to select which of the proposed internal structures should be included in the output. We are actively exploring ways to integrate these post-processing steps more closely with the main model, since an integrated statistical model is the only way to make every choice in a nuanced way, based on all the available information. Sentence-Level Model Figure 1 is a block diagram of the sentence-level model showing the main components and data paths. Two types of annotations are used to train the model: semantic annotations for learning about the target entities and relations, and syntactic annotations for learning about the general structure of English. From these annotations, the training program estimates the parameters of a unified statistical model that accounts for both syntax and semantics. Later, when presented with a new sentence, the search program explores the statistical model to find the most likely combined semantic and syntactic interpretation. syntactic annotations (Penn Treebank) training program semantic annotations statistical training model decoding search combined semantic- sentences program syntactic interpretations Figure 1: Block diagram of sentence-level model. Training Data Our source for syntactically annotated training data was the Penn Treebank (Marcus et al., 1993). Significantly, we do not require that syntactic annotations be from the same source, or cover the same domain, as the target task. For example, while the Penn Treebank consists of Wall Street Journal text, the target source for this evaluation was New York Times newswire. Similarly, although the Penn Treebank domain covers general and financial news, the target domain for this evaluation was space technology. The ability to use syntactic training from a different source and domain than the target is an important feature of our model. Since the Penn Treebank serves as our syntactically annotated training corpus, we need only create a semantically annotated corpus. Stated generally, semantic annotations serve to denote the entities and relations of interest in the target domain. More specifically, entities are marked as either names or 2 descriptors, with co-reference between entities marked as well. Figure 2 shows a semantically annotated fragment of a typical sentence. From only these simple semantic annotations, the system can be trained to work in a new domain. To train SIFT for MUC-7, we annotated approximately 500,000 words of New York Times newswire text, covering the domains of air disasters and space technology. (We have not yet run experiments to see how performance varies with more/less training data.) employee coreference relation person-descriptor person organization Nance , who is also a paid consultant to ABC News , said ... Figure 2: An example of semantic annotation. Semantic/Syntactic Structure While our semantic annotations are quite simple, the internal model of sentence structure is substantially more complicated, since it must account for syntactic structure as well as entities and semantic relations. Our underlying training algorithm requires examples of these internal structures in order to estimate the parameters of the unified semantic/syntactic model. However, we do not wish to incur the high cost of annotating parse trees. Instead, we use the following multi-step training procedure, exploiting the Penn Treebank: 1) Train the sentence-level model on the purely syntactic parse trees in the Treebank. Once this step is complete, the model will function as a state-of-the-art statistical parser. 2) For each sentence in the semantically annotated corpus: a) Apply the sentence level model to syntactically parse the sentence, constraining the model to produce only parses that are consistent with the semantic annotation. b) Augment the resulting parse tree to reflect semantic structure as well as syntactic structure. 3) Retrain the sentence-level model on the augmented parse trees produced in step 2. Once this step is complete, we have an integrated model of semantics and syntax. Details of the statistical model will be discussed later. For now, we turn our attention to (a) constraining the decoder and (b) augmenting the parse trees with semantic structure. Constraints are simply bracketing boundaries that may not be crossed by any parse constituent. There are two types of constraints: hard constraints that cannot be violated under any conditions, and soft constraints, that may be violated only if enforcing them would result in no plausible parse. All named entities and descriptors are treated as hard constraints; the model is prohibited from producing any structures that would break up these elements. In addition, we attempt to keep possible appositives together through soft constraints. Whenever there is a co-referential relation between two entities that are either adjacent or separated by only a comma, we posit an appositive and introduce a soft constraint to encourage the parser to keep the elements together. Once a constrained parse is found, it must be augmented to reflect the semantic structure. Augmentation is a five step process. 3 1) Nodes are inserted into the parse tree to distinguish names and descriptors that are not bracketed in the parse. For example, the parser produces a single noun phrase with no internal structure for “Lt. Cmdr. David Edwin Lewis”. Additional nodes must be inserted to distinguish the descriptor, “Lt. Cmdr.,” and the name, “David Edwin Lewis.” 2) Semantic labels are attached to all nodes that correspond to names or descriptors. These labels reflect the entity type, such as person, organization or location, as well as whether the node is a proper name or a descriptor. 3) For relations between entities, where one entity is not a syntactic modifier of the other, the lowermost parse node that spans both entities is identified. A semantic tag is then added to that node denoting the relationship. For example, in the sentence “Mary Fackler Schiavo is the inspector general of the U.S. Department of Transportation,” a co-reference semantic label is added to the S node spanning the name, “Mary Fackler Schiavo,” and the descriptor, “the inspector general of the U.S. Department of Transportation.” 4) Nodes are inserted into the parse tree to distinguish the arguments to each relation. In cases where there is a relation between two entities, and one of the entities is a syntactic modifier of the other, the inserted node serves to indicate the relation as well as the argument. For example, in the phrase “Lt. Cmdr. David Edwin Lewis,” a node is inserted to indicate that “Lt. Cmdr.” is a descriptor for “David Edwin Lewis.” 5) Whenever a relation involves an entity that is not a direct descendant of that relation in the parse tree, semantic pointer labels are attached to all of the intermediate nodes. These labels serve to form a continuous chain between the relation and its argument. Figure 3 shows an augmented parse tree corresponding to the semantic annotation in Figure 2. Note that nodes with semantic labels ending in “-r” are MUC reportable names and descriptors. Statistical Model In SIFT’s statistical model, augmented parse trees are generated according to a process similar to that described in Collins (1996, 1997). For each constituent, the head is generated first, followed by the modifiers, which are generated from the head outward. Head words, along with their part-of-speech tags and features, are generated for each modifier as soon as the modifier is created. Word features are introduced primarily to help with unknown words, as in Weischedel et al. (1993). We illustrate the generation process by walking through a few of the steps of the parse shown in Figure 3. At each step in the process, a choice is made from a statistical distribution, with the probability of each possible selection dependent on particular features of previously-generated elements. We pick up the derivation just after the topmost S and its head word, said, have been produced. The next steps are to generate in order: 1. A head constituent for the S, in this case a VP. 2. Pre-modifier constituents for the S. In this case, there is only one: a PER/NP. 3. A head part-of-speech tag for the PER/NP, in this case PER/NNP. 4. A head word for the PER/NP, in this case nance. 5. Word features for the head word of the PER/NP, in this case capitalized. 6. A head constituent for the PER/NP, in this case a PER-R/NP. 7. Pre-modifier constituents for the PER/NP. In this case, there are none. 8. Post-modifier constituents for the PER/NP. First a comma, then an SBAR structure, and then a second comma are each generated in turn. This generation process is continued until the entire tree has been produced. 4 s per/np vp per-desc-of/sbar-lnk per-desc-ptr/sbar per-desc-ptr/vp per-desc-r/np emp-of/pp-lnk org-ptr/pp per-r/np whnp advp per-desc/np org-r/np per/nnp , wp vbz rb det vbn per-desc/nn to org'/nnporg/nnp , vbd Nance , who is also a paid consultant to ABC News , said ... Figure 3: An augmented parse tree. We now briefly summarize the probability structure of the model. The categories for head constituents, c , h are predicted based solely on the category of the parent node, c : p P(c |c ), e.g. P(vp|s) h p Modifier constituent cateogries, c , are predicted based on their parent node, c , the head constituent of m p their parent node, c , the previously generated modifier, c , and the head word of their parent, w . hp m-1 p Separate probabilities are maintained for left (pre) and right (post) modifiers: P (c |c ,c ,c ,w ), e.g. P (per/np|s,vp,null,said) L m p hp m- 1 p L P (c |c ,c ,c ,w ), e.g. P (null|s,vp,null,said) R m p hp m- 1 p R Part-of-speech tags, t , for modifiers are predicted based on the modifier, c , the part-of-speech tag of the m m head word , t , and the head word itself, w : h h P(t |c ,t ,w ), e.g. P(per/nnp|per/np,vbd,said) m m h h Head words, w , for modifiers are predicted based on the modifier, c , the part-of-speech tag of the m m modifier word , t , the part-of-speech tag of the head word , t , and the head word itself, w : m h h P(w |c ,t t ,w ), e.g. P(nance| per/np,per/nnp,vbd,said) m m m, h h 5 Finally, word features, f , for modifiers are predicted based on the modifier, c , the part-of-speech tag of m m the modifier word , t , the part-of-speech tag of the head word , t , the head word itself, w , and whether or m h h not the modifier head word, w , is known or unknown. m P(f |c ,t t ,w ,known(w )), e.g. P(cap|per/np,per/nnp,vbd,said,true) m m m,h h m The probability of a complete tree is the product of the probabilities of generating each element in the tree. If we generalize the tree components (constituent labels, words, tags, etc.) and treat them all as simply elements, e, and treat all the conditioning factors as the history, h, we can write: (cid:213) P(tree)= P(e|h) e˛ tree Training the Model Maximum likelihood estimates for all model probabilities are obtained by observing frequencies in the training corpus. However, because these estimates are too sparse to be relied upon, they must be smoothed by mixing in lower-dimensional estimates. We determine the mixture weights using the Witten-Bell smoothing method. For modifier constituents, the mixture components are: P¢(c |c ,c ,c ,w )=l P(c |c ,c ,c ,w ) m p hp m- 1 p 1 m p hp m- 1 p +l P(c |c ,c ,c ) 2 m p hp m- 1 For part-of-speech tags, the mixture components are: P¢(t |c ,t ,w )=l P(t |c ,w ) m m h h 1 m m h +l P(t |c ,t ) 2 m m h +l P(t |c ) 3 m m For head words, the mixture components are: P¢(w |c ,t ,t ,w )=l P(w |c ,t ,w ) m m m h h 1 m m m h +l P(w |c ,t ,t ) 2 m m m h +l P(w |c ,t ) 3 m m m +l P(w |t ) 4 m m Finally, for word features, the mixture components are: P¢(f |c ,t ,t ,w ,known(w ))=l P(f |c ,t ,w ,known(w )) m m m h h m 1 m m m h m +l P(f |c ,t ,t ,known(w )) 2 m m m h m +l P(f |c ,t ,known(w )) 3 m m m m +l P(f |t ,known(w )) 4 m m m Searching the Model Given a sentence to be analyzed, the search program must find the most likely semantic and syntactic interpretation. More concretely, it must find the most likely augmented parse tree. Although mathematically the model predicts tree elements in a top-down fashion, we search the space bottom-up using a chart based search. The search is kept tractable through a combination of CKY-style dynamic programming and pruning of low probability elements. 6 Dynamic Programming: Whenever two or more constituents are equivalent relative to all possible later parsing decisions, we apply dynamic programming, keeping only the most likely constituent in the chart. Two constituents are considered equivalent if: 1. They have identical category labels. 2. Their head constituents have identical labels. 3. They have the same head word. 4. Their leftmost modifiers have identical labels. 5. Their rightmost modifiers have identical labels. Pruning: Given multiple constituents that cover identical spans in the chart, only those constituents with probabilities within a threshold of the highest scoring constituent are maintained; all others are pruned. For purposes of pruning, and only for purposes of pruning, the prior probability of each constituent category is multiplied by the generative probability of that constituent (Goodman, 1997). We can think of this prior probability as an estimate of the probability of generating a subtree with the constituent category, starting at the topmost node. Thus, the scores used in pruning can be considered as the product of: 1. The probability of generating a constituent of the specified category, starting at the topmost node. 2. The probability of generating the structure beneath that constituent, having already generated a constituent of that category. The outcome of the search process is a tree structure that encodes both the syntactic and semantic structure of the sentence, so that the TE entities and local TR relations can be directly extracted from these sentential trees. Cross-Sentence Model The cross-sentence model uses structural and contextual clues to hypothesize template relations between two elements that are not mentioned within the same sentence. Since 80-90% of the relations found in the answer keys connect two elements that are mentioned in the same sentence, the cross sentence model has a narrow target to shoot for. Very few of the pairs of entities seen in different sentences turn out to be actually related. This model uses features extracted from related pairs in training data to try to identify those cases. It is a classifier model that considers all pairs of entities in a message whose types are compatible with a given relation; for example, a Person and an Organization would suggest a possible Employee_Of. For the three Muc-7 relations, it turned out to be somewhat advantageous to build in a functional constraint, so that the model would not consider, for example, a possible Employee_Of relation for a person already known from the sentence-level model to be employed elsewhere. Given the measured features for a possible relation, the probability of a relation holding or not holding can be computed as follows: p(feats|rel)p(rel) p(rel| feats)= p(feats) p(feats|~rel)p(~rel) p(~rel| feats)= p(feats) If the ratio of those two probabilities, computed as follows, is greater than 1, the model predicts a relation: p(rel| feats) p(feats|rel)p(rel) = p(~rel| feats) p(feats|~rel)p(~rel) 7 We approximate this ratio by assuming feature independence and taking the product of the contributions for each feature. (cid:213) p(rel) p(feat |rel) p(rel| feats) i » i (cid:213) p(~rel| feats) p(~rel) p(feat |~rel) i i The cross-sentence feature model applies to entities found by the sentence-level model, which is run over all of the sentence-like portions of the text. An initial heuristic procedure checks for sections of the preamble or trailer that look like sentential material, that should be treated like the body text. There is also a separate handwritten procedure that searches the preamble text for any byline, and, if one is found, instantiates an appropriate employee relationship. Model Features Two classes of features were used in this model: structural features that reflect properties of the text surrounding references to the entities involved in the suggested relation, and content features based on the actual entities and relations encountered in the training data. Structural Features The structural features exploit simple characteristics of the text surrounding references to the possibly- related entities. The most powerful structural feature, not surprisingly, was distance, reflecting the fact that related elements tend to be mentioned in close proximity, even when they are not mentioned in the same sentence. Given a pair of entity references in the text, the distance between them was quantized into one of three possible values: Code Distance Value 0 Within the same sentence 1 Neighboring sentences 2 More remote than neighboring sentences For each pair of possibly-related elements, the distance feature value was defined as the minimum distance between some reference in the text to the first element and some reference to the second. A second structural feature grew out of the intuition that entities mentioned in the first sentence of an article often play a special topical role throughout the article. The “Topic Sentence” feature was defined to be true if some reference to one of the two entities involved in the suggested relation occurred in the first sentence of the text-field body of the article. Other structural features that were considered but not implemented included the count of the number of references to each entity. Content Features While the structural features learn general facts about the patterns in which related references occur and the text that surrounds them, the content features learn about the actual names and descriptors of entities seen to be related in the training data. The three content features in current use test for a similar relationship in training by name or by descriptor or for a conflicting relationship in training by name. The simplest content feature tests using names whether the entities in the proposed relationship have ever been seen before to be related. To test this feature, the model maintains a database of all the entities seen to be related in training, and of the names used to refer to them. The “by name” content feature is true if, for example, a person in some training message who shared at least one name string with the person in the 8 proposed relationship was employed in that training message by an organization that shared at least one name string with the organization in the proposed relationship. A somewhat weaker feature makes the same kind of test for a previously seen relationship using descriptor strings. This feature fires when an entity that shares a descriptor string with the first argument of the suggested relation was related in training to an entity that shares a name with the second argument. Since titles like “General” count as descriptor strings, one effect of this feature is to increase the likelihood of generals being employed by armies. Observing such examples, but noting that the training didn’t include all the reasonable combinations of titles and organizations, the training for this feature was seeded by adding a virtual message constructed from a list of such titles and organizations, so that any reasonable such pair would turn up in training. The third content feature was a kind of inverse of the first “by name” feature which was true if some entity sharing a name with the first argument of the proposed relation was related to an entity that did not share a name with the second argument. Using Employee_Of again as an example, it is less likely (though still possible) that a person who was known in another message to be employed by a different organization should be reported here as employed by the suggested one. Training Given enough fully annotated data, with both sentence-level semantic annotation and message-level answer keys recorded along with the connections between them, training the features would be quite straightforward. For each possibly-related pair of entities mentioned in a document, one would just count up the 2x2 table showing how many of them exhibited the given structural feature and how many of them were actually related. The training issues that did arise stemmed from the limited supply of answer keys and that the keys were not connected to the sentence-level annotations. The government training and dry run data provided 200 messages’ worth of TE and TR answer keys. Those answer keys, however, contained strings without recording where in the text they were found. In order to train structural features from that data, we needed the locations of references within the text. A heuristic string matching process was used to make that connection, with a special check to ensure for names that the shorter version of a name did not match a string in the text that also matched a longer version of the same name. Training the content features, on the other hand, did not require positional information about the references. The plain answer keys could be used in combination with a database of the name and descriptor strings for entities related in training to count up the feature probabilities for actually related and non-related pairs. The string database was collected first, and one-out training was then used, so that the rest of the training corpus provided the string database for training the feature counts on each particular message. The additional training data that was semantically annotated for training the sentence-level model but for which answer keys were not available could still also be used in building up the string database for the content features. The probabilities based on the final feature counts were smoothed by mixing them with 0.01% of a uniform model. Contribution of the Cross Sentence Model When measured on 10 randomly-selected messages from the airplane crash training, the cross sentence model improved TR scores by 5 points. It proved a bit less effective on the 100 messages of the formal test set, improving scores there by only 2 points. (The F score on the formal test set with the cross-sentence model component disabled was 69.33%.) 9

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.