Answering Definitional Questions Before They Are Asked by Aaron D. Fernandes Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the OF TECHNOLOGY MASSACHUSETTS INSTITUTE OF TECHNOLO LB 8 2005 September 2004 LIBRARIES @ Massachusetts Institute of Technology 2004. All rights reserved. /A A Author .............................- Department of Electrical Engineering and mputer Science August 17, 2004 Certified by................ ............. ................ Boris Katz Principal Research Scientist Thesis Supervisor A ccepted by .......................... . . .rthuC.... ... S. Arthur C. Smith Chairman, Department Committee on Graduate Students '.CHIVES Answering Definitional Questions Before They Are Asked by Aaron D. Fernandes Submitted to the Department of Electrical Engineering and Computer Science on August 17, 2004, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science Abstract Most question answering systems narrow down their search space by issuing a boolean IR query on a keyword indexed corpus. This technique often proves futile for defini- tional questions, because they only contain one keyword or name. Thus, an IR search for only that term is likely to produce many spurious results; documents that contain mentions of the keyword, but not in a definitional context. An alternative approach is to glean the corpus in pre-processing for syntactic constructs in which entities are defined. In this thesis, I describe a regular expression language for detecting such constructs, with the help of a part-of-speech tagger and a named-entity recognizer. My system, named CoL. ForBIN, extracts entities and their definitions, and stores them in a database. This reduces the task of definitional question answering to a simple database lookup. Thesis Supervisor: Boris Katz Title: Principal Research Scientist 3 "Col. Forbin, I know why you've come here And I'll help you with the quest to gain the knowledge that you lack." -Icculus Acknowledgments I'd like to thank the following people for making this thesis possible: " Boris Katz, for giving me this opportunity, and for all his guidance and patience throughout the years. " Gregory Marton for providing the initial spark (and being my personal sysad- min), and Jimmy Lin for reigniting it. " Sue Felshin, for her thoughtful editing remarks. " Prof. Winston, who taught me all the rules for writing a thesis (so I could go ahead and break them). " Jerome MacFarland, for showing me the Infolab way; Matthew Bilotti, for keep- ing me company in 822; Federico, Daniel, and Roni, for entertaining me all summer; and all my other fellow Infolabers. " Jason Chiu, "the human annotator". " All my friends beyond the Infolab, including, but not limited to the DU boys, the Husky Club, and the Phans. Thanks for keeping me sane. " Dad, the Bezners, the Morgensterns (no precedence, just alphabetical!), and the rest of my family, for all their love and support. Without you, none of this is possible. 10h yeah, I'd also like to thank the band that provided these words of wisdom, and a world of enlightenment. This has all been wonderful, and you will be missed. 5 6 Dedication In loving memory of Mom, Sitoo, and Grandpa, who never doubted me even when I did. This one's for you. 7 8 Contents 1 Introduction 15 1.1 Question Answering Overview . . . . . . . . . . . . . . . . . . . . . . 16 1.2 Definitional Questions . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.3 Syntactic Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . 18 2 Related Work 21 2.1 TREC 10 & 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.1.1 InsightSoft-M . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1.2 ISI's TextMap . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2 Restricting the Domain to Definitional Questions . . . . . . . . . . . 24 2.2.1 Fleischman et al. . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.2 MIT CSAIL's Intrasentential Definition Extractor . . . . . . . 25 2.2.3 State of the Art Alternatives: BBN and Yang . . . . . . . . . 25 3 CoL. ForBIN: A Pattern Grammar 27 3.1 A Syntactic Regular Expression Library . . . . . . . . . . . . . . . . 28 3.1.1 Atomic Elements . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.2 Detecting Phrase Boundaries . . . . . . . . . . . . . . . . . . 29 3.2 Building Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.1 A ge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.2 A ffiliation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.3 Also Known As (AKA) . . . . . . . . . . . . . . . . . . . . . . 37 3.2.4 A lso called . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 9 3.2.5 Named . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.6 Like . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.7 Such As . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2.8 Occupation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.9 Appositive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.10 Copular . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.11 Became . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.12 Relative Clause . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.13 Wa s Named . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.14 Other Verb Patterns . . . . . . . . . . . . . . . . . . . . . . . 45 3.3 Putting it all Together . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.1 Entity-First Patterns . . . . . . . . . . . . . . . . . . . . . . . 47 3.3.2 Entity-Second Patterns . . . . . . . . . . . . . . . . . . . . . . 48 3.3.3 Occupations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4 Experimental Results 53 4.1 Generating Ground Truth . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 Testing Against Truth . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2.1 Generating Benchmark Statistics . . . . . . . . . . . . . . . . 54 4.2.2 Comparison to Benchmark . . . . . . . . . . . . . . . . . . . . 56 4.3 Human Judgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4 Testing in QA Context . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5 Future Work 69 5.1 Target Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2 CFG Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.3 Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6 Contributions 73 10
Description: