Table Of Content

QACTIS Enhancements in TREC QA-2006 P. Schone, G. Ciany*, R. Cuttsb , P. McNamee=, J. Mayfield=, Tom Smith= U.S. Department of Defense Ft. George G. Meade, MD 20755-6000 ABSTRACT of past-year systems, but we believe it provides an appropriate reflection of system performance over time. The QACTIS system has been tested in previous years at Afterdiscussionofthesystemenhancements,wecon- the TREC Question Answering Evaluations. This paper duct a post-evaluation analysis of the results from the describes new enhancements to the system specific to TREC-2006evaluation. Oursystemimprovedslightlythis TREC-2006, including basic improvements and threshold- year in terms of factoid and list answering. However, we ingexperiments,filteredandInternet-supportedpseudo-rel- experienced10%and20%relativelossesinsystemperfor- evance feedback for information retrieval, and emerging manceduerespectivelytounsupportednessandinexactness statistics-driven question-answering. For contrast, we also -- numbers which are too large to go without notice. The compareourTREC-2006systemperformancetothatofour inexactness losses seem high but hopefully such degrada- top systems from TREC-2004 and TREC-2005 applied to tion has been uniformly observed across systems. On the this year’s data. Lastly, we analyze evaluator-declared otherhand,theunsupportednessismoresuspicious. Unlike unsupportedness of factoids and nugget decisions of manyothersystems,whichusetheInternetasapre-mecha- “other”questionstounderstandmajornegativechangesin nism for finding answers and thereby have a chance of performance for these categories over last year. recalling the wrong file, our system solely draws its answersfromTRECdocumentsanddoesnotminetheWeb 1. INTRODUCTION for answers. We conducted a number of post-evaluation QACTIS (pronounced "cactus"), which breaks out to experiments. Oneoftheseattemptedtomakeadetermina- "Question-Answering for Cross-Lingual Text, Image, and tion as to whether the unsupported labels were justified. Speech,"isaresearchprotoype systembeingdevelopedby We found through this analysis no systematic biases. the U.S. Department of Defense. The goals and descrip- Along a similar vein, at TREC-2005, QACTIS’s “Other” tionsofthissystemarespecificallydescribedinpastTREC answerer received the highest score, whereas this year, it descriptions (see Schone, et al., 2004, 2005 in [1], [2]). In suffereda40%degradationinoverallscore. Weconducted this paper, though, we provide a self-contained description astudytounderstandthisdegradation. Thisevaluationalso ofmodificationsthathavebeenmadetothesystemin2006. eliminated concerns about potential assessment problems. Therewerethreemajorpointsofstudyuponwhichwecon- ducted research this year: (1) basic improvements to the 2. CY2006 SYSTEM ENHANCEMENTS general processing strategy, (2) information retrieval In 2006, there were a number of new avenues of research enhancementsasaprefilter,and(3)amovetowardintegra- on QACTIS. As mentioned earlier, these fall into three tion of more purely-statistical question answering. We maindirections. Specifically,theseinvolvedimprovements describeeachoftheseresearchavenuesinsomedetail. For to the base system, information retrieval enhancements for thesakeofdemonstratingtheseimprovements,weevaluate preselection of appropriate documents, and, lastly, a pro- ourbestsystemsfromTREC-2004andTREC-2005onthis cesstomoveawayfromsymbolicprocessingtoamoresta- year’s evaluation data as a means of comparison. This tistical system. We discuss each of these in turn. comparisondoesnotcompletelyre-createallofthenuances 2.1 General System Improvements 2.1.1. Overcoming Competition-level Holes * Dragon Development Corporation, Columbia, MD b One major modification was designed to overcome prob- Henggeler Computer Consultants, Columbia, MD = Johns Hopkins Applies Physics Laboratory, Laurel MD lems that only arise with the appearance of fresh data -- Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2006 2. REPORT TYPE 00-00-2006 to 00-00-2006 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER QACTIS Enhancements in TREC QA-2006 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION U.S. Department of Defense,Fort George G. Meade,MD,20755-6000 REPORT NUMBER 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 9 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 problems which unfortunately only really occur during were unsuccessful. We reasoned that use of a more competitions. In TREC2005, we had noticed that there recently-built content extractor with such resolution were a number of questions which our parser failed to embedded could be especially beneficial. BBN was able properlyhandle;therewereotherquestionsforwhichthe togenerateoutputforusfromtheirSERIF[3]engine,and system did not know what kind of information it was we began work to incorporate this information into an seeking; there were questions that were so long that our exactnessfilter. Unfortunately,wewerenotabletomake NIL-tagger generated inadvertent false positives; and useofthisinformationpriortotheevaluation. Ultimately, lastly, there were many QACTIS-provided answers (as the only additional resolution that we could incorporate much as 20% relative) marked inexact. intothesystembythetimeofthecompetitionwastoget The first two of these were readily solved. We the base system to augment city names with their corre- ensured that the system always parsed any previously- spondingstatenameswhensuchinformationwaspresent unprocesseddocumentsandthatithandledtheseproperly in the data. atruntime. Wealsoattemptedtorequirethatanyfactoid question whose target answer form was unknown would 2.1.2. What is the Actual Information Need? at least return an entity type. Also, we sought to prevent Basedonevaluationsoverpastyears,wenotedthatQAC- non-responses for other-style questions by requiring the TISproducederroneousanswesforquestionsaboutcourt system to revisit the question in a new way if an earlier decisions, court cases, ranks, ball teams, scores, cam- stage failed to provide a response. These changes were puses, manufacturers, and, in some cases, titles of works important for yielding a more robust question answerer, ofart. Theseproblemswereduelargelytoissuesofeither and it is possible that as much as 0.5-1% of the factoid underspecificityortoprovidingahyponymforaconcept scoreimprovementsisattributabletothesefixes(particu- rather than a required instance of that concept. larly the parsing fixes). With regard to underspecificity, “teams” provide a Perhaps the most dramatic change to handling the greatexample. Ifaquestionwereoftheform“Whatteam problem of the unforeseen was to change the system’s won Super Bowl ...,” it is clear to a human that the team scoring metric to take length of question into consider- thatisbeingreferencedisanNFLfootballteam. Instead, ation. Our system attempts to provide a probability-like if “World Cup” replaced “Super Bowl,” the team should estimate of the likelihood that some answer is legitimate be a soccer team. Formerly, the system would seek out giventhequestion. Allthenon-stopwordsofthequestion anykindofteamfortheappropriateresponse--aproblem are important in the question-answering process, so of underspecificity. To avoid this problem, we encoded longerquestionswillnaturallyhavelowerestimatedprob- knowledge into the system to help it better be able to abilities than shorter questions. Nevertheless, it had pre- reach the correct level of specificity particularly with viously been our policy to use a threshold as a means of teams. Likewise, in TREC-2005, the system was pre- estimating whether an answer should be reported as a paredtoidentifytitlesofworksofart,butwhetherornot NIL. Thismeantthatlongquestionsweremorelikelyto it could do it was subject to the way the question was erroneously report NIL as their answers than short ques- posed. We tried to incorporate more generality into its tions. Through least squares fit, we determined that the ability to recognize when such a work of art was being probability scores were approximately proportional to requested. We have not removed all problems with this 0.01length_in_wordsforquestionswithatleastthreewords. underspecificity,sothiswillbecontinuedworkaswepre- Therefore, we multiplied the scores of such questions by pare toward TREC-2007. 100(length_in_words-3). Weconductedanexhaustivesearch The hyponym/instance-of issue is likewise a promi- nent problem in the system. If the system were to see a anddeterminedanoptimalnewNILthreshold(of10-12). question“Whatwasthecourtdecisionin...”or“whatwas In our developments, this augmentation to the weight thescore...”thesystemwouldthinkthatitwaslookingfor seemedtohaveonlylimitedeffectonscorebutprevented some hyponym of “court decision” and “score” rather accidental discarding of legitimate answers. than a particular verdict (guilty, innocent) or a numeric Thelastchallengewastoovercomethelargenumber value,respectively. Weimplementedanumberofspecial- ofanswersthatQACTISproducedthatwerebeingidenti- ized functions to tackle rarer-occuring questions such as fied as inexact in past years. Our system might identify these and ensure that the appropriate type of information the appropriate last name of a person that should be was provided to the user. reportedasananswer,oritmightidentifythecitywithout thestate. Wehadpreviouslybuiltnameandanaphorares- 2.1.3. Missing Information Needs: Auto-hypernymy olutionintooursystemwhichwehadnotbeenusing,and There are related problems to looking for either hyp- we experimented with various settings of these compo- onyms or instances of classes which are due to lack of nentstoseeifwemightgetsomeadditionalgains,butwe worldknowledgeinsomeareas. Forsystemsthatusethe Internet to provide them potential answers, they get results for factoid answering and Variant1 actually around the problem of missing world information. Our yieldedsignificantlossesinlistperformance. Needlessto basesystemis,forthemostpart,self-containedandwedo say,wechosetonotselecttheseupdateddatasourcesfor not currently make direct use of the Web. Therefore, we use in our actual evaluation systems. This will be a sub- need to ingest resources to support our question-answer- ject of study for the future. ing. Inpastyears,wehavemadeuseofWordNet[4]anda knowledge source we had previously developed called Table 1: Wikipedia Ingestion SemanticForests[5],pluswehadtargetedspecificcatego- Baseline Variant1 Variant2 riesofquestionsandderivedlargeinventoriesofpotential QA Set concepts under those categories through the use of Wiki- #1 List #1 List #1 List pedia[6]. Thisyear,however,wetriedtogrowourworld 201-700 165 164 knowledgetomuchmorethanhundredsofcategoriesand 894-1393 116 113 instead try to (a) ingest much or all of Wikipedia’s taxo- 1394-1893 133 133 133 nomic structure, and (b) automatically induce taxonomic 1894-2393 154 .165 152 .137 152 .169 structure on the fly. 1.1-65.5 72 .197 71 .130 71 .188 Forthefirsteffort,wedownloadedtheentireEnglish 66.1-140.5 121 .163 124 .109 121 .157 component of Wikipedia and distilled out all of its lists. Thenwedevelopedcodethatcouldturnthoselistsintoa 2.1.4. Longer Term Attempts structure akin to the taxonomic structure required for ThereweretwootherareasofresearchonQACTISwhich ingestion by our system. Time constraints limited our weundertookwithintentionsofincorporatingbythetime ability to do this in a flawless fashion (and revisiting this of the evaluation but which required more effort than issue is certainly in order for the future). expected to make ready on time. These are mentioned In addition to the use of this Wiki-generated taxo- only briefly for the sake of completeness. One of these nomic structure, we also experimented with hypernym areas dealt with morphological information and the other inductionasameansoffindingstillmoreinformation. In with multitiered processing. 2004,Snow,etal[7] describedaninterestingmethodfor With morphology, our system attempts to crudely hypernym induction that was based on supervised generatedirectedconflationsetsforallofthewordsofthe machinelearningusingfeaturesderivedfromdependency question and for the documents which hopefully contain parses. Oncetrained,thelearnercanbeappliedtounan- theanswers. Anumberofquestionshavebeenanswered notated corpora to identify new hypenym pairs. Snow incorrectly due to incorrect morphological information provideduswithhiscode,andwebeganinvestigatingthis related often to word sense. We therefore began what technologyandextendingitsscalabilityforapplicationto turnedouttobeasignificantefforttoconvertthewaythe our question-answering problem. system did morphological processing to one that would We used these two new datasets to augment the data also make use of the part of speech in its stemming pro- that we previously had (and which we had been growing cess. We hope that this information will strengthen the by hand throughout the year) to see if these approaches system at a later point even though it is currently not would yield system improvements. We created two vari- embedded in the processing. ant taxonomic dictionaries of different sizes and plugged Another effort which we were not able to finish themintotheexistingsystem. InTable1,weillustratethe attemptedtoconvertQACTIS’scurrentprocessfromone results of these variants as compared with our baseline that fuses all forms of annotation (named entity, parsing, system that makes use of largely hand-assembled data. etc.)fromasinglestreamintoonewhereeachstreamcan (Notethatthescoreslistedarethenumberofnumberone be accessed independently. This would allow the system answers (#1) and the F-score for lists on past TREC sets to derive an answer from one stream even if another withquestionidentifiersinthespecifiedranges. Inthese stream would have yielded some other interpretation. evaluations, also, the system judgments are automatic, Thisnotionhaspotentiallyverypositivegains,butweare and the judgments do not count off for unsupportedness currently at some distance away from knowing its long- nor for temporal incorrectness. Moreover, the evaluation term benefits. counts as correct any outputs which have exact answers Althoughthefull-blownmorphologyrevisionwasnot embeddedtherein,butwhichmaynottrulybeexact. For incorporated by the time of evaluation, we were able to example, “Coach Harold Solomon” would be scored as incorporate a weaker effort regarding morphology with correct even if the exact form should only be “Harold regard to list-processing. The QACTIS system has been Solomon.”) The table illustrates a disappointing result: developed primarily with a focus on factoid answering, that despite this interesting effort, the taxonomy-growing but morphological structure of questions was not well approach as it currently stands yielded slightly negative addressed for tackling lists. We therefore did work on beefingthesystemupintermsofitslist-handlingcapabil- IRsystemforthe2005datasetwasmuchbetter--thetop ityand,inthelongrun,thiseffortprovedtobequiteuse- 10 documents contained the correct answer 70% of the ful in that our list-answering capability on the whole time while the top 30 documents contained the correct improved substantially. answer 80% of the time. ApseudoIRsetwasbuiltforthe1894seriesusingthe 2.2 Retrieving Documents answersetprovidedbyTREC-wereferredtothissetasa ’perfectIR’set. Thenumberofcorrectanswersprovided Like many other question-answering systems, QACTIS bythebasesystemwhenthisdatasetwasusedwas~60% beginsitsquestion-answeringphasebyfirstattemptingto -- an absolute gain of 20% just by removing irrelevant find documents on the same subject We had used the documents. Lemur system [8], version 2.2, since TREC13, and have An additional step was added to the overall system found its results to be satisfactory. In fact, at TREC14, that attempted to filter the IR using the named entities usingthissystemoutoftheboxyieldedoneofthetopIR present in the question. This list also includes dates, systems. We experimented with new versions of Lemur, titles,andanythinginquotes. Thisprocessdidprovidean but were not able to get any better results for QA. increase in scores as long as it was not overly aggressive Even still, when we look at the results of our ques- in filtering out too many documents. tion-answerer,weseethatithasaless-than-perfectupper Further attempts were made to include multi-word bound due to limitations in information retrieval. If we terms and low-frequency words (words in the question could enhance our ability to identify appropriate docu- whichhadalowerfrequencyofoccurrenceintheoverall ments, we would likely have a higher performance and a corpus) as filter terms, but there was not enough time to higher upper bound on performance. We set out to adequately analyze the effect. Additional parameters improve our ability to preselect documents which would such as how many of the top 1000 documents should we hopefullycontainthedesiredanswers. Weexperimented examine,howmanydocumentsshouldweretainandhow withtwoapproaches. The firstof thesewasanapproach manydocumentsfromtheoriginalIRshouldwekeepby which identified key phrases from the question and tried default also had to be factored in to the result. to ensure that the returned documents actually contained By the time of the TREC-2006 evaluation, it was thosephrases. Wewillcallthisprocessphrase-basedfil- determined that no more than the top 50 documents tering of information retrieval. The second process used shouldbeexamined. Therewasnodifferenceinoursys- the Web as a mechanism for pseudo-relevance feedback. tem in examining 50 or 75 documents. 100 documents We discuss each of these techniques. degradedtheoverallsystemperformance. Therewasalso asignificantboostinlookingat50documentsasopposed 2.2.1 Phrase-based IR Filtering to just 30. Also, because of list questions, it was deter- By the time we had competed our system in the TREC- mined arbitrarily that at least 10 documents should be 2005competition,thebasesystemasappliedtooneofthe retained. Since our IR system showed the top 5 docs for older TREC collections (the 1894 question series) was each question to be relevant about 60% of the time, we gettingmeanreciprocalranksofabout40%. Uponexam- decidedtokeepthetop5documentsasamatterofcourse. inationofthedocumentsreturnedbytheIRcomponent,it Weranourphrase-basedfilteringonallofthecollec- was discovered that a large number of irrelevant docu- tions at our disposal on the day before the TREC-2006 ments were being returned. One reason for this was that evaluation. Table 2 illustrates these results (whose scor- peoples’ names, such as ’Virginia Woolf’, were broken ing follows the paradigm mentioned in Table 1). As can into two separate query terms. The resultant documents be seen, this approach affords small (2.6% relative) but returned some that contained ’Virginia Woolf’ as well as positive improvements in the overall system performance. some that related to a ’Bob Woolf’ who lived in ’Vir- ginia’. A question pertaining to ’Ozzy Osborne’ also returned a document containing a reference to a woman Table 2: Filtered IR DevSet Improvements who owned a dog named Ozzy which bit a ’Mrs. Osborne’ on the wrist. Baseline w/ Filtered IR Diff in QA Set Furtheranalysis ofthe IRset showed thatthe top10 #1 Mrr List #1s documentsforeachquestioncontainedthecorrectanswer 201-700 165 .432 172 .446 +7 55%ofthetime;thetop30documentscontainedthecor- 894-1393 116 .351 118 .348 +2 rect answer 67% of the time. (The numbers were deter- 1394-1893 133 .393 138 .402 +5 minedbycomparingthedocumentlistreturnedbytheIR 1894-2393 154 .431 .165 158 .444 .176 +4 system to the list of correct documents for each question 1.1-65.5 72 .395 .197 71 .402 .197 -1 as provided by the TREC competition committee.) The 66.1-140.5 121 .445 .163 127 .456 .162 +6 2.2.2. Google-Enhanced Information Retrieval recentyearsofTREC,ascomparedtothebaseline,theS2 Thesecondapproachtoprefilteringwasamultisteptech- system in preliminary tests yielded a 6.4% relative niquethattookadvantageoftheInternetwithoutactually improvement in factoid performance and a 4.1% relative trying to mine answers from it. This is a process which improvement in list performance. apparentlyhasbeenusedbyotherTREC-QAparticipants in past years. 2.2.3. A Word about Coupled Retrieval To improve our document retrieval phase, we used One last experiment we tried was the coupling of these Google to select augmentation terms for each question. two prefilters. The hope was that if each could give an Each question in the test set was converted to a Google improvement, then perhaps in combination the improve- query and submitted to Google using the Google API. ment would increase. Unfortunately, this was not the The top 80 unique snippets returned for each question case. It turns out that the approaches are somewhat at wereusedastheaugmentationcollection. Givenaques- odds with each other. The phrase-filtering approach tion, we counted the number of times each snippet word attemptstoensurethatonlydocumentsthatcontainsome co-occurredwithaquestionwordinthesnippetsentence. or all of the important question phrases should be Thesesumsweremultipliedbytheinversedocumentfre- retained, while the Google-assisted approach attempts to quencyoftheterm;documentfrequencieswerecalculated lookfordocumentsthatmighthaveterminologythatwas fromtheAQUAINTcollection. Theresultingscoreswere notintheoriginalquestion. Ifoneappliesaphrase-based usedtoranktheco-occurringterms. Thetopeightterms filtering system to documents that have been obtained by were selected as augmentation terms, and were added to the Google-assisted process, the likelihood is that even theoriginalquerywithweight1/3. Theresultingqueries fewerofthedocumentsthanbeforewillactuallyhavethe were then used for document retrieval. appropriateterminology. Wetriedseveralothervariations In selecting these parameters, we were faced with on this theme after the actual TREC submission, but no many parameter choices for definition of co-occurrence, combination yielded an improvement in overall perfor- term weighting, etc. With over a thousand combinations mance. to choose from, it was not practical to run full question- answeringtestsoneachonetoselectthebest.Instead,we 2.3. Beginning to Incorporate Statistical QA used a proxy for question answering performance: the In the past few TREC evaluations, there has been an numberofwordsthatoccurinquestionanswersthatwere emergenceofstatisticalQAsystemswhichhavetheprop- selectedasexpansiontermsbythemethod. Weminedthe erty that they learn relations between the posed question, answer patterns from past TREC QA tracks for words. theanswerpassage,andtheactualanswer. Then,whena Each time a method selected one of these words as an userposesafuturequestion,variousanswerpassagesare expansion term for the training collection, it was given a evaluated using statistical or machine-learning processes point. WeusedthehighestscoringmethodinourTREC- to determine how likely they are to contain a needed 2006 run. answer. As a final step, the system must distill out the answerfromtheexistingpassage. Statisticallearnersare Table 3: Google-Enhanced Improvements particularly appealing in that they hold potential capabil- Google-Enhanced ity of developing language-independent QA. For such QA Set BL capability, one need only provide question-answer pairs TF LS LT S S2 for training. #1 154 156 130 156 152 157 Webeganin2005todevelopastatisticalQAsystem. 1894-2393 Mrr .431 .455 .377 .455 .442 .462 Theinfrastructureforthissystemwasinplacebythetime List .165 .191 .148 .192 .179 .191 of the TREC-2006 evaluation and the system had begun #1 72 76 80 81 76 79 to be taught how to automatically answer a limited num- 1.1-65.5 Mrr .395 .434 .405 .443 .434 ber of questions, so we thought we would couple it with List .191 .176 .179 .182 .189 .195 theexistingsystemandallowittoanswerthosefewques- #1 121 126 116 123 122 130 tionstructuresthatitwasequippedtoaddress. Sincethis 66.1-140.5 Mrr .445 .464 .473 .458 .463 .473 systemisnewandemerging,weprovideabitmoreinfor- List .163 .158 .179 .159 .163 .168 mation about the process and ways that we attempted to exploit the process during 2006. Inourdevelopmentswiththisprocess,wewerequite excitedaboutthegainswewereseeingwithit. Weexper- 2.3.1. From Document Selection to Passage Selection imented with five different configurations of this process Thefirststepoftheprocessofdevelopingastatisticalsys- and one process (S2) yielded particularly successful tem was to move from mere document selection to some results. In Table 3, as can be seen from the three most form of passage selection. The first 45 documents potentially language-independent process for answering reported from the Lemur IR system were screened to the questions. Using a copy of Wikipedia, we first used identifysentences(andsometimessurroundingsentences) named-entity matching and part of speech tagging to see which reflected the information from the important noun if we could draw out an answer directly from the Wiki wordsofthequestion. Thefirst80sentencesthatsatisfied pages. Barring this, we looked for redundant but nor- the criteria were retained and the sentence selection pro- mallyrareinformationfromtheSVMsentencesand,ifit cess was terminated. The 45-document and 80-sentence existed, this information was returned as the answer. limits were determined to be empirically optimal threshold. 3. SYSTEM EVALUATIONS The next goal was to order these sentences by their 3.1. Description of Results potential for answering the question. The reordering In TREC-2006, we submitted three runs from among the component was based on support vector machines various configurations at our disposal. All of our runs (SVMs). Thereorderingeffortwastreatedasatwo-way used the same “Other” processing as in TREC-2005 classification problem -- a customary domain for SVMs. (exceptthatthissystemwasslightlymorerobusttofailure The classifier was based on 26-dimensional feature vec- than last year). Also, in each situation, the system tors that were drawn from the data. Examples of these reported the top 20 answers from our factoid system as features were: (a) reporting 1.0 if the direct object from the“list”response. Intermsoffactoids,thefirstofthese thequestionwasintheputativeanswersentenceand0.25 runs made use of our base engine but its information otherwise; (b) reporting 1.0 if the direct object from the retrievalphasewasprefilteredusingphrase-basedfiltering question was missing but a WordNet synonym is in the asmentionedbefore. Thesecondsystemreplacedphrase- putative answer sentence, and 0.25 otherwise; (c) report- based filtering with our Google-enhanced information ing the ratio of question words to putative answer sen- retrieval efforts. The third system was the same as the tence words; and so forth. The classifier was then second but whenever the statistical system was deemed presentedwithpositiveexamplesoffeaturevectorsdrawn itself able to answer the question, it would supplant the from actual answer sentences of past-year TRECs and it original answer with its own. Since the statistical system wasalsopresentedwithacomparablenumberofnegative isinitsinfancy,therewereveryfewanswersthatitactu- examples drawn from bogus sentences. ally supplanted. Fromthesetrainingexamples,thesystemwastaught The results of these runs are detailed in Table 4. with quite good accuracy to learn the difference between Under “Factoid,” the number of correct answers is listed good question-answering sentences and poor ones. In andisfollowedbythetriple(unsupported,inexact,tempo- fact, for questions where the initial IR actually captured rally-incorrect)andbythefractionoffirstplaceanswers. relevant documents, the percentage of true answer sen- Under the “List” and “Other” scores are the NIST- tencesidentifiedbytheSVMonaheldoutTRECQAcol- reported F-scores. The “All” category is the average of lection was: 30% in the top-1 sentence, 43% in the top-2 thethreeprecedingcolumns,whichrepresentstheofficial sentences,59%inthetop-5sentences,74%inthetop-10 NISTscore. Tooursurprise,noneofthesevariationspro- sentences, and 80% in the top-15 sentences. It seemed vided significantly different results in the “All” category. highly likely that this process could afford a dramatic However, it seems clear that factoids were negatively improvement in overall performance. impactedbyGoogle-enhancedIRascomparedtophrase- based filtering, and the opposite is true for Lists. 2.3.2. Pulling out the Answers from the Sentences Thenextissuewastoextracttheanswerfromtheanswer Table 4: TREC 2006 Performance sentences. Onestrategywastoinsertthesesentencesinto the existing question-answerer and hope it could distill Strategy Factoid List Other All outtheanswer. Thisprocessyieldeda20%relativedeg- Phrase-filtered IR 107 .147 .148 .185 radation in performance due largely to the fact that the + improved QA (10/20/5) current system requires itself to find all relevant compo- [#1] .266 nents of questions, whereas the best sentence may have Google-enhanced 95 .156 .151 .181 the answer but not all relevant question components IR, improved QA (14/22/4) (neededforsupportedness). Althoughwewillattemptto [#2] .236 modifythebasesystemduringtheremainderof2006and 2007totackletheproblem,therewasalsoadesiretogeta Google-enhanced 96 .156 .154 .183 fully-statistical QA system. IR, improved QA, (14/22/4) Wehaveyettodevelopafullstatisticallearningpro- some statistical .238 cess for finding answers, but we did begin a simple and QA [#3] 3.2. Comparison to Past Years 3.3. Considering the “Other”s With these scores seeming to be only marginally better At TREC-2005, the F=0.248 that our system generated than they were last year, we wanted to determine if there was the maximum score for any of the “Other” question had actually been any true system improvements since answerers. Thisyear,ourF-scoredroppedbyanabsolute last year. We were able to identify our best competition 10%andourpositionfelltojustabovethemedian. Ifwe systemsfrom2004and2005andconductasmallexperi- had made changes to our “Other” answerer, we would ment to test for system improvements by applying these havebelievedthatwehadsimplyputforthapooreffortin past systems to this year’s data. Our experiments would making changes. On the other hand, since the only solely focus on factoids and we would not identify ques- change we made was to add a stage which would thrice tionsforwhichanassessor,hadheorsheseentheoutput ensure against empty answers, we had to seek to under- of the older system, may have judged an answer as cor- stand why the performance would have fallen as it had. rect. Likewise,whereaswehadsomeparsingproblemsin At TREC-2004, our first year of participation, our years past, we would allow the system to directly access system received a very high score and only 1/4 of the the new and updated parses of today (since the broken answers were given zero credit (with half of these due to and/or empty parses have long since been removed). It non-responsiveness of our system). In this year, though, wasourexpectationthatthesetwooversightswouldlikely halfofouranswersweregivennocrediteventhoughour balanceeachotherandprovideafairlyaccuratecompari- method for “Other”-answering is sort of a “kitchen sink” son of past year performance to that of the current year. approachwhichreportstonsofinformation. Wetherefore Additionally, since past-year systems were not con- reviewed the first ten of these zero-scored answers to see cerned with temporally inaccurate answers, and since what would have changed. “unsupported”isdifficulttotrulyjudgewithouttheinput Reviewingthe“Other”questionsisanon-trivialtask. of an assessor, we scored the three systems by allowing Thecoreissuewiththeseisnotknowinghowthedetermi- what TREC-2006 assessors had declared to be “Right,” nationismadeastowhethersomethingisvitalornot. It “Locally Correct”, and “Correct but Unsupported.” The appeared this year that since there were so many ques- following table provides the performance comparisons. tions asked from each series, the “Other” questions had littleinformationtochoosefromthatwasbothnoveland vital. Evenso,therearethreesituationsthatariseingiv- Table 5: TREC 2006 vs. Past Years ingasystemazeroscoreinsuchantheevaluation:(a)the QAsystemdidnotreturnitemsthatassessorsfoundtobe Factoid TREC Year valuable, (b) the QA system did return such items and #R+L+U/“Correct” Score receivednocredit,and(c)theQAsystemproduceditems TREC-2006 System 122 / .303 that these assessors deemed to be non-vital but other TREC-2005 System 95 / .236 assessorsmighthavebeenperceivedofasnuggets. Since thetaskissubjective,anevaluationof“c”isnotaparticu- TREC-2004 System 53/ .132 larlyhelpfuldirectiontostudy. Yetwewilltouchbriefly (TREC-2006 w/o Filter) (116 / .288) on the first two of these given our in-depth review of the ten zero-scored answers that we studied. (It should be Table5showsagratifyingresult. Fromthefirstthree notedforreference,though,thatthereissomesubjectivity rows we see that there have been reasonable improve- whichisinconsistent:in164.8,creditisgivenforthefact mentstooursystemoverthecourseofthepasttwoyears. thatJudiDenchwasawardedOrderoftheBritishEmpire, The last row of Table 5 is merely informational. but credit was not given in 153.8 for Alfred Hitchcock’s Since we were only able to submit three runs to TREC, receiving of higher honors as a Knight Commander.) wewerenotabletodeterminetheimpactofourbasicsys- By far the biggest problem for us with category “a” temimprovementsasopposedtothosethatwerecoupled above was that what were being deemed as nuggets this withIRimprovements. Withtheincorporationofthelast year were largely less-important pieces of information row, we are able to see that the basic system improve- whichsurfacedtothetopasvitalsbecausemoreinterest- mentscontributedtoabout21morerightanswersandthat ing information was posed as questions. If one were to IRcontributedto6beyondthat. Wedidnotruntheexper- ask a system: “Tell me everything interesting and unique iment of enhanced IR without the basic additions to the youcanaboutWarrenMoon”(Q141.8),onewouldexpect system,butwhilewewereoriginallydevelopingthealgo- to receive information about his life profile: birth, death rithms, this paradigm was tested and it was typically the (if appropriate), his profession, and other major suc- case that IR improvements and basic improvements had cesses. SincetheQ141seriesasksabouthispositionona 1-3 correct answers in common. football team, the college where he played ball, his birth year,histimeasaprobowler,hiscoaches,andtheteams on which he played, the remaining relevant information mustaddresshissuccessesandpossiblyhisdeath. Since hehasnotdied,onlysuccessesremain. Thus,thefactthat Table 6: Unsupported Answers: Why? hehadthethirdall-timehighestcareerpassingyards,and Our Answer a15-yearfootballcareerarenote-worthy. Evenso,ourIR QID Reason Unsupported [Document] prefilter rejected one of the documents containing one of these items, and the other item appeared in a 17th-place 145.4 february23, Needed a conviction date. Docu- document ... deeper than our Other processing typically 1999 mentdated3/2/1999refersto“last goes. [NYT19990 week” which is ambiguous. Furtherreviewinginthe“a”arena,wenotedthatvital 302.0069] nuggetsfor152.7(Mozart)appearedinour17thand60th 154.4 Margot Needed most-frequent actress in documents; a vital for 163.8 (Hermitage) was not in our Kidder Superman. Document states that top100;thesolevitalfor164.8wasin47thplace;thetwo [APW19981 Kidder starred with Reeve, but vitalsfrom175.5werein18thplaceandnon-top-100;and 213.1025] nothing about “most” so forth. The absence of such information in the higher- 172.1 Burlington Needed a city and state of com- IR documents was obviously a leading contributor to our [NYT20000 pany’s origin. The answer is inex- reducedscores. Inpastyears,therewerefewerquestions 124.0364] act, too, but this document only asked per series, and many of those questions did not says “based in,” not originated in. focussomuchonkeyeventsofpeopleandorganizations but were focused more on exercising the QA systems. 182.5 Scotland Needed country of Edinburg These facts seem to be the primary reason why our [APW19990 Fringe. Document discusses poli- “Other” results would be so drastically degraded. 506.0176] tics in Scotland and a “fringe However, there are a few instances of “b” occurring party” -- polysemy problem. as well. That is, our system reported vital information 188.1 California Needed US state with highest avo- thatwasoverlooked. In163.8,oursystemreportedwith- [APW19990 cado production. California is outcreditthattheHermitagewas“comparabletotheLou- 117.0079] only mentioned in passing and vre in Paris or the Metropolitan Museum in New York,” nothing mentions production rates. which was a specified vital nugget. Such issues are less 189.7 Edinburgh Needed city of JK Rowling in frequent, though, and they are not unexpected given that [NYT20000 2000. Document states that she our system reports tons of information as answers. Fur- 112.0203] lived in Edinburgh in 1993. thermore, if the answers we perused are indicative, the Unclear if she lived there in 2000. issue of vitals not receiving credit would probably con- 190.1 PITTS- Needed city of HJ Heinz. The tribute less that .05 absolute to the current F-score. BURGH byline gives Pittsburgh and dis- [APW19990 cusses company profits, but does 3.4 Unsupportedness 615.0036] not explicitly say its base as there. As mentioned previously, the large number of answers from our system that assessors were tagging as “unsup- 191.3 Germany Neededcountrythatwonfirstfour ported”seemedsomewhatsuspicioustousgiventhatour [XIE199906 IRFR World Cups. Document system does not draw its answers from the Web. We 20.0031] mentions “Germany” and “four” sought to review the answers being proposed by our sys- but not that Germany won 4 times. tem and determine what the unsupported issues were. 194.3 six Need number of players at 1996 First, based on the cumulative information provided [XIE199603 WorldChessSuperTournament. for all 59 competing system runs, we were able to deter- 20.0094] Documentmentions6players,but mine that the average run had 12 answers that were wrong tournament and game. declared to be inexact. We looked at our highest-scored 214.6 seven Need number of Miss America factoid run and noted that we had 10 apparently unsup- [NYT20000 pageant judges. Document is on ported answers. Although 10 was less than the average, 717.0370] Miss Texas Scholastic Pageant we still wanted to understand the issues. We reviewed which had 7 judges. each answer and found that all the answers were indeed unsupported (and possible inexact as well). The table below summarizes this information: the question number 4 FUTURE DIRECTIONS (QID), the answer our system reported, and the reason The future of QACTIS still holds a direction of multilin- why the answer was unsupported. gual and multimedia question-answering as a primary goal. YetweanticipatefutureparticipationinTRECnext yearuntilwehaveironedoutthewrinklesinoursystem. Our focus on textual QA for the next year will be to addresstheissuesthathaveyettobecompletedbutwhat were mentioned in this paper, such as improvements and exactnessfilteringusingmoremoderncontentextractors, better incorporation of hypernyms, and making improve- mentstoourstatisticalQAsystem. Wealsoplantomake our baseline system cleaner and more robust. 5 REFERENCES [1] Schone, P., Ciany, G., McNamee, P., Kulman, A., Bassi, T. , "Question Answering with QACTIS at TREC 2004” The 13th Text Retrieval Conference (TREC-2004),Gaithersburg,MD.NISTSpecialPub- lication 500-261,2004. [2] Schone, P., Ciany, G., Cutts, R.., McNamee, P., May- field, J., Smith, T. , "QACTIS-based Question Answering at TREC-2005” The 14th Text Retrieval Conference (TREC-2005), Gaithersburg, MD, ,2005. [3]BBN’s SERIFTM engine. [4]MillerG.A.,BeckwithR.,FellbaumC.,GrossD.,and Miller K. J. “WordNet: An online lexical database.” InternationalJournalofLexicography3(4):235-244, 1990. [5]P.Schone,J.Townsend,C.Olano,T.H.Crystal. “Text Retrieval via Semantic Forests.” TREC-6, Gaithers- burg, MD. NIST Special Publication 500-240, pp. 761-773, 1997 [6] www.wikipedia.org [7] Snow, R., Jurafsky, D., and Ng, A.Y., "Learning syn- tactic patterns for automatic hypernym discovery". NIPS 2004. [8] The LEMUR System. URL: http://www- 2.cs.cmu.edu/~lemur