NAIST-IS-DT0161021 Doctor’s Thesis Studies on Improving Retrieval Accuracy in Web Information Retrieval KAZUNARI SUGIYAMA February6, 2004 Department ofInformationSystems GraduateSchoolofInformationScience Nara Institute ofScienceandTechnology Doctor’sThesis submitted toGraduateSchoolofInformationScience, Nara Institute ofScienceandTechnology inpartial fulfillmentofthe requirementsforthe degreeof DOCTORofENGINEERING KAZUNARISUGIYAMA Thesiscommittee: ShunsukeUemura,Professor HiroyukiSeki,Professor YujiMatsumoto, Professor MasatoshiYoshikawa,Professor (NagoyaUniversity) Studies on Improving Retrieval Accuracy in Web Information Retrieval (cid:3) KAZUNARI SUGIYAMA Abstract With the rapid spread of the Internet, anyone can easily access a variety of infor- mation on the World Wide Web (WWW). Since information resources on the WWW continue to grow, it has become increasingly difficult for users to find relevant infor- mationontheWWW.Inthissituation,Websearchenginesareoneofthemostpopular methods for finding valuable information effectively. However, users are not satisfied withtheretrievalaccuracy ofsearchengines. Weconsiderthatthefollowingtwofacts causethisproblem: (i) accurateindexingofWebpagesby consideringtheir contentsisnotperformed, (ii) most search engines cannot provide search results that satisfy each user’s infor- mationneed. Theaimofthisthesisistoaddresstheaboveproblemsinordertoimproveretrieval accuracy in Webinformation retrieval. In information retrieval systems based on the vector space model, the TF-IDF scheme is widely used to characterize documents. However, in the case of documents withhyperlinkstructuressuchasWebpages,itisnecessarytodevelopatechniquefor representing the contents of Web pages more accurately by exploiting the contents of their hyperlinked neighboring pages. Therefore, in order to address the problem (i), we propose several approaches to refining the TF-IDF scheme for a target Web page byusingthecontentsofitshyperlinkedneighboringpages. Experimentalresultsshow (cid:3)Doctor’s Thesis, Department of Information Systems, Graduate School of Information Science, NaraInstituteofScienceandTechnology,NAIST-IS-DT0161021, February6,2004. i that,generally,moreaccuratefeature vectorsofatarget Webpagecanbegeneratedin the case of utilizing the contents of its hyperlinked neighboring pages at levels up to secondin thebackward directionfromthe target page. On the other hand, Web search engines help users find useful information on the WWW. However, when the same query is submitted by different users, typical search engines return the same result regardless of who submitted the query. Generally,each userhas differentinformation needs for his/her query. Hence, the search result should beadaptedtouserswithdifferentinformationneeds. Therefore,toaddresstheproblem (ii),we proposeseveralapproachesto adaptingsearchresults accordingto each user’s needforrelevantinformation. Experimentalresultsshowthatmorefine-grainedsearch systems that adapt to user’s preferences can be achieved by constructing user profiles basedon modifiedcollaborativefiltering. Webelievethat,inthefieldofWebinformationretrieval,ourproposedapproaches contribute for indexing a target Web page more accurately, and allowing each user to performmorefine-grainedsearch thatsatisfyhis/herinformationneed. Keywords: WWW, information retrieval, hyperlink, TF-IDF scheme, information filtering, rele- vancefeedback,usermodeling ii Acknowledgements I am most grateful to Professor Shunsuke Uemura for giving me a chance to study Information Retrieval in Database Labolatory. He not only gave me lots of advices derived from his vast range of knowledge but also often encouraged me. I am also grateful to Professor Masatoshi Yoshikawa at Nagoya University, Associate Professor Jun Miyazaki, Assistant Professor Kenji Hatano and Assistant Professor Toshiyuki Amagasaforgiving me alotofcommentsandadvices. IsincerelythankthemembersofthethesiscommitteeincludingProfessorHiroyuki Seki and Professor Yuji Matsumoto for reviewing this thesis. They gave me useful commentswhich Ihadnot noticed. I would like to thank Professor Hiroshi Nakagawa (he is currently at The Uni- versity of Tokyo), Associate Professor Tatsunori Mori and Dr. Koichi Yamada (he is currently at Tokyo Denki University) at Yokohama National University where I had studied in undergraduate and master course days. They gave me informative knowl- edge about information retrieval, natural language processing, image processing, and machinelearning. I appreciate Professor Shlomo Moran at Technion (Israel Institute of Technology) who gave me valuable comments at my poster presentation in VLDB2002. I also appreciate Mr. Chen Yongjun, FUJITSU Corporation, who asked me questions at DEXA2002inFrance. Itwasquiteunexpectedthat,inFrance,apersonwhoworksfor aJapanesecompanyaskedmequestions. IgratefullyappreciateDr. KrishnaBharatat Google who was my mentor of SIGIR2003. Due to a lot of his suggestive and valu- able comments, I could refine the quality of my paper although it was not accepted at SIGIR2003. Furthermore,atHyperText’03,IsincerelyappreciateDr. KevinMcCurley at IBM Almaden Research Center who gave me valuable comments for my research, Professor Andruid Kerne at Texas A&M University who highly praised my research aftermypresentation,andconferencecommitteeswhonominatedmypaperforoneof the Ted NelsonNewcomer Awardcandidates. Iwould liketo acknowledgethe contri- bution of Dr. Ayse Goker at Robert Gordon University in reviewing my paper as my mentorofSIGIR2004. ShegavemealotofinformativeadvicethatIhadnotnoticed. I would also like to express my sincere appreciation to anonymousreviewers of several conferences. Icould evolvemystudyduetoalot oftheir helpfulcomments. I also thank former andcurrent members in DatabaseLabolatory. Discussion with iii themprovidedme withnewandinnovativeideasformyresearch. I thank my best friends, Keisuke Fujimori at KAJIMA Corporation, Kunihiko Fu- tami at Olympus Corporation, Masaki Kobayashi at Seiko Epson Corporation, and YoshihisaShimizuatNipponOilCorporation. Theyencouragedmewheneverwemet together. Thanks also go to the members of swimming club in my high school days. They always have great zeal in obtaining their own goal. Their such attitudes always inspiremeto greaterefforts. Finally, I thank my family for their support and encouragement throughout my doctoralcoursedays atNara InstituteofScienceandTechnology. iv Contents Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii 1. Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 OrganizationofthisThesis . . . . . . . . . . . . . . . . . . . . . . . 3 2. BasicTechniquesinInformationRetrieval 5 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 ExtractionofIndexTerms . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1 DocumentPreprocessing . . . . . . . . . . . . . . . . . . . . 8 2.2.2 LexicalAnalysisoftheText . . . . . . . . . . . . . . . . . . 8 2.2.3 Elimination ofStopwords . . . . . . . . . . . . . . . . . . . 9 2.2.4 Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 RetrievalModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 BooleanModel . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.2 VectorSpaceModel . . . . . . . . . . . . . . . . . . . . . . 14 2.3.3 ProbabilisticModel . . . . . . . . . . . . . . . . . . . . . . . 18 2.4 RetrievalEvaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4.1 Recall andPrecision . . . . . . . . . . . . . . . . . . . . . . 22 2.4.2 AlternativeMeasures . . . . . . . . . . . . . . . . . . . . . . 30 2.5 Related TechniquesofInformationRetrieval . . . . . . . . . . . . . . 31 2.5.1 RelevanceFeedback . . . . . . . . . . . . . . . . . . . . . . 31 2.5.2 InformationExtraction . . . . . . . . . . . . . . . . . . . . . 34 2.5.3 InformationFiltering . . . . . . . . . . . . . . . . . . . . . . 39 v 3. Related Work 43 3.1 Related Workon WebSearchBasedon itsHyperlinkStructures . . . 43 3.1.1 InformationRetrievalSystemsbasedonthe Conceptof “OptimalDocumentGranularity” . . . . . . . . . . . . . . . 43 3.1.2 HITSAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . 44 3.1.3 PageRankAlgorithm . . . . . . . . . . . . . . . . . . . . . . 45 3.2 Related Workon WebSearchBasedon User’sPreferences . . . . . . 46 3.2.1 Hyperlink-BasedPersonalizedWebSearch . . . . . . . . . . 46 3.2.2 PersonalizedWebSites . . . . . . . . . . . . . . . . . . . . . 47 3.2.3 RecommenderSystems . . . . . . . . . . . . . . . . . . . . . 49 3.2.4 PersonalizedMultimediaSystems . . . . . . . . . . . . . . . 52 4. WebSearch Based onitsHyperlink Structures 55 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2 ProposedMethods . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.2.1 MethodI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2.2 MethodII . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.2.3 MethodIII . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3.1 ExperimentalSetup . . . . . . . . . . . . . . . . . . . . . . . 70 4.3.2 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . 72 4.3.3 ExperimentsforFurtherImprovement . . . . . . . . . . . . . 78 4.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.4 ConclusionofthisChapter . . . . . . . . . . . . . . . . . . . . . . . 85 5. WebSearch Based onUser’s Preferences 87 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.2 ProposedMethods . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.2.1 UserProfile ConstructionBased onPure BrowsingHistory . . 91 5.2.2 User Profile Construction Based on Modified Collaborative FilteringAlgorithm . . . . . . . . . . . . . . . . . . . . . . . 93 5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.3.1 ExperimentalSetup . . . . . . . . . . . . . . . . . . . . . . . 98 5.3.2 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . 100 vi 5.3.3 ExperimentsforFurtherImprovement . . . . . . . . . . . . . 106 5.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.4 ConclusionofthisChapter . . . . . . . . . . . . . . . . . . . . . . . 114 6. Conclusions 115 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 A ClusteringAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 131 A.1 K-MeansClustering . . . . . . . . . . . . . . . . . . . . . . 131 A.2 K-Nearest NeighborClustering . . . . . . . . . . . . . . . . 132 vii List of Figures 2.1 Ageneralmodelofinformationretrieval. . . . . . . . . . . . . . . . 6 2.2 Thethreeconjunctivecomponentsforthe query[Q = t ^(:t _t )]. 13 a b c 2.3 Definitionofthe Booleanmodel. . . . . . . . . . . . . . . . . . . . . 13 2.4 Definitionofthe vectorspacemodel. . . . . . . . . . . . . . . . . . . 16 2.5 VectorSpaceModel. . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.6 Definitionofthe probabilistic model. . . . . . . . . . . . . . . . . . . 19 2.7 Recall andprecisionforagiven exampleinformationrequest. . . . . . 23 2.8 Precisionat11standardrecalllevels. . . . . . . . . . . . . . . . . . . 25 2.9 Recall andprecisionforagiven exampleinformationrequest. . . . . . 27 2.10 Aprecisionhistogram. . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.11 InformationExtractiontransformingatext’sstructures. . . . . . . . . 38 2.12 Typical information extractionsystemcomponents. . . . . . . . . . . 38 2.13 Ageneralmodelofinformationfiltering. . . . . . . . . . . . . . . . . 40 4.1 The refinement of a feature vector as performed by Method I [(a) in the Webspace,(b)in thevectorspace]. . . . . . . . . . . . . . . . . 61 4.2 The refinement of a feature vector as performed by Method II [(a) in the Webspace,(b)in thevectorspace]. . . . . . . . . . . . . . . . . 65 4.3 The refinement of a feature vector as performed by Method III [(a) in the Webspace,(b)in thevectorspace]. . . . . . . . . . . . . . . . . 69 4.4 AverageprecisionbasedonMethodI–i. . . . . . . . . . . . . . . . . 73 4.5 AverageprecisionbasedonMethodI–ii. . . . . . . . . . . . . . . . . 73 4.6 Distribution ofaveragesimilarity. . . . . . . . . . . . . . . . . . . . 73 4.7 AverageprecisionbasedonMethodII–i(a). . . . . . . . . . . . . . . 74 4.8 AverageprecisionbasedonMethodII–i(b). . . . . . . . . . . . . . . 74 4.9 AverageprecisionbasedonMethodII–i(c). . . . . . . . . . . . . . . 74 viii
Description: