Internet Fish by Brian A. LaMacchia Arti(cid:12)cial Intelligence Laboratory and Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Abstract I have invented \Internet Fish," a novel class of resource-discovery tools designed to help users extractusefulinformationfromtheInternet. InternetFish(IFish)aresemi-autonomous,persistent information brokers; users deploy individual IFish to gather and re(cid:12)ne information related to a particular topic. An IFish will initiate research, continue to discover new sources of information, and keep tabs on new developments in that topic. As part of the information-gathering process the user interacts with his IFish to (cid:12)nd out what it has learned, answer questions it has posed, and make suggestions for guidance. Internet Fish di(cid:11)er from other Internet resource discovery systems in that they are persistent, personalanddynamic. As partof the information-gatheringprocess IFish conductextended, long- term conversations with users as they explore. They incorporate deep structural knowledge of the organizationandservicesof the net, andare alsocapableof on-the-(cid:13)yrecon(cid:12)guration, modi(cid:12)cation and expansion. Human users may dynamically change the IFish in response to changes in the environment, or IFish may initiate such changes itself. IFish maintain internal state, including models of its own structure, behavior, information environment and its user; these models permit an IFish to perform meta-level reasoning about its own structure. To facilitate rapid assembly of particular IFish I have created the Internet Fish Construction Kit. This system provides enabling technology for the entire class of Internet Fish tools; it facil- itates both creation of new IFish as well as additions of new capabilities to existing ones. The Construction Kit includes a collection of encapsulated heuristic knowledge modules that may be combined in mix-and-match fashion to create a particular IFish; interfaces to new services written with the Construction Kit may be immediately added to \live" IFish. Using the Construction KitI have created a demonstration IFish specializedfor (cid:12)ndingWorld- Wide Web documents related to a given group of documents. This\Finder" IFish includesheuris- tics that describe how to interact with the Web in general, explain how to take advantage of various public indexes and classi(cid:12)cation schemes, and provide a method for discovering similarity relationships among documents. Thesis Supervisor: Gerald J. Sussman Matsushita Professor of Electrical Engineering This report is a revised version of a thesis submitted in partial ful(cid:12)llment of the requirements for the degree of Doctor of Philosophy in the Department of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology in May, 1996. Notice of Copyright and Terms of Limited License This technical report, including all (cid:12)gures, tables and code fragments, is Copyright (cid:13)c 1996 Brian A. LaMacchia. Country of (cid:12)rst publication: United States of America. All rights granted to the author in accordance with 17 USC xx101 et. seq. are hereby reserved. Pursuant to 17 USC x201(d)(2), the author hereby grants to the Arti(cid:12)cial Intelligence Laboratory of the Massachusetts Institute of Technology (hereinafter \the AI Lab") certain nonexclusive, non- transferrable, limited rights related to the copyright of this document: 1. The AI Lab may reproduce paper copies of this technical report for use within the MIT community for educational or research purposes (an action which is an exclusive right of the copyright holder under 17 USC x106(1)). 2. The AI Lab may reproducepaper copies of this technical report and distributesuch copies to the public(an action that is an exclusive right of the copyright holder under 17 USC x106(1) and 17 USC x106(3)) so long as no fee is charged for such copies in excess of the actual cost of making the copy. 3. All copies of this technical report made by the AI Lab under this license must include a copy of this copyright notice and license. 4. All other uses of thistechnical report withinthe scope of the exclusive rights of the copyright holder as speci(cid:12)ed in 17 USC x106 are reserved by the author, and any action by the AI Lab that infringes any of those exclusive rights, except as explicitly granted above, requires the expressed written consent of the author. Acknowledgments This thesis could not have been completed without the support of a great many people. I wish to take thisopportunitytoexpressmyappreciationfortheirhelpinbringingthisthesisto asuccessful conclusion. First, my sincerest thanks to those whose work became part of the Internet Fish. Steven Adams wrote the Scheme code to support s-expression HTML. Michael (\Ziggy") Blair provided the demonstration problem for the Finder IFish. The Architext index and search engine (now called \Excite for Web servers") was provided courtesy of Excite, Inc. The HTTP proxy server code was written by CERN. Iam deeplygratefulto allmy friendsat Silverglate and Good fortheiroverwhelming generosity, which permitted me to work from home this past year and write this thesis. Thanks especially to Harvey Silverglate, Andy Good, Sandie Fennell, Dana Gurwitch, and Gia Barresi. For the past year I have been fortunate inhaving a retreat up Massachusetts Ave. to whichI could escape when I needed to do something other than thesis. I thank Professors Arthur Miller and Charles Nesson of Harvard Law School for graciously allowing me to audit Copyright during the Fall term of 1995 and Law, Internet and Society during the Spring term of 1996. My thanks also to the students in both classes who helped make those classes the most fun I’ve had in a classroom since I entered graduate school at MIT. My legal education began at Silverglate and Good and Zalkind, Rodriguez, Lunt and Duncan; in addition to those already mentioned above I’d like to thank Sharon Beckman, Phil Cormier, Jason Gull and Da(cid:11)odil Tyminskiof S&G andDavid Duncan of ZRL&D for their help leading me through the (cid:12)ner points of the law. When I wasn’t working on IFish or reading case law, I was helping others overcome the wonders of modern hardware software. Thanks much to all the clients of LaMacchia Computer Consult- ing: Tricia Prevett at KHJ Integrated Marketing (formerly KelleyHabibJohn Marketing and Advertising), Hearst New Media, Terry Ehling at The MIT Press, Errol Mor- ris at Fourth Floor Productions, Jay Lupica at Buyers Advantage, and John Habib at Alexander Mortgage. For the past fouryears I have beenfortunate to bepart of the localfundraisinge(cid:11)orts of St. Jude Children’s Research Hospital. Thanks to all the people involving in making TomorrowNite ’93-’96 happen, especially Paul and Jane Ayoub, Joe and Christa Ayoub, and Steven and Karen Salhaney. For almost ten years I have worked as a member of Project MAC (the Project on Mathematics and Computation) at the MIT Arti(cid:12)cial Intelligence Laboratory. What made that group such a great place to work in the years gone by were the people who inhabited it: Michael Blair, Liz Bradley, Mike Eisenberg, Arthur Gleckler, Philip Greenspun, Kleanthes Koniaris, Bill Rozas, Thanos Siapas, Jason Wilson and Henry Wu. Together they formed a very special collection of people, one with which I was proud to have been associated. My thanks to Ellen Spertus, David LaMacchia, Jim Miller and Gerald Sussman for pro- viding comments on early drafts of this thesis. For most of mygraduate career Iwasfortunate to have beensupportedbyanAT&T Foundation PhD Fellowship. My thanks to the AT&T Foundation and the former AT&T Bell Labo- ratories for that (cid:12)nancial support. The majority of the work in this thesis was funded by this fellowship. Portions of this thesis were also supported by the Open Software Foundation Research In- stitute. Portions of this technical report were supported by a Packard Fellowship in Science and Engineering from the David and Lucile Packard Foundation. GeraldJ.Sussman,MatsushitaProfessorofElectricalEngineering,supervisedthisthesis. Harold Abelson,Classof1922ProfessorofComputerScienceandEngineering,andJames Miller,World Wide Web Consortium and MIT Laboratory for Computer Science, served as readers on my thesis committee. There are others whose actions contributed to the completion of this thesis at this time, in this manner. While their actions must be acknowledged, it does not seem appropriate to do so here in this place. I commend to readers interested in those stories my forthcoming book, Defending Dave, and other tales of the ’Net, which shines the bright light of truth and public scrutiny in a number of dark corners. This thesis describes research conducted at the Arti(cid:12)cial Intelligence Laboratory of the Mas- sachusetts Institute of Technology. Support for the Laboratory’s arti(cid:12)cial intelligence research isprovidedinpart bythe Advanced Research Projects Agency of the Department of Defense under O(cid:14)ce of Naval Research contract N00014-92-J-4097. This one’s dedicated to a lot of people. In memory of my father’s parents Hon. Otto H. and Dinah LaMacchia (I know the Judge approves of what I have done) and In honor of my mother’s parents Gerald and Leona Tigar (who have been my home away from home here in Boston) For my parents, Robert and Sherry LaMacchia who have always given of themselves so that I might have the best possible education, whatever the price For all my friends who stood by me in the darkness, keeping alive the (cid:13)ickering (cid:13)ame of hope: Ziggy, who taught me how to argue, Arthur, who opened my eyes to life outside the lab, Russ, who taught me how to keep score in a ballgame, Liz, who taught me how to climb mountains, real and metaphorical, Retta, who always had an ear, or advice, or just a shoulder to cry on, Philip, who showed me Lexis, Westlaw, and the path to Harvard Law School, Bill, who taught me to stand by my principles and beliefs, no matter the cost, and, most especially, Henry, my mentor, o(cid:14)ce mate, friend, con(cid:12)dant and drinking buddy, who taught me how to appreciate wine, route network cable, design circuits, give of myself to charity, and hack, be it code, restaurants or hotels, but, most of all, For Dave, who had to endure what no man should ever have to endure, and in doing so showed a depth and strength of character, conviction and shear will that I can only hope to equal. Contents 1 Introduction 1 1.1 The Internet: Evolution in Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Resource Discovery on the Internet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Jurassic Net|FTP and Usenet . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 Gopher and other Campus-Wide Information Systems (CWIS) . . . . . . . . 5 1.2.3 The World Wide Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.4 Indexing Local Filesystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.5 Client-side Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 The Need for Something More { the Internet Fish . . . . . . . . . . . . . . . . . . . 10 1.3.1 Heuristic Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.2 Long-Term Conversations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.3 Serendipitous Resource Discovery . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3.4 Other Goals and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.4 The Road Ahead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2 Encapsulating Heuristic Knowledge 19 2.1 Claims and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2 Infochunks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 Operations: Tranducers and Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4 Infochunk-rule Interactions and Supporting System Software . . . . . . . . . . . . . 28 2.4.1 The Interaction Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.2 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.3 Error Handling and Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.4 Resource Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 i ii Contents 3 User Interactions and Interestingness 33 3.1 User Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1.1 Questions and Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.2 System Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1.3 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1.4 Ordering Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2 Interestingness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.1 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.2 Prototype Implementation of Interestingness . . . . . . . . . . . . . . . . . . 44 4 The \Finder" IFish 49 4.1 Building an IFish that Finds Web Pages \Like These" . . . . . . . . . . . . . . . . . 49 4.2 Heuristic Knowledge in the Finder IFish . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.1 Heuristics to Find New Sources of Information . . . . . . . . . . . . . . . . . 50 4.2.2 Heuristics to Look For Relationships Among Retrieved Objects . . . . . . . . 57 4.3 Querying the User to Re(cid:12)ne the Search . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4 Approximating Interestingness of Web Pages . . . . . . . . . . . . . . . . . . . . . . 66 4.5 A Session with the IFish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5 The Future of IFish 75 5.1 Evaluating IFish Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.2.1 Straight-line Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.2.2 Self-analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.2.3 Inter-IFish Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.2.4 IFish in Other Information Oceans . . . . . . . . . . . . . . . . . . . . . . . . 79 5.2.5 Toward Serendipitous Resource Discovery . . . . . . . . . . . . . . . . . . . . 80 5.3 IFish and the Future of Information Markets . . . . . . . . . . . . . . . . . . . . . . 80 5.3.1 The Marginal Cost of Content . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.3.2 The Marginal Price of Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.3.3 Selling Time: the Next Layer of the \Internet Wars" . . . . . . . . . . . . . . 83 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 List of Tables 4.1 Results of an Architext concept search. The listed (cid:12)lename is the IFish-generated (cid:12)lenameofalocally-cachedcopyofthedocument. ThedocumenttitleistheHTML- tagged title in the document, if one exists. . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 User evaluation of top documents found by the Finder IFish. . . . . . . . . . . . . . 73 iii iv List of Tables
Description: