ebook img

Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP CURL PDF

493 Pages·2016·5.91 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP CURL

main.html Webbots, Spiders, and Screen Scrapers by Michael Schrenk Publisher: No Starch Pub Date: March 15, 2007 Print ISBN-10: 1-593-27120-4 Print ISBN-13: 978-1-59-327120-6 Pages: 328 Table of Contents | Index Overview The Internet is bigger and better than what a mere browser allows. Webbots, Spiders, and Screen Scrapers is for programmers and businesspeople who want to take full advantage of the vast resources available on the Web. There's no reason to let browsers limit your online experience-especially when you can easily automate online tasks to suit your individual needs. Learn how to write webbots and spiders that do all this and more: Programmatically download entire websites ● Effectively parse data from web pages ● Manage cookies ● Decode encrypted files ● Automate form submissions ● Send and receive email ● Send SMS alerts to your cell phone ● Unlock password-protected websites ● Automatically bid in online auctions ● Exchange data with FTP and NNTP servers ● Sample projects using standard code libraries reinforce these new skills. You'll learn how to create your own webbots and spiders that track online prices, aggregate different data sources into a single web page, and archive the online data you just can't live without. You'll learn inside information from an experienced webbot developer on how and when to write stealthy webbots that mimic human behavior, tips for developing fault-tolerant designs, and various methods for launching and scheduling webbots. You'll also get advice on how to write webbots and spiders that respect website owner property rights, plus techniques for shielding websites from unwanted robots. As a bonus, visit the author's website to test your webbots on sample target pages, and to download the scripts and code libraries used in the book. file:///D|/!!/final/main.html (1 von 2) [29.03.2008 23:21:53] main.html Some tasks are just too tedious-or too important!- to leave to humans. Once you've automated your online life, you'll never let a browser limit the way you use the Internet again. file:///D|/!!/final/main.html (2 von 2) [29.03.2008 23:21:53] toc.html Webbots, Spiders, and Screen Scrapers by Michael Schrenk Publisher: No Starch Pub Date: March 15, 2007 Print ISBN-10: 1-593-27120-4 Print ISBN-13: 978-1-59-327120-6 Pages: 328 Table of Contents | Index Dedication ACKNOWLEDGMENTS Introduction FUNDAMENTAL CONCEPTS AND TECHNIQUES WHAT'S IN IT FOR YOU? Uncovering the Internet's True Potential What's in It for Developers? What's in It for Business Leaders? Final Thoughts IDEAS FOR WEBBOT PROJECTS Inspiration from Browser Limitations A Few Crazy Ideas to Get You Started Final Thoughts DOWNLOADING WEB PAGES Think About Files, Not Web Pages Downloading Files with PHP's Built-in Functions Introducing PHP/CURL Installing PHP/CURL LIB_http Final Thoughts PARSING TECHNIQUES Parsing Poorly Written HTML Standard Parse Routines Using LIB_parse Useful PHP Functions Final Thoughts AUTOMATING FORM SUBMISSION Reverse Engineering Form Interfaces Form Handlers, Data Fields, Methods, and Event Triggers Unpredictable Forms Analyzing a Form Final Thoughts MANAGING LARGE AMOUNTS OF DATA Organizing Data Making Data Smaller Thumbnailing Images Final Thoughts PROJECTS PRICE-MONITORING WEBBOTS The Target file:///D|/!!/final/toc.html (1 von 4) [29.03.2008 23:21:54] toc.html Designing the Parsing Script Initialization and Downloading the Target Further Exploration IMAGE-CAPTURING WEBBOTS Example Image-Capturing Webbot Creating the Image-Capturing Webbot Further Exploration Final Thoughts LINK-VERIFICATION WEBBOTS Creating the Link-Verification Webbot Running the Webbot Further Exploration ANONYMOUS BROWSING WEBBOTS Anonymity with Proxies The Anonymizer Project Final Thoughts SEARCH-RANKING WEBBOTS Description of a Search Result Page What the Search-Ranking Webbot Does Running the Search-Ranking Webbot How the Search-Ranking Webbot Works The Search-Ranking Webbot Script Final Thoughts Further Exploration AGGREGATION WEBBOTS Choosing Data Sources for Webbots Example Aggregation Webbot Adding Filtering to Your Aggregation Webbot Further Exploration FTP WEBBOTS Example FTP Webbot PHP and FTP Further Exploration NNTP NEWS WEBBOTS NNTP Use and History Webbots and Newsgroups Further Exploration WEBBOTS THAT READ EMAIL The POP3 Protocol Executing POP3 Commands with a Webbot Further Exploration WEBBOTS THAT SEND EMAIL Email, Webbots, and Spam Sending Mail with SMTP and PHP Writing a Webbot That Sends Email Notifications Further Exploration CONVERTING A WEBSITE INTO A FUNCTION Writing a Function Interface Final Thoughts ADVANCED TECHNICAL CONSIDERATIONS SPIDERS file:///D|/!!/final/toc.html (2 von 4) [29.03.2008 23:21:54] toc.html How Spiders Work Example Spider LIB_simple_spider Experimenting with the Spider Adding the Payload Further Exploration PROCUREMENT WEBBOTS AND SNIPERS Procurement Webbot Theory Sniper Theory Testing Your Own Webbots and Snipers Further Exploration Final Thoughts WEBBOTS AND CRYPTOGRAPHY Designing Webbots That Use Encryption A Quick Overview of Web Encryption Local Certificates Final Thoughts AUTHENTICATION What Is Authentication? Example Scripts and Practice Pages Basic Authentication Session Authentication Final Thoughts ADVANCED COOKIE MANAGEMENT How Cookies Work PHP/CURL and Cookies How Cookies Challenge Webbot Design Further Exploration SCHEDULING WEBBOTS AND SPIDERS The Windows Task Scheduler Complex Schedules Non-Calendar-Based Triggers Final Thoughts LARGER CONSIDERATIONS DESIGNING STEALTHY WEBBOTS AND SPIDERS Why Design a Stealthy Webbot? Stealth Means Simulating Human Patterns Final Thoughts WRITING FAULT-TOLERANT WEBBOTS Types of Webbot Fault Tolerance Error Handlers DESIGNING WEBBOT-FRIENDLY WEBSITES Optimizing Web Pages for Search Engine Spiders Web Design Techniques That Hinder Search Engine Spiders Designing Data-Only Interfaces KILLING SPIDERS Asking Nicely Building Speed Bumps Setting Traps Final Thoughts KEEPING WEBBOTS OUT OF TROUBLE file:///D|/!!/final/toc.html (3 von 4) [29.03.2008 23:21:54] toc.html It's All About Respect Copyright Trespass to Chattels Internet Law Final Thoughts PHP/CURL REFERENCE Creating a Minimal PHP/CURL Session Initiating PHP/CURL Sessions Setting PHP/CURL Options Executing the PHP/CURL Command Closing PHP/CURL Sessions STATUS CODES HTTP Codes NNTP Codes SMS EMAIL ADDRESSES Colophon Index file:///D|/!!/final/toc.html (4 von 4) [29.03.2008 23:21:54] Ipreface.html WEBBOTS, SPIDERS, AND SCREEN SCRAPERS. Copyright © 2007 by Michael Schrenk. All rights reserved. No part of this work may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval system, without the prior written permission of the copyright owner and the publisher. Printed on recycled paper in the United States of America 11 10 09 08 07 1 2 3 4 5 6 7 8 9 ISBN-10: 1-59327-120-4 ISBN-13: 978-1-59327-120-6 Publisher: William Pollock Production Editor: Christina Samuell Cover and Interior Design: Octopod Studios Developmental Editors: Tyler Ortman and William Pollock Technical Reviewer: Peter MacIntyre Copyeditor: Megan Dunchak Compositors: Megan Dunchak, Riley Hoffman, and Christina Samuell Proofreader: Stephanie Provines Indexer: Nancy Guenther For information on book distributors or translations, please contact No Starch Press, Inc. directly: No Starch Press, Inc. 555 De Haro Street, Suite 250, San Francisco, CA 94107 phone: 415.863.9900; fax: 415.863.9950; [email protected]; www.nostarch.com Library of Congress Cataloging-in-Publication Data Code View: Schrenk, Michael. Webbots, spiders, and screen scrapers : a guide to developing internet agents with PHP/CURL / Michael Schrenk. p. cm. file:///D|/!!/final/Ipreface.html (1 von 2) [29.03.2008 23:21:55] Ipreface.html Includes index. ISBN-13: 978-1-59327-120-6 ISBN-10: 1-59327-120-4 1. Web search engines. 2. Internet programming. 3. Internet searching. 4. Intelligent agents (Computer software) I. Title. TK5105.884.S37 2007 025.04--dc22 2006026680 No Starch Press and the No Starch Press logo are registered trademarks of No Starch Press, Inc. Other product and company names mentioned herein may be the trademarks of their respective owners. Rather than use a trademark symbol with every occurrence of a trademarked name, we are using the names only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The information in this book is distributed on an "As Is" basis, without warranty. While every precaution has been taken in the preparation of this work, neither the author nor No Starch Press, Inc. shall have any liability to any person or entity with respect to any loss or damage caused or alleged to be caused directly or indirectly by the information contained in it. file:///D|/!!/final/Ipreface.html (2 von 2) [29.03.2008 23:21:55] Idedication.html Webbots, Spiders, and Screen Scrapers Table of ContentsNot Available in This Reduce Text zoom Increase Previous Next Html view • Index • Errata Format Dedication In loving memory Charlotte Schrenk 1897–1982 Webbots, Spiders, and Screen Scrapers Table of ContentsNot Available in This Reduce Text zoom Increase Previous Next Html view • Index • Errata Format Top of Page URL http://safari.informit.com/9781593271206/Idedication file:///D|/!!/final/Idedication.html [29.03.2008 23:22:12]

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.