For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 261 Shelter Island, NY 11964 Email: [email protected] ©2014 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher Photographs in this book were created by Martin Evans and Jordan Hochenbaum, unless otherwise noted. Illustrations were created by Martin Evans, Joshua Noble, and Jordan Hochenbaum. Fritzing (fritzing.org) was used to create some of the circuit diagrams. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. Manning Publications Co. Development editors: Elizabeth Lexleigh, Susan Conant 20 Baldwin Road Copyeditor: Melinda Rankin PO Box 261 Proofreader: Elizabeth Martin Shelter Island, NY 11964 Typesetter: Dennis Dalinnik Cover designer: Marija Tudor ISBN: 9781617291029 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – MAL – 19 18 17 16 15 14 2 Table of Contents Foreword ...................................................................................................... 7 Preface .......................................................................................................... 9 Acknowledgments ....................................................................................... 11 Trey Grainger ................................................................................................................... 11 Timothy Potter ................................................................................................................. 12 About this Book ...........................................................................................12 Roadmap ........................................................................................................................... 12 How to use this book ....................................................................................................... 15 Code conventions and downloads ................................................................................. 16 Author Online .................................................................................................................. 17 About the cover illustration ........................................................................................... 18 Part 1. Meet Solr ..........................................................................................19 Chapter 1. Introduction to Solr.................................................................. 20 1.1. Why do I need a search engine?.............................................................................. 21 1.2. What is Solr? ............................................................................................................. 27 1.3. Why Solr?................................................................................................................... 36 1.4. Features overview..................................................................................................... 39 1.5. Summary.................................................................................................................... 45 Chapter 2. Getting to know Solr ................................................................ 47 2.1. Getting started .......................................................................................................... 48 2.2. Searching is what it’s all about............................................................................... 59 2.3. Tour of the Solr administration console ............................................................... 69 2.4. Adapting the example to your needs ..................................................................... 71 2.5. Summary ................................................................................................................... 72 Chapter 3. Key Solr concepts ..................................................................... 74 3.1. Searching, matching, and finding content ............................................................ 74 3.2. Relevancy .................................................................................................................. 95 3.3. Precision and Recall .............................................................................................. 103 3.4. Searching at scale ................................................................................................... 107 3.5. Summary ................................................................................................................. 114 Chapter 4. Configuring Solr ...................................................................... 115 4.1. Overview of solrconfig.xml.................................................................................... 118 4.2. Query request handling......................................................................................... 124 4.3. Managing searchers ............................................................................................... 139 4.4. Cache management ............................................................................................... 144 4.5. Remaining configuration options ........................................................................ 153 4.6. Summary ................................................................................................................. 153 Chapter 5. Indexing...................................................................................156 3 5.1. Example microblog search application ............................................................... 156 5.2. Designing your schema ......................................................................................... 161 5.3. Defining fields in schema.xml .............................................................................. 167 5.4. Field types for structured nontext fields............................................................. 177 5.5. Sending documents to Solr for indexing............................................................. 186 5.6. Update handler....................................................................................................... 193 5.7. Index management ................................................................................................ 203 5.8. Summary ................................................................................................................. 209 Chapter 6. Text analysis ............................................................................ 211 6.1. Analyzing microblog text....................................................................................... 212 6.2. Basic text analysis .................................................................................................. 216 6.3. Defining a custom field type for microblog text ................................................ 227 6.4. Advanced text analysis .......................................................................................... 242 6.5. Summary ................................................................................................................. 250 Part 2. Core Solr capabilities ................................................................... 252 Chapter 7. Performing queries and handling results.............................. 253 7.1. The anatomy of a Solr request .............................................................................. 253 7.2. Working with query parsers ................................................................................. 264 7.3. Queries and filters .................................................................................................. 268 7.4. The default query parser (Lucene query parser) ............................................... 275 7.5. Handling user queries (eDisMax query parser)................................................. 283 7.6. Other useful query parsers.................................................................................... 296 7.7. Returning results .................................................................................................... 303 7.8. Sorting results......................................................................................................... 312 7.9. Debugging query results ....................................................................................... 315 7.10. Summary ............................................................................................................... 317 Chapter 8. Faceted search........................................................................ 318 8.1. Navigating your content at a glance .................................................................... 319 8.2. Setting up test data ................................................................................................ 323 8.3. Field faceting .......................................................................................................... 329 8.4. Query faceting ........................................................................................................ 336 8.5. Range faceting ........................................................................................................ 339 8.6. Filtering upon faceted values ............................................................................... 343 8.7. Multiselect faceting, keys, and tags ..................................................................... 350 8.8. Beyond the basics .................................................................................................. 356 8.9. Summary ................................................................................................................. 356 Chapter 9. Hit highlighting ...................................................................... 358 9.1. Overview of hit highlighting ................................................................................. 359 9.2. How highlighting works ........................................................................................ 360 9.3. Improving performance using FastVectorHighlighter ..................................... 381 4 9.4. PostingsHighlighter ............................................................................................... 383 9.5. Summary ................................................................................................................. 386 Chapter 10. Query suggestions ................................................................ 387 10.1. Spell-check............................................................................................................. 387 10.2. Autosuggesting query terms ............................................................................... 401 10.3. Suggesting document field values ..................................................................... 405 10.4. Suggesting queries based on user activity ........................................................ 409 10.5. Summary ............................................................................................................... 414 Chapter 11. Result grouping/field collapsing .......................................... 416 11.1. Result grouping vs. field collapsing .................................................................... 417 11.2. Skipping duplicate documents............................................................................ 417 11.3. Returning multiple documents per group......................................................... 429 11.4. Grouping by functions and queries .................................................................... 432 11.5. Paging and sorting grouped results.................................................................... 437 11.6. Grouping gotchas .................................................................................................. 440 11.7. Efficient field collapsing with the Collapsing query parser ............................ 445 11.8. Summary................................................................................................................ 446 Chapter 12. Taking Solr to production .................................................... 448 12.1. Developing a Solr distribution ............................................................................ 448 12.2. Deploying Solr ...................................................................................................... 449 12.3. Hardware and server configuration................................................................... 451 12.4. Data acquisition strategies.................................................................................. 461 12.5. Sharding and replication ..................................................................................... 465 12.6. Solr core management ......................................................................................... 475 12.7. Managing clusters of servers .............................................................................. 482 12.8. Querying and interacting with Solr ................................................................... 487 12.9. Monitoring Solr’s performance .......................................................................... 492 12.10. Upgrading between Solr versions .................................................................... 503 12.11. Summary .............................................................................................................. 503 Part 3. Taking Solr to the next level ......................................................... 505 Chapter 13. SolrCloud .............................................................................. 506 13.1. Getting started with SolrCloud ........................................................................... 507 13.2. Core concepts ........................................................................................................ 519 13.3. Distributed indexing ............................................................................................ 534 13.4. Distributed search ................................................................................................ 541 13.5. Collections API ..................................................................................................... 545 13.6. Basic system-administration tasks .................................................................... 552 13.7. Advanced topics .................................................................................................... 556 13.8. Summary ............................................................................................................... 560 Chapter 14. Multilingual search .............................................................. 562 5 14.1. Why linguistic analysis matters .......................................................................... 562 14.2. Stemming vs. lemmatization .............................................................................. 564 14.3. Stemming in action .............................................................................................. 566 14.4. Handling edge cases ............................................................................................ 571 14.5. Available language libraries in Solr ................................................................... 574 14.6. Searching content in multiple languages.......................................................... 579 14.7. Language identification ....................................................................................... 604 14.8. Summary ............................................................................................................... 622 Chapter 15. Complex query operations ................................................... 624 15.1. Function queries ................................................................................................... 625 15.2. Geospatial search.................................................................................................. 648 15.3. Pivot faceting......................................................................................................... 669 15.4. Referencing external data ................................................................................... 673 15.5. Cross-document and cross-index joins ............................................................. 676 15.6. Big data analytics with Solr................................................................................. 679 15.7. Summary................................................................................................................ 680 Chapter 16. Mastering relevancy ............................................................. 681 16.1. The impact of relevancy tuning .......................................................................... 682 16.2. Debugging the relevancy calculation................................................................. 683 16.3. Relevancy boosting .............................................................................................. 691 16.4. Pluggable Similarity class implementations .................................................... 704 16.5. Personalized search and recommendations ..................................................... 707 16.6. Creating a personalized search experience....................................................... 734 16.7. Running relevancy experiments......................................................................... 735 16.8. Summary ............................................................................................................... 739 Appendix A. Working with the Solr codebase ......................................... 740 A.1. Pulling the right version of Solr ........................................................................... 740 A.2. Setting up Solr in your IDE .................................................................................. 741 A.3. Debugging Solr code ............................................................................................. 744 A.4. Downloading and applying Solr patches............................................................ 746 A.5. Contributing patches............................................................................................. 747 Appendix B. Language-specific field type configurations....................... 750 Appendix C. Useful data import configurations ..................................... 758 C.1. Indexing Wikipedia................................................................................................ 758 C.2. Indexing Stack Exchange...................................................................................... 760 6 Foreword Solr has had a long and successful history, but a major new chapter began recently with the advent of Solr 4 and SolrCloud. This is the perfect time for Solr in Action. With clear examples, enlightening diagrams, and coverage from key concepts through the newest features, Solr in Action will have you successfully using Solr in no time! Solr was born out of necessity in 2004, at CNET Networks (now CBS Interactive), to replace a commercial search engine being discontinued by the vendor. Even though I had no formal search background when I started writing Solr, it felt like a very natural fit, because I have always enjoyed making software “go fast.” I viewed Solr more as an alternate type of datastore designed around an inverted index than as a full-text search engine, and that has helped Solr extend beyond the legacy enterprise search market. By the end of 2005, Solr was powering the search and faceted navigation of a number of CNET sites, and soon it was made open source. Solr was contributed to the Apache Software Foundation in January 2006 and became a subproject of the Lucene PMC (with Lucene Java as its sibling). There had always been a large degree of overlap with Lucene (the core full-text search library used by Solr) committers, and in 2010 the projects were merged. Separate Lucene and Solr downloads would still be available, but they would be developed by a single unified team. Solr’s version number jumped to match that of Lucene, and the releases have since been synchronized. The recent Solr 4 release is a major milestone, adding SolrCloud—the set of highly scalable features including distributed indexing with no single points of failure. The NoSQL feature set was also expanded to include transaction logs, update durability, optimistic concurrency, and atomic updates. Solr in Action, written by longtime Solr power users and community members, Trey and Timothy, covers these important recent Solr features and provides an excellent starting point for those new to Solr. Solr is now used in more places than I could ever have imagined—from integrated library systems to e-commerce platforms, analytics and business intelligence products, content-management systems, internet searches, and more. It’s been rewarding to see Solr grow from a few early adopters to a huge global community of helpful users and active volunteers cooperatively pushing development forward. Solr in Action gives you the knowledge and techniques you need to use Solr’s features that have been under development since 2004. With Solr in Action in hand, you too are now well equipped to join the global community and help take Solr to new heights! 7 YONIK SEELEY CREATOR OF SOLR 8 Preface In 2008, I was asked to take over leadership of CareerBuilder’s search technology team. We were using the Microsoft FAST search platform at the time, but realized that search was too important to the success of our business for us to continue relying on a commercial vendor instead of developing the domain expertise internally. I immediately began investigating open source alternatives such as Solr, which seemed to provide most of the key features needed for our products. By the summer of 2009, we decided that we were ready to bring our search expertise in-house and convert our systems to Solr. The timing was great. Lucene, the open source search library upon which Solr is built, had become a full top-level Apache project in February 2005, and Solr, which had been contributed to the Apache Software Foundation in 2006, had become a top-level Apache project in January of 2007. Both technologies were reaching critical mass and would soon be merged (in March 2010) into a unified project. By the summer of 2010, our entire platform was converted to Solr. In the process, we increased the speed of our searches, significantly reduced the number of servers necessary to support our search infrastructure, dropped expensive licensing fees, increased platform stability, and in-sourced much of the search expertise for which we had previously been dependent on a commercial vendor. Little did we know at that time how much additional value we would gain by bringing search in-house. We have been able to build entirely new suites of search-based products—from traditional keyword and semantic search, to big data analytics products, to real-time recommendation engines—utilizing Solr as a scalable search architecture to handle billions of documents and millions of queries an hour across hundreds of servers. We have entered the era of cloud services, elastic scalability, and an explosion of data that we strive to make meaningful for society, and with Solr we are able to tackle each of these challenges head-on. When Manning approached me about writing Solr in Action, I was hesitant because I knew it would be a large undertaking. My one requirement was that I needed a strong coauthor, and that is exactly what I found in Timothy Potter. Tim also has years of experience developing search-based solutions with Lucene and Solr. He has a wealth of expertise building text analysis systems for social data and architecting real-time analytics solutions using Solr and other cutting-edge big data technologies. With both of us having received so much help from the Solr community over the years and with such a clear need for an example-driven guide to Solr, Tim and I are excited to be able to provide Solr in Action to help the next generation of search engineers. It’s the book we wish we’d had five 9 years ago when we started with Solr, and we hope that you find it to be useful, whether you are just getting introduced to Solr or are looking to take your knowledge to the next level. TREY GRAINGER 10
Description: