ebook img

Structured search for big data : from keywords to key-objects PDF

100 Pages·2016·6.523 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Structured search for big data : from keywords to key-objects

Structured Search for Big Data From Keywords to Key-objects Mikhail Gilula AMSTERDAM • BOSTON • HEIDELBERG LONDON • NEW YORK • OXFORD PARIS • SAN DIEGO • SAN FRANCISCO SINGAPORE • SYDNEY • TOKYO Morgan Kaufmann is an imprint of Elsevier Morgan Kaufmann is an imprint of Elsevier 225 Wyman Street, Waltham, MA 02451, USA Copyright © 2016 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions poli- cies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/ permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional prac- tices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments de- scribed herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. ISBN: 978-0-12-804631-9 British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress For information on all Morgan Kaufmann publications visit our website at www.mkp.com Dedication To my parents, Max and Asya; my wife, Natalia; my children, Maria, Victoria, and Maxim; and my grandson, Sava. QUOTATION Getting information off the Internet is like taking a drink from a fire hydrant. Mitchell Kapor PREFACE OBJECTIVE We are now in the Big Data era, which is characterized by three Vs: Vol- ume, Variety, and Velocity. This new VVV world not surprisingly follows the WWW one. While large data volumes are not uncommon for traditional data- bases, it is mostly the other two Vs that spell trouble. When data struc- tures vary or change rapidly, the classic database technology becomes not as useful. At the same time, NoSQL share is growing, though some say these are not even databases because they generally do not aim to support ad hoc queries or full-blown query languages. Proponents of NoSQL point out that ad hoc querying is not necessary for many appli- cations, but rich data structures and high availability along with speed of access are paramount. High availability may not be a decisive differ- entiator, but the rich data structure handling and ease of access to data from applications do not belong to the advantages of SQL databases. It is worth mentioning that some data from NoSQL databases end up in SQL data warehouses for analytical processing. Another big trend of the WWW–VVV era is the ubiquitous use of keyword search. Internet search companies have immensely advanced the technology and that probably accounts for use cases where the key- word search alone is a suboptimal solution. One example is e-commerce where goods and services are searched by keywords rather than by speci- fications, which would be the case in the database paradigm of struc- tured queries. If the structured query interfaces were used, researching complex merchandise for the best deals would take minutes instead of hours it might take with keywords. A typical remedy is classifiers helping users reduce search outputs by checking the classification boxes. It re- quires classifying each item individually but falls short of providing the on par functionality. This is essentially equivalent to labeling the table rows with multiple tags in lieu of employing query languages. The above suggests that we may be failing to uncover Big Informa- tion by not fully interrogating Big Data with structured queries. The question is do we want to, or are we fine with just keywords and NoSQL. xiv Preface Our goal is to present the advantages of structured search in the realm of Big Data so that the readers will be better informed to answer this question. AUDIENCE This book is for a wide audience of enlightened readers defined by the dictionary as “factually well-informed, tolerant of alternative opinions, and guided by rational thought.” It is addressed to anyone who works with, studies, or simply is interested in Big Data, SQL or NoSQL data- bases, information retrieval, or Internet search. This includes, but is not limited to, IT professionals and managers, data architects and modelers, software developers, undergraduate and graduate students in informa- tion systems, computer science or engineering, and their teachers as well. Some parts can be useful for business professionals, students and teach- ers, especially for those working or planning to work in e-commerce. The book does not require special training in computer science or programming skills. An introductory course in information systems or databases should suffice for understanding most of the material. We have tried to make it brief, interesting, and thought provoking. OUTLINE OF THE BOOK Chapter 1 conceptualizes structured search as a technology for querying multiple data sources in an independent and scalable manner. It occu- pies the middle ground between keyword search and database search. As in the keyword search paradigm, query originators do not need to know the structure or the number of data sources being queried. As in the database paradigm, users can pose precise queries, control the output order, and access data in real time. Chapter 2 introduces key-objects as a generalization of keywords. The key-objects can be thought of as data structures reflecting the properties or characteristics of things and re-encapsulating them from their names presented by keywords. For example, the keyword “chair” does not al- low distinguishing between different chairs; and therefore, does not al- low narrowing the search down to specific chair instances. A key-object “chair” allows specification of the chair properties the query originators Preface xv may be looking for, and there is no limit to the number of different key- objects reflecting the concept of a chair. Keywords allow posing queries independently of the indexed documents or web pages, that is, without knowledge of their content, or number. Similarly, key-objects allow pos- ing queries to structured data sources without knowledge of their orga- nization or the number of data sources being queried. The key-object concept is further developed in Chapter 3. It presents an abstract key-object data model based on hereditarily-finite sets – a mathematical structure having the finite set as the only constructor. The key-object model is a generalization of the relational model where data objects – key-object instances – can be arbitrarily structured and multi- valued, and the phenomenon of multiple values receives its formal ex- plication. Sets of key-object instances form data stores, which can be viewed as analogs of relational tables and databases at the same time. Particularly, tables correspond to flat homogeneous data stores and da- tabases correspond to flat data stores. All relational operations have their generalized analogs in data store operations of the key-object model. Unlike their relational counterparts, all operations on data stores are total, that is, defined for any operands, whereas relational set-theoretic operations, for example, are only defined for relations having an equal number of attributes of compatible types. The totality of data store operations contributes to scalability of the na- tive data store systems because any two data stores can be viewed as the parts of one and the same data store. In the relational setting that would correspond to any two tables being the parts of one and the same table. Under this analogy, the same query could be addressed to all tables in all databases and the response could be formed as the union of responses returned from each table. Chapter 4 introduces design principles, framework, and data archi- tecture for structured search systems based on the key-object model. It presents eight design principles of the systems realizing the structured search paradigm. They include query independence, search scalability, security control, and others. Not all principles may be important or use- ful for all cases. However, the presented framework and data architec- ture aim to satisfy all listed principles so that the designers of concrete systems could choose a mix of features they need to implement. xvi Preface The functions of the systems are as follows: facilitating query origi- nation, delivering queries to data providers, collecting responses to the queries from the data providers, and delivering the responses to the query originators. Key-object catalogs provide federating namespaces for the structured search systems. Queries are explicated using the KeySQL lan- guage and are delivered to data providers using the Q-format. Responses are returned using the R-format designed for transporting key-object instances. Those formats can be machine and user readable or binary for increased performance. Two principle types of structured search systems are considered: fed- erative and native. In the federative scenario, data manipulation is lim- ited to the federative SELECT statement. In a sense, this mimics the key- word search where no inserts, updates, or deletes can be performed. In the native scenario, the full data manipulation functionality is available. Chapter 5 describes KeySQL – a structured query language based on the key-object data model. It consists of two main parts: catalog man- agement language (CML) and store manipulation language (SML), and provides two types of data manipulation functionality via the federa- tive and native sublanguages. The sublanguages share the major part of CML, but have no SML statements in common. CML plays the role of data definition language and deals with creating and dropping key- objects, catalogs, and synonymies. The federative SML includes only the federative SELECT statement. The native SML includes CREATE, DROP, INSERT, SELECT, UPDATE, and DELETE statements for the data stores as sets of key-object instances. The positioning of structured search within the landscape of historical and contemporary database trends is discussed in Chapter 6. The topics considered are: key-objects and object-oriented programming para- digm, key-objects and object-oriented databases, KeySQL and NoSQL, query independence and data independence, and KeySQL and MPP architectures. Chapter 7 presents examples of structured search solutions, applica- tions, and use cases. They include e-commerce and mobile e-commerce applications, secure federated information systems, healthcare informa- tion systems, Big Data warehousing, implementation of KeySQL on MapReduce clusters, and others. Preface xvii The last section is devoted to the place of structured search in the In- ternet evolution. It describes an implementation of structured Internet search via key-object instances linked or embedded into web pages. The key-object instances are then collected by search engines, stored, and made available for the structured search. Alternatively, a real-time struc- tured Internet search can be employed. In this setting, websites become data providers and play an active role in the search instead of waiting for search engines to collect their data. Real-time and nonreal-time search results can be combined within the federative framework. US PATENTS The book comprises material protected by granted and pending US patents. ACKNOWLEDGMENTS Konstantin Andreyev was one of the first to recognize the potential of structured search; with the help of Alex Kouznetsov, he created the first structured search portal and worked with Alexander Denisov on the implementation of KeySQL catalog management. ZEDventures and personally Saurbh Khera organized a series of presentations, which helped shaping the book. Maxim Gilula read through several versions of the text and helped to clean it up. The value of constructive critical remarks made by Alexei Lisitsa, Alexei Stolboushkin, Seva Yakhontov, and Vladas Leonas cannot be overestimated. Chris Date, Paul Smoot, and Shell Finkelstein have encouraged me over the years. I am very grateful to all these people, and also to everyone who took the time to participate in the structured search presentations. Mikhail Gilula Foster City, CA, 2015

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.