ebook img

Flux: A Mechanism for Building Robust, Scalable Dataflows PDF

206 Pages·2004·1.18 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Flux: A Mechanism for Building Robust, Scalable Dataflows

Flux: A Mechanism for Building Robust, Scalable Dataflows by Mehul Arunkumar Shah B.S. (Massachusetts Institute of Technology) 1996 B.S. (Massachusetts Institute of Technology) 1996 M.Eng. (Massachusetts Institute of Technology) 1997 A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the GRADUATE DIVISION of the UNIVERSITY OF CALIFORNIA, BERKELEY Committee in charge: Professor Joseph M. Hellerstein, Chair Professor Michael Franklin Professor John Chuang Fall 2004 The dissertation of Mehul Arunkumar Shah is approved: Professor Joseph M. Hellerstein, Chair Date Professor Michael Franklin Date Professor John Chuang Date University of California, Berkeley Fall 2004 Flux: A Mechanism for Building Robust, Scalable Dataflows Copyright (cid:13)c 2004 by Mehul Arunkumar Shah Abstract Flux: A Mechanism for Building Robust, Scalable Dataflows by Mehul Arunkumar Shah Doctor of Philosophy in Computer Science University of California, Berkeley Professor Joseph M. Hellerstein, Chair We present techniques for robustly scaling high-throughput, 24x7, data-stream processing applications. Examples of such applications include intrusion or denial- of-service detection, click-stream processing, and online analysis of financial quote streams. In the TelegraphCQ project, we implement these applications using a general-purpose continuous query (CQ) engine that executes long-running dataflows. To scale the performance of these dataflows, we parallelize them across a cluster of workstations. For these critical applications, high availability, fault-tolerance, and scalability are important goals. These goals are challenging to achieve on a cluster because machines are bound to fail, and load imbalances are likely to arise. In this thesis, we develop the design of Flux, a reusable communication abstraction that enables long- running, parallel dataflows to adapt on-the-fly to these problems. Flux encapsulates mechanisms that allow a dataflow to mask failures and to automatically recover from them as they occur during execution. Flux leverages these same mechanisms to periodically rebalance a dataflow and keep it running efficiently. By encapsulating the critical, fault-tolerance and load-balancing logic into Flux, we enable its reuse in a variety of dataflow applications with little modification to existing dataflow 1 components and interfaces. Thus, by simply constructing a parallel dataflow using Flux, an application developer can make the dataflow robust. Professor Joseph M. Hellerstein, Chair Date 2 Acknowledgements The dissertation is really a small part of the process that ultimately led to this document. In my opinion, the most important aspects of this process are not the results or exposition, but rather the fascinating interactions I had with the people I met along the way. These people helped me explore technical, social, and political ideas that otherwise would have made this seemingly-endless process a colorless and unbearable experience. Without further ado, I want to thank the following people that I am truly fortunate to have passed through my life. I thank the advisors and mentors who guided me. I thank my advisor, Joe Heller- stein. He gave me the opportunity and support to pursue many of my own ideas and judiciously provided gentle nudges to push me along the right course. I was always baffled by his optimism in my own ability. His constant enthusiasm for my work provided the impetus needed to complete it. And, I am amazed by how deep and accurate his advice is. His advice taught me more than just the mechanics of research. His advice molded my versatile style that I believe will help my ideas endure. I thank MichaelFranklinforhishonestyandhissuccessfuleffortstokeepmeinline. Hissense for knowing what is and what will be important is unparalleled. Without having the privilege to leverage his vision, I would still be wandering around for thesis topics. I thank Eric Brewer. He challenged me to really understand and pinpoint what the research in this thesis was about. I thank Michael Carey. I strive to have the balance and composure he exhibits in both academic and personal settings. I thank the colleagues and friends that I gathered along this journey. They formed a professional and personal support structure that provided a springboard for my work and that I could lean upon in my weakest moments. I thank Fred Reiss and Amol Deshpande, without whom, I never could have completed any of the experi- i ments in this thesis. I also thank Amol for his invaluable insight and willingness to work through deep technical details with me. I thank Sailesh Krishnamurthy; his friendship, support, and optimism helped me survive the journey. I thank Sirish Chandrasekaran for helping me embark on this thesis. I thank Owen Cooper for his efforts in helping me debug numerous deadlocks and non-deterministic bugs. I also thank both Sirish and Owen for the moral support they offered during the bleak days. I thank Mark Paskin for sharing his Latex files which make this thesis look so pretty. I thank the following friends that helped foster new ideas and stimulat- ing discussions: Steve Czerwinski, Alex Berg, Marcel Kornacker, Paul Aoki, Sam Madden, Remzi Arpaci-Dusseau, Vijayshankar Raman, David T. Liu, Matt Denny, Yanlei Diao, Ryan Heubsch, Boon Thau Loo, and Anil Edakkunni. I also thank my officemates for the lively not-always-4 o’clock political discussions. Please forgive me for forgetting any others that I have missed. I am deeply indebted to my family. They gave me the strength to endure the innumerable obstacles thrown at me during this trek. I owe my dearest sister, Guddi, niece, Dhara, and brother-in-law, Alpesh for their cross-country trips at a moments notice, for relieving me of many of my familial obligations, and for their unwavering support. I am grateful to my nephew, Nikhil, and sister-in-law, Ashima, and in-laws, Saroj and Indra. I thank them for their unqualified support and entrusting Nupur to me in spite of the risks that I place upon her. I thank my grandparents, and especially my late grandfather, Nandlal L. Shah. It was his vision that enabled me to pursue my goals. I am in awe of my parents, Arun and Nayana, who managed to do so much with so little. They are the most altruistic people I know. They provided the standards and nourishment that sculpted the person that I am today. It was their dedication and commitment to my education that brought me to this level. All my accolades and achievements are a result of their sacrifices. Although I will never be able to ii repay them, I promise to continue to seek in my efforts what they sought in me. Finally, I am blessed to have the opportunity to share my life with Maya and Nupur. Maya, my daughter and inspiration, has taught me what it means to care and to know what is truly important. Nupur, my dedicated partner in life, deserves as much credit as I do for this dissertation. While the ideas are not hers, she suffered more than her fair share of the pains and enjoyed less than her fair share of the joys that accompanied this process. I am grateful for her faith and devotion to my causes. Words cannot describe the innumerable lessons of life that I have learned from her, and I have so much more yet to discover with her. Without Maya and Nupur, I could not have survived the emotional roller coaster this process puts one through. These two are the source of optimism and intoxication I feel when I rise each morning. They are the source of comfort and peacefulness I feel when I rest each evening. iii Dedicated to Maya and Nupur, with love and gratitude iv Contents List of Figures ix List of Tables xii 1 Introduction 1 1.1 Continuous Query Systems . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Flux Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 High Availability and Fault-Tolerance . . . . . . . . . . . . . . 6 1.2.2 Adapting Dataflows to Imbalances . . . . . . . . . . . . . . . 7 1.2.3 A Unified Flux . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.4 Encapsulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 Background 11 2.1 Dataflow Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 CQ Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.2 Queues and Scheduling . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Partitioned Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3 Flux-HA: Fortifying Dataflows 24 3.1 Environment and Setup . . . . . . . . . . . . . . . . . . . . . . . . . 25 v

Description:
them as they occur during execution 5 Flux: The Greater Whole .. is, the cost of executing redundant dataflows with Flux-HA is only slightly more
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.