The Architecture of Open Source Applications Amy Brown and Greg Wilson (eds.) Lulu.com, 2011, 978-1-257-63801-7 License/ Buy/ Contribute Architects look at thousands of buildings during their training, and study critiques of those buildings written by masters. In contrast, most software developers only ever get to know a handful of large programs well—usually programs they wrote themselves—and never study the great programs of history. As a result, they repeat one another's mistakes rather than building on one another's successes. This book's goal is to change that. In it, the authors of twenty-five open source applications explain how their software is structured, and why. What are each program's major components? How do they interact? And what did their builders learn during their development? In answering these questions, the contributors to this book provide unique insights into how they think. If you are a junior developer, and want to learn how your more experienced colleagues think, this book is the place to start. If you are an intermediate or senior developer, and want to see how your peers have solved hard design problems, this book can help you too. Contents Introduction Amy Brown and Greg Wilson ix 1.Asterisk Russell Bryant 1 2.Audacity James Crook 15 3.The Bourne-Again Shell Chet Ramey 29 4.Berkeley DB Margo Seltzer and Keith Bostic 45 5.CMake Bill Hoffman and Kenneth Martin 67 6.Eclipse Kim Moir 77 7.Graphite Chris Davis 101 8.The Hadoop Distributed Robert Chansler, Hairong Kuang, Sanjay Radia, 111 File System Konstantin Shvachko, and Suresh Srinivas 9.Continuous Integration C. Titus Brown and Rosangela Canino-Koning 125 10.Jitsi Emil Ivov 139 11.LLVM Chris Lattner 155 12.Mercurial Dirkjan Ochtman 171 13.The NoSQL Ecosystem Adam Marcus 185 14.Python Packaging Tarek Ziadé 205 15.Riak and Erlang/OTP Francesco Cesarini, Andy Gross, and Justin Sheehy 229 16.Selenium WebDriver Simon Stewart 245 17.Sendmail Eric Allman 271 18.SnowFlock Roy Bryant and Andrés Lagar-Cavilla 291 19.SocialCalc Audrey Tang 303 20.Telepathy Danielle Madeley 325 21.Thousand Parsec Alan Laudicina and Aaron Mavrinac 345 22.Violet Cay Horstmann 361 23.VisTrails Juliana Freire, David Koop, Emanuele Santos, 377 Carlos Scheidegger, Claudio Silva, and Huy T. Vo 24.VTK Berk Geveci and Will Schroeder 395 25.Battle For Wesnoth Richard Shimooka and David White 411 Bibliography Making Software This work is made availableunder the Creative Commons Attribution 3.0 Unported license. All royalties from sales of this book will be donated to Amnesty International. Follow us at http://third-bit.comor search for #aosa on Twitter. Purchasing Copies of this book may be purchased from Lulu.com and other online booksellers. All royalties from these sales will be donated to Amnesty International. If you do buy a copy, please buy directly from Lulu: Lulu Amazon You pay: $35.00 $35.00 Lulu gets: $3.74 $0.94 Amazon gets: $17.50 Amnesty gets: $14.98 $3.78 Contributing Dozens of volunteers worked hard to create this book, but there is still lots to do. You can help by reporting errors, by helping to translate the content into other languages, or by describing the architecture of other open source projects. Please contact us at [email protected] you would like to get involved. The Architecture of Open Source Applications Amy Brown and Greg Wilson (eds.) ISBN 978-1-257-63801-7 License/ Buy/ Contribute Introduction Amy Brown and Greg Wilson Carpentry is an exacting craft, and people can spend their entire lives learning how to do it well. But carpentry is not architecture: if we step back from pitch boards and miter joints, buildings as a whole must be designed, and doing that is as much an art as it is a craft or science. Programming is also an exacting craft, and people can spend their entire lives learning how to do it well. But programming is not software architecture. Many programmers spend years thinking about (or wrestling with) larger design issues: Should this application be extensible? If so, should that be done by providing a scripting interface, through some sort of plugin mechanism, or in some other way entirely? What should be done by the client, what should be left to the server, and is "client-server" even a useful way to think about this application? These are not programming questions, any more than where to put the stairs is a question of carpentry. Building architecture and software architecture have a lot in common, but there is one crucial difference. While architects study thousands of buildings in their training and during their careers, most software developers only ever get to know a handful of large programs well. And more often than not, those are programs they wrote themselves. They never get to see the great programs of history, or read critiques of those programs' design written by experienced practitioners. As a result, they repeat one another's mistakes rather than building on one another's successes. This book is our attempt to change that. Each chapter describes the architecture of an open source application: how it is structured, how its parts interact, why it's built that way, and what lessons have been learned that can be applied to other big design problems. The descriptions are written by the people who know the software best, people with years or decades of experience designing and re-designing complex applications. The applications themselves range in scale from simple drawing programs and web-based spreadsheets to compiler toolkits and multi-million line visualization packages. Some are only a few years old, while others are approaching their thirtieth anniversary. What they have in common is that their creators have thought long and hard about their design, and are willing to share those thoughts with you. We hope you enjoy what they have written. Contributors Eric P. Allman (Sendmail): Eric Allman is the original author of sendmail, syslog, and trek, and the co-founder of Sendmail, Inc. He has been writing open source software since before it had a name, much less became a "movement". He is a member of the ACM Queue Editorial Review Board and the Cal Performances Board of Trustees. His personal web site is http://www.neophilic.com/~eric. Keith Bostic (Berkeley DB): Keith was a member of the University of California Berkeley Computer Systems Research Group, where he was the architect of the 2.10BSD release and a principal developer of 4.4BSD and related releases. He received the USENIX Lifetime Achievement Award ("The Flame"), which recognizes singular contributions to the Unix community, as well as a Distinguished Achievement Award from the University of California, Berkeley, for making the 4BSD release Open Source. Keith was the architect and one of the original developers of Berkeley DB, the Open Source embedded database system. Amy Brown (editorial): Amy has a bachelor's degree in Mathematics from the University of Waterloo, and worked in the software industry for ten years. She now writes and edits books, sometimes about software. She lives in Toronto and has two children and a very old cat. C. Titus Brown (Continuous Integration): Titus has worked in evolutionary modeling, physical meteorology, developmental biology, genomics, and bioinformatics. He is now an Assistant Professor at Michigan State University, where he has expanded his interests into several new areas, including reproducibility and maintainability of scientific software. He is also a member of the Python Software Foundation, and blogs at http://ivory.idyll.org. Roy Bryant (Snowflock): In 20 years as a software architect and CTO, Roy designed systems including Electronics Workbench (now National Instruments' Multisim) and the Linkwalker Data Pipeline, which won Microsoft's worldwide Winning Customer Award for High-Performance Computing in 2006. After selling his latest startup, he returned to the University of Toronto to do graduate studies in Computer Science with a research focus on virtualization and cloud computing. Most recently, he published his Kaleidoscope extensions to Snowflock at ACM's Eurosys Conference in 2011. His personal web site is http://www.roybryant.net/. Russell Bryant (Asterisk): Russell is the Engineering Manager for the Open Source Software team at Digium, Inc. He has been a core member of the Asterisk development team since the Fall of 2004. He has since contributed to almost all areas of Asterisk development, from project management to core architectural design and development. He blogs at http://www.russellbryant.net. Rosangela Canino-Koning (Continuous Integration): After 13 years of slogging in the software industry trenches, Rosangela returned to university to pursue a Ph.D. in Computer Science and Evolutionary Biology at Michigan State University. In her copious spare time, she likes to read, hike, travel, and hack on open source bioinformatics software. She blogs at http://www.voidptr.net. Francesco Cesarini (Riak): Francesco Cesarini has used Erlang on a daily basis since 1995, having worked in various turnkey projects at Ericsson, including the OTP R1 release. He is the founder of Erlang Solutions and co-author of O'Reilly's Erlang Programming. He currently works as Technical Director at Erlang Solutions, but still finds the time to teach graduates and undergraduates alike at Oxford University in the UK and the IT University of Gotheburg in Sweden. Robert Chansler (HDFS): Robert is a Senior Manager for Software Development at Yahoo!. After graduate studies in distributed systems at Carnegie-Mellon University, he worked on compilers (Tartan Labs), printing and imaging systems (Adobe Systems), electronic commerce (Adobe Systems, Impresse), and storage area network management (SanNavigator, McDATA). Returning to distributed systems and HDFS, Rob found many familiar problems, but all of the numbers had two or three more zeros. James Crook (Audacity): James is a contract software developer based in Dublin, Ireland. Currently he is working on tools for electronics design, though in a previous life he developed bioinformatics software. He has many audacious plans for Audacity, and he hopes some, at least, will see the light of day. Chris Davis (Graphite): Chris is a software consultant and Google engineer who has been designing and building scalable monitoring and automation tools for over 12 years. Chris originally wrote Graphite in 2006 and has lead the open source project ever since. When he's not writing code he enjoys cooking, making music, and doing research. His research interests include knowledge modeling, group theory, information theory, chaos theory, and complex systems. Juliana Freire (VisTrails): Juliana is an Associate Professor of Computer Science, at the University of Utah. Before, she was member of technical staff at the Database Systems Research Department at Bell Laboratories (Lucent Technologies) and an Assistant Professor at OGI/OHSU. Her research interests include provenance, scientific data management, information integration, and Web mining. She is a recipient of an NSF CAREER and an IBM Faculty award. Her research has been funded by the National Science Foundation, Department of Energy, National Institutes of Health, IBM, Microsoft and Yahoo!. Berk Geveci (VTK): Berk is the Director of Scientific Computing at Kitware. He is responsible for leading the development effort of ParaView, and award winning visualization application based on VTK. His research interests include large scale parallel computing, computational dynamics, finite elements and visualization algorithms. Andy Gross (Riak): Andy Gross is Principal Architect at Basho Technologies, managing the design and development of Basho's Open Source and Enterprise data storage systems. Andy started at Basho in December of 2007 with 10 years of software and distributed systems engineering experience. Prior to Basho, Andy held senior distributed systems engineering positions at Mochi Media, Apple, Inc., and Akamai Technologies. Bill Hoffman (CMake): Bill is CTO and co-Founder of Kitware, Inc. He is a key developer of the CMake project, and has been working with large C++ systems for over 20 years. Cay Horstmann (Violet): Cay is a professor of computer science at San Jose State University, but ever so often, he takes a leave of absence to work in industry or teach in a foreign country. He is the author of many books on programming languages and software design, and the original author of the Violet and GridWorld open-source programs. Emil Ivov (Jitsi): Emil is the founder and project lead of the Jitsi project (previously SIP Communicator). He is also involved with other initiatives like the ice4j.org, and JAIN SIP projects. Emil obtained his Ph.D. from the University of Strasbourg in early 2008, and has been focusing primarily on Jitsi related activities ever since. David Koop (VisTrails): David is a Ph.D. candidate in computer science at the University of Utah (finishing in the summer of 2011). His research interests include visualization, provenance, and scientific data management. He is a lead developer of the VisTrails system, and a senior software architect at VisTrails, Inc. Hairong Kuang (HDFS) is a long time contributor and committer to the Hadoop project, which she has passionately worked on currently at Facebook and previously at Yahoo!. Prior to industry, she was an Assistant Professor at California State Polytechnic University, Pomona. She received Ph.D. in Computer Science from the University of California at Irvine. Her interests include cloud computing, mobile agents, parallel computing, and distributed systems. H. Andrés Lagar-Cavilla (Snowflock): Andrés is a software systems researcher who does experimental work on virtualization, operating systems, security, cluster computing, and mobile computing. He has a B.A.Sc. from Argentina, and an M.Sc. and Ph.D. in Computer Science from University of Toronto, and can be found online at http://lagarcavilla.org. Chris Lattner (LLVM): Chris is a software developer with a diverse range of interests and experiences, particularly in the area of compiler tool chains, operating systems, graphics and image rendering. He is the designer and lead architect of the Open Source LLVM Project. See http://nondot.org/~sabre/ for more about Chris and his projects. Alan Laudicina (Thousand Parsec): Alan is an M.Sc. student in computer science at Wayne State University, where he studies distributed computing. In his spare time he codes, learns programming languages, and plays poker. You can find more about him at http://alanp.ca/. Danielle Madeley (Telepathy): Danielle is an Australian software engineer working on Telepathy and other magic for Collabora Ltd. She has bachelor's degrees in electronic engineering and computer science. She also has an extensive collection of plush penguins. She blogs at http://blogs.gnome.org/danni/. Adam Marcus (NoSQL): Adam is a Ph.D. student focused on the intersection of database systems and social computing at MIT's Computer Science and Artificial Intelligence Lab. His recent work ties traditional database systems to social streams such as Twitter and human computation platforms such as Mechanical Turk. He likes to build usable open source systems from his research prototypes, and prefers tracking open source storage systems to long walks on the beach. He blogs at http://blog.marcua.net. Kenneth Martin (CMake): Ken is currently Chairman and CFO of Kitware, Inc., a research and development company based in the US. He co-founded Kitware in 1998 and since then has helped grow the company to its current position as a leading R&D provider with clients across many government and commercial sectors. Aaron Mavrinac (Thousand Parsec): Aaron is a Ph.D. candidate in electrical and computer engineering at the University of Windsor, researching camera networks, computer vision, and robotics. When there is free time, he fills some of it working on Thousand Parsec and other free software, coding in Python and C, and doing too many other things to get good at any of them. His web site is http://www.mavrinac.com. Kim Moir (Eclipse): Kim works at the IBM Rational Software lab in Ottawa as the Release Engineering lead for the Eclipse and Runtime Equinox projects and is a member of the Eclipse Architecture Council. Her interests lie in build optimization, Equinox and building component based software. Outside of work she can be found hitting the pavement with her running mates, preparing for the next road race. She blogs at http://relengofthenerds.blogspot.com/. Dirkjan Ochtman (Mercurial): Dirkjan graduated as a Master in CS in 2010, and has been working at a financial startup for 3 years. When not procrastinating in his free time, he hacks on Mercurial, Python, Gentoo Linux and a Python CouchDB library. He lives in the beautiful city of Amsterdam. His personal web site is http://dirkjan.ochtman.nl/. Sanjay Radia (HDFS): Sanjay is the architect of the Hadoop project at Yahoo!, and a Hadoop committer and Project Management Committee member at the Apache Software Foundation. Previously he held senior engineering positions at Cassatt, Sun Microsystems and INRIA where he developed software for distributed systems and grid/utility computing infrastructures. Sanjay has Ph.D. in Computer Science from University of Waterloo, Canada. Chet Ramey (Bash): Chet has been involved with bash for more than twenty years, the past seventeen as primary developer. He is a longtime employee of Case Western Reserve University in Cleveland, Ohio, from which he received his B.Sc. and M.Sc. degrees. He lives near Cleveland with his family and pets, and can be found online at http://tiswww.cwru.edu/~chet. Emanuele Santos (VisTrails): Emanuele is a research scientist at the University of Utah. Her research interests include scientific data management, visualization, and provenance. She received her Ph.D. in Computing from the University of Utah in 2010. She is also a lead developer of the VisTrails system. Carlos Scheidegger (VisTrails): Carlos has a Ph.D. in Computing from the University of Utah, and is now a researcher at AT&T Labs–Research. Carlos has won best paper awards at IEEE Visualization in 2007, and Shape Modeling International in 2008. His research interests include data visualization and analysis, geometry processing and computer graphics. Will Schroeder (VTK): Will is President and co-Founder of Kitware, Inc. He is a computational scientist by training and has been of the key developers of VTK. He enjoys writing beautiful code, especially when it involves computational geometry or graphics. Margo Seltzer (Berkeley DB): Margo is the Herchel Smith Professor of Computer Science at Harvard's School of Engineering and Applied Sciences and an Architect at Oracle Corporation. She was one of the principal designers of Berkeley DB and a co-founder of Sleepycat Software. Her research interests are in filesystems, database systems, transactional systems, and medical data mining. Her professional life is online at http://www.eecs.harvard.edu/~margo, and she blogs at http://mis-misinformation.blogspot.com/. Justin Sheehy (Riak): Justin is the CTO of Basho Technologies, the company behind the creation of Webmachine and Riak. Most recently before Basho, he was a principal scientist at the MITRE Corporation and a senior architect for systems infrastructure at Akamai. At both of those companies he focused on multiple aspects of robust distributed systems, including scheduling algorithms, language-based formal models, and resilience. Richard Shimooka (Battle for Wesnoth): Richard is a Research Associate at Queen's University's Defence Management Studies Program in Kingston, Ontario. He is also a Deputy Administrator and Secretary for the Battle For Wesnoth. Richard has written several works examining the organizational cultures of social groups, ranging from governments to open source projects. Konstantin V. Shvachko (HDFS), a veteran HDFS developer, is a principal Hadoop architect at eBay. Konstantin specializes in efficient data structures and algorithms for large-scale distributed storage systems. He discovered a new type of balanced trees, S-trees, for optimal indexing of unstructured data, and was a primary developer of an S-tree-based Linux filesystem, treeFS, a prototype of reiserFS. Konstantin holds a Ph.D. in computer science from Moscow State University, Russia. He is also a member of the Project Management Committee for Apache Hadoop. Claudio Silva (VisTrails): Claudio is a full professor of computer science at the University of Utah. His research interests are in visualization, geometric computing, computer graphics, and scientific data management. He received his Ph.D. in computer science from the State University of New York at Stony Brook in 1996. Later in 2011, he will be joining the Polytechnic Institute of New York University as a full professor of computer science and engineering. Suresh Srinivas (HDFS): Suresh works on HDFS as software architect at Yahoo!. He is a Hadoop committer and PMC member at Apache Software Foundation. Prior to Yahoo!, he worked at Sylantro Systems, developing scalable infrastructure for hosted communication services. Suresh has a bachelor's degree in Electronics and Communication from National Institute of Technology Karnataka, India. Simon Stewart (Selenium): Simon lives in London and works as a Software Engineer in Test at Google. He is a core contributor to the Selenium project, was the creator of WebDriver and is enthusiastic about Open Source. Simon enjoys beer and writing better software, sometimes at the same time. His personal home page is http://www.pubbitch.org/. Audrey Tang (SocialCalc): Audrey is a self-educated programmer and translator based in Taiwan. She curently works at Socialtext, where her job title is "Untitled Page", as well as at Apple as contractor for localization and release engineering. She previously designed and led the Pugs project, the first working Perl 6 implementation; she has also served in language design committees for Haskell, Perl 5, and Perl 6, and has made numerous contributions to CPAN and Hackage. She blogs at http://pugs.blogs.com/audreyt/. Huy T. Vo (VisTrails): Huy is receiving his Ph.D. from the University of Utah in May, 2011. His research interests include visualization, dataflow architecture and scientific data management. He is a senior developer at VisTrails, Inc. He also holds a Research Assistant Professor appointment with the Polytechnic Institute of New York University. David White (Battle for Wesnoth): David is the founder and lead developer of Battle for Wesnoth. David has been involved with several Open Source video game projects, including Frogatto which he also co-founded. David is a performance engineer at Sabre Holdings, a leader in travel technology. Greg Wilson (editorial): Greg has worked over the past 25 years in high-performance scientific computing, data visualization, and computer security, and is the author or editor of several computing books (including the 2008 Jolt Award winner Beautiful Code) and two books for children. Greg received a Ph.D. in Computer Science from the University of Edinburgh in 1993. He blogs at http://third-bit.com and http://software-carpentry.org. Tarek Ziadé (Python Packaging): Tarek lives in Burgundy, France. He's a Senior Software Engineer at Mozilla, building servers in Python. In his spare time, he leads the packaging effort in Python. Acknowledgments We would like to thank our reviewers: Eric Aderhold Muhammad Ali Lillian Angel Robert Beghian Taavi Burns Luis Pedro Coelho David Cooper Mauricio de Simone Jonathan Deber Patrick Dubroy Igor Foox Alecia Fowler Marcus Hanwell Johan Harjono Vivek Lakshmanan Greg Lapouchnian Laurie MacDougall Sookraj Josh McCarthy Jason Montojo Colin Morris Christian Muise Victor Ng Nikita Pchelin Andrew Petersen Andrey Petrov Tom Plaskon Pascal Rapicault Todd Ritchie Samar Sabie Misa Sakamoto David Scannell Clara Severino Tim Smith Kyle Spaans Sana Tapal Tony Targonski Miles Thibault David Wright Tina Yee We would also like to thank Jackie Carter, who helped with the early stages of editing. The cover image is a photograph by Peter Dutton of the 48 Free Street Mural by Chris Denison in Portland, Maine. The photograph is licensed under the Creative Commons Attribution- NonCommercial-ShareAlike 2.0 Generic license. License, Credits, and Disclaimers This work is licensed under the Creative Commons Attribution 3.0 Unported license (CC BY 3.0). You are free: to Share—to copy, distribute and transmit the work to Remix—to adapt the work under the following conditions: Attribution—you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). with the understanding that: Waiver—Any of the above conditions can be waived if you get permission from the copyright holder. Public Domain—Where the work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license. Other Rights—In no way are any of the following rights affected by the license: Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations; The author's moral rights; Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights. Notice—For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to http://creativecommons.org/licenses/by/3.0/. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/ or send a letter to Creative Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA. Product and company names mentioned herein may be the trademarks of their respective owners. While every precaution has been taken in the preparation of this book, the editors and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. Dedication Dedicated to Brian Kernighan, who has taught us all so much; and to prisoners of conscience everywhere. The Architecture of Open Source Applications Amy Brown and Greg Wilson (eds.) ISBN 978-1-257-63801-7 License/ Buy/ Contribute Chapter 1. Asterisk Russell Bryant Asterisk1 is an open source telephony applications platform distributed under the GPLv2. In short, it is a server application for making, receiving, and performing custom processing of phone calls. The project was started by Mark Spencer in 1999. Mark had a company called Linux Support Services and he needed a phone system to help operate his business. He did not have a lot of money to spend on buying one, so he just made his own. As the popularity of Asterisk grew, Linux Support Services shifted focus to Asterisk and changed its name to Digium, Inc. The name Asterisk comes from the Unix wildcard character, *. The goal for the Asterisk project is to do everything telephony. Through pursuing this goal, Asterisk now supports a long list of technologies for making and receiving phone calls. This includes many VoIP (Voice over IP) protocols, as well as both analog and digital connectivity to the traditional telephone network, or the PSTN (Public Switched Telephone Network). This ability to get many different types of phone calls into and out of the system is one of Asterisk's main strengths. Once phone calls are made to and from an Asterisk system, there are many additional features that can be used to customize the processing of the phone call. Some features are larger pre-built common applications, such as voicemail. There are other smaller features that can be combined together to create custom voice applications, such as playing back a sound file, reading digits, or speech recognition. 1.1. Critical Architectural Concepts This section discusses some architectural concepts that are critical to all parts of Asterisk. These ideas are at the foundation of the Asterisk architecture. 1.1.1. Channels A channel in Asterisk represents a connection between the Asterisk system and some telephony endpoint (Figure 1.1). The most common example is when a phone makes a call into an Asterisk system. This connection is represented by a single channel. In the Asterisk code, a channel exists as an instance of the ast_channel data structure. This call scenario could be a caller interacting with voicemail, for example. Figure 1.1: A Single Call Leg, Represented by a Single Channel 1.1.2. Channel Bridging
Description: