HPE Reference Architecture for Cloudera Enterprise 5.7 with HPE Apollo 2000 and HPE Apollo 4200 Gen9 servers Reference Architecture Reference Architecture Contents Executive summary ................................................................................................................................................................................................................................................................................................................................ 3 Introduction ................................................................................................................................................................................................................................................................................................................................................... 3 Solution overview ..................................................................................................................................................................................................................................................................................................................................... 4 Storage nodes ....................................................................................................................................................................................................................................................................................................................................... 5 Compute nodes ................................................................................................................................................................................................................................................................................................................................... 5 Hadoop YARN ...................................................................................................................................................................................................................................................................................................................................... 5 Cloudera Enterprise overview ...................................................................................................................................................................................................................................................................................................... 5 Benefits of the HPE BDRA solution ........................................................................................................................................................................................................................................................................................ 6 Solution components ............................................................................................................................................................................................................................................................................................................................ 7 Edge node ................................................................................................................................................................................................................................................................................................................................................ 8 Management and head nodes ............................................................................................................................................................................................................................................................................................... 8 Compute nodes ................................................................................................................................................................................................................................................................................................................................... 9 Storage nodes ................................................................................................................................................................................................................................................................................................................................... 10 Power and cooling ........................................................................................................................................................................................................................................................................................................................ 10 Networking .......................................................................................................................................................................................................................................................................................................................................... 11 Capacity and sizing ............................................................................................................................................................................................................................................................................................................................ 12 Expanding the base configuration ................................................................................................................................................................................................................................................................................. 12 Best practices and configuration guidance for the solution ......................................................................................................................................................................................................................... 14 Setting up Cloudera Enterprise ........................................................................................................................................................................................................................................................................................ 14 Configuring compression ....................................................................................................................................................................................................................................................................................................... 18 Cloudera Impala..................................................................................................................................................................................................................................................................................................................................... 18 Installation ............................................................................................................................................................................................................................................................................................................................................ 18 Configuration ..................................................................................................................................................................................................................................................................................................................................... 18 YARN and Impala .......................................................................................................................................................................................................................................................................................................................... 19 Performance and configuration guidelines ............................................................................................................................................................................................................................................................ 19 Implementing a proof-of-concept ......................................................................................................................................................................................................................................................................................... 21 Summary ...................................................................................................................................................................................................................................................................................................................................................... 22 Appendix A: Bill of materials ...................................................................................................................................................................................................................................................................................................... 23 Appendix B: Alternate compute node components ............................................................................................................................................................................................................................................. 29 Appendix C: Alternate storage node components ................................................................................................................................................................................................................................................ 30 Appendix D: HPE value added services and support ......................................................................................................................................................................................................................................... 31 Resources and additional links ................................................................................................................................................................................................................................................................................................ 33 Reference Architecture Page 3 Executive summary This white paper describes a big data solution deploying Cloudera Enterprise, including Cloudera Impala, and using Hewlett Packard Enterprise Big Data Reference Architecture (BDRA), HPE Apollo 2000 compute servers, and HPE Apollo 4200 storage servers as the key components of the solution. This solution provides recognizable benefits of using Cloudera Enterprise on the HPE BDRA. In addition to simplifying the procurement process, this paper also provides guidelines for configuring Cloudera Enterprise once the system has been deployed. There is an ever-growing need for a scalable, modern architecture for the consolidation, storage, access, and processing of big data analytics. Big data solutions using Hadoop are evolving from a simple model where each application was deployed on a dedicated cluster of identical nodes into a complex model where applications are deployed on a cluster of asymmetric nodes. Analytics and processing engines themselves have grown from MapReduce to a broader set now including Spark and SQL-based interfaces like Impala. By integrating the significant advances that have occurred in fabrics, storage, container-based resource management, and workload-optimized servers since the inception of the Hadoop architecture in 2005, HPE BDRA provides a cost-effective and flexible solution to optimize computing infrastructure in response to these ever- changing requirements in the Hadoop ecosystem. This implementation focuses on Cloudera Enterprise along with Cloudera Impala. These technologies, and guidelines for their configuration and usage, are described in this document. Target audience: This paper is intended for decision makers, system and solution architects, system administrators and experienced users that are interested in reducing design time or simplifying the purchase of a big data architecture containing both HPE and Cloudera components. An intermediate knowledge of Apache Hadoop and scale-out infrastructure is recommended. Document purpose: The purpose of this document is to describe the optimal way to configure Cloudera Enterprise on HPE BDRA along with Cloudera Impala. This document takes the precedence over the previous version titled “HPE Big Data Reference Architecture for Cloudera Enterprise: Using HPE Apollo 2000 and Apollo 4200 Gen9 servers”. This reference architecture describes solution testing performed with Cloudera Enterprise 5.7 in April 2016. Disclaimer: Products sold prior to the separation of Hewlett-Packard Company into Hewlett Packard Enterprise Company and HP Inc. on November 1, 2015 may have a product name and model number that differ from current models Introduction As companies grow their big data implementations, they often find themselves deploying multiple clusters to support their needs. This could be to support different big data environments (MapReduce, Spark, NoSQL DBs, MPP DBMSs, etc.), to support rigid workload partitioning for departmental requirements or simply as a byproduct of multi-generational hardware. These multiple clusters often lead to data duplication and movement of large amounts of data between systems to accomplish an organization’s business requirements. Many enterprise customers are searching for ways to recapture some of the traditional benefits of a converged infrastructure, such as the ability to more easily share data between different applications running on different platforms, the ability to scale compute and storage separately and the ability to rapidly provision new servers without repartitioning data to achieve optimal performance. To address these needs, HPE engineers challenged the traditional Hadoop architecture which always co-locates compute elements in a server with data. While this approach works, the real power of Hadoop is that tasks run against specific slices of data without the need for coordination. Consequently, the distributed lock management and distributed cache management found in older parallel database designs is no longer required. In fact, in a modern server, there is often more network bandwidth available to ship data off the server than there is bandwidth to disk. The HPE BDRA for Cloudera Enterprise is deployed as an asymmetric cluster with some nodes dedicated to compute and others dedicated to Hadoop Distributed File System (HDFS). Hadoop still works the same way – tasks still have complete ownership of their portion of the data and functions are still being shipped to the data – but the computation is executed on a node optimized for this task and the file system operations are executed on a node that is optimized for that work. Of particular interest is that this approach can actually perform better than a traditional Hadoop cluster for several reasons. For more information on this architecture1 and the benefits that it provides, see the document at https://h20195.www2.hpe.com/V2/GetDocument.aspx?docname=4AA6-8931ENW 1 HPE BDRA, also known as, HPE Elastic Platform for Big Data Analytics (EPA) Reference Architecture Page 4 Solution overview HPE Big Data solutions provide world class performance and availability, with integrated software, services, infrastructure, and management – all delivered as one tested configuration, described in more detail at hpe.com/info/hadoop. As shown below in Figure 1, the HPE BDRA solution consists of a highly optimized configuration built using the unique servers offered by HPE – the HPE Apollo 4200 Gen9 for the high density storage layer and the HPE Apollo 2000 system with HPE ProLiant XL170r nodes for the high density computational layer. The servers and components described below offer a superb breed of Enterprise asymmetric Hadoop architecture designed and optimized for a variety of workloads. • HPE Apollo 4200 Gen9 server offers revolutionary storage density in a 2U form factor. When compared to previous generations, it provides more than double the storage capacity than previous generations along with an unprecedented selection of processors to match for data intensive workloads. HPE Apollo 4200 Gen9 server allows companies to save valuable data center space through its unique density-optimized 2U form factor which holds up to 28 LFF or 54 SFF hot plug drives. • The HPE Apollo 2000 system offers a dense solution with up to four independent, hot-pluggable HPE ProLiant XL170r Gen9 server nodes in a standard 2U chassis. It delivers twice the compute density as 1U rack-mount servers with front hot-pluggable drives and rear serviceable nodes yielding very cost effective configurations for various workloads. • HPE ProLiant DL360 servers include two sockets with eight-core processors and use Intel® Xeon® E5-2600 v3 product family to provide the high performance required for the management services for Hadoop cluster. The HPE iLO management engine on the servers contains HPE Integrated Lights-Out 4 (iLO 4) and features a complete set of embedded management features for HPE Power/Cooling, Agentless Management, Active Health System, and Intelligent Provisioning which reduces node and cluster level administration costs for Hadoop. The reference architecture configuration is the result of a great deal of testing and optimization done by HPE engineers resulting in the right set of software, drivers, firmware and hardware to yield extremely high density and performance. As shown in Figure 1, this architecture is changing the economics of work distribution in big data. Figure 1. HPE BDRA – changing the economics of work distribution in big data Reference Architecture Page 5 In order to simplify the build for customers, HPE provides the exact bill of materials in this document to allow a customer to purchase this complete solution. HPE recommends that customers purchase the option in which HPE Technical Services Consulting will install the prebuilt operating system images, verify all firmware and versions are correctly installed, and run a suite of tests that verify that the configuration is performing optimally. Once this has been done, the customer can perform a standard Cloudera Enterprise installation using the recommended guidelines in this document. The HPE BDRA design is anchored by the following HPE technologies. Storage nodes HPE Apollo 4200 Gen9 servers make up the HDFS storage layer, providing a single repository for big data. The HPE Apollo 4200 Gen9 server offers the right mix between storage and compute performance in a 2U form factor. The 28 LFF disks have been adequately powered by 20 physical cores for the exclusive compute needs of HDFS. The storage controllers in the HPE Apollo 4200 support HPE Secure Encryption, an HPE Smart Array controller-based data encryption solution that provides encryption for data at rest. Compute nodes HPE Apollo 2000 system nodes deliver a scalable, high-density layer for compute tasks and provide a framework for workload-optimization with four HPE ProLiant XL170r nodes on a single chassis. High-speed networking separates compute nodes and storage nodes, creating an asymmetric architecture that allows each tier to be scaled individually; there is no commitment to a particular CPU/storage ratio. Since big data is no longer collocated with compute resources, Hadoop does not need to achieve node locality. However, rack locality works in exactly the same way as in a traditional converged infrastructure; that is, as long as you scale within a rack, overall scalability is not affected. With compute and storage de-coupled, users gain many of the advantages of a traditional converged system. For example, you can scale compute and storage independently, simply by adding compute nodes or storage nodes. Testing carried out by HPE indicates that most workloads respond almost linearly to additional compute resources. Hadoop YARN YARN is a key feature for resource management in Hadoop cluster on HPE BDRA. It decouples MapReduce’s resource management and scheduling capabilities from the data processing components, allowing Hadoop to support more varied processing approaches and a broader array of applications. Cloudera Enterprise overview Founded in 2008, Cloudera was the first company to commercialize Apache Hadoop and to develop enterprise-grade solutions built on this powerful open source technology. Today, Cloudera is the leading innovator in and largest contributor to the Hadoop open source software community. Cloudera employs a “hybrid open” subscription software business model, affording customers all the benefits of open source software, plus the features and support expected from traditional enterprise software such as security, data governance and system management. Reference Architecture Page 6 As shown in Figure 2, Cloudera Enterprise is built on top of Cloudera’s Enterprise Data Hub (EDH) software platform. In this way, it empowers organizations to store, process and analyze all enterprise data, of whatever type, in any volume – creating remarkable cost-efficiencies as well as enabling business transformation. It is one place to store all your data for as long as desired in its original fidelity. With Apache Hadoop at its core, it is a new, more powerful and scalable data platform with the flexibility to run a variety of workloads – batch processing, interactive SQL, enterprise search, advanced analytics – together with the robust security, governance, data protection, and management that enterprises require. For detailed information on Cloudera Enterprise, see cloudera.com/content/cloudera/en/products-and-services/cloudera-enterprise.html. Figure 2. Cloudera Enterprise Benefits of the HPE BDRA solution While the most obvious benefits of the HPE BDRA solution center on density and price/performance, other benefits include: • Elasticity – HPE BDRA is designed for flexibility. Compute nodes can be allocated very flexibly without redistributing data; for example, nodes can be allocated by time-of-day or even for a single job. The organization is no longer committed to yesterday’s CPU/storage ratios, leading to much more flexibility in design and cost. Moreover, with HPE BDRA, the system is only grown where needed. • Consolidation – HPE BDRA is based on HDFS, which has enough performance and can scale to large enough capacities to be the single source for big data within any organization. Various pools of data currently being used in big data projects can be consolidated into a single, central repository. YARN-compliant workloads access the big data directly via HDFS; other workloads can access the same data via appropriate connectors. • Workload-optimization – There is no one go-to software for big data; instead, there is a federation of data management tools. After selecting the appropriate tool to meet organizational requirements, run the job using the compute nodes that are best suited for the workload, such as a processor with high core count, more memory, or both. • Enhanced capacity management – Compute nodes can be provisioned on the fly, while storage nodes now constitute a smaller subset of the cluster and, as such, are less costly to overprovision. In addition, managing a single data repository rather than multiple different clusters reduces overall management costs. Reference Architecture Page 7 • Faster time-to-solution – Processing big data typically requires the use of multiple data management tools. When these tools are deployed on conventional Hadoop clusters with dedicated – often fragmented – copies of the data, time-to-solution can be lengthy. With HPE BDRA, data is not fragmented but consolidated in a single data lake, allowing different tools to access the same data. Thus, more time can be spent on analysis, less on shipping data; therefore, time-to-solution is typically faster. Solution components Figure 3 provides a basic conceptual diagram of HPE BDRA. Figure 3. HPE BDRA concept Reference Architecture Page 8 For full BOM listings of products selected for the proof-of-concept, refer to Appendix A: Bill of materials of this white paper. Minimum configuration Figure 4 shows a minimum HPE BDRA configuration, with 12 worker nodes and 3 storage nodes housed in a single 42U rack. Figure 4. Base HPE BDRA configuration, with detailed lists of components Best practice HPE recommends starting with three HPE Apollo 2000 chassis consisting of 12 HPE ProLiant XL170r nodes with 128 GB of memory for each node, combined with three HPE Apollo 4200 Gen9 servers as storage nodes. For performance sensitive Impala installations, consider increasing memory to 256 GB in the compute nodes. The following nodes are used in the base HPE BDRA configuration. Edge node The HPE BDRA solution includes a multi-homed network server called an edge node. It acts as a gateway node between the cluster’s private VLAN and the external routable network. Any application that requires both external network access and cluster-private-network can run on this server. When significant storage and network bandwidth are required, use more additional HPE ProLiant DL360 Gen9 servers as edge nodes. Management and head nodes Three HPE ProLiant DL360 Gen9 servers are configured as management/head nodes. Refer to Table 2 in the Node components section for the suggested software components for these nodes. Management and head nodes store important metadata information for the cluster and hence for greater data protection and performance a RAID configuration is suggested for data storage. RAID 1 is recommended for the OS and NameNode metadata. RAID 10 is recommended for Reference Architecture Page 9 Database data. As ZooKeeper and Quorum Journal Nodes are extremely latency sensitive, they should be configured with a single disk each. They are best implemented with a single spindle, outside any RAID set (if RAID is a must for some reason, use RAID 0). Compute nodes The base HPE BDRA configuration features three HPE Apollo 2000 chassis containing a total of 12 HPE ProLiant XL170r Gen9 hot-pluggable server nodes. The HPE ProLiant XL170r Gen9 is shown in Figure 5. The rear and front of the HPE Apollo 2000 chassis is shown in Figure 6. Figure 5. HPE ProLiant XL170r Gen9 Figure 6. HPE Apollo 2000 chassis (rear and front) The HPE Apollo 2000 system is a dense solution with four independent HPE ProLiant XL170r Gen9 hot-pluggable server nodes in a standard 2U chassis. Each HPE ProLiant XL170r Gen9 server node is serviced individually without impacting the operation of other nodes sharing the same chassis to provide increased server uptime. Each server node harnesses the performance of 2133 MHz memory (16 DIMM slots per node) and dual Intel Xeon E5-2600 v3 processors in a very efficient solution that shares both power and cooling infrastructure. Other features of the HPE ProLiant XL170r Gen9 server include: • Support for high-performance memory (DDR4) and Intel Xeon E5-2600 v3 processor up to 18C, 145W • Additional PCIe riser options for flexible and balanced I/O configurations • FlexibleLOM feature for additional network expansion options • Support for dual M.2 drives For more information on HPE Apollo 2000 chassis, visit hpe.com/us/en/product-catalog/servers/proliant-servers/pip.hpe-apollo-r2000-chassis.7832023.html For more information on the HPE ProLiant XL170r Gen9 server, visit hpe.com/us/en/product-catalog/servers/proliant-servers/pip.hpe-proliant-xl170r-gen9-server.7799270.html Each of these compute nodes typically runs YARN NodeManagers or Impala daemons. Reference Architecture Page 10 Storage nodes There are three HPE Apollo 4200 Gen9 servers. Each server is configured with 28 LFF disks; each typically runs HDFS DataNode. The HPE Apollo 4200 Gen9 server is shown in Figure 7. Figure 7. HPE Apollo 4200 Gen9 server The HPE Apollo 4200 Gen9 server offers revolutionary storage density in a 2U form factor for data intensive workloads such as Apache Spark on Hadoop HDFS. The HPE Apollo 4200 Gen9 server allows you to save valuable data center space through its unique density-optimized 2U form factor which holds up to 28 LFF disks and with capacity for up to 224 TB per server. It has the ability to grow your big data solutions with its density-optimized infrastructure that is ready to scale. Another benefit is that the HPE Apollo 4200 fits easily into standard racks with a depth of 32-inches per server – no special racks are required. For more detailed information, visit hpe.com/us/en/product-catalog/servers/proliant-servers/pip.hpe-apollo-4200-gen9-server.8261831.html Key point The storage controllers in the HPE Apollo 4200 and HPE ProLiant XL170r support HPE Secure Encryption. Secure Encryption is an HPE Smart Array controller-based data encryption solution. It provides encryption for data at rest, an important component for complying with government regulations having data privacy requirements, such as HIPAA and Sarbanes-Oxley. This optional feature enhances the functionality of the storage controller in cases where encryption is a required feature of the solution. Best practice HPE recommends starting with a minimum three HPE Apollo 2000 chassis, with 12 compute nodes and three to five HPE Apollo 4200 Gen9 servers as storage nodes. Note Cloudera recommends a minimum of five storage nodes in production deployments for optimal utilization of nodes in the context of HDFS replication and fault tolerance. Power and cooling When planning large clusters, it is important to properly manage power redundancy and distribution. To ensure servers and racks have adequate power redundancy, HPE recommends that each chassis (HPE Apollo 2000) and each HPE Apollo 4200 Gen9 server should have a backup power supply and each rack should have at least two Power Distribution Units (PDUs). There is additional cost associated with procuring redundant power supplies; however, the need for redundancy is less critical in larger clusters where the inherent redundancy within CDH ensures there would be less impact in the event of a failure.
Description: