OLTP on Hardware Islands Danica Porobic Ippokratis Pandis† Miguel Branco Pınar To¨zu¨n Anastasia Ailamaki E´colePolytechniqueFe´de´raledeLausanne †IBMAlmadenResearchCenter Lausanne,VD,Switzerland SanJose,CA,USA {danica.porobic,miguel.branco,pinar.tozun,anastasia.ailamaki}@epfl.ch [email protected] ABSTRACT [31, 18, 25, 21, 24]). OLTP applications are mission-critical for many enterprises with little margin for compromising Modern hardware is abundantly parallel and increasingly either performance or scalability. Thus, it is not surpris- heterogeneous. The numerous processing cores have non- ing that all major OLTP vendors spend significant effort uniform access latencies to the main memory and to the in developing highly-optimized software releases, often with processor caches, which causes variability in the communica- platform-specific optimizations. tion costs. Unfortunately, database systems mostly assume Over the past decades, OLTP systems benefited greatly that all processing cores are the same and that microarchi- from improvements in the underlying hardware. Innovations tecture differences are not significant enough to appear in in their software architecture have been plentiful but there criticaldatabaseexecutionpaths. Aswedemonstrateinthis is a clear benefit from processor evolution. Uni-processors paper, however, hardware heterogeneity does appear in the grew predictably faster with time, leading to better OLTP criticalpathandconventionaldatabasearchitecturesachieve performance. Around 2005, when processor vendors hit the suboptimal and even worse, unpredictable performance. frequency-scaling wall, they started obtaining performance We perform a detailed performance analysis of OLTP de- improvements by adding multiple processing cores to the ploymentsinserverswithmultiplecoresperCPU(multicore) same CPU chip, forming chip multiprocessors (multicore or and multiple CPUs per server (multisocket). We compare CMP); and building servers with multiple CPU sockets of different database deployment strategies where we vary the multicore processors (SMP of CMP). number and size of independent database instances running Multisockets of multicores are highly parallel and charac- on a single server, from a single shared-everything instance terized by heterogeneity in the communication costs: sets, to fine-grained shared-nothing configurations. We quantify or islands, of processing cores communicate with each other theimpactofnon-uniformhardwareonvariousdeployments veryefficientlythroughcommonon-chipcaches,andcommu- by (a) examining how efficiently each deployment uses the nicate less efficiently with others through bandwidth-limited available hardware resources and (b) measuring the impact andhigher-latencylinks. Eventhoughmultisocketmulticore of distributed transactions and skewed requests on different machinesdominateinmoderndata-centers,itisunclearhow workloads. Finally, we argue in favor of shared-nothing de- wellsoftwaresystemsandinparticularOLTPsystemsexploit ployments that are topology- and workload-aware and take hardware capabilities. advantage of fast on-chip communication between islands of This paper characterizes the impact of hardware topol- cores on the same socket. ogy on the behavior of OLTP systems running on modern multisocket multicore servers. As recent studies argue and 1. INTRODUCTION thispapercorroborates,traditionalshared-everythingOLTP On-Line Transaction Processing (OLTP) is a multi-billion systems underperform on modern hardware because of (a) dollarindustry1 andoneofthemostimportantanddemand- excessive communication between the various threads [5, 14] ing database applications. Innovations in OLTP continue to and (b) contention among threads [26, 31]. Practitioners deserve significant attention, advocated by the recent emer- reportthatevencommercialshared-everythingsystemswith gence of appliances2, startups3, and research projects (e.g. support for non-uniform memory architectures (NUMA) un- derperform [11, 36]. On the other hand, shared-nothing 1E.g. http://www.gartner.com/DisplayDocument?id= deployments [30] face the challenges of (a) higher execution 1044912 2Such as Oracle’s Exadata database machine. costs when distributed transactionsare required [16, 9, 12, 3Such as VoltDB, MongoDB, NuoDB, and others. 27],evenwithinasinglenode,particularlyifthecommunica- tion occurs between slower links (e.g. across CPU sockets); and (b) load imbalances due to skew [33]. Many real-life workloads cannot be easily partitioned across instances or can have significant data and request skews, which may also change over time. In this paper, we examine the impact of perfectly partitionable and non- partitionable workloads, with and without data skew, on shared-nothingdeploymentsofvaryingsizesaswellasshared- everything deployments. Our experiments show that per- fectly partitionable workloads perform significantly better 2.1 Shared-everythingDatabaseDeployments on fine-grained shared-nothing configurations but non-parti- Within a database node, shared-everything is any deploy- tionable workloads favor coarse-grained configurations, due ment where a single database instance manages all the avail- to the overhead of distributed transactions. We identify the able resources. As database servers have long been designed overheads as messaging, additional logging, and increased to operate on machines with multiple processors, shared- contention, allofwhich dependon workloadcharacteristics everything deployments assume equally fast communication such as the percentage of multisite transactions, the number channelsbetweenallprocessors,sinceeachthreadneedstoex- of sites touched by each transaction, and the amount of changedatawithitspeers. Untilrecently,shared-everything work done within each transaction. Additionally, we find was the most popular deployment strategy on a single node. that skewed accesses cause performance to drop significantly All major commercial database systems adopt it. when using fine-grained shared-nothing configurations; this OLTP has been studied extensively on shared-everything effectislessevidentoncoarserconfigurationsandwhenusing databases. For instance, transactions suffer significant stalls shared-everything deployments. during execution [3, 2, 14]; a result we corroborate in Sec- Toourknowledge,thisisthefirststudythatsystematically tion 6.2. It has also been shown that shared-everything analyzesthe performance ofshared-everything andshared- systems have frequent shared read-write accesses [5, 14], nothing OLTP configurations of varying size on modern which are difficult to predict [29]. Modern systems enter multisocket multicore machines. The contributions are as numerous contentious critical sections even when execut- follows: ing simple transactions, affecting single-thread performance, requiring frequent inter-core communication, and causing • We provide experimental evidence of the impact of non- contention among threads [26, 25, 18]. These characteris- uniform hardware on the performance of transaction pro- tics make distributed memories (as those of multisockets), cessing systems and conclude that high performance soft- distributed caches (as those of multicores), and prefetch- ware has to minimize contention among cores and avoid ers ineffective. Recent work suggests a departure from the frequent communication between distant cores. traditional transaction-oriented execution model, to adopt • Our experiments show that fine-grained shared-nothing a data-oriented execution model, circumventing the afore- deployments can achieve more than four times as high mentioned properties - and flaws - of traditional shared- throughput as a shared-everything system when the work- everything OLTP [25, 26]. load is perfectly partitionable. By contrast, when the 2.2 Shared-nothingDatabaseDeployments workloadisnotpartitionableand/orexhibitsskew,shared- everything achieves twice as high a throughput as shared- Shared-nothing deployments [30], based on fully indepen- nothing. Therefore,thereisnouniqueoptimaldeployment dent (physically partitioned) database instances that collec- strategy that is independent of the workload. tively process the workload, are an increasingly appealing designevenwithinsinglenode[31,21,28]. Thisisduetothe • We demonstrate that a careful assignment of threads to scalability limitations of shared-everything systems, which islands of cores can combine the best features of a broad suffer from contention when concurrent threads attempt to rangeofsystemconfigurations,therebyachievingflexibility access shared resources [18, 25, 26]. in the deployment as well as more predictable and robust The main advantage of shared-nothing deployments is performance. In particular, islands-aware thread assign- the explicit control over the contention within each physi- ment can improve the worst-case scenario by a factor of 2 cal database instance. As a result, shared-nothing systems without hurting the best-case performance much. exhibit high single-thread performance and low contention. In addition, shared-nothing databases typically make bet- Therestofthedocumentisstructuredasfollows. Section2 ter use of the available hardware resources whenever the presentsthebackgroundandrelatedwork,describingthetwo workload executes transactions touching data on a single main database deployment approaches. Section 3 identifies database instance. Systems such as H-Store [31] and HyPer recent trends on modern hardware and their implications [21] apply the shared-nothing design to the extreme, deploy- on software design. Section 4 discusses the dependence of ing one single-threaded database instances per CPU core. database systems performance on hardware topology and Thisenablessimplificationsorremovalofexpensivedatabase workload characteristics such as percentage of distributed components such as logging and locking. transactions. Section5presentsexperimentalmethodology. Shared-nothing systems appear ideal from the hardware Section6describescasesfavoringfine-grainedshared-nothing utilization perspective, but they are sensitive to the ability configurations, and Section 7 analyzes possible overheads to partition the workload. Unfortunately, many workloads when deploying shared-nothing configurations. Finally, Sec- are not perfectly partitionable, i.e. it is hardly possible to tion 8 summarizes the findings and discusses future work. allocate data such that every transaction touches a single instance. Whenever multiple instances must collectively pro- 2. BACKGROUNDANDRELATEDWORK cess a request, shared-nothing databases require expensive Shared-everything and shared-nothing database designs, distributed consensus protocols, such as two-phase commit, described in the next two sections, are the most widely which many argue are inherently non-scalable [16, 9]. Simi- usedapproachesforOLTPdeployments. Legacymultisocket larly, handling data and access skew is problematic [33]. machines,whichgainedpopularityinthe1990sassymmetric The overhead of distributed transactions urged system multiprocessing servers, had non-uniform memory access designers to explore partitioning techniques that reduce the (NUMA) latencies. This required changes to the database frequency of distributed transactions [12, 27], and to ex- and operating systems to diminish the impact of NUMA, as plore alternative concurrency control mechanisms, such as discussed in Section 2.3. speculative locking [20], multiversioning [6] and optimistic concurrency control [22, 24], to reduce the overheads when distributed transactions cannot be avoided. Designers of large-scale systems have circumvented problems with dis- tributed transactions by using relaxed consistency models such as eventual consistency [35]. Eventual consistency elim- inates the need for synchronous distributed transactions, butitmakesprogrammingtransactionalapplicationsharder, with consistency checks left to the application layer. The emergence of multisocket multicore hardware adds further complexity to the on-going debate between shared- everything and shared-nothing OLTP designs. As Section 3 Figure1: Blockdiagramofatypicalmachine. Cores describes, multisocket multicores introduce an additional communicate either through a common cache, an level into the memory hierarchy. Communication between interconnect across socket or main memory. processors is no longer uniform: cores that share caches communicate differently from cores in the same socket and of processors by clocking them to higher frequency or by us- other sockets. ing more advanced techniques such as increased instruction- width and extended out-of-order execution. Instead, two 2.3 PerformanceonMultisocketMulticores approaches are mainly used to increase the processing ca- Past work focused on adapting databases for legacy multi- pability of a machine. The first is to put together multiple socketsystems. Forinstance,commercialdatabasesystems processorchipsthatcommunicatethroughsharedmainmem- provide configuration options to enable NUMA support, but ory. For several decades, such multisocket designs provided this setting is optimized for legacy hardware where each the only way to scale performance within a single node and individual CPU is assumed to contain a single core. With the majority of OLTP systems have historically used such newer multisocket servers, enabling NUMA support might hardware. The second approach places multiple process- lead to high CPU usage and degraded performance [11, 36]. ing cores on a single chip, such that each core is capable AnalternativeapproachistakenbytheMultimedproject, of processing concurrently several independent instruction which views the multisocket multicore system as a cluster of streams, or hardware contexts. The communication between machines [28]. Multimed uses replication techniques and a cores in these multicore processors happens through on-chip middleware layer to split database instances into those that caches. In recent years, multicore processors have become a process read-only requests and those that process updates. commodity. The authors report higher performance than a single shared- Multisocket multicore systems are the predominant con- everything instance. However, Multimed does not explicitly figuration for database servers and are expected to remain address NUMA-awareness and the work is motivated by popular in the future. Figure 1 shows a simplified block the fact that the shared-everything system being used has diagramofatypicalmachinethathastwosocketswithquad- inherent scalability limitations. In this paper, we use a core CPUs.4 Communication between the numerous cores scalableopen-sourceshared-everythingOLTPsystem,Shore- happens through different mechanisms. For example, cores MT[18],whichscalesnearlylinearlywiththeavailablecores inthesamesocketshareacommoncache,whilecoreslocated on single-socket machines; however, we still observe benefits indifferentsocketscommunicateviatheinterconnect(called with shared-nothing deployments based on Shore-MT. QPI for Intel processors). Cores may also communicate A comparison of techniques for executing hash joins in throughthemainmemoryifthedataisnotcurrentlycached. multicore machines [8], corresponding broadly to shared- The result is that the inter-core communication is variable: everything and shared-nothing configurations of different communication in multicores is more efficient than in mul- sizes,illustratesacasewhereshared-everythinghasappealing tisockets, which communicate over a slower, power-hungry, characteristics. The operation under study, however, hash and often bandwidth-limited interconnect. joins, has different characteristics from OLTP. Hence,therearetwomaintrendsinmodernhardware: the Exploiting NUMA effects at the operating system level variability in communication latencies and the abundance of is an area of active research. Some operating system ker- parallelism. In the following two subsections we discuss how nels such as the Mach [1] and exokernels [13], or, more re- each trend affects the performance of software systems. cently, Barrelfish [4], employ the message-passing paradigm. 3.1 VariableCommunicationLatencies Message-passing potentially facilitates the development of NUMA-aware systems since the communication between The impact of modern processor memory hierarchies on threads is done explicitly through messages, which the op- the applicationperformance issignificant becauseit causes erating system can schedule in a NUMA-aware way. Other variability in access latency and bandwidth, making the proposals include the development of schedulers that detect overall software performance unpredictable. Furthermore, contentionandreactinaNUMA-awaremanner[7,32]. None it is difficult to implement synchronization or communica- of these proposals is specific to database systems and likely tion mechanisms that are globally optimal in different en- require extensive changes to the database engine. vironments - multicores, multisockets, and multisockets of multicores. Weillustratetheproblemofoptimalsynchronizationmech- 3. HARDWAREHASISLANDS anisms with a simple microbenchmark. Figure 2 plots the Hardwarehaslongdepartedfromuniprocessors,whichhad 4Adapted from http://software.intel.com/sites/ predictable and uniform performance. Due to thermal and products/collateral/hpc/vtune/performance analysis power limitations, vendors cannot improve the performance guide.pdf Millions/sec 400 10 s) p ut 300 KT 8 ghp ut ( 6 ou 200 hp hr ug 4 T o 100 r h T 2 0 0 "Spread" "Grouped" OS Spread Group Mix OS threads threads ? ? Figure 2: Allocating threads and memory in a ? ? topology-aware manner provides the best perfor- mance and lower variability. Figure 3: Running the TPCC-Payment workload with all cores on the same socket achieves 20-30% higher throughput of a program running on a machine that has performance than other configurations. 8 CPUs with 10 cores each (the “Octo-socket” machine of Table 2). There are 80 threads in the program, divided into groupsof10threads,whereeachgroupincrementsacounter Table 1: Throughput and variability when increas- protected by a lock in a tight loop. There are 8 counters in ing counters each protected by a lock. total, matching the number of sockets in the machine. We Counter Total Throughput (M/sec) Std. dev. vary the allocation of the worker threads and plot the total setup counters (Speedup) (%) throughput (million counter increments per second). The Single 1 18.4 9.33% first bar (“Spread” threads) spreads worker threads across Per socket 8 341.7 (18.5x) 0.86% all sockets. The second bar (“Grouped” threads) allocates Per core 80 9527.8 (516.8x) 0.03% all threads in the same socket as the counter. The third bar leaves the operating system to do the thread alloca- tion. Allocating threads and memory in a topology-aware manner results in the best performance and lowest variabil- lelismpotentiallycausesadditionalcontentioninmultisocket ity. Leaving the allocation to the operating system leads to multicore systems, as a higher number of cores compete for non-optimal results and higher variability.5 shared data accesses. Table 1 shows the results obtained We obtain similar results when running OLTP workloads. on the octo-socket machine when varying the number of To demonstrate the impact of NUMA latencies on OLTP, worker threads accessing a set of counters, each protected werunTPC-C Paymenttransactionsonamachinethathas4 by a lock. An exclusive counter per core achieves lower CPUswith6coreseach(“Quad-socket”inTable2). Figure3 variability and 18x higher throughput than a counter per plots the average throughput and standard deviation across socket, and 517x higher throughput than a single counter multiple executions on a database with 4 worker threads. for the entire machine. In both cases, this is a super-linear In each configuration we vary the allocation of individual speedup. Shared-nothing deployments are better suited to worker threads to cores. The first configuration (“Spread”) handle contention, since they provide explicit control by assigns each thread to a core in a different socket. The sec- physically partitioning data, leading to higher performance. ond configuration (“Group”) assigns all threads to the same In summary, modern hardware poses new challenges to socket. Theconfiguration“Mix”assignstwocorespersocket. softwaresystems. Contentionandtopologyhaveasignificant In the “OS” configuration, we let the operating system do impact on performance and predictability of the software. the scheduling. This experiment corroborates the previous Predictably fast transaction processing systems have to take observations ofFigure 2: theOSdoesnotoptimallyallocate advantage of the hardware islands in the system. They need work to cores, and a topology-aware configuration achieves to(a)avoidfrequentcommunicationbetween“distant”cores 20-30% better performance and less variation. The absolute intheprocessortopologyand(b)keepthecontentionamong difference in performance is much lower than in the case of coreslow. Thenextsectionarguesinfavoroftopology-aware counter incrementing because executing a transaction has OLTP deployments that adapt to those hardware islands. significant start-up and finish costs, and during transaction execution a large fraction of the time is spent on operations 4. ISLANDS: HARDWARE TOPOLOGY- other than accessing data. For instance, studies show that around 20% of the total instructions executed during OLTP AND WORKLOAD-AWARE SHARED- are data loads or stores (e.g. [3, 14]). NOTHINGOLTP 3.2 AbundantHardwareParallelism Traditionally, database systems fall into one of two main categories: shared-everythingorshared-nothing. Thedistinc- Anothermajortrendistheabundanthardwareparallelism tionintotwostrictcategories,however,doesnotcapturethe availableinmoderndatabaseservers. Higherhardwareparal- fact that there are many alternative shared-nothing configu- 5Thisobservationhasbeendonealsobyothers,e.g. [4],and rationsofdifferentsizes,norhowtomapeachshared-nothing is an area of active research. instance to CPU cores. CPU 0 CPU 1 CPU 0 CPU 1 CPU 0 CPU 1 Fine-grained t shared -nothing pu C P U 2 C P U 3 C P U 2 C P U 3 C P U 2 C P U 3 ugh Islands Shared-everything o 2 Islands 4 Islands 4 Spread r h T Figure4: Differentshared-nothingconfigurationson a four-socket four-core machine. % multisite transactions Figure 4 illustrates three different shared-nothing configu- rations. Thetwoleft-mostconfigurations,labeled“2Islands” in workload and “4 Islands”, dedicate different number of cores per in- Figure5: Performanceofvariousdeploymentconfig- stance, but, for the given size, minimize the NUMA effects urations as the percentage of multisite transactions as much as possible. Computation within an instance is increases. done in close cores. The third configuration, ”4 Spread” has the same size per instance as “4 Islands”; however, it does transactions can actually be physically local transactions, not minimize the NUMA effects, as it forces communication since all the required data reside physically in the same across sockets when it is strictly not needed. The first two databaseinstance. Distributedtransactionsareonlyrequired configurationsareislandsinourterminology,whereanisland for multisite transactions whose data reside across different is a shared-nothing configuration where each shared-nothing physicaldatabaseinstances. Assumingthesamepartitioning instance is placed in a topology-aware manner. The third algorithm is used (e.g. [12, 27]), then the more data a configurationissimplyashared-nothingconfiguration. As database contains the more likely for a transaction to be hardware becomes more parallel and more heterogeneous local. the design space over the possible shared-nothing configu- Given the previous reasoning one could argue that an op- rations increases, and it is harder to determine the optimal timalshared-nothing configurationconsistsofafew coarse- deployment. grained database instances. Thiswouldbea naiveassump- On top of the hardware complexity, we have to consider tion as it ignores the effects of hardware parallelism and thatthecostofatransactioninashared-nothingenvironment variable communication costs. For example, if we consider alsodependsonwhetherthistransactionislocaltoadatabase the contention, then the cost of a (local) transaction of a instance or distributed. A transaction is local when all the coarse-grainedshared-nothingconfigurationC ishigher coarse requireddataforthetransactionisstoredinasingledatabase than the cost of a (local) transaction of a very fine-grained instance. Atransactionisdistributedwhenmultipledatabase configuration C , because the number of concurrent con- fine instancesneedtobecontactedandslowdistributedconsensus tenting threads is larger. That is, T < T . If we coarse fine protocols (such as two-phase commit) need to be employed. considercommunicationlatency,thenthecostofatopology- Thus, the throughput also heavily depends on the workload, awareislands configurationC ofacertainsize is lower islands adding another dimension to the design space and making than the cost of a topology-unaware shared-nothing configu- the optimal deployment decision nearly “black magic.” 6 ration C . That is, T <T . naive islands naive Anoversimplifiedestimationofthethroughputofashared- As a result, this paper makes the case for OLTP Islands, nothingdeploymentasafunctionofthenumberofdistributed which are hardware topology- and workload-aware shared- transactions is given by the following. If Tlocal is the perfor- nothingdeployments. Figure5illustratestheexpectedbehav- mance of the shared-nothing system when each instance exe- ior of Islands, shared-everything, and finer-grained shared- cutes only local transactions, and Tdistr is the performance nothingconfigurationsasthepercentageofmultisitetransac- of a shared-nothing deployment when every transaction re- tionsintheworkloadincreases. Islandsexploittheproperties quiresdatafrommorethanonedatabaseinstances,thenthe of modern hardware by exploring the sets of cores that com- total throughput T is: municate faster with each other. Islands are shared-nothing designs,butpartiallycombinetheadvantagesofbothshared- T =(1−p)∗T +p∗T local distr everything and shared-nothing deployments. Similarly to where p is the fraction of distributed transactions executed. a shared-everything system, Islands provide robust perfor- In a shared-everything configuration all the transactions mance even when transactions in the workload vary slightly. are local (p =0). On the other hand, the percentage of At the same time, performance on well-partitioned work- SE distributedtransactionsonashared-nothingsystemdepends loads should be high, due to less contention and avoidance on the partitioning algorithm and the system configuration. of higher-latency communication links. Their performance, Typically,shared-nothingconfigurationsoflargersizeexecute however, is not as high as a fine-grained shared-nothing sys- fewer distributed transactions, as each database instance tem, since each node has more worker threads operating contains more data. That is, a given workload has a set on the same data. At the other side of the spectrum, the of transactions that access data in a single logical site, and performance of Islands will not deteriorate as sharply as a transactions that access data in multiple logical sites, which fine-grainedshared-nothingunderthepresenceofe.g. skew. wecallmultisitetransactions. Asingledatabaseinstancemay hold data for multiple logical sites. In that case, multisite 5. EXPERIMENTALSETUP 6Explaining, among other reasons, the high compensation Inthefollowingsectionsweperformathoroughevaluation for skilled database administrators. of the benefits of various deployment strategies under a CPU socket or in different sockets. Unix domain sockets Table 2: Description of the machines used. achieve the highest performance and are used throughout Machine Description the remaining evaluation. Quad-socket 4 x Intel Xeon E7530 @ 1.86 GHz 6 cores per CPU 5.1 PrototypeSystem Fully-connected with QPI 64 GB RAM In order to evaluate the performance of various shared- 64 KB L1 and 256 KB L2 cache per core nothing deployments in multisocket multicore hardware, we 12 MB L3 shared CPU cache implementedaprototypeshared-nothingtransactionprocess- Octo-socket 8 x Intel Xeon E7-L8867 @ 2.13GHz ing system on top of the Shore-MT [18] storage manager. 10 cores per CPU We opted for Shore-MT as the underlying system since it Connected using 3 QPI links per CPU providesnearlinearscalabilityonsinglemulticoresmachines. 192 GB RAM Shore-MT is the improved version of the SHORE storage 64 KB L1 and 256 KB L2 cache per core manager, originally developed as an object-relational data 30 MB L3 shared CPU cache store[10]. Shore-MTisdesignedtoremovescalabilitybottle- necks, significantly improving Shore’s original single-thread 70 Same socket performance. Its performance and scalability are at the /s) 60 Diff. socket highest end of open-source storage managers [18]. s Shore-MTisoriginallyashared-everythingsystem. There- g Ms 50 fore,weextendedShore-MTwiththeabilitytoruninshared- K 40 nothing configurations, by implementing a distributed trans- ( t action coordinator using the standard two-phase commit u 30 p protocol. h g 20 Shore-MT includes a number of state-of-the-art optimiza- u ro 10 tions for local transactions, such as speculative lock inheri- h T tance [17]. We extended these features for distributed trans- 0 actions, providing a fair comparison between the execution FIFO POSIX Pipes TCP UNIX of local and distributed transactions. Message sockets sockets Queues 5.2 WorkloadandExperimentalMethodology Figure 6: Throughput of message exchanging (in thousands of messages exchanged per second) for In our experiments, we vary the number of instances (i.e. a set of inter-process communication mechanisms. partitions) of the database system. Each instance runs as Unix domain sockets are the highest performing. a separate process. In all experiments, the total amount of inputdataiskeptconstantandthedataisrange-partitioned across all instances of the system. For every experiment, variety of workloads on two modern multisocket multicore with the exception of Section 7.4, we use small dataset with machines,onewithfoursocketsof6-coreCPUsandonewith 240,000 rows (∼ 60 MB). We show results using different eight sockets of 10-core CPUs 7. database configurations, but we always use the same total Hardware and tools. Table2describesindetailthehard- amountofdata,processors,andmemoryresources. Onlythe ware used in the experiments. We disable HyperThreading number of instances and the distribution of resources across to reduce variability in the measurements. The operating instances changes. system is Red Hat Enterprise Linux 6.2 (kernel 2.6.32). In We ensure that each database instance is optimally de- the experiment of Section 7.4, we use two 146 GB 10kRPM ployed. That is, each database process is bound to the cores SAS 2,5” HDDs in RAID-0. within a single socket (minimizing NUMA effects) when pos- We use Intel VTune Amplifier XE 2011 to collect ba- sible, and its memory is allocated in the nearest memory sic micro-architectural and time-breakdown profiling results. bank. AsnotedinSection3,allowingtheoperatingsystemto VTune does hardware counter sampling, which is both accu- schedule processes arbitrarily leads to suboptimal placement rateandlight-weight. Ourdatabasesystemiscompiledusing and thread migration, which degrades performance. GCC 4.4.3 with maximum optimizations. In most experi- The configurations on the graphs are labeled with ”NISL” ments,thedatabasesizefitsintheaggregatebufferpoolsize. where N represents the number of instances. For instance, As such, the only I/O is due to the flushing of log entries. 8ISL represents the configuration with 8 database instances, However, since the disks are not capable of sustaining the eachofwhichhas1/8thofthetotaldataanduses3processor I/O load, we use memory mapped disks for both data and cores (the machine has 24 cores in total). The number of log files. Overall, we exercise all code paths in the system instances is varied from 1 (i.e. a shared-everything system) and utilize all available hardware contexts. to 24 (i.e. a fine-grained shared-nothing system). Optimiza- IPC mechanisms. Theperformanceofanyshared-nothing tions are also applied to particular configurations whenever systemheavilydependsontheefficiencyofitscommunication possible: e.g. fine-grainedshared-nothingallowscertainopti- layer. Figure 6 shows the performance in the quad-socket mizationstobeapplied. Optimizationsusedarenotedalong machineofvariousinter-processcommunication(IPC)mech- the corresponding experimental results. anisms using a simple benchmark that exchanges messages We run two microbenchmarks. The first consists of read- between two processes, which are either located in the same only transactions that retrieve N rows. The second consists 7For more details see http://www.supermicro.com/ of read-write transactions updating N rows. For each mi- manuals/motherboard/7500/X8OBN-F.pdf crobenchmark, we run two types of transactions: execution leading to shorter code paths, which decreases the 150 number of instruction misses. On the other hand, instances s) that span across sockets have a much higher percentage of p Kt 100 stalled cycles (shown in the middle of Figure 8), in part due t ( to the significant percentage of—expensive—last-level cache u p (LLC) misses. Within the same socket, smaller instances gh 50 havehigherratioofinstructionspercycleduetolesssharing u o between cores running threads from the same instance, as r h shown on the Figure 8 (right). T 0 Fine-grained Shared Everything 7. CHALLENGES FOR FINE-GRAINED Shared Nothing PARTITIONING Figure 7: Running the TPC-C benchmark with only A significant number of real life workloads cannot be par- local transactions. Fine-grained shared-nothing is titioned in a way that transactions access a single partition. 4.5x faster than shared everything. Moreover, many workloads contain data and access skews, which may also change dynamically. Such workloads are • Local transactions, which perform its action (read or more challenging for systems that use fine-grained parti- update) on the N rows located in the local partition; tioning and coarser-grained shared-nothing configurations provide a robust alternative. • Multisite transactions, which perform its action (read or update) on one row located in the local partition while 7.1 CostofDistributedTransactions remainingN−1rowsarechosenuniformlyfromthewhole Distributed transactions are known to incur a significant data range. Transactions are distributed if some of the cost, and this problem has been the subject of previous input rows happen to be located in remote partitions. research, with e.g. proposals to reduce the overhead of the distributed transaction coordination [20] or to determine an initialoptimalpartitioningstrategy[12,27]. Ourexperiment, 6. CASES FAVORING FINE-GRAINED shown in Figure 9, corroborates these results. We run two PARTITIONING microbenchmarks whose transactions read and update 10 This section presents two cases where fine-grained shared- rows respectively on the quad-socket machine. As expected, nothing configurations outperform coarser-grained shared- theconfiguration1ISL(i.e. shared-everything)isnotaffected nothing configurations as well as shared-everything. byvaryingthepercentageofmultisitetransactions. However, thereisadropinperformanceoftheremainingconfigurations, 6.1 PerfectlyPartitionableWorkloads which is more significant in the case of the fine-grained one. Ifthe workloadis perfectlypartitionable then fine-grained The following experiments further analyze this behavior. shared-nothing provides better performance. An example is 7.1.1 Read-onlyCase: OverheadProportionaltothe shown on Figure 7, obtained using the quad-socket machine, NumberofParticipatingInstances which compares the performance of the shared-everything version of Shore-MT with the fine-grained shared-nothing Figure 10 (upper left) represents the costs of a local read- versionofShore-MTwith24ISL.Bothsystemsrunamodified only transaction in various database configurations and as version of the TPC-C benchmark [34] Payment transaction, the number of rows retrieved per transaction increases. The where all the requests are local and, hence, the workload results are obtained on the quad-socket machine. The 24ISL is perfectly partitionable on Warehouses. The fine-grained configuration runs with a single worker thread per instance, shared-nothingconfigurationoutperformsshared-everything so locking and latching are disabled, which leads to roughly by 4.5x, due in large part to contention on the Warehouse 40% lower costs than the next best configuration, corrobo- tableintheshared-everythingcase. Experimentswithshort- rating previous results [15]. running microbenchmarks in later sections, however, do not The costs of multisite read-only transactions (Figure 10, show such a large difference between shared-everything and upper right) show the opposite trend from the local read- shared-nothing. Thisisbecauseofthelargernumberofrows only transactions. In the local case, the costs of a single in the table being accessed, which implies lower contention transaction rise as the size of an instance grows. In the on a particular row. multisitecase,however,thecostsdecreasewiththesizeofan instance. Thisisduetoadecreaseinthenumberofinstances 6.2 Read-onlyWorkloads participating in the execution of any single transaction. The Fine-grained shared-nothing configurations are also appro- exception is the shared-everything configuration, which has priate for read-only workloads. In the following experiment highercostsduetointer-socketcommunication, asdiscussed werunmicrobenchmarkwithlocaltransactionsthatretrieve in Section 6. 10 rows each. We test multiple configurations ranging from 7.1.2 UpdateCase: AdditionalLoggingOverheadIs 24ISLto1ISLinthequad-socketmachine. Theconfiguration Significant 24ISL is run without locking or latching. Figure 8 (left) shows that the fine-grained shared-nothing The lower left plot of Figure 10 describes the costs of the configurations,whoseinstanceshavefewerthreads,makebet- update microbenchmark with local transactions only, on the ter utilization of the CPU. Single-threaded instances, apart quad-socket machine. The cost of a transaction increases from not communicating with other instances, have simpler with the number of threads in the system, due to contention 1.2 IPC 50% Stalled Cycles 4.0% Sharing through LLC 1 40% 3.0% 0.8 30% 0.6 2.0% 20% 0.4 1.0% 10% 0.2 0 0% 0.0% Fig ure8: Microarchitecturaldatafordifferentdeployments: instructionspercycle(left),percentageofstalled cycles (middle) and percentage of cycles when data is shared between cores on the same socket (right). IPC is much higher for smaller instances. 200 Retrieving 10 rows 120 Updating 10 rows 24ISL 100 s) 150 4ISL p T 80 K 1ISL t ( 100 60 u hp 40 g 50 u 20 o r h 0 0 T 0 20 40 60 80 100 0 20 40 60 80 100 % multisite transactions % multisite transactions Fig ure9: Performanceasthenumberofdistributedtransactionsincreases. Whileshared-everythingremains stable, performance of share-nothing configurations decreases. on shared data structures. As in the read-only case, the itisimportanttostudythebehaviorofalternativedatabase 24ISL configuration runs without locks or latches and hence, configurations as hardware parallelism and communication has lower costs. variability grows. In Figure 12, we run the microbenchmark Multisite shared-nothing transactions (Figure 10, lower which reads (left) or updates (right) 10 rows with fixed right) are significantly more expensive than their local coun- percentageofmultisitetransactionsto20%,whilethenumber terparts. This is due to the overhead associated with dis- ofcoresactiveinthemachineisincreasedgradually. Results tributed transactions and to the (mandatory) use of locking. are shown for both the quad-socket and the (more parallel Any configuration that requires distributed transactions is and variable) octo-socket machine. more expensive than the shared-everything configuration. The shared-nothing configurations scale linearly, with CG (coarse-grained shared-nothing) configuration being compet- 7.1.3 Profiling itive with the best case across different machines and across To characterize the overhead of inter-process communica- different levels of hardware parallelism. The configuration tion costs in relation to the remaining costs of a distributed labeledSE(shared-everything)doesnotscalelinearly,partic- transaction,weprofiletheexecutionofasetofread-onlyand ularlyonthemachinewith8sockets. IntheSEconfiguration, update transactions on the quad-socket machine, using the there is no locality when accessing the buffer pool, locks, or 4ISL configuration. Figure 11 plots time breakdown for the latches. To verify the poor locality of SE, we measured the lightweight transaction which reads or updates 4 rows. The QPI/IMC ratio, i.e. the ratio of the inter-socket traffic over messaging overhead is high in the read-only case, although memory controllertraffic. A higher QPI/IMC ratio means it has a constant cost per transaction. The relative cost the system does more inter-socket traffic while reading (i.e. of communication can be seen by comparing the 0% mul- processing) less data overall: it is less NUMA-friendly. The tisite (i.e. local transactions only) and the 100% multisite QPI/IMC ratio for the experiment with read-only workload bars. Although distributed transactions require exchange of on octo-socket server using all 80 cores is 1.73 for SE, 1.54 twice as many messages in the update case, this overhead is forCG,and1.52forFG.TheFGandCGconfigurationsstill comparatively smaller because of additional logging, as well havearelativelyhighratioduetomultisitetransactionsbut, as increased contention which further increase the cost of a unlike SE, these consist of useful work. When restricting all transaction. configurationstolocaltransactionsonly,weobserveasteady data traffic of 100 Mb/s on the inter-socket links for FG 7.2 IncreasingHardwareParallelism and CG (similar to the values observed when the system is Hardware parallelism as well as communication variability idle), while SE exceeds 2000 Mb/s. Clearly, to scale the SE willlikelycontinuetoincreaseinfutureprocessors. Therefore, configuration to a larger number of cores, data locality has 140 140 s) Local read-only Multisite read-only 24ISL (μ 120 120 on 100 100 12ISL ti ac 80 80 8ISL s an 60 60 4ISL r t r 40 40 2ISL e p t 20 20 1ISL s Co 0 0 2 4 8 12 18 24 30 40 60 80100 2 4 8 12 18 24 30 40 60 80 100 Number of rows retrieved Number of rows retrieved s) 250 Local update 250 Multisite update 24ISL μ ( on 200 200 12ISL ti c 8ISL a 150 150 s an 4ISL r 100 100 t r 2ISL e p 50 50 t 1ISL s Co 0 0 2 4 8 12 18 24 30 40 60 80100 2 4 8 12 18 24 30 40 60 80 100 Number of rows updated Number of rows updated Fig ure10: Costoflocalandmultisitetransactionsinread-onlyandupdatemicrobenchmarks. Coarse-grained shared-nothinghasmorerobustperformancecomparedtofine-grainedshared-nothingandshared-everything. Retrieving 4 rows Updating 4 rows s) s) µ 150 µ ( ( 600 s s logging n n o o cti 100 cti 400 locking a a s s n n a a communication r 50 r 200 t t er er xct management p p e 0 e 0 m m xct execution Ti 0% 50% 100% Ti 0% 50% 100% Multisite transactions Multisite transactions Figure 11: Time breakdown for a transaction that retrieves (left) or updates (right) 4 rows. The cost of co mmunication dominates in the cost of distributed transaction in the read-only case, while in the update case overheads are divided between communication and additional logging. to be increased. Additionally, one of the main reasons for distribution, with different skew factors s, shown on the poor performance of SE configuration is high contention on x-axis of Figure 13. The figures show the throughput for locks and latches. Using partitioned shared-everything de- varying percentages of multisite transactions. We employ signs with data-oriented execution can significantly improve similar optimizations as described in 7.1.1 and 7.1.2. locality of accesses and remove or minimize the overheads Skew has a dramatic effect on the performance of the dif- coming from lock and latch managers [25, 26]. ferent configurations. For shared-everything, heavily skewed workloads result in a significant performance drop due to in- creasedcontention. Thiseffectisapparentparticularlyinthe 7.3 TolerancetoSkew updatecase. Whenrequestsarenotstronglyskewed,shared- Inmanyrealworkloads,skewsondataandrequests,aswell everything achieves fairly high performance in the update as dynamic changes are the norm rather than the exception. microbenchmark, mainly due to optimized logging, which Forexample,manyworkloadsseemtofollowthepopular80- significantly improves the performance of short read-write 20 distribution rule, where the 80% of requests accesses only transactions [19]. In coarser-grained islands, the increased the 20% of the data. This subsection describes experiments load due to skewed accesses is naturally distributed among with workloads that exhibit skew. allworkerthreadsintheaffectedinstance. Withfine-grained The following microbenchmark reads or updates two rows instances, which have a single worker thread, the additional chosen with skew over the whole data range. We use Zipfian 20% multisite read-only workload 20 % multisite update workload 4 sockets 8 sockets 4 sockets 8 sockets 200 600 60 200 FG 500 50 ps) 150 CG 150 KT SE 400 40 ut ( 100 300 30 100 p h g u 200 20 ro 50 50 h T 100 10 0 0 0 0 6 12 18 24 20 40 60 80 6 12 18 24 20 40 60 80 # Cores # Cores # Cores # Cores Figure 12: Performance of alternative configurations as the hardware parallelism increases. Coarser-grained shared-nothing provides an adequate compromise between performance and predictability. Read-only Workload 800 0% multisite 800 20% multisite 800 50% multisite s) 24ISL Tp 600 600 600 4ISL K ( t 1ISL u p 400 400 400 h g u ro 200 200 200 h T 0 0 0 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 Skew factor Skew factor Skew factor Update Workload 300 0% multisite 300 20% multisite 300 50% multisite 250 250 250 s) p T 200 200 200 K ( ut 150 150 150 p gh 100 24ISL 100 100 u o 4ISL r 50 50 50 Th 1ISL 0 0 0 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 Skew factor Skew factor Skew factor Figu re 13: Performance of read-only (top) and update (bottom) workloads with skewed accesses. As skew increases, shared-everythingsuffersfromincreasedcontention, whilefine-grainedshared-nothingsuffersfrom a highly-loaded instance that slows others. Coarse-grained shared-nothing configuration cope better with a highly loaded instances, due to multiple internal threads. load cannot be divided and the most loaded instance be- resistant to load imbalances. comes a bottleneck. Furthermore, as the skew increases to 7.4 IncreasingDatabaseSize the point where all remote requests go to a single instance, the throughput of other instances also drops as they cannot Although main memory sizes in modern servers continue complete transactions involving the overloaded instance. togrow,therearemanyworkloadsthatarenotmainmemory Overall, coarse-grained shared-nothing configurations ex- resident and rely on disk-resident data. To evaluate various hibit good performance in the presence of skewed requests, database configurations on growing dataset sizes, we gradu- as they suffer less from increased contention and are more allyincreasethenumberofrowsinthedatasetfrom240,000 to 120,000,000 (i.e. from 60 MB to 33 GB). Contrary to
Description: