ABSTRACT Context. Cassandra is a NoSQL Database which handles large amount of data simultaneously and provides high availability for the data present. Compaction in Cassandra is a process of removing stale data and making data more available to the user. This thesis focusses on analyzing the impact of Cassandra compaction on Cassandra’s performance when running inside a Docker container. Objectives. In this thesis, we investigate the impact of Cassandra compaction on the database performance when it is used within a Docker based container platform. We further fine tune Cassandra’s compaction settings to arrive at a sub-optimal scenario which maximizes its performance while operating within a Docker. Methods. Literature review is performed to enlist different compaction related metrics and compaction related parameters which have an effect on Cassandra’s performance. Further, Experiments are conducted using different sets of mixed workload to estimate the impact of compaction over database performance when used within a Docker. Once these experiments are conducted, we modify compaction settings while operating under a write heavy workload and access database performance in each of these scenarios to identify a sub-optimal value of parameter for maximum database performance. Finally, we use these sub-optimal parameters to perform an experiment and access the database performance. Results. The Cassandra and Operating System related parameters and metrics which affect the Cassandra compaction are listed and their effect on Cassandra’s performance has been tested using some experiments. Based on these experiments, few sub-optimum values are proposed for the listed metrics. Conclusions. It can be concluded that, for better performance of Dockerized Cassandra, the proposed values for each of the parameters in the results (i.e. 5120 for Memtable_heap_size_in_mb, 24 for concurrent_compactors, 16 for compaction_throughput_mb_per_sec, 6 for Memtable_flush_writers and 0.14 for Memtable_cleaup _threshold) can be chosen separately but not the union of those proposed values (confirmed from the experiment performed). Also the metrics and parameters affecting Cassandra performance are listed in this thesis. Keywords: Docker, Cassandra, Cassandra compaction, NoSQL database 1 INTRODUCTION New advanced technologies like distributed databases (all NoSQL DB) are capable of handling big data with their inherent schema free data model and horizontal scalability features to store and retrieve the data quickly [1]. The popular NoSQL databases available are Facebook’s Cassandra [2], Mongo DB [3], Oracle’s NoSQL DB [4], Amazon’s Dynamo [5], Google’s Big Table [6] and Apache’s Hbase [7]. NoSQL stands for Not Only Structured Query Language. NoSQL databases are distributed and non-relational databases, where they are not primarily built on tables. They can store huge amounts of data and can process the stored data in parallel across a large number of commodity servers. They use non-SQL languages to query and manipulate the data. [8] NoSQL databases are divided into five different types based on their characteristics and data storage models: wide column store/column family, document store, key value/tuple store, consistent key value store and graph databases. Cassandra is a type of ‘column store/column family' NOSQL database with a distributed storage system developed by Facebook for storing and managing huge amounts of unstructured data [9]. It has become increasingly popular with many big organizations today like Yahoo, Facebook etc. which are moving towards migrating to Cassandra DB for their data consolidation needs and many other leading IT firms which are developing data mining applications for big data on Cassandra platform [10]. Cassandra does not have a fixed schema like traditional relational databases and it can have any number of columns in a row contrary to single row limitations with relational databases [11]. It's popularity today has been elevated due to its reliability, scalability and robustness features [12]. It uses a process called compaction to store the data into the disks called SSTables, which are immutable in nature [13][14]. Compaction plays an important role in the performance of Cassandra as huge amount of data gets stored into the SSTables by using one of the compaction strategies (i.e. Size Tiered Compaction Strategy, Leveled Compaction Strategy and Date Tiered Compaction Strategy) and response time of any read/write request to the Cassandra is affected by the compaction process. In this thesis we test the impact of Cassandra compaction when running inside a Docker. According to the definition of a Docker mentioned on their website, “Docker containers wrap a piece of software in a complete filesystem that contains everything needed to run: code, runtime, system tools, system libraries – anything that can be installed on a server. This guarantees that the software will always run the same, regardless of its environment” [15]. Essentially, a Docker allows a software to execute itself without the need of an operating system environment, providing it with the necessary utilities to do its intended task. This setup increases the software performance as all the available resources are used by the software to ensure its successful execution. 1.1 What is Cassandra? Cassandra is a NoSQL database and a distributed storage system developed by Facebook to store and manage huge amounts of unstructured data [9]. Cassandra differs from the relational database in the table rows schema. There is no fixed schema for the Cassandra database, that is, there can be any number of columns in a row [11]. It is designed to run on top of many nodes. But there are chances of components of a 7 The data model of Cassandra is as shown in Yahoo Cloud Serving Benchmark (YCSB): figure 1. The smallest part of the data is the YCSB stands for Yahoo! Cloud Serving column, which comprises of the three aspects Benchmark. It was invented by Yahoo to namely column name, column value and compare the performance of its new database timestamp. Each column is identified by a key. PNUTS with the other existing databases namely Each such key is depicted as a row. Each row has Cassandra, HBase and MySQL [18]. a different structure as it may contain any number The structure of the YCSB has a workload of rows. Column family contains these rows and generating agent and a package of standard the placement of these rows in the column family workloads. The generating agent is a java is done by using partitioning strategy. KeySpaces program with the functionality of loading the data acts as a database, which stores the different into the database and setting up the workload by column families and different KeySpaces generating operations for measuring the combine together to form a cluster [1]. performance. Packages are a combination of read node to fail. Cassandra solves this problem by running in a steady state, showing its Cassandra uses a peer-to-peer networking operation, write operation, scan etc., which robustness. Cassandra also has the feature of reliability (design system with cheap architecture, where there is a continuous describe the workload [18]. hardware) and scalability (user handles the design and configuration of data) [12]. communication between each of the nodes or The important feature of YCSB is that it is Cassandra coenascishts o of f tha ey damatla cfielne tecrasll eind tchaess canludsrtae.yr.a mThl,i sw shhicohw cso ntains all the extensible, as new workloads can be defined by a settings of the parathmee tfelresx, iebvielnit yth eo cf omCapsascatinodnr paa raasm eitt errse pthliact aatrees usthede by Cassandra. user and make it use for benchmarking a new This yaml file cand baeta e daitcerdo stos tuanlle tthhee panroadmeest erosr. Adfatetar cmeankteinrgs. chAanngye s to the .yaml database or system. This can be done by file, the Cassandrac phraoncgeesss ninee ad ss itno gblee rneostdaert ewdi tlol bbrriinngg tahbeo cuhta cnhgeasn ginetso effect. developing new packages or by writing java code 1.1.1 Cassandra Dtoa ttah eM oothdeerl nodes due to the replication and that are required for the new system [18]. automatic synchronization between the nodes. This replication provides high fault tolerance and The Cassandra data model consists of different parts. The different parts are as no single point of breakdown. Consistency of this LITERATURE STUDY follows: column, key, rows, column family, keyspace and cluster. Column is the smallest part of tshyes tdeamta isst ocraeldc.u lEaatechd oconl uthmen b ahsaiss tohfr eteh ed rifefpelriecnat tiaottnr ibute s namely IV explains the research definition and plan. Section V eacoclhum pn ineacmee , ocof lfudamcantto avr..a lRuIeen p alnCidca attisimosneas ntfadamcrtapo. r E diasac tthah ce o wnluuimmlln b hbeare so afp scaporeptciiitefiisoc nkeeyd Oa nudr ealicthe rature study on the topic “Parallel explains about the experiments that we have conducted. The trkaenys pisa rreepnretslyen taetdch raaots s nsae eardolswl t.po T abhrete irrcee ippalriaect aimtneagdn fyno orr odewaecss h.a cnEhda uecnakhc hoa frn oddwa t aeh.v ase ray C doifmfepreuntitn g On Distributed Databases”, helped us structure. All thesEe arcohw as ncdo mevbeinrye tnoo dfoer min at hceo lculmusnt efra mcoilnyt rainbdu ttehse steo rows inside a analysis of results of our experiments is presented in section node in cluster is responsible for overall database. Any of thein gaining knowledge about the state of the art column family aret hpela coevde brayl ul sdinagta ab paasret.i tiSoon intgh es trcaltieegnyt.s Tchaense ccoonlutamcnt family form a VI. Section VII gives a detailed discussion. We conclude and Cassandra node can be contacted for any operation. Thep arallel computing technologies, Distributed keyspace. These kweyisthp aacneys aocft tahse a ndoadtaebsa stoe apnedr faolrl mth eth keeiyr soppaceeras tcioomnsb. ine together to present the scope for future work in section VIII. node will be acting as a coordinator and forwards the clientsD atabases and the integration of parallel form a cluster. [9] This node acts as the coordinator and transmits request to the replica nodes. [4] computing engines with the distributed databases the client requests to the other nodes [17]. to build efficient and scalable big data applications. The figure presented below explains III. BACKGROUND AND MOTIVATION the order of the subtopics on which the literature A. Hadoop and Spark review has been performed. Apache Hadoop is one of the most widely used open- From our literature study we observed that among source implementations of Map Reduce programming the various distributed databases, column family model. Hadoop consists of two core components: The based NoSQL database Cassandra is being widely Hadoop MapReduce Framework (parallel computing engine) used for its scalability, efficiency and high fault and the Hadoop Distributed File System (HDFS). Hadoop tolerance. To leverage these features of a has been widely used in various areas and applications, such distributed database, an efficient parallel data as log analysis, machine learning, data mining etc., and processing engine is proved to be very essential. achieved success for its high scalability, built-in fault- Among the various parallel computing engines tolerance and simplicity of programming [11] [14]. Apache Spark is proved to be highly efficient to run wide range of BigData tasks to build efficient F I G U FRigEu r1e: 1D:ADTatAa mMoOdeDl EofL C OasFs aCnAdrSaS ANDRA Apache Spark is a cluster computing framework that applications. This knowledge is gained by processes task-parallel jobs with in memory caching Figure 1. Data Model of Cassandra techniques [7] [14]. Spark implements resilient distributed 1.1.2 Cassandra Architecture C. Apache Mahout datasets (RDDs), the distributed memory abstraction that performs efficiently in iterative algorithms of machine ATphaec ahrech iMtecatuhroe uoft Ciass saan dsrca aulsaesb ale p edera-ttoa- pmeeri nneintwgo rkainndg tom coamchmiunneic ate with learning and datamining [11] [18]. lethaer noithnegr Cliabssraanrdyra tmhaacth ihnaess oar thceo nlloedcetsi, ownh icohf aarel gcoonrnitehctmeds i nt oa csluosltvere. There is continuous communication between the nodes at all the times. It has the features of clustering, classification and prediction problems [16]. It can Both Apache Hadoop and Apache Spark can integrate replication and automatic synchronization of data across all the nodes, that is, any be integrated with both Hadoop and Spark to solve various with different distributed data stores like HDFS and changes to the data in a node is replicated to the other nodes in the cluster and is data mining and machine learning problems [17]. automatically synchronized. Due to the replication, there is no single point of failure Cassandra. and there is no data loss if any of the nodes fail. Any node in the cluster can be set as a D. Motivation B. Cassandra seed node and the client directly communicates with the seed node to perform the operations. The seed node is called as the coordinator to transmit the client’s request to The study on various parallel computing engines and Cassandra is one of the most popular NoSQL databases the other nodes in the cluster. [9] distributed NoSQL databases helped us to identify the outside Hadoop ecosystem. It is a distributed datastore for necessity of analyzing the performance of Hadoop and spark structured and unstructured data. Figure 1 shows column on Cassandra for the various data mining applications due to oriented data model of Cassandra. Cluster is a container of the following reasons: different Keyspaces. Different nodes are connected to form a single cluster which logically assigned in a form of ring. i) An Evaluation of Cassandra for Hadoop is done but Keyspaces are databases that are used to group column performance analysis of Hadoop and spark on Cassandra is 8 families together. A column family is a collection of rows not performed. that are stored using separate files, which are sorted in row key order. A row in the column family is a key that is ii) As Apache Spark is proved to be several times faster that associated with the each column. Here rows do not have Hadoop as an in-memory parallel computing engine and its fixed structure as different rows contains different number of capability to easily integrate with Cassandra, the columns. A column is a smallest increment of data in performance evaluation of the spark-Cassandra pair (spark Cassandra which have three attributes namely column name, on top of Cassandra) would be of great help for various value and time stamp [15]. organizations that are presently using Hadoop on top of Cassandra for the data mining applications [18]. Cassandra provides maximum flexibility to distribute This research could help in analyzing and choosing the data by replicating data across multiple datacenters. Read appropriate parallel computing engine (Hadoop/Spark) on and write operations can be performed in any node and all Cassandra for various data mining applications. The the changes in cluster will be automatically synchronized. It knowledge gained from this research would in-turn help in is built on peer to peer model that makes it a high fault building efficient data mining applications. If this research is tolerant system with no single point of failure and provides not performed, it would be difficult to make a selection high horizontal scalability. To provide high fault tolerance between the suitable parallel computing engine among and no single point of failure, replication is done to one or Hadoop and Spark, for the data mining application more copies of every row in a column family across the developers who are using Cassandra as the underlying cluster .Consistency level is decided on replication factor datastore. which directs the number of copies should be replicated for 1.1.2.1 Cassandra Storage Engine Cassandra stores the data on the disk by using a mechanism similar to Log- Structure Merge (LSM) trees [13]. The different components used to store the Cassandra data are as follows: 1. CommitLog: helps Cassandra in achieving its durable nature, that is, any change in data or a write operation is first stored/written to the commitlog, which stores it permanently. There is no chance of data loss if any system failure occurs. [13] 2. MemTable: It is similar to the cached data, which is sorted by keys. They don’t have any replica and any changes to the existing data is overwritten if the same key is used. [13] 3. SSTable: is the disk representation of the data. It is immutable in nature. Chapter 2 [13] As shown in RFiegaudr Ge o2o,g alen'sy B wigrtiatbel eo ppaepraetri,o Bni gftairbslet : wA rDitiesstr/aibpupteedn dSsto rtahgee Sdyasttae mto the CommitLog and tfhoer nS truupcdtuarteeds Dthaeta ,M aveamilaTbalbe laet. hMttepm:T/a/brlees setaorrecsh .thgeo othgel ed.actoa mb/ased archive/bigtable.html. upon some criteria and then flushes the data to the immutable SSTables after the criteria is fulfilled. Now the CommitLog gets rid of the data, which was stored earlier and again starts storing new data. [13] update MemTable Memory Client Flush Append BF Commit Log SSTable Disc index Cleanup Figure 2: Cassandra Storage Engine [13] Figure 2.13: CommitLog, MemTable, and SSTable in action Let's quickly go a bit deeper into implementation. All the classes that deal 1.1.3 Cassandra Compaction with the CommitLog management reside under org.apache.cassandra. db.commitlog package. The CommitLog singleton is a facade for all the operations. Compaction is the process of merging the immutable SSTables to form a new and The implementations of ICommitLogExecutorService are responsible for write single SSTable. As the data gets stored on to the different SSTables, any read commands to the CommitLog file. Then there is a CommitLogSegment class. It operation to the data requires reading data from the different SSTables. This problem manages a single CommitLog file, writes serialized write (mutation) to CommitLog, is overcome by doing the compaction. There are different types of compaction and it holds a very interesting property, cfLastWrite. cfLastWrite is a map with strategies like size-tiered compaction strategy (STCS), level compaction strategy key as the column family name and value as an integer that represents the position (LCS) and Date-tiered compaction strategy (DTCS). But, STCS is the default strategy (offset) in the CommitLog file where the last mutation for that column family used in the Cassandra compaction. [13][14] is written. It can be thought of as a cursor one cursor per column family. When MemTable of a column family is flushed, the segments containing those mutations 1.1.3.1 Size Tiered Compaction Strategy (STCS) are marked as clean (for that particular column family). And when a new write arIrti vies s,t hite isd mefaaruklte dc odmirtpya cwtiiothn osftfrsaette gayt thine ltahtee stC masusatantdiroan .yaml file. In this the SSTables of the similar size are merged together to form a new and larger SSTable. 9 [ 53 ] Then these larger SSTables of similar size are merged to form even larger SSTables. It works good for workloads with write heavy operations. But, it holds onto the stale data for long time, due to which the amount of memory required increases. [13][14] 1.1.3.2 Leveled Compaction Strategy (LCS) This compaction strategy can be used for compaction by changing the settings in the cassandra.yaml file. The compaction process works in levels. The data from the MemTable gets stored into the SSTables in the level one (L0). These SSTables are merged with the larger SSTables at level two (L1) by compaction. This process of merging the SSTables carries on to next level and so on. It works good for workloads with read heavy operations. But, the operation latency gets affected by the higher I/O (input/output) utilization. [13][14] 1.1.3.3 Date Tiered Compaction Strategy (DTCS) This compaction strategy can be used for compaction by changing the settings in the cassandra.yaml file. This strategy is similar to the STCS, but the compaction process merges the SSTables written in the same time frame. It uses Time-To-Live (TTL) timestamps to merge the SSTables of similar ages. It works good for compacting the time series data. [14] 1.1.4 Cassandra Stress Tool According to the definition given in the Datastax documentation, “Cassandra- stress tool is a java-based stress testing utility for basic benchmarking and load testing a Cassandra cluster”.It is used to populate the Cassandra cluster with huge amount of data and stress testing the Cassandra tables and queries. A single experiment should be conducted several times to find the issues in the data model. The data gets stored into the keyspaces and tables, when a stress test is done. The default keyspace to store the data is ‘keyspace1’ and, the default tables to store the data is ‘standard1’ or ‘counter1’ depending upon the type of the table being tested. [16] 1.2 What is Docker? Docker is a container-based virtualization technique. Each container can store an application software or the business service software with all the system libraries and tools required to run the software. Containers share the same kernel of the operating system, when running on the same machine. They start instantly and the usage of RAM (Random Access Memory) is very less. Different applications can run on different containers simultaneously and, the processes are isolated and protected from each other. [15] Lately Docker is being used as a popular tool for managing Linux containers. Research in the field of containers reveal that when virtual machines are compared with bare metal, the processing speed and throughput are almost equal. This makes Docker a silver lining in the cloud full of virtual machines. Docker is used in testing purposes for testing applications in an isolated state. Docker containers are implemented in Linux using ‘cgroups’, which allows resources such as CPU and memory to be shared among them. The containers are isolated and have a different kernel namespace. [17] 10 1.2.1 Hypervisors vs Docker Containers Hypervisors technique is the most commonly used virtualization technique. In this technique, there is a virtual machine monitor (VMM) on top of the host operating system (OS) [18]. VMM controls different virtual machines (VM), where each VM consisCts oLf Oits UowDn O ST. TIDhis Bis IthTe Straditionally used virtualization technique. But, it causes a high performance overhead, while it is used for HPC [18]. In Hypervisors, a guest operating system is required to be installed, before downloading the application and, its supporting binaries and libraries [15]. Whereas in Containers, we only need to download the application and all the dependent binaries and libraries. There is no need of installing an Operating System (OS) [15]. Figure 3 gives an idea of the working of the hypervisors and the containers. APP APP APP A B C As Linux emerged as the dominant open plat- form, replacing these earlier variations, the technol- Libs Libs Libs ogy found its way into the standard distribution in the form of LXC. APP APP APP Figure 1 compares application deployment using OS OS OS A B C a hypervisor and a container. As the figure shows, A B C the hypervisor-based deployment is ideal when ap- Libs Libs Libs plications on the same cloud require different op- erating systems or OS versions (for example, RHEL Linux, Debian Linux, Ubuntu Linux, Windows Hypervisor Container engine 2000, Windows 2008, Windows 2012). The abstrac- tion must be at the VM level to provide this capabil- Host OS Host OS ity of running different OS versions. With containers, applications share an OS (and, Server Server where appropriate, binaries and libraries), and as a re- (a) (b) sult these deployments will be significantly smaller in size than hypervisor deployments, making it possible FiguFrieg u3:r e(a 1) .H Cyopemrvpisaorriss obans eodf v(air)t uhaylpizeartivoins,o (rb a) nDdo c(bke) rc Coonnttaaiinneerr -bbaasesde dvi rtualization[19]t o store hundreds of containers on a physical host deployments. A hypervisor-based deployment is ideal when applications (versus a strictly limited number of VMs). Because on the same cloud require different operating systems or different OS containers use the host OS, restarting a container 1.3 Svecrsoiopnes; in container-based systems, applications share an operating doesn’t mean restarting or rebooting the OS. system, so these deployments can be significantly smaller in size. Those familiar with Linux implementations know that there’s a great degree of binary applica- The scope of the thesis is limited to analyzing the impact of Cassandra compaction tion portability among Linux variants, with librar- on Cassandra’s performance when used inside a Docker container using Size Tiered Compaction Strategy (SaTltCeSrn) aotniv ea, sKinVgMle nhoadse f oour nad s iitnsg wlea my aincthoin me.o Erex preerciemnet-nts aieres occasionally required to complete the portability. conducted using differenlyt wcoonrksltoraudcst eadn dc ltohue dosu t(psuutc ihs aa sg rAaTph&icTa, l HrePp,r eCseonmtactaiosnt, of OTSh erefore, it’s practical to have one container pack- metrics (CPU usage, aDnidsk O ruasnaggee) . eKtcV.)M v iss atlismoe a afanvdo rCitea shsaynpderrav ismore torfic tsh e( Bytaegse that will run on almost all Linux-based clouds. compacted, completed Otapsekns Setatcc.k) pvrso jewcot raknloda idss u stoe d binet tmero sat nOalpyezneS tthace k idmisp-act of compaction. Also the thterisbisu tiiso nlism (istuedc ht oa st uRneindgH tahte, Ccolomupdascctaiolinn gp,a Praimstoetne,r sa nodn e aDt ao cker Containers time and analyzing theN erbesuullat)s. Otof fciondu rsae ,s uMb-icorpotismofatl uvsaelus ei tso fH eyapcehr -Vco hmyp-actiDono cker (www.docker.com) is an open source project parameter only for a wriptee rhveiasvoyr uwnodrekrlonaeda t(hi. eb.o 9t0h% Mwicrirtoesso afnt dA 1z0u%re raenadd sM), iwcrhoe-n usperdo viding a systematic way to automate the faster on a Docker based platfsoorfmt .P Oriuvtaptuet Cylioeludde d( wfrwomw. mthiec reoxspoefrti.mcoemnts/e anr-eu sre/sperrevseenr ted dienp loyment of Linux applications inside portable the form of graphs to b-ectltoeur da/nsaollyuzteio nthse/ vrierstuulatlsi zaantdio nfi.nads paxn). sub-optimal compacticoonn tainers. Basically, Docker extends LXC with a setting for a write heavy workload, when used on a Docker based platform. However, not all well-known public clouds use kernel-and application-level API that together run hypervisors. For example, Google, IBM/Softlayer, processes in isolation: CPU, memory, I/O, network, and Joyent are all examples of extremely successful and so on. Docker also uses namespaces to com- public cloud platforms using containers, not VMs. pletely isolate an application’s view of the underly- Some trace inspiration for containers back to the ing operating environment, including process trees, Unix chroot command, which was introduced as part network, user IDs, and file systems. of Unix version 7 in 1 979. In 1998, an extended ver- 11 Docker containers are created using base images. sion of chroot was implemented in FreeBSD and called A Docker image can include just the OS fundamen- jail. In 2004, the capability was improved and released tals, or it can consist of a sophisticated prebuilt appli- with Solaris 10 as zones. By Solaris 11, a full-blown ca- cation stack ready for launch. When building images pability based on zones was completed and called con- with Docker, each action taken (that is, command ex- tainers. By that time, other proprietary Unix vendors ecuted, such as apt-get install) forms a new layer on offered similar capabilities—for example, HP-UX con- top of the previous one. Commands can be executed tainers and IBM AIX workload partitions. manually or automatically using Dockerfiles. 82 IEEE CLOUD COMPUTING WWW.COMPUTER.ORG/CLOUDCOMPUTING 1.4 Problem Description and Motivation Cassandra and Docker are the leading contenders in the field of NoSQL databases and Container technology respectively. Cassandra stores its data into the disks by the process of compaction. And, while measuring the performance of Cassandra, compaction plays a very important role. Compaction statistics are the Key performance indicators/markers (KPI/KPM) for Cassandra. Docker, a container technology is in forefront of today's virtualization over big data cloud computing. But, a very little research has been done on the integration of Cassandra with Docker. Important questions such as, "what is the impact when Cassandra runs inside a Docker container?" and "How is the Cassandra’s performance affected in a Docker environment?" is what we are trying to assess and have answers for! Finding answers to these questions is what has motivated us to embark on this research thesis wherein we need to integrate Cassandra on top of Docker and then investigate the impact of Cassandra compaction on Cassandra’s performance. 1.5 Aim and Objectives 1.5.1 Aim The aim of the project is to: 1. Run Cassandra inside Docker container 2. Investigate the impact of Cassandra compaction (with the default and tuned parameters) on Cassandra’s performance working within a Docker 3. Identify KPI (Key Performance Indicator) to access the performance 1.5.2 Objectives The objectives of this thesis is to: 1. Learn about the Cassandra database and the Cassandra compaction 2. Learn about Docker containers 3. Study the impact of Cassandra compaction when Cassandra runs inside a Docker container 4. Do a literature review for finding the different performance metrics and parameters on which the Cassandra compaction is dependent 5. Learn and study about the different performance parameters related to Cassandra compaction 6. Learn and study about the different performance metrics (output) related to Cassandra compaction 7. Perform experiments with default Cassandra settings using different workloads and analyze the results to find out the impact of Cassandra compaction 8. Perform experiments by tuning the compaction parameters and by using a write heavy workload, and then analyze the results to find the sub-optimal value of Cassandra compaction parameters to be used inside a Docker container 1.6 Research Questions This section describes about the research questions that are needed to be answered to execute a successful research. RQ1: What are the performance metrics that better adhere to measure the effect of compaction on the Cassandra performance for a Docker based platform? 12 This question is framed to learn about the different Cassandra compaction metrics, which can be used to measure the performance of Cassandra running inside a Docker container. This is solved by doing a Literature review. RQ2: What are the Cassandra compaction parameters that allow to tune the Cassandra performance for a Docker based platform? This question is framed to learn about the different parameters, that are directly or indirectly related to Cassandra compaction. These parameters may affect the performance of Cassandra, when tuned. This is solved by doing a Literature review. RQ3: What is the impact of Cassandra compaction (using the default settings of Cassandra compaction) on the Cassandra performance for a Docker based platform, when different workloads (1) 90%writes and 10%reads, (2) 50%writes and 50%reads, and (3) 70%writes and 30%reads are considered? This question is framed to investigate the impact of compaction on Cassandra’s performance on a Docker platform by using the default settings of Cassandra. This is solved by performing experiments using different workloads. RQ4: What is the impact of Cassandra compaction (by tuning each compaction parameter) on the Cassandra performance for a Docker based platform, when 90%writes and 10%reads workload is considered? This question is framed to investigate the impact of compaction on Cassandra’s performance on a Docker platform by tuning each parameter at a time. Then sub- optimal value of the parameters are collected, which can offer better performance of Cassandra on a Docker platform. This is solved by tuning the parameters one at a time and performing experiments on the tuned setting for a write-heavy workload (i.e. 90%writes and 10%reads). 1.7 Contribution The main contribution of the thesis is to understand the effect of compaction when Cassandra runs inside a Docker container. First, the experiments are conducted by using the default setting of Cassandra compaction and using the default compaction strategy, that is, Size Tiered Compaction Strategy (STCS), with different workloads to find the impact of Cassandra compaction on the Cassandra’s performance, for a Docker based platform. The results are shown as a graphical representation of OS metrics (CPU usage, Disk usage etc.) vs time and Cassandra metrics (Bytes compacted, completed tasks etc.) vs workloads to better analyze the impact of compaction. Then the compaction parameters are tuned and experiments are carried out to find the sub-optimal values of Cassandra compaction parameters for a write heavy workload (i.e. 90%writes and 10% reads), on a Docker based platform. Graphs are drawn from the results and analysis of these results provides the sub-optimal compaction parameter values. 1.8 Outline The rest of the thesis document is organized as follows. Chapter 2 gives an overview about the related works, that has already been done in the field of research. 13 Chapter 3 gives a detailed view of how the different research methods are carried out. Chapter 4 gives a detailed view of the results and analysis for the different experiments performed. Chapter 5 gives a detailed view of the discussions including the Threats to validity and answering the RQ’s. Chapter 6 gives an overview about the conclusion and the future work. 14 2 RELATED WORK This chapter highlights the different works done by other researchers that have ties to our research. 2.1 Cassandra The research on Cassandra is advancing in a quick pace [20]. In [21], authors have discussed about the effect of the replication factor and the consistency strategies for the databases HBase and Cassandra by using the YCSB benchmarking tool. In [22], authors have provided an in-depth analysis of compaction-related performance overhead for HBase and Cassandra. In [23], authors have designed a performance monitoring tool for Cassandra, which they have used on a low end machine (cost effective machines). In [24], authors have estimated a performance model for Cassandra, that shows the relationship between the memory space and the system performance. In [25], authors have discussed about the different background mechanisms that help in improving the performance of data-intensive cloud systems using Cassandra, which has an exceptional read-write performance and the Cassandra compaction strategy, a mechanism which runs in the background and merges the SSTable depending on the compaction strategy used. In [26], author has performed a benchmark test for Apache Cassandra, Apache Lucene and MySQL. Apache Cassandra outperforms others in the benchmark test as it has a high read/write throughput and it can handle larger amount of data than the other databases in comparison. 2.2 Cassandra Compaction In [27], author has compared the performances of different compaction strategies, that is, STCS, LCS and DTCS for different workloads (i.e. 50%writes and 50%reads, and 10%writes and 90%reads). Author has also researched about the different Cassandra metrics and Operating System (OS) metrics required to evaluate the success of different compaction strategies. In [14], Datastax discuss about configuring compaction using the different compaction parameters and also discuss about the different compaction metrics to evaluate the performance of Cassandra. The different Cassandra metrics, OS metrics and Cassandra compaction parameters are as follows: 2.2.1 Cassandra compaction metrics 1. Bytes compacted: It is the total amount of data that has been compacted from the start of experiment. 2. Completed tasks: It is the number of compaction tasks that have been performed/completed from the start of the server. 3. Pending tasks: It is the number of compaction tasks that are supposed to be done or that are pending to achieve the desired state. 4. Total compactions count: It is the total number of compaction that occur within the specified duration of an experiment. 5. Compaction rate: It is the number of compactions performed in a second. 2.2.2 OS metrics 1. Memory usage: It is the percentage of RAM used while performing each experiment 15
Description: