Improving optimal sequence alignments through a SIMD-accelerated library Jakob Tobias Frielingsdorf Master’s Thesis Spring 2015 Abstract The recent years have seen an increasing demand in fast sequence align- ments,fuelledbyarapidlygrowingamountofsequencedata. Meanwhile, with growing computing power optimal sequence alignment algorithms camebackintothefocusoftheseanalyses. This work presents a new library for fast database searches based on optimal sequence alignments. It performs database searches accelerated on multiple threads and single instruction multiple data (SIMD) opera- tions. The library implements Rognes’ approach for accelerating database searches, while being designed for extensibility. A modular structure al- lows for an easy integration of new and improved algorithms. Addition- ally,theapplicationprogrammableinterface(API)ofthelibraryisdesigned foreasyuseandflexibility,allowinganextensiveconfigurationofthecom- putations. Besides the modular structure, the key features are the database searches based on SIMD instructions. These are optimised for the widely used streamingSIMDextensions(SSE)andthemorerecentadvancedvectorex- tensions(AVX)implementingtwiceaswideregisters. The focus of this thesis is the evaluation of the performance of libssa and of the optimised implementation of the database searches, with emphasis on the benefits of the computation on AVX over SSE. It presents that AVX significantlyimprovestheperformancebyupto1.83timesoverSSE. iii Acknowledgements IwouldliketothankmysupervisorTorbjørnRognesforalotofinteresting discussionsandvaluablefeedbackwhichhelpedtoimprovethiswork. LukasandOllideserveaspecialthanksforproofreadingpartsofmythesis. Both provided me with useful comments. Another big thanks goes to my friendsandfamilyformoralsupport. On a personal note I would like to thank my dear Julia for her love and encouragement. JakobTobiasFrielingsdorf UniversityofOslo May,2015 iv Contents 1 Introduction 1 2 Background&Theory 3 2.1 Biologicalbackground . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Sequencealignment . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2.1 Applicationsofsequencealignment . . . . . . . . . . 5 2.2.2 Sequencetranslations . . . . . . . . . . . . . . . . . . 5 2.2.3 Algorithmsforsequencealignment . . . . . . . . . . 6 2.3 Parallelisationofsequencealignment . . . . . . . . . . . . . 15 2.3.1 Flynn’sTaxonomy . . . . . . . . . . . . . . . . . . . . 15 2.3.2 Levelsofparallelisation . . . . . . . . . . . . . . . . . 15 2.3.3 CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.4 ParallelisingsequencealignmentsonCPUs . . . . . . 17 2.4 APIdesigninC . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.1 Generaldesignrules . . . . . . . . . . . . . . . . . . . 19 2.4.2 DesignrulesforimplementationsinC . . . . . . . . . 20 2.4.3 ExistingsequencealignmentAPIs . . . . . . . . . . . 21 2.5 Measuringperformance . . . . . . . . . . . . . . . . . . . . . 23 2.5.1 Measuringspeed . . . . . . . . . . . . . . . . . . . . . 23 2.5.2 Amdahl’sLaw . . . . . . . . . . . . . . . . . . . . . . 23 2.6 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3 Design 27 3.1 DesignoftheAPI . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.1 Usecases . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.2 AlignmentAPI . . . . . . . . . . . . . . . . . . . . . . 28 3.1.3 DatabaseAPI . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Designofthelibrary . . . . . . . . . . . . . . . . . . . . . . . 30 3.3 Parallelisationofsequencealignment . . . . . . . . . . . . . 32 3.3.1 Threads . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.2 Nonvectorisedsequencealignments. . . . . . . . . . 33 3.3.3 Vectorisedalignments . . . . . . . . . . . . . . . . . . 33 4 Implementation 37 4.1 Configurationandinternalformats . . . . . . . . . . . . . . . 37 4.1.1 Datatypes . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1.2 Sequences . . . . . . . . . . . . . . . . . . . . . . . . . 38 v 4.1.3 Scoringschemes . . . . . . . . . . . . . . . . . . . . . 39 4.1.4 Gappenalties . . . . . . . . . . . . . . . . . . . . . . . 40 4.1.5 Validationoftheconfiguration . . . . . . . . . . . . . 40 4.2 Databaseintegration . . . . . . . . . . . . . . . . . . . . . . . 41 4.3 Controllingdatabasesearchesandalignments . . . . . . . . 41 4.3.1 Threadpool . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3.2 Min-max-heap . . . . . . . . . . . . . . . . . . . . . . 42 4.4 Databasesearches . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.4.1 64bitimplementation . . . . . . . . . . . . . . . . . . 44 4.4.2 8and16bitimplementations . . . . . . . . . . . . . . 45 4.5 Computingalignments . . . . . . . . . . . . . . . . . . . . . . 54 4.6 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.7 Measuringperformance . . . . . . . . . . . . . . . . . . . . . 56 5 Resultsanddiscussion 59 5.1 Evaluatingperformance . . . . . . . . . . . . . . . . . . . . . 59 5.1.1 Collectingresults . . . . . . . . . . . . . . . . . . . . . 59 5.1.2 Basetestrun . . . . . . . . . . . . . . . . . . . . . . . . 61 5.1.3 Querylengths,bitwidths,andSIMDcapabilities . . 63 5.1.4 Threadcounts . . . . . . . . . . . . . . . . . . . . . . . 69 5.1.5 Chunksizes . . . . . . . . . . . . . . . . . . . . . . . . 71 5.1.6 Comparisonofalignmenttools . . . . . . . . . . . . . 74 5.1.7 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.2 Evaluatingoptimisations . . . . . . . . . . . . . . . . . . . . . 78 5.2.1 Searchcolumns . . . . . . . . . . . . . . . . . . . . . . 78 5.2.2 Memorymanagement . . . . . . . . . . . . . . . . . . 80 5.2.3 Optimisationsdonebythecompiler . . . . . . . . . . 82 5.3 Exploitingopensource . . . . . . . . . . . . . . . . . . . . . . 85 6 Futurework 89 6.1 Newandimprovedalgorithms . . . . . . . . . . . . . . . . . 89 6.1.1 Needleman-Wunsch-Sellersalgorithm . . . . . . . . . 89 6.1.2 Differentgappenalties . . . . . . . . . . . . . . . . . . 89 6.1.3 Optimisingalignments . . . . . . . . . . . . . . . . . . 89 6.2 Errorhandling . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.3 Creatingthetemporaryscoreprofile . . . . . . . . . . . . . . 91 6.3.1 Computeon8bit . . . . . . . . . . . . . . . . . . . . . 91 6.3.2 Differentimplementationfornucleotidesequences . 92 6.3.3 Fillingforonlymatch/mismatchvalues . . . . . . . . 92 6.4 Otherwork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 7 Conclusion 95 Bibliography 97 A Numberofsearchcolumnsfordifferentdatabases 99 B Reproducingperformanceresults 101 vi C Installingandrunninglibssa 103 vii Chapter 1 Introduction Sequencealignmentsareanimportantpartoftheanalysisofgenomicdata. They help to discover similarities in genomic sequences and form a base forfurtheranalysesandresearch. Especiallyintherecentyearsthecostfor sequencing genomic and protein data is as low as ever before. The effect isarapidlygrowingamountofsequencedatafuellingthedemandforfast analyses. Algorithmsforoptimallysolvingtheproblemofsequencealignmentsexist since the 1970s. In the past, the biggest challenge of these algorithms was theirquadraticcomplexityintimeandmemory. Asolutionwereheuristic algorithmssolvingtheproblemgoodbutnotoptimally. One of the applications based on sequence alignments are database searches. Here, a database is searched for sequences, that are similar to a querysequence. Sincethebeginningofthe1990sheuristicswerethemain kindofalgorithmsusedforsuchapplications. With growing computing power and more advanced hardware, optimal sequence alignment algorithms came back into focus. One technique that is used to speed up these algorithms are single instruction multiple data (SIMD) operations. CPUs implement this feature since the middle of the 1990s. It allows for computing on multiple data elements in parallel and was used to develop different approaches to accelerate optimal sequence alignments. In software development it is common practice to implement tasks in li- braries,whicharethenusedinmultipleprograms. Thisallowsforre-using an implementation of a task in different programs, thus speeding up the development of these. This thesis describes a new library for database searches using optimal sequence alignment algorithms called libssa (li- brary for SIMD accelerated optimal Sequence Alignments). It implements the searches based on global and local alignments using the Needleman- Wunsch and Smith-Waterman algorithms. The focus of the library lies on fast implementations of the database search and algorithms as well as on 1 providing a modular structure, which allows for an easy extension of the library. TwooftheSIMDinstructionsets,whichimplementtheSIMDoperationsin modernCPUs,arethestreamingSIMDextensions(SSE)andtheadvanced vector extensions (AVX). The main difference of both is the amount of data elements, which can be computed in parallel. AVX allows for twice as many data elements than SSE. The research question of this thesis is whetheranimplementationoftheoptimalsequencealignmentalgorithms based on AVX instructions performs better than an implementation based onSSEinstructions. Inadditiontothequantitativeanalysisofthelibrary’s performance, it is going to be compared to existing implementations of databasesearches. 2 Chapter 2 Background & Theory 2.1 Biological background Theentiregeneticinformationofanorganismisstoredinthegenome. En- codedinDNA(Deoxyribonucleicacid),itdescribesthefeaturesandprop- erties of the organism. DNA is a molecule which consists of two com- plementing nucleotide sequences that form a double helix. This struc- ture is formed by bonds between the nucleotides of the complementing sequences. The four different nucleotides or nucleobases, which DNA is buildof,are: adenine(A),guanine(G),cytosine(C),andthymine(T).Ade- nine and guanine, and cytosine and thymine are complementary to each other. The structure of cells and nearly all cell functions in the body of an or- ganism are built of or executed by proteins. The blueprints for building proteinsareencodedingenes,thecodingentitiesofthegenome. Theseare transcribedintoRNA(Ribonucleicacid),anintermediarymoleculesimilar to DNA, which guides the synthesis of protein molecules. RNA is build of the nucleotides adenine (A), guanine (G), cytosine (C) and uracil (U). UracilsubstitutesthethymineofDNA.TheRNAthenaretranslatedintoa chainofaminoacids,whichformsaprotein. ThetranslationfromRNAto aminoacidsisdescribedinthegeneticcode(seefigure2.1). Eachofthe20 aminoacidsthatoccurinanorganismisdescribedbyatripletofRNAnu- cleotides,asocalledcodon. Thiscodeisuniversalforallorganisms. [Sung, 2010,chapter1] The actual process of the protein biosynthesis is more complex than de- scribed here. The focus here was to introduce terms used in the next chapters,whichdescribedifferentapproachesandalgorithmsforsequence alignments. 3
Description: