Computing and Informatics, Vol. 37, 2018, 186–212, doi:10.4149/cai 2018 1 186 HYBRID DATA RACE DETECTION FOR MULTICORE SOFTWARE Alper Sen, Onder Kalaci Department of Computer Engineering Bogazici University, Turkey e-mail: {alper.sen, onder.kalaci}@boun.edu.tr Abstract. Multithreaded programs are prone to concurrency errors such as dead- locks,raceconditionsandatomicityviolations. Theseerrorsarenotoriouslydifficult todetectduetothenon-deterministicnatureofconcurrentsoftwarerunningonmul- ticore hardware. Data races result from the concurrent access of shared data by multiple threads and can result in unexpected program behaviors. Main dynamic data race detection techniques in the literature are happens-before and lockset al- gorithms which suffer from high execution time and memory overhead, miss many data races or produce a high number of false alarms. Our goal is to improve the performance of dynamic data race detection, while at the same time improving its accuracybygeneratingfewerfalsealarms. Wedevelopahybriddataracedetection algorithm that is a combination of the happens-before and lockset algorithms in a tool. Rather than focusing on individual memory accesses by each thread, we focus on sequence of memory accesses by each thread, called a segment. This al- lows us to improve the performance of data race detection. We implement several optimizations on our hybrid data race detector and compare our technique with traditional happens-before and lockset detectors. The experiments are performed withC/C++multithreadedbenchmarksusingPthreadslibraryfromPARSECsuite and large applications such as Apache web server. Our experiments showed that our hybrid detector is 15% faster than the happens-before detector and produces 50% less potential data races than the lockset detector. Ultimately, a hybrid data race detector can improve the performance and accuracy of data race detection, enhancing its usability in practice. Keywords: Software testing and debugging, multithreaded programs, data race, concurrency, happens-before, lockset Mathematics Subject Classification 2010: 68-N19 Hybrid Data Race Detection for Multicore Software 187 1 INTRODUCTION Multicore processors provide high computation power, however, in order to utilize the increased power, concurrent software must be designed and written. Concur- rency is achieved by multithreading in many systems. The interleaving of multiple threads can result in concurrency bugs which are hard to reproduce. Data race is a well-known concurrency problem which is defined as two threads accessing a single memory address where at least one access is write and there is no appropriate synchronization among the accesses [19]. There have been many works in the past for detection of data races in multithreaded programs. The prior work is divided into two broad categories, which are static data race detection [28, 22, 3] and dynamic data race detection [13, 21, 2, 20, 15, 31]. Static data race detectors generate all possible thread interleavings and data races are searched among these interleavings. This approach is often infeasible due to state space explosion problem [12]. Moreover, they end up with many false posi- tives due to the interleavings that are not possible to happen with any input data. Typically, the number of false positives is more than the number of real races [18]. Dynamic data race detectors observe memory accesses while an application is run- ning. This approach is scalable to real life programs. The deficiency of dynamic analysis is that they consider an execution with a single input data, which limits the coverage of the analysis. Due to limited coverage, dynamic detectors miss some of the potential data races. In other words, dynamic detectors produce false neg- atives. In order to overcome this problem, dynamic analysis must be repeatedly executed with different inputs. Due to its scalability and potentially lower false positive rates, dynamic data race detection is the most commonly used method of data race detection. There are two state-of-the-art algorithms used in dynamic data race detection, lockset and happens-before data race detection algorithms. Lockset algorithm [25] checks whether two threads access a shared variable while holding a common lock ornot. Foreachsharedmemoryaddress, thealgorithmmaintainsacandidatelock- set. Lockset detectors produce many false positives. In other words, some of the data races detected by the detectors are not real races. The main source of the false positives is that the detectors ignore all the synchronization operations ex- cept for locks. For instance, the detectors produce false positives for every shared memory access where the synchronization among accesses is generated by condi- tion variables. Happens-before data race detection algorithm is based on Lam- port’s happens-before relation [13]. Happens-before relation defines a partial order among the events generated during the execution of a program in a distributed system. This relation has been extended for applications using shared memory as well. Happens-before data race detection algorithms utilize vector clocks for main- taining the happens-before relation [21, 15]. These detectors do not produce false positives, however they may miss real races (false negative). Happens-before based detectors suffer from high execution time and memory overhead. The size of each vector clock is proportional to the number of threads count in the program. On the 188 A. Sen, O. Kalaci contrary, lockset based detectors are scalable and can be implemented with a low overhead. There has been prior work [20, 26, 11, 21] for combining the benefits of lockset and happens-before detectors in a hybrid data race detector that has good per- formance and few false positives. We develop a hybrid approach in this paper as well. Almost all dynamic data race detection algorithms [2, 20, 21, 12, 15, 5] de- tect potential data races by tracking accesses on each memory address during the execution. Thiscanresultinmemoryoverhead,whichiscrucialwhenworkingwith large programs. Our algorithm instead detects potential races by tracking accesses on each segment [26], where a segment is formed by consecutive memory accesses of a single thread. No synchronization operation is allowed inside a segment, thus, all the memory accesses use the same vector clocks and the same locks. Moreover, as we later show in Table 2, for most of the applications, the number of segments is much less than the number of memory addresses accessed. This observation allows us to reduce memory overhead. Weproposefouroptimizationsonoursegmentbasedhybridalgorithm. Thefirst optimization is based on the observation that vector clock values of a segment does not change after it is assigned. Thus, we added a limited vector clock cache for the vector clocks of segments. The second optimization is based on exploiting the same memory accesses inside a segment. If a memory address is accessed more than once inside a segment, the second and subsequent accesses have no effect on detecting the potential data races. The third optimization is based on the active number of segments. We define a maximum number of active segments, if that is exceeded, we start discarding older segments. Although this may lead to missing some of the potential data races, performance increase is considerable in many cases. Our last optimization is based on sampling a user given percentage of memory accesses during analysis. WeimplementedourhybriddetectorandouroptimizationsusingPINDynamic Binary Instrumentation (DBI) tool [14]. This tool allows us to work with binaries of applications rather than their source code, which may be crucial in commer- cial settings. In order to compare our hybrid detector we also implemented using PIN a lockset based detector (Eraser [25]) and a happens-before based detector (DJIT+ [21]). We performed experiments on 8 different applications from PAR- SEC benchmark suite [1], Apache httpd web server [9] and a parallel compression tool pbzip2 [6]. All benchmarks are written in C/C++ using Pthreads library. Our experiments showed that our hybrid detector is 15% faster than happens-before detector and produces 50% less potential data races than lockset detector. The main contribution of this work can be summarized as follows: • We develop a segment-based hybrid dynamic data race detector and present a formal treatment of the concepts and our algorithms. • We propose four different optimizations on the hybrid algorithm. Two of these optimizationsincreasetheperformanceofdataracedetectionwithoutsacrificing the precision. The remaining two optimizations provide a trade-off between Hybrid Data Race Detection for Multicore Software 189 the number of potential data races detected and the performance of data race detection. • We implement our techniques in a tool and compare with traditional dynamic data race algorithms on large multithreaded benchmarks. Therestofthepaperisorganizedasfollows. Section2givesbackgroundonour multithreaded program model and happens-before relation. We describe dynamic data race detection algorithms in Section 3. In Section 4, we present our segment based hybrid data race detection algorithm and describe several optimizations on this algorithm in Section 5. We discuss experimental results in Section 6, which is followed by related work and conclusions. 2 BACKGROUND Figure 1. Overview of our dynamic data race detection In Figure 1, an overview of our dynamic data race detection is displayed. A dy- namic binary instrumentation tool instruments a multithreaded application and when this instrumented program is executed it generates events, which are input to the data race detection algorithm. Then, the algorithm decides whether the application has potential data race(s) or not. 2.1 Multithreaded Program Model Inthiswork,weconsidermultithreadedC/C++applicationsthatusethePthreads library [23]. A multithreaded program consists of threads, memory addresses, and synchronizationobjectssuchaslocks,conditionvariables,barriers,andsemaphores. During the execution of an instrumented program, a sequence of atomic opera- tions (events), denoted by e ,...,e , are generated by each thread. We utilize the x y following types of events in our data race detection algorithms, similar to earlier works [20, 5, 26]. • READ(x,t): Memory address x is read by thread t. i i • WRITE(x,t): Memory address x is written by thread t. i i 190 A. Sen, O. Kalaci • ACCESS(x,t): Either READ(x,t) or WRITE(x,t). i i i • WR LOCK(l,t): Lock l is acquired write-held by thread t. i i • RD LOCK(l,t): Lock l is acquired read-held by thread t. i i • LOCK(l,t): Either WR LOCK(l,t) or RD LOCK(l,t). i i i • UNLOCK(l,t): Lock l is released by thread t. This is also composed of read i i and write unlock operations. • SIGNAL(cv,t): Unblock at least one of the threads that are blocked on the i condition variable cv. • SIGNAL ALL(cv,t): Unblock all threads currently blocked on the condition i variable cv. • WAIT(cv,t): Thread t blocks on a condition variable cv. i i We use the notion of synchronization points to generate the causality relation- ship among events. The following pair of corresponding events constitute the start (left) and the end (right) of synchronization points, denoted by SYNCH POINTS: (UNLOCK(l,t),LOCK(l,t )), (SIGNAL(cv,t),WAIT(cv,t )), (SIGNAL ALL(cv, i j i j t),WAIT(cv,t )). Note that similar synchronization points can be defined for i j semaphores, barriers as well as thread creation and exit events. 2.2 Happens-Before Relation and Vector Clocks Thereexistseveraltechniquesfortrackingtheconcurrencyinformationorthedepen- dencies between events. Lamport’s happened-before relation [13], which is a partial order relation, is used for capturing ordering between events generated during the executionofaconcurrentprogram. Moreformally,thehappens-beforerelation(→) among two events e and e is denoted as (e → e ) and is the smallest transitive x y x y relation that satisfies the following properties (where x(cid:54)=y and i(cid:54)=j) [21, 5]: • Program Order: (e ∈t ∧e ∈t)∧(e is executed before e in t), x i y i x y i • SynchronizationOrder: (e ∈t ∧e ∈t )∧(i(cid:54)=j)∧(e and e is a pair of events x i y j x y from SYNC POINTS), • Transitivity: (e →e )∧(e →e ). x z z y Two events, e and e are concurrent ((cid:107)) if neither of them happens-before the x y other, that is, e (cid:107)e ⇔(¬(e →e )∧¬(e →e )). x y x y y x Vector clocks [4, 17] are used to capture the happens-before relation among the events generated during program execution. A vector clock assigns timestamps to events such that the partial order relation between events can be determined by using the timestamps. A vector clock, VC, consists of a vector of n integers where n is the total number of threads in the execution. VC identifies the vector clock of i threadt andVC[j]holdsthelogicaltimeofthreadt knownbythreadt. Initially, i i j i VC[j] = 0, for i (cid:54)= j, and VC[i] = 1. A thread increments its own component of i i Hybrid Data Race Detection for Multicore Software 191 the vector clock after each event. For certain events that will be described in the nextsection,itupdatesitsvectorclockbytakingacomponentwisemaximumwith the vector clock included in the message. Below we describe these common vector clock operations: • INIT(VC): Initialize a Vector Clock, VC[j] = 1, for i == j and VC[j] = 0, i i i for i(cid:54)=j, • INCREMENT(VC): Increment a Vector Clock, VC[i]=VC[i]+1, i i i • RECV(VC,VC ): Receive Vector Clock, VC[k] = max(VC[k],VC [k]), ∀k ∈ i j i i j {1,...,n}, • Compare Two Vector Clocks: VC <VC is true, if ∀k ∈{1,...,n}:VC[k]≤ i j i VC [k]∧∃k ∈{1,...,n}:VC[k]<VC [k]. j i j We say that e →e iff e .VC <e .VC. Hence, vector clocks precisely capture x y x y the happens-before relation. A sample execution of the vector clock algorithm is given in Figure 2, where the tuples in brackets represent the vector clocks. In the example, event (action) s happened-before t since [1,0,0]<[2,1,3], where v <v if all elements of v are less i j i than or equal to the corresponding elements of v and at least one element of v is j i strictly less than the corresponding element of v . Whereas u is concurrent with t j since their vector clocks are not comparable. Figure 2. Happens-before relation and vector clocks 3 DATA RACE DETECTION ALGORITHMS We briefly describe two state-of-the-art algorithms in data race detection, namely happens-before algorithm and lockset algorithm. 3.1 Happens-Before Data Race Detection Happens-before data race detection algorithms check whether concurrent accesses bymultiplethreadstothesamememoryaddressarepossible,ifsotheyreturnarace 192 A. Sen, O. Kalaci warning. These detectors do not produce false positives but they can produce false negatives. We describe DJIT+ [21] algorithm as the happens-before algorithm. The al- gorithm maintains a vector clock for each thread t, denoted by VC, which is i i initialized at startup. A vector clock is kept for each synchronization object s, denoted by s.VC, which is initialized to all zeros at startup. For each left event of a SYNC POINTS, such as UNLOCK(s,t) the following operations take place, i s.VC = VC and VC = INCREMENT(VC). Similarly, for each right event i i i of a SYNC POINTS, such as LOCK(s,t), the following operation takes place, i VC = RECV(VC,s.VC). For example, in Figure 2, in Thread 1, one can think i i of having an UNLOCK event after the second event with vector clock [2,1,0] and the corresponding LOCK event occurs on Thread 3 changing the vector clock to [2,1,3]. Two vector clocks are kept for each memory address x, denoted by x .VC r and x .VC, which are used to keep track of the last read and the last write to x by w each thread, respectively. In Algorithm 1, we display the race detection portion of the happens-before baseddataracedetectionalgorithm. Foreachreadofxbyt inline1,thealgorithm i checks whether x .VC happens-before VC in line 3. If not, a race is detected in w i line 12. Similarly, for each write of x by t in line 6, the algorithm checks whether i each of x .VC and x .VC happens-before VC. If one of them does not, then the w r i algorithm concludes a potential data race in line 12. Note that by exploiting the transitivity of the happens-before relation, the algorithm can decide on potential race conditions by only comparing the vector clock of the last read and write with the vector clock of the current access. Algorithm 1 Happens-Before based data race detection algorithm Input: Thread t generates a memory access event on address x, ACCESS(x,t) i i Output: Potential data race detected or not 1: if ACCESS(x,ti)==READ(x,ti) then 2: xr.VC[i]=VCi[i]; 3: if xw.VC <VCi then 4: return false; (cid:46) No Race Found 5: end if 6: else if ACCESS(x,ti)==WRITE(x,ti) then 7: xw.VC[i]=VCi[i]; 8: if (xw.VC <VCi)∧(xr.VC <VCi) then 9: return false; (cid:46) No Race Found 10: end if 11: end if 12: return true; (cid:46) Race Found Hybrid Data Race Detection for Multicore Software 193 3.2 Lockset Data Race Detection Lockset data race detection algorithms check whether accesses by multiple threads tothesamememoryaddresscanoccurwhilethreadsarenotholdingacommonlock, ifsotheyreturnaracewarning. Thesedetectorsdonotproducefalsenegativesbut they can produce false positives for a given execution. We describe Eraser [25] algorithm as the basic lockset algorithm and show it in Algorithm 2. For each shared memory address x, the algorithm maintains a can- didate lockset, CLS(x). The name candidate is given since the algorithm cannot determine which lock is intended for which memory address. Thus, via candidate locksets, the algorithm attempts to infer whether a shared memory address is pro- tected by a unique lock throughout the execution. When a memory address is accessed for the first time during the execution, its candidate lockset is assigned to include all possible locks. Then, on each access in line 1, its candidate lockset is updated to its intersection with the lockset of the thread that is executing the access, LS(t). This lock refinement step aims to find the unique locks that protect i the variable during the execution. If the intersection ends up with an empty set in line 2, the algorithm concludes that there is a potential data race in line 3. Eraser lockset algorithm includes optimizations and we implemented the Eraser lockset algorithm that includes optimizations in this paper. Algorithm 2 Lockset based data race detection algorithm Input: Thread t generates a memory access event on address x i Output: Potential data race detected or not 1: CLS(x)=CLS(x)∩LS(ti) 2: if CLS(x)=={} then 3: return true; (cid:46) Race Found 4: end if 5: return false; (cid:46) No Race Found 3.3 Hybrid Data Race Detection When the above two approaches are examined in terms of preciseness, lockset de- tectors can produce false positives, whereas happens-before based detectors can produce false negatives. In terms of performance, lockset detectors are scalable, whereas happens-before based detectors are not because happens-before detectors require a high memory and processing overhead. A hybrid data race detector combines the lockset and happens-before approa- ches. A naive hybrid algorithm maintains two vector clocks and a lockset for each memory address. Such an approach may help reduce the number of false positives compared to a lockset detector. However, performance of the detector would be worsethanahappens-beforebaseddetector. First,thememoryrequirementswould 194 A. Sen, O. Kalaci bemorethanahappens-beforedetectorsinceitalsousesalocksetforeachmemory. Second, the computation overhead would increase since both vector clock compar- isons and lockset calculations are needed. 4 SEGMENT BASED HYBRID DATA RACE DETECTION ALGORITHM In order to combine the best of traditional race detectors, we developed a seg- ment based hybrid algorithm that utilizes the concept of a segment to improve performance. Our segment based hybrid approach is based on Threadsanitizer [26] algorithm. We formalize this algorithm and extend it with several performance optimizations as discussed in the next chapter. A segment seg of a thread t is a sequence of memory accesses of t, denoted i i i by {e}, where the lockset and vector clock of the segment is the same as that of i thread t. No function calls or synchronization operations are allowed inside a seg- i ment. Theoutcomeofthisisthatalltheeventsinsideasegmentareexecutedwhile the thread holds the same vector clock and the same locks. Thus, the vector clocks and the locksets could be kept on the granularity of segments instead of memory accesses. Since we observe that the total number segments is much less than the totalnumberofmemoryaddressesinanexecutionformostoftheapplications,this approach reduces the memory requirements and increases the performance consid- erably as shown in the experiments. Oursegmentbasedhybridalgorithmmaintainsseveraldatastructuresasshown in Table 1. Algorithm 3 shows how these data structures are updated during pro- gram execution. Note that a happens-before relation is established from a SIGNAL to a WAIT but not from an UNLOCK to a LOCK operation as in the case for happens-before algorithm. This is because the happens-before relation on lock operations ensures that the same lock is used by threads during memory access, however the lockset algorithm portion of the hybrid algorithm will consider this case. For each memory access in line 17 of Algorithm 3, Algorithm 4 is executed. In Algorithm 4, first, the current segment of thread t that is executing the memory i access is obtained as seg in line 1. Then, in line 2 for the write access, writer seg- i ment set of the memory address is updated so that it only includes segments that do not happen-before the current executing segment in line 3. Additionally, the current segment is added to the writer segment set of the memory address. Simi- larly, reader segment set of the memory address is updated so that it only includes segments that do not happen-before the current executing segment in line 4. It can be observed from the definition of concurrency that segments in write segment sets are pairwise concurrent, similarly for read segment sets. This is a useful per- formance optimization because accesses from non-concurrent segments, which are happens-before ordered, cannot cause a potential data race on a shared memory address. Hybrid Data Race Detection for Multicore Software 195 Thread t i Vector Clock VC i Writer Lockset WRLS(t) i Reader Lockset RDLS(t) i Segment s i Vector Clock seg.VC i Writer Lockset WRLS(seg) i Reader Lockset RDLS(seg) i Condition Variable cv Vector Clock cv.VC Memory Address x Writer Segment Set WRST . x Reader Segment Set RDST . x Table 1. Segment-based hybrid algorithm data structures It can be seen from Algorithm 4 that read and write accesses update segment setsdifferently. Onwriteaccesses,bothwriterandreadersegmentsetsareupdated (line3and4),whereas,onreadaccesses,onlyreadersegmentsetisupdated(line6). Thereasonisasfollows,onwriteaccesses,itissafetoremoveanyofthereadaccesses from RDST . Remember that, RDST consists of concurrent segments where x is x x read. Since there is no read-read type of data race, removing any segment from RDST does not lead to missing any races. On the contrary, on read accesses it is x not safe to remove any segment from WRST . The reason is that, it may lead to x missingawrite-writedataracebecausetheremovedsegmentmighthaveapotential race with one of the prospective segments in the same set. The outcome of this is that all segments within any segment set are concurrent with each other. However, not all segments in RDST are concurrent with all segments in WRST , which is x x handled while checking race among WRST and RDST . x x Algorithm 5 describes the data race checking between segment sets check- Race(WRST,RDST),whichiscalledatline8inAlgorithm4. Thealgorithmchecks forcommonlocksamongconcurrentsegmentssuchthatonesegmentisawriterseg- mentandtheotheriseitherareaderorawritersegmentsincethereisnoread-read datarace. Ifthesegmentsareawriterandareadersegmentthenthealgorithmalso makes sure that there is no happens-before relation from the writer to the reader segment as shown in line 8. We know that a happens-before relation cannot exist fromthereadertothewriterasthesehavebeenremovedduringthereadersegment set update in Algorithm 4. If the algorithm could not find a common lock among concurrent segments (lines 4 and 9), it returns true indicating a data race. Note that if any two segments are ordered by the happens-before relation, they are not checkedfordatarace,whichissimilartothehappens-beforealgorithm. Then,iftwo segments are concurrent then they are checked for holding a common lock, which is similar to the lockset algorithm.
Description: