ebook img

ARMageddon: Last-Level Cache Attacks on Mobile Devices PDF

16 Pages·2015·0.51 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview ARMageddon: Last-Level Cache Attacks on Mobile Devices

ARMageddon: Last-Level Cache Attacks on Mobile Devices Moritz Lipp Daniel Gruss Raphael Spreitzer Graz University of Technology, Austria Graz University of Technology, Austria Graz University of Technology, Austria [email protected] [email protected] [email protected] Stefan Mangard 5 Graz University of Technology, Austria 1 [email protected] 0 2 Abstract—In the last 10 years cache attacks on Intel CPUs build keyloggers on Intel platforms [18]. Thus, cache attacks v havegainedincreasingattentionamongthescientificcommunity. represent a significant threat to today’s computing platforms. o More specifically, powerful techniques to exploit the cache side N channel have been developed. However, so far only a few Although a few publications about cache attacks against investigations have been performed on modern smartphones and AES T-table implementations on mobile devices exist (cf. 6 mobiledevicesingeneral.Inthiswork,wedescribeEvict+Reload, [9], [49]–[51], [56]), the recent advances in terms of highly 1 the first access-based cross-core cache attack on modern ARM accurate and generic attacks have been demonstrated on x86 Cortex-Aarchitecturesasusedinmostoftoday’smobiledevices. platforms only. More specifically, open challenges caused by ] Our attack approach overcomes several limitations of existing the significant differences in terms of cache architecture and R cache attacks on ARM-based devices, for instance, the require- privileged instructions on modern x86 and ARM architectures C ment of a rooted device or specific permissions. Thereby, we have prevented these attacks from being mounted on ARM- broaden the scope of cache attacks in two dimensions. First, we s. show that all existing attacks on the x86 architecture can also based devices for years. Therefore, we investigate these open c be applied to mobile devices. Second, despite the general belief challenges in more detail and show that all of these issues [ these attacks can also be launched on non-rooted devices and, can be overcome in practice. Thereby, we demonstrate that all 1 thus, on millions of off-the-shelf devices. existing cache attacks proposed for x86 architectures can also v Similarly to the well-known Flush+Reload attack for the be applied on ARM architectures, which allows for real-world 7 x86 architecture, Evict+Reload allows to launch generic cache attack scenarios on millions of off-the-shelf Android devices 9 attacks on mobile devices. Based on cache template attacks without any special privileges or permissions. 8 we identify information leaking through the last-level cache 4 that can be exploited, for instance, to infer tap and swipe As smartphones continue to evolve as the most important 0 events, inter-keystroke timings as well as the length of words personalcomputingplatform,theinvestigationofmoregeneric 1. entered on the touchscreen, and even cryptographic primitives cacheattacksonmobiledevicesisofutmostimportance.Based 1 implemented in Java. Furthermore, we demonstrate the appli- on related work in this area of research, we found that cache 5 cability of Prime+Probe attacks on ARM Cortex-A CPUs. The attacks are not practically relevant on ARM because of the 1 performed example attacks demonstrate the immense potential following issues, which we overcome in this work. : of our proposed attack techniques. v 1) Random replacement policy: The random replacement i X I. INTRODUCTION policy—used to replace specific cache lines within a cache set—has been mentioned as one possible source r a Cache attacks represent a powerful means of exploiting of noise in case of time-driven cache attacks [49], [51]. the different access times within the memory hierarchy of Furthermore, due to the random replacement policy, the modern system architectures. Until recently these attacks ex- Evict+Time approach is so far considered more appropri- plicitly targeted cryptographic implementations. The seminal ate than the Prime+Probe approach [50]. paper of Yarom and Falkner [58], however, introduced the 2) Precise timing: So far, precise timings on the ARM plat- so-called Flush+Reload attack, which allows an attacker to form relied on the cycle count register (PMCCNTR) [3] infer which specific parts (instructions as well as data) of a that is only accessible in unprivileged mode if access is binary (shared object or executable) are accessed by a victim explicitlygrantedbyaprivilegedapplication.Thus,these program. Thereby, Flush+Reload allows more fine-grained attacks assume that the Android device is rooted and the attacksthanpreviousapproacheslike,forinstance,plaintiming exploitation of cache attacks on non-rooted devices has attacks [8], or the well-known Evict+Time and Prime+Probe been mentioned as an interesting open challenge [50]. techniques [42]. Recently, Gruss et al. [18] even demonstrated 3) Flush instruction: In contrast to Intel x86 platforms, the possibility to use Flush+Reload to automatically exploit ARM restricts the usage of dedicated flush instructions cache-based side channels by so-called cache template attacks to privileged mode only. Thus, the device needs to be on Intel platforms. Flush+Reload does not only allow for rooted and the attacker needs to permanently install a efficient attacks against crypto implementations (cf. [7], [24], kernel module. However, an attacker can also perform [55]) but also to infer keystroke information and even to cache eviction by accessing multiple congruent addresses toevictalldatafromaspecificcacheset.Duetoimproved caches and shared memory in general. Furthermore, we cover replacement policies previously published eviction is not related work in this area of research. Section III presents a practical anymore on more recent CPUs. more realistic adversary model with fewer assumptions than 4) Cache architecture: One of the prerequisites of the existing attacks and identifies the challenges that need to Flush+Reload attack is an inclusive shared last-level be solved for such an adversary model. We describe the cache. Inclusive means that data which is present in the Evict+Reload attack in Section IV and we also state solutions lower cache levels, e.g., L1 cache, must also be present to the previously identified challenges on ARM platforms. withinhighercachelevelsandsharedmeansthatthelast- In Section V, we evaluate the performance of a cross-core level cache is shared among all cores. These properties covert channel between two Android applications based on allow any process to evict specific data from another ourEvict+Reload attack.InSectionVI,wedemonstratecache core’sL1cache.However,ARMCortex-Aprocessorsdid template attacks on Android based on our Evict+Reload ap- not support an inclusive shared last-level cache until the proach. In Section VII, we describe how the eviction from ARM Cortex-A53/A57 generation. the Evict+Reload attack can be used to build a Prime+Probe attack. We discuss possible countermeasures against the iden- The above mentioned challenges show the significant dif- tified weaknesses in Section VIII and we discuss interesting ferences between modern Intel platforms and ARM platforms. open research challenges in Section IX. Last but not least, we Eventually, we show how to tackle these challenges and our conclude this work in Section X. insights clearly demonstrate the feasibility of highly efficient cache attacks on ARM platforms. Thereby, we do not restrict our investigations to cryptographic implementations but also II. BACKGROUNDANDRELATEDWORK consider cache attacks as a means to infer inter-keystroke In this section, we give a basic introduction to the concept timings or the length of swipe actions. In addition, our in- of CPU caches and compare modern Intel CPU caches to vestigations show that our attack approach allows a malicious modern ARM CPU caches. We discuss the basics of shared application to determine when a specific library is used. For memory.Furthermore,weprovideabasicintroductiontocache instance,thepresentedattackcanbeusedtodeterminewhether attacks on ARM and Intel architectures. the GPS sensor is active, or whether other features like the microphone or the camera are used. This information further A. CPU Caches allows an adversary to infer privacy-sensitive information about the user. Today’s CPU performance is influenced not only by the clockfrequencybutalsobythelatencyofinstructions,operand Wedonotaimtolistthepossibleexploitsexhaustivelybut fetches, and other interaction with internal and external de- onlydemonstratetheimmenseattackpotentialofourproposed vices. All CPU computations require data from the system Evict+Reload attack. Given this powerful technique, we be- memory, but reducing the latency of system memory is diffi- lieve that many sophisticated cache attacks will be presented cult. Instead, CPUs employ caches to buffer frequently used in the future. We address this to the fact that Evict+Reload datainsmallerandfasterinternalmemories,effectivelyhiding canbeusedtoscananylibraryorprogrambinaryforpossible the latency of slow accesses to the system memory. information leaks resulting from the cache side channel. Modern caches are organized in sets of cache lines, which Contributions.Thecontributionsofthisworkcanbesumma- is also known as set-associative caches. Each memory address rized as follows. maps to one of these cache sets and memory addresses that map to the same cache set are considered as being congruent. • We are the first to successfully perform highly efficient Congruent addresses compete for cache lines within the same and accurate cache attacks on ARM CPUs, which so far set. Therefore, CPUs implement replacement policies, for have only been shown for x86 platforms. More specif- example,least-recentlyused(LRU)evictionwhichiscommon ically, we demonstrate that the Evict+Reload approach onIntelCPUs.However,thesereplacementpoliciesarewidely can be used to apply Flush+Reload-like attacks. Fur- undocumented [17]. thermore, we are the first to demonstrate the feasibility of Prime+Probe attacks on ARM. Our attacks are the CPU caches can be virtually indexed and physically in- first access-driven cache attacks that exploit the last-level dexed caches, which derive the index from the virtual or cache and, thus, work across CPU cores. physical address, respectively. Virtually indexed caches are • Ourattackisthefirsttodemonstratethatlocalinstruction generallyfasterbecausetheydonotrequirevirtualtophysical caches can be attacked via data memory accesses, even address translation before the cache lookup. However, using if the last-level cache is only inclusive on the instruction thevirtualaddressleadstothesituationthatthesamephysical side but not on the data side. address might be cached in different cache lines, which again • We show that cache-based covert channels significantly introducesperformancepenalties.Inordertouniquelyidentify outperform existing covert channels on Android. the actual address that is cached within a specific cache line, • Weshowthatcachetemplateattackscanbeusedtolaunch a so-called tag is used. This tag again can be based on the sophisticatedattacksonmobiledevices,includingattacks virtual or physical address. Most modern caches use physical against cryptographic implementations used in practice tags because they can be computed simultaneously to locating but also more fine-grained attacks like inter-keystroke the cache set. timings and swipe actions on the touchscreen. CPUs have multiple cache levels, with the lower levels Outline. The remainder of this paper is structured as follows. being faster and smaller than the higher levels. If all cache In Section II, we start with basic information about CPU lines from lower levels are also stored in a higher-level cache line, we call the higher-level cache inclusive. If a cache line are merged to the same physical page and marked as copy- can only reside in one of the cache levels at any point in on-write.Thismechanismcanenhancesystemperformancein time, we call the caches exclusive. The last-level cache is cases where system memory is scarce, such as cloud systems oftensharedamongallcorestoenhancetheperformanceupon with a high number of virtual machines or smartphones with transitioning threads between cores and to simplify cross-core limited physical memory. cachelookups.However,withsharedlast-levelcachesthatare Processes can retrieve information on virtual and phys- exclusive or inclusive, one core can (intentionally) influence ical address mappings using operating-system services like thecachecontentofallothercores.Thisisthebasisforcache /proc/<pid>/maps or /proc/<pid>/pagemap, both attacks such as the Flush+Reload [58] attack. on Linux and on Android. While Linux gradually restricts In this paper we focus on two CPUs, a Qualcomm Krait unprivileged access to these resources, these patches have not 400 and an ARM Cortex-A53, both very common quad-core yet been merged into stock Android kernels. Thus, a process CPUs used in today’s smartphones. can retrieve a list of all loaded shared-object files and the program binary of any process and even perform virtual to The Qualcomm Krait 400 runs at 2.88GHz. It has two physical address translation without any privileges. 16KB L1 caches—one for data and one for instructions—per core, with a cache line size of 64 bytes. The L1 caches are C. Cache Attacks physicallyindexedandphysicallytaggedandhave4waysand 64 cache sets. The L2 cache is shared among all cores, but it In the groundbreaking work of Kocher [30] about timing is neither inclusive nor exclusive [3]. The L2 cache has a size attacks on cryptographic implementations, the CPU cache of2MBandisdividedinto2048sets,eachwith8cachelines has been first mentioned as possible information leak. Later, and a cache line size of 128 bytes. Kelsey et al. [28] observed that cache-hit ratios can be ex- ploited in order to break ciphers employing large S-boxes. The more recent ARM Cortex-A53 runs at 1.2GHz. It Basedonthesetheoreticalobservations,morepracticalattacks has two 4-way 32KB L1 caches—one for data and one for against DES have been proposed by Page [43] and also by instructions—per core, with a cache line size of 64 bytes. The Tsunoo et al. [54]. With the standardization of the Advanced L1 caches are again physically indexed and physically tagged Encryption Standard (AES) [14], [37], cache attacks against and have 128 cache sets. The L2 cache is shared among all this block cipher have been investigated, i.e., Bernstein [8] cores and inclusive with respect to the L1 instruction cache presentedthewell-knowncache-timingattackagainstAESthat but exclusive with respect to the L1 data cache [1], [5]. It has has been further analyzed by Neve [38] and Neve et al. [40]. a total size of 512KB and is divided into 512 sets with 16 cache lines each and a cache line size of 64 bytes. WhileBernstein’scache-timingattackexploitedtheoverall execution time of the encryption algorithm, more fine-grained exploitations of the memory accesses to the CPU cache have B. Shared Memory been proposed by Percival [44] and also by Osvik et al. While writeable shared memory can be used as a means [42]. More specifically, Osvik et al. formalized two concepts, for communication between two processes, read-only shared namely Evict+Time and Prime+Probe, to determine which memory can be used as a means of memory optimization. In specific cache sets have been accessed by a victim program. case of shared libraries this does not only reduce the memory Both of these approaches consist of three basic steps which footprint of a system but also enhances the speed as shared are outlined within the following paragraphs. code is kept only once in memory and in CPU caches as well Evict+Time: as address translation units. The operating system implements this behavior by mapping the same physical memory into the 1) Measure the execution time of the victim program. address space of each process. 2) Evict a specific cache set. 3) Measure the execution time of the victim program again. When executing self-modifying code or just-in-time com- piled code, this advantage cannot be used in general. Android Prime+Probe: applications are usually implemented in Java and therefore would incur just-in-time compilation. However, there have 1) Occupy specific cache sets. beenseveralapproachestoimprovetheperformance,firstwith 2) The victim program is scheduled. optimized virtual machine binaries and more recently with 3) Determine which cache sets are still occupied. native code binaries that are compiled from Java byte code Both approaches, Evict+Time and Prime+Probe, allow an using ART [2] (Android Runtime). adversary to determine which cache sets are used during The operating system employs the same memory sharing the victim’s computations and have been exploited to attack mechanism when opening a file as read-only and mapping it cryptographic implementations (cf. [23], [32], [42], [53]) but into memory using system calls like mmap. Thus, an attacker also to build cross-VM covert channels (cf. [35]). can map code or data of a shared library or any accessible A significantly more powerful, i.e., more fine-grained, binaryintoitsownaddressspace,resultinginread-onlyshared attack approach denoted as Flush+Reload has been proposed memory, even if the program is statically linked. by Yarom and Falkner [58] in 2014. This sophisticated tech- Content-basedpagededuplicationisanotherformofshared nique exploits three fundamental concepts of modern system memory. Here, an operating system service scans the entire architectures.First,theavailabilityofsharedmemorybetween system memory for identical physical pages. Identical pages thevictimprocessandtheadversary.Second,last-levelcaches are typically shared among all cores. Third, Intel platforms Evict+Time [42] approach in order to attack an AES T-table use inclusive last-level caches, meaning that the eviction of implementation on Android-based smartphones. However, so information from the last-level cache leads to the eviction of faronlycacheattacksagainsttheAEST-tableimplementation this data from all lower-level caches of other cores, which have been considered on smartphone platforms and none of allowsanyprogramtoevictdatafromotherprogramsonother the most recent advances has been evaluated and investigated cores. While the basic idea of this attack has been proposed on mobile devices. As mobile devices advance to be the by Gullasch et al. [20], Yarom and Falkner extended this idea majorcomputingplatforms,weconsideritespeciallyimportant tosharedlast-levelcacheswhichallowsforcross-coreattacks. to further investigate the possible threat arising from cache The basic working principle of Flush+Reload is as follows. attacks. In this paper, we aim to close this gap. Flush+Reload: III. ADVERSARYMODELANDATTACKSCENARIO 1) Mapbinary(sharedobjectorprogram)intoaddressspace. As we consider a scenario where an adversary attacks a 2) Flush a specific cache line (code or data) from the cache. smartphone user, we obviously require the user to install a 3) Schedule the victim program. maliciousapplication.Nevertheless,weconsiderthisarealistic 4) Check if the corresponding code from step 2) has been assumption due to the following reasons. loaded by the victim program. 1) Non-rootedsmartphones:Ourapplicationdoesnotrequire Thereby, the Flush+Reload approach allows an attacker to a rooted smartphone as the malicious application can be determine which specific instructions are executed and also executed in unprivileged userspace. which specific data is accessed by the victim program. Thus, 2) No permissions: Our application does not require any rather fine-grained attacks are possible and have already been permissionatall,whichmeansthatuserswillnotbeable demonstrated against cryptographic implementations (cf. [21], to notice any suspicious behavior due to the presented [25], [26]). Furthermore, Gruss et al. [18] demonstrated the permissions during the install process. possibility to automatically exploit cache-based side-channel 3) Malicious app can be spread through popular app mar- information based on the Flush+Reload approach. Besides kets: Based on the above mentioned advantages, we note attacking cryptographic implementations like AES T-table im- that such a malicious application can be spread easily plementations,theyshowedhowtoinferkeystrokeinformation via available app markets, such as Google Play and the andevenhowtobuildakeyloggeraccordingtothecacheside Amazon Appstore. An adversary only needs to convince channel.Similarly,Orenetal.[41]demonstratedthepossibility the user to install our application, which can be done toexploitcacheattacksonIntelplatformsfromJavaScriptand through a useful tool or an addictive game. showed how to infer visited websites and to track the user’s 4) Independent of Android version: Our presented attack mouse activity. does not rely on a specific Android version and works Whiletheabovediscussedattackshavebeenproposedand onallstockAndroidROMsaswellascustomizedROMs investigated for Intel processors, only few studies consider the in use today. possible exploitation of cache-based side-channel information onmodernsmartphones.Sofar,theseinvestigationsonmodern We derive the prerequisites for our Evict+Reload attack smartphone platforms, like ARM Cortex-A processors, only according to the challenges identified in Section I. considered the exploitation of cache attacks in order to attack 1) Efficientevictionwithoutadedicatedflushinstruction:On cryptographic implementations. For instance, Weiß et al. [56] ARM platforms we are facing two obstacles that need investigated Bernstein’s cache-timing attack on a Beagleboard to be overcome. First, the dedicated flush instruction is employing an ARM Cortex-A8 processor. As Weiß et al. restricted to privileged mode only. As our attack should claimed that noise makes the attack difficult, Spreitzer and be deployable without any privileged instructions, i.e., Plos [51] investigated the applicability of Bernstein’s cache- on non-rooted devices, we need to rely on memory timing attack on different ARM Cortex-A8 and ARM Cortex- accesses to congruent addresses in order to evict data A9 smartphones running the Android operating system. Both from the cache. Second, due to the random replacement investigations [51], [56] confirmed that timing information policy current attacks on ARM platforms [50] rely on a is leaking, but two major drawbacks restrict the practical number of memory accesses which is two to three times application of this attack. First, many measurement samples the number of ways. Thus, we optimize the number of are required, e.g., about 230 AES encryptions, and second, the memory accesses, i.e., a more efficient cache eviction. key space could only be reduced to about 65 bits which is 2) Precise timings without performance monitor registers: still rather impractical. Later on, Spreitzer and Ge´rard [49] Cache attacks on ARM use the PMCCNTR register [3], improved upon these results and managed to reduce the key which allows for cycle-accurate timing measurements. space to a complexity which is practically relevant. However,accesstothisregistermustbegrantedbyapriv- Besides Bernstein’s cache-timing attack, another attack ilegedapplication,whichmeansthatarootedsmartphone against AES T-table implementations has been proposed by isrequiredforstate-of-the-artcacheattacks.Eventhough Bogdanov et al. [9], who exploited so-called wide collisions the challenge of launching cache attacks on non-rooted on an ARM9 microprocessor. In addition, power analysis deviceshasbeenmentionedbySpreitzerandPlos[50]as attacks [16] and also electromagnetic emanations [15] have interesting future work, it has not been solved yet. been shown to be powerful techniques to visualize cache 3) Sharedmemorybetweenapplications:TheFlush+Reload accesses during AES computations on ARM microproces- attack proposed by Yarom and Falkner [58] requires sors. Furthermore, Spreitzer and Plos [50] implemented the read-only shared memory between a spy process and a victim process. This ensures that both processes work Algorithm 1: Evict+Reload attack on the same physical addresses and the spy process can Input: Address a of mapped shared memory m determine the victim’s memory accesses by means of Output: Cache hit/miss on address a after victim has cache hits and cache misses, respectively. been scheduled 4) Shared inclusive/exclusive last-level cache: The Evict address a from cache Flush+Reload attack also relies on a shared inclusive Wait for victim to be scheduled last-level cache. The most recent ARM Cortex-A53/A57 Check whether or not victim loaded a into the cache generation has a unified shared last-level cache that is if Victim loaded a into the cache then inclusiveontheinstructionsideandexclusiveonthedata return Hit side. We investigate how this cache can be instrumented else to implement Flush+Reload-like attacks. In addition, we return Miss show how to launch our presented attack even against end shared last-level caches that are neither inclusive nor exclusive and, thus, our attack can be used to attack older ARM Cortex-A8 and ARM Cortex-A9 devices as well. The basic idea is that depending on whether or not the victim process accessed the memory region (instruction or The same prerequisites also apply for the Prime+Probe data) corresponding to address a, the attacker either observes attack, except that the spy and the victim process do not need a cache hit or a cache miss when accessing address a again. to share memory. Therefore, either timing information, a cycle counter, an in- struction counter, or a cache-miss counter can be employed. Within the next Section we show how to fulfill all these In case the attacker observes a cache hit, the victim process prerequisites and how an adversary can exploit these features musthaveloadedthememoryaddressaintothecachebefore. in order to launch generic cache template attacks as described Otherwise, the attacker learns that the victim most likely did by Gruss et al. [18]. not access this address. The Evict+Reload attack has a high accuracy as it is highly unlikely that exactly one cache line IV. THEEVICT+RELOADATTACK is evicted by other system activity before the attacker is able In this section we present Evict+Reload, a highly accurate to reload it. However, if the victim’s access happens after the access-based cross-core cache side-channel attack. It is a reload step and before finishing eviction in the next iteration, variant of the Flush+Reload attack proposed by Yarom and the cache line can be evicted by the attacker accidentally. The Falkner [58] and has been first mentioned by Gruss et al. executiontimeofthiscodefragmentisalwaysbelow1µs.By [18] as an alternative to Flush+Reload. Instead of using choosingthedurationofthewaitingstep,theattackerchooses an instruction to flush data from the cache, Evict+Reload atrade-offbetweenthetemporalresolutionandtheprobability evicts data from the cache by performing memory accesses, of missing an event. For instance, a waiting duration of 1ms similarly as in Evict+Time or Prime+Probe attacks. However, yields a probability below 0.1% to miss an event while still Gruss et al. observed that Evict+Reload has no practical use being accurate enough for most use cases. In other scenarios on x86 platforms as the clflush instruction is available in spy and victim might be synchronized as the spy triggers the unprivileged mode on Intel platforms. While the ARMv7-A victim computation. In this case the probability of missing a and the ARMv8-A instruction sets support various privileged cache hit is even lower. flushinstructions,thereisnoflushinstructionthatisaccessible As already mentioned above, an attacker can rely on fromuserspace[3].Thus,weneedtorelyonmemoryaccesses variousmechanisms,e.g.,timinginformationoraperformance for eviction purposes. counter like a cache-miss counter, in order to distinguish The Evict+Reload attack as described by Gruss et al. is between cache hits and cache misses. In this work we employ specifically tailored to the x86 architecture and assumes the a cycle-accurate timer in order to distinguish between a cache availabilityofseveralx86instructions,suchasrdtsc.There- hitandcachemiss.Figure1illustratesahistogramformemory fore, it is not directly applicable to ARMv7-A and ARMv8- accessesresultinginacachehitandmemoryaccessesresulting A architectures. We first describe Evict+Reload in a generic in a cache miss. Based on this histogram, an attacker can wayandsubsequentlyweshowhowitcanbeimplementedon reliably determine a threshold to distinguish between cache ARMv7-based architectures. hits and cache misses, which is then used in Algorithm 1. The plot in Figure 1 is based on the privileged cycle counter Similarly to the Flush+Reload attack, Evict+Reload relies register (PMCCNTR). However, we will demonstrate a means ontheavailabilityofsharedmemorybetweentheattackerand to overcome this privileged register in Section IV-B. thevictim,e.g.,sharedlibrariesorprogrambinaries.Following the notion of Flush+Reload, the attacker process maps the AsalreadymentionedinSectionIII,afewchallengesneed library or binary under attack into its own virtual address to be overcome in order for Evict+Reload to be applicable space.Theattackerthen“probes”addresseswithinthisshared for our adversary model. Within the following subsections we memory area. More specifically, for each address a within investigate how to fulfill all the requirements for our attack the shared memory area, the attacker evicts address a from strategy. First, we discuss the technicalities in order for our the CPU cache, waits for the victim process to be scheduled, attack to work on shared inclusive as well as shared exclusive and finally the attacker determines whether the victim fetched last-level caches. Afterwards, we show how to get rid of the address a into the CPU cache again. Algorithm 1 summarizes prerequisite of a rooted smartphone in order to get cycle- the single steps for a specific address a. accurate timings. Last but not least, we show how to evict ·104 Core0 Core1 Hit ses 4 Miss L1I L1D L1I L1D a c s of et S er b 2 m Nu (1) L2UnifiedCache (2) s 0 et 0 50 100 150 200 250 300 S (3) Executiontimeincycles Fig.2. Cross-coreinstructioncacheevictionthroughdataaccesses. Fig. 1. Histogram of access times for cache hits and cache misses on the ARMCortex-A53testdevice. itscore’sdatacache,therebyevictingcachelinesintothelast- levelcache.Instep3,theprocesshasfilledthelast-levelcache setusingonlydataaccessesandtherebyevictstheinstructions specific cache sets without relying on the privileged flush from other core’s instruction caches as well. instruction. Although the data caches are exclusive, we observe that we can perform cross-core attacks on these caches as well. A. Shared Cache Lines on ARM CPUs After evicting data from the last-level cache using memory accesses we measure higher access times. When another Sharingmemorybetweenvictimandspyprocessisstraight process running on a different core reaccesses the data we forward on Linux and Windows. As the operating system observealoweraccesstimeagain.Thesetimingmeasurements tries to minimize the memory footprint, binaries and shared would suggest an inclusive cache architecture although it is objects are always mapped as read-only shared memory into exclusive according to the documentation [1], [5]. We assume all processes. Since Android is based on Linux, the same that this is due to the cache-coherency protocol between the concepts also apply to the Android OS. Even though most CPU cores. If remote-core data fetches are performed instead Android applications are implemented in Java and, thus, shar- of accesses to physical memory it might be fast enough to be ing memory between a spy and a victim process is more observed as a cache hit. However, this attack is generally not difficult,theunderlyingsystememployssharedmemoryjustas possible on exclusive last-level caches. Further investigations inLinux.Hence,wetargetsharedlibrariesorprogrambinaries on exclusive last-level caches and the influence of cache- on Android as well. coherency protocols are beyond the scope of this work and Whenitcomestocaches,ARMv7-AandARMv8-ACPUs we consider these investigations as possible future work. are very heterogeneous compared to Intel CPUs. Whether a On the Krait 400 the L2 cache is neither inclusive on the CPU has a second-level cache can be decided by the manu- data side nor on the instruction side. However, we observed facturer. As we only consider multi-core CPUs with a second- that cross-core cache attacks are still possible in most cases. level cache as they are predominant in Android smartphones, The reason for this could be the cache-coherency protocol there is only a limited number of properties that influence between the CPU cores. In cases where the Evict+Reload whether cache lines are shared and to what extent. The last- attack did not work on the Krait 400 we found that launching level cache on ARMv7-A and ARMv8-A devices is usually the attack in parallel on all CPU cores allows to perform shared among all cores. However, the last-level cache can be Evict+Reload in a local-core attack. Thus, the attack can be inclusive to lower-level caches, that is every cache line in applied to older devices with non-inclusive caches as well. any core’s lower-level cache is also contained in the last-level cache. It can also be exclusive, that is no cache line can be in two cache levels at the same time. ARMv7-A CPUs, i.e., B. Distinguishing Cache Hits and Cache Misses ARM Cortex-A8 and ARM Cortex-A9 processors, are usually In order to distinguish cache hits and cache misses, timing neither inclusive nor exclusive. Thus it is difficult to perform sources or dedicated performance counters can be used, i.e., cross-core attacks. However, in this scenario an attacker can anything that captures the difference between cache hits and run the spy process on all cores simultaneously and thus fall cache misses. We focus on timing sources as cache misses back to a same-core attack. This changed with the ARMv8- haveasignificantlyhigheraccesslatency.Whilecacheattacks A architecture, e.g., ARM Cortex-A53 and ARM Cortex-A57 onx86CPUsemploy therdtsc instruction—whichprovides processors.Onthisarchitecturethelast-levelcacheisinclusive sub-nanosecond resolution timestamps—that can be accessed on the instruction side and exclusive on the data side [1], [5]. by any unprivileged user program, the ARMv7-A architecture To perform a cross-core attack we do not execute the does not provide an instruction for this purpose. Instead, the instructions we want to spy on. Instead we load enough data ARMv7-Aarchitecturehasaperformancemonitoringunitthat into the cache to fully evict the corresponding last-level cache allows to monitor CPU activity. One of these performance set.Thereby,weexploitthatthelast-levelcacheisinclusiveon counters—denoted as cycle count register (PMCCNTR)—can the instruction cache side and can evict instructions from the be used to distinguish between a cache hit and a cache miss other core’s local caches. Figure 2 illustrates such an eviction. by relying on the number of CPU cycles that passed during a Instep1,aninstructionisallocatedtothelast-levelcacheand memory access. However, the performance monitoring unit is the instruction cache of one core. In step 2, a process fills not accessible from userspace by default. Hit (PMCCNTR) Hit (syscall) fetchedfromthephysicalmemory,anexistingcachelineneeds Miss (PMCCNTR) Miss (syscall) toreplaced,i.e.,evictedaccordingtoapredefinedreplacement policy.TheARMv7-Aarchitecturedefinestwodifferentcache ·104 replacement policies, namely round-robin and pseudo-random s e s replacement policy. In practice only the pseudo-random re- s ce 4 placementpolicyisusedforreasonsofperformanceandsince c a switching the cache replacement policy is only possible in of er 2 privileged mode. b m In order to intentionally evict a specific cache set without u N 0 a dedicated flush instruction, an attacker needs to access 0 50 100 150 200 250 300 congruent memory addresses that map to the same cache set. Access time in cycles ThishasalreadybeenproposedbyOsviketal.[42].Similarly, Hund et al. [22] flush the whole CPU cache on an older Intel CPU without using cache maintenance operations by Fig.3. HistogramofcachehitsandcachemissesontheARMCortex-A53 accordingtothePMCCNTRregisterandtheperf_event_openinterface. accessing a physically consecutive memory buffer which is larger than the cache. Although this also evicts the targeted address it is definitely not the fastest approach, as it would Previous work assumed that an attacker is able to enable be sufficient to access only addresses which are congruent to userspace access to these performance counters by setting a the address to be evicted. Cache eviction has recently been certain register while running in privileged mode. In order to investigated in more detail for the x86 architecture [32], [35], do so, it is required to obtain root privileges on the device [41] in order to perform Prime+Probe attacks. However, as and to load a kernel module. While this is possible on rooted already observed by Spreitzer and Plos [50], on ARMv7-A devices if specific applications are installed and executed, the CPUs it is necessary to access more addresses than there vast majority of Android smartphones are not rooted. are cache lines per cache set, because of the pseudo-random replacement policy. We improve these eviction strategies by Thus, in order to broaden the attack surface, we do not applying methods of Gruss et al. [17]. want to rely on root privileges in order for our attack to work. Our observations showed that in Linux kernel version 2.6.31 WhilethecachereplacementonARMCortex-Aprocessors the perf_event_open syscall has been introduced as an is described as a pseudo-random replacement policy, there are abstractlayertoaccessruntimeperformanceinformationinde- no details on the actual implementation. Gruss et al. [17] pendentoftheunderlyinghardware.Itallowstoaccessperfor- recently proposed an algorithm to find access patterns that mance counters on different CPUs through a unified interface. achieve a high eviction rate and a low execution time on Intel In our case we use the PERF_COUNT_HW_CPU_CYCLES CPUs. Although they claim that their algorithm works on any performance counter that returns an accurate cycle count just architecture, they only examined different Intel architectures. as the privileged instructions. However, due to the fact that Whiletheiralgorithmisveryslowandcomputesboth,congru- this approach relies on a syscall to acquire the cycle counter entaddressesandanaccesspattern,weonlyneedtosearchfor value, a latency overhead can be observed. theaccesspattern.ThisisduetothefactthatAndroidprovides access to the mapping of virtual to physical addresses through In Figure 3 we show the cycle count distribution as /proc/self/pagemap. Although access to this mapping measured using the perf_event_open interface and via has already been identified as a potential security issue on access to the privileged PMCCNTR register. Although the x86 [47] and recent Linux kernels [29] restrict access to this system call introduces latency and noise, cache hits and cache mapping, current Android versions still allow access to any misses are still clearly distinguishable. Thus, even with this unprivilegedapplication.Therefore,wecancomputecongruent latency overhead we are able to exploit this syscall in order to addresses directly and only use their algorithm to evaluate the successfully launch our proposed Evict+Reload attack. access patterns using our set of congruent addresses. Throughthissyscallinterface,itispossibletogettheCPU We applied the algorithm of Gruss et al. [17] to the set cycle counter value from userspace without privileged access. of congruent addresses—which has been established via the Since only Android 1.0 used a kernel version below 2.6.31 access to /proc/pid/pagemap—for our two test devices. and more recent Android versions deploy kernel versions 3.4 or 3.10, the perf_event_open interface is available on Figure 4 shows the best access pattern for the Krait 400 CPU. On the y-axis we illustrate the different (but congruent) all Android devices above Android 1.0. Furthermore, today addresses and on the x-axis we illustrate the memory accesses the number of devices running Android 1.0 is negligible over time. Hence, for the Krait 400, which has a 4-way L1 and, therefore, we assume that about 1.4 billion Android cache and an 8-way L2 cache, our eviction set consists of a devices [33] are affected. total of 12 congruent addresses. These addresses are accessed usingreadaccesseswithinaloopof10rounds,with3memory C. Unprivileged Cache Eviction accesses per round, and theses accesses are shifted by 1 after As described in Section II-A, modern CPUs employ set- every round. Based on this eviction strategy, we measured an associative caches. Multiple cache lines comprise one cache eviction rate of 100% and an average execution time of 599 set and addresses that map into the same set are considered cycles if performed in an Evict+Reload attack. Although a as congruent. These congruent addresses compete for cache strategy accessing every address in the eviction set only once lines in this set. If a cache line has to be allocated for data wouldperformsignificantlylessmemoryaccesses,itconsumes s s e r d ess Ad r d d A Time Fig.5. ExcerptoftheoptimalaccesspatternonourCortex-A53testdevice. 31 2423 8 7 0 Time SendSequence Payload CRC Number Fig.4. OptimalaccesspatternonourKrait400testdevice. 10 3 2 0 ResponseSequence CRC TABLEI. DIFFERENTEVICTIONSTRATEGIESONTHEKRAIT400 Number N A D Cycles Rate Fig.6. Formatofsenddataframes(above)andresponsedataframes(below). - - - 549 100.00% 6 1 3 32 0.00% 12 1 3 599 100.00% V. AHIGHPERFORMANCECACHECOVERTCHANNEL 13 1 2 582 50.03% 32 3 2 16689 99.97% In this Section we describe a high performance cross- core cache covert channel on modern smartphones using Evict+Reload. A covert channel enables two unprivileged ap- moreCPUcycles.Foranevictionrateof100%theevictionset plicationsonasystemtocommunicatewitheachotherwithout size is at least 16 and the execution time at least 1460 cycles using any data transfer mechanisms provided by the operating in the same attack scenario. system. This communication evades the sandboxing concept Table I summarizes different eviction strategies for the and the permission system. Particularly on Android this is a Krait400.Thefirstcolumnindicatesthetotalevictionsetsize problem,asthiscovertchannelcanbeusedtoexfiltrateprivate N.Adenotestheshiftoffsettobeappliedaftereachroundand datafromthedevicethattheAndroidpermissionsystemwould D indicates the number of memory accesses in each iteration. normally restrict. An attacker could use one application that The column cycles states the execution time for the eviction hasaccesstothepersonalcontactsoftheownerofthedeviceto and the last column indicates the eviction rate. For instance, a senddataviathecovertchanneltoanotherapplicationthathas strategy with the same loop as before, but only 2 accesses to Internet access (cf. collusion attacks [34]). In such a scenario different addresses per round over 13 rounds has an average an adversary can steal personal information. executiontimeofonly582cyclesbuttheevictionratedropsto Our covert channel is established on addresses of a shared 50%.ThefirstlineinTableIstatestheexecutiontimeandthe librarythatisusedbyboth,thesenderandthereceiver.While eviction rate for the privileged flush instruction, which gives bothprocesseshaveread-onlyaccesstothesharedlibrary,they the best result in terms of execution time (549 cycles). Our cantransmitinformationbyloadingaddressesfromtheshared best identified eviction strategy also achieves an eviction rate library into the cache or evicting it from the cache. of 100% but takes 599 cycles. Thus, there is a small trade-off between using the optimal but privileged flush instruction and We implement the covert channel using a simple protocol. the unprivileged eviction strategy which takes slightly more Dataistransmittedinnbits,anadditionals-bitsequencenum- time. ber, and a c-bit checksum that is calculated over the payload and the sequence number. The sequence number is used to We performed the same evaluation on our ARM Cortex- distinguish consecutive packages and the checksum is used to A53testsystem.Figure5showsanexcerptofthebesteviction check the integrity of the payload and the sequence number. pattern we found for this CPU. The access pattern consists of If a received data package is valid, the s-bit sequence number aloopof18rounds,eachwith4repeatedmemoryaccessesto is sent back accompanied with an additional x-bit checksum the same 6 addresses and where theses accesses are shifted by calculated over the returned sequence number. By adjusting 1 in each iteration. Thus the eviction set contains 23 different the sizes of checksums and sequence numbers the error rate addresses for the ARM Cortex-A53 with a 4-way L1 cache ofthecovertchannelcanbecontrolled.Figure6illustratesthe and a 16-way L2 cache. Based on this eviction strategy, we data frames for sending data and the corresponding responses. measuredanevictionrateof99.86%andanaverageexecution time of 8789 cycles. Again accessing every address only once Eachbitisrepresentedbyoneaddressinthesharedlibrary, in the eviction set only once is much less efficient although it whilenotwoaddressesarechosenthatmaptothesamecache involves significantly less memory accesses. For this strategy, set. In order to transmit a bit value of 1, the sender accesses we had to increase the eviction set size to 800 to achieve an the corresponding address in the shared library. To transmit a eviction rate of 99.04%. Eviction then takes 131804 cycles, bit value of 0, the sender does not access the corresponding which is 15 times as much as with the best strategy we found. address, resulting in a cache miss on the receiver’s side. Thus, WesuspectthereasonforthisinexclusivenessoftheL2cache the receiving process observes a cache hit or a cache miss on the data side. We can only fill an L2 cache set by evicting dependingonthememoryaccessperformedbythesender.The data from L1. Therefore, it is better to reaccess the data that sender then checks whether the acknowledge bit has been set is already in the L2 cache and gradually add new addresses to by the receiver. If it is set, the response sequence number will the set instead of accessing more different addresses. be measured on the corresponding cache sets. If the response sequence number and the response checksum are valid, the TABLEII. COMPARISONOFCOVERTCHANNELSONANDROID sender continues with the next packet and resends the current Work Type Permission Bandwidth[bps] packetotherwise.Algorithm2andAlgorithm3provideamore Ours Cache - 18534 formal description of the sender and the receiver, respectively. Marforioetal.[34] TypeofIntents - 4300 Marforioetal.[34] UNIXsocketdiscovery - 2600 Schlegeletal.[46] Filelocks - 685 Algorithm 2: Sending data Schlegeletal.[46] Volumesettings - 150 Schlegeletal.[46] Vibrationsettings - 87 Input: Mapped shared library m Schlegeletal.[46] Screenwakelock WAKE_LOCK 5 Input: Data to send d sn← Initial sequence number; for fn ← 1 to number of frames do mightbepossible.Inparticularthepacketpayloadsizecanbe p← Current package (sn, d , CS(fn,d )); increased or checksum and sequence number size decreased x x received← false; until the error rate reaches 1% while increasing the overall do transmission rate. However, the comparison in Table II shows Access sending bit address; that our covert channel clearly outperforms existing covert Access or evict packet bit addresses; channels by a factor of 4 in terms of bandwidth. In addition, ack ← Access Acknowledge bit address; Marforio et al. [34] also reported that the XManDroid frame- if ack ≡true then work [10], [11] is able to detect the Type of Intents covert Measure response data addresses; channel as well as the UNIX socket discovery channel, which sn ,cs ← Response seq. number, CRC; indicatesthatthepreviouslyfastestcovertchannelscanalready m m if CS(sn,d )≡cs and sn≡sn then be prevented. x m m received←true; end VI. SIDE-CHANNELATTACKSONMOBILEDEVICES end while received ≡ false; In this section we demonstrate access-driven cache side- sn←sn+1; channel attacks on mobile Android devices. We implement end cache template attacks as described by Gruss et al. [18] to create and exploit accurate cache-usage profiles using the Evict+Reload attack. Cache template attacks consist of a profiling phase and an exploitation phase. In the profiling Algorithm 3: Receiving data phase, a template matrix is computed that represents how Input: Mapped shared library m many cache hits occur on a specific address after triggering while true do a specific event. The exploitation phase uses this matrix to received←false; infer events from cache hits. For further details about cache do template attacks we refer to Gruss et al. [18]. sn← Initial sequence number; sending ←false; To perform cache template attacks, an attacker has to be do ablemapsharedbinariesorsharedlibrariesasread-onlyshared sending ← Measure sending bit address; memory into its own address space. We have already shown while sending ≡ false; that this is possible in the previous section. By using shared Measure packet data addresses; libraries, the attacker bypasses any potential countermeasures sn ,d ,cs ← Sequence number, data, CRC; taken by the operating system, such as restricted access to m m m if CS(sn ,d )≡cs then runtime data of other apps or address space layout random- m m m if sn≡sn then ization (ASLR). The attack can even be performed online on m received←true; the device under attack if the event can be simulated. We Report d ; exemplary show how to simulate these events in case of touch m sn←sn+1; actions below. end Access acknowledge bit address; Withinthefollowingsubsectionsweshowhowcachetem- Access or evict response data bit addresses; plateattackscan beusedtoinfertouch actionsandkeystrokes else on Android-based smartphones. Furthermore, we demonstrate Evict acknowledge bit address; an effective attack against the AES T-table implementation, end which is still deployed on all smartphones running stock while received ≡ false; versions of Android. end A. Inferring Touch Actions and Keystrokes We implemented this covert channel on our ARM Cortex- Welaunchedcachetemplateattacksonuserinputeventsin A53 device with the following parameters. We use an 8-bit order to find addresses in shared libraries that are responsible checksum, a 64-bit payload, and an 8-bit sequence number. for user input handling. After identifying such addresses, the According to these parameters, we achieve a transmission rate idea of cache template attacks is to monitor these addresses in of 18534bps. A practical evaluation of our implementation of order to infer user input events. Just as Linux, Android uses the proposed covert channel with these parameters yields an a large number of shared libraries, each with a size of up to error rate of 0%, which indicates that further improvements several megabytes. We inspected all available libraries on the 200 key s e longpress cl nt cy 150 e swipe n Ev ei tap m 100 ti text s s e 840 880 280 700 080 100 140 840 880 900 940 980 000 040 080 Acc 50 Tap Tap Tap Swipe Swipe Swipe Tap Tap Tap Swipe Swipe x x 3 7 8 8 8 8 8 8 8 8 1 1 1 0 0 x x x x x x x x x x 1 1 1 0 0 0 0 0 0 0 0 0 0 0x 0x 0x 0 5 10 15 Addresses Timeinseconds Fig.7. Cachetemplatematrixforlibinput.so. Fig. 8. A user input sequence of 3 tap events, 3 swipe events, 3 tap events, and 2 swipe events measured on address 0x11040 of /system/lib/libinput.soonourARMCortex-A53testdevice. systembymanuallyscanningthenamesandidentifiedlibraries thatmightberesponsibleforhandlinguserinput,e.g.,likethe libinput.so library.1 tack application reliably reported cache hits on the mon- itored addresses. For instance, on address 0x11040 of After we identified libinput.so2 as our main target /system/lib/libinput.so tap actions and swipe ac- library, we launched the profiling phase of the cache template tions on the screen can be distinguished. Tap actions cause a attack. Thus, we simulated the user input events like tapping smallernumberofcachehitsthanswipeactions.Swipeactions or swiping on the screen as well as sending text and key cause cache hits in a high frequency as long as the screen is events to applications. We simulated these events via the touched. In order to illustrate this, Figure 8 shows a sequence android-debug bridge (adb shell) with two different methods.3 of 3 tap events, 3 swipe events, 3 tap events, and 2 swipe The first method uses the input command line tool to events.Theseeventscanbeclearlyobservedbythefastaccess simulateuserinputevents.Thesecondmethodiswritingevent times. The gaps mark periods of time where our program was messages to /dev/input/event*. As the second method not scheduled on the CPU. Events that would occur in those only requires a write() statement it is significantly faster, periods can be missed by our attack. but it is also more device specific. Therefore, we used the inputcommandlinetoolinmostscenarios.Whilesimulating Due to the fact that we are able to determine how long a theseevents,wesimultaneouslyprobedalladdresseswithinthe swipeactionlasts,weareabletoinferthelengthoftheentered libinput.solibrary,i.e.,wemeasuredthenumberofcache word on Android keyboards that use the so-called swipe input hits that occurred on each address after triggering a specific method. Swipe input allows a user to enter words by swiping event. over the soft-keyboard and thereby connecting the single characters to form a word. Since we are able to determine Figure 7 shows the cache template matrix for the the length of swipe movements, it is possible to correlate the libinput.so library. We triggered the following events: length of the swipe movement with the actual length of the keyeventsincludingthepowerbutton(key),longtouchevents word. Furthermore, for the pattern-unlock mechanism we can (longpress), swipe events, touch events (tap), and text input determine the actual length of the unlock pattern. events (text) via the input command line tool as often as possibleandmeasuredeachaddressforonesecond.Thecache In addition it is also possible to apply this attack to opti- template matrix clearly reveals addresses with high cache-hit mized virtual machine binaries or more recently used ahead- rates for the corresponding events. More specifically, squares of-time compiled ART (Android Runtime) executables [2]. with a darker color represent addresses with a higher cache- We have used this attack on the AOSP Android Keyboard hit rate for the corresponding event, whereas lighter colors and evaluated the number of accesses to every address in the represent addresses with lower cache-hit rates. Hence, the optimized executable that responds to a tap of every letter on cache template matrix reveals addresses that can be used to the keyboard. It is possible to find addresses that correspond distinguish different events. to a key press and more importantly to distinguish between them.InFigure9weobserveaddresseswherenohitsoccurred We also verified the previously revealed addresses by and addresses where the hit distribution is almost uniform monitoring the identified addresses while operating the smart- for every letter. These addresses can be used to monitor key phone manually, i.e., we touched the screen and our at- presses on the keyboard. In addition we identified an address that corresponds only to letters on the keyboard and hardly on 1Note that we only restricted the set of possible libraries since testing all the space bar or the Enter button. With this information it is librarieswouldhavetakenasignificantamountoftime.Yet,anadversarycan possible to precisely determine the length of single words. alsoexhaustivelyprobeallavailablelibraries. 2Weconsideredthislibraryasbeingresponsibleforprocessinginputevents Similar to the attack on the libinput.so, we are able duetoitsname. to infer the length of words when the user simply taps on the 3Observethattriggeringtheactualeventwhichanattackerwantstospyon, mightrequireeither(1)anofflinephaseor(2)privilegedaccess.Forinstance, characters of the soft-keyboard. This is due to the fact that we incaseofakeylogger,theattackercangatheracachetemplatematrixoffline areabletodistinguishtapsoncharactersandtapsonthespace foraspecificversionofalibrary,ortheattackerreliesonprivilegedaccessof bar. We illustrate an example of this observation in Figure 10. theapplication(oradedicatedpermission)inordertobeabletosimulatethe The red dots highlight the lower peaks of the red curve, that eventforgatheringthecachetemplatematrix.However,theactualexploitation is the spaces. The plot shows that we can clearly determine ofthecachetemplatematrixtoinfertheeventsdoesneitherrequireprivileged accessnoranypermission. the length of entered words and monitor user input accurately

Description:
ARMageddon: Last-Level Cache Attacks on Mobile. Devices. Moritz Lipp. Graz University of entered on the touchscreen, and even cryptographic primitives implemented in Java. Furthermore, we has already been identified as a potential security issue on x86 [47] and recent Linux kernels [29]
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.