Application-aware snoop filtering for low-power cache coherence in embedded multiprocessors

Xiangrong Zhou, Chenjie Yu, Alokika Dash, Peter Petrov
2008 ACM Transactions on Design Automation of Electronic Systems  
Maintaining local caches coherently in shared-memory multiprocessors results in significant power consumption. The customization methodology we propose exploits the fact that in embedded systems, important knowledge is available to the system designers regarding memory sharing between tasks. We demonstrate how the snoop-induced cache probings can be significantly reduced by identifying and exploiting in a deterministic way the shared memory regions between the processors. Snoop activity is
more » ... ed only for the accesses referring to known shared regions. The hardware support is not only cost efficient, but also software programmable, which allows for reprogrammability and customization across different tasks and applications. 16:2 • X. Zhou et al. system platforms, frequently in the form of a complex multiprocessor systemson-a-chip (MPSoC) [Barroso et al. 2000; Wolf 2004; Cumming 2003] . Such multiprocessor platforms are quite natural, as task parallelism and specialization are inherent for these embedded applications. However, most of these applications are extremely energy constrained, as they often need to operate on autonomous power sources. Even for many stationary devices with direct access to the power grid, such as set-top boxes and advanced network routers and gateways, energy efficiency is crucial due to the high cost of cooling and packaging. Embedded systems have been traditionally designed using general-purpose hardware architectures and system software infrastructure in order to achieve flexible implementation, low design cost, and short time-to-market through the exploitation of off-the-shelf components and design reuse techniques [Wolf 2001; Sangiovanni-Vincentelli and Martin 2001] . However, such generalpurpose computing architectures come with the price of excessive power consumption [Gonzalez 2000; Kathail et al. 2002] , a characteristic of extreme importance for many wearable and battery-powered devices. Shared-memory multiprocessor architectures are typically used for MP-SOC platforms, as they provide for low communication latency, well-understood programming models, and flexible platforms that can easily support heterogeneous processing units and hardware coprocessors. Since accesses to the shared memory can easily overwhelm the interconnect to memory, caches are used to bring data closer to the requesting processors, thus hiding the interconnect latency and reducing the required interconnect bandwidth. However, caches must be maintained coherently, since when a processor modifies cached data, other remote caches may be left with an older version of the same. To resolve this issue, cache-coherence protocols are comonly used. In this article we focus on one of the large classes of cache-coherence protocols, namely the snoop-coherence protocols. The snoop cache-coherence protocols are designed and very well suited for architectures in which a shared communication medium, most often in the form of a high-speed bus, is used to access the shared memory. The broadcast nature of the memory requests enables all the cache controllers to "snoop" the shared bus and through invalidation, updates, and write-backs to ensure that the local caches are maintained coherently respect to the shared memory. The provision for easily extendable multiprocessing platforms and their software-transparent implementation have made the snoop-based cache-coherence protocols easy to maintain, deploy, and reuse, while providing minimal impact on performance and memory access latency [Martin et al. 2002 [Martin et al. , 2003 ]. However, a major obstacle in utilizing these protocols in many modern embedded applications is their extreme energy inefficiency. The very general-purpose nature of this scheme, which has enabled their ease of integration, results in significant power consumption, thus preventing their utilization for energy-constrained embedded applications. It has been reported [Ekman et al. 2002; Loghi et al. 2005 ] that the power due to snoop-related cache lookup can amount to 40% of the total power consumed at the cache subsystem. Snoop-based cache-coherence schemes are general purpose in nature, as no prior knowledge regarding the application structure, and sharing patterns, in particular, is available. It is assumed that all memory references can potentially
doi:10.1145/1297666.1297682 fatcat:72xf7unyuzcqxjeutbvsb3vwse