Architectural Techniques for Improving the Power Consumption of NoC-Based CMPs: A Case Study of Cache and Network Layer

Emmanuel Ofori-Attah, Washington Bhebhe, Michael Agyeman
2017 Journal of Low Power Electronics and Applications  
The disparity between memory and CPU have been ameliorated by the introduction of Network-on-Chip-based Chip-Multiprocessors (NoC-based CMPS). However, power consumption continues to be an aggressive stumbling block halting the progress of technology. Miniaturized transistors invoke many-core integration at the cost of high power consumption caused by the components in NoC-based CMPs; particularly caches and routers. If NoC-based CMPs are to be standardised as the future of technology design,
more » ... is imperative that the power demands of its components are optimized. Much research effort has been put into finding techniques that can improve the power efficiency for both cache and router architectures. This work presents a survey of power-saving techniques for efficient NoC designs with a focus on the cache and router components, such as the buffer and crossbar. Nonetheless, the aim of this work is to compile a quick reference guide of power-saving techniques for engineers and researchers. a router and logic block. Connection is established between each node through the routers using links. Nonetheless, power consumption in NoC-based CMPs is proving to be a problem for SoC designers; particularly in the cache and the router. Furthermore, preliminary reports indicate that as this advent in technology and transistor size reduction continues, leakage power will be become a major contributor of NoC's power consumption [4, 5] . Routers consume a staggering amount of NoC power. Power-hungry components such as the input buffers and crossbars limit designers from maximising the capabilities of these systems. Continuous switching of activities results in high dynamic and leakage power consumption, thus causing a surge in the amount of power consumed on the chip. Elsewhere, the recent advancements in video streaming, image processing and high speed wireless communication have immensely affected the design methodology for cache [6] . These advancements place demands for high performance and low power consumption in embedded systems. Objectives have shifted from achieving high peak performance to achieving power efficiency. Therefore, embedded systems present a challenge for designers because of their power and performance budgets. Cache: Power Concept, Architecture, Power Saving Chips (Engineering Approach to Power Saving) This section of the paper presents the design methodology of the cache architecture with a view toward understanding how the cache is designed, the components of the cache and the several types of its organisation. The last section of the background focuses on how power is dissipated in the cache architecture, the different types of power consumption and in which parts in the cache they materialise. The cache memory is a small, high speed memory, which is designed for Static RAM (SRAM) and consists of the most recently accessed data of the main memory. Due to their size, caches cannot store all of the code and data of an executing program. The cache memory is situated between the processor and the main memory Dynamic RAM (DRAM). Caches are known to perform 75% faster than their DRAM counterparts [10]. This is because it takes a shorter amount of time (15 ns) to retrieve information stored in the cache memory than the DRAM (60 ns). Moreover, the process of fetching an instruction from storage consumes time and power; therefore, to avoid the performance bottleneck at the input, the cache needs to be fast. The memory design strictly centres on the principle of locality reference, meaning that at any given time, the processor accesses a small or localised region of memory. The cache loads this localised region. Internal 16 K byte cache of a Pentium processor contains over 90% of the addresses requested by the processor, making a hit rate of 90% [11] . It is not feasible to replace the main memory with SRAM to upgrade the performance because it is very expensive, less dense and consumes more power than DRAM. Therefore, increasing the amount of SRAM will have a negative effect on the performance since the processor will have more area to search, thus resulting in more time and dynamic power being spent on fetching. In addition, the cache needs to be of a size that the processor can quickly determine a hit or a miss to avoid performance degradation. A cache architecture has two policies: read and write. The read architecture can either be a look aside or a look through; whereas the write policy architecture can be a write back or write through. A cache subsystem can be divided into three functional blocks: SRAM, Tag RAM (TRAM) and the cache controller. The SRAM is the memory block and contains the data. Relatively, the size of the SRAM memory block determines the size of the cache. The Tag RAM on the other hand is a small section of the SRAM, which stores the addresses of the data that are stored in the SRAM. The Cache Controller (CC) is identified as the main brain of the cache. It is responsible for the following actions: performance snoops and snarfs, implementing the write policy and updating the SRAM and TRAM. In addition to this, it is also responsible for determining if memory requests are cacheable and to also identify if a request has been a miss or hit. Subsequently, caches consist of different organisations. These are Fully-Associative (FA), Direct Map (DM) and Set-Associative (SA). A cache is FA if a memory block can be mapped to any of its entries. FA permits any line in the main memory to be stored at any location in the cache. For this purpose, it is deemed to provide the best performance. In addition to this, it does not use the cache page, only the lines. The main disadvantage associated with FA is its high complexity during fetching. During this process, the current address is compared with all of the addresses in the TRAM. This process requires a very large number of comparators, thus increasing the complexity and cost of implementing large caches. The DM, on the other hand, divides the main memory into cache pages. The size of each page is equal to the size of the cache. DM cache may only store a specific line of memory within the same line of cache. Although DM is the least complex and less expensive compared to the other organisations, it is far less flexible, making the performance much slower especially when moving between pages. Lastly, the SA cache scheme is a combination of FA and DM. Under SA, the SRAM is divided into
doi:10.3390/jlpea7020014 fatcat:qgf4zaqltfcgpcd525wuio5dwq