Dynamic Scratch-Pad Memory Management with Data Pipelining for Embedded Systems

Yanqin Yang, Meng Wang, Zili Shao, Minyi Guo
2009 2009 International Conference on Computational Science and Engineering  
In this paper, we propose an effective data pipelining technique, SPDP (Scratch-Pad Data Pipelining), for dynamic scratch-pad memory (SPM) management with DMA (Direct Memory Access). Our basic idea is to overlap the execution of CPU instructions and DMA operations. In SPDP, based on the iteration access patterns of arrays, we group multiple iterations into a block to improve the data locality of regular array accesses. We allocate the data of multiple iterations into different portions of the
more » ... t portions of the SPM. In this way, when the CPU executes instructions and accesses data from one portion of the SPM, DMA operations can be performed to transfer data between the off-chip memory and another portion of SPM simultaneously. We perform code transformation to insert DMA instructions to achieve the data pipelining. We have implemented our SPDP technique with the IMPACT compiler, and conduct experiments using a set of loop kernels from DSPstone, Mibench, and Mediabench on the cycle-accurate VLIW simulator of Trimaran. The experimental results show that our technique achieves performance improvement compared with the previous work. DYNAMIC SPM MANAGEMENT WITH DMA 1875 INTRODUCTION The ever-widening performance gap between CPU and off-chip memory requires effective techniques to reduce memory accesses. To alleviate the gap, scratch-pad memory (SPM), a small fast software-managed on-chip SRAM (Static Random Access Memory), is widely used in embedded systems with its advantages in energy and area [1] . A recent study [1] shows that SPM has 34% smaller area and 40% lower power consumption than the cache of the same capacity. As the cache typically consumes 25-50% of the total energy and area of a processor, SPM can help to significantly reduce the energy consumption for embedded processors. Embedded software is usually optimized for specific applications, hence we can utilize SPM to improve the performance and predictability by avoiding cache misses. Owing to these advantages, SPM has become the most common SRAM in embedded processors. However, it poses a big challenge for the compiler to fully explore SPM since it is completely controlled by software. To effectively manage SPM, two kinds of compiler-managed methods have been proposed: static method [2] and dynamic method [3, 4] . Basically, based on the static SPM management, the content in SPM is fixed and is not changed during the running time of applications. With the dynamic SPM management, the content of SPM is changed during the running time based on the behavior of applications. For dynamic SPM management, it is important to select an effective approach to transfer data between off-chip memory and SPM. This is because the latency of offchip memory access is about 10-100 times of that of SPM [3], and many embedded applications in image and video processing domains have significant data transfer requirements in addition to their computational requirements. To reduce off-chip memory access overheads, the dedicated cost-efficient hardware, DMA (Direct Memory Access), is used to transfer data. In this paper, we focus on how to combine SPM and DMA in dynamic SPM management for optimizing loops that are usually the most critical sections in some embedded applications, such as DSP and image processing. Our work is closely related to the work in [4] [5] [6] [7] [8] . In [4], DMA is applied for data transfer between SPM and off-chip memory. The same cost model using DMA for data transfer has been used in [7] to accelerate data transfer between off-chip memory and SPM. The work in [8] used DMA to pre-fetch data only from off-chip memory to SPM. However, the above work focuses on array allocation for SPM without considering the data parallelization between DMA and CPU. In our technique, we show that we can achieve data parallelization for multiple iterations of a loop. In this paper, we propose an effective data pipelining technique, SPDP (Scratch-Pad Data Pipelining), for dynamic SPM management with DMA. Our basic idea is to overlap the execution of CPU instructions and DMA operations. In SPDP, based on the iteration access patterns of arrays, we group multiple iterations into a block to improve the data locality of regular array accesses. We allocate the data of multiple iterations into different portions of the SPM. In this way, when the CPU executes instructions and accesses data from one portion of the SPM, DMA operations can be performed to transfer data between the off-chip memory and another portion of SPM simultaneously. We perform code transformation to insert DMA instructions to achieve the data pipelining. We implement our technique with IMPACT [9], and conduct experiments using a set of loop kernels from DSPstone [10], Mibench [11], and Mediabench [12] on the cycle-accurate VLIW simulator of Trimaran [13]. The experimental results show that the SPDP technique achieves performance improvement compared with the previous work [5,6,8].
doi:10.1109/cse.2009.295 dblp:conf/cse/YangWSG09 fatcat:n7hcrdtakfg5xkumtijyhyxns4