Efficient asynchronous memory copy operations on multi-core systems and I/OAT

K. Vaidyanathan, L. Chai, W. Huang, D. K. Panda
2007 2007 IEEE International Conference on Cluster Computing  
Bulk memory copies incur large overheads such as CPU stalling (i.e., no overlap of computation with memory copy operation), small register-size data movement, cache pollution, etc. Asynchronous copy engines introduced by Intel's I/O Acceleration Technology help in alleviating these overheads by offloading the memory copy operations using several DMA channels. However, the startup overheads associated with these copy engines such as pinning the application buffers, posting the descriptors and
more » ... cking for completion notifications, limit their overlap capability. In this paper, we propose two schemes to provide complete overlap of memory copy operation with computation by dedicating the critical tasks to a single core in a multi-core system. In the first scheme, MCI (Multi-Core with I/OAT), we offload the memory copy operation to the copy engine and onload the startup overheads to the dedicated core. For systems without any hardware copy engine support, we propose a second scheme, MCNI (Multi-Core with No I/OAT) that onloads the memory copy operation to the dedicated core. We further propose a mechanism for an applicationtransparent asynchronous memory copy operation using memory protection. We analyze our schemes based on overlap efficiency, performance and associated overheads using several micro-benchmarks and applications. Our microbenchmark results show that memory copy operations can be significantly overlapped (up to 100%) with computation using the MCI and MCNI schemes. Evaluation with MPI-based applications such as respectively, as compared to traditional implementations. Evaluations with data-centers using the MCI scheme show up to 37% improvement compared to the traditional implementation. Our evaluations with gzip SPEC benchmark using application-transparent asynchronous memory copy show a lot of potential to use such mechanisms in several application domains.
doi:10.1109/clustr.2007.4629228 dblp:conf/cluster/VaidyanathanCHP07 fatcat:p2xw7l3ecvcgrovrzynlkyazum