Collective memory transfers for multi-core chips

George Michelogiannakis, Alexander Williams, Samuel Williams, John Shalf
2014 Proceedings of the 28th ACM international conference on Supercomputing - ICS '14  
Future performance improvements for microprocessors have shifted from clock frequency scaling towards increases in onchip parallelism. Performance improvements for a wide variety of parallel applications require domain decomposition of data arrays from a contiguous arrangement in memory to a tiled layout for on-chip L1 caches and scratchpads. However, DRAM performance suffers under the non-streaming access patterns generated by many independent cores. In this paper, we propose collective memory
more » ... e collective memory scheduling (CMS) that uses simple software and inexpensive hardware to identify collective transfers and guarantee that loads and stores arrive in memory address order to the memory controller. CMS actively takes charge of collective transfers and pushes or pulls data to or from the on-chip processors according to memory address order. CMS reduces application execution time by up to 55% (20% average) compared to a state-of-theart architecture where each processor reads and writes its data independently. CMS also reduces DRAM read power by 2.2× and write power by 50%.
doi:10.1145/2597652.2597654 dblp:conf/ics/MichelogiannakisWWS14 fatcat:vgcoibzaibes3cjlhe5xjxjmxy