A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2015; you can also visit the original URL.
The file type is
Future performance improvements for microprocessors have shifted from clock frequency scaling towards increases in onchip parallelism. Performance improvements for a wide variety of parallel applications require domain decomposition of data arrays from a contiguous arrangement in memory to a tiled layout for on-chip L1 caches and scratchpads. However, DRAM performance suffers under the non-streaming access patterns generated by many independent cores. In this paper, we propose collective memorydoi:10.1145/2597652.2597654 dblp:conf/ics/MichelogiannakisWWS14 fatcat:vgcoibzaibes3cjlhe5xjxjmxy