Lazy release consistency for GPUs

Johnathan Alsop, Marc S. Orr, Bradford M. Beckmann, David A. Wood
2016 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)  
The heterogeneous-race-free (HRF) memory model has been embraced by the Heterogeneous System Architecture (HSA) Foundation and OpenCL TM because it clearly and precisely defines the behavior of current GPUs. However, compared to the simpler SC for DRF memory model, HRF has two shortcomings. The first is that HRF requires programmers to label atomic memory operations with the correct scope of synchronization. This explicit labeling can save significant coherence overhead when synchronization is
more » ... synchronization is local, but it is tedious and error-prone. The second shortcoming is that HRF restricts important dynamic data sharing patterns like work stealing. Prior work on remote scope promotion (RSP) attempted to resolve the second shortcoming. However, RSP further complicates the memory model and no scalable implementation of RSP has been proposed. For example, we found that the previously proposed RSP implementation actually results in slowdowns of up to 30% on large GPUs, compared to a naïve baseline system that forgoes work stealing and scopes. Meanwhile, DeNovo has been shown to offer efficient synchronization with an SC for DRF memory model, performing on average 21% better than our baseline system, but it introduces additional coherence traffic to maintain ownership of all modified data. To resolve these deficiencies, we propose to adapt lazy release consistency-previously only proposed for homogeneous CPU systems-to a heterogeneous system. Our approach, called hLRC, uses a DeNovo-like mechanism to track ownership of synchronization variables, lazily performing coherence actions only when a synchronization variable changes locations. hLRC allows GPU programmers to use the simpler SC for DRF memory model without tracking ownership for all modified data. Our evaluation shows that lazy release consistency provides robust performance improvement across a set of graph analysis applications-29% on average versus the baseline system. Inv all L1s bcast A broadcasted request to all remote L1 caches to invalidate their valid data. Lock all RMWs A broadcasted request to all remote L1 caches to block RMWs. Unlock all RMWs A broadcasted request to all remote L1 caches to unblock RMWs.
doi:10.1109/micro.2016.7783729 dblp:conf/micro/AlsopOBW16 fatcat:p5u2mv5gyzbcfpu7gjoxop4rse