Vladimir Dimić, Miquel Moretó, Marc Casas, Jan Ciesko, Mateo Valero
2020 Proceedings of the 34th ACM International Conference on Supercomputing  
Reductions constitute a frequent algorithmic pattern in high-performance and scientific computing. Sophisticated techniques are needed to ensure their correct and scalable concurrent execution on modern processors. Reductions on large arrays represent the most demanding case where traditional approaches are not always applicable due to low performance scalability. To address these challenges, we propose RICH, a runtime-assisted solution that relies on architectural and parallel programming
more » ... el programming model extensions. RICH updates the reduction variable directly in the cache hierarchy with the help of added in-cache functional units. Our programming model extensions fit with the most relevant parallel programming solutions for shared memory environments like OpenMP. RICH does not modify the ISA, which allows the use of algorithms with reductions from pre-compiled external libraries. Experiments show that our solution achieves the speedup of 1.11× on average, compared to the state-of-the-art hardwarebased approaches, while it introduces 2.4% area and 3.8% power overhead.
doi:10.1145/3392717.3392736 fatcat:57ph5q6wobbehmbigehddigppa