Improving Operational Intensity in Data Bound Markov Chain Monte Carlo

Balazs Nemeth, Tom Haber, Thomas J. Ashby, Wim Lamotte
2017 Procedia Computer Science  
Typically, parallel algorithms are developed to leverage the processing power of multiple processors simultaneously speeding up overall execution. At the same time, discrepancy between DRAM bandwidth and microprocessor speed hinders reaching peak performance. This paper explores how operational intensity improves by performing useful computation during otherwise stalled cycles. While the proposed methodology is applicable to a wide variety of parallel algorithms, and at different scales, the
more » ... cepts are demonstrated in the machine learning context. Performance improvements are shown for Bayesian logistic regression with a Markov chain Monte Carlo sampler, either with multiple chains or with multiple proposals, on a dense data set two orders of magnitude larger than the last level cache on contemporary systems. Abstract Typically, parallel algorithms are developed to leverage the processing power of multiple processors simultaneously speeding up overall execution. At the same time, discrepancy between DRAM bandwidth and microprocessor speed hinders reaching peak performance. This paper explores how operational intensity improves by performing useful computation during otherwise stalled cycles. While the proposed methodology is applicable to a wide variety of parallel algorithms, and at different scales, the concepts are demonstrated in the machine learning context. Performance improvements are shown for Bayesian logistic regression with a Markov chain Monte Carlo sampler, either with multiple chains or with multiple proposals, on a dense data set two orders of magnitude larger than the last level cache on contemporary systems. Abstract Typically, parallel algorithms are developed to leverage the processing power of multiple processors simultaneously speeding up overall execution. At the same time, discrepancy between DRAM bandwidth and microprocessor speed hinders reaching peak performance. This paper explores how operational intensity improves by performing useful computation during otherwise stalled cycles. While the proposed methodology is applicable to a wide variety of parallel algorithms, and at different scales, the concepts are demonstrated in the machine learning context. Performance improvements are shown for Bayesian logistic regression with a Markov chain Monte Carlo sampler, either with multiple chains or with multiple proposals, on a dense data set two orders of magnitude larger than the last level cache on contemporary systems.
doi:10.1016/j.procs.2017.05.024 fatcat:ihgigk7d3bc6rbrvern6qo5hqu