Speeding up distributed machine learning using codes

Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, Kannan Ramchandran
2016 2016 IEEE International Symposium on Information Theory (ISIT)  
Codes are widely used in many engineering applications to offer some form of reliability and fault tolerance. The high-level idea of coding is to exploit resource redundancy to deliver higher robustness against system noise. In distributed systems, there are several types of "noise" that can affect ML algorithms: straggler nodes, system failures, communication bottlenecks, etc. Moreover, redundancy is abundant: a plethora of nodes, a lot of spare storage, etc. However, there has been little
more » ... raction between Codes, Machine Learning, and Distributed Systems. In this work, we scratch the tip of the "Coding for Distributed ML" iceberg. We show how codes can be used to speed up two of the most basic building blocks of distributed ML algorithms: data shuffling and matrix multiplication. In data shuffling, we use codes to exploit the excess in storage and reduce communication bottlenecks. For matrix multiplication, we use codes to leverage the plethora of nodes and alleviate the effects of stragglers. We provide theoretical insights and evidence on synthetic and OpenMPI experiments on Amazon EC2 that highlight significant gains offered by coded solutions compared to uncoded ones.
doi:10.1109/isit.2016.7541478 dblp:conf/isit/LeeLPPR16 fatcat:zrzj7v7dxrh5novlu2qsh4gj2m