Distributed Training of Deep Neural Networks with Spark: The MareNostrum Experience

Leonel Cruz, Ruben Tous, Beatriz Otero
2019 Pattern Recognition Letters  
Deployment of a distributed deep learning technology stack on a large parallel system is a very complex process, involving the integration and configuration of several layers of both, general-purpose and custom software. The details of such kind of deployments are rarely described in the literature. This paper presents the experiences observed during the deployment of a technology stack to enable deep learning workloads on MareNostrum, a petascale supercomputer. The components of a layered
more » ... tecture, based on the usage of Apache Spark, are described and the performance and scalability of the resulting system is evaluated. This is followed by a discussion about the impact of different configurations including parallelism, storage and networking alternatives, and other aspects related to the execution of deep learning workloads on a traditional HPC setup. The derived conclusions should be useful to guide similarly complex deployments in the future.
doi:10.1016/j.patrec.2019.01.020 fatcat:47iumueflfdvbdgnml6wux3qyi