Performance and monetary cost optimizations for HPC applications in the cloud [thesis]

Yifan Gong
Benefit from cloud computing, high performance computing (HPC) tasks can be performed on many virtual machines instead of a physical cluster. The users can move HPC tasks from private clusters to public clouds. The unique virtualized environment calls for careful examination on existing HPC software stacks for efficiency. Because of the pay-as-you-go nature, performance and monetary cost optimizations are significant in not only improving the productivity but also reducing the total ownership
more » ... e total ownership cost. Our research focuses on both of them for network intensive applications like scientific computing and big data. For performance optimizations, we first evaluate the performance of Message Passing Interfaces (MPI), including broadcast, reduce, gather and scatter, in cloud. For existing MPI implementations, the information of network topology becomes very important in performance optimizations. However, in the cloud environment, virtualization not only hides the network topology information from the users, but also causes traffic interference and dynamics to network performance. In this case, topology-aware optimizations are not useful. Therefore, we develop novel network performance aware algorithms for MPI collective communication operations. We further implement two common applications, namely conjugate gradient (CG) and N-body. We have conducted our experiments with two complementary methods (on Amazon EC2 and simulations). Our evaluation results demonstrate that the network performance awareness results in 25.4% and 28.3% performance improvement over MPICH2 on Amazon EC2 and on simulations, respectively. Evaluations on N-body and CG show 41.6% and 14.3% respectively on application performance improvement. Furthermore, we find network interference is a norm in the cloud networking, and has a large performance impact on network-intensive applications that are deployed and run in virtual clusters. Many existing network optimizations that rely on topology information or stable network performance in physical clusters are no longer effective for virtual clusters as we mentioned above. So we propose a novel network performance model to decouple the constant and volatility components from the dynamic network performance of virtual clusters. The notions of constant and volatility represent the long-term and transient network performance factors of the virtual cluster, respectively. We develop i
doi:10.32657/10356/69201 fatcat:7membcqkbzcb7l4jecttqpdfru