Network-aware big data processing

Lukas Rupprecht, Peter Pietzuch
2017
The scale-out approach of modern data-parallel frameworks such as Apache Flink or Apache Spark has enabled them to deal with large amounts of data. These applications are often deployed in large-scale data centres with many resources. However, as deployments and data continue to grow, more network communication is incurred during a data processing query. At the same time, data centre networks (DCNs) are becoming increasingly more complex in terms of the physical network topology, the variety of
more » ... applications that are sharing the network, and the different requirements of these applications on the network. The high complexity of DCNs combined with the increased traffic demands of applications has made the network a bottleneck for query performance. In this thesis, we explore ways of making data-parallel frameworks network-aware, i.e. we combine specific knowledge about the application and the physical network to reduce query completion times. We identify three main types of traffic that occur during query processing and add network-awareness to each of them to optimise network usage. 1) Traffic reduction for aggregatable traffic exploits the physical network topology and the associativity and commutativity of aggregation queries to reduce traffic as early as possible. In-network aggregation trees utilise existing networking hardware and the tree topology of DCNs to partially aggregate and thereby reduce data as it flows through the network. 2) Traffic balancing for non-aggregatable traffic monitors the network throughput of an application and uses knowledge about the query to optimise the overall network utilisation. By dynamically changing the destinations of parts of the transferred data, network hotspots, which can occur when many applications share the network, can be avoided. 3) Traffic elimination for storage traffic gives control over data placement to the application instead of the distributed storage system. This allows the application to optimise where data is stored across the cluster based on applicatio [...]
doi:10.25560/52455 fatcat:6skrdu4cvfexnacoqizvy5ub6q