SMARTH: Enabling Multi-pipeline Data Transfer in HDFS

Hong Zhang, Liqiang Wang, Hai Huang
2014 2014 43rd International Conference on Parallel Processing  
Hadoop is a popular open-source implementation of the MapReduce programming model to handle large data sets, and HDFS is one of Hadoop's most commonly used distributed file systems. Surprisingly, we found that HDFS is inefficient when handling upload of data files from client local file system, especially when the storage cluster is configured to use replicas. The root cause is HDFS's synchronous pipeline design. In this paper, we introduce an improved HDFS design called SMARTH. It utilizes
more » ... chronous multi-pipeline data transfers instead of a single pipeline stop-and-wait mechanism. SMARTH records the actual transfer speed of data blocks and sends this information to the namenode along with periodic heartbeat messages. The namenode sorts datanodes according to their past performance and tracks this information continuously. When a client initiates an upload request, the namenode will send it a list of "high performance" datanodes that it thinks will yield the highest throughput for the client. By choosing higher performance datanodes relative to each client and by taking advantage of the multi-pipeline design, our experiments show that SMARTH significantly improves the performance of data write operations compared to HDFS. Specifically, SMARTH is able to improve the throughput of data transfer by 27-245% in a heterogeneous virtual cluster on Amazon EC2.
doi:10.1109/icpp.2014.12 dblp:conf/icpp/ZhangWH14 fatcat:vx2g4cvncbc7tarolsd7vy4iv4