Lokendra Gour, Dr. Akhilesh A. Waoo
2020 International Journal of Engineering Applied Sciences and Technology  
Large scale distributed systems encompass heterogeneous computational machines, workloads and sub-systems dispersed diversely across the cloud environment. These sub-systems frequently encounter faults and failures due to different data structures, hardware/software malfunction, and communication delay. To speed up computation in such a situation a fault tolerating infrastructure is implemented by adopting a machine learning approach. Under machine learning, an artificial neural network (ANN)
more » ... ptures, manipulates, and updates the states and behaviors of the sub-systems in the servers and worker's machines. Multiple layers of neurons (i. e., deep learning) can handle large scale distributed systems with large datasets. Adopting the variants of a stochastic gradient descend algorithm on subsystems (also known as computational nodes) the efficiency, and reliability of a distributed system are enhanced significantly. In high-performance computing (HPC) applications fault tolerance mechanisms must be embedded to recover from system failures.
doi:10.33564/ijeast.2020.v05i04.096 fatcat:yflf277oovadfncrfze47ko6zi