FAULT TOLERATING MECHANISM IN DISTRIBUTED COMPUTING ENVIRONMENT

Lokendra Gour, Dr. Akhilesh A. Waoo
2020 International Journal of Engineering Applied Sciences and Technology  
Large scale distributed systems encompass heterogeneous computational machines, workloads and sub-systems dispersed diversely across the cloud environment. These sub-systems frequently encounter faults and failures due to different data structures, hardware/software malfunction, and communication delay. To speed up computation in such a situation a fault tolerating infrastructure is implemented by adopting a machine learning approach. Under machine learning, an artificial neural network (ANN)
more » ... ptures, manipulates, and updates the states and behaviors of the sub-systems in the servers and worker's machines. Multiple layers of neurons (i. e., deep learning) can handle large scale distributed systems with large datasets. Adopting the variants of a stochastic gradient descend algorithm on subsystems (also known as computational nodes) the efficiency, and reliability of a distributed system are enhanced significantly. In high-performance computing (HPC) applications fault tolerance mechanisms must be embedded to recover from system failures.
doi:10.33564/ijeast.2020.v05i04.096 fatcat:yflf277oovadfncrfze47ko6zi