IMPROVING FAULT TOLERANT RESOURCE OPTIMIZED AWARE JOB SCHEDULING FOR GRID COMPUTING
Journal of Computer Science
Workflow brokers of existing Grid Scheduling Systems are lack of cooperation mechanism which causes inefficient schedules of application distributed resources and it also worsens the utilization of various resources including network bandwidth and computational cycles. Furthermore considering the literature, all of these existing brokering systems primarily evolved around models of centralized hierarchical or client/server. In such models, vital responsibility such as resource discovery is
... e discovery is delegated to the centralized server machines, thus they are associated with well-known disadvantages regarding single point of failure, scalability and network congestion at links that are leading to the server. In order to overcome these issues, we implement a new approach for decentralized cooperative workflow scheduling in a dynamically distributed resource sharing environment of Grids. The various actors in the system namely the users who belong to multiple control domains, workflow brokers and resources work together enabling a single cooperative resource sharing environment. But this approach ignored the fact that each grid site may have its own fault-tolerance strategy because each site is itself an autonomous domain. For instance, if a grid site handles the job check-pointing mechanism, each computation node must have the ability of periodical transmission of transient state of the job execution by computational node to the server. When there is a failure of job, it will migrate to another computational node and resume from the last stored checkpoint. A Glow worm Swarm Optimization (GSO) for job scheduling is used to address the issue of heterogeneity in fault-tolerance of computational grid but Weighted GSO that overcomes the position update imperfections of general GSO in a more efficient manner shown during comparison analysis. This system supports four kinds of fault-tolerance mechanisms, including the job migration, job retry, checkpointing and the job replication mechanisms also considering risk nature of Grid computing environment. The risk relationship between jobs and nodes are defined by the security demand and the trust level. Our evaluation based simulation results show that our algorithm has shorter makespan and more efficient. We also analyze the efficiency of the proposed approach against a centralized coordinated workflow scheduling technique and show that our approach is more efficient than the centralized technique with respect to achieving highly coordinated schedules.