78 Hits in 4.7 sec

Byzantine Fault-Tolerant Distributed Machine Learning Using Stochastic Gradient Descent (SGD) and Norm-Based Comparative Gradient Elimination (CGE) [article]

Nirupam Gupta, Shuo Liu, Nitin H. Vaidya
2021 arXiv   pre-print
We propose a norm-based gradient-filter, named comparative gradient elimination (CGE), that robustifies the D-SGD method against Byzantine agents.  ...  This paper considers the Byzantine fault-tolerance problem in distributed stochastic gradient descent (D-SGD) method - a popular algorithm for distributed multi-agent machine learning.  ...  Fault-tolerance in the distributed stochastic gradient descent (D-SGD) method: We propose a fault-tolerance mechanism that confers fault-tolerance to the D-SGD method -a standard distributed machine learning  ... 
arXiv:2008.04699v2 fatcat:v3bvhnb4vvffrp3t7dsawb3etu

Byzantine Fault Tolerance in Distributed Machine Learning : a Survey [article]

Djamila Bouhata, Hamouma Moumen
2022 arXiv   pre-print
Byzantine Fault Tolerance (BFT) is among the most challenging problems in Distributed Machine Learning (DML).  ...  Mainly in first-order optimization methods, especially Stochastic Gradient Descent (SGD). We highlight the key techniques as well as fundamental approaches.  ...  in distributed machine learning A system is fault-tolerant if it can detect and eliminate fault-caused errors or crashes and automatically recover its execution.  ... 
arXiv:2205.02572v1 fatcat:h2hkcgz3w5cvrnro6whl2rpvby

A Survey on Fault-tolerance in Distributed Optimization and Machine Learning [article]

Shuo Liu
2021 arXiv   pre-print
This survey investigates the current state of fault-tolerance research in distributed optimization, and aims to provide an overview of the existing studies on both fault-tolerant distributed optimization  ...  The robustness of distributed optimization is an emerging field of study, motivated by various applications of distributed optimization including distributed machine learning, distributed sensing, and  ...  Norm-based methods Norm filtering (or comparative gradient elimination, CGE) and norm-cap filtering (or comparative gradient clipping, CGC) [43, 46, 49] are a group of filters looking at the norms of  ... 
arXiv:2106.08545v2 fatcat:g6fys4icrbbr5k3bd3ycylaptu

Approximate Byzantine Fault-Tolerance in Distributed Optimization [article]

Shuo Liu, Nirupam Gupta, Nitin H. Vaidya
2021 arXiv   pre-print
This paper considers the problem of Byzantine fault-tolerance in distributed multi-agent optimization.  ...  In case when the agents' cost functions are differentiable, we obtain conditions for (f,ϵ)-resilience of the distributed gradient-descent method when equipped with robust gradient aggregation.  ...  Byzantine fault-tolerant distributed machine learning using stochastic gradient descent (sgd) and norm-based comparative gradient elimination (cge), 2021. [24] Gupta, N., and Vaidya, N. H.  ... 
arXiv:2101.09337v4 fatcat:jlhclmf2ljhzvlaf6almnqdyri

SignSGD: Fault-Tolerance to Blind and Byzantine Adversaries [article]

Jason Akoun, Sebastien Meyer
2022 arXiv   pre-print
We implemented SignSGD along with Byzantine strategies attempting to crush the learning process. Therefore, we provide empirical observations from our experiments to support our theory.  ...  Our code is available on GitHub and our experiments are reproducible by using the provided parameters.  ...  Finally, Signum is much more fault-tolerant than SignSGD, as the algorithm allows to achieve an accuracy similar to the one with distributed SGD, even with a proportion of Byzantine adversaries close to  ... 
arXiv:2202.02085v2 fatcat:ddiyflbcwnfjzm4oljrlv5u77y

signSGD with Majority Vote is Communication Efficient And Fault Tolerant [article]

Jeremy Bernstein, Jiawei Zhao, Kamyar Azizzadenesheli, Anima Anandkumar
2019 arXiv   pre-print
The time cost of communicating gradients limits the effectiveness of using such large machine counts, as may the increased chance of network faults.  ...  Benchmarking against the state of the art collective communications library (NCCL), our framework---with the parameter server housed entirely on one machine---led to a 25 p3.2xlarge machines.  ...  Byzantine fault tolerant optimisation: the problem of modifying SGD to make it Byzantine fault tolerant has recently attracted interest in the literature (Yin et al., 2018) .  ... 
arXiv:1810.05291v3 fatcat:4izrbblni5asdkla5bjxkujs4m

Genuinely Distributed Byzantine Machine Learning [article]

El-Mahdi El-Mhamdi and Rachid Guerraoui and Arsany Guirguis and Lê Nguyên Hoang and Sébastien Rouault
2020 arXiv   pre-print
We initiate in this paper the study of the "general" Byzantine-resilient distributed machine learning problem where no individual component is trusted.  ...  We present a new algorithm, ByzSGD, which solves the general Byzantine-resilient distributed machine learning problem by relying on three major schemes.  ...  KEYWORDS distributed machine learning, Byzantine fault tolerance, Byzantine parameter servers e fundamental problem addressed here is induced by the multiplicity of servers and consists of bounding the  ... 
arXiv:1905.03853v2 fatcat:u6irl56wsregref72p74napnka

Fault Tolerance in Iterative-Convergent Machine Learning [article]

Aurick Qiao, Bryon Aragam, Bingjing Zhang, Eric P. Xing
2018 arXiv   pre-print
We show that SCAR can reduce the iteration cost of partial failures by 78% - 95% when compared with traditional checkpoint-based fault tolerance across a variety of ML models and training algorithms.  ...  Machine learning (ML) training algorithms often possess an inherent self-correcting behavior due to their iterative-convergent nature.  ...  Byzantine failures are one of the most general assumptions on failures, and thus a Byzantine fault-tolerant training system is naturally tolerant to many other types of faults and perturbations.  ... 
arXiv:1810.07354v1 fatcat:igj7rdwakncunbfbysitugqg6q

SignGuard: Byzantine-robust Federated Learning through Collaborative Malicious Gradient Filtering [article]

Jian Xu, Shao-Lun Huang, Linqi Song, Tian Lan
2021 arXiv   pre-print
Gradient-based training in federated learning is known to be vulnerable to faulty/malicious worker nodes, which are often modeled as Byzantine clients.  ...  Based on our theoretical analysis of state-of-the-art attack, we propose a novel approach, SignGuard, to enable Byzantine-robust federated learning through collaborative malicious gradient filtering.  ...  In [18] the authors show that even if PS only collects the sign of gradient, the model training can still converge with small accuracy degradation and keep the training process fault-tolerant.  ... 
arXiv:2109.05872v1 fatcat:e5jeogz4jjhxvgmyywlm652c6m

A Hitchhiker's Guide On Distributed Training of Deep Neural Networks [article]

Karanbir Chahal, Manraj Singh Grover, Kuntal Dey
2018 arXiv   pre-print
Training a benchmark dataset like ImageNet on a single machine with a modern GPU can take upto a week, distributing training on multiple machines has been observed to drastically bring this time down.  ...  More specifically, we explore the synchronous and asynchronous variants of distributed Stochastic Gradient Descent, various All Reduce gradient aggregation strategies and best practices for obtaining higher  ...  Fault Tolerance A fault tolerance approach in training with Synchronous SGD has not been addressed in literature as of now to the best of our knowledge.  ... 
arXiv:1810.11787v1 fatcat:wy36x3sdwvhvfdrnc5tvzn7sty

DETOX: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation [article]

Shashank Rajput, Hongyi Wang, Zachary Charles, Dimitris Papailiopoulos
2020 arXiv   pre-print
To improve the resilience of distributed training to worst-case, or Byzantine node failures, several recent approaches have replaced gradient averaging with robust aggregation methods.  ...  We provide extensive experiments over real distributed setups across a variety of large-scale machine learning tasks, showing that DETOX leads to orders of magnitude accuracy and speedup improvements over  ...  A stronger form of Byzantine resilience is desirable for most of distributed machine learning applications.  ... 
arXiv:1907.12205v2 fatcat:zgtftuws2nb53kwlhb25yxoshm

Secure Distributed Training at Scale [article]

Eduard Gorbunov, Alexander Borzunov, Michael Diskin, Max Ryabinin
2021 arXiv   pre-print
Training in presence of such peers requires specialized distributed training algorithms with Byzantine tolerance.  ...  As a result, it can be infeasible to apply such algorithms to large-scale distributed deep learning, where models can have billions of parameters.  ...  Byzantine fault-tolerance in peer-to-peer distributed gradient- descent. arXiv preprint arXiv:2101.12316, 2021. Nirupam Gupta, Thinh T Doan, and Nitin H Vaidya.  ... 
arXiv:2106.11257v2 fatcat:whcd527c6bf2pknucgdise4ope

Byzantine Fault-Tolerance in Peer-to-Peer Distributed Gradient-Descent [article]

Nirupam Gupta, Nitin H. Vaidya
2021 arXiv   pre-print
We consider the problem of Byzantine fault-tolerance in the peer-to-peer (P2P) distributed gradient-descent method – a prominent algorithm for distributed optimization in a P2P system.  ...  We refer to this fault-tolerance goal as f-resilience where f is the maximum number of Byzantine faulty agents in a system of n agents, with f < n.  ...  named comparative gradient elimination (CGE) [17, 18] .  ... 
arXiv:2101.12316v1 fatcat:erhlcelxcvbo3d5zclyrjtz77i

Byzantine Fault-Tolerance in Decentralized Optimization under Minimal Redundancy [article]

Nirupam Gupta, Thinh T. Doan, Nitin H. Vaidya
2020 arXiv   pre-print
We propose a decentralized optimization algorithm with provable exact fault-tolerance against a bounded number of Byzantine agents, provided the non-faulty agents have a minimal redundancy.  ...  This paper considers the problem of Byzantine fault-tolerance in multi-agent decentralized optimization. In this problem, each agent has a local cost function.  ...  Notable applications of decentralized optimization include swarm robotics [3] , multi-sensor networks [4] , and distributed machine learning [1] .  ... 
arXiv:2009.14763v1 fatcat:rzafb6l3e5falablitb5ila7cy

Byzantine Resilient Distributed Multi-Task Learning [article]

Jiani Li, Waseem Abbas, Xenofon Koutsoukos
2021 arXiv   pre-print
In this paper, we present an approach for Byzantine resilient distributed multi-task learning.  ...  distributed algorithms for learning relatedness among tasks are not resilient in the presence of Byzantine agents.  ...  Introduction Distributed machine learning models are gaining much attention recently as they improve the learning capabilities of agents distributed within a network with no central entity or server.  ... 
arXiv:2010.13032v2 fatcat:x5i3nlirvjcatfjohs66g2jwmy
« Previous Showing results 1 — 15 out of 78 results