Filters








44 Hits in 3.7 sec

Parle: parallelizing stochastic gradient descent [article]

Pratik Chaudhari, Carlo Baldassi, Riccardo Zecchina, Stefano Soatto, Ameet Talwalkar, Adam Oberman
2017 arXiv   pre-print
We propose a new algorithm called Parle for parallel training of deep networks that converges 2-4x faster than a data-parallel implementation of SGD, while achieving significantly improved error rates  ...  Parle requires very infrequent communication with the parameter server and instead performs more computation on each client, which makes it well-suited to both single-machine, multi-GPU settings and distributed  ...  PARLE Stochastic gradient descent step to minimize (5) amounts to combining Entropy-SGD in (6) and Elastic-SGD in (7) .  ... 
arXiv:1707.00424v2 fatcat:4a6xw6bgpfgnxm3mzylo2mpzwe

GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent [article]

Jeff Daily, Abhinav Vishnu, Charles Siegel, Thomas Warfel, Vinay Amatya
2018 arXiv   pre-print
In this paper, we present GossipGraD - a gossip communication protocol based Stochastic Gradient Descent (SGD) algorithm for scaling Deep Learning (DL) algorithms on large-scale systems.  ...  in SGD to prevent over-fitting, 5) asynchronous communication of gradients for further reducing the communication cost of SGD and GossipGraD.  ...  An important type of Gradient Descent is Batch/Stochastic Gradient Descent (SGD) -where a random subset of samples are used for iterative feed-forward (calculation of predicted value) and back-propagation  ... 
arXiv:1803.05880v1 fatcat:tun5qumqbvbjhay4q2dwbzdyxi

What does fault tolerant Deep Learning need from MPI? [article]

Vinay Amatya, Abhinav Vishnu, Charles Siegel, Jeff Daily
2017 arXiv   pre-print
We present an in-depth discussion on the suitability of different parallelism types (model, data and hybrid); a need (or lack thereof) for check-pointing of any critical data structures; and most importantly  ...  useful, but do not provide strict equivalence to the default stochastic gradient descent (SGD) algorithm.  ...  FT-SGD Fault Tolerant Stochastic Gradient Descent Yes Implements strong scaling by dividing batch and all-to-all reduction.  ... 
arXiv:1709.03316v1 fatcat:2yxepfrxtfaa7hqc4wyxwsv6km

OpenGraphGym-MG: Using Reinforcement Learning to Solve Large Graph Optimization Problems on MultiGPU Systems [article]

Weijian Zheng, Dali Wang, Fengguang Song
2021 arXiv   pre-print
, graph-level and node-level batched processing, distributed sparse graph storage, efficient parallel RL training and inference algorithms, repeated gradient descent iterations, and adaptive multiple-node  ...  This study performs a comprehensive performance analysis on parallel efficiency and memory cost that proves the parallel RL training and inference algorithms are efficient and highly scalable on a number  ...  . 2020. https://github.com/PaddlePaddle/PARL.  ... 
arXiv:2105.08764v2 fatcat:r6ejpsl4srdizgh6vld5cffnum

Entropic gradient descent algorithms and wide flat minima [article]

Fabrizio Pittorino, Carlo Lucibello, Christoph Feinauer, Gabriele Perugini, Carlo Baldassi, Elizaveta Demyanenko, Riccardo Zecchina
2021 arXiv   pre-print
One area of ongoing research is the connection between the flatness of minima found by optimization algorithms like stochastic gradient descent (SGD) and the generalization performance of the network  ...  Replicated stochastic gradient descent (rSGD) replaces the local entropy objective by an objective involving several replicas of the model, each one moving in the potential induced by the loss while also  ... 
arXiv:2006.07897v4 fatcat:uunb56piljgkjism6uyyxcfsvy

Volume-of-Interest Aware Deep Neural Networks for Rapid Chest CT-Based COVID-19 Patient Risk Assessment

Anargyros Chatzitofis, Pierandrea Cancian, Vasileios Gkitsas, Alessandro Carlucci, Panagiotis Stalidis, Georgios Albanis, Antonis Karakottas, Theodoros Semertzidis, Petros Daras, Caterina Giannitto, Elena Casiraghi, Federica Mrakic Sposta (+12 others)
2021 International Journal of Environmental Research and Public Health  
We trained the model with Stochastic Gradient Descent (SGD) [51] , an initial learning rate of 0.001 and a momentum of 0.7.  ...  Gradient Descent, with an initial learning rate of 0.001, and momentum 0.7.  ...  We trained the model with Stochastic Gradient Descent (SGD) [51] , an initial learning rate of 0.001, and a momentum of 0.7.  ... 
doi:10.3390/ijerph18062842 pmid:33799509 fatcat:3xj2mbylfjbfzkaqifjgjdxky4

An Interactive Self-Learning Game and Evolutionary Approach Based on Non-Cooperative Equilibrium

Yan Li, Mengyu Zhao, Huazhi Zhang, Fuling Yang, Suyu Wang
2021 Electronics  
gradient: 6 end for 7 Sample minibatch of m noise samples {z (1) , . . . , z (m) } from noise prior p g (z) 8 Update the generator by descending its stochastic gradient: min G V(D, G) = E z∼P z (z) [log  ...  Therefore, the parameter θ Q of the evaluation network is updated by the gradient descent method, and the equation is shown as (4) . θ Q = θ Q − η ∂ ∂θ Q L θ Q (4) where η denotes the learning rate.  ... 
doi:10.3390/electronics10232977 fatcat:5cueeunuerbw5fehkqxrn6zehu

Applications of Reinforcement Learning in Deregulated Power Market: A Comprehensive Review [article]

Ziqing Zhu, Ze Hu, Ka Wing Chan, Siqi Bu, Bin Zhou, Shiwei Xia
2022 arXiv   pre-print
Among these methods, the steepest descent, also known as "policy gradient", is the most straightforward way to solve the aforementioned problem, by calculating the value of ( ) in the steepest descent  ...  Stochastic Policy Gradient (SPG) Before elaborating the detailed algorithms, it is necessary to develop the concept of policy gradient in RL.  ...  For the stochastic RL, the core idea is to incorporate the conditional-value-at-risk (CVaR) into the Q-value as a penalty term:  ... 
arXiv:2205.08369v1 fatcat:yqdarokpnzf4zitcilkgigq4vm

Artificial Intelligence for Prosthetics - challenge solutions [article]

Łukasz Kidziński, Carmichael Ong, Sharada Prasanna Mohanty, Jennifer Hicks, Sean F. Carroll, Bo Zhou, Hongsheng Zeng, Fan Wang, Rongzhong Lian, Hao Tian, Wojciech Jaśkowski, Garrett Andersen, Odd Rune Lykkebø, Nihat Engin Toklu (+31 others)
2019 arXiv   pre-print
We used Stochastic Gradient Descent with Warm Restarts (SGDR [28] ) to produce an ensemble of 10 networks, and then we chose the best combination of 4 networks by grid-search.  ...  Methods Policy Representation A stochastic policy was used.  ... 
arXiv:1902.02441v1 fatcat:hf7xzitrhjdqfb5cfaneovlfa4

Detecting Cross-Lingual Plagiarism Using Simulated Word Embeddings [article]

Victor Thompson
2018 arXiv   pre-print
Unlike most existing models, the proposed model does not require parallel corpora, and accommodates multiple languages (multi-lingual).  ...  When contexts are created, the final stage is to train a feed-forward neural network, using the back propagation algorithm with stochastic gradient descent to learn the word distributions in the embeddings  ...  were created using the multi-lingual Europarl-corpus; from a non-English source document, a text passage is removed and used to retrieve its corresponding English version from the multi-lingual Euro-Parl  ... 
arXiv:1712.10190v2 fatcat:fd3ukosck5fv7fqs3o5why7vpq

Algorithmes génétiques appliqués à la gestion du trafic aérien

N. Durand, J.-B. Gotteland
2003 Journal sur l'enseignement des sciences et technologies de l'information et des systèmes  
A plus court terme, on parle souvent de pré-régulation : elle consisteà organiser une journée de trafic, la veille ou l'avant-veille.  ...  Lorsque l'avion està moins de 50 nautiques de son début de descente, il peut anticiper sa descenteà t 0 et faire un palierà t 1 pour rejoindre sa trajectoire de descente.  ... 
doi:10.1051/bib-j3ea:2003506 fatcat:o3hmdkluyfdkvk6io53c4au57u

H2 optimal and frequency limited approximation methods for large-scale LTI dynamical systems

Pierre Vuillemin, Charles Poussot-Vassal, Daniel Alazard
2013 IFAC Proceedings Volumes  
Dans le cas des modèles dynamiques Linéaires et Invariants dans le Temps (LTI), la complexité se traduit par une dimension importante du vecteur d'état et on parle alors de modèles de grande dimension.  ...  The first-order optimality conditions of the optimal H 2,Ω approximation problem are derived and used to built a complex-domain descent algorithm aimed at finding a local minimum of the problem.  ...  Conclusion La formulation pôles-résidus de la norme H 2,Ω et son gradient ont été utilisés pour créer un algorithme de descente dans le domaine complexe : DARPO.  ... 
doi:10.3182/20130204-3-fr-2033.00061 fatcat:ijxrjgml45ephgvq53v37pxiyi

Machine learning at the service of meta-heuristics for solving combinatorial optimization problems: A state-of-the-art

Maryam Karimi-Mamaghan, Mehrdad Mohammadi, Patrick Meyer, Amir Mohammad Karimi-Mamaghan, El-Ghazali Talbi
2021 European Journal of Operational Research  
Parl. Inter Apriori Ind. Part. TS VRP Meignan, Alg. Parl. Intra RL Di. Comp. EA VRP, FLP Cadenas, Garrido, and Muñoz (2009) Alg. Parl. Inter DT Ind. Comp.  ...  Coop. level Parl./ Seq. Learning ML tech.  ... 
doi:10.1016/j.ejor.2021.04.032 fatcat:bdbbv2o4kff5hemwafi7l5p3ga

Interactions sociales et dispersion dans des populations structurées dans l'espace

Jean-François Le Galliard, Jean Clobert, Régis Ferrière
2003 Zenodo  
Deterministic dynamics (dashed curve) paralleled the stochastic simulations (continuous curve, mean of 1500 runs).  ...  Thus, identity by state rather than by descent may also influence dispersal behaviour.  ... 
doi:10.5281/zenodo.3529133 fatcat:7lspn2ixb5bzjn6q6wofungjhq

Out of equilibrium Statistical Physics of learning [article]

LUCA SAGLIETTI
2018
), but allows one to avoid the clamping of the magnetizations. • stochastic gradient descent (SGD), where the gradient is evaluated only over a random mini-batches of the training set, injecting noise  ...  Notwithstanding all these developments, (stochastic) gradient descent has remained the main ingredient for learning (even though it might be "cooked" in different ways) [62] .  ...  In the simulations, we chose to employ the natural gradient (with (1 − m 2 i )∂ m i instead of ∂ m i ), with a learning rate equal to 1; the loss-function was set to be that of equation 7.84, with ρ k  ... 
doi:10.6092/polito/porto/2710532 fatcat:s3keuq5jrzb4bnoaatpzuij6wq
« Previous Showing results 1 — 15 out of 44 results