Filters








10 Hits in 7.7 sec

A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation [article]

Akhilesh Gotmare, Nitish Shirish Keskar, Caiming Xiong, Richard Socher
2018 arXiv   pre-print
In particular, we explore knowledge distillation and learning rate heuristics of (cosine) restarts and warmup using mode connectivity and CCA.  ...  The convergence rate and final performance of common deep learning models have significantly benefited from heuristics such as learning rate schedules, knowledge distillation, skip connections, and normalization  ...  WARMUP LEARNING RATE SCHEME Learning rate warmup is a common heuristic used by many practitioners for training deep neural nets for computer vision (Goyal et al., 2017) and natural language processing  ... 
arXiv:1810.13243v1 fatcat:wakptps6sraknorrxbwwiqch2a

Attention, Learn to Solve Routing Problems! [article]

Wouter Kool, Herke van Hoof, Max Welling
2019 arXiv   pre-print
With the same hyperparameters, we learn strong heuristics for two variants of the Vehicle Routing Problem (VRP), the Orienteering Problem (OP) and (a stochastic variant of) the Prize Collecting TSP (PCTSP  ...  The recently presented idea to learn heuristics for combinatorial optimization problems is promising as it can save costly development.  ...  We thank Thomas Kipf for helpful discussions and anonymous reviewers for comments that helped improve the paper.  ... 
arXiv:1803.08475v3 fatcat:2u54w3n63zgjdfvucb7pvddv5a

Self-Distillation Amplifies Regularization in Hilbert Space [article]

Hossein Mobahi, Mehrdad Farajtabar, Peter L. Bartlett
2020 arXiv   pre-print
Knowledge distillation introduced in the deep learning context is a method to transfer knowledge from one architecture to another.  ...  Why this happens, however, has been a mystery: the self-distillation dynamics does not receive any new information about the task and solely evolves by looping over training.  ...  Acknowledgement We would like to thank colleagues at Google Research for their feedback: Moshe Dubiner, Pierre Foret, Sergey Ioffe, Yiding Jiang, Alan MacKey, Sam Schoenholz, Matt Streeter, and Andrey  ... 
arXiv:2002.05715v3 fatcat:g3ksy53bxrd23bdnfjepcqajha

A continual learning survey: Defying forgetting in classification tasks [article]

Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Gregory Slabaugh, Tinne Tuytelaars
2020 arXiv   pre-print
) a comprehensive experimental comparison of 11 state-of-the-art continual learning methods and 4 baselines.  ...  Artificial neural networks thrive in solving the classification problem for a particular rigid task, acquiring knowledge through generalized learning behaviour from a distinct training phase.  ...  IMM with dropout exhibits higher performance only for the WIDE and DEEP model, coming closer to the performance obtained by the other methods. iCaRL and GEM.  ... 
arXiv:1909.08383v2 fatcat:vhvlwslqa5cefitcnajs7hp5nu

Dota 2 with Large Scale Deep Reinforcement Learning [article]

OpenAI: Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal Józefowicz, Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov (+10 others)
2019 arXiv   pre-print
We developed a distributed training system and tools for continual training which allowed us to train OpenAI Five for 10 months.  ...  By defeating the Dota 2 world champion (Team OG), OpenAI Five demonstrates that self-play reinforcement learning can achieve superhuman performance on a difficult task.  ...  Distilling the Knowledge in a Neural Network in NIPS Deep Learning and Representation Learning Workshop (2015). <http://arxiv.org/abs/ 1503.02531>.55. Ross, S., Gordon, G. & Bagnell, D.  ... 
arXiv:1912.06680v1 fatcat:cu237lzmbjff5ecpjd26berpbu

Masking as an Efficient Alternative to Finetuning for Pretrained Language Models [article]

Mengjie Zhao, Tao Lin, Fei Mi, Martin Jaggi, Hinrich Schütze
2020 arXiv   pre-print
Extensive evaluations of masking BERT and RoBERTa on a series of NLP tasks show that our masking scheme yields performance comparable to finetuning, yet has a much smaller memory footprint when several  ...  Analyzing the loss landscape, we show that masking and finetuning produce models that reside in minima that can be connected by a line segment with nearly constant test accuracy.  ...  Acknowledgments We thank the anonymous reviewers for the insightful comments and suggestions.  ... 
arXiv:2004.12406v2 fatcat:n4ao5uyodvgfbedoqxtm5khhly

Intelligence, physics and information – the tradeoff between accuracy and simplicity in machine learning [article]

Tailin Wu
2020 arXiv   pre-print
How can we enable machines to make sense of the world, and become better at learning?  ...  Secondly, for representation learning, when can we learn a good representation, and how does learning depend on the structure of the dataset?  ...  We use noisy label to mimic realistic settings where the data may be noisy and also to have controllable difficulty for different classes.  ... 
arXiv:2001.03780v2 fatcat:piduzlhoafcjhhsgthulbbhtke

Masking as an Efficient Alternative to Finetuning for Pretrained Language Models

Mengjie Zhao, Tao Lin, Fei Mi, Martin Jaggi, Hinrich Schütze
2020 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)   unpublished
Analyzing the loss landscape, we show that masking and finetuning produce models that reside in minima that can be connected by a line segment with nearly constant test accuracy.  ...  Extensive evaluations of masking BERT, RoBERTa, and DistilBERT on eleven diverse NLP tasks show that our masking scheme yields performance comparable to finetuning, yet has a much smaller memory footprint  ...  Acknowledgments We thank the anonymous reviewers for the insightful comments and suggestions.  ... 
doi:10.18653/v1/2020.emnlp-main.174 fatcat:3liaklamzfawlprhiyswiig6c4

Learning Collections of Functions

Emmanouil Platanios
2021
and effectively than a system that learns to perform each task in isolation.  ...  More specifically, human learning is often not about learning a single skill in isolation, but rather about learning collections of skills and utilizing relationships between them to learn more efficiently  ...  closer to human learning).  ... 
doi:10.1184/r1/13574678 fatcat:727q4m7sifc7viy3il6a6hpfi4

Logical partitioning of parallel system simulations

Hari Angepat, Austin, The University Of Texas At, Austin, The University Of Texas At, Derek Chiou, Mattan Erez
2019
Simulation has been a fundamental tool to prototype, hypothesize, and evaluate new ideas to continue improving system performance.  ...  By leveraging partitioning in a structured manner, it is possible to design simulators that better address the open challenges of parallel and heterogeneous systems design.  ...  To normalize for the differences between FPGA frequencies, we look at the activation rate of each simulator, combined with its FPGA operating frequency to obtain its effective simulation rate in target-core-cycles  ... 
doi:10.26153/tsw/3268 fatcat:wkotdvpeyrahpatsfwcv4aogti