A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation
[article]
2018
arXiv
pre-print
In particular, we explore knowledge distillation and learning rate heuristics of (cosine) restarts and warmup using mode connectivity and CCA. ...
The convergence rate and final performance of common deep learning models have significantly benefited from heuristics such as learning rate schedules, knowledge distillation, skip connections, and normalization ...
WARMUP LEARNING RATE SCHEME Learning rate warmup is a common heuristic used by many practitioners for training deep neural nets for computer vision (Goyal et al., 2017) and natural language processing ...
arXiv:1810.13243v1
fatcat:wakptps6sraknorrxbwwiqch2a
Attention, Learn to Solve Routing Problems!
[article]
2019
arXiv
pre-print
With the same hyperparameters, we learn strong heuristics for two variants of the Vehicle Routing Problem (VRP), the Orienteering Problem (OP) and (a stochastic variant of) the Prize Collecting TSP (PCTSP ...
The recently presented idea to learn heuristics for combinatorial optimization problems is promising as it can save costly development. ...
We thank Thomas Kipf for helpful discussions and anonymous reviewers for comments that helped improve the paper. ...
arXiv:1803.08475v3
fatcat:2u54w3n63zgjdfvucb7pvddv5a
Self-Distillation Amplifies Regularization in Hilbert Space
[article]
2020
arXiv
pre-print
Knowledge distillation introduced in the deep learning context is a method to transfer knowledge from one architecture to another. ...
Why this happens, however, has been a mystery: the self-distillation dynamics does not receive any new information about the task and solely evolves by looping over training. ...
Acknowledgement We would like to thank colleagues at Google Research for their feedback: Moshe Dubiner, Pierre Foret, Sergey Ioffe, Yiding Jiang, Alan MacKey, Sam Schoenholz, Matt Streeter, and Andrey ...
arXiv:2002.05715v3
fatcat:g3ksy53bxrd23bdnfjepcqajha
A continual learning survey: Defying forgetting in classification tasks
[article]
2020
arXiv
pre-print
) a comprehensive experimental comparison of 11 state-of-the-art continual learning methods and 4 baselines. ...
Artificial neural networks thrive in solving the classification problem for a particular rigid task, acquiring knowledge through generalized learning behaviour from a distinct training phase. ...
IMM with dropout exhibits higher performance only for the WIDE and DEEP model, coming closer to the performance obtained by the other methods. iCaRL and GEM. ...
arXiv:1909.08383v2
fatcat:vhvlwslqa5cefitcnajs7hp5nu
Dota 2 with Large Scale Deep Reinforcement Learning
[article]
2019
arXiv
pre-print
We developed a distributed training system and tools for continual training which allowed us to train OpenAI Five for 10 months. ...
By defeating the Dota 2 world champion (Team OG), OpenAI Five demonstrates that self-play reinforcement learning can achieve superhuman performance on a difficult task. ...
Distilling the Knowledge in a Neural Network in NIPS Deep Learning and Representation Learning Workshop (2015). <http://arxiv.org/abs/ 1503.02531>.55. Ross, S., Gordon, G. & Bagnell, D. ...
arXiv:1912.06680v1
fatcat:cu237lzmbjff5ecpjd26berpbu
Masking as an Efficient Alternative to Finetuning for Pretrained Language Models
[article]
2020
arXiv
pre-print
Extensive evaluations of masking BERT and RoBERTa on a series of NLP tasks show that our masking scheme yields performance comparable to finetuning, yet has a much smaller memory footprint when several ...
Analyzing the loss landscape, we show that masking and finetuning produce models that reside in minima that can be connected by a line segment with nearly constant test accuracy. ...
Acknowledgments We thank the anonymous reviewers for the insightful comments and suggestions. ...
arXiv:2004.12406v2
fatcat:n4ao5uyodvgfbedoqxtm5khhly
Intelligence, physics and information – the tradeoff between accuracy and simplicity in machine learning
[article]
2020
arXiv
pre-print
How can we enable machines to make sense of the world, and become better at learning? ...
Secondly, for representation learning, when can we learn a good representation, and how does learning depend on the structure of the dataset? ...
We use noisy label to mimic realistic settings where the data may be noisy and also to have controllable difficulty for different classes. ...
arXiv:2001.03780v2
fatcat:piduzlhoafcjhhsgthulbbhtke
Masking as an Efficient Alternative to Finetuning for Pretrained Language Models
2020
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
unpublished
Analyzing the loss landscape, we show that masking and finetuning produce models that reside in minima that can be connected by a line segment with nearly constant test accuracy. ...
Extensive evaluations of masking BERT, RoBERTa, and DistilBERT on eleven diverse NLP tasks show that our masking scheme yields performance comparable to finetuning, yet has a much smaller memory footprint ...
Acknowledgments We thank the anonymous reviewers for the insightful comments and suggestions. ...
doi:10.18653/v1/2020.emnlp-main.174
fatcat:3liaklamzfawlprhiyswiig6c4
Learning Collections of Functions
2021
and effectively than a system that learns to perform each task in isolation. ...
More specifically, human learning is often not about learning a single skill in isolation, but rather about learning collections of skills and utilizing relationships between them to learn more efficiently ...
closer to human learning). ...
doi:10.1184/r1/13574678
fatcat:727q4m7sifc7viy3il6a6hpfi4
Logical partitioning of parallel system simulations
2019
Simulation has been a fundamental tool to prototype, hypothesize, and evaluate new ideas to continue improving system performance. ...
By leveraging partitioning in a structured manner, it is possible to design simulators that better address the open challenges of parallel and heterogeneous systems design. ...
To normalize for the differences between FPGA frequencies, we look at the activation rate of each simulator, combined with its FPGA operating frequency to obtain its effective simulation rate in target-core-cycles ...
doi:10.26153/tsw/3268
fatcat:wkotdvpeyrahpatsfwcv4aogti