Filters








220 Hits in 1.2 sec

Weighted Transformer Network for Machine Translation [article]

Karim Ahmed, Nitish Shirish Keskar, Richard Socher
2017 arXiv   pre-print
State-of-the-art results on neural machine translation often use attentional sequence-to-sequence models with some form of convolution or recursion. Vaswani et al. (2017) propose a new architecture that avoids recurrence and convolution completely. Instead, it uses only self-attention and feed-forward layers. While the proposed architecture achieves state-of-the-art results on several machine translation tasks, it requires a large number of parameters and training iterations to converge. We
more » ... ose Weighted Transformer, a Transformer with modified attention layers, that not only outperforms the baseline network in BLEU score but also converges 15-40% faster. Specifically, we replace the multi-head attention by multiple self-attention branches that the model learns to combine during the training process. Our model improves the state-of-the-art performance by 0.5 BLEU points on the WMT 2014 English-to-German translation task and by 0.4 on the English-to-French translation task.
arXiv:1711.02132v1 fatcat:45u2pz33xjd3hcqq53uhzixxye

Regularizing and Optimizing LSTM Language Models [article]

Stephen Merity, Nitish Shirish Keskar, Richard Socher
2017 arXiv   pre-print
Analogous strategies have also been proposed for learning-rate reduction in SGD (Keskar & Saon, 2015) .  ... 
arXiv:1708.02182v1 fatcat:ljt4xxy7lrcarfwlc3gxfxk55q

Identifying Generalization Properties in Neural Networks [article]

Huan Wang, Nitish Shirish Keskar, Caiming Xiong, Richard Socher
2018 arXiv   pre-print
arXiv:1809.07402v1 [cs.LG] 19 Sep 2018 Huan, Nitish, Caiming, and Richard . One may also assume the same τ for all parameters for a simpler argument.  ...  We had the same observation as in (Keskar et al., 2016) that as the batch size grows, the gap between the test loss and the training loss tends to get larger.  ... 
arXiv:1809.07402v1 fatcat:rf7lmibqfbfwpglg24vazuqrvm

Pretrained AI Models: Performativity, Mobility, and Change [article]

Lav R. Varshney, Nitish Shirish Keskar, Richard Socher
2019 arXiv   pre-print
The paradigm of pretrained deep learning models has recently emerged in artificial intelligence practice, allowing deployment in numerous societal settings with limited computational resources, but also embedding biases and enabling unintended negative uses. In this paper, we treat pretrained models as objects of study and discuss the ethical impacts of their sociological position. We discuss how pretrained models are developed and compared under the common task framework, but that this may
more » ... self-regulation inadequate. Further how pretrained models may have a performative effect on society that exacerbates biases. We then discuss how pretrained models move through actor networks as a kind of computationally immutable mobile, but that users also act as agents of technological change by reinterpreting them via fine-tuning and transfer. We further discuss how users may use pretrained models in malicious ways, drawing a novel connection between the responsible innovation and user-centered innovation literatures. We close by discussing how this sociological understanding of pretrained models can inform AI governance frameworks for fairness, accountability, and transparency.
arXiv:1909.03290v1 fatcat:7doni7tc3rginpokkow2wtiqmy

Neural Text Summarization: A Critical Evaluation [article]

Wojciech Kryściński, Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, Richard Socher
2019 arXiv   pre-print
Text summarization aims at compressing long documents into a shorter form that conveys the most important parts of the original document. Despite increased interest in the community and notable research effort, progress on benchmark datasets has stagnated. We critically evaluate key ingredients of the current research setup: datasets, evaluation metrics, and models, and highlight three primary shortcomings: 1) automatically collected datasets leave the task underconstrained and may contain
more » ... detrimental to training and evaluation, 2) current evaluation protocol is weakly correlated with human judgment and does not account for important characteristics such as factual correctness, 3) models overfit to layout biases of current datasets and offer limited diversity in their outputs.
arXiv:1908.08960v1 fatcat:5gew2vbmvjgjjb33njpm3j7ucq

Improving Generalization Performance by Switching from Adam to SGD [article]

Nitish Shirish Keskar, Richard Socher
2017 arXiv   pre-print
Correspondence to: Nitish Shirish Keskar <nkeskar@salesforce.com>. lowing non-convex optimization problem, min w∈R n f (w), where f is a loss function.  ... 
arXiv:1712.07628v1 fatcat:uksgec7lfnfbjpcjbuz5e2l3mu

Using Mode Connectivity for Loss Landscape Analysis [article]

Akhilesh Gotmare, Nitish Shirish Keskar, Caiming Xiong, Richard Socher
2018 arXiv   pre-print
Correspondence to: Nitish Shirish Keskar <nkeskar@salesfoce.com>. Machine Learning, Stockholm, Sweden, 2018 . Copyright 2018 by the author(s). curve.  ...  Particularly for the large batch training case, previous works (Hochreiter & Schmidhuber, 1997; Keskar et al., 2016) have empirically established that small-batch training leads to wider minima and large-batch  ... 
arXiv:1806.06977v1 fatcat:j6xuni3hxrbuxn5v57oybf3yge

An Analysis of Neural Language Modeling at Multiple Scales [article]

Stephen Merity, Nitish Shirish Keskar, Richard Socher
2018 arXiv   pre-print
Many of the leading approaches in language modeling introduce novel, complex and specialized architectures. We take existing state-of-the-art word level language models based on LSTMs and QRNNs and extend them to both larger vocabularies as well as character-level granularity. When properly tuned, LSTMs and QRNNs achieve state-of-the-art results on character-level (Penn Treebank, enwik8) and word-level (WikiText-103) datasets, respectively. Results are obtained in only 12 hours (WikiText-103) to 2 days (enwik8) using a single modern GPU.
arXiv:1803.08240v1 fatcat:5qfnak75nfdofoycebgyszdwu4

A Limited-Memory Quasi-Newton Algorithm for Bound-Constrained Nonsmooth Optimization [article]

Nitish Shirish Keskar, Andreas Waechter
2016 arXiv   pre-print
We consider the problem of minimizing a continuous function that may be nonsmooth and nonconvex, subject to bound constraints. We propose an algorithm that uses the L-BFGS quasi-Newton approximation of the problem's curvature together with a variant of the weak Wolfe line search. The key ingredient of the method is an active-set selection strategy that defines the subspace in which search directions are computed. To overcome the inherent shortsightedness of the gradient for a nonsmooth
more » ... we propose two strategies. The first relies on an approximation of the ϵ-minimum norm subgradient, and the second uses an iterative corrective loop that augments the active set based on the resulting search directions. We describe a Python implementation of the proposed algorithm and present numerical results on a set of standard test problems to illustrate the efficacy of our approach.
arXiv:1612.07350v1 fatcat:dbgyww2alzbmji6uox2b7fa3qm

Limits of Detecting Text Generated by Large-Scale Language Models [article]

Lav R. Varshney, Nitish Shirish Keskar, Richard Socher
2020 arXiv   pre-print
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns. Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated. We show that error exponents for particular language models are bounded in terms of their perplexity, a standard measure of language generation performance. Under the assumption that human language is
more » ... ionary and ergodic, the formulation is extended from considering specific language models to considering maximum likelihood language models, among the class of k-order Markov approximations; error probabilities are characterized. Some discussion of incorporating semantic side information is also given.
arXiv:2002.03438v1 fatcat:o636j5cl4ngo7mzgpmjajyafg4

ProGen: Language Modeling for Protein Generation [article]

Ali Madani, Bryan McCann, Nikhil Naik, Nitish Shirish Keskar, Namrata Anand, Raphael R Eguchi, Possu Huang, Richard Socher
2020 bioRxiv   pre-print
Notably different from Keskar et al. (2019) , protein engineering requires a finer-grained, much larger, and more complex set of conditioning tags.  ...  However, there has been no attempt to adapt state-of-the-art methods for artificial text generation , and in particular the kind of controllable generation (Keskar et al., 2019) that would be most useful  ... 
doi:10.1101/2020.03.07.982272 fatcat:4rbzpctnxzf5pa3n7gniinmroe

The Natural Language Decathlon: Multitask Learning as Question Answering [article]

Bryan McCann and Nitish Shirish Keskar and Caiming Xiong and Richard Socher
2018 arXiv   pre-print
Deep learning has improved performance on many natural language processing (NLP) tasks individually. However, general NLP models cannot emerge within a paradigm that focuses on the particularities of a single metric, dataset, and task. We introduce the Natural Language Decathlon (decaNLP), a challenge that spans ten tasks: question answering, machine translation, summarization, natural language inference, sentiment analysis, semantic role labeling, zero-shot relation extraction, goal-oriented
more » ... alogue, semantic parsing, and commonsense pronoun resolution. We cast all tasks as question answering over a context. Furthermore, we present a new Multitask Question Answering Network (MQAN) jointly learns all tasks in decaNLP without any task-specific modules or parameters in the multitask setting. MQAN shows improvements in transfer learning for machine translation and named entity recognition, domain adaptation for sentiment analysis and natural language inference, and zero-shot capabilities for text classification. We demonstrate that the MQAN's multi-pointer-generator decoder is key to this success and performance further improves with an anti-curriculum training strategy. Though designed for decaNLP, MQAN also achieves state of the art results on the WikiSQL semantic parsing task in the single-task setting. We also release code for procuring and processing data, training and evaluating models, and reproducing all experiments for decaNLP.
arXiv:1806.08730v1 fatcat:pdvwr3fqfrdnjdzwotzahsjf3e

ProGen: Language Modeling for Protein Generation [article]

Ali Madani, Bryan McCann, Nikhil Naik, Nitish Shirish Keskar, Namrata Anand, Raphael R. Eguchi, Po-Ssu Huang, Richard Socher
2020 arXiv   pre-print
Notably different from Keskar et al. (2019) , protein engineering requires a finer-grained, much larger, and more complex set of conditioning tags.  ...  However, there has been no attempt to adapt state-of-the-art methods for artificial text generation , and in particular the kind of controllable generation (Keskar et al., 2019) that would be most useful  ... 
arXiv:2004.03497v1 fatcat:2iudbwmfnvdcfaewvs7fw3xkqu

Unsupervised Paraphrasing with Pretrained Language Models [article]

Tong Niu, Semih Yavuz, Yingbo Zhou, Nitish Shirish Keskar, Huan Wang, Caiming Xiong
2021 arXiv   pre-print
Paraphrase generation has benefited extensively from recent progress in the designing of training objectives and model architectures. However, previous explorations have largely focused on supervised methods, which require a large amount of labeled data that is costly to collect. To address this drawback, we adopt a transfer learning approach and propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting. Our recipe
more » ... ts of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking (DB). To enforce a surface form dissimilar from the input, whenever the language model emits a token contained in the source sequence, DB prevents the model from outputting the subsequent source token for the next generation step. We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair (QQP) and the ParaNMT datasets and is robust to domain shift between the two datasets of distinct distributions. We also demonstrate that our model transfers to paraphrasing in other languages without any additional finetuning.
arXiv:2010.12885v2 fatcat:wcronrkhx5cidasbpv7uvmmbdu

Global Capacity Measures for Deep ReLU Networks via Path Sampling [article]

Ryan Theisen, Jason M. Klusowski, Huan Wang, Nitish Shirish Keskar, Caiming Xiong, Richard Socher
2019 arXiv   pre-print
Classical results on the statistical complexity of linear models have commonly identified the norm of the weights w as a fundamental capacity measure. Generalizations of this measure to the setting of deep networks have been varied, though a frequently identified quantity is the product of weight norms of each layer. In this work, we show that for a large class of networks possessing a positive homogeneity property, similar bounds may be obtained instead in terms of the norm of the product of
more » ... ights. Our proof technique generalizes a recently proposed sampling argument, which allows us to demonstrate the existence of sparse approximants of positive homogeneous networks. This yields covering number bounds, which can be converted to generalization bounds for multi-class classification that are comparable to, and in certain cases improve upon, existing results in the literature. Finally, we investigate our sampling procedure empirically, which yields results consistent with our theory.
arXiv:1910.10245v1 fatcat:cclznvmcdzfmfefetg5vqw4ngu
« Previous Showing results 1 — 15 out of 220 results