240 Hits in 2.1 sec

Online Sparse Linear Regression [article]

Dean Foster, Satyen Kale, Howard Karloff
2016 arXiv   pre-print
This computational hardness result resolves an open problem presented in COLT 2014 (Kale, 2014) and also posed by Zolghadr et al. (2013).  ...  This computational hardness result resolves open problems from (Kale, 2014) and (Zolghadr et al., 2013) .  ...  Also, Kale (2014) sketches a different algorithm with performance guarantees similar to the algorithm presented in this paper; our work builds upon that sketch and gives tighter regret bounds.  ... 
arXiv:1603.02250v1 fatcat:gcp5loocojekxc5n3mo3gpyn74

Federated Functional Gradient Boosting [article]

Zebang Shen, Hamed Hassani, Satyen Kale, Amin Karbasi
2021 arXiv   pre-print
In this paper, we initiate a study of functional minimization in Federated Learning. First, in the semi-heterogeneous setting, when the marginal distributions of the feature vectors on client machines are identical, we develop the federated functional gradient boosting (FFGB) method that provably converges to the global minimum. Subsequently, we extend our results to the fully-heterogeneous setting (where marginal distributions of feature vectors may differ) by designing an efficient variant of
more » ... fficient variant of FFGB called FFGB.C, with provable convergence to a neighborhood of the global minimum within a radius that depends on the total variation distances between the client feature distributions. For the special case of square loss, but still in the fully heterogeneous setting, we design the FFGB.L method that also enjoys provable convergence to a neighborhood of the global minimum but within a radius depending on the much tighter Wasserstein-1 distances. For both FFGB.C and FFGB.L, the radii of convergence shrink to zero as the feature distributions become more homogeneous. Finally, we conduct proof-of-concept experiments to demonstrate the benefits of our approach against natural baselines.
arXiv:2103.06972v1 fatcat:io3lm3pozzg5vlizx5iu3uviia

Combinatorial Approximation Algorithms for MaxCut using Random Walks [article]

Satyen Kale, C. Seshadhri
2010 arXiv   pre-print
Arora and Kale [AK07] gave an efficient near-linear-time implementation of the SDP algorithm for MaxCut 1 .  ... 
arXiv:1008.3938v1 fatcat:ghgprkco55hgxns7bkidj3jd6y

Online Gradient Boosting [article]

Alina Beygelzimer, Elad Hazan, Satyen Kale, Haipeng Luo
2015 arXiv   pre-print
We extend the theory of boosting for regression problems to the online learning setting. Generalizing from the batch setting for boosting, the notion of a weak learning algorithm is modeled as an online learning algorithm with linear loss functions that competes with a base class of regression functions, while a strong learning algorithm is an online learning algorithm with convex loss functions that competes with a larger class of regression functions. Our main result is an online gradient
more » ... online gradient boosting algorithm which converts a weak online learning algorithm into a strong one where the larger class of functions is the linear span of the base class. We also give a simpler boosting algorithm that converts a weak online learning algorithm into a strong one where the larger class of functions is the convex hull of the base class, and prove its optimality.
arXiv:1506.04820v2 fatcat:7e3h67djebggfaelbhkfj7gpce

An optimal algorithm for stochastic strongly-convex optimization [article]

Elad Hazan, Satyen Kale
2010 arXiv   pre-print
We consider stochastic convex optimization with a strongly convex (but not necessarily smooth) objective. We give an algorithm which performs only gradient updates with optimal rate of convergence.
arXiv:1006.2425v1 fatcat:w4l3ko76wjai3azbmbek37k5ti

Optimal and Adaptive Algorithms for Online Boosting [article]

Alina Beygelzimer, Satyen Kale, Haipeng Luo
2015 arXiv   pre-print
We study online boosting, the task of converting any weak online learner into a strong online learner. Based on a novel and natural definition of weak online learnability, we develop two online boosting algorithms. The first algorithm is an online version of boost-by-majority. By proving a matching lower bound, we show that this algorithm is essentially optimal in terms of the number of weak learners and the sample complexity needed to achieve a specified accuracy. This optimal algorithm is not
more » ... al algorithm is not adaptive however. Using tools from online loss minimization, we derive an adaptive online boosting algorithm that is also parameter-free, but not optimal. Both algorithms work with base learners that can handle example importance weights directly, as well as by rejection sampling examples with probability defined by the booster. Results are complemented with an extensive experimental study.
arXiv:1502.02651v1 fatcat:svaj4rrgxfgfxbbzta54l5vuq4

Near-Optimal Algorithms for Online Matrix Prediction [article]

Elad Hazan, Satyen Kale, Shai Shalev-Shwartz
2012 arXiv   pre-print
The algorithm, forms of which independently appeared in the work of Tsuda et al. [2006] and Arora and Kale [2007] , performs exponentiated gradient steps followed by Bregman projections onto K.  ...  : Obtain loss matrix L t . 6: Update X t+1 = arg min X∈K ∆(X, exp(log(X t ) − ηL t )). 7: end for Algorithm 1 has the following regret bound (essentially following Tsuda et al. [2006] , Arora and Kale  ... 
arXiv:1204.0136v1 fatcat:6jhsc2asvfharetxem3zlkrxly

On the Convergence of Adam and Beyond [article]

Sashank J. Reddi, Satyen Kale, Sanjiv Kumar
2019 arXiv   pre-print
Several recently proposed stochastic optimization methods that have been successfully used in training deep networks such as RMSProp, Adam, Adadelta, Nadam are based on using gradient updates scaled by square roots of exponential moving averages of squared past gradients. In many applications, e.g. learning with large output spaces, it has been empirically observed that these algorithms fail to converge to an optimal solution (or a critical point in nonconvex settings). We show that one cause
more » ... ow that one cause for such failures is the exponential moving average used in the algorithms. We provide an explicit example of a simple convex optimization setting where Adam does not converge to the optimal solution, and describe the precise problems with the previous analysis of Adam algorithm. Our analysis suggests that the convergence issues can be fixed by endowing such algorithms with 'long-term memory' of past gradients, and propose new variants of the Adam algorithm which not only fix the convergence issues but often also lead to improved empirical performance.
arXiv:1904.09237v1 fatcat:ctg52u4p5fgufdfxrgwmvydvlu

Bargaining for Revenue Shares on Tree Trading Networks [article]

Arpita Ghosh, Satyen Kale, Kevin Lang, Benjamin Moseley
2013 arXiv   pre-print
We study trade networks with a tree structure, where a seller with a single indivisible good is connected to buyers, each with some value for the good, via a unique path of intermediaries. Agents in the tree make multiplicative revenue share offers to their parent nodes, who choose the best offer and offer part of it to their parent, and so on; the winning path is determined by who finally makes the highest offer to the seller. In this paper, we investigate how these revenue shares might be set
more » ... shares might be set via a natural bargaining process between agents on the tree, specifically, egalitarian bargaining between endpoints of each edge in the tree. We investigate the fixed point of this system of bargaining equations and prove various desirable for this solution concept, including (i) existence, (ii) uniqueness, (iii) efficiency, (iv) membership in the core, (v) strict monotonicity, (vi) polynomial-time computability to any given accuracy. Finally, we present numerical evidence that asynchronous dynamics with randomly ordered updates always converges to the fixed point, indicating that the fixed point shares might arise from decentralized bargaining amongst agents on the trade network.
arXiv:1304.5822v1 fatcat:denzjxiuubfizny73yvqubeq74

Learning rotations with little regret

Elad Hazan, Satyen Kale, Manfred K. Warmuth
2016 Machine Learning  
We describe online algorithms for learning a rotation from pairs of unit vectors in R n . We show that the expected regret of our online algorithm compared to the best fixed rotation chosen offline is O( √ nL), where L is the loss of the best rotation. We also give a lower bound that proves that this expected regret bound is optimal within a constant factor. This resolves an open problem posed in COLT 2008. Our online algorithm for choosing a rotation matrix in each trial is based on the
more » ... based on the Follow-The-Perturbed-Leader paradigm. It adds a random spectral perturbation to the matrix characterizing the loss incurred so far and then chooses the best rotation matrix for that loss. We also show that any deterministic algorithm for learning rotations has Ω(T ) regret in the worst case.
doi:10.1007/s10994-016-5548-x fatcat:v7azubwm6zcbfkcccbu7jhntby

Efficient Optimal Learning for Contextual Bandits [article]

Miroslav Dudik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, Tong Zhang
2011 arXiv   pre-print
We address the problem of learning in an online setting where the learner repeatedly observes features, selects among a set of actions, and receives reward for the action taken. We provide the first efficient algorithm with an optimal regret. Our algorithm uses a cost sensitive classification learner as an oracle and has a running time $\mathrm{polylog}(N)$, where $N$ is the number of classification rules among which the oracle might choose. This is exponentially faster than all previous
more » ... all previous algorithms that achieve optimal regret in this setting. Our formulation also enables us to create an algorithm with regret that is additive rather than multiplicative in feedback delay as in all previous work.
arXiv:1106.2369v1 fatcat:ws5uavavibchtbu7gpe6blysxy

SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs [article]

Satyen Kale, Ayush Sekhari, Karthik Sridharan
2021 arXiv   pre-print
Multi-epoch, small-batch, Stochastic Gradient Descent (SGD) has been the method of choice for learning with large over-parameterized models. A popular theory for explaining why SGD works well in practice is that the algorithm has an implicit regularization that biases its output towards a good solution. Perhaps the theoretically most well understood learning setting for SGD is that of Stochastic Convex Optimization (SCO), where it is well known that SGD learns at a rate of $O(1/\sqrt{n})$,
more » ... (1/\sqrt{n})$, where $n$ is the number of samples. In this paper, we consider the problem of SCO and explore the role of implicit regularization, batch size and multiple epochs for SGD. Our main contributions are threefold: (a) We show that for any regularizer, there is an SCO problem for which Regularized Empirical Risk Minimzation fails to learn. This automatically rules out any implicit regularization based explanation for the success of SGD. (b) We provide a separation between SGD and learning via Gradient Descent on empirical loss (GD) in terms of sample complexity. We show that there is an SCO problem such that GD with any step size and number of iterations can only learn at a suboptimal rate: at least $\widetilde{\Omega}(1/n^{5/12})$. (c) We present a multi-epoch variant of SGD commonly used in practice. We prove that this algorithm is at least as good as single pass SGD in the worst case. However, for certain SCO problems, taking multiple passes over the dataset can significantly outperform single pass SGD. We extend our results to the general learning setting by showing a problem which is learnable for any data distribution, and for this problem, SGD is strictly better than RERM for any regularization function. We conclude by discussing the implications of our results for deep learning, and show a separation between SGD and ERM for two layer diagonal neural networks.
arXiv:2107.05074v1 fatcat:uan3uyb2krbqdkiufemgfneuo4

Efficient Methods for Online Multiclass Logistic Regression [article]

Naman Agarwal, Satyen Kale, Julian Zimmert
2021 arXiv   pre-print
Kale. Newtron: an efficient bandit algorithm for online multiclass prediction. In Advances in Neural Information Processing Systems, pages 891-899, 2011. Elad Hazan, Amit Agarwal, and Satyen Kale.  ...  Alina Beygelzimer, Satyen Kale, and Haipeng Luo. Optimal and adaptive algorithms for online boosting. In International Conference on Machine Learning, pages 2323-2331. PMLR, 2015.  ... 
arXiv:2110.03020v2 fatcat:cx3pfnihyvc5losso633fi4ude

An Expansion Tester for Bounded Degree Graphs [chapter]

Satyen Kale, C. Seshadhri
2008 Lecture Notes in Computer Science  
We consider the problem of testing graph expansion (either vertex or edge) in the bounded degree model [10] . We give a property tester that given a graph with degree bound d, an expansion bound α, and a parameter ε > 0, accepts the graph with high probability if its expansion is more than α, and rejects it with high probability if it is εfar from any graph with expansion α with degree bound d, where α < α is a function of α. For edge expansion, we obtain α = Ω( α 2 d ), and for vertex
more » ... for vertex expansion, we obtain α = Ω( α 2 d 2 ). In either case, the algorithm runs in timeÕ( n (1+µ)/2 d 2 εα 2 ) for any given constant µ > 0.
doi:10.1007/978-3-540-70575-8_43 fatcat:w3kjfjxhzfcplduhrpgr6efklu

Adaptive Feature Selection: Computationally Efficient Online Sparse Linear Regression under RIP [article]

Satyen Kale, Zohar Karnin, Tengyuan Liang, Dávid Pál
2017 arXiv   pre-print
Kale [2014] posed the open question of whether it is possible to design an efficient algorithm for the problem with a sublinear regret bound.  ...  This bound has optimal dependence on T , since even in the full information setting where all features are observed there is a lower bound of Ω(log T ) [Hazan and Kale, 2014] .  ... 
arXiv:1706.04690v1 fatcat:ugrzmhbqpjhcvi6agga2mrx76a
« Previous Showing results 1 — 15 out of 240 results