A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is `application/pdf`

.

## Filters

##
###
Online Sparse Linear Regression
[article]

2016
*
arXiv
*
pre-print

This computational hardness result resolves an open problem presented in COLT 2014 (

arXiv:1603.02250v1
fatcat:gcp5loocojekxc5n3mo3gpyn74
*Kale*, 2014) and also posed by Zolghadr et al. (2013). ... This computational hardness result resolves open problems from (*Kale*, 2014) and (Zolghadr et al., 2013) . ... Also,*Kale*(2014) sketches a different algorithm with performance guarantees similar to the algorithm presented in this paper; our work builds upon that sketch and gives tighter regret bounds. ...##
###
Federated Functional Gradient Boosting
[article]

2021
*
arXiv
*
pre-print

In this paper, we initiate a study of functional minimization in Federated Learning. First, in the semi-heterogeneous setting, when the marginal distributions of the feature vectors on client machines are identical, we develop the federated functional gradient boosting (FFGB) method that provably converges to the global minimum. Subsequently, we extend our results to the fully-heterogeneous setting (where marginal distributions of feature vectors may differ) by designing an efficient variant of

arXiv:2103.06972v1
fatcat:io3lm3pozzg5vlizx5iu3uviia
## more »

... fficient variant of FFGB called FFGB.C, with provable convergence to a neighborhood of the global minimum within a radius that depends on the total variation distances between the client feature distributions. For the special case of square loss, but still in the fully heterogeneous setting, we design the FFGB.L method that also enjoys provable convergence to a neighborhood of the global minimum but within a radius depending on the much tighter Wasserstein-1 distances. For both FFGB.C and FFGB.L, the radii of convergence shrink to zero as the feature distributions become more homogeneous. Finally, we conduct proof-of-concept experiments to demonstrate the benefits of our approach against natural baselines.##
###
Combinatorial Approximation Algorithms for MaxCut using Random Walks
[article]

2010
*
arXiv
*
pre-print

Arora and

arXiv:1008.3938v1
fatcat:ghgprkco55hgxns7bkidj3jd6y
*Kale*[AK07] gave an efficient near-linear-time implementation of the SDP algorithm for MaxCut 1 . ...##
###
Online Gradient Boosting
[article]

2015
*
arXiv
*
pre-print

We extend the theory of boosting for regression problems to the online learning setting. Generalizing from the batch setting for boosting, the notion of a weak learning algorithm is modeled as an online learning algorithm with linear loss functions that competes with a base class of regression functions, while a strong learning algorithm is an online learning algorithm with convex loss functions that competes with a larger class of regression functions. Our main result is an online gradient

arXiv:1506.04820v2
fatcat:7e3h67djebggfaelbhkfj7gpce
## more »

... online gradient boosting algorithm which converts a weak online learning algorithm into a strong one where the larger class of functions is the linear span of the base class. We also give a simpler boosting algorithm that converts a weak online learning algorithm into a strong one where the larger class of functions is the convex hull of the base class, and prove its optimality.##
###
An optimal algorithm for stochastic strongly-convex optimization
[article]

2010
*
arXiv
*
pre-print

We consider stochastic convex optimization with a strongly convex (but not necessarily smooth) objective. We give an algorithm which performs only gradient updates with optimal rate of convergence.

arXiv:1006.2425v1
fatcat:w4l3ko76wjai3azbmbek37k5ti
##
###
Optimal and Adaptive Algorithms for Online Boosting
[article]

2015
*
arXiv
*
pre-print

We study online boosting, the task of converting any weak online learner into a strong online learner. Based on a novel and natural definition of weak online learnability, we develop two online boosting algorithms. The first algorithm is an online version of boost-by-majority. By proving a matching lower bound, we show that this algorithm is essentially optimal in terms of the number of weak learners and the sample complexity needed to achieve a specified accuracy. This optimal algorithm is not

arXiv:1502.02651v1
fatcat:svaj4rrgxfgfxbbzta54l5vuq4
## more »

... al algorithm is not adaptive however. Using tools from online loss minimization, we derive an adaptive online boosting algorithm that is also parameter-free, but not optimal. Both algorithms work with base learners that can handle example importance weights directly, as well as by rejection sampling examples with probability defined by the booster. Results are complemented with an extensive experimental study.##
###
Near-Optimal Algorithms for Online Matrix Prediction
[article]

2012
*
arXiv
*
pre-print

The algorithm, forms of which independently appeared in the work of Tsuda et al. [2006] and Arora and

arXiv:1204.0136v1
fatcat:6jhsc2asvfharetxem3zlkrxly
*Kale*[2007] , performs exponentiated gradient steps followed by Bregman projections onto K. ... : Obtain loss matrix L t . 6: Update X t+1 = arg min X∈K ∆(X, exp(log(X t ) − ηL t )). 7: end for Algorithm 1 has the following regret bound (essentially following Tsuda et al. [2006] , Arora and*Kale*...##
###
On the Convergence of Adam and Beyond
[article]

2019
*
arXiv
*
pre-print

Several recently proposed stochastic optimization methods that have been successfully used in training deep networks such as RMSProp, Adam, Adadelta, Nadam are based on using gradient updates scaled by square roots of exponential moving averages of squared past gradients. In many applications, e.g. learning with large output spaces, it has been empirically observed that these algorithms fail to converge to an optimal solution (or a critical point in nonconvex settings). We show that one cause

arXiv:1904.09237v1
fatcat:ctg52u4p5fgufdfxrgwmvydvlu
## more »

... ow that one cause for such failures is the exponential moving average used in the algorithms. We provide an explicit example of a simple convex optimization setting where Adam does not converge to the optimal solution, and describe the precise problems with the previous analysis of Adam algorithm. Our analysis suggests that the convergence issues can be fixed by endowing such algorithms with 'long-term memory' of past gradients, and propose new variants of the Adam algorithm which not only fix the convergence issues but often also lead to improved empirical performance.##
###
Bargaining for Revenue Shares on Tree Trading Networks
[article]

2013
*
arXiv
*
pre-print

We study trade networks with a tree structure, where a seller with a single indivisible good is connected to buyers, each with some value for the good, via a unique path of intermediaries. Agents in the tree make multiplicative revenue share offers to their parent nodes, who choose the best offer and offer part of it to their parent, and so on; the winning path is determined by who finally makes the highest offer to the seller. In this paper, we investigate how these revenue shares might be set

arXiv:1304.5822v1
fatcat:denzjxiuubfizny73yvqubeq74
## more »

... shares might be set via a natural bargaining process between agents on the tree, specifically, egalitarian bargaining between endpoints of each edge in the tree. We investigate the fixed point of this system of bargaining equations and prove various desirable for this solution concept, including (i) existence, (ii) uniqueness, (iii) efficiency, (iv) membership in the core, (v) strict monotonicity, (vi) polynomial-time computability to any given accuracy. Finally, we present numerical evidence that asynchronous dynamics with randomly ordered updates always converges to the fixed point, indicating that the fixed point shares might arise from decentralized bargaining amongst agents on the trade network.##
###
Learning rotations with little regret

2016
*
Machine Learning
*

We describe online algorithms for learning a rotation from pairs of unit vectors in R n . We show that the expected regret of our online algorithm compared to the best fixed rotation chosen offline is O( √ nL), where L is the loss of the best rotation. We also give a lower bound that proves that this expected regret bound is optimal within a constant factor. This resolves an open problem posed in COLT 2008. Our online algorithm for choosing a rotation matrix in each trial is based on the

doi:10.1007/s10994-016-5548-x
fatcat:v7azubwm6zcbfkcccbu7jhntby
## more »

... based on the Follow-The-Perturbed-Leader paradigm. It adds a random spectral perturbation to the matrix characterizing the loss incurred so far and then chooses the best rotation matrix for that loss. We also show that any deterministic algorithm for learning rotations has Ω(T ) regret in the worst case.##
###
Efficient Optimal Learning for Contextual Bandits
[article]

2011
*
arXiv
*
pre-print

We address the problem of learning in an online setting where the learner repeatedly observes features, selects among a set of actions, and receives reward for the action taken. We provide the first efficient algorithm with an optimal regret. Our algorithm uses a cost sensitive classification learner as an oracle and has a running time $\mathrm{polylog}(N)$, where $N$ is the number of classification rules among which the oracle might choose. This is exponentially faster than all previous

arXiv:1106.2369v1
fatcat:ws5uavavibchtbu7gpe6blysxy
## more »

... all previous algorithms that achieve optimal regret in this setting. Our formulation also enables us to create an algorithm with regret that is additive rather than multiplicative in feedback delay as in all previous work.##
###
SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs
[article]

2021
*
arXiv
*
pre-print

Multi-epoch, small-batch, Stochastic Gradient Descent (SGD) has been the method of choice for learning with large over-parameterized models. A popular theory for explaining why SGD works well in practice is that the algorithm has an implicit regularization that biases its output towards a good solution. Perhaps the theoretically most well understood learning setting for SGD is that of Stochastic Convex Optimization (SCO), where it is well known that SGD learns at a rate of $O(1/\sqrt{n})$,

arXiv:2107.05074v1
fatcat:uan3uyb2krbqdkiufemgfneuo4
## more »

... (1/\sqrt{n})$, where $n$ is the number of samples. In this paper, we consider the problem of SCO and explore the role of implicit regularization, batch size and multiple epochs for SGD. Our main contributions are threefold: (a) We show that for any regularizer, there is an SCO problem for which Regularized Empirical Risk Minimzation fails to learn. This automatically rules out any implicit regularization based explanation for the success of SGD. (b) We provide a separation between SGD and learning via Gradient Descent on empirical loss (GD) in terms of sample complexity. We show that there is an SCO problem such that GD with any step size and number of iterations can only learn at a suboptimal rate: at least $\widetilde{\Omega}(1/n^{5/12})$. (c) We present a multi-epoch variant of SGD commonly used in practice. We prove that this algorithm is at least as good as single pass SGD in the worst case. However, for certain SCO problems, taking multiple passes over the dataset can significantly outperform single pass SGD. We extend our results to the general learning setting by showing a problem which is learnable for any data distribution, and for this problem, SGD is strictly better than RERM for any regularization function. We conclude by discussing the implications of our results for deep learning, and show a separation between SGD and ERM for two layer diagonal neural networks.##
###
Efficient Methods for Online Multiclass Logistic Regression
[article]

2021
*
arXiv
*
pre-print

*Kale*. Newtron: an efficient bandit algorithm for online multiclass prediction. In Advances in Neural Information Processing Systems, pages 891-899, 2011. Elad Hazan, Amit Agarwal, and

*Satyen*

*Kale*. ... Alina Beygelzimer,

*Satyen*

*Kale*, and Haipeng Luo. Optimal and adaptive algorithms for online boosting. In International Conference on Machine Learning, pages 2323-2331. PMLR, 2015. ...

##
###
An Expansion Tester for Bounded Degree Graphs
[chapter]

2008
*
Lecture Notes in Computer Science
*

We consider the problem of testing graph expansion (either vertex or edge) in the bounded degree model [10] . We give a property tester that given a graph with degree bound d, an expansion bound α, and a parameter ε > 0, accepts the graph with high probability if its expansion is more than α, and rejects it with high probability if it is εfar from any graph with expansion α with degree bound d, where α < α is a function of α. For edge expansion, we obtain α = Ω( α 2 d ), and for vertex

doi:10.1007/978-3-540-70575-8_43
fatcat:w3kjfjxhzfcplduhrpgr6efklu
## more »

... for vertex expansion, we obtain α = Ω( α 2 d 2 ). In either case, the algorithm runs in timeÕ( n (1+µ)/2 d 2 εα 2 ) for any given constant µ > 0.##
###
Adaptive Feature Selection: Computationally Efficient Online Sparse Linear Regression under RIP
[article]

2017
*
arXiv
*
pre-print

*Kale*[2014] posed the open question of whether it is possible to design an efficient algorithm for the problem with a sublinear regret bound. ... This bound has optimal dependence on T , since even in the full information setting where all features are observed there is a lower bound of Ω(log T ) [Hazan and

*Kale*, 2014] . ...

« Previous

*Showing results 1 — 15 out of 240 results*