A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is
We study an expansion of the log-likelihood in undirected graphical models such as the Restricted Boltzmann Machine (RBM), where each term in the expansion is associated with a sample in a Gibbs chain alternating between two random variables (the visible vector and the hidden vector, in RBMs). We are particularly interested in estimators of the gradient of the log-likelihood obtained through this expansion. We show that its residual term converges to zero, justifying the use of a truncation,doi:10.1162/neco.2008.11-07-647 pmid:19018704 fatcat:cmm4n7r65fhmtbmohf2yxnqdfi
more »... . running only a short Gibbs chain, which is the main idea behind the Contrastive Divergence (CD) estimator of the log-likelihood gradient. By truncating even more, we obtain a stochastic reconstruction error, related through a mean-field approximation to the reconstruction error often used to train autoassociators and stacked auto-associators. The derivation is not specific to the particular parametric forms used in RBMs, and only requires convergence of the Gibbs chain. We present theoretical and empirical evidence linking the number of Gibbs steps k and the magnitude of the RBM parameters to the bias in the CD estimator. These experiments also suggest that the sign of the CD estimator is
This observation leads to a more general cost criterion involving a trade-off between (11.9) and (11.10) (Belkin et al.  , Delalleau et al.  ). ... regularization, which naturally leads to a regularization term based on the graph Laplacian (Belkin and Niyogi  , Joachims  , Zhou et al.  , Zhu et al.  , Belkin et al.  , Delalleau ...doi:10.7551/mitpress/9780262033589.003.0011 fatcat:hsu674c5d5fhnmmk2xfsx5dv5a
We investigate the representational power of sum-product networks (computation networks analogous to neural networks, but whose individual units compute either products or weighted sums), through a theoretical analysis that compares deep (multiple hidden layers) vs. shallow (one hidden layer) architectures. We prove there exist families of functions that can be represented much more efficiently with a deep network than with a shallow one, i.e. with substantially fewer hidden units. Such resultsdblp:conf/nips/DelalleauB11 fatcat:tqgz3xj54ngnhmbw4v5z6tigfq
more »... were not available until now, and contribute to motivate recent research involving learning of deep sum-product networks, and more generally motivate research in Deep Learning.
This article is inspired by previous work that has shown such limitations in the case of kernel methods with a local kernel (Bengio, Delalleau, and Le Roux 2006a) as well as in the case of so-called ...doi:10.1111/j.1467-8640.2010.00366.x fatcat:tqyvj6kr2bhm7ggra6hg7e3om4
Experimental Results We performed experiments on the 2-D double moon toy dataset (as used in (Delalleau, Bengio and Le Roux, 2005)) , to be able to compare with the exact version of the algorithm. ...dblp:conf/nips/BengioRVDM05 fatcat:7szjzlln3vdgjcddgjf2q44j6y
Lecture Notes in Computer Science
On the Expressive Power of Deep Architectures Yoshua Bengio and Oliver Delalleau S Function approximation f (x) = .... ...doi:10.1007/978-3-642-24412-4_3 fatcat:dvr4psguy5gl5nze2gn2v3soni
While most current research in Reinforcement Learning (RL) focuses on improving the performance of the algorithms in controlled environments, the use of RL under constraints like those met in the video game industry is rarely studied. Operating under such constraints, we propose Hybrid SAC, an extension of the Soft Actor-Critic algorithm able to handle discrete, continuous and parameterized actions in a principled way. We show that Hybrid SAC can successfully solve a highspeed driving task inarXiv:1912.11077v1 fatcat:znyoi6kog5fhfmoyso5f2yd6ka
more »... e of our games, and is competitive with the state-of-the-art on parameterized actions benchmark tasks. We also explore the impact of using normalizing flows to enrich the expressiveness of the policy at minimal computational cost, and identify a potential undesired effect of SAC when used with normalizing flows, that may be addressed by optimizing a different objective.
Studies in Fuzziness and Soft Computing
., 2004; Belkin, Matveeva and Niyogi, 2004; Delalleau, Bengio and Le Roux, 2005) , which also fall in this category, and share many ideas with manifold learning algorithms. ... The graph-based algorithms we consider here can be seen as minimizing the following cost function, as shown in (Delalleau, Bengio and Le Roux, 2005) : C( Ŷ ) = Ŷl − Y l 2 + µ Ŷ L Ŷ + µ Ŷ 2 (9) with Ŷ ...dblp:conf/nips/BengioDR05 fatcat:vpk5xvbjr5dd5pzriwfjn6cm44
One should keep in mind that the EM algorithm assumes the missing Olivier Delalleau, Aaron Courville and Yoshua Bengio are with the Department of Computer Science and Operations Research, University of ...arXiv:1209.0521v2 fatcat:4jn3cfkukfdw5pky3u3xu5tjym
There has been an increase of interest for semi-supervised learning recently, because of the many datasets with large amounts of unlabeled examples and only a few labeled ones. This paper follows up on proposed nonparametric algorithms which provide an estimated continuous label for the given unlabeled examples. First, it extends them to function induction algorithms that minimize a regularization criterion applied to an outof-sample example, and happen to have the form of Parzen windowsdblp:conf/aistats/DelalleauBR05 fatcat:yej3f3zde5aozdniyi4jh4uuxy
more »... ors. This allows to predict test labels without solving again a linear system of dimension n (the number of unlabeled and labeled training examples), which can cost O(n 3 ). Second, this function induction procedure gives rise to an efficient approximation of the training process, reducing the linear system to be solved to m n unknowns, using only a subset of m examples. An improvement of O(n 2 /m 2 ) in time can thus be obtained. Comparative experiments are presented, showing the good performance of the induction formula and approximation algorithm.
Codistillation has been proposed as a mechanism to share knowledge among concurrently trained models by encouraging them to represent the same function through an auxiliary loss. This contrasts with the more commonly used fully-synchronous data-parallel stochastic gradient descent methods, where different model replicas average their gradients (or parameters) at every iteration and thus maintain identical parameters. We investigate codistillation in a distributed training setup, complementingarXiv:2010.02838v2 fatcat:s7ugrgic2rdfdik5rm2neuwwxe
more »... evious work which focused on extremely large batch sizes. Surprisingly, we find that even at moderate batch sizes, models trained with codistillation can perform as well as models trained with synchronous data-parallel methods, despite using a much weaker synchronization mechanism. These findings hold across a range of batch sizes and learning rate schedules, as well as different kinds of models and datasets. Obtaining this level of accuracy, however, requires properly accounting for the regularization effect of codistillation, which we highlight through several empirical observations. Overall, this work contributes to a better understanding of codistillation and how to best take advantage of it in a distributed computing environment.
In this paper, we show a direct relation between spectral embedding methods and kernel PCA, and how both are special cases of a more general learning problem, that of learning the principal eigenfunctions of an operator defined from a kernel and the unknown data generating density. 1 Whereas spectral embedding methods only provided coordinates for the training points, the analysis justifies a simple extension to out-of-sample examples (the Nyström formula) for Multi-Dimensional Scaling,doi:10.1162/0899766041732396 pmid:15333211 fatcat:uzc4rzlorzfczdclqyeq7hohxy
more »... clustering, Laplacian eigenmaps, Locally Linear Embedding (LLE) and Isomap. The analysis provides, for all such spectral embedding methods, the definition of a loss function, whose empirical average is minimized by the traditional algorithms. The asymptotic expected value of that loss defines a generalization performance and clarifies what these algorithms are trying to learn. Experiments with LLE, Isomap, spectral clustering and MDS show that this out-of-sample embedding formula generalizes well, with a level of error comparable to the effect of small perturbations of the training set on the embedding.
We compare the recently proposed Discriminative Restricted Boltzmann Machine to the classical Support Vector Machine on a challenging classification task consisting in identifying weapon classes from audio signals. The three weapon classes considered in this work (mortar, rocket and rocket-propelled grenade), are difficult to reliably classify with standard techniques since they tend to have similar acoustic signatures. In addition, specificities of the data available in this study makes itdoi:10.1111/j.1467-8640.2012.00419.x fatcat:65o2c3meo5gsnngzenbh4w3flu
more »... lenging to rigorously compare classifiers, and we address methodological issues arising from this situation. Experiments show good classification accuracy that could make these techniques suitable for fielding on autonomous devices. Discriminative Restricted Boltzmann Machines appear to yield better accuracy than Support Vector Machines, and are less sensitive to the choice of signal preprocessing and model hyperparameters. This last property is especially appealing in such a task where the lack of data makes model validation difficult.
Despite CD's popularity, it does not yield the best approximation of the log-likelihood gradient (Carreira-Perpiñan & Hinton, 2005; Bengio & Delalleau, 2009) . ...dblp:journals/jmlr/DesjardinsCBVD10 fatcat:i7oamsfwvbcqrfvfryt6hlg2si
« Previous Showing results 1 — 15 out of 62 results