A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Filters
Justifying and Generalizing Contrastive Divergence
2009
Neural Computation
We study an expansion of the log-likelihood in undirected graphical models such as the Restricted Boltzmann Machine (RBM), where each term in the expansion is associated with a sample in a Gibbs chain alternating between two random variables (the visible vector and the hidden vector, in RBMs). We are particularly interested in estimators of the gradient of the log-likelihood obtained through this expansion. We show that its residual term converges to zero, justifying the use of a truncation,
doi:10.1162/neco.2008.11-07-647
pmid:19018704
fatcat:cmm4n7r65fhmtbmohf2yxnqdfi
more »
... . running only a short Gibbs chain, which is the main idea behind the Contrastive Divergence (CD) estimator of the log-likelihood gradient. By truncating even more, we obtain a stochastic reconstruction error, related through a mean-field approximation to the reconstruction error often used to train autoassociators and stacked auto-associators. The derivation is not specific to the particular parametric forms used in RBMs, and only requires convergence of the Gibbs chain. We present theoretical and empirical evidence linking the number of Gibbs steps k and the magnitude of the RBM parameters to the bias in the CD estimator. These experiments also suggest that the sign of the CD estimator is
Label Propagation and Quadratic Criterion
[chapter]
2006
Semi-Supervised Learning
This observation leads to a more general cost criterion involving a trade-off between (11.9) and (11.10) (Belkin et al. [2004] , Delalleau et al. [2005] ). ...
regularization, which naturally leads to a regularization term based on the graph Laplacian (Belkin and Niyogi [2003] , Joachims [2003] , Zhou et al. [2004] , Zhu et al. [2003] , Belkin et al. [2004] , Delalleau ...
doi:10.7551/mitpress/9780262033589.003.0011
fatcat:hsu674c5d5fhnmmk2xfsx5dv5a
Shallow vs. Deep Sum-Product Networks
2011
Neural Information Processing Systems
We investigate the representational power of sum-product networks (computation networks analogous to neural networks, but whose individual units compute either products or weighted sums), through a theoretical analysis that compares deep (multiple hidden layers) vs. shallow (one hidden layer) architectures. We prove there exist families of functions that can be represented much more efficiently with a deep network than with a shallow one, i.e. with substantially fewer hidden units. Such results
dblp:conf/nips/DelalleauB11
fatcat:tqgz3xj54ngnhmbw4v5z6tigfq
more »
... were not available until now, and contribute to motivate recent research involving learning of deep sum-product networks, and more generally motivate research in Deep Learning.
DECISION TREES DO NOT GENERALIZE TO NEW VARIATIONS
2010
Computational intelligence
This article is inspired by previous work that has shown such limitations in the case of kernel methods with a local kernel (Bengio, Delalleau, and Le Roux 2006a) as well as in the case of so-called ...
doi:10.1111/j.1467-8640.2010.00366.x
fatcat:tqyvj6kr2bhm7ggra6hg7e3om4
Convex Neural Networks
2005
Neural Information Processing Systems
Experimental Results We performed experiments on the 2-D double moon toy dataset (as used in (Delalleau, Bengio and Le Roux, 2005)) , to be able to compare with the exact version of the algorithm. ...
dblp:conf/nips/BengioRVDM05
fatcat:7szjzlln3vdgjcddgjf2q44j6y
On the Expressive Power of Deep Architectures
[chapter]
2011
Lecture Notes in Computer Science
On the Expressive Power of
Deep Architectures
Yoshua Bengio and Oliver Delalleau
S
Function approximation
f (x) = .... ...
doi:10.1007/978-3-642-24412-4_3
fatcat:dvr4psguy5gl5nze2gn2v3soni
Discrete and Continuous Action Representation for Practical RL in Video Games
[article]
2019
arXiv
pre-print
While most current research in Reinforcement Learning (RL) focuses on improving the performance of the algorithms in controlled environments, the use of RL under constraints like those met in the video game industry is rarely studied. Operating under such constraints, we propose Hybrid SAC, an extension of the Soft Actor-Critic algorithm able to handle discrete, continuous and parameterized actions in a principled way. We show that Hybrid SAC can successfully solve a highspeed driving task in
arXiv:1912.11077v1
fatcat:znyoi6kog5fhfmoyso5f2yd6ka
more »
... e of our games, and is competitive with the state-of-the-art on parameterized actions benchmark tasks. We also explore the impact of using normalizing flows to enrich the expressiveness of the policy at minimal computational cost, and identify a potential undesired effect of SAC when used with normalizing flows, that may be addressed by optimizing a different objective.
Spectral Dimensionality Reduction
[chapter]
2006
Studies in Fuzziness and Soft Computing
The Curse of Highly Variable Functions for Local Kernel Machines
2005
Neural Information Processing Systems
., 2004; Belkin, Matveeva and Niyogi, 2004; Delalleau, Bengio and Le Roux, 2005) , which also fall in this category, and share many ideas with manifold learning algorithms. ...
The graph-based algorithms we consider here can be seen as minimizing the following cost function, as shown in (Delalleau, Bengio and Le Roux, 2005) : C( Ŷ ) = Ŷl − Y l 2 + µ Ŷ L Ŷ + µ Ŷ 2 (9) with Ŷ ...
dblp:conf/nips/BengioDR05
fatcat:vpk5xvbjr5dd5pzriwfjn6cm44
Efficient EM Training of Gaussian Mixtures with Missing Data
[article]
2018
arXiv
pre-print
One should keep in mind that the EM algorithm assumes the missing Olivier Delalleau, Aaron Courville and Yoshua Bengio are with the Department of Computer Science and Operations Research, University of ...
arXiv:1209.0521v2
fatcat:4jn3cfkukfdw5pky3u3xu5tjym
Efficient Non-Parametric Function Induction in Semi-Supervised Learning
2005
International Conference on Artificial Intelligence and Statistics
There has been an increase of interest for semi-supervised learning recently, because of the many datasets with large amounts of unlabeled examples and only a few labeled ones. This paper follows up on proposed nonparametric algorithms which provide an estimated continuous label for the given unlabeled examples. First, it extends them to function induction algorithms that minimize a regularization criterion applied to an outof-sample example, and happen to have the form of Parzen windows
dblp:conf/aistats/DelalleauBR05
fatcat:yej3f3zde5aozdniyi4jh4uuxy
more »
... ors. This allows to predict test labels without solving again a linear system of dimension n (the number of unlabeled and labeled training examples), which can cost O(n 3 ). Second, this function induction procedure gives rise to an efficient approximation of the training process, reducing the linear system to be solved to m n unknowns, using only a subset of m examples. An improvement of O(n 2 /m 2 ) in time can thus be obtained. Comparative experiments are presented, showing the good performance of the induction formula and approximation algorithm.
A Closer Look at Codistillation for Distributed Training
[article]
2021
arXiv
pre-print
Codistillation has been proposed as a mechanism to share knowledge among concurrently trained models by encouraging them to represent the same function through an auxiliary loss. This contrasts with the more commonly used fully-synchronous data-parallel stochastic gradient descent methods, where different model replicas average their gradients (or parameters) at every iteration and thus maintain identical parameters. We investigate codistillation in a distributed training setup, complementing
arXiv:2010.02838v2
fatcat:s7ugrgic2rdfdik5rm2neuwwxe
more »
... evious work which focused on extremely large batch sizes. Surprisingly, we find that even at moderate batch sizes, models trained with codistillation can perform as well as models trained with synchronous data-parallel methods, despite using a much weaker synchronization mechanism. These findings hold across a range of batch sizes and learning rate schedules, as well as different kinds of models and datasets. Obtaining this level of accuracy, however, requires properly accounting for the regularization effect of codistillation, which we highlight through several empirical observations. Overall, this work contributes to a better understanding of codistillation and how to best take advantage of it in a distributed computing environment.
Learning Eigenfunctions Links Spectral Embedding and Kernel PCA
2004
Neural Computation
In this paper, we show a direct relation between spectral embedding methods and kernel PCA, and how both are special cases of a more general learning problem, that of learning the principal eigenfunctions of an operator defined from a kernel and the unknown data generating density. 1 Whereas spectral embedding methods only provided coordinates for the training points, the analysis justifies a simple extension to out-of-sample examples (the Nyström formula) for Multi-Dimensional Scaling,
doi:10.1162/0899766041732396
pmid:15333211
fatcat:uzc4rzlorzfczdclqyeq7hohxy
more »
... clustering, Laplacian eigenmaps, Locally Linear Embedding (LLE) and Isomap. The analysis provides, for all such spectral embedding methods, the definition of a loss function, whose empirical average is minimized by the traditional algorithms. The asymptotic expected value of that loss defines a generalization performance and clarifies what these algorithms are trying to learn. Experiments with LLE, Isomap, spectral clustering and MDS show that this out-of-sample embedding formula generalizes well, with a level of error comparable to the effect of small perturbations of the training set on the embedding.
DETONATION CLASSIFICATION FROM ACOUSTIC SIGNATURE WITH THE RESTRICTED BOLTZMANN MACHINE
2012
Computational intelligence
We compare the recently proposed Discriminative Restricted Boltzmann Machine to the classical Support Vector Machine on a challenging classification task consisting in identifying weapon classes from audio signals. The three weapon classes considered in this work (mortar, rocket and rocket-propelled grenade), are difficult to reliably classify with standard techniques since they tend to have similar acoustic signatures. In addition, specificities of the data available in this study makes it
doi:10.1111/j.1467-8640.2012.00419.x
fatcat:65o2c3meo5gsnngzenbh4w3flu
more »
... lenging to rigorously compare classifiers, and we address methodological issues arising from this situation. Experiments show good classification accuracy that could make these techniques suitable for fielding on autonomous devices. Discriminative Restricted Boltzmann Machines appear to yield better accuracy than Support Vector Machines, and are less sensitive to the choice of signal preprocessing and model hyperparameters. This last property is especially appealing in such a task where the lack of data makes model validation difficult.
Tempered Markov Chain Monte Carlo for training of Restricted Boltzmann Machines
2010
Journal of machine learning research
Despite CD's popularity, it does not yield the best approximation of the log-likelihood gradient (Carreira-Perpiñan & Hinton, 2005; Bengio & Delalleau, 2009) . ...
dblp:journals/jmlr/DesjardinsCBVD10
fatcat:i7oamsfwvbcqrfvfryt6hlg2si
« Previous
Showing results 1 — 15 out of 62 results