Regret-Aware Black-Box Optimization with Natural Gradients, Trust-Regions and Entropy Control [article]

Maximilian Hüttenrauch, Gerhard Neumann
2022 arXiv   pre-print
Most successful stochastic black-box optimizers, such as CMA-ES, use rankings of the individual samples to obtain a new search distribution. Yet, the use of rankings also introduces several issues such as the underlying optimization objective is often unclear, i.e., we do not optimize the expected fitness. Further, while these algorithms typically produce a high-quality mean estimate of the search distribution, the produced samples can have poor quality as these algorithms are ignorant of the
more » ... gret. Lastly, noisy fitness function evaluations may result in solutions that are highly sub-optimal on expectation. In contrast, stochastic optimizers that are motivated by policy gradients, such as the Model-based Relative Entropy Stochastic Search (MORE) algorithm, directly optimize the expected fitness function without the use of rankings. MORE can be derived by applying natural policy gradients and compatible function approximation, and is using information theoretic constraints to ensure the stability of the policy update. While MORE does not suffer from the listed limitations, it often cannot achieve state of the art performance in comparison to ranking based methods. We improve MORE by decoupling the update of the mean and covariance of the search distribution allowing for more aggressive updates on the mean while keeping the update on the covariance conservative, an improved entropy scheduling technique based on an evolution path which results in faster convergence and a simplified and more effective model learning approach in comparison to the original paper. We compare our algorithm to state of the art black-box optimization algorithms on standard optimization tasks as well as on episodic RL tasks in robotics where it is also crucial to have small regret. We obtain competitive results on benchmark functions and clearly outperform ranking-based methods in terms of regret on the RL tasks.
arXiv:2206.06090v1 fatcat:awvdtyldjnedncp4kviefvl5o4