已知增加熵以奖励将贪婪的argmax策略软化为softmax策略。 重新构造了熵增强,并导致有动机以KL散度的形式向目标函数引入附加的熵项,以使优化过程规则化。 结果是在当前策略和softmax贪婪策略之间进行策略插值。 该策略用于构建连续参数化的算法,该算法同时优化策略和Q函数,并且其极限分别对应于策略梯度和Q学习。 实验表明,使用中间算法可以提高性能。
原文标题:Entropy-Augmented Entropy-Regularized Reinforcement Learning and a Continuous Path from Policy Gradient to Q-Learning
原文:Entropy augmented to reward is known to soften the greedy argmax policy to softmax policy. Entropy augmentation is reformulated and leads to a motivation to introduce an additional entropy term to the objective function in the form of KL-divergence to regularize optimization process. It results in a policy interpolating between the current policy and the softmax greedy policy. This policy is used to build a continuously parameterized algorithm which optimize policy and Q-function simultaneously and whose extreme limits correspond to policy gradient and Q-learning, respectively. Experiments show that there can be a performance gain using an intermediate algorithm.
原文作者:Donghoon Lee
原文地址:https://arxiv.org/abs/2005.08844
熵增强的熵正规强化学习以及从策略梯度到Q学习的连续路径(CS ML).pdf ---来自腾讯云社区的---用户7305506
微信扫一扫打赏
支付宝扫一扫打赏