Lei Mao bio photo

Lei Mao

Machine Learning, Artificial Intelligence. On the Move.

Twitter Facebook LinkedIn GitHub   G. Scholar E-Mail RSS

Introduction

In reinforcement learning policy gradient methods, the goal is usually to maximize $\mathbb{E}[R_t]$, the expected value of return $R$ at time step $t$, where $R_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}$.


We have parameterized the policy $\pi(a|s; \theta)$ and update the parameter $\theta$ so that $\mathbb{E}[R_t]$ is maximized.


Standard REINFORCE algorithm updates policy in the direction $\nabla_{\theta} R_t \log \pi (a_t|s_t; \theta)$. That is to say, we have a loss function of $ -\log R_t \pi (a_t|s_t; \theta)$ and we are minimizing the loss.

REINFORCE

But in order to mamixize $\mathbb{E}[R_t]$, why do we have to calculate the the derivatives for $R_t \log \pi (a_t|s_t; \theta)$? It turns out that it is because $\nabla_{\theta} R_t \log \pi (a_t|s_t; \theta)$ is an unbiased estimate of $\nabla_{\theta} \mathbb{E}[R_t]$, i.e., $\mathbb{E}[R_t \nabla_{\theta} \log \pi (a_t|s_t; \theta)] = \nabla_{\theta} \mathbb{E}[R_t]$.

Proof

We use the definition of expected value, because random variable $R_t$ is dependent on $a_t$ and it follows distribution $\pi (a_t|s_t; \theta)$.

Then we apply a special case of Leibniz Integral Rule, where

We then have

Because $R_t$ is dependent on $a_t$ but not dependent on $\theta$ (Is it?). We have

We apply an identity trick here

Because the random variable $R_t \nabla_{\theta} \log \pi (a_t|s_t; \theta)$ also follows distribution $\pi (a_t|s_t; \theta)$, we have

Therefore,

This concludes the proof.


Generalization

More generally it is not hard to see that,

Extension

We also use REINFORCE with baseline to reduce the variance of the estimate of $\nabla_{\theta} \mathbb{E}[R_t]$ while keeping the estimate unbiased.

REINFORCE with Baseline

In stead of using $\nabla_{\theta} R_t \log \pi (a_t|s_t; \theta)$ as an unbiased estimate of $\nabla_{\theta} \mathbb{E}[R_t]$, we use $\nabla_{\theta} (R_t - v(s_t)) \log \pi (a_t|s_t; \theta)$ as long as $v(s_t)$ does not dependent on $a$.


To see why $\mathbb{E}[(R_t - v(s_t)) \nabla_{\theta} \log \pi (a_t|s_t; \theta)] = \nabla_{\theta} \mathbb{E}[R_t]$, we have

Where $\nabla_{\theta} \int v(s_t) \pi (a_t|s_t; \theta) d(a_t) = 0$.


Because $v$ use different parameters other than $\theta$ and it is not dependent on $a$,

Quick Proof to Leibniz Integral Rule

Thanks to the guidance from my friend Guotu Li on this.

Final Remarks

There is a statement or an assumption we made during the proof, that is $R_t$ is not dependent on $\theta$. It is because in this case $R_t$ is a scala value instead of an output from the policy neural network.