### Introduction

In reinforcement learning policy gradient methods, the goal is usually to maximize $\mathbb{E}[R_t]$, the expected value of return $R$ at time step $t$, where $R_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}$.

We have parameterized the policy $\pi(a|s; \theta)$ and update the parameter $\theta$ so that $\mathbb{E}[R_t]$ is maximized.

Standard REINFORCE algorithm updates policy in the direction $\nabla_{\theta} R_t \log \pi (a_t|s_t; \theta)$. That is to say, we have a loss function of $ -\log R_t \pi (a_t|s_t; \theta)$ and we are minimizing the loss.

But in order to mamixize $\mathbb{E}[R_t]$, why do we have to calculate the the derivatives for $R_t \log \pi (a_t|s_t; \theta)$? It turns out that it is because $\nabla_{\theta} R_t \log \pi (a_t|s_t; \theta)$ is an unbiased estimate of $\nabla_{\theta} \mathbb{E}[R_t]$, i.e., $\mathbb{E}[R_t \nabla_{\theta} \log \pi (a_t|s_t; \theta)] = \nabla_{\theta} \mathbb{E}[R_t]$.

### Proof

We use the definition of expected value, because random variable $R_t$ is dependent on $a_t$ and it follows distribution $\pi (a_t|s_t; \theta)$.

Then we apply a special case of Leibniz Integral Rule, where

We then have

Because $R_t$ is dependent on $a_t$ but not dependent on $\theta$ (Is it?). We have

We apply an identity trick here

Because the random variable $R_t \nabla_{\theta} \log \pi (a_t|s_t; \theta)$ also follows distribution $\pi (a_t|s_t; \theta)$, we have

Therefore,

This concludes the proof.

### Generalization

More generally it is not hard to see that,

### Extension

We also use REINFORCE with baseline to reduce the variance of the estimate of $\nabla_{\theta} \mathbb{E}[R_t]$ while keeping the estimate unbiased.

In stead of using $\nabla_{\theta} R_t \log \pi (a_t|s_t; \theta)$ as an unbiased estimate of $\nabla_{\theta} \mathbb{E}[R_t]$, we use $\nabla_{\theta} (R_t - v(s_t)) \log \pi (a_t|s_t; \theta)$ as long as $v(s_t)$ does not dependent on $a$.

To see why $\mathbb{E}[(R_t - v(s_t)) \nabla_{\theta} \log \pi (a_t|s_t; \theta)] = \nabla_{\theta} \mathbb{E}[R_t]$, we have

Where $\nabla_{\theta} \int v(s_t) \pi (a_t|s_t; \theta) d(a_t) = 0$.

Because $v$ use different parameters other than $\theta$ and it is not dependent on $a$,

### Quick Proof to Leibniz Integral Rule

Thanks to the guidance from my friend Guotu Li on this.

### Final Remarks

There is a statement or an assumption we made during the proof, that is $R_t$ is not dependent on $\theta$. It is because in this case $R_t$ is a scala value instead of an output from the policy neural network.