### Introduction

We often see in the machine learning studies that people are using proportional symbol $\propto$ instead of equivalence symbol $=$. In this blog post, we will see why we should use it, when we could use it, and how we could understand it.

### Why Should We Use It?

I will use Baye’s theorem as a concrete example first.

\[\underbrace{p(\theta|x)}_\text{posterior} = \frac{p(x|\theta) p(\theta)}{p(x)} \propto \underbrace{p(x|\theta)}_\text{likelihood} \underbrace{p(\theta)}_\text{prior} \propto p(x,\theta)\]Because $p(\theta|x)$ is a probability distribution for random variable $\theta$. The integral of the probability densities should equal to 1.

\[\int_{\theta} p(\theta|x) d \theta = \int_{\theta} \frac{p(x|\theta) p(\theta)}{p(x)} d \theta = \frac{\int_{\theta} p(x|\theta) p(\theta) d \theta}{p(x)} = 1\]Therefore,

\[p(x) = \int_{\theta} p(x|\theta) p(\theta) d \theta\]This means that if we know the closed form $p(x|\theta) p(\theta)$, we know $p(x)$. It is universally true for any probability distribution, and it is not necessary to write the normalizer which is not dependent on the random variable of the distribution.

For any probability distribution $p(\theta)$, if it could be written as

\[p(\theta) = h(\theta, \eta)g(\eta)\]where $g(\eta)$ does not dependent on random variable $\theta$, we could always use the proportional symbol

\[p(\theta) \propto h(\theta, \eta)\]Because

\[\int_{\theta} p(\theta) d \theta = g(\eta) \int_{\theta} h(\theta, \eta) d \theta = 1\\ g(\eta) = \frac{1}{\int_{\theta} h(\theta, \eta) d \theta}\]### When Could We Use It?

For any probability distribution $p(\theta)$, during any step of the derivation, if it could be written as

\[p(\theta) \propto h(\theta, \eta)g(\eta)\]where $g(\eta)$ does not dependent on random variable $\theta$, we could always remove the term $g(\eta)$ which is not related to $\theta$

\[p(\theta) \propto h(\theta, \eta)\]### How Could We Understand It?

For any probability distribution $p(\theta)$, if I could be written as

\[p(\theta) = h(\theta, \eta)g(\eta)\]$h(\theta, \eta)$ determines the shape of the distribution, and all the terms not related to $\theta$ could be merged to one single term $g(\eta)$. This term is constant to $\theta$ and it is only a scaling factor (normalizer) to the distribution to make sure that after normalization the integral of the distribution sum to 1.

### Final Remarks

The terms in the probability density expression which are not related to the random variable of the probability are usually not important during derivation. We ignore them during derivation, but we can always find it back at the end of derivation.