### Introduction

Naive Bayes classifier is one of the simplest classification algorithm based on the application of Bayes theorem with a strong naive conditional independence assumption. Although it is no longer state-of-the-art classification model nowadays, its mathematics are simple and elegant.

In this blog post, I would like to discuss the naive Bayes classifier.

### Bayesian Probabilistic Model

Given a feature vector variable $\mathbf{x} = (x_1, x_2, \cdots, x_n)$ representing $n$ features, the Bayesian probabilistic model produce probability distribution $p( y | \mathbf{x} )$ where $y$ is the label variable. It should also be noted that $x_1, x_2, \cdots, x_n$ and $y$ could be high-dimensional data, although usually they are scalars.

Bayesian probabilistic model is commonly used for classification problems. If $y$ is a discrete variable representing all the possible candidate classes in a classification problem, we could compute the probability of being each candidate class given a feature vector.

According to Bayes theorem, mathematically we have

\[\begin{align} P(y | \mathbf{x}) = \frac{P(y) P( \mathbf{x} | y)}{P(\mathbf{x})} \end{align}\]where $P(\mathbf{x})$ is the total probability or evidence and

\[\begin{align} P(\mathbf{x}) = \sum_{y \in Y} P(y) P( \mathbf{x} | y) \end{align}\]for discrete variable $y$,

\[\begin{align} P(\mathbf{x}) = \int_{Y} P(y) P( \mathbf{x} | y) dy \end{align}\]for continuous variable $y$.

Notice that $P(\mathbf{x})$ is constant with respect to $y$. It can be computed using the above summation. But if we just want to know which $y$ is the most likely, calculating the exact $P(y | \mathbf{x})$ is not necessary. Instead, we compute and compare $P(y) P( \mathbf{x} | y)$ for each $y$.

Computing $P(y)$ is easy from a training dataset, we just have to count the fraction of each labels in the training dataset. However, modeling $P( \mathbf{x} | y)$ is a hard problem. According to the probability chain rule, $P( \mathbf{x} | y)$ could be expressed as

\[\begin{align} P(\mathbf{x} | y) &= P(x_1, x_2, \cdots, x_n | y) \\ &= P( x_1 | y) P( x_2 | x_1, y) \cdots P( x_n | x_1, x_2, \cdots, x_{n-1}, y) \\ \end{align}\]But it is extremely complicated to model $P( x_i | x_1, x_2, \cdots, x_{i-1}, y)$ when $i$ is large. Not to mention $n$ might also be large. So the question becomes how could we model $P( \mathbf{x} | y)$ that is simple and efficient to compute and generalize well? The naive Bayes assumption comes to simplify the modeling process.

### Naive Bayes Classifier

The naive Bayes assumption is basically a conditional independence assumption. It assumes all the features in $\mathbf{x}$ are mutually independent, conditional on the category. Mathematically, we assume

\[\begin{align} P( x_i | x_1, x_2, \cdots, x_{i-1}, y) &= P( x_i | y) \\ \end{align}\]Modeling $P( x_i | y)$ is much more easier than modeling $P( x_i | x_1, x_2, \cdots, x_{i-1}, y)$. Therefore,

\[\begin{align} P(\mathbf{x} | y) &= P(x_1, x_2, \cdots, x_n | y) \\ &= \prod_{i=1}^{n} P( x_i | y) \\ \end{align}\]The Bayesian probabilistic model comes

\[\begin{align} P(y | \mathbf{x}) &= \frac{P(y) \prod_{i=1}^{n} P( x_i | y) }{P(\mathbf{x})} \\ &\propto P(y) \prod_{i=1}^{n} P( x_i | y) \\ \end{align}\]So for a naive Bayesian classifier, we only have to train $n+1$ models, $P(y)$, $P( x_1 | y)$, $P( x_2 | y)$, $\cdots$, $P( x_n | y)$, usually via maximum likelihood estimation.

To predict the label $\tilde{y}$ for a given $\mathbf{x}$,

\[\DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \begin{align} \tilde{y} = \argmax_{y} P(y) \prod_{i=1}^{n} P( x_i | y) \\ \end{align}\]One could imagine that if all the features in $\mathbf{x}$ are indeed conditionally independent the naive Bayes classifier should work very well.

The distribution selections for $P( x_1 | y)$, $P( x_2 | y)$, $\cdots$, $P( x_n | y)$ varies. But using Gaussian distribution is the most common. The scikit-learn naive Bayes module has built-in support for modeling using other distributions.

### Bernoulli Naive Bayes Classifier

Modeling a naive Bayes classifier given a tabular dataset is simple, since most likely the number of features are fixed and the first-hand features have been prepared. In many problems in practice, there is no such nice tabular dataset prepared for us. We would have to prepare a tabular dataset in order to use naive Bayes classifier.

For example, for E-mail spam and non-spam classification, the length of each Email are different. So setting variable $x_i$ as the word in the $i$th position is not appropriate. What people commonly are doing is, they have a dictionary of words where the number of words is fixed. Then $x_i$ is the number of times the word $i$ from the dictionary shows in the Email or even whether the word $i$ from the dictionary shows in the Email. In this way, the number of features is fixed. We are learning $k$ different Bernoulli distributions if $x_i$ is whether the word $i$ from the dictionary shows in the Email, where $k$ is the number of words in the dictionary.

### Multinomial Naive Bayes Classifier

There is another way to model Naive Bayes Classifier. Basically, for each word variable $x$ given the class variable $y$, it is a multinomial distribution. In this case, the distributions $P( x | y) = P( x_1 | y) = P( x_2 | y) = \cdots = P( x_n | y)$, and $n$ is the length of a document. We are learning one single $k$-multinomial distribution, where $k$ is the number of words in the dictionary.

### Caveats

There are scenarios where during inference time a word in the Email does not exist in the dictionary. There are two strategies. Either we expand our dictionary to include the word and create a new Bayes classifier, or we simply ignore the word. The latter one is more common since the dictionary are the collection of words we carefully selected. Preparing the dictionary or the tabular dataset for naive Bayes classifier could be tricky and might affect the model quality significantly. But that is another story belonging to feature engineers.

### Conclusions

Naive Bayes classifier is just a Bayesian probabilistic model with an assumption of feature mutual conditional independence on labels.