Lei Mao bio photo

Lei Mao

Machine Learning, Artificial Intelligence. On the Move.

Twitter Facebook LinkedIn GitHub   G. Scholar E-Mail RSS

Preface

I was reading the perceptron convergence theorem, which is a proof for the convergence of perceptron learning algorithm, in the book “Machine Learning - An Algorithmic Perspective” 2nd Ed. I found the authors made some error in the mathmatical derivation by introducing some unstated assumptions. Obviously, the author was looking at the materials from multiple different sources but did not generalize it very well to match his proceeding writings in the book. I then tried to look up the right derivation on the internet, and I found that most of the materials includes too many assumptions that did not generalize the theorem very well. I finally found one from the MIT opencourse and I think this is the best one I was looking for.

Mathematical Proof and Caveats

In case you forget the perceptron learning algorithm, you may find it here.


The perceptron convergence theorem basically states that the perceptron learning algorithm converges in finite number of steps, given a linearly separable dataset. More precisely, if for each data point x, \( \|\mathbf{x}\| < R \) where \( R \) is certain constant number, \( \gamma = (\theta^{\ast})^{T} \mathbf{x_c} \) where \( \mathbf{x_c} \) is the data point that is the closest to the linear separate hyperplane. It should be noted that mathematically \(\frac{\gamma}{\|\theta^{\ast}\|^2}\) is the distance \(d\) of the closest datapoint to the linear separate hyperplane (it could be negative). The number of steps is bounded by \(\frac{R^{2}\|\theta^{\ast}\|^{2}}{\gamma^2}\) or \(\frac{R^{2}}{d^2}\).


In some materials, for simplicity, someone added assumption without generality that the weight of separate hyperplane is a unit vector (\( \|\theta^{\ast}\|^{2} = 1 \)) and one could claim that the physical meaning of \( \gamma \) is the the distance of the closest datapoint to the linear separate hyperplane. However, sometimes people ignored this assumption and claim \( \gamma \) is the the distance of the closest datapoint to the linear separate hyperplane. That was wrong.


The comprehensive correct proof could be found here.


PS: “MathJax is against humanity.” and “Happy Mother’s Day.”

Updates

6/16/2017


I happened to find an interesting non-mathematical informal proof in Hinton’s “Neural Networks for Machine Learning” course. Unlike the previous mathematical proofs I provided, this proof discussed the problems in the weight space. The whole informal proof is very simple and easy to understand. The informal proof could be downloaded here.