Lei Mao's Log BookJekyll2019-12-09T21:50:38-06:00https://leimao.github.io/Lei Maohttps://leimao.github.io/dukeleimao@gmail.com<![CDATA[John Carreyrou's Bad Blood]]>https://leimao.github.io/reading/John-Carreyrou-Bad-Blood2019-11-30 14:17:25 -0400T00:00:00-00:002019-11-30T00:00:00-06:00Lei Maohttps://leimao.github.iodukeleimao@gmail.com<h3 id="introduction">Introduction</h3>
<p>I have two degrees in the life science related disciplines, however, I did not know the existence of the company Theranos and its “legend” until I watched a book recommendation YouTube video at the end of 2018. I was too busy with my study in computer science and did not find time to read the book until November 2019. However, I was already deeply astonished by the evil nature of Theranos when I watched the short YouTube video, confirming that my decision of leaving the field of life science or anything related is correct.</p>
<p><br /></p>
<p>I purchased the book and finished reading the book in one month. John Carreyrou is a two-time Pulitzer Prizing winning author, the word choices he made in the book were extremely good. As a non-native English speaker who hardly read books other than textbooks, sometimes I would have to look up the words in a dictionary. However, because I have professional background in the life science academia and industry, and the central idea in the book is extremely strong, I did not have too much difficulty finish reading the book.</p>
<p><br /></p>
<p>In this blog post, I would like to share some of my thoughts after reading the entire book.</p>
<h3 id="characters">Characters</h3>
<p>I could not remember how many characters are there in the book, but they could be categorized in three groups, Theranos helmsmen, Theranos zealots, and Theranos breakers.</p>
<h4 id="theranos-helmsmen">Theranos Helmsmen</h4>
<p>The Theranos helmsmen are of course the CEO Elizabeth Holmes and her elder boyfriend the COO Ramesh “Sunny” Balwani. Although nobody would doubt the initial motivation of Theranos was to make a better world when Elizabeth first founded the company, they did not have as solid background in science and engineering as Thomas Edison who was eventually able to “fake it until you make it”.</p>
<p><br /></p>
<p>The way they treated the employees was extremely disgusting. Threatens, lies, and surveillance were routine. This reminds me someone I know at Duke University who was very similar to the Theranos CEO. A notorious place to stay, regardless how revolutionary their “invention” was.</p>
<h4 id="theranos-zealots">Theranos Zealots</h4>
<p>The Theranos board of directors consists of a good number of highly powerful and influential people from academia, politics, and legal services, etc. While they were receiving an extraordinary amount of salaries from Theranos, they had made almost no contribution to the company to make it work in a correct way. Although most of them did not have the technical background, they should still have some skepticism to tell that there were something unusual going on after having been on board for several years, instead of keeping believing in and advocating Theranos without thinking twice.</p>
<h4 id="theranos-breakers">Theranos Breakers</h4>
<p>The most admirable and honorable people are the Theranos quitters, whistleblowers, and of course the author of the book John Carreyrou who digged into this and exposed the truth to the public. They abided the research integrity, <a href="https://en.wikipedia.org/wiki/Hippocratic_Oath">the hippocratic oath</a>, and the fundamental principle of being a good man which is to be honest, even with being constantly threatened by Theranos in various ways. Without them, more investment might be in vain, and there could have been patients died from this. They have truly saved the world.</p>
<h3 id="theranos-could-not-be-apple">Theranos Could Not Be Apple</h3>
<p>Creating a medicine or medical device is different to inventing an algorithm, writing a software, or developing a conventional electronic device, because human health and life really matters. An apple device, however, could be malfunctional or miss some features that the users wanted. But it would usually not hurt the user’s health or put the user’s life in danger. Because of this, Apple could usually be more imaginative and creative. More wild thoughts and technologies could be assembled together. The pharmaceutical and medical device company, however, should be more conservative, every single step and component should be evaluated carefully regarding its potential harm to human health, before the products become commercially available. Most of the time, the evaluation results would be no harm but also no effect, or effective but harmful. In the Theranos case, unfortunately, it is not effective but also harmful.</p>
<p><br /></p>
<p>Therefore, advertising Theranos as a next Apple was completely wrong from the root.</p>
<h3 id="theranos-is-not-alone">Theranos Is Not Alone</h3>
<p>Not only in the life science industry, but also academia, I have seen too many similar stories. People who have taken research ethics trainings, no matter how prestigious they are, could easily betray the ethics, not even mention those who had no such ethics trainings at all, such as Elizabeth and Sunny.</p>
<p><br /></p>
<p>In the life science product development and studies, usually it would result in a 1 or 0 binary outcome. This means that either your research is true and helpful, or it is completely failed and useless. However, people care more about power, wealth, and result, instead of the process. Given it is extremely easy to get 0 as the outcome, in order to get the 1 as the outcome, fabricating data, cheating, intentionally misinterpretation, etc., are not new tricks.</p>
<p><br /></p>
<p>So this is why this field has been totally corrupted. People of integrity could hardly survive in the life science field, because when most of the others cheat, you would also have to cheat, otherwise you could not win and you would be kicked out from the game if you whistleblow.</p>
<h3 id="conclusions">Conclusions</h3>
<p>This is truly a well written book, and I would rate it 5/5. I would recommend it to everyone, especially those who are not familiar with the life science academia and industry.</p>
<h3 id="references">References</h3>
<ul>
<li><a href="https://www.amazon.com/Bad-Blood-Secrets-Silicon-Startup-ebook/dp/B078VW3VM7/">Bad Blood: Secrets and Lies in a Silicon Valley Startup</a></li>
<li><a href="https://leimao.github.io/blog/Do-Not-Study-Life-Science/">Reasons Not to Study Life Science or Anything Related</a></li>
<li><a href="https://leimao.github.io/miscellaneous/bio-discouragement/">Bio-Discouragement Plan</a></li>
</ul>
<p><a href="https://leimao.github.io/reading/John-Carreyrou-Bad-Blood/">John Carreyrou's Bad Blood</a> was originally published by Lei Mao at <a href="https://leimao.github.io">Lei Mao's Log Book</a> on November 30, 2019.</p><![CDATA[Correlation VS Causation]]>https://leimao.github.io/blog/Correlation-vs-Causation2019-11-28 14:17:25 -0400T00:00:00-00:002019-11-28T00:00:00-06:00Lei Maohttps://leimao.github.iodukeleimao@gmail.com<h3 id="introduction">Introduction</h3>
<p>The concepts of correlation and causation are sometimes confusing to amateur researchers. In practice, I often saw researchers considering a correlation as causation and making mistakes in conclusions. Mathematically, causation is the necessary but insufficient condition for correlation. In other words, if two things have causation relationship, these two things must have correlation relationship as well. However, if two things have correlation relationship, these two things do not necessarily have causation relationship.</p>
<p><br /></p>
<p>In this blog post, I would use an example to talk about the concepts of correlation and causation, how to verify causation using experiments, and the caveats in using experiments to verify causation.</p>
<h3 id="example">Example</h3>
<p>Suppose we have a system containing the four variables including temperature ($t$), the volume of water Mike drinks daily ($x_1$), the volume of urine Mike has daily ($x_2$), the number of fires at California daily ($x_3$). The ground truth values are listed in the table. The underlying relationships have also been listed and we assume there is no error in the measurement during experimental data collection.</p>
<p><br /></p>
<p>The ground truth relationships are listed in the relationships column in the table, and these relationships are unknown to researchers.</p>
<p><br /></p>
<p>In the data collection, we only collected the data for $x_1$, $x_2$, and $x_3$, but not $t$.</p>
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-9wq8{border-color:inherit;text-align:center;vertical-align:middle}
.tg .tg-uzvj{font-weight:bold;border-color:inherit;text-align:center;vertical-align:middle}
</style>
<table class="tg">
<tr>
<th class="tg-uzvj">Variable</th>
<th class="tg-uzvj">Symbol</th>
<th class="tg-uzvj">Relationships</th>
<th class="tg-uzvj">Data #1</th>
<th class="tg-uzvj">Data #2</th>
<th class="tg-uzvj">Data #3</th>
<th class="tg-uzvj">Data #4</th>
<th class="tg-uzvj">Data #5</th>
</tr>
<tr>
<td class="tg-9wq8">Temperature</td>
<td class="tg-9wq8">$t$</td>
<td class="tg-9wq8">$t$</td>
<td class="tg-9wq8">0</td>
<td class="tg-9wq8">10</td>
<td class="tg-9wq8">20</td>
<td class="tg-9wq8">30</td>
<td class="tg-9wq8">40</td>
</tr>
<tr>
<td class="tg-9wq8">Volume of Water <br />Mike Drinks Daily<br />(mL)</td>
<td class="tg-9wq8">$x_1$</td>
<td class="tg-9wq8">$x_1 = f_1(t)$</td>
<td class="tg-9wq8">500</td>
<td class="tg-9wq8">800</td>
<td class="tg-9wq8">900</td>
<td class="tg-9wq8">1500</td>
<td class="tg-9wq8">3000</td>
</tr>
<tr>
<td class="tg-9wq8">Volume of Urine <br />Mike Has Daily<br />(mL)<br /></td>
<td class="tg-9wq8">$x_2$</td>
<td class="tg-9wq8">$x_2 = f_2(x_1)$</td>
<td class="tg-9wq8">500</td>
<td class="tg-9wq8">800</td>
<td class="tg-9wq8">900</td>
<td class="tg-9wq8">1500</td>
<td class="tg-9wq8">3000</td>
</tr>
<tr>
<td class="tg-9wq8">Number of Fires <br />at California<br />Daily</td>
<td class="tg-9wq8">$x_3$</td>
<td class="tg-9wq8">$x_3 = f_3(t)$</td>
<td class="tg-9wq8">5</td>
<td class="tg-9wq8">8</td>
<td class="tg-9wq8">9</td>
<td class="tg-9wq8">15</td>
<td class="tg-9wq8">30</td>
</tr>
</table>
<p>From the data, we found that each pair of $x_1$, $x_2$, and $x_3$ are highly correlated. Can we say $x_1$ and $x_2$ have causation relationship and further $x_1$ caused $x_2$? Can we say $x_1$ and $x_3$ have causation relationship and further $x_1$ caused $x_3$?</p>
<h3 id="correlation-is-not-necessarily-causation">Correlation Is Not Necessarily Causation</h3>
<p>If we have some common sense, we know that $x_1$, the volume of water Mike drinks daily, caused $x_2$, the volume of urine Mike has daily. However, since Mike is just an ordinary person and does not have divine power, definitely $x_1$, the volume of water Mike drinks daily, would not cause $x_3$, the number of fires at California daily.</p>
<p><br /></p>
<p>This example concretely shows that causation is the necessary but insufficient condition for correlation.</p>
<p><br /></p>
<p>The next question is how to determine or eliminate the causation relationship from all the correlation relationships? The correct way is to do experiment.</p>
<h3 id="determine-causation-by-experiment">Determine Causation By Experiment</h3>
<p>In this case, if we keep $t$ the same (although we are not monitoring it), increase $x_1$, and monitor the change of $x_2$ and $x_3$. That is to say, we keep the temperature the same, ask Mike to drink more water daily, and monitor change of the volume of urine Mike has daily and the number of fires at California daily.</p>
<p><br /></p>
<p>Of course, Mike would have more volume of urine daily, but the number of fires at California should not change. This experiment result confirms that $x_1$ caused $x_2$, but $x_1$ would not cause $x_3$.</p>
<h3 id="caveats">Caveats</h3>
<p>However, if we do experiment incorrectly, we might find erroneous causation relationships.</p>
<p><br /></p>
<p>For example, if we are not aware $t$ is going up since we are not monitoring it, we increase $x_1$ and monitor the change of $x_2$ and $x_3$, then $x_2$ and $x_3$ would both increase. Then we would draw an incorrect conclusion that $x_2$ caused $x_3$, i.e., the volume of water Mike drinks daily caused the number fires at California daily.</p>
<p><br /></p>
<p>Since we are not aware of $t$ increasing, or even did not know the existence of $t$ if we are ignorant, how do we eliminate that $x_2$ caused $x_3$? In this particular problem, it is almost impossible. However, if we have a clone of Mike which is exactly the same to Mike on any aspect, we would ask the first Mike to drink more water daily and the second Mike drink the same amount of water daily to the usual, and monitor the change of $x_2$ and $x_3$. We would find although the first Mike suggested that $x_1$ caused $x_3$, the second Mike would show that although $x_1$ remained the same but $x_3$ increases, which is inconsistent to the causation relationship that $x_1$ caused $x_3$. This rules out the causation relationship between $x_1$ and $x_3$.</p>
<p><br /></p>
<p>Such experiments are called control experiments. Essentially when the control experiments show something inconsistent to the causation relationship you found, the causation relationship you found is fake. It is extremely useful, because even though you do not know how many hidden variables you could not capture, as long as you can guarantee the variables, which have no causation relationships to the variables you are experimenting with, remain the same in both control experiments and actual experiments, the causation relationships you found, if any, would be reliable.</p>
<p><br /></p>
<p>However, in practice, because the system is much more complicated, and the single variable you thought you are changing might actually turn out to contain many variables. For example, when Mike start to drink more water, he has to use the exact the same cup he used to drink the water, the water has to be the exact same water he used to drink. If the cup got changed, or the water quality got changed, in principle, the conclusion that $x_1$ caused $x_2$ and $x_1$ would not cause $x_3$ would not hold.</p>
<h3 id="conclusions">Conclusions</h3>
<p>The determination of causation is extremely complicated and could often go wrong. This is because there are infinitely number of variables when you are doing experiments, even with very good control experiment, you might not be aware how many variables you have changed in the actual experiments and derive the wrong conclusion. This is essentially why a lot of sophisticated scientific findings, especially for life and biomedical science, turn out to be not true.</p>
<p><br /></p>
<p>Nevertheless, the key ideas in doing experiment to determine causation relationships are:</p>
<ul>
<li>Doing control experiment.</li>
<li>Minimizing and being aware of the change to other variables while you thought you are changing only one single variable.</li>
</ul>
<p><a href="https://leimao.github.io/blog/Correlation-vs-Causation/">Correlation VS Causation</a> was originally published by Lei Mao at <a href="https://leimao.github.io">Lei Mao's Log Book</a> on November 28, 2019.</p><![CDATA[Bilingual Evaluation Understudy (BLEU)]]>https://leimao.github.io/blog/BLEU-Score2019-11-17 14:17:25 -0400T00:00:00-00:002019-11-17T00:00:00-06:00Lei Maohttps://leimao.github.iodukeleimao@gmail.com<h3 id="introduction">Introduction</h3>
<p>BLEU is a standard algorithm for evaluating the machine translations against the human translations. At first I thought it should be very straightforward to use. However, it turns out that there are a lot of caveats.</p>
<p><br /></p>
<p>In this blog post, I am going to show the BLEU algorithm in detail and talk about the caveats.</p>
<h3 id="english-translation-example">English Translation Example</h3>
<p>We will use the following examples to illustrate how to compute the BLEU scores.</p>
<h4 id="example-1">Example 1</h4>
<p>Chinese: 猫坐在垫子上</p>
<p>Reference 1: the cat is on the mat</p>
<p>Reference 2: there is a cat on the mat</p>
<p>Candidate: the cat the cat on the mat</p>
<h4 id="example-2">Example 2</h4>
<p>Chinese: 猫坐在垫子上</p>
<p>Reference 1: the cat is on the mat</p>
<p>Reference 2: there is a cat on the mat</p>
<p>Candidate: the the the the the the the the</p>
<h3 id="precision">Precision</h3>
<p>We count each of the ngram in the candidate sentence whether it has shown in any of the reference sentences, gather the total counts for each of the unique ngram, sum up the total counts for each of the unique ngram, and divided by the number of ngrams in the candidate sentence.</p>
<h4 id="example-1-1">Example 1</h4>
<p>We first compute the unigram precision for example 1. All the unigrams in the candidate sentences have shown in the reference sentences.</p>
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-baqh{text-align:center;vertical-align:top}
.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top}
.tg .tg-c3tw{font-weight:bold;border-color:inherit;text-align:center;vertical-align:top}
</style>
<table class="tg">
<tr>
<th class="tg-c3tw">Unigram</th>
<th class="tg-c3tw">Shown?</th>
</tr>
<tr>
<td class="tg-c3ow">the</td>
<td class="tg-c3ow">1</td>
</tr>
<tr>
<td class="tg-c3ow">cat</td>
<td class="tg-c3ow">1</td>
</tr>
<tr>
<td class="tg-c3ow">the</td>
<td class="tg-c3ow">1</td>
</tr>
<tr>
<td class="tg-c3ow">cat</td>
<td class="tg-c3ow">1</td>
</tr>
<tr>
<td class="tg-c3ow">on</td>
<td class="tg-c3ow">1</td>
</tr>
<tr>
<td class="tg-baqh">the</td>
<td class="tg-baqh">1</td>
</tr>
<tr>
<td class="tg-baqh">mat</td>
<td class="tg-baqh">1</td>
</tr>
</table>
<p>We then merge the unigram counts together.</p>
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top}
.tg .tg-c3tw{font-weight:bold;border-color:inherit;text-align:center;vertical-align:top}
</style>
<table class="tg">
<tr>
<th class="tg-c3tw">Unique Unigram</th>
<th class="tg-c3tw">Count</th>
</tr>
<tr>
<td class="tg-c3ow">the</td>
<td class="tg-c3ow">3</td>
</tr>
<tr>
<td class="tg-c3ow">cat</td>
<td class="tg-c3ow">2</td>
</tr>
<tr>
<td class="tg-c3ow">on</td>
<td class="tg-c3ow">1</td>
</tr>
<tr>
<td class="tg-c3ow">mat</td>
<td class="tg-c3ow">1</td>
</tr>
</table>
<p>The total number of counts for the unique unigrams in the candidate sentence is 7, and the total number of unigrams in the candidate sentence is 7. The unigram precision is 7/7 = 1.0 for example 1.</p>
<p><br /></p>
<p>We then try to compute the bigram precision for example 1.</p>
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-baqh{text-align:center;vertical-align:top}
.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top}
.tg .tg-c3tw{font-weight:bold;border-color:inherit;text-align:center;vertical-align:top}
</style>
<table class="tg">
<tr>
<th class="tg-c3tw">Bigram</th>
<th class="tg-c3tw">Shown?</th>
</tr>
<tr>
<td class="tg-c3ow">the cat</td>
<td class="tg-c3ow">1</td>
</tr>
<tr>
<td class="tg-c3ow">cat the</td>
<td class="tg-c3ow">0</td>
</tr>
<tr>
<td class="tg-c3ow">the cat</td>
<td class="tg-c3ow">1</td>
</tr>
<tr>
<td class="tg-c3ow">cat on</td>
<td class="tg-c3ow">1</td>
</tr>
<tr>
<td class="tg-baqh">on the</td>
<td class="tg-baqh">1</td>
</tr>
<tr>
<td class="tg-baqh">the mat</td>
<td class="tg-baqh">1</td>
</tr>
</table>
<p>We then merge the bigram counts together.</p>
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-baqh{text-align:center;vertical-align:top}
.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top}
.tg .tg-c3tw{font-weight:bold;border-color:inherit;text-align:center;vertical-align:top}
</style>
<table class="tg">
<tr>
<th class="tg-c3tw">Unique Bigram</th>
<th class="tg-c3tw">Count</th>
</tr>
<tr>
<td class="tg-c3ow">the cat</td>
<td class="tg-c3ow">2</td>
</tr>
<tr>
<td class="tg-c3ow">cat the</td>
<td class="tg-c3ow">0</td>
</tr>
<tr>
<td class="tg-c3ow">cat on</td>
<td class="tg-c3ow">1</td>
</tr>
<tr>
<td class="tg-baqh">on the</td>
<td class="tg-baqh">1</td>
</tr>
<tr>
<td class="tg-baqh">the mat</td>
<td class="tg-baqh">1</td>
</tr>
</table>
<p>The total number of counts for the unique bigrams in the candidate sentence is 5, and the total number of bigrams in the candidate sentence is 6. The bigram precision is 5/6 = 0.833 for example 1.</p>
<h4 id="example-2-1">Example 2</h4>
<p>We first compute the unigram precision for example 2. All the unigrams in the candidate sentences have shown in the reference sentences.</p>
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-baqh{text-align:center;vertical-align:top}
.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top}
.tg .tg-c3tw{font-weight:bold;border-color:inherit;text-align:center;vertical-align:top}
</style>
<table class="tg">
<tr>
<th class="tg-c3tw">Unigram</th>
<th class="tg-c3tw">Shown?</th>
</tr>
<tr>
<td class="tg-c3ow">the</td>
<td class="tg-c3ow">1</td>
</tr>
<tr>
<td class="tg-c3ow">the</td>
<td class="tg-c3ow">1</td>
</tr>
<tr>
<td class="tg-c3ow">the</td>
<td class="tg-c3ow">1</td>
</tr>
<tr>
<td class="tg-c3ow">the</td>
<td class="tg-c3ow">1</td>
</tr>
<tr>
<td class="tg-c3ow">the</td>
<td class="tg-c3ow">1</td>
</tr>
<tr>
<td class="tg-c3ow">the</td>
<td class="tg-c3ow">1</td>
</tr>
<tr>
<td class="tg-c3ow">the</td>
<td class="tg-c3ow">1</td>
</tr>
<tr>
<td class="tg-c3ow">the</td>
<td class="tg-c3ow">1</td>
</tr>
</table>
<p>We then merge the unigram counts together.</p>
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top}
.tg .tg-c3tw{font-weight:bold;border-color:inherit;text-align:center;vertical-align:top}
</style>
<table class="tg">
<tr>
<th class="tg-c3tw">Unique Unigram</th>
<th class="tg-c3tw">Count</th>
</tr>
<tr>
<td class="tg-c3ow">the</td>
<td class="tg-c3ow">8</td>
</tr>
</table>
<p>The total number of counts for the unique unigrams in the candidate sentence is 8, and the total number of unigrams in the candidate sentence is 8. The unigram precision is 8/8 = 1.0 for example 2.</p>
<p><br /></p>
<p>We then try to compute the bigram precision for example 2.</p>
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-baqh{text-align:center;vertical-align:top}
.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top}
.tg .tg-c3tw{font-weight:bold;border-color:inherit;text-align:center;vertical-align:top}
</style>
<table class="tg">
<tr>
<th class="tg-c3tw">Bigram</th>
<th class="tg-c3tw">Shown?</th>
</tr>
<tr>
<td class="tg-c3ow">the the</td>
<td class="tg-c3ow">0</td>
</tr>
<tr>
<td class="tg-c3ow">the the</td>
<td class="tg-c3ow">0</td>
</tr>
<tr>
<td class="tg-c3ow">the the</td>
<td class="tg-c3ow">0</td>
</tr>
<tr>
<td class="tg-c3ow">the the</td>
<td class="tg-c3ow">0</td>
</tr>
<tr>
<td class="tg-baqh">the the</td>
<td class="tg-baqh">0</td>
</tr>
<tr>
<td class="tg-baqh">the the</td>
<td class="tg-baqh">0</td>
</tr>
<tr>
<td class="tg-baqh">the the</td>
<td class="tg-baqh">0</td>
</tr>
</table>
<p>We then merge the bigram counts together.</p>
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-baqh{text-align:center;vertical-align:top}
.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top}
.tg .tg-c3tw{font-weight:bold;border-color:inherit;text-align:center;vertical-align:top}
</style>
<table class="tg">
<tr>
<th class="tg-c3tw">Unique Bigram</th>
<th class="tg-c3tw">Count</th>
</tr>
<tr>
<td class="tg-c3ow">the the</td>
<td class="tg-c3ow">0</td>
</tr>
</table>
<p>The total number of counts for the unique bigrams in the candidate sentence is 0, and the total number of bigrams in the candidate sentence is 7. The bigram precision is 0/7 = 0 for example 2.</p>
<h4 id="drawbacks">Drawbacks</h4>
<p>We can see from example 1 and 2 that unigram precision is very easy to be over-confident about the quality of the machine translation. To overcome this, clipped count and modified precision were proposed.</p>
<h3 id="modified-precision">Modified Precision</h3>
<p>For each unique ngram, we count its maximum frequency in each of the reference sentences. The minimum of this special count and the original count is called the clipped the count. That is to say, the clipped count is no greater than the original count. We then use this clipped count, in place of the original count, for computing the modified precision.</p>
<h4 id="example-1-2">Example 1</h4>
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-baqh{text-align:center;vertical-align:top}
.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top}
.tg .tg-c3tw{font-weight:bold;border-color:inherit;text-align:center;vertical-align:top}
</style>
<table class="tg">
<tr>
<th class="tg-c3tw">Unique Unigram</th>
<th class="tg-c3tw">Count</th>
<th class="tg-c3tw">Clipped Count</th>
</tr>
<tr>
<td class="tg-c3ow">the</td>
<td class="tg-c3ow">3</td>
<td class="tg-baqh">2</td>
</tr>
<tr>
<td class="tg-c3ow">cat</td>
<td class="tg-c3ow">2</td>
<td class="tg-baqh">1</td>
</tr>
<tr>
<td class="tg-c3ow">on</td>
<td class="tg-c3ow">1</td>
<td class="tg-baqh">1</td>
</tr>
<tr>
<td class="tg-c3ow">mat</td>
<td class="tg-c3ow">1</td>
<td class="tg-baqh">1</td>
</tr>
</table>
<p>The total number of clipped counts for the unique unigrams in the candidate sentence is 5, and the total number of unigrams in the candidate sentence is 7. The unigram modified precision is 5/7 = 0.714 for example 1.</p>
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-baqh{text-align:center;vertical-align:top}
.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top}
.tg .tg-c3tw{font-weight:bold;border-color:inherit;text-align:center;vertical-align:top}
</style>
<table class="tg">
<tr>
<th class="tg-c3tw">Unique Bigram</th>
<th class="tg-c3tw">Count</th>
<th class="tg-c3tw">Clipped Count</th>
</tr>
<tr>
<td class="tg-c3ow">the cat</td>
<td class="tg-c3ow">2</td>
<td class="tg-baqh">1</td>
</tr>
<tr>
<td class="tg-c3ow">cat the</td>
<td class="tg-c3ow">0</td>
<td class="tg-baqh">0</td>
</tr>
<tr>
<td class="tg-c3ow">cat on</td>
<td class="tg-c3ow">1</td>
<td class="tg-baqh">1</td>
</tr>
<tr>
<td class="tg-c3ow">on the</td>
<td class="tg-c3ow">1</td>
<td class="tg-baqh">1</td>
</tr>
<tr>
<td class="tg-baqh">the mat</td>
<td class="tg-baqh">1</td>
<td class="tg-baqh">1</td>
</tr>
</table>
<p>The total number of clipped counts for the unique bigrams in the candidate sentence is 4, and the total number of unigrams in the candidate sentence is 6. The bigram modified precision is 4/6 = 0.667 for example 1.</p>
<h4 id="example-2-2">Example 2</h4>
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-baqh{text-align:center;vertical-align:top}
.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top}
.tg .tg-c3tw{font-weight:bold;border-color:inherit;text-align:center;vertical-align:top}
</style>
<table class="tg">
<tr>
<th class="tg-c3tw">Unique Unigram</th>
<th class="tg-c3tw">Count</th>
<th class="tg-c3tw">Clipped Count</th>
</tr>
<tr>
<td class="tg-c3ow">the</td>
<td class="tg-c3ow">8</td>
<td class="tg-baqh">2</td>
</tr>
</table>
<p>The total number of clipped counts for the unique bigrams in the candidate sentence is 0, and the total number of unigrams in the candidate sentence is 8. The unigram modified precision is 2/8 = 0.25 for example 2.</p>
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-baqh{text-align:center;vertical-align:top}
.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top}
.tg .tg-c3tw{font-weight:bold;border-color:inherit;text-align:center;vertical-align:top}
</style>
<table class="tg">
<tr>
<th class="tg-c3tw">Unique Bigram</th>
<th class="tg-c3tw">Count</th>
<th class="tg-c3tw">Clipped Count</th>
</tr>
<tr>
<td class="tg-c3ow">the the</td>
<td class="tg-c3ow">0</td>
<td class="tg-baqh">0</td>
</tr>
</table>
<p>The total number of clipped counts for the unique bigrams in the candidate sentence is 0, and the total number of bigrams in the candidate sentence is 7. The bigram precision is 0/7 = 0 for example 2.</p>
<h4 id="advantages">Advantages</h4>
<p>Compared to precision, we found that modified precision is a better metric, at least for unigrams.</p>
<h3 id="bleu">BLEU</h3>
<h4 id="algorithm">Algorithm</h4>
<p>BLEU is computed using a couple of ngram modified precisions. Specifically,</p>
<script type="math/tex; mode=display">\text{BLEU} = \text{BP} \cdot \exp \bigg( \sum_{n=1}^{N} w_n \log p_n \bigg)</script>
<p>where $p_n$ is the modified precision for $n$gram, the base of $\log$ is the natural base $e$, $w_n$ is weight between 0 and 1 for $\log p_n$ and $\sum_{n=1}^{N} w_n = 1$, and BP is the brevity penalty to penalize short machine translations.</p>
<script type="math/tex; mode=display">% <![CDATA[
\text{BP} =
\begin{cases}
1 & \text{if } c > r \\
\exp \big(1-\frac{r}{c}\big) & \text{if } c \leq r
\end{cases} %]]></script>
<p>where $c$ is the number of unigrams in the candidate sentence, and $r$ is the smallest number of unigrams in all reference sentences.</p>
<p><br /></p>
<p>It is not hard to find that BLEU is always a value between 0 and 1. It is because BP, $w_n$, and $p_n$ are always between 0 and 1, and</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\exp \bigg( \sum_{n=1}^{N} w_n \log p_n \bigg) &= \prod_{n=1}^{N} \exp \big( w_n \log p_n \big) \\
&= \prod_{n=1}^{N} \Big[ \exp \big( \log p_n \big) \Big]^{w_n} \\
&= \prod_{n=1}^{N} {p_n}^{w_n} \\
&\in [0,1]
\end{align} %]]></script>
<p>Usually, BLEU uses $N = 4$ and $w_n = \frac{1}{N}$.</p>
<h4 id="example-1-3">Example 1</h4>
<p>We have computed the modified precision for some of the ngrams. It is not hard to compute the others. Concretely, we have</p>
<script type="math/tex; mode=display">p_1 = \frac{5}{7}\\
p_2 = \frac{4}{6}\\
p_3 = \frac{2}{5}\\
p_4 = \frac{1}{4}</script>
<script type="math/tex; mode=display">w_1 = w_2 = w_3 = w_4 = \frac{1}{4}</script>
<script type="math/tex; mode=display">\text{BP} = 1</script>
<p>We plugin these values to the BLEU equation, the BLEU is</p>
<script type="math/tex; mode=display">\text{BLEU} = 0.467</script>
<p>We further compare the BLEU to the BLEU computed using <a href="https://www.nltk.org/">NLTK</a>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">import</span> <span class="nn">nltk</span>
<span class="o">>>></span> <span class="n">reference_1</span> <span class="o">=</span> <span class="s">"the cat is on the mat"</span><span class="o">.</span><span class="n">split</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">reference_2</span> <span class="o">=</span> <span class="s">"there is a cat on the mat"</span><span class="o">.</span><span class="n">split</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">candidate</span> <span class="o">=</span> <span class="s">"the cat the cat on the mat"</span><span class="o">.</span><span class="n">split</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">bleu</span> <span class="o">=</span> <span class="n">nltk</span><span class="o">.</span><span class="n">translate</span><span class="o">.</span><span class="n">bleu_score</span><span class="o">.</span><span class="n">sentence_bleu</span><span class="p">(</span><span class="n">references</span><span class="o">=</span><span class="p">[</span><span class="n">reference_1</span><span class="p">,</span> <span class="n">reference_2</span><span class="p">],</span> <span class="n">hypothesis</span><span class="o">=</span><span class="n">candidate</span><span class="p">,</span> <span class="n">weights</span><span class="o">=</span><span class="p">(</span><span class="mf">0.25</span><span class="p">,</span><span class="mf">0.25</span><span class="p">,</span><span class="mf">0.25</span><span class="p">,</span><span class="mf">0.25</span><span class="p">))</span>
<span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">bleu</span><span class="p">)</span>
<span class="mf">0.4671379777282001</span>
</code></pre></div></div>
<p>The value of <code class="language-plaintext highlighter-rouge">bleu</code> is 0.467 which is exactly matching to the BLEU we computed manually.</p>
<h4 id="example-2-3">Example 2</h4>
<p>Similarly,</p>
<script type="math/tex; mode=display">p_1 = \frac{2}{8}\\
p_2 = \frac{0}{7}\\
p_3 = \frac{0}{6}\\
p_4 = \frac{0}{5}</script>
<script type="math/tex; mode=display">w_1 = w_2 = w_3 = w_4 = \frac{1}{4}</script>
<script type="math/tex; mode=display">\text{BP} = 1</script>
<p>When we plugin these values to the BLEU equation, actually we would need to compute $\log 0$ which is not mathematically defined. We use a small number $10^{-100}$ instead of $0$ for $p_2$, $p_3$ and $p_4$. The BLEU is</p>
<script type="math/tex; mode=display">\text{BLEU} = 0</script>
<p>We further also compare the BLEU to the BLEU computed using <a href="https://www.nltk.org/">NLTK</a>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">import</span> <span class="nn">nltk</span>
<span class="o">>>></span> <span class="n">reference_1</span> <span class="o">=</span> <span class="s">"the cat is on the mat"</span><span class="o">.</span><span class="n">split</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">reference_2</span> <span class="o">=</span> <span class="s">"there is a cat on the mat"</span><span class="o">.</span><span class="n">split</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">candidate</span> <span class="o">=</span> <span class="s">"the the the the the the the the"</span><span class="o">.</span><span class="n">split</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">bleu</span> <span class="o">=</span> <span class="n">nltk</span><span class="o">.</span><span class="n">translate</span><span class="o">.</span><span class="n">bleu_score</span><span class="o">.</span><span class="n">sentence_bleu</span><span class="p">(</span><span class="n">references</span><span class="o">=</span><span class="p">[</span><span class="n">reference_1</span><span class="p">,</span> <span class="n">reference_2</span><span class="p">],</span> <span class="n">hypothesis</span><span class="o">=</span><span class="n">candidate</span><span class="p">,</span> <span class="n">weights</span><span class="o">=</span><span class="p">(</span><span class="mf">0.25</span><span class="p">,</span><span class="mf">0.25</span><span class="p">,</span><span class="mf">0.25</span><span class="p">,</span><span class="mf">0.25</span><span class="p">))</span>
<span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">bleu</span><span class="p">)</span>
<span class="mf">1.2882297539194154e-231</span>
</code></pre></div></div>
<p>The value of <code class="language-plaintext highlighter-rouge">bleu</code> is 0 which is exactly matching to the BLEU we computed manually.</p>
<h3 id="caveats">Caveats</h3>
<p>In some scenarios, BLEU does not score the translation very well, especially for those short translations with few reference sentences. For example,</p>
<p><br /></p>
<p>Chinese: 你准备好了吗？</p>
<p>Reference 1: are you ready ?</p>
<p>Candidate: you are ready ?</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">import</span> <span class="nn">nltk</span>
<span class="o">>>></span> <span class="n">reference_1</span> <span class="o">=</span> <span class="s">"are you ready ?"</span><span class="o">.</span><span class="n">split</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">candidate</span> <span class="o">=</span> <span class="s">"you are ready ?"</span><span class="o">.</span><span class="n">split</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">bleu</span> <span class="o">=</span> <span class="n">nltk</span><span class="o">.</span><span class="n">translate</span><span class="o">.</span><span class="n">bleu_score</span><span class="o">.</span><span class="n">sentence_bleu</span><span class="p">(</span><span class="n">references</span><span class="o">=</span><span class="p">[</span><span class="n">reference_1</span><span class="p">],</span> <span class="n">hypothesis</span><span class="o">=</span><span class="n">candidate</span><span class="p">,</span> <span class="n">weights</span><span class="o">=</span><span class="p">[</span><span class="mf">0.25</span><span class="p">,</span><span class="mf">0.25</span><span class="p">,</span><span class="mf">0.25</span><span class="p">,</span><span class="mf">0.25</span><span class="p">])</span>
<span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">bleu</span><span class="p">)</span>
<span class="mf">1.133422688662942e-154</span>
</code></pre></div></div>
<p>This is actually a very good machine translation to me. However, the BLEU score is 0, which means that the machine translation is totally wrong.</p>
<p><br /></p>
<p>In NLTK, you are allowed to provide <a href="https://www.nltk.org/api/nltk.translate.html#nltk.translate.bleu_score.SmoothingFunction">smoothing functions</a>. For example,</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">import</span> <span class="nn">nltk</span>
<span class="o">>>></span> <span class="n">reference_1</span> <span class="o">=</span> <span class="s">"are you ready ?"</span><span class="o">.</span><span class="n">split</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">candidate</span> <span class="o">=</span> <span class="s">"you are ready ?"</span><span class="o">.</span><span class="n">split</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">bleu</span> <span class="o">=</span> <span class="n">nltk</span><span class="o">.</span><span class="n">translate</span><span class="o">.</span><span class="n">bleu_score</span><span class="o">.</span><span class="n">sentence_bleu</span><span class="p">(</span><span class="n">references</span><span class="o">=</span><span class="p">[</span><span class="n">reference_1</span><span class="p">],</span> <span class="n">hypothesis</span><span class="o">=</span><span class="n">candidate</span><span class="p">,</span> <span class="n">weights</span><span class="o">=</span><span class="p">[</span><span class="mf">0.25</span><span class="p">,</span><span class="mf">0.25</span><span class="p">,</span><span class="mf">0.25</span><span class="p">,</span><span class="mf">0.25</span><span class="p">],</span> <span class="n">smoothing_function</span><span class="o">=</span><span class="n">nltk</span><span class="o">.</span><span class="n">translate</span><span class="o">.</span><span class="n">bleu_score</span><span class="o">.</span><span class="n">SmoothingFunction</span><span class="p">()</span><span class="o">.</span><span class="n">method7</span><span class="p">)</span>
<span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">bleu</span><span class="p">)</span>
<span class="mf">0.4002926439114545</span>
</code></pre></div></div>
<p>This time, the value of <code class="language-plaintext highlighter-rouge">bleu</code> is 0.4, which is magically higher than the vanilla one we computed without using smoothing functions.</p>
<p><br /></p>
<p>However, one should be always cautious about the smoothing function used in BLEU computation. At least we have to make sure that the BLEU scores we are comparing against are using no smoothing function or the exact same smoothing function.</p>
<h3 id="references">References</h3>
<ul>
<li><a href="https://www.aclweb.org/anthology/P02-1040/">BLEU: a Method for Automatic Evaluation of Machine Translation</a></li>
<li><a href="https://www.youtube.com/watch?v=DejHQYAGb7Q">BLEU - Andrew Ng</a></li>
</ul>
<p><a href="https://leimao.github.io/blog/BLEU-Score/">Bilingual Evaluation Understudy (BLEU)</a> was originally published by Lei Mao at <a href="https://leimao.github.io">Lei Mao's Log Book</a> on November 17, 2019.</p><![CDATA[RSA Algorithm]]>https://leimao.github.io/article/RSA-Algorithm2019-11-10 14:17:25 -0400T00:00:00-00:002019-11-10T00:00:00-06:00Lei Maohttps://leimao.github.iodukeleimao@gmail.com<h3 id="introduction">Introduction</h3>
<p>RSA (Rivest–Shamir–Adleman) algorithm is an asymmetric cryptographic algorithm that is widely used in <a href="https://leimao.github.io/blog/Public-Key-Cryptosystem-and-Digital-Signature/">the modern public-key cryptosystems</a>. We have been hearing RSA algorithm all the time, but some of us actually did not know what it really is and how it works.</p>
<p><br /></p>
<p>In this article, I will systematically discuss the theory behind the RSA algorithm. The theory guarantees that the cryptosystems built on the top of the RSA algorithm are relatively safe and hard to crack, which is fundamentally interesting.</p>
<h3 id="prerequisites">Prerequisites</h3>
<h4 id="eulers-totient-function">Euler’s Totient Function</h4>
<p>In number theory, Euler’s totient function, also called Euler’s phi function, denoted as $\varphi(n)$, counts the positive integers up to a given integer n that are relatively prime to $n$. In other words, it is the number of integers $k$ in the range $1 \leq k \leq n$ for which the greatest common divisor $\gcd(n, k)$ is equal to 1.</p>
<p><br /></p>
<p>Euler’s totient function is a multiplicative function, meaning that if two numbers $m$ and $n$ are relatively prime, then,</p>
<script type="math/tex; mode=display">\varphi(mn) = \varphi(m)\varphi(n)</script>
<p>If $k$ numbers, $\{m_1, m_2, \cdots, m_k\}$, are pairwise relatively prime, then</p>
<script type="math/tex; mode=display">\varphi(\prod_{i=1}^{k}m_i) = \prod_{i=1}^{k} \varphi(m_i)</script>
<p>A concrete proof of this property could be found <a href="https://exploringnumbertheory.wordpress.com/2015/11/13/eulers-phi-function-is-multiplicative/">here</a>, which requires to use the Chinese remainder theorem.</p>
<p><br /></p>
<p>When $n$ is prime number, according to the definition of prime, $\varphi(n) = n-1$. If $m$ and $n$ are different prime numbers, because $m$ and $n$ are relatively prime, we have</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\varphi(mn) &= \varphi(m)\varphi(n) \\
&= (m-1)(n-1)
\end{aligned} %]]></script>
<h4 id="eulers-theorem">Euler’s Theorem</h4>
<p>If $m$ and $n$ are relatively prime, then,</p>
<script type="math/tex; mode=display">m^{\varphi(n)} \equiv 1 \pmod n</script>
<p>where $\varphi(n)$ is Euler’s totient function. This theorem is very famous, and there a couple of different proofs to it. One of the proofs could be found <a href="https://brilliant.org/wiki/eulers-theorem/">here</a>.</p>
<h4 id="multiplicative-inverse-theorem">Multiplicative Inverse Theorem</h4>
<p>Let $n$ and $x$ be positive integers. Then $x$ has a multiplicative inverse modulo $n$ if and only if $\gcd(n, x) = 1$. Moreover, if it exists, then the multiplicative inverse is unique.</p>
<p><br /></p>
<p>Equivalently, that is to say, let $n$ and $x$ be positive integers,</p>
<script type="math/tex; mode=display">xy \equiv 1 \pmod n</script>
<p>$y \bmod n$ exists if and only if $\gcd(n, x) = 1$, and $y \bmod n$ is unique.</p>
<p><br /></p>
<p>Note that as long as the multiplicative inverse $y \bmod n$ exists, all the integers that has the same $y \bmod n$ satisfy $xy \equiv 1 \pmod n$. But there is only one such integer that is $0 \leq y \leq n-1$. For instance, if there is a $y^\ast$ such that $xy^\ast \equiv 1 \pmod n$, then $xy^\ast - 1 = kn$ for some integer $k$. Any other $y$ where $y=y^\ast + tn$, for any integer $t$, also satisfies $xy - 1 = k^\prime n$ for some integer $k^\prime$ (This is easy to show). Therefore, we also have $xy \equiv 1 \pmod n$. Because there could be infinite number of $y$ which satisfies $xy \equiv 1 \pmod n$, we consider the multiplicative inverse to be $y \bmod n$ which is unique if it exists.</p>
<p><br /></p>
<p><em>Proof</em></p>
<p><br /></p>
<p>To prove for the sufficient condition and the uniqueness. There are $n$ possibilities of $y \pmod n$, $0, 1, 2, \cdots, n-1$. Then the value of $xy$ could be $0x, 1x, 2x, \cdots, (n-1)x$. We are going to show that $0x \bmod n, 1x \bmod n, 2x \bmod n, \cdots, (n-1)x \bmod n$ are distinct if $\gcd(n, x) = 1$. Suppose there are two distinct integer $a, b$, $0 \leq a, b \leq n-1$, and $ax \equiv bx \pmod n$. Then $(a-b)x = kn$ for some integer $k$. Because $\gcd(n, x) = 1$, $a-b = hn$ for some integer $h$. However, since $a-b$ is in a range of $[-n+1, n-1]$ and $a-b \neq 0$ because $a$ and $b$ are distinct, there is no integer $h$ which could satisfy $a-b = hn$. Thus, $0x \bmod n, 1x \bmod n, 2x \bmod n, \cdots, (n-1)x \bmod n$ have to be distinct if $\gcd(n, x) = 1$. Since there are $n$ possible values for $0x \bmod n, 1x \bmod n, 2x \bmod n, \cdots, (n-1)x \bmod n$ which are distinct, there is only one of them which must be 1. Therefore, $y \bmod n$ exists if $\gcd(n, x) = 1$ and $y \bmod n$ is unique.</p>
<p><br /></p>
<p>To prove for the necessary condition. Given non-negative integer $y$ such that $xy \equiv 1 \pmod n$, we have $xy - 1 = kn$ for some integer $k$. Suppose $\gcd(n, x) > 1$, we divide $\gcd(n, x)$ at both sides of the equation.</p>
<script type="math/tex; mode=display">\begin{gather}
\frac{xy - 1}{\gcd(n, x)} = \frac{kn}{\gcd(n, x)} \\
\frac{xy}{\gcd(n, x)} - \frac{1}{\gcd(n, x)} = \frac{kn}{\gcd(n, x)}
\end{gather}</script>
<p>Because $\frac{xy}{\gcd(n, x)}$ and $\frac{kn}{\gcd(n, x)}$ are integers, but $\frac{1}{\gcd(n, x)}$ is not an integer because $\gcd(n, x) > 1$, there is no way for the equivalence. This raises a contradiction. Therefore, $\gcd(n, x) = 1$ if $y \bmod n$ exists.</p>
<p><br /></p>
<p>This concludes the proof. $\square$</p>
<h4 id="lemma-1">Lemma 1</h4>
<p>If $m$ and $n$ are relatively prime, then,</p>
<script type="math/tex; mode=display">m^{k\varphi(n)+1} \equiv m \pmod n</script>
<p>where $\varphi(n)$ is Euler’s totient function, and $k$ is any integer.</p>
<p><br /></p>
<p><em>Proof</em></p>
<p><br /></p>
<p>Using the compatibility with scaling in <a href="https://en.wikipedia.org/wiki/Modular_arithmetic#Properties">modular arithmetic properties</a>, we multiply $m$ at the both sides of $m^{\varphi(n)} \equiv 1 \pmod n$ from Euler’s theorem, we have</p>
<script type="math/tex; mode=display">m^{\varphi(n)+1} \equiv m \pmod n</script>
<p>We further multiply $m^{\varphi(n)}$ at the both sides of $m^{\varphi(n)+1} \equiv m \pmod n$, we have</p>
<script type="math/tex; mode=display">m^{2\varphi(n)+1} \equiv m^{\varphi(n)+1} \pmod n</script>
<p>By induction, we could show that</p>
<script type="math/tex; mode=display">m^{k\varphi(n)+1} \equiv m \pmod n</script>
<p>for any non-negative integer $k$.</p>
<p><br /></p>
<p>Similarly, we multiply $m^{-\varphi(n)}$ at the both sides of $m^{\varphi(n)+1} \equiv m \pmod n$, we have</p>
<script type="math/tex; mode=display">m \equiv m^{-\varphi(n)+1} \pmod n</script>
<p>By induction, we could show that</p>
<script type="math/tex; mode=display">m \equiv m^{-k\varphi(n)+1} \pmod n</script>
<p>for any negative integer $-k$.</p>
<p><br /></p>
<p>This concludes the proof that the congruence is valid for any integer $k$. $\square$</p>
<h3 id="rsa-algorithm">RSA Algorithm</h3>
<h4 id="basic-features-of-public-key-cryptosystems">Basic Features of Public-Key Cryptosystems</h4>
<p>The RSA algorithm is used as a typical public-key cryptosystem. Therefore, it has to match the <a href="https://leimao.github.io/blog/Public-Key-Cryptosystem-and-Digital-Signature/">four basic features</a> for public-key cryptosystems. I am copying the descriptions for the features for the completeness of this article.</p>
<p><br /></p>
<p>The encryption and decryption functions using public key and private key in the RSA algorithm are denoted by $E$ and $D$, respectively. $M$ is used to represent the message to be encrypted and sent. The four basic features of a public-key cryptosystem as well as the RSA algorithm are:</p>
<ul>
<li>Decrypting an encrypted message gives you the original message.</li>
</ul>
<script type="math/tex; mode=display">D(E(M)) = M</script>
<ul>
<li>Encrypting an decrypted message gives you the original message.</li>
</ul>
<script type="math/tex; mode=display">E(D(M)) = M</script>
<ul>
<li>
<p>$E$ and $D$ are easy to compute.</p>
</li>
<li>
<p>The publicity of $E$ does not compromise the secrecy of $D$.</p>
</li>
</ul>
<h4 id="rsa-basic-principle">RSA Basic Principle</h4>
<p>A basic principle behind RSA is the observation that it is practical to find three very large positive integers $e$, $d$ and $n$ such that with modular exponentiation for all integers $m$ (with $0 \leq m < n$):</p>
<script type="math/tex; mode=display">(m^e)^d \equiv m \pmod n</script>
<p>Here, the tuple $(n, e)$ is usually called the public key for encryption, and the tuple $(n, d)$ is usually called the private key for decryption. $m$ is the message because you could always represent a message using an integer uniquely. If somehow the message is too long and $m$ exceeds $n$, we dissect the messages into trunks and encrypt separately.</p>
<h4 id="rsa-key-generations">RSA Key Generations</h4>
<p>We would show how the $e$, $d$, and $n$ were generated in the RSA algorithm to satisfy the RSA basic principle.</p>
<p><br /></p>
<p>In the RSA algorithm,</p>
<script type="math/tex; mode=display">n = p \times q</script>
<p>where $p$ and $q$ are some large distinct prime numbers.</p>
<p><br /></p>
<p>Because of the Euler’s theorem in the prerequisite, if the message $m$ and $n$ are relatively prime, then</p>
<script type="math/tex; mode=display">m^{\varphi(n)} \equiv 1 \pmod n</script>
<p>where $\varphi(n)$ is Euler’s totient function.</p>
<p><br /></p>
<p>There is an extremely rare case that when $m$ and $n$ are not relatively prime, that is only when $m = p$ or $m = q$, the decryption of the encrypted message would not recover the original content of the message. If we were that lucky, we would have cracked the RSA encryption system.</p>
<p><br /></p>
<p>I am not sure people are setting rules to eliminate this extremely rare corner case. If we really want to do so, in each encryption, we could encrypt the same message several times, say three times, using $E(m)$, $E(m+1)$, and $E(m+2)$. We also have the <a href="https://leimao.github.io/blog/Public-Key-Cryptosystem-and-Digital-Signature/">digital signature</a> for each of the messages. If the messages were from the authentic author, there is no message content modification, and we did not hit $p$ and $q$ by chance, the three digital signatures should all pass the digital signature verifications. We would recover the three messages, $m$, $m+1$, and $m+2$, and further recover the three messages to the exact same message $m$, $m$ and $m$. However, if we somehow hit $p$ or $q$ by chance, some of the digital signatures would fail the verification. We just have to extract the message information from the messages that passed the digital signature verification. After all, the three messages contains the exact same information. According to the pigeonhole principle, the three distinct messages could not be all relatively prime to $p$ or $q$.</p>
<p><br /></p>
<p>Based on the property of the Euler’s totient function in the prerequisite, computing the Euler’s totient function for the product of two distinct prime numbers is actually very easy.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\varphi(n) &= \varphi(pq) \\
&= \varphi(p)\varphi(q) \\
&= (p-1)(q-1)
\end{aligned} %]]></script>
<p>Based on the Lemma 1 in the prerequisite, for any integer $k$,</p>
<script type="math/tex; mode=display">m^{k\varphi(n)+1} \equiv m \pmod n</script>
<p>We immediately found that, based on the RSA basic principle, $ed = k\varphi(n)+1$. Although $n$ is public, factorizing $n$ to $p$ and $q$ is almost impossible using an modern computer, computing $\varphi(n)$ using mathematical definition and the equation $\varphi(n) = (p-1)(q-1)$ are almost not possible neither. Therefore, releasing $e$ as the public key does not lead to the disclosure of $d$ easily.</p>
<p><br /></p>
<p>Then the question becomes how to choose appropriate integers $e$ and $d$. It seems that $e$ and $d$ could be any values as long as the equivalence $ed = k\varphi(n)+1$ is satisfied for some integer $k$.</p>
<p><br /></p>
<p>This is equivalent to say we need to satisfy</p>
<script type="math/tex; mode=display">ed \equiv 1 \pmod {\varphi(n)}</script>
<p>Based on the multiplicative inverse theorem in the prerequisite, as long as $\gcd(e,\varphi(n)) = 1$, there must be a unique $d$ which satisfies the above congruence. Getting a such $e$ might not be hard. Although $e$ does not have to be prime, for our convenience, we could simply select a prime number from a corpus of prime numbers and verify whether $\gcd(e,\varphi(n)) = 1$ since verifying relatively primeness is easy if one of the numbers are known to be prime. A typical $e$ generally used could be 65537, which is a prime number.</p>
<p><br /></p>
<p>Once $e$ is determined, $d \bmod \varphi(n)$ could be determined using the <a href="https://en.wikipedia.org/wiki/Extended_Euclidean_algorithm">Extend Euclidean algorithm</a> which takes $O((\log\varphi(n))^2)$ to run. Note that it is not necessary to make $d$ infinitely large large to make the private key $d$ less suspectable to cracking. Using any $d$s that have the same remainder $d \bmod \varphi(n)$ decrypt the encrypted message exactly the same.</p>
<p><br /></p>
<p>With such $e$ and $d$, the RSA basic principle is satisfied.</p>
<h4 id="message-encryption-and-decryption">Message Encryption and Decryption</h4>
<p>With the orchestrated $e$ and $d$, and the RSA basic principle, it is not hard to find that the encryption function $E$ is</p>
<script type="math/tex; mode=display">c \equiv m^e \pmod n</script>
<p>where $c$ is the encrypted message. In practice,</p>
<script type="math/tex; mode=display">c = m^e \bmod n</script>
<p>The decryption function $D$ is</p>
<script type="math/tex; mode=display">c^d \equiv (m^e)^d \equiv m \pmod n</script>
<p>In the above congruences, the first congruence is due to the compatibility with exponentiation in modular arithmetic properties(https://en.wikipedia.org/wiki/Modular_arithmetic#Properties), the second congruence is because of the RSA basic principle. similarly, in practice,</p>
<script type="math/tex; mode=display">m = c^d \bmod n</script>
<p>Without any doubt, using such encryption and decryption, the first feature of the public-key cryptosystem, decrypting an encrypted message gives you the original message, is satisfied.</p>
<p><br /></p>
<p>If we swap the positions of $e$ and $d$ in the RSA basic principle, surprisingly (or not?), the congruences and equivalences still hold, meaning that the second feature of the public-key cryptosystem, encrypting an decrypted message gives you the original message, also holds.</p>
<p><br /></p>
<p>How about the third feature, $E$ and $D$ are easy to compute? $E$ and $D$ both involve exponential computations which would be extremely slow and memory consuming (actually no memory so far could fit a number with not very large exponents) using trivial algorithms if the exponents $e$ and $d$ are large. However, given $e$, it is meaningless for $d$ to be “infinitely” large to satisfy the congruence $ed \equiv 1 \pmod {\varphi(n)}$. Note that only $e$ and $d \bmod \varphi(n)$ are the actual keys. In addition, there are actually <a href="https://en.wikipedia.org/wiki/Modular_exponentiation">fast modular exponentiation algorithms</a> which take $O(\log e)$ or $O(\log d)$ (i.e., $O(\log (d \bmod \varphi(n)))$) to run and are memory saving. We would not elaborate on it here, and would take it for granted that the third feature is satisfied.</p>
<p><br /></p>
<p>The fourth feature of the public-key cryptosystems has also been satisfied. We have seen the related information in the earlier sections and will see more in the following sections.</p>
<h3 id="cracking-the-rsa-cryptosystem">Cracking the RSA Cryptosystem</h3>
<h4 id="modern-computer">Modern Computer</h4>
<p>Cracking the RSA encryption system using brute force is not practically feasible. If the private key $d$ is large, it would take extremely large number of iterations to guess the correct private key $d$, not even mention in each iteration, since you usually don’t know the message content, there is hardly any way to verify whether the decrypted message using the guessed private key $d$ is the original message that you have never seen.</p>
<p><br /></p>
<p>A better way is to factorize the public $n$. If somehow you know the value of $\varphi(n)$, with the public key $e$, you would derive the value of $d \bmod \varphi(n)$. Remember what is important in the RSA algorithm is $d \bmod \varphi(n)$ instead of the actual value of $d$.</p>
<h4 id="quantum-computer">Quantum Computer</h4>
<p>The ordinary algorithm to do integer factorization takes sub-exponential time according to the <a href="https://en.wikipedia.org/wiki/Integer_factorization">Wikipedia</a>. This is the fundamental reason which makes the RSA cryptosystem so reliable.</p>
<p><br /></p>
<p>Quantum computer, however, is good at integer factorization. Using <a href="https://en.wikipedia.org/wiki/Shor%27s_algorithm">Shor’s algorithm</a>, quantum computer does integer factorization in polynomial time, which makes cracking the RSA cryptosystem possible.</p>
<h3 id="references">References</h3>
<ul>
<li><a href="https://sites.math.washington.edu/~morrow/336_09/papers/Yevgeny.pdf">The RSA Algorithm</a></li>
<li><a href="https://www.youtube.com/watch?v=JR4_RBb8A9Q">What are Digital Signatures and How Do They Work?</a></li>
</ul>
<p><a href="https://leimao.github.io/article/RSA-Algorithm/">RSA Algorithm</a> was originally published by Lei Mao at <a href="https://leimao.github.io">Lei Mao's Log Book</a> on November 10, 2019.</p><![CDATA[Public-Key Cryptosystems and Digital Signatures]]>https://leimao.github.io/blog/Public-Key-Cryptosystem-and-Digital-Signature2019-11-06 14:17:25 -0400T00:00:00-00:002019-11-06T00:00:00-06:00Lei Maohttps://leimao.github.iodukeleimao@gmail.com<h3 id="introduction">Introduction</h3>
<p>Sometimes I would get some interesting questions from my friends about internet security, such as whether it is possible to leak the bank account password when logging to the online bank via an untrusted network, whether it is possible that the message you received from someone via an untrusted network got modified maliciously, and whether it is possible that someone pretend to be you and send messages in your name. My answer to those questions are as long as your computer and cell phone are uncontaminated they are almost impossible thanks to our modern public-key crytosystems and digital signatures.</p>
<p><br /></p>
<p>There are a couple of introductions to the public-key crystosystems and digital signature available online. However, I think most of them are incomplete or hard to understand. In this blog post, I am going to describe the public-key crystosystems and digital signatures using extremely simple math, so that there would be no ambiguity at all.</p>
<h3 id="usages">Usages</h3>
<p>The modern cryptosystems are public-key encryption systems in which everyone has a public-key for encryption and a private key for decryption. The public-key is seen by everyone, but the private is only accessible by the owner. The public-key encryption systems could also generate digital signatures which could be used to verify whether the message you received is unmodified and truly sent from the sender.</p>
<p><br /></p>
<p>RSA (Rivest–Shamir–Adleman) is a typical algorithm for the public-key cryptosystems used by modern computers to encrypt and decrypt messages. However, we are not going to introduce the RSA algorithm in this blog post. Instead, we would describe the public-key cryptosystems and digital signatures at a high level.</p>
<h3 id="essential-features">Essential Features</h3>
<p>Each user has their own encryption and decryption functions, $E$ and $D$, using public-key and private key respectively. We use $M$ to represent the message to be encrypted and sent. There are four features that are essential to a public-key cryptosystem.</p>
<ul>
<li>Decrypting an encrypted message gives you the original message (Of course!). Specifically,</li>
</ul>
<script type="math/tex; mode=display">D(E(M)) = M</script>
<ul>
<li>Encrypting an decrypted message gives you the original message (Hmm…). Specifically,</li>
</ul>
<script type="math/tex; mode=display">E(D(M)) = M</script>
<ul>
<li>
<p>$E$ and $D$ are easy to compute. This means the encryption and decryption process should be fast.</p>
</li>
<li>
<p>The publicity of $E$ does not compromise the secrecy of $D$. This means you could hardly find the way to decrypt the encrypted message, even if you know how to encrypt the message.</p>
</li>
</ul>
<p>We would ignore how to satisfy the four features in this blog post.</p>
<h3 id="message-encryption-and-decryption">Message Encryption and Decryption</h3>
<p>Suppose we have two people, Alice and Bob. Both of them are using the same public-key cryptosystems. This means Alice and Bob both have their private keys stored secretly and have their public key published to some authorities. From the authorities, we could find the public keys of Alice and Bob unambiguously. We denote encrypting the message using Alice and Bob’s public keys to be $E_A$ and $E_B$ respectively, and decrypting the message using Alice and Bob’s private keys to be $D_A$ and $D_B$ respectively.</p>
<p><br /></p>
<p>One day, Alice wanted to send a private message $M$ to Bob. Alice found the public key of Bob, encrypted the message $M$ using Bob’s public key. The encrypted message for Bob is denoted as $C$.</p>
<script type="math/tex; mode=display">E_B(M) = C</script>
<p>Once Bob received the encrypted message $C$, he could decrypt $C$ using his private key.</p>
<script type="math/tex; mode=display">D_B(C) = D_B(E_B(M)) = M</script>
<p>Even if the network was compromised and someone intercepted $C$, it is still almost impossible to decrypt $C$ because of that $D_B$ is unknown and the feature “the publicity of $E$ does not compromise the secrecy of $D$” for public-key cryptosystems.</p>
<p><br /></p>
<p>The message content is safe because of the public-key encryption systems. However, it does not provide any assurance about the sender. For example, James could send Bob a message $M^\prime$ which specifically says the message is from Alice, encrypt it using $E_B$, and send it to Bob. If there is no author verification procedure and Bob is not careful enough, Bob might actually think the message is sent from Alice. In some other scenarios, James might have intercepted the encrypted message $C$ sent out from Alice to Bob, prevented the message transmission to Bob, replaced the original message to $M^\prime$ which specifically says the message is from Alice, encrypt it using $E_B$, and send it to Bob. Bob might also be convinced that the message content $M^\prime$ was actually the original message Alice have sent. Specifically,</p>
<script type="math/tex; mode=display">E_B(M^\prime) = C^\prime \\
D_B(C^\prime) = D_B(E_B(M^\prime)) = M^\prime</script>
<p>Digital signatures, derived from the public-key cryptosystems, are designed to solve these authentication problems.</p>
<h3 id="digital-signatures">Digital Signatures</h3>
<p>In addition to the encrypted message $C$ that Alice sent to Bob, Alice would also have to send her digital signature $S$ to Bob. Namely,</p>
<script type="math/tex; mode=display">E_B(D_A(M)) = S</script>
<p>Alice could find $E_B$ using Bob’s public key and $D_A$ using her private key.</p>
<p><br /></p>
<p>Once Bob received both $S$ and $C$, he could decrypt both $S$ and $C$ using his private key and Alice’s public key. Specially,</p>
<script type="math/tex; mode=display">D_B(C) = D_B(E_B(M)) = M \\
E_A(D_B(S)) = E_A(D_B(E_B(D_A(M)))) = E_A((D_A(M))) = M</script>
<p>We found actually the two decrypted messages from $S$ and $C$ are exactly the same. This is expected if the message was sent from Alice and the content of the message was not modified. Let’s further see what will happen if someone pretend to be Alice to send message to Bob, or the content of the message has been modified.</p>
<p><br /></p>
<p>James, again, wanted to send Bob a message $M^\prime$ which specifically says the message is from Alice, or had intercepted an encrypted message $C$ from Alice, blocked it and created a message $M^\prime$ which specifically says the message is from Alice. To make the message readable by Bob, James encrypted $M^\prime$ using $E_B$, send send the encrypted message $C^\prime$ to Bob.</p>
<script type="math/tex; mode=display">E_B(M^\prime) = C^\prime</script>
<p>Because Bob does not accept any message without signature, James had to make up a signature. However, because James knew nothing about Alice’s private key, he used an decryption function $D_J$ which is different to Alice’s secrete $D_A$. The signature James generated would be</p>
<script type="math/tex; mode=display">E_B(D_J(M^\prime)) = S^\prime</script>
<p>Once Bob received both $S^\prime$ and $C^\prime$, he could decrypt both $S$ and $C$ using his private key and Alice’s public key as usual. Specially,</p>
<script type="math/tex; mode=display">D_B(C^\prime) = D_B(E_B(M^\prime)) = M^\prime \\
E_A(D_B(S^\prime)) = E_A(D_B(E_B(D_J(M^\prime)))) = E_A((D_J(M^\prime))) = M^{\prime\prime}</script>
<p>In this case, Bob would see the two decrypted messages are not the same. Bob would then realize that there is something unusual happened and he should not trust anything about the message.</p>
<h3 id="hacking-the-public-key-cryptosystems">Hacking the Public-Key Cryptosystems</h3>
<p>As long as the four features of the public-key cryptosystems hold, cracking it is almost impossible. If some day, when the almighty quantum computer is available, the feature “the publicity of $E$ does not compromise the secrecy of $D$” would be compromised, therefore the modern public-key cryptosystems would no longer be reliable. I may talk about this topic in the future.</p>
<p><br /></p>
<p>There is another way to send fake messages to Bob in name of Alice, without using a quantum computer. If James could somehow crack the account name and password of Alice on the web application, replace the Alice’s public encryption function from $E_A$ to $E_J$, when Bob tried to retrieve Alice’s encryption function, he would get $E_J$ instead of $E_A$. The decryption of signature $S^\prime$ would become $M^\prime$ instead of $M^{\prime\prime}$ then. Concretely,</p>
<script type="math/tex; mode=display">E_J(D_B(S^\prime)) = E_J(D_B(E_B(D_J(M^\prime)))) = E_J((D_J(M^\prime))) = M^{\prime}</script>
<p>In this case the two decrypted messages match. Bob would be convinced that the message is from Alice and it has not been modified.</p>
<p><br /></p>
<p>It should be noted that James was replacing Alice’s public encryption function from $E_A$ to $E_J$, instead of replacing his private decryption function from $D_J$ to $D_A$. In principle, $D_A$ would only be kept on Alice local computer and not anywhere else. Even if James has Alice’s account name and password on the web application, he would not get a copy of $D_A$ unless he specifically hacked Alice’s physical computer.</p>
<h3 id="final-remarks">Final Remarks</h3>
<p>This reminds us that actually keeping our password safe is the most important.</p>
<h3 id="references">References</h3>
<ul>
<li><a href="https://sites.math.washington.edu/~morrow/336_09/papers/Yevgeny.pdf">The RSA Algorithm</a></li>
<li><a href="https://www.youtube.com/watch?v=JR4_RBb8A9Q">What are Digital Signatures and How Do They Work?</a></li>
</ul>
<p><a href="https://leimao.github.io/blog/Public-Key-Cryptosystem-and-Digital-Signature/">Public-Key Cryptosystems and Digital Signatures</a> was originally published by Lei Mao at <a href="https://leimao.github.io">Lei Mao's Log Book</a> on November 06, 2019.</p><![CDATA[Number of Alignments in Connectionist Temporal Classification (CTC)]]>https://leimao.github.io/blog/CTC-Alignment-Combinations2019-11-02 14:17:25 -0400T00:00:00-00:002019-11-02T00:00:00-05:00Lei Maohttps://leimao.github.iodukeleimao@gmail.com<h3 id="introduction">Introduction</h3>
<p>In some sequence modeling problems, the length of labels are shorter than the length of outputs. For example, in speech recognition, given an audio clip, the model would predict a sequence of tokens. But the sequence length of the transcript is often shorter than the the sequence of the predicted tokens. These sequence modeling problems are different to classic classification sequence modeling problems where each predicted token has unambiguous label, and could be modeled as Connectionist Temporal Classification (CTC) problems. The CTC loss could be employed for training the model accordingly.</p>
<p><br /></p>
<p>The article <a href="https://distill.pub/2017/ctc/">“Sequence Modeling With CTC”</a> on Distill described the CTC very well. I am not sure if I could write a better article on CTC compared to it, so a good idea is not to do so. However, there is an interesting mathematical statement in the article that is not well justified. So in this blog post, I would go over the CTC process at a super high level, then I would discuss the interesting mathematical statement for CTC described in the “Sequence Modeling With CTC”. To understand more details on CTC, the readers should refer to the “Sequence Modeling With CTC” on Distill.</p>
<h3 id="overview-of-connectionist-temporal-classification">Overview of Connectionist Temporal Classification</h3>
<div class="titled-image">
<figure class="titled-image">
<img src="https://leimao.github.io/images/blog/2019-11-02-CTC-Alignment-Combinations/ctc_loss.png" style="width: 90%; height: 90%" />
<figcaption>Connectionist Temporal Classification Overview from "Sequence Modeling With CTC"</figcaption>
</figure>
</div>
<p>In CTC classification, we have a special token $\epsilon$ in the vocabulary. Given an input sequence, we predict all the probability vectors for each of the token positions in the predicted sequence from the model. Each possible predicted token sequence would have a probability by computing the product of the probability of each token in the predicted sequence. Each possible predicted token sequence could be further reduced by first merging the same consecutive tokens, followed by the removal of $\epsilon$.</p>
<p><br /></p>
<p>For example, a predicted token sequence [h, e, l, l, $\epsilon$, l, l, o, o] would be reduced to sequence [h, e, l, l, o], a predicted token sequence [h, e, l, l, l, l, $\epsilon$, o, o] would be reduced to sequence [h, e, l, o], a predicted token sequence [h, e, l, $\epsilon$, l, l, l, o, o] would also be reduced to sequence [h, e, l, l, o].</p>
<p><br /></p>
<p>As you have probably found out, the predicted token sequence corresponding to the a certain reduced sequence is not unique. Instead of being a one-to-one mapping, it is a many-to-one mapping.</p>
<p><br /></p>
<p>Given the label sequence for the input sequence, we would need to find out all the predicted token sequence which could be reduced to the label sequence, compute their probability, and sum them up as the marginalized probability for the label sequence. During training, we would like to maximize the marginalized probability for the label sequence.</p>
<p><br /></p>
<p>The number of the predicted token sequence which could be reduced to the label sequence could be a huge number, thus it could be intractable to compute the marginalized probability. Fortunately, with the help of dynamic programming, the computation cost is asymptotically reduced.</p>
<h3 id="number-of-alignments-in-connectionist-temporal-classification">Number of Alignments in Connectionist Temporal Classification</h3>
<p>The number of the predicted token sequence which could be reduced to the label sequence could be a huge number, thus it could be intractable to compute the marginalized probability.</p>
<p><br /></p>
<p>In the “Sequence Modeling With CTC”, there is an interesting mathematical statement with some modification from me. For a $Y$ of length $U$ without any repeat characters and $X$ of length $T$, assuming $T \geq U$, the number of CTC alignments (the number of different $X$s) is ${{T + U}\choose{T - U}}$. Here, $Y$ is the label sequence, and $X$ is any predicted token sequence.</p>
<p><br /></p>
<p>Probably because I did not read as many papers as the author of the “Sequence Modeling With CTC” has read on CTC, I was not able to find the proof to this from the literature. Because it is an interesting statement, I would like to proof it by myself.</p>
<p><br /></p>
<p>To be honest, if I did not know the number of CTC alignment is ${{T + U}\choose{T - U}}$, I might have difficulty in deriving this. However, to prove the number of CTC alignment is ${{T + U}\choose{T - U}}$, it is somewhat easier, but still tricky.</p>
<p><br /></p>
<p>Given $X$ of length $T$ and $Y$ of length $U$, we prepare an empty sequence $Z$ of length $T + U$. Because $T \geq U$ and $T + U \geq 2U$, we could always select $2U$ tokens from the sequence $Z$, and the number of the combinations is ${{T + U}\choose{2U}} = {{T + U}\choose{T - U}}$.</p>
<p><br /></p>
<p>The $2U$ tokens are for the all the tokens from $Y$, and the last tokens in $X$ corresponding to the tokens in $Y$. We call the last tokens in $X$ corresponding to the tokens in $Y$ as $Y^\prime$. Starting with the tokens in $Y$, we iteratively put the tokens from $Y$ and $Y^\prime$ to the $2U$ tokens. This might sound a little bit vague, let’s see an example.</p>
<p><br /></p>
<p>We have $Y = \{a,b\}$ of length $U=2$. We also have $X$ of length $T=7$. Therefore, $Z$ has length of $U+T=9$. An example of the placement of the 2U would in $Z$ would be:</p>
<div class="titled-image">
<figure class="titled-image">
<img src="https://leimao.github.io/images/blog/2019-11-02-CTC-Alignment-Combinations/ctc_alignment_generation.png" style="width: 75%; height: 75%" />
<figcaption>Connectionist Temporal Classification Alignment Generation</figcaption>
</figure>
</div>
<p>We then fill the blanks. The rule is in $Z$ the token before the last corresponding token and after previous token from $Y$ are the same to the last corresponding token. The rest of the tokens are just filled with $\epsilon$. This is a one-to-one projection.</p>
<p><br /></p>
<p>The next step is simply to remove the tokens from $Y$ in $Z$. This is a one-to-one projection.</p>
<p><br /></p>
<p>By this means, each combination of $2U$ tokens from the $T + U$ tokens will be a valid, one-to-one correspondent CTC alignment. Because we have ${{T + U}\choose{2U}} = {{T + U}\choose{T - U}}$ combinations in total, so we have ${{T + U}\choose{2U}} = {{T + U}\choose{T - U}}$ CTC alignments.</p>
<p><br /></p>
<p>This concludes the proof.</p>
<p><br /></p>
<p>We could see that the number of CTC alignment could be very large, even for small $U$ and $T$s. Therefore, we need to use dynamic programming to compute the probability of the label asymptotically faster.</p>
<h3 id="references">References</h3>
<ul>
<li><a href="https://distill.pub/2017/ctc/">Sequence Modeling With CTC</a></li>
</ul>
<p><a href="https://leimao.github.io/blog/CTC-Alignment-Combinations/">Number of Alignments in Connectionist Temporal Classification (CTC)</a> was originally published by Lei Mao at <a href="https://leimao.github.io">Lei Mao's Log Book</a> on November 02, 2019.</p><![CDATA[Fixing a Bad PC Power Button Problem]]>https://leimao.github.io/blog/Fix-PC-Power-Button-PCB-Issue2019-10-27 14:17:25 -0400T00:00:00-00:002019-10-27T00:00:00-05:00Lei Maohttps://leimao.github.iodukeleimao@gmail.com<h3 id="introduction">Introduction</h3>
<p>I assembled a PC for gaming and deep learning earlier this year, so far I would say it has been working well in general. However, there is a problem that has been bothering me. Sometimes, after turning off my PC for several hours, I could not boot my PC by pressing the power button on the PC case. The lights for the CMOS button and Power button on the motherboard at the back of the PC case are on, and I could press the power button on the motherboard to boot the PC. After the PC started successfully, I turned off PC, immediately pressed the power button on the PC case again, the PC would boot normally. It did not happen all the time, but I would say it happened frequently, which is annoying.</p>
<p><br /></p>
<p>In this blog post, I am going to describe what are the possible causes for this phenomenon, how I did experiment to find out where the most likely problem was, and how I fixed the problem.</p>
<h3 id="possible-causes">Possible Causes</h3>
<p>Most of the people suggested that it might be due to a mechanical problem on the power button. It is not convincing to me because I believe the mechanical problem would persists whenever I turned off the PC. In this scenario, pressing the power button on the PC case should never work immediately after I turned off PC. Some other people suggested that it might be a bad wire problem, or even a motherboard problem (which would be unfortunate).</p>
<p><br /></p>
<p>One of my friends who studied electric engineering, however, considered it as a capacitor problem on the power button PCB board on the PC case. He told me that it is likely that after the PC was started using the power button on the motherboard, the “bad” capacitors on the power button PCB board got recharged somehow. That was why I could boot using the power button on the PC case normally immediately after I turned off the PC. After several hours, the electric stored in the capacitor was somehow gone, so the power button on the PC case would not be useful. I found this suggestion at least reasonable because it explained the inconsistency in the phenomenon.</p>
<h3 id="experiments">Experiments</h3>
<p>I did a small experiment to loosely confirm the capacitor hypothesis, although I never formally studied electric engineering. In one morning, I found I could not boot the PC using the power button on the PC case, I started my experiment.</p>
<p><br /></p>
<p>I took off the power switch cable connected to the power button on the PC case from the motherboard. Then, instead of booting the computer using the power button on the motherboard, I tried to use a screwdriver to boot the PC by contacting the +/- pins the power switch cable used to connect. The PC started normally, which is a stronger evidence of intact mother board than booting the computer using the power button on the motherboard.</p>
<p><br /></p>
<p>Once the PC got started normally, I turned it off, connected the power switch cable back to the motherboard. Then I tried to boot the PC by pressing the power button on the PC case. No response at all. Because in the last boot, the power button PCB has no contact to the motherboard, I think it loosely confirmed the capacitor hypothesis.</p>
<h3 id="fixes">Fixes</h3>
<p>Fixing the problem is trivial. I requested the replacement for the power button PCB from the PC case vendor, replaced the old PCB, and the computer seems to work fine now.</p>
<h3 id="acknowledgement">Acknowledgement</h3>
<p>I would like to thank my electric engineering friend. Because I did not study electric engineering, I never knew even a power button could be so sophisticated. Actually I thought it was only one simple wire or something, that is why I could not understand why there will be such problem until he proposed the capacitor hypothesis to me.</p>
<p><br /></p>
<p>I would also like to thank the warm hearted people from <a href="https://www.techpowerup.com/forums/threads/pressing-power-button-sometimes-does-not-boot-pc.260495/#post-4140028">TechPowerUp</a> who provided a lot of good suggestions.</p>
<h3 id="references">References</h3>
<ul>
<li><a href="https://leimao.github.io/blog/PC-Build-Gaming-Deep-Learning/">PC Configurations</a></li>
<li><a href="https://www.gamersnexus.net/guides/2011-jumping-a-motherboard-without-power-switch-button">Use Screwdriver to Start PC</a></li>
</ul>
<p><a href="https://leimao.github.io/blog/Fix-PC-Power-Button-PCB-Issue/">Fixing a Bad PC Power Button Problem</a> was originally published by Lei Mao at <a href="https://leimao.github.io">Lei Mao's Log Book</a> on October 27, 2019.</p><![CDATA[Python String Format]]>https://leimao.github.io/blog/Python-String-Format2019-10-26 14:17:25 -0400T00:00:00-00:002019-10-26T00:00:00-05:00Lei Maohttps://leimao.github.iodukeleimao@gmail.com<h3 id="introduction">Introduction</h3>
<p>Python string format has been widely used to control variables in the string and format the string in a way that the user prefers. However, in practice, the strings printed out still do not look beautiful for various reasons such as bad text alignment and insufficient free spaces.</p>
<p><br /></p>
<p>In this blog post, I am going to describe the general rule of using Python string format, and how to use it to print beautiful strings to console for machine learning and data science projects.</p>
<h3 id="basic-python-string-format-syntax">Basic Python String Format Syntax</h3>
<h4 id="syntax">Syntax</h4>
<p>Although the Python string format syntax could be more complicated, I think the following syntax might be sufficient for most of the projects involving scientific computing.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{id : char_to_fill alignment sign width comma num_decimals data_type}
</code></pre></div></div>
<h4 id="instruction">Instruction</h4>
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;border-color:#ccc;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#ccc;color:#333;background-color:#fff;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#ccc;color:#333;background-color:#f0f0f0;}
.tg .tg-baqh{text-align:center;vertical-align:top}
.tg .tg-nrix{text-align:center;vertical-align:middle}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg">
<tr>
<th class="tg-nrix">Token</th>
<th class="tg-nrix">Optional</th>
<th class="tg-baqh">Explanation</th>
</tr>
<tr>
<td class="tg-nrix">id</td>
<td class="tg-nrix">Yes</td>
<td class="tg-0lax">The id of the string format placeholder. </td>
</tr>
<tr>
<td class="tg-nrix">padding_char</td>
<td class="tg-nrix">Yes</td>
<td class="tg-0lax">The character used for filling the padding spaces at the start and the end of the string. <br />If no character is given, empty space will be used.</td>
</tr>
<tr>
<td class="tg-nrix">alignment</td>
<td class="tg-nrix">Yes</td>
<td class="tg-0lax">`^` is align center; `<` is align left; `>` is align right.</td>
</tr>
<tr>
<td class="tg-baqh">sign</td>
<td class="tg-baqh">Yes</td>
<td class="tg-0lax">If `+` is used, + or - would be used for positive and negative values, respectively.</td>
</tr>
<tr>
<td class="tg-baqh">width</td>
<td class="tg-baqh">Yes</td>
<td class="tg-0lax">The width of the whole string. If the width is larger than the length of the string to be print, <br />`padding_char` will be used.</td>
</tr>
<tr>
<td class="tg-baqh">comma</td>
<td class="tg-baqh">Yes</td>
<td class="tg-0lax">If `,` is used, large numbers will have commas as separator.</td>
</tr>
<tr>
<td class="tg-baqh">num_decimals</td>
<td class="tg-baqh">Yes</td>
<td class="tg-0lax">The number of decimals for floating numbers. Has to be of format `.n` where n is an integer.</td>
</tr>
<tr>
<td class="tg-baqh">data_type</td>
<td class="tg-baqh">Yes</td>
<td class="tg-0lax">`s` is string, `f` is floating number, `d` is integer number.</td>
</tr>
</table>
<h4 id="example">Example</h4>
<p>If we run the following code in Python,</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">example_line</span> <span class="o">=</span> <span class="s">"|{pi:@^+25,.8f}|"</span><span class="o">.</span><span class="nb">format</span><span class="p">(</span><span class="n">pi</span><span class="o">=</span><span class="mf">314159.26</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">example_line</span><span class="p">)</span>
</code></pre></div></div>
<p>The message printed to the console would be</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>|@@@@+314,159.26000000@@@@|
</code></pre></div></div>
<h3 id="python-string-format-for-machine-learning-and-data-science">Python String Format for Machine Learning and Data Science</h3>
<p>We would use the following Python generator to generate fake machine learning training statistics for illustration.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Generate fake training statistics
</span><span class="k">def</span> <span class="nf">gen_func</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
<span class="n">loss_max</span> <span class="o">=</span> <span class="mf">10000.0</span>
<span class="n">accuracy_max</span> <span class="o">=</span> <span class="mf">1.0</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
<span class="c1"># epoch, training loss, training accuracy
</span> <span class="k">yield</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">/</span><span class="n">n</span><span class="p">)</span><span class="o">*</span><span class="n">loss_max</span><span class="p">,</span> <span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">/</span><span class="n">n</span><span class="o">*</span><span class="n">accuracy_max</span>
</code></pre></div></div>
<p>The following Python code could be used to print the aligned training statistics to console automatically, as long as the variable <code class="language-plaintext highlighter-rouge">header_items</code>, and <code class="language-plaintext highlighter-rouge">width</code> were given.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">train_op</span> <span class="o">=</span> <span class="n">gen_func</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
<span class="n">header_items</span> <span class="o">=</span> <span class="p">[</span><span class="s">"Epoch"</span><span class="p">,</span> <span class="s">"Loss"</span><span class="p">,</span> <span class="s">"Accuracy"</span><span class="p">]</span>
<span class="n">width</span> <span class="o">=</span> <span class="mi">60</span>
<span class="n">dash</span> <span class="o">=</span> <span class="s">"-"</span> <span class="o">*</span> <span class="n">width</span>
<span class="n">column_width</span> <span class="o">=</span> <span class="n">width</span> <span class="o">//</span> <span class="nb">len</span><span class="p">(</span><span class="n">header_items</span><span class="p">)</span>
<span class="n">column_width_items</span> <span class="o">=</span> <span class="p">[</span><span class="n">column_width</span><span class="p">]</span> <span class="o">*</span> <span class="nb">len</span><span class="p">(</span><span class="n">header_items</span><span class="p">)</span>
<span class="n">header_format_content</span> <span class="o">=</span> <span class="p">[</span><span class="bp">None</span><span class="p">]</span> <span class="o">*</span> <span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">header_items</span><span class="p">)</span> <span class="o">+</span> <span class="nb">len</span><span class="p">(</span><span class="n">column_width_items</span><span class="p">))</span>
<span class="n">header_format_content</span><span class="p">[::</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">header_items</span>
<span class="n">header_format_content</span><span class="p">[</span><span class="mi">1</span><span class="p">::</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">column_width_items</span>
<span class="c1"># Expand list using asterisk
# We could have {} inside {}
</span><span class="n">header</span> <span class="o">=</span> <span class="s">"{:^{}s}{:^{}s}{:^{}s}"</span><span class="o">.</span><span class="nb">format</span><span class="p">(</span><span class="o">*</span><span class="n">header_format_content</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">dash</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">header</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">dash</span><span class="p">)</span>
<span class="k">for</span> <span class="p">(</span><span class="n">epoch</span><span class="p">,</span> <span class="n">loss</span><span class="p">,</span> <span class="n">accuracy</span><span class="p">)</span> <span class="ow">in</span> <span class="n">train_op</span><span class="p">:</span>
<span class="n">line</span> <span class="o">=</span> <span class="s">"{:^{}d}{:^{}.4f}{:^{}.2</span><span class="si">%</span><span class="s">}"</span><span class="o">.</span><span class="nb">format</span><span class="p">(</span><span class="n">epoch</span><span class="p">,</span> <span class="n">column_width</span><span class="p">,</span> <span class="n">loss</span><span class="p">,</span> <span class="n">column_width</span><span class="p">,</span> <span class="n">accuracy</span><span class="p">,</span> <span class="n">column_width</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">dash</span><span class="p">)</span>
</code></pre></div></div>
<p>The aligned training statistics printed out would be</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>------------------------------------------------------------
Epoch Loss Accuracy
------------------------------------------------------------
0 9000.0000 10.00%
1 8000.0000 20.00%
2 7000.0000 30.00%
3 6000.0000 40.00%
4 5000.0000 50.00%
5 4000.0000 60.00%
6 3000.0000 70.00%
7 2000.0000 80.00%
8 1000.0000 90.00%
9 0.0000 100.00%
------------------------------------------------------------
</code></pre></div></div>
<h3 id="reference">Reference</h3>
<ul>
<li><a href="https://mkaz.blog/code/python-string-format-cookbook/">Python String Format Cookbook</a></li>
</ul>
<p><a href="https://leimao.github.io/blog/Python-String-Format/">Python String Format</a> was originally published by Lei Mao at <a href="https://leimao.github.io">Lei Mao's Log Book</a> on October 26, 2019.</p><![CDATA[Reasons Not to Study Life Science or Anything Related]]>https://leimao.github.io/blog/Do-Not-Study-Life-Science2019-10-19 14:17:25 -0400T00:00:00-00:002019-10-19T00:00:00-05:00Lei Maohttps://leimao.github.iodukeleimao@gmail.com<h3 id="introduction">Introduction</h3>
<p>Human beings are becoming more ambitious, maybe more presumptuous to some extent, nowadays, as we want to understand everything about life and cure all diseases. This significantly motivated the industry and education of life science and its related disciplines. Lots of money goes there, lots of advertisement and progress reports shows up on media, lots of young promising students choose life science to study in college and graduate school.</p>
<p><br /></p>
<p>While life science still belongs to science in general, unfortunately, the majority of the people in this field have turned it into a cult. Outsiders who did not know quite about life science think highly of it, but lots of insiders are always being slaved and suffering.</p>
<div class="titled-image">
<figure class="titled-image">
<img src="https://leimao.github.io/images/blog/2019-10-19-Do-Not-Study-Life-Science/brooks_was_here.jpeg" style="width: 100%; height: 100%" />
<figcaption>"Brooks Was Here" - The Shawshank Redemption</figcaption>
</figure>
</div>
<p>In this blog post, I am going to describe what this field really is, and how it will ruin one’s dignity and career. I wish the people who are studying life science or related disciplines would quit, and the people who are going to study life science or related disciplines would think twice and not to waste the talents there.</p>
<h3 id="knowledge-you-would-learn-for-life-science">Knowledge You Would Learn For Life Science</h3>
<p>Most of the disciplines in science and engineering share lots in common. Every new finding is based on mathematical axiom, law of physics, rigorous mathematical derivation, or verifications in both experiments and practice. In principle, every student or researcher in science and engineering should have very solid skills to do mathematics, because that is the basic skill you do science and engineering. In college and graduate school, students from different science or engineering departments are taking the same fundamental courses which heavily emphasize on mathematics, or specialized courses that often share a lot of contents and basic ideas in common. This is how science and engineering works and how people should study science and engineering.</p>
<p><br /></p>
<p>However, life science, although it has a “science” in its name, is totally different to other science and engineering disciplines. In college, at least in some colleges, students majoring in life science would still have to attend courses, such as calculus, linear algebra, and probability and statistics. This is usually because the college requires every student studying science and engineering to attend those courses, which is a good motivation. Those courses, however, would hardly be used in the specialized life science studies. Because you would hardly use math in your study and research, you forgot. To my knowledge, almost all of my former colleagues who study life science totally forgot how to do mathematics, and how to read mathematical symbols and expressions, even though they have once studied it before. Many professors in life science, who claimed he or she discovered something via mathematical derivation, or would like to formulate some equations in the class but could not justify them using mathematical derivations, are often bluffing and know almost nothing about mathematics and physics. Life science is more close to chemistry than any other disciplines in science and engineering. So in principle, students or researchers study life science should have very good knowledge about chemistry. However, based on my teaching and research experience, it is often not true.</p>
<p><br /></p>
<p>So sounds like you don’t need to know anything and there is no prerequisite in order to do life science studies. This is true to some extent. Otherwise you would not see there are so many middle school or high school students spending their summer doing life science research in some labs. The only one thing I think is useful for life science or you can learn from life science is the design of experiments. This is probably the only thing in life science that shares something in common with other disciplines. What makes you stand out in life science is not how well you are doing for the course work, but how well you know about using different kind of experiment instruments, and your experiences of different kind of life science experiments. These knowledge and skills are highly domain specific, and they do not apply to other disciplines.</p>
<p><br /></p>
<p>Wait, how about data analysis? Can we learn data analysis from life science? In life science, the experiments could be categorized based on the size of data you generated. For experiments generating small amount of data, usually it is too simple to analyze, and you would learn nothing. Compute the mean and standard deviation of the samples, analyze whether there is any statistical difference between the control group and experiment group. Because usually the students and even the professors do not know too much about statistics, they often made mistakes in choosing the right statistical methods for analysis, thus resulting in error-prone conclusions. This is called “You don’t know what you were doing”. For experiments generating large amount of data, such as genome sequencing experiments, it is usually handled and processed by professional software. Essentially you got results magically from a black-box software without knowing what the underlying analytical algorithms are. This is called “You don’t know what it was doing”.</p>
<h3 id="research-you-would-do-for-life-science">Research You Would Do For Life Science</h3>
<p>Although lots of advanced experiment instruments have been invented to help the life science researchers to automate their workflow, most life science researchers still spend more than 90% of the time doing the labor-consuming work. Powerful researchers who manages a lot of resources, including funding and experiment instruments, do not have to do experiments in person. They have sufficient time to read literature, think potentially appropriate proposals, design experiments, and have funding to hire someone to do experiment for them. These powerful researchers are usually the professors in universities or research institutions, and the people being hired to do experiments are usually graduate students or postdoctoral fellows. Unfortunately it will be many years for a junior life science student or researcher to finally become a powerful life science researcher, and the competition is extremely fierce.</p>
<p><br /></p>
<p>Doing experiments is extremely tedious and usually trivial. Once you become familiar with some experiments, you do those experiments routinely and hardly learn anything new. Doing life science experiments requires extremely high concentration. If you made an error during experiment, say preparing a bottle of solution with incorrect concentration of some components, it could hardly be traced back. Your experiment results will thus be wrong and irreproducible. Some experiments cannot be fully automated by advanced instruments, and they require good hands to do fine operations. If you don’t have good hands, usually your experiment results would be inconsistent and untrustworthy. Experiments would usually take extremely long time to conduct. Unlike computer programs, they usually could not be “saved” in halfway. Many experiment materials and samples are fragile, sensitive to environments, such as temperature and light, and they have their “life cycles” as well. This means that the fresh experiment materials and samples should be used as soon as possible to ensure their quality. The same experiment material or sample is likely different to what it was two hours ago. If somehow you realize that anything went wrong in the experiments, usually you would have to start from scratch again, whereas for computer programs you could always start from somewhere in the middle as long as you saved it. This further means that your life will be managed by experiments and you will no longer have control over your own life. For example, once you started a scheduled 24-hour experiment, you would have to follow exactly to the experiment plan. If you scheduled to do something for the experiment, even if it was at 2:00 AM and there was storm outside, you would still need to show up in the lab and start to do experiment on time. Otherwise, the whole experiment might screw up, and you would need to restart the whole experiment on another day.</p>
<p><br /></p>
<p>You could see that there are so many variables we could hardly control in experiments, and these poorly controlled variables contributes to the noise in our experiment results. In fact, even with these variables appropriately controlled, most of the experiments are intrinsically chaotic, and the data will always be noisy. Someone may argue that the datasets from other disciplines might be noisy as well, and extracting valuable information from noisy data is the goal of science and engineering. This is true without any doubt. But there are two factors behind helping us extracting valuable information from noisy data, signal-to-noise and the number of samples. Due to the cost of life science experiments, the number of experiment replicates is usually not large. With poor signal-to-noise, extracting valuable information from limited samples would not be possible. In addition, the experiment results might be determined by some hidden variables in the environment that you are not aware of. Just like computer programs, if the software environment or hardware environment does not match what the computer program requires, the computer program might generate incorrect results, or it might just not run at all. Many life science experiments could only be reproduced in the original researcher’s lab, and could not be reproduced in other labs. While the experiment results might still be valid in the original lab, it might be just some artifact, and cannot be generalized to the practice and nature.</p>
<h3 id="the-correct-way-of-doing-science-and-engineering">The Correct Way of Doing Science and Engineering</h3>
<p>In my opinion, the correct way to do science is one of the followings:</p>
<ul>
<li>Deriving theoretical proofs for hypothesis, and doing experiment to verify.</li>
<li>Doing thought experiments, such as Einstein’s pursuing of a beam of light experiment.</li>
</ul>
<p>The correct way to do engineering is:</p>
<ul>
<li>Proposing a model, finding parameters for the model using the data from experiments, and improving the model using new data.</li>
</ul>
<p>Because life science currently does have solid mathematics and physics foundations, the two correct ways for doing science could hardly be applied to life science. The way to do life science research is actually always the way of doing engineering, “proposing a model, finding parameters for the model using the data from experiments, and improving the model using new data”. However, the underlying assumption for the correct way to do engineering is that “the model is always wrong”. This means that no matter how much data and evidence you collected, the model is likely to be wrong and will fail in some unprecedented cases. This means that life science studies only find models but never learn what nature is. This is in a sense that the life science which currently the people are doing is actually engineering instead of science.</p>
<h3 id="the-darkness-of-life-science">The Darkness of Life Science</h3>
<p>Having talked so much, we learned that doing life science is hard, given it is chaotic, and has no perfect way to study. Regardless whether the life science researchers know much about the general mathematics, physics and chemistry, in principle, we should still admire the life science researchers, because they are working so hard to find the truth from all different kind of difficulties. However, the time is different to several decades ago when people’s motivation of understanding life is pure. The majority of the people in the field of life science have turned this discipline corrupted now.</p>
<h4 id="publication-driven-life-science-career">Publication Driven Life Science Career</h4>
<p>The career of life science is driven by publications. It does not matter how much you know about the specific domains in life science or how proficient you are doing some kind of experiments, publications are always the key to get you job. However, getting a publication in life science is relatively harder, compared to other disciplines, such as computer science. It is because, even if the idea is simple, doing experiments takes a lot of time and effort, and getting a consistent self-contained “story” for publication takes even more time and effort. Being smart, even as smart as Albert Einstein, would not help you in doing life science career, since being smart could hardly improve the chance of proposing the “correct” model from the enormous model candidates. Because life science students and researchers usually do not know much about general science, such as mathematics, physics, and chemistry, getting more publications in life science becomes their proof and identity card as a “researcher”.</p>
<h4 id="resource-driven-publications">Resource Driven Publications</h4>
<p>Because all the life science experiments come at cost, and usually they are expensive, people who have more resources get more publications. Large labs get more publications, and more publications get the labs more fundings. Smaller labs get fewer publications, and fewer publications hardly get the labs more fundings. It becomes cycle. Because the principal investigator (PI) for lab, especially for large labs, need to apply for fundings to feed the whole lab, some of them would have no time to instruct students and junior researchers. But whenever there is a publication from the lab, regardless how much scientific contribution the PI made, the PI’s name will always show up in the publication, probably because the PI fed you and it is a convention to get permission from PI to get a paper published.</p>
<h4 id="competitions-for-publications-and-resources">Competitions for Publications and Resources</h4>
<p>Competitions for publications and resources is a disaster in life science.</p>
<p><br /></p>
<p>Because the number of life science research topics are somewhat limited compared to other disciplines, many research groups all over the world are studying exactly the same research topic. If they published before you publish, and it happens that their conclusion is almost the same to yours, your many year study would just be in vain. If this happens in other disciplines, usually you could still get it published in a decent journal or conference. But for life science, this is never the case.</p>
<p><br /></p>
<p>Resource competition is also everywhere. Because the total budget from NIH or the research institution is limited. Some applicants get funded, some applicants do not. While this is true for other disciplines as well, in life science, there are a significant amount of the competition for the shared resources, such as million-dollar lab instruments. Even within the same lab, there are competitions for research topics and experiment resources between students and researchers.</p>
<p><br /></p>
<p>Because of these, and definitely there are some selfish people around, people started to dislike, or maybe hate, each other.</p>
<h4 id="lab-hierarchy-and-slavery">Lab Hierarchy and Slavery</h4>
<p>Students need publications for graduation and jobs, junior researchers need publications for promotion, senior researchers need publications for being elected as a member in national academy of science and getting more research resources. This usually causes hierarchy in the lab.</p>
<p><br /></p>
<p>For students and researchers in other disciplines, if you are good enough and could learn everything on your own, you could be almost entirely independent. However, people in the field of life science could hardly be independent. You could usually tell this from the number of authors in the publication. People studying life science usually only have the skill set to do certain kind of experiments. They also manages a small portions of the lab resources, including instruments and experiment materials, that they don’t want to share unless there is a guarantee for authorship. Because doing experiment is expensive and it relies on instruments, self-independent study does not work in life science, and experiment noobs would have to learn from other colleagues. This kind of relationships could be unhealthy because if there are often the case the “master” did not want you to learn all the stuff he or she knows. So it is likely that you become affiliated to your “master”. All these weird things causes hierarchy in the lab, and lower level people have to obey high level people. Because of such unequal relationships, people become slaved. They become cheap labor and have to work enormously long in the lab. Students with critical thinking could hardly survive in the lab. No matter how wrong you think your mentor’s ideas are, you will have to obey their instructions, or they will force you to obey by any kind of means. In other disciplines, the relationships in the lab are relatively equal, because people are more independent, and you could use mathematics, which is the universal tool for correctness, to prove things are correct or wrong. Being demanded to work more than 60 hours per week is also not something unusual. Even if later it turns out that the mentor’s idea is totally wrong, it is your precious time getting wasted, but not theirs. When doing experiments, you would also have to be careful not to break anything valuable. If it happens, I am sure your life would not be easy in the lab.</p>
<h4 id="too-many-illusions-and-lies">Too Many Illusions and Lies</h4>
<p>To attract more cheap labors, especially young inexperienced college students, and get the cheap labors motivated, the PI in the lab would often promise that the students’ credits will be reflected in the authorship of the publications. However, such kind of promise is cheap and in many scenarios it will not become true even if you have spent a good amount of time in it. The order of authors listed in the publication also matters. I have seen a lot of cases that the credits of people not being reflected in the authorship or their rank in the authorship is low even if they have contributed a lot.</p>
<p><br /></p>
<p>Because it is usually hard to find cheap and tamed labors for the lab, once you get it, you will not let it go. The PI has many ways to keep people in the lab. They could write negative recommendation letters for graduating students, give grade “B” or “C” to students for courses without justifications, lie to graduating students the project will end and publications with the students’ authorship will come out, etc., you name it.</p>
<h4 id="research-ethics">Research Ethics</h4>
<p>The results of many experiment results in the publication could hardly be reproduced, not even mention the validness of the conclusion. This is mainly due to two reasons. One reason is that the experiment is intrinsic disordered or the experiment results were controlled by some hidden variables that people were not aware of. The other reason is that people are fabricating data or intentionally not processing data with scientific justification. On one hand, because there are missing data or the original data does not support the conclusion being proposed, people fabricate data for the publication. These data, of course, could not be reproduced by any means. On the other hand, the data might be real, but people violate scientific common sense when they are processing the data. For example, people did many experiments but only select to use the results that match the conclusion being proposed. It is possible that the data selected does not reflect the true distribution and therefore the experiment results could not be reproduced.</p>
<p><br /></p>
<p>Many research discoveries and findings are over-advertised and exaggerated. A lot of findings, even though they might be real, are completely useless. But in order to get more further fundings, it has to be described to have potential in some aspects, such as curing diseases. Investors, who even have life science education background, were often mislead by the advertisement of life science discoveries, and wasted a lot of time and money on something useless.</p>
<p><br /></p>
<p>Because there are lots of competitions in life science researches, people hide key technical details in the publication so that other people could hardly reproduce. Sending Emails asking for technical details and getting no response is also common. As I said, the competition also exists inside the lab, stealing colleagues’ research ideas, experiment accomplishments, research credits, experiment materials and samples are not news.</p>
<p><br /></p>
<p>Of course, some biological experiments are unethical at all. But in order to get attention, reputation, or profits from the society, some people don’t care about the ethics at all. For example, there were labs which tried to clone human beings, or edited genomes for human offsprings. Do remember, in science fictions, if there are villains, they are always somewhat related to life science researches.</p>
<h4 id="not-very-glorious">Not Very Glorious</h4>
<p>Nowadays, the prosper of life science research is mainly due to the advancements in physics, mechanical engineering, and computer science. Life science researchers did not invent those but simply use them without understanding too much how those work. Too some extent, I think this is not the correct way to do science. But we should not blame the life science researchers for this. Because of the restrictions and limitations I described above, they could not or they are not able to study those advancements in too much details.</p>
<p><br /></p>
<p>Another weird feeling of studying life science is that you would feel useless outside a lab. Essentially because all the skills and knowledge you learned are highly domain specific and do not work outside the lab. Lab becomes a prison, and you are the prisoner. Once having been in jail for too long, you don’t know how to live outside a prison, just like Brooks Hatlen in the film “The Shawshank Redemption”.</p>
<div class="titled-image">
<figure class="titled-image">
<img src="https://leimao.github.io/images/blog/2019-10-19-Do-Not-Study-Life-Science/brooks.jpg" style="width: 100%; height: 100%" />
<figcaption>Brooks Hatlen - The Shawshank Redemption</figcaption>
</figure>
</div>
<h4 id="hard-to-change-career">Hard to Change Career</h4>
<p>As a devoted life scientist, such as what I used to be, you would devote almost all of your time into research both passively or actively. Therefore, you would have almost no time to study anything else. If your career in life science does not work out, your life becomes ruined. Because the knowledge you have is highly domain specific and it does not generalize to other disciplines, it is almost impossible to find a job in other fields. However, for students and researchers in other disciplines, because their math skills are usually very good, and all the disciplines in science and technology except life science share knowledge in common, changing career is usually not hard.</p>
<h4 id="interdisciplinary-studies">Interdisciplinary Studies?</h4>
<p>How about life science related interdisciplinary studies, such as biophysics, biostatistics, bioinformatics, etc.? I would still suggest to stay away from them. Because the knowledge you will learn and use for these interdisciplinary studies are orchestrated to work for life science only. Biophysics, biostatistics, bioinformatics are derivatives of physics, statistics, and computer science, but you would not actually learn the real physics, statistics, and computer science. They are highly domain specific, and do not generalize to other fields. There is also lack of innovations there. All the advancements in biophysics, biostatistics, and bioinformatics are essentially the effort from physics, statistics, and computer science.</p>
<h3 id="suggestions">Suggestions</h3>
<p>My suggestions to the public are not to risk your career by studying life science or anything related too early, because it is one of most complicated subject to study in the world and the whole field has been corrupted. If you have already obtained a PhD in mathematics, physics, and computer science, you may give it a shot to life science researches. If you don’t like it, you could still go back to what you were doing.</p>
<p><br /></p>
<p>I also have suggestions to the governments and education institutions. Please remove life science or any interdisciplinary majors related to life science from college. The purpose of college education is to lay good foundations to science and engineering for students. Highly Domain specific education waste students’ precious time, and mislead students when they are going to make their important career decisions.</p>
<p><a href="https://leimao.github.io/blog/Do-Not-Study-Life-Science/">Reasons Not to Study Life Science or Anything Related</a> was originally published by Lei Mao at <a href="https://leimao.github.io">Lei Mao's Log Book</a> on October 19, 2019.</p><![CDATA[Setting Locale in Docker]]>https://leimao.github.io/blog/Docker-Locale2019-10-02 14:17:25 -0400T00:00:00-00:002019-10-02T00:00:00-05:00Lei Maohttps://leimao.github.iodukeleimao@gmail.com<h3 id="introduction">Introduction</h3>
<p>Although Python 3 has officially started to use UTF-8 encoding for text files, I still sometimes got errors regarding ASCII/UTF-8 in Docker container. Surprisingly, there is no such issue in the native system. It turns out that it is the system locale problem. In native system, the locale is usually properly set from the GUI during installation. In Docker container, usually the system locale was not set, and therefore UTF-8 could not be properly read and display in the terminal.</p>
<p><br /></p>
<p>In this blog post, I will talk about how to set the locale properly in Docker container so that there will be no UTF-8 problems at all.</p>
<h3 id="check-system-locale">Check System Locale</h3>
<p>We could check the system locale using the <code class="language-plaintext highlighter-rouge">locale</code> command.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>locale
<span class="nv">LANG</span><span class="o">=</span>en_US.UTF-8
<span class="nv">LANGUAGE</span><span class="o">=</span>en_US.UTF-8
<span class="nv">LC_CTYPE</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span>
<span class="nv">LC_NUMERIC</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span>
<span class="nv">LC_TIME</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span>
<span class="nv">LC_COLLATE</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span>
<span class="nv">LC_MONETARY</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span>
<span class="nv">LC_MESSAGES</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span>
<span class="nv">LC_PAPER</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span>
<span class="nv">LC_NAME</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span>
<span class="nv">LC_ADDRESS</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span>
<span class="nv">LC_TELEPHONE</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span>
<span class="nv">LC_MEASUREMENT</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span>
<span class="nv">LC_IDENTIFICATION</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span>
<span class="nv">LC_ALL</span><span class="o">=</span>en_US.UTF-8
</code></pre></div></div>
<p>If the <code class="language-plaintext highlighter-rouge">LANG</code>, <code class="language-plaintext highlighter-rouge">LANGUAGE</code>, and <code class="language-plaintext highlighter-rouge">LC_MESSAGES</code> are not set with <code class="language-plaintext highlighter-rouge">UTF-8</code> locales, you are likely to have UTF-8 read and display issues when running computer programs.</p>
<p><br /></p>
<p>In Python, we could also check the encoding method of the locale in the system using the following command.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>python <span class="nt">-c</span> <span class="s2">"import sys; print(sys.stdout.encoding)"</span>
UTF-8
</code></pre></div></div>
<p>If the output is not <code class="language-plaintext highlighter-rouge">UTF-8</code>, you are likely to have UTF-8 read and display issues when running computer programs.</p>
<p><br /></p>
<p>It should be noted that the following command, although somewhat similar to the one we used above, does not reflect the system locale.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python <span class="nt">-c</span> <span class="s2">"import sys; print(sys.getdefaultencoding())"</span>
</code></pre></div></div>
<h3 id="set-locale-properly-for-docker-container">Set Locale Properly for Docker Container</h3>
<p>It is actually simple to set locale for the Docker container. During the building of Docker image, just add either one of the following Docker script snippet to the Dockerfile and you are all set.</p>
<h4 id="method-1">Method 1</h4>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>RUN apt-get update
RUN apt-get <span class="nb">install</span> <span class="nt">-y</span> locales
RUN <span class="nb">sed</span> <span class="nt">-i</span> <span class="nt">-e</span> <span class="s1">'s/# en_US.UTF-8 UTF-8/en_US.UTF-8 UTF-8/'</span> /etc/locale.gen <span class="o">&&</span> <span class="se">\</span>
locale-gen
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US:en
</code></pre></div></div>
<h4 id="method-2">Method 2</h4>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>RUN apt-get update
RUN apt-get <span class="nb">install</span> <span class="nt">-y</span> locales locales-all
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
</code></pre></div></div>
<h3 id="references">References</h3>
<ul>
<li><a href="https://stackoverflow.com/questions/28405902/how-to-set-the-locale-inside-a-debian-ubuntu-docker-container">Setting Locale for Ubuntu/Docker</a></li>
</ul>
<p><a href="https://leimao.github.io/blog/Docker-Locale/">Setting Locale in Docker</a> was originally published by Lei Mao at <a href="https://leimao.github.io">Lei Mao's Log Book</a> on October 02, 2019.</p><![CDATA[Git Branch Upstream]]>https://leimao.github.io/blog/Git-Brach-Upstream2019-10-01 14:17:25 -0400T00:00:00-00:002019-10-01T00:00:00-05:00Lei Maohttps://leimao.github.iodukeleimao@gmail.com<h3 id="introduction">Introduction</h3>
<p>Sometimes people are confused about the difference between <code class="language-plaintext highlighter-rouge">git push</code> and <code class="language-plaintext highlighter-rouge">git push -u</code>. In this blog post, we will dig into this and try to understand the mechanism behind.</p>
<h3 id="examples">Examples</h3>
<h4 id="example-base-repo">Example Base Repo</h4>
<p>We assume we are in a repo which already has a master branch both locally and remotely.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>git branch <span class="nt">-a</span>
<span class="k">*</span> master
remotes/origin/master
</code></pre></div></div>
<h4 id="git-push-experiments-for-branches">Git Push Experiments for Branches</h4>
<p>If we are going to create a new branch called <code class="language-plaintext highlighter-rouge">temp_1</code> and push it to the remote branch <code class="language-plaintext highlighter-rouge">temp_1</code> in the remote repo, running the following commands containing <code class="language-plaintext highlighter-rouge">git push</code> would fail.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>git branch temp_1
<span class="nv">$ </span>git checkout temp_1
<span class="nv">$ </span><span class="nb">touch </span>temp_1.txt
<span class="nv">$ </span>git add <span class="nb">.</span>
<span class="nv">$ </span>git commit <span class="nt">-m</span> <span class="s2">"temp_1 commit"</span>
<span class="nv">$ </span>git push
fatal: The current branch temp_1 has no upstream branch.
To push the current branch and <span class="nb">set </span>the remote as upstream, use
git push <span class="nt">--set-upstream</span> origin temp_1
</code></pre></div></div>
<p>Because the current branch have not been set the upstream branch, so it does not know where the commit should go in the remote repo, and will complain to the user. We can validate this by running the following command.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>git branch <span class="nt">-a</span>
master
<span class="k">*</span> temp_1
remotes/origin/master
</code></pre></div></div>
<p>An upstream has to be a remote branch.</p>
<p><br /></p>
<p>However, running the following commands containing <code class="language-plaintext highlighter-rouge">git push</code> would actually work.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>git branch temp_1
<span class="nv">$ </span>git checkout temp_1
<span class="nv">$ </span><span class="nb">touch </span>temp_1.txt
<span class="nv">$ </span>git add <span class="nb">.</span>
<span class="nv">$ </span>git commit <span class="nt">-m</span> <span class="s2">"temp_1 commit"</span>
<span class="nv">$ </span>git push origin temp_1
</code></pre></div></div>
<p>Even though the current branch does not have upstream branch, we specifically tells it to push to a branch. If the branch does not exist, it will create a branch for us in the repo.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>git branch <span class="nt">-a</span>
master
<span class="k">*</span> temp_1
remotes/origin/master
remotes/origin/temp_1
</code></pre></div></div>
<p>The drawback of doing this is that, next time when you try to push some change to the remote branch, you would still need to type the branch name with <code class="language-plaintext highlighter-rouge">git push</code>, which might be tedious.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ git checkout temp_1
$ git push
fatal: The current branch temp_1 has no upstream branch.
To push the current branch and set the remote as upstream, use
git push --set-upstream origin temp_1
</code></pre></div></div>
<p>If we are going to create a new branch called <code class="language-plaintext highlighter-rouge">temp_2</code> based on existing branch <code class="language-plaintext highlighter-rouge">temp_1</code> and push it to the remote branch <code class="language-plaintext highlighter-rouge">temp_1</code> in the remote repo, running the following commands containing <code class="language-plaintext highlighter-rouge">git push</code> would update the remote branch <code class="language-plaintext highlighter-rouge">temp_1</code> in the remote repo.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>git branch temp_2
<span class="nv">$ </span>git checkout temp_2
<span class="nv">$ </span><span class="nb">touch </span>temp_2.txt
<span class="nv">$ </span>git add <span class="nb">.</span>
<span class="nv">$ </span>git commit <span class="nt">-m</span> <span class="s2">"temp_2 commit"</span>
<span class="nv">$ </span>git push origin temp_1
Everything up-to-date
</code></pre></div></div>
<p>This is because nothing has been updated on the local <code class="language-plaintext highlighter-rouge">temp_1</code> branch, so the remote <code class="language-plaintext highlighter-rouge">temp_1</code> branch is up-to-date.</p>
<p><br /></p>
<p>If we are going to create a new branch called <code class="language-plaintext highlighter-rouge">temp_2</code> and push it to the remote branch <code class="language-plaintext highlighter-rouge">temp_2</code> in the remote repo, running the following commands containing <code class="language-plaintext highlighter-rouge">git push -u</code> would work.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>git branch temp_2
<span class="nv">$ </span>git checkout temp_2
<span class="nv">$ </span><span class="nb">touch </span>temp_2.txt
<span class="nv">$ </span>git add <span class="nb">.</span>
<span class="nv">$ </span>git commit <span class="nt">-m</span> <span class="s2">"temp_2 commit"</span>
<span class="nv">$ </span>git push origin <span class="nt">-u</span> temp_2
Branch <span class="s1">'temp_2'</span> <span class="nb">set </span>up to track remote branch <span class="s1">'temp_2'</span> from <span class="s1">'origin'</span><span class="nb">.</span>
</code></pre></div></div>
<p>We checked the branches created so far.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>git branch <span class="nt">-a</span>
master
temp_1
<span class="k">*</span> temp_2
remotes/origin/master
remotes/origin/temp_1
remotes/origin/temp_2
</code></pre></div></div>
<p>Next time when you try to push some change to the remote branch, you would just simply do <code class="language-plaintext highlighter-rouge">git push</code> without branch name, because the branch already knows what its upstream is.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ git checkout temp_2
$ git push
Everything up-to-date
</code></pre></div></div>
<p>It should be noted that the following three commands are equivalent:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git push origin <span class="nt">-u</span> temp_2
</code></pre></div></div>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git push origin <span class="nt">--set-upstream</span> temp_2
</code></pre></div></div>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git push origin temp_2
git branch <span class="nt">--set-upstream-to</span><span class="o">=</span>origin/temp_2 temp_2
</code></pre></div></div>
<p>If we are going to create a new branch called <code class="language-plaintext highlighter-rouge">temp_3</code> based on local branch <code class="language-plaintext highlighter-rouge">temp_2</code>, make some changes, and push it to the remote branch <code class="language-plaintext highlighter-rouge">temp_2</code> in the remote repo, we would need to set the upstream for the local branch <code class="language-plaintext highlighter-rouge">temp_3</code> to remote <code class="language-plaintext highlighter-rouge">temp_2</code>. However, in this scenario, Git does not allow us to do <code class="language-plaintext highlighter-rouge">git push</code> without specifying the remote branch specifically, for the branches whose branch names do not match between the local branch and remote branch.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>git branch temp_3
<span class="nv">$ </span>git checkout temp_3
<span class="nv">$ </span><span class="nb">touch </span>temp_3.txt
<span class="nv">$ </span>git add <span class="nb">.</span>
<span class="nv">$ </span>git commit <span class="nt">-m</span> <span class="s2">"temp_3 commit"</span>
<span class="nv">$ </span>git branch <span class="nt">--set-upstream-to</span><span class="o">=</span>origin/temp_2 temp_3
<span class="nv">$ </span>git push
fatal: The upstream branch of your current branch does not match
the name of your current branch. To push to the upstream branch
on the remote, use
git push origin HEAD:temp_2
To push to the branch of the same name on the remote, use
git push origin temp_3
To choose either option permanently, see push.default <span class="k">in</span> <span class="s1">'git help config'</span><span class="nb">.</span>
</code></pre></div></div>
<p>We do <code class="language-plaintext highlighter-rouge">git push origin HEAD:temp_2</code> instead, and it would work.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>git branch temp_3
<span class="nv">$ </span>git checkout temp_3
<span class="nv">$ </span><span class="nb">touch </span>temp_3.txt
<span class="nv">$ </span>git add <span class="nb">.</span>
<span class="nv">$ </span>git commit <span class="nt">-m</span> <span class="s2">"temp_3 commit"</span>
<span class="nv">$ </span>git branch <span class="nt">--set-upstream-to</span><span class="o">=</span>origin/temp_2 temp_3
<span class="nv">$ </span>git push origin HEAD:temp_2
</code></pre></div></div>
<p>It should be noted that the local branch <code class="language-plaintext highlighter-rouge">temp_2</code> has not been modified, but the remote branch <code class="language-plaintext highlighter-rouge">temp_2</code> has been modified. This means that the local branch <code class="language-plaintext highlighter-rouge">temp_2</code> has fallen behind the remote branch <code class="language-plaintext highlighter-rouge">temp_2</code>.</p>
<p><br /></p>
<p>We checked the branches created so far.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>git branch <span class="nt">-a</span>
master
temp_1
temp_2
<span class="k">*</span> temp_3
remotes/origin/master
remotes/origin/temp_1
remotes/origin/temp_2
</code></pre></div></div>
<p>Branch <code class="language-plaintext highlighter-rouge">temp_3</code> only exists locally but not on the remote server.</p>
<p><br /></p>
<p>Such operation is rare though. Usually people create branches both locally and remotely and always sync between the local and remote branches. If we want to apply some changes from one branch to another, we would do <code class="language-plaintext highlighter-rouge">git merge</code>.</p>
<h3 id="final-remarks">Final Remarks</h3>
<p>Although we are using Git everyday, it is sometimes hard to understand what is going on behind the Git commands.</p>
<p><a href="https://leimao.github.io/blog/Git-Brach-Upstream/">Git Branch Upstream</a> was originally published by Lei Mao at <a href="https://leimao.github.io">Lei Mao's Log Book</a> on October 01, 2019.</p><![CDATA[Tmux Tutorial]]>https://leimao.github.io/blog/Tmux-Tutorial2019-09-22 14:17:25 -0400T00:00:00-00:002019-09-22T00:00:00-05:00Lei Maohttps://leimao.github.iodukeleimao@gmail.com<h3 id="introduction">Introduction</h3>
<p><a href="https://github.com/tmux/tmux">Tmux</a> is a very powerful terminal multiplexer which is extremely useful especially when you are using the remote server via SSH.</p>
<p><br /></p>
<p>If we want to do multiple tasks simultaneously on the remote server, usually we have to two ways to do it. We could SSH into the remote server and run everything in the background with an ‘&’ at the end of each terminal command. This is problematic if you want to monitor the process of each task. We could also open multiple windows, SSH into the remote server for each window, and run one task for each window. This is good for monitoring all the tasks, but the shortcoming is that you would have to type your SSH login information for each of the windows you opened. Sometimes it is also hard to find which window is doing which task if there are too many windows opened.</p>
<p><br /></p>
<p>Tmux allows the user to create multiple sessions and each session could have multiple terminals. The user would be able to control multiple tasks in multiple windows via Tmux. No more multiple SSH logins anymore. However, Tmux is not very friendly to beginners because you would have to memorize a series of commands required for controlling Tmux. Although Tmux is much useful than a terminal emulator such as <a href="https://leimao.github.io/blog/Gnome-Terminator/">Gnome Terminator</a>, many users would just like to use Tmux as a multi-window terminal emulator. However, Tmux does not memorize user settings such as pane layouts, so every time after reboot or restart the Tmux server, all of the user settings will be gone.</p>
<p><br /></p>
<p>In this short tutorial, I am going through some of the basic concepts and commands for Tmux, and how to use a Tmux plugin, which is called <a href="https://github.com/tmux-plugins/tmux-resurrect">Tmux Resurrect</a>, to restore Tmux environment after reboot or Tmux server restart.</p>
<h3 id="tmux-usages">Tmux Usages</h3>
<h4 id="installation">Installation</h4>
<p>We install Tmux via <code class="language-plaintext highlighter-rouge">apt</code>.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">sudo </span>apt update
<span class="nv">$ </span><span class="nb">sudo </span>apt <span class="nb">install </span>tmux
</code></pre></div></div>
<h4 id="concepts">Concepts</h4>
<p>Tmux has sessions, windows, and panes. The hierarchy is that Tmux could have multiple sessions, a session could have multiple windows, a window could have multiple panes. On the server, users could follow some certain conventions or rules to manage Tmux. For example, we could create a session for a specific project. In the project session, we could create multiple windows, and each window would be used for each specific task for the project. In the window, in order to finish the task more efficiently, we create multiple panes for purposes such as process monitoring and file management.</p>
<h4 id="dual-interface">Dual Interface</h4>
<p>Similar to Docker, Tmux has two layers of interface, the local terminal outside Tmux, and the terminal inside Tmux. We could manage Tmux in both layers. While typing bash commands are equivalent in both interface, to manage the Tmux related stuff inside Tmux, we would need to use hotkeys so Tmux know when to manage the Tmux related stuff. All the hotkeys are prefixed by <code class="language-plaintext highlighter-rouge">Ctrl</code> + <code class="language-plaintext highlighter-rouge">b</code>.</p>
<h4 id="tmux-console">Tmux Console</h4>
<p>In the Tmux terminal, we could call out Tmux console by <code class="language-plaintext highlighter-rouge">Ctrl</code> + <code class="language-plaintext highlighter-rouge">b</code> + <code class="language-plaintext highlighter-rouge">:</code> and run all the Tmux commands available for the local terminal without <code class="language-plaintext highlighter-rouge">tmux</code> prefix. For example, if there is a Tmux command for the local terminal like this.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>tmux xxxxx
</code></pre></div></div>
<p>In the Tmux console in the Tmux terminal, we could do the equivalent thing by running the following command.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>:xxxxx
</code></pre></div></div>
<p>Note that <code class="language-plaintext highlighter-rouge">:</code> is the prefix of the Tmux console which we don’t type. <code class="language-plaintext highlighter-rouge">:</code> could be thought as <code class="language-plaintext highlighter-rouge">$ tmux </code> in the local terminal.</p>
<h4 id="create-sessions">Create Sessions</h4>
<p>In the local terminal, we create Tmux sessions by simply running one of the following three equivalent commands.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>tmux
<span class="nv">$ </span>tmux new
<span class="nv">$ </span>tmux new-session
</code></pre></div></div>
<p>This will create a new session to the existing Tmux. If there is no previous Tmux session running, this will create the first Tmux session. If there are already Tmux sessions running, this will create an additional one.</p>
<p><br /></p>
<p>In the Tmux terminal, to create Tmux sessions, we would need to first call out the Tmux console by hitting <code class="language-plaintext highlighter-rouge">Ctrl</code> + <code class="language-plaintext highlighter-rouge">b</code> + <code class="language-plaintext highlighter-rouge">:</code>. Just like Vim, we could then type commands in the Tmux console at the bottom of the Tmux session. We type the following command to create Tmux session.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>:new
</code></pre></div></div>
<p>Tmux requires at least one session to run. If the last session was closed, Tmux server will automatically close.</p>
<p><br /></p>
<p>In the following tutorials, because the commands in the Tmux console in the Tmux terminal is a replicate of the commands in the local terminal, we are not going to elaborate on them.</p>
<h4 id="detach-sessions">Detach Sessions</h4>
<p>To return to the local terminal from Tmux sessions, we usually do detach by hitting <code class="language-plaintext highlighter-rouge">Ctrl</code> + <code class="language-plaintext highlighter-rouge">b</code> + <code class="language-plaintext highlighter-rouge">d</code>. Everything would be still running in the backend.</p>
<p><br /></p>
<p>In some scenarios, we could return to the local terminal by running the following command in the Tmux terminal.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">exit</span>
</code></pre></div></div>
<p>However, bear in mind that using this method the current session will exit and all the information in the current session will be lost.</p>
<h4 id="create-sessions-with-names">Create Sessions With Names</h4>
<p>Tmux, by default, uses natural integers as the name for sessions. This is sometimes inconvenient for project management. We could create sessions with names using the following commands in the local terminal.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>tmux new <span class="nt">-s</span> <span class="o">[</span>session-name]
</code></pre></div></div>
<h4 id="view-sessions">View Sessions</h4>
<p>To view Tmux sessions from local terminal, run one of the following commands.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>tmux <span class="nb">ls</span>
<span class="nv">$ </span>tmux list-sessions
</code></pre></div></div>
<p>We would see the Tmux session information like this.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>tmux <span class="nb">ls
</span>deeplabv3: 1 windows <span class="o">(</span>created Sun Sep 22 12:41:33 2019<span class="o">)</span> <span class="o">[</span>80x23]
resnet50: 1 windows <span class="o">(</span>created Sun Sep 22 12:38:25 2019<span class="o">)</span> <span class="o">[</span>80x23]
</code></pre></div></div>
<p>In Tmux terminal, we check Tmux sessions by hitting <code class="language-plaintext highlighter-rouge">Ctrl</code> + <code class="language-plaintext highlighter-rouge">b</code> + <code class="language-plaintext highlighter-rouge">s</code>. The following information will show up.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(0) + deeplabv3: 1 windows
(1) + resnet50: 1 windows (attached)
┌ resnet50 (sort: index) ──────────────────────────────────────────────────────┐
│ leimao@leimao-evolvx:~$ │
│ │
│ │
│ │
│ │
│ 0:bash │
│ │
│ │
│ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘
[resnet50]0:[tmux]* "leimao-evolvx" 12:48 22-Sep-19
</code></pre></div></div>
<p>Hit <code class="language-plaintext highlighter-rouge">Esc</code> or <code class="language-plaintext highlighter-rouge">q</code> to exit the information.</p>
<h4 id="rename-sessions">Rename Sessions</h4>
<p>To rename sessions, from the local terminal, we run the following command.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>tmux rename-session <span class="o">[</span><span class="nt">-t</span> session-name] <span class="o">[</span>new-session-name]
</code></pre></div></div>
<p>If <code class="language-plaintext highlighter-rouge">[-t session-name]</code> is not provided, the last session used will be renamed.</p>
<p><br /></p>
<p>Alternatively, we may also hit <code class="language-plaintext highlighter-rouge">Ctrl</code> + <code class="language-plaintext highlighter-rouge">b</code> + <code class="language-plaintext highlighter-rouge">$</code> to rename the current session in the Tmux terminal.</p>
<h4 id="kill-sessions">Kill Sessions</h4>
<p>To kill all sessions, from the local terminal, we run the following command.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>tmux kill-server
</code></pre></div></div>
<p>To kill specific sessions, from the local terminal, we run the following command.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>tmux kill-session <span class="nt">-t</span> <span class="o">[</span>session-name]
</code></pre></div></div>
<h4 id="attach-sessions">Attach Sessions</h4>
<p>To attach to specific sessions, from the local terminal, we run the following command.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>tmux attach <span class="nt">-t</span> <span class="o">[</span>session-name]
</code></pre></div></div>
<h4 id="createclose-windows">Create/Close Windows</h4>
<p>In Tmux session, we could have multiple windows. To create a window, in the Tmux terminal, we hit <code class="language-plaintext highlighter-rouge">Ctrl</code> + <code class="language-plaintext highlighter-rouge">b</code> + <code class="language-plaintext highlighter-rouge">c</code>. To kill the current window, in the Tmux terminal, we hit <code class="language-plaintext highlighter-rouge">Ctrl</code> + <code class="language-plaintext highlighter-rouge">b</code> + <code class="language-plaintext highlighter-rouge">&</code> (<code class="language-plaintext highlighter-rouge">&</code> is <code class="language-plaintext highlighter-rouge">Shift</code> + <code class="language-plaintext highlighter-rouge">7</code>).</p>
<p><br /></p>
<p>The windows in the sessions could have names. We rename the current window by hitting <code class="language-plaintext highlighter-rouge">Ctrl</code> + <code class="language-plaintext highlighter-rouge">b</code> + <code class="language-plaintext highlighter-rouge">,</code>.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>0<span class="o">)</span> + deeplabv3: 2 windows <span class="o">(</span>attached<span class="o">)</span>
<span class="o">(</span>1<span class="o">)</span> + resnet50: 1 windows
┌ deeplabv3 <span class="o">(</span><span class="nb">sort</span>: index<span class="o">)</span> ─────────────────────────────────────────────────────┐
│ o-evolvx:~<span class="nv">$ </span> │o-evolvx:~<span class="nv">$ </span> │
│ │ │
│ │ │
│ │ │
│ │ │
│ 0:htop-monitor │ 1:main │
│ │ │
│ │ │
│ │ │
│ │ │
└──────────────────────────────────────────────────────────────────────────────┘
<span class="o">[</span>deeplabv30:htop-monitor- 1:main<span class="k">*</span> <span class="s2">"leimao-evolvx"</span> 14:35 22-Sep-19
</code></pre></div></div>
<p>The window name could be identified in the session information.</p>
<h4 id="select-windows">Select Windows</h4>
<p>Each window in the session, regardless whether it has name or not (actually its default name is always <code class="language-plaintext highlighter-rouge">bash</code>), would have a window id of natural integer <code class="language-plaintext highlighter-rouge">0</code>, <code class="language-plaintext highlighter-rouge">1</code>, etc. We select specific window by hitting <code class="language-plaintext highlighter-rouge">Ctrl</code> + <code class="language-plaintext highlighter-rouge">b</code> + window id.</p>
<p><br /></p>
<p>Sometimes it is also convenient to use <code class="language-plaintext highlighter-rouge">Ctrl</code> + <code class="language-plaintext highlighter-rouge">b</code> + <code class="language-plaintext highlighter-rouge">n</code> to move to the next window, or <code class="language-plaintext highlighter-rouge">Ctrl</code> + <code class="language-plaintext highlighter-rouge">b</code> + <code class="language-plaintext highlighter-rouge">p</code> to move to the previous window.</p>
<h4 id="createclose-panes">Create/Close Panes</h4>
<p>Each window in the session could have multiple panes, just like Gnome Terminator. To split the pane vertically, we hit <code class="language-plaintext highlighter-rouge">Ctrl</code> + <code class="language-plaintext highlighter-rouge">b</code> + <code class="language-plaintext highlighter-rouge">%</code>. To split the pane horizontally, we hit <code class="language-plaintext highlighter-rouge">Ctrl</code> + <code class="language-plaintext highlighter-rouge">b</code> + <code class="language-plaintext highlighter-rouge">"</code>. To close the current pane, we we hit <code class="language-plaintext highlighter-rouge">Ctrl</code> + <code class="language-plaintext highlighter-rouge">b</code> + <code class="language-plaintext highlighter-rouge">x</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>leimao@leimao-evolvx:~$ │leimao@leimao-evolvx:~$
│
│
│
│
│
│
│
│
│
│
────────────────────────────────────────┼───────────────────────────────────────
leimao@leimao-evolvx:~$ │leimao@leimao-evolvx:~$
│
│
│
│
│
│
│
│
│
│
[deeplabv30:htop-monitor- 1:main* "leimao-evolvx" 14:56 22-Sep-19
</code></pre></div></div>
<p>To toggle between panes in the window, we simply hit <code class="language-plaintext highlighter-rouge">Ctrl</code> + <code class="language-plaintext highlighter-rouge">b</code> + <code class="language-plaintext highlighter-rouge">↑</code>/<code class="language-plaintext highlighter-rouge">↓</code>/<code class="language-plaintext highlighter-rouge">←</code>/<code class="language-plaintext highlighter-rouge">→</code>.</p>
<h3 id="tmux-resurrect-usages">Tmux Resurrect Usages</h3>
<h4 id="installation-1">Installation</h4>
<p>To install Tmux Resurrect, it is recommended to install <a href="https://github.com/tmux-plugins/tpm">Tmux Plugin Manager</a> first. Please check the GitHub repo for installation instructions.</p>
<p><br /></p>
<p>Then we add new plugin Tmux Resurrect to Tmux by adding <code class="language-plaintext highlighter-rouge">set -g @plugin 'tmux-plugins/tmux-resurrect'</code> to <code class="language-plaintext highlighter-rouge">~/.tmux.conf</code>. An example of the <code class="language-plaintext highlighter-rouge">~/.tmux.conf</code> would be</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">cat</span> ~/.tmux.conf
<span class="c"># List of plugins</span>
<span class="nb">set</span> <span class="nt">-g</span> @plugin <span class="s1">'tmux-plugins/tpm'</span>
<span class="nb">set</span> <span class="nt">-g</span> @plugin <span class="s1">'tmux-plugins/tmux-sensible'</span>
<span class="nb">set</span> <span class="nt">-g</span> @plugin <span class="s1">'tmux-plugins/tmux-resurrect'</span>
<span class="c"># Other examples:</span>
<span class="c"># set -g @plugin 'github_username/plugin_name'</span>
<span class="c"># set -g @plugin 'git@github.com/user/plugin'</span>
<span class="c"># set -g @plugin 'git@bitbucket.com/user/plugin'</span>
<span class="c"># Initialize TMUX plugin manager (keep this line at the very bottom of tmux.conf)</span>
run <span class="nt">-b</span> <span class="s1">'~/.tmux/plugins/tpm/tpm'</span>
</code></pre></div></div>
<p>Finally we install the plugin by hitting <code class="language-plaintext highlighter-rouge">Ctrl</code> + <code class="language-plaintext highlighter-rouge">b</code> + <code class="language-plaintext highlighter-rouge">I</code> in the Tmux terminal. We would see the following information if the installation was successful.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> [0/0]
TMUX environment reloaded.
Done, press ESCAPE to continue.
</code></pre></div></div>
<h4 id="save-and-restore-tmux-environment">Save and Restore Tmux Environment</h4>
<p>To save the Tmux environment, we hit <code class="language-plaintext highlighter-rouge">Ctrl</code> + <code class="language-plaintext highlighter-rouge">b</code> + <code class="language-plaintext highlighter-rouge">Ctrl</code> + <code class="language-plaintext highlighter-rouge">s</code> in the Tmux Terminal. If the save was successful, a message of <code class="language-plaintext highlighter-rouge">Tmux environment saved!</code> would pop up.</p>
<p><br /></p>
<p>To restore the Tmux environment, we hit <code class="language-plaintext highlighter-rouge">Ctrl</code> + <code class="language-plaintext highlighter-rouge">b</code> + <code class="language-plaintext highlighter-rouge">Ctrl</code> + <code class="language-plaintext highlighter-rouge">r</code> in the Tmux Terminal. If the restore was successful, a message of <code class="language-plaintext highlighter-rouge">Tmux restore complete!</code> would pop up.</p>
<p><br /></p>
<p>All the sessions, windows, and panels would be saved and restored with Tmux Resurrect. Some of the running commands, such as <code class="language-plaintext highlighter-rouge">htop</code>, would be restored as well.</p>
<h3 id="last-tricks">Last Tricks</h3>
<h4 id="prefix-key-binding">Prefix Key Binding</h4>
<p>Sometime hitting the hotkey prefix <code class="language-plaintext highlighter-rouge">Ctrl</code> + <code class="language-plaintext highlighter-rouge">b</code> could be tedious. We could set a single button hit for <code class="language-plaintext highlighter-rouge">Ctrl</code> + <code class="language-plaintext highlighter-rouge">b</code>. The right <code class="language-plaintext highlighter-rouge">⊞ Win</code> key on my keyboard seems to be useless in Ubuntu, and we could bind the right <code class="language-plaintext highlighter-rouge">⊞ Win</code> key to <code class="language-plaintext highlighter-rouge">Ctrl</code> + <code class="language-plaintext highlighter-rouge">b</code>.</p>
<h3 id="acknowledgement">Acknowledgement</h3>
<p>Thank my friend Dong Meng for recommending Tmux Resurrect to me.</p>
<h3 id="final-remarks">Final Remarks</h3>
<p>We would see concept similarities between Tmux, Docker, and Vim. More comprehensive Tmux commands could be found on <a href="https://tmuxcheatsheet.com/">Tmux Cheat Sheet</a>.</p>
<h3 id="references">References</h3>
<ul>
<li><a href="https://github.com/tmux/tmux">Tmux</a></li>
<li><a href="https://tmuxcheatsheet.com/">Tmux Cheat Sheet</a></li>
<li><a href="https://github.com/tmux-plugins/tmux-resurrect">Tmux Resurrect</a></li>
</ul>
<p><a href="https://leimao.github.io/blog/Tmux-Tutorial/">Tmux Tutorial</a> was originally published by Lei Mao at <a href="https://leimao.github.io">Lei Mao's Log Book</a> on September 22, 2019.</p><![CDATA[Introduction to Dirichlet Distribution]]>https://leimao.github.io/blog/Introduction-to-Dirichlet-Distribution2019-09-10 14:17:25 -0400T00:00:00-00:002019-09-10T00:00:00-05:00Lei Maohttps://leimao.github.iodukeleimao@gmail.com<h3 id="introduction">Introduction</h3>
<p>Dirichlet distribution, also called multivariate beta distribution, is widely used in text mining techniques, such as Dirichlet process and latent Dirichlet allocation. To have a better understanding of these text mining techniques, we have to first understand the Dirichlet distribution throughly. To understand the Dirichlet distribution from scratch, we would also need to understand binomial distribution, multinomial distribution, gamma function, beta distribution, and their relationships.</p>
<p><br /></p>
<p>In this tutorial, we are going through the fundamentals of binomial distribution, multinomial distribution, gamma function, beta distribution, and Dirichlet distribution, laying the foundations to Dirichlet process and latent Dirichlet allocation.</p>
<h3 id="binomial-distribution">Binomial Distribution</h3>
<p>Binomial distribution, parameterized by $n$ and $p$, is the discrete probability distribution of the number of successes $x$ in a sequence of $n$ Bernoulli trials with success probability of $p$. Formally, we denote $P(x;n,p) \sim B(n,p)$.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
P(x;n,p) &= \binom{n}{x} p^{x} (1-p)^{n-x} \\
&= \frac{n!}{x!(n-x)!} p^{x} (1-p)^{n-x}
\end{align} %]]></script>
<p>where $x \in \mathbb{Z}$ and $0 \leq x \leq n$.</p>
<p><br /></p>
<p>It is very easy to understand the formula. We select $x$ balls from $n$ balls for success, the rest $n-x$ balls are considered as failures. There are $\binom{n}{x}$ combinations to select the successful balls. The probability for each combination is $p^{x} (1-p)^{n-x}$.</p>
<h3 id="multinomial-distribution">Multinomial Distribution</h3>
<p>Multinomial distribution is simply a generalized high dimensional version of binomial distribution. The variable, instead of being a single scalar value in binomial distribution, is a multivariable vector in multinomial distribution.</p>
<p><br /></p>
<p>In multinomial distribution, we are not doing Bernoulli trials any more. Instead, each trial has $k$ possible consequences, with success probabilities of $\boldsymbol{p} = \{p_1, p_2, \cdots, p_k\}$ for each possible consequence. Multinomial distribution is the the discrete probability distribution of the number of successes $\boldsymbol{x} = \{x_1, x_2, \cdots, x_k\}$ for each of the possible consequence in a sequence of $n$ such trials. Formally, we denote $P(\boldsymbol{x};n,\boldsymbol{p}) \sim \text{Mult}(n,\boldsymbol{p})$.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
P(\boldsymbol{x};n, \boldsymbol{p}) &= \binom{n}{x_1} \binom{n - x_1}{x_2} \cdots \binom{n - \sum_{j=1}^{i-1}x_j}{x_i} \cdots \binom{x_k}{x_k} {p_1}^{x_1} {p_2}^{x_2} \cdots {p_k}^{x_k}\\
&= \prod_{i=1}^{k} \binom{n - \sum_{j=1}^{i-1}x_j}{x_i} \prod_{i=1}^{k} {p_i}^{x_i}\\
&= \prod_{i=1}^{k} \frac{(n - \sum_{j=1}^{i-1}x_j)!}{x_i!(n - \sum_{j=1}^{i}x_j)!} \prod_{i=1}^{k} {p_i}^{x_i}\\
&= \frac{n!}{x_1!(n-{x_1})!} \frac{n-x_1!}{x_2!(n-x_1-x_2)!}\cdots\frac{x_k!}{x_k!0!} \prod_{i=1}^{k} {p_i}^{x_i}\\
&= \frac{n!}{\prod_{i=1}^{k} {x_i}!} \prod_{i=1}^{k} {p_i}^{x_i}
\end{align} %]]></script>
<p>where for $1 \leq i \leq k$, $x_i \in \mathbb{Z}$, $0 \leq x_i \leq n$, $\sum_{i=1}^{k}x_i = n$, and $\sum_{i=1}^{k}p_i = 1$.</p>
<p><br /></p>
<p>It is also not hard to understand the formula. We select $x_1$ balls from $n$ balls for trials with consequence $1$, $x_2$ balls from the remaining $n-x_1$ balls for trials with consequence $2$, etc., and $x_k$ balls from the remaining $x_k$ balls for trials with consequence $k$. There $\prod_{i=1}^{k} \binom{n - \sum_{j=1}^{i-1}x_j}{x_i}$ ways to select balls. The probability for each combination is $\prod_{i=1}^{k} {p_i}^{x_i}$.</p>
<h3 id="gamma-function">Gamma Function</h3>
<p>We will talk about gamma function, instead of gamma distribution, because gamma distribution does not need to be directly related to Dirichlet distribution. Gamma function is well defined for any complex number $x$ whose real component $\Re(x)$ is greater than 0. The definition of gamma function is</p>
<script type="math/tex; mode=display">\begin{align}
\Gamma(x) = \int_{0}^{\infty} {s^{x-1} e^{-s} ds}
\end{align}</script>
<p>where $\Re(x) > 0$. In this tutorial, all the numbers we are using are non-complex.</p>
<p><br /></p>
<p>Gamma function has a special property, which will be used for deriving the properties of beta distribution and Dirichlet distribution.</p>
<script type="math/tex; mode=display">\begin{align}
\Gamma(x+1) = x\Gamma(x)
\end{align}</script>
<p>The proof is presented as follows using the definition of gamma function and integral by parts.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\Gamma(x+1) &= \int_{0}^{\infty} {s^{x} e^{-s} ds} \\
&= \big[s^{x} (-e^{-s})\big] \big|_{0}^{\infty} - \int_{0}^{\infty} {(x s^{x-1}) (-e^{-s}) ds} \\
&= (0 - 0) + x \int_{0}^{\infty} {s^{x-1} e^{-s} ds} \\
&= x \Gamma(x)
\end{align*} %]]></script>
<p>This concludes the proof.</p>
<p><br /></p>
<p>There are some special values for gamma function.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\Gamma(1) & =1\\
\Gamma(\frac{1}{2}) &= \sqrt{\pi}
\end{align} %]]></script>
<p>It might not be trivial to find $\Gamma(\frac{1}{2}) = \sqrt{\pi}$. To show this, we have to use the properties of Gaussian distribution.</p>
<p><br /></p>
<p>The probability density of a Gaussian distribution is well defined from $-\infty$ to $\infty$.</p>
<script type="math/tex; mode=display">\begin{align}
\varphi(x;\mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}
\end{align}</script>
<p>When $\mu = 0$ and $\sigma^2 = \frac{1}{2}$,</p>
<script type="math/tex; mode=display">\begin{align}
\varphi(x) = \frac{1}{\sqrt{\pi}} e^{-{x^2}}
\end{align}</script>
<p>This Gaussian distribution is symmetric about $x=0$. So we have</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\int_{0}^{\infty} \varphi(x) dx &= \int_{0}^{\infty} \frac{1}{\sqrt{\pi}} e^{-{x^2}} dx \\
&= \frac{1}{\sqrt{\pi}} \int_{0}^{\infty} e^{-{x^2}} dx \\
&= \frac{1}{2}
\end{align*} %]]></script>
<p>Therefor,</p>
<script type="math/tex; mode=display">\begin{align}
\int_{0}^{\infty} e^{-{x^2}} dx = \frac{\sqrt{\pi}}{2}
\end{align}</script>
<p>We use this integral for $\Gamma(\frac{1}{2})$.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\Gamma(\frac{1}{2}) &= \int_{0}^{\infty} {s^{-\frac{1}{2}} e^{-s} ds} \\
&= 2 \int_{0}^{\infty} { e^{-s} d{s^{\frac{1}{2}}}} \\
&= 2 \int_{0}^{\infty} e^{-x^2} d{x} \\
&= 2 \frac{\sqrt{\pi}}{2} \\
&= \sqrt{\pi}
\end{align*} %]]></script>
<p>Because $\Gamma(1) =1$ and $\Gamma(x+1) = x\Gamma(x)$, when $x$ is positive integer, gamma function is exactly the factorial function. In general, gamma function could be considered as the continuous interpolated factorial function.</p>
<h3 id="jacobian">Jacobian</h3>
<p>The Jacobian, denoted by $J$, is the determinant of Jacobian matrix, which appears when changing the variables in multiple integrals.</p>
<p><br /></p>
<p>For a continuous 1-to-1 transformation $\Phi$ mapping for a region $D$ from space $(x_1, x_2, \cdots, x_k)$ to a region $D^{\ast}$ from space $(y_1, y_2, \cdots, y_k)$,</p>
<script type="math/tex; mode=display">\begin{gather*}
x_1 = \Phi_1(y_1, y_2, \cdots, y_k) \\
x_2 = \Phi_2(y_1, y_2, \cdots, y_k) \\
\cdots \\
x_k = \Phi_k(y_1, y_2, \cdots, y_k) \\
\end{gather*}</script>
<p>The Jacobian matrix, denoted by $\frac{\partial(x_1, x_2, \cdots, x_k)}{\partial(y_1, y_2, \cdots, y_k)}$, is defined as</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\frac{\partial(x_1, x_2, \cdots, x_k)}{\partial(y_1, y_2, \cdots, y_k)} = \begin{bmatrix}
\frac{\partial x_1}{\partial y_1} & \frac{\partial x_1}{\partial y_2} & \dots & \frac{\partial x_1}{\partial y_k} \\
\frac{\partial x_2}{\partial y_1} & \frac{\partial x_2}{\partial y_2} & \dots & \frac{\partial x_2}{\partial y_k} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial x_k}{\partial y_1} & \frac{\partial x_k}{\partial y_2} & \dots & \frac{\partial x_k}{\partial y_k} \\
\end{bmatrix}
\end{align} %]]></script>
<p>The Jacobian of the Jacobian matrix is defined as the determinant of the Jacobian matrix.</p>
<script type="math/tex; mode=display">\begin{align}
J = \left| \frac{\partial(x_1, x_2, \cdots, x_k)}{\partial(y_1, y_2, \cdots, y_k)} \right|
\end{align}</script>
<p>We then have the following transformation for the multiple integrals.</p>
<script type="math/tex; mode=display">\begin{align}
\idotsint\limits_{D} f(x_1, x_2, \cdots, x_k) d{x_1}d{x_2}{\cdots}d{x_k}= \idotsint\limits_{D^{\ast}} f(\Phi(y_1, y_2, \cdots, y_k)) J d{y_1}d{y_2}{\cdots}d{y_k}
\end{align}</script>
<h3 id="beta-distribution">Beta Distribution</h3>
<p>Beta distribution is a family of continuous probability distributions well defined on the interval $[0, 1]$ parametrized by two positive shape parameters, denoted by $\alpha$ and $\beta$, that controls the shape of the distribution. Formally, we denote $P(p;\alpha,\beta) \sim \text{Beta}(\alpha,\beta)$.</p>
<script type="math/tex; mode=display">\begin{align}
P(p;\alpha,\beta) = \frac{1}{B(\alpha,\beta)}p^{\alpha-1}(1-p)^{\beta-1}
\end{align}</script>
<p>where $B(\alpha,\beta)$ is some constant normalizer, and</p>
<script type="math/tex; mode=display">\begin{align}
B(\alpha,\beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}
\end{align}</script>
<p>It is less well known how $B(\alpha,\beta)$ could be expressed in this way. So we would derive this in this tutorial.</p>
<p><br /></p>
<p>Because</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\int_{0}^{1} P(p;\alpha,\beta) dp &= \int_{0}^{1} \frac{1}{B(\alpha,\beta)}p^{\alpha-1}(1-p)^{\beta-1} dp \\
&= \frac{1}{B(\alpha,\beta)} \int_{0}^{1} p^{\alpha-1}(1-p)^{\beta-1} dp \\
&= 1
\end{align*} %]]></script>
<p>We have</p>
<script type="math/tex; mode=display">\begin{align}
B(\alpha,\beta) = \int_{0}^{1} p^{\alpha-1}(1-p)^{\beta-1} dp
\end{align}</script>
<p>We then check what $\Gamma(\alpha)\Gamma(\beta)$ is.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\Gamma(\alpha)\Gamma(\beta) &= \int_{0}^{\infty}u^{\alpha-1}e^{-u}du \int_{0}^{\infty}v^{\beta-1}e^{-v}dv \nonumber\\
&= \int_{0}^{\infty} \int_{0}^{\infty} u^{\alpha-1}e^{-u} v^{\beta-1}e^{-v} du dv \nonumber\\
&= \int_{0}^{\infty} \int_{0}^{\infty} u^{\alpha-1} v^{\beta-1}e^{-(u+v)} du dv
\end{align} %]]></script>
<p>We set $x=\frac{u}{u+v}$ and $y=u+v$ where $x \in [0,1]$ and $y \in [0,\infty]$, so the mapping from $uv$ space to $xy$ space is</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
u &= xy \\
v &= (1-x)y
\end{align} %]]></script>
<p>The Jacobian matrix is</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\frac{\partial(u,v)}{\partial(x,y)} &=
\begin{bmatrix}
\frac{\partial u}{\partial x} & \frac{\partial u}{\partial y} \\
\frac{\partial v}{\partial x} & \frac{\partial v}{\partial y} \\
\end{bmatrix} \nonumber\\
&=
\begin{bmatrix}
y & x \\
-y & 1-x \\
\end{bmatrix}
\end{align} %]]></script>
<p>The Jocobian is</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
J &= \left| \frac{\partial(u,v)}{\partial(x,y)} \right| \nonumber\\
&= y(1-x) - x(-y) \nonumber\\
&= y
\end{align} %]]></script>
<p>By applying the transoformation for multiple integrals,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\Gamma(\alpha)\Gamma(\beta) &= \int_{0}^{\infty} \int_{0}^{\infty} u^{\alpha-1} v^{\beta-1}e^{-(u+v)} du dv \nonumber\\
&= \int_{y=0}^{\infty} \int_{x=0}^{1} (xy)^{\alpha-1} [(1-x)y]^{\beta-1}e^{-y} y dx dy \nonumber\\
&= \int_{y=0}^{\infty} \int_{x=0}^{1} x^{\alpha-1} (1-x)^{\beta-1} y^{\alpha + \beta -1} e^{-y} dx dy \nonumber\\
&= \int_{0}^{\infty} y^{\alpha + \beta -1} e^{-y} dy\int_{0}^{1} x^{\alpha-1} (1-x)^{\beta-1} dx \nonumber\\
&= \Gamma(\alpha+\beta) \int_{0}^{1} x^{\alpha-1} (1-x)^{\beta-1} dx \nonumber\\
&= \Gamma(\alpha+\beta) B(\alpha,\beta)
\end{align} %]]></script>
<p>Therefore, this concludes the proof for</p>
<script type="math/tex; mode=display">\begin{align*}
B(\alpha,\beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}
\end{align*}</script>
<p>In addition, beta distribution is the conjugate prior for binomial distribution. We have prior $P(p;\alpha,\beta) \sim \text{Beta}(\alpha, \beta)$, and likelihood $P(x|p;n) \sim B(n,p)$.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
P(p|x;n,\alpha,\beta) &\propto P(x|p;n) P(p;\alpha,\beta) \nonumber\\
&= \binom{n}{x} p^{x} (1-p)^{n-x} \frac{1}{B(\alpha,\beta)}p^{\alpha-1}(1-p)^{\beta-1} \nonumber\\
&= \frac{\binom{n}{x}}{B(\alpha,\beta)}p^{x + \alpha-1}(1-p)^{n - x + \beta-1} \nonumber\\
&\propto p^{x + \alpha-1}(1-p)^{n - x + \beta-1}
\end{align} %]]></script>
<p>Therefore, $P(p|x;n,\alpha,\beta) \sim \text{Beta}(x + \alpha, n - x + \beta)$.</p>
<h3 id="dirichlet-distribution">Dirichlet Distribution</h3>
<p>Analogous to multinomial distribution to binomial distribution, Dirichlet is the multinomial version for the beta distribution. Dirichlet distribution is a family of continuous probability distribution for a discrete probability distribution for $k$ categories $\boldsymbol{p} = \{p_1, p_2, \cdots, p_k\}$, where $0 \leq p_i \leq 1$ for $i \in [1,k]$ and $\sum_{i=1}^{k} p_i = 1$, denoted by $k$ parameters $\boldsymbol{\alpha} = \{\alpha_1, \alpha_2, \cdots, \alpha_k\}$. Formally, we denote $P(\boldsymbol{p};\boldsymbol{\alpha}) \sim \text{Dir}(\boldsymbol{\alpha})$.</p>
<script type="math/tex; mode=display">\begin{align}
P(\boldsymbol{p};\boldsymbol{\alpha}) = \frac{1}{B(\boldsymbol{\alpha})}\prod_{i=1}^{k} {p_i}^{\alpha_i-1}
\end{align}</script>
<p>where $B(\boldsymbol{\alpha})$ is some constant normalizer, and</p>
<script type="math/tex; mode=display">\begin{align}
B(\boldsymbol{\alpha}) = \frac{\prod_{i=1}^{k}\Gamma(\alpha_i)}{\Gamma(\sum_{i=1}^{k}\alpha_i)}
\end{align}</script>
<p>Not surprisingly, when $k=2$, Dirichlet distribution becomes beta distribution.</p>
<p><br /></p>
<p>Similar to the normalizer in beta distribution, we would show $B(\boldsymbol{\alpha})$ could be expressed in this way.</p>
<p><br /></p>
<p>Because</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\int P(\boldsymbol{p};\boldsymbol{\alpha}) d\boldsymbol{p} &= \int \frac{1}{B(\boldsymbol{\alpha})}\prod_{i=1}^{k} {p_i}^{\alpha_i-1} d\boldsymbol{p} \\
&= \int_{p_{k-1}=0}^{1} \cdots \int_{p_1=0}^{1} \frac{1}{B(\boldsymbol{\alpha})}(\prod_{i=1}^{k-1} {p_i}^{\alpha_i-1})(1-\sum_{i=1}^{k-1} p_i)^{\alpha_k-1} d{p_1} d{p_2} \cdots d{p_{k-1}} \\
&= \frac{1}{B(\boldsymbol{\alpha})} \int_{p_{k-1}=0}^{1} \cdots \int_{p_1=0}^{1} (\prod_{i=1}^{k-1} {p_i}^{\alpha_i-1})(1-\sum_{i=1}^{k-1} p_i)^{\alpha_k-1} d{p_1} d{p_2} \cdots d{p_{k-1}} \\
&= 1
\end{align*} %]]></script>
<p>We have</p>
<script type="math/tex; mode=display">\begin{align}
B(\boldsymbol{\alpha}) = \int_{p_{k-1}=0}^{1} \cdots \int_{p_1=0}^{1} (\prod_{i=1}^{k-1} {p_i}^{\alpha_i-1})(1-\sum_{i=1}^{k-1} p_i)^{\alpha_k-1} d{p_1} d{p_2} \cdots d{p_{k-1}}
\end{align}</script>
<p>We then check what $\prod_{i=1}^{k}\Gamma(\alpha_i)$ is.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\prod_{i=1}^{k}\Gamma(\alpha_i) &= \prod_{i=1}^{k} \int_{0}^{\infty}{x_i}^{\alpha_i-1}e^{-x_i}d{x_i} \nonumber\\
&= \int_{x_{k}=0}^{\infty}\int_{x_{k-1}=0}^{\infty} \cdots \int_{x_1=0}^{\infty} \prod_{i=1}^{k} ({x_i}^{\alpha_i-1}e^{-x_i}) d{x_1} d{x_2} \cdots d{x_k} \nonumber\\
&= \int_{x_{k}=0}^{\infty}\int_{x_{k-1}=0}^{\infty} \cdots \int_{x_1=0}^{\infty} e^{-\sum_{i=1}^{k}x_i} \prod_{i=1}^{k} {x_i}^{\alpha_i-1} d{x_1} d{x_2} \cdots d{x_k}
\end{align} %]]></script>
<p>We set</p>
<script type="math/tex; mode=display">\begin{gather*}
z_k = \sum_{i=1}^{k} x_i \\
y_1 = \frac{x_1}{z_k} \\
\cdots \\
y_{k-1} = \frac{x_{k-1}}{z_k}
\end{gather*}</script>
<p>where $y_i \in [0,1]$ for $0 \leq i \leq k-1$ and $z_k \in [0,\infty]$, so the mapping from $(x_1, x_2, \cdots, x_k)$ space to $(y_1, y_2, \cdots, y_{k-1}, z_k)$ space is</p>
<script type="math/tex; mode=display">\begin{gather*}
x_1 = y_1 z_k \\
x_2 = y_2 z_k \\
\cdots \\
x_{k-1} = y_{k-1} z_k \\
x_{k} = (1- \sum_{i=1}^{k-1} y_{i}) z_k \\
\end{gather*}</script>
<p>The Jacobian matrix is</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\frac{\partial(x_1,x_2,\cdots,x_{k-1},x_k)}{\partial(y_1,y_2,\cdots,y_{k-1},z_k)} &=
\begin{bmatrix}
\frac{\partial x_1}{\partial y_1} & \frac{\partial x_1}{\partial y_2} & \dots & \frac{\partial x_1}{\partial y_{k-1}} & \frac{\partial x_1}{\partial z_k} \\
\frac{\partial x_2}{\partial y_1} & \frac{\partial x_2}{\partial y_2} & \dots & \frac{\partial x_2}{\partial y_{k-1}} & \frac{\partial x_2}{\partial z_k} \\
\vdots & \vdots & \ddots & \vdots & \vdots \\
\frac{\partial x_{k-1}}{\partial y_1} & \frac{\partial x_{k-1}}{\partial y_2} & \dots & \frac{\partial x_{k-1}}{\partial y_{k-1}} & \frac{\partial x_{k-1}}{\partial z_k} \\
\frac{\partial x_k}{\partial y_1} & \frac{\partial x_k}{\partial y_2} & \dots & \frac{\partial x_{k}}{\partial y_{k-1}} & \frac{\partial x_k}{\partial z_k} \\
\end{bmatrix} \nonumber\\
&=
\begin{bmatrix}
z_k & 0 & \dots & 0 & y_1 \\
0 & z_k & \dots & 0 & y_2 \\
\vdots & \vdots & \ddots & \vdots & \vdots \\
0 & 0 & \dots & z_k & y_{k-1} \\
-z_k & -z_k & \dots & -z_k & (1- \sum_{i=1}^{k-1} y_{i}) \\
\end{bmatrix}
\end{align} %]]></script>
<p>The Jocobian is computed via Gaussian elimination by adding each of the row from row $1$ to $k-1$ to row $k$.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
J &= \left| \frac{\partial(x_1,x_2,\cdots,x_{k-1},x_k)}{\partial(y_1,y_2,\cdots,y_{k-1},z_k)} \right| \nonumber\\
&=
\begin{vmatrix}
z_k & 0 & \dots & 0 & y_1 \\
0 & z_k & \dots & 0 & y_2 \\
\vdots & \vdots & \ddots & \vdots & \vdots \\
0 & 0 & \dots & z_k & y_{k-1} \\
-z_k & -z_k & \dots & -z_k & (1- \sum_{i=1}^{k-1} y_{i}) \\
\end{vmatrix} \nonumber\\
&=
\begin{vmatrix}
z_k & 0 & \dots & 0 & y_1 \\
0 & z_k & \dots & 0 & y_2 \\
\vdots & \vdots & \ddots & \vdots & \vdots \\
0 & 0 & \dots & z_k & y_{k-1} \\
0 & 0 & \dots & 0 & 1 \\
\end{vmatrix} \\
&= {z_k}^{k-1}
\end{align} %]]></script>
<p>By applying the transformation for multiple integrals,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\prod_{i=1}^{k}\Gamma(\alpha_i) &= \int_{x_{k}=0}^{\infty}\int_{x_{k-1}=0}^{\infty} \cdots \int_{x_1=0}^{\infty} e^{-\sum_{i=1}^{k}x_i} \prod_{i=1}^{k} {x_i}^{\alpha_i-1} d{x_1} d{x_2} \cdots d{x_{k-1}}d{x_k} \nonumber\\
&= \int_{z_{k}=0}^{\infty}\int_{y_{k-1}=0}^{1} \cdots \int_{y_1=0}^{1} e^{-z_k} {z_k}^{\sum_{i=1}^{k} \alpha_i - k} (\prod_{i=1}^{k} {y_i}^{\alpha_i-1})(1- \sum_{i=1}^{k-1} y_{i})^{\alpha_k - 1} {z_k}^{k-1} d{y_1} d{y_2} \cdots d{y_{k-1}}d{z_k} \nonumber \\
&= \int_{z_{k}=0}^{\infty} e^{-z_k} {z_k}^{\sum_{i=1}^{k} \alpha_i - 1} d{z_k} \int_{y_{k-1}=0}^{1} \cdots \int_{y_1=0}^{1} (\prod_{i=1}^{k} {y_i}^{\alpha_i-1})(1- \sum_{i=1}^{k-1} y_{i})^{\alpha_k - 1} d{y_1} d{y_2} \cdots d{y_{k-1}} \nonumber \\
&= \Gamma(\sum_{i=i}^{k} \alpha_i) \int_{y_{k-1}=0}^{1} \cdots \int_{y_1=0}^{1} (\prod_{i=1}^{k} {y_i}^{\alpha_i-1})(1- \sum_{i=1}^{k-1} y_{i})^{\alpha_k - 1} d{y_1} d{y_2} \cdots d{y_{k-1}} \\
&= \Gamma(\sum_{i=i}^{k} \alpha_i) B(\boldsymbol{\alpha})
\end{align} %]]></script>
<p>Therefore, this concludes the proof for</p>
<script type="math/tex; mode=display">\begin{align*}
B(\boldsymbol{\alpha}) = \frac{\prod_{i=1}^{k}\Gamma(\alpha_i)}{\Gamma(\sum_{i=1}^{k}\alpha_i)}
\end{align*}</script>
<p>In addition, Dirichlet distribution is the conjugate prior for multinomial distribution. We have prior $P(\boldsymbol{p};\boldsymbol{\alpha}) \sim \text{Dir}(\boldsymbol{\alpha})$, and likelihood $P(\boldsymbol{x}|\boldsymbol{p};n) \sim \text{Mult}(n,\boldsymbol{p})$.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
P(\boldsymbol{p}|\boldsymbol{x};n,\boldsymbol{\alpha}) &\propto P(\boldsymbol{x}|\boldsymbol{p};n) P(\boldsymbol{p};\boldsymbol{\alpha}) \nonumber\\
&= \prod_{i=1}^{k} \binom{n - \sum_{j=1}^{i-1}x_j}{x_i} \prod_{i=1}^{k} {p_i}^{x_i} \frac{1}{B(\boldsymbol{\alpha})}\prod_{i=1}^{k} {p_i}^{\alpha_i-1} \nonumber\\
&= \frac{\prod_{i=1}^{k} \binom{n - \sum_{j=1}^{i-1}x_j}{x_i}}{B(\boldsymbol{\alpha})}\prod_{i=1}^{k} {p_i}^{x_i + \alpha_i-1} \nonumber\\
&\propto \prod_{i=1}^{k} {p_i}^{x_i + \alpha_i-1}
\end{align} %]]></script>
<p>Therefore, $P(\boldsymbol{p}|\boldsymbol{x};n,\boldsymbol{\alpha}) \sim \text{Dir}(\boldsymbol{x} + \boldsymbol{\alpha})$.</p>
<h3 id="references">References</h3>
<ul>
<li>Jiayu Lin, <a href="https://mast.queensu.ca/~communications/Papers/msc-jiayu-lin.pdf">On The Dirichlet Distribution</a>.</li>
<li><a href="https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant">Jacobian Matrix and Determinant</a></li>
<li><a href="https://web.ma.utexas.edu/users/m408s/m408d/CurrentWeb/LM15-10-4.php">Jacobian</a></li>
</ul>
<p><a href="https://leimao.github.io/blog/Introduction-to-Dirichlet-Distribution/">Introduction to Dirichlet Distribution</a> was originally published by Lei Mao at <a href="https://leimao.github.io">Lei Mao's Log Book</a> on September 10, 2019.</p><![CDATA[TensorFlow Inference for Estimator]]>https://leimao.github.io/blog/TensorFlow-Estimator-SavedModel2019-08-29 14:17:25 -0400T00:00:00-00:002019-08-29T00:00:00-05:00Lei Maohttps://leimao.github.iodukeleimao@gmail.com<h3 id="introduction">Introduction</h3>
<p>Although I am not a big fan of the high level API <code class="language-plaintext highlighter-rouge">Estimator</code> in TensorFlow, there are more and more models using <code class="language-plaintext highlighter-rouge">Estimator</code> to do training, evaluation, and inference now. Because everything was wrapped up, the machine learning process become less transparent. This is likely to cause a lot of trouble to engineering the fast inference process, especially when people are not familiar with the high level APIs.</p>
<p><br /></p>
<p>In this blog post, I am going to provide a comprehensive guidance on how to setup fast inference protocols for TensorFlow models based on <code class="language-plaintext highlighter-rouge">Estimator</code>.</p>
<h3 id="repository">Repository</h3>
<p>The <a href="https://github.com/leimao/TensorFlow_Estimator_Basics">sample code</a> for this tutorial was forked from Guillaume Genthial’s <a href="https://github.com/guillaumegenthial/tf-estimator-basics">tf-estimator-basics</a>, with some modifications. All the tests were conducted using a NVIDIA RTX 2080 TI graphic card.</p>
<p><br /></p>
<p>To know more about the details of the model, please check Guillaume Genthial’s <a href="https://guillaumegenthial.github.io/serving-tensorflow-estimator.html">blog post</a>.</p>
<p><br /></p>
<p>Before starting to do inference tests, please train the model by running the following command in the terminal.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>python train.py
</code></pre></div></div>
<p>please also export the model to <code class="language-plaintext highlighter-rouge">SavedModel</code> by running the following command in the terminal.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>python export.py
</code></pre></div></div>
<p>I have also provided the pre-trained <code class="language-plaintext highlighter-rouge">ckpt</code> model and <code class="language-plaintext highlighter-rouge">SavedModel</code> in the GitHub repository.</p>
<h3 id="fast-inference-protocols">Fast Inference Protocols</h3>
<p>TensorFlow <code class="language-plaintext highlighter-rouge">Estimator</code> uses <code class="language-plaintext highlighter-rouge">predict</code> method to do inference. The <a href="https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator#predict"><code class="language-plaintext highlighter-rouge">predict</code></a> method needs to take <code class="language-plaintext highlighter-rouge">input_fn</code> which will return a input from a generator to the model upon being called. Without orchestration, if new data comes in batches, we would have to create <code class="language-plaintext highlighter-rouge">input_fn</code> for each batch of the new data, and run the <code class="language-plaintext highlighter-rouge">predict</code> method. The <code class="language-plaintext highlighter-rouge">predict</code> method will return a generator to the prediction values corresponding to the input values generated from <code class="language-plaintext highlighter-rouge">input_fn</code>.</p>
<p><br /></p>
<p>The problem is that TensorFlow will create graph and load all the parameters of the model when <code class="language-plaintext highlighter-rouge">predict</code> is being called. Once <code class="language-plaintext highlighter-rouge">input_fn</code> raises an end-of-input exception during function call, TensorFlow will destroy the graph and release the memory for all the parameters. This overhead process will take very long time. During inference, if we create <code class="language-plaintext highlighter-rouge">input_fn</code> for each batch of the new data, overhead process will make the inference extremely slow.</p>
<p><br /></p>
<p>There are generally two ways to make the inference of <code class="language-plaintext highlighter-rouge">Estimator</code> based models faster, including using <code class="language-plaintext highlighter-rouge">predict</code> while keeping the graph alive all the time, and converting <code class="language-plaintext highlighter-rouge">Estimator</code> based models to <code class="language-plaintext highlighter-rouge">SavedModel</code> and serve.</p>
<h4 id="keeping-graph-alive">Keeping Graph Alive</h4>
<p>As I mentioned previously, the graph will be destroyed when <code class="language-plaintext highlighter-rouge">input_fn</code> raises an end-of-input exception during function call. So if the <code class="language-plaintext highlighter-rouge">input_fn</code> uses an indefinite generator, the <code class="language-plaintext highlighter-rouge">input_fn</code> will never raise an end-of-input exception. Therefore, the graph will be alive all the time. So designing such indefinite generator is very important.</p>
<p><br /></p>
<p>I have tested the vanilla Estimator predict by running <a href="https://github.com/leimao/TensorFlow_Estimator_Basics/blob/master/predict.py"><code class="language-plaintext highlighter-rouge">predict.py</code></a> in the repository. It takes 0.1152 seconds per example using batch size of 1, which is extremely slow.</p>
<p><br /></p>
<p>Marc Stogaitis has implemented a <a href="https://github.com/marcsto/rl/blob/master/src/fast_predict2.py">FastPredict</a> as a wrapper for the <code class="language-plaintext highlighter-rouge">predict</code> method of <code class="language-plaintext highlighter-rouge">Estimator</code>, using an indefinite generator. I have applied his wrapper to the same model, and tested it by running <a href="https://github.com/leimao/TensorFlow_Estimator_Basics/blob/master/fast_predict.py"><code class="language-plaintext highlighter-rouge">fast_predict.py</code></a>. It takes 0.352 milliseconds per example using batch size of 1, which is extremely fast. However, the shortcoming of his interface is that it only allows exact one example at one time.</p>
<p><br /></p>
<p>I modified Marc Stogaitis’s interface implementation such that it allows multiple examples to be fed at each time, although the inference was still done using batch size of 1. I have tested it by running <a href="https://github.com/leimao/TensorFlow_Estimator_Basics/blob/master/faster_predict.py"><code class="language-plaintext highlighter-rouge">faster_predict.py</code></a>. It takes 0.238 milliseconds per example using batch size of 1, which is 30% faster than Marc Stogaitis’s implementation somehow.</p>
<h4 id="inference-on-savedmodel">Inference on SavedModel</h4>
<p>Guillaume Genthial has talked about exporting the model to <code class="language-plaintext highlighter-rouge">SavedModel</code> and doing inference on it in his <a href="https://guillaumegenthial.github.io/serving-tensorflow-estimator.html">blog post</a>. I am not going to elaborate too much on it. I have tested it by running <a href="https://github.com/leimao/TensorFlow_Estimator_Basics/blob/master/faster_predict.py"><code class="language-plaintext highlighter-rouge">serve.py</code></a>. It takes 0.178 milliseconds per example using batch size of 1, which is 40% faster than my <code class="language-plaintext highlighter-rouge">Estimator</code> <code class="language-plaintext highlighter-rouge">predict</code> solution.</p>
<h4 id="changing-prediction-tensors">Changing Prediction Tensors</h4>
<p>Sometimes, you would like to change the default output tensors from the original settings. For example, you would like to extract some hidden layer tensors. You can change the <code class="language-plaintext highlighter-rouge">model_fn</code> function passed to <code class="language-plaintext highlighter-rouge">Estimator</code> when loading the model using</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">estimator</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">estimator</span><span class="o">.</span><span class="n">Estimator</span><span class="p">(</span><span class="n">model_fn</span><span class="o">=</span><span class="n">model_fn</span><span class="p">,</span> <span class="n">config</span><span class="o">=</span><span class="n">run_config</span><span class="p">)</span>
</code></pre></div></div>
<p>The output node in the <code class="language-plaintext highlighter-rouge">estimator</code> which was built using the following <code class="language-plaintext highlighter-rouge">model_fn</code> is <code class="language-plaintext highlighter-rouge">predictions</code> tensor, and its name in the graph is <code class="language-plaintext highlighter-rouge">'output'</code> by default.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">model_fn</span><span class="p">(</span><span class="n">features</span><span class="p">,</span> <span class="n">labels</span><span class="p">,</span> <span class="n">mode</span><span class="p">,</span> <span class="n">params</span><span class="p">):</span>
<span class="c1"># pylint: disable=unused-argument
</span> <span class="s">"""Dummy model_fn"""</span>
<span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">features</span><span class="p">,</span> <span class="nb">dict</span><span class="p">):</span> <span class="c1"># For serving
</span> <span class="n">features</span> <span class="o">=</span> <span class="n">features</span><span class="p">[</span><span class="s">'feature'</span><span class="p">]</span>
<span class="n">hidden</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">dense</span><span class="p">(</span><span class="n">features</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span>
<span class="n">predictions</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">dense</span><span class="p">(</span><span class="n">hidden</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">if</span> <span class="n">mode</span> <span class="o">==</span> <span class="n">tf</span><span class="o">.</span><span class="n">estimator</span><span class="o">.</span><span class="n">ModeKeys</span><span class="o">.</span><span class="n">PREDICT</span><span class="p">:</span>
<span class="k">return</span> <span class="n">tf</span><span class="o">.</span><span class="n">estimator</span><span class="o">.</span><span class="n">EstimatorSpec</span><span class="p">(</span><span class="n">mode</span><span class="p">,</span> <span class="n">predictions</span><span class="o">=</span><span class="n">predictions</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">l2_loss</span><span class="p">(</span><span class="n">predictions</span> <span class="o">-</span> <span class="n">labels</span><span class="p">)</span>
<span class="k">if</span> <span class="n">mode</span> <span class="o">==</span> <span class="n">tf</span><span class="o">.</span><span class="n">estimator</span><span class="o">.</span><span class="n">ModeKeys</span><span class="o">.</span><span class="n">EVAL</span><span class="p">:</span>
<span class="k">return</span> <span class="n">tf</span><span class="o">.</span><span class="n">estimator</span><span class="o">.</span><span class="n">EstimatorSpec</span><span class="p">(</span>
<span class="n">mode</span><span class="p">,</span> <span class="n">loss</span><span class="o">=</span><span class="n">loss</span><span class="p">)</span>
<span class="k">elif</span> <span class="n">mode</span> <span class="o">==</span> <span class="n">tf</span><span class="o">.</span><span class="n">estimator</span><span class="o">.</span><span class="n">ModeKeys</span><span class="o">.</span><span class="n">TRAIN</span><span class="p">:</span>
<span class="n">train_op</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">train</span><span class="o">.</span><span class="n">AdamOptimizer</span><span class="p">(</span><span class="n">learning_rate</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span><span class="o">.</span><span class="n">minimize</span><span class="p">(</span>
<span class="n">loss</span><span class="p">,</span> <span class="n">global_step</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">train</span><span class="o">.</span><span class="n">get_global_step</span><span class="p">())</span>
<span class="k">return</span> <span class="n">tf</span><span class="o">.</span><span class="n">estimator</span><span class="o">.</span><span class="n">EstimatorSpec</span><span class="p">(</span>
<span class="n">mode</span><span class="p">,</span> <span class="n">loss</span><span class="o">=</span><span class="n">loss</span><span class="p">,</span> <span class="n">train_op</span><span class="o">=</span><span class="n">train_op</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">raise</span> <span class="nb">NotImplementedError</span><span class="p">()</span>
</code></pre></div></div>
<p>To add more output nodes, we passed <code class="language-plaintext highlighter-rouge">{'hidden':hidden, 'predictions':predictions}</code> a dictionary. Here the name of output nodes in the graph are <code class="language-plaintext highlighter-rouge">'hidden'</code> and <code class="language-plaintext highlighter-rouge">'predictions'</code>, respectively.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">model_fn</span><span class="p">(</span><span class="n">features</span><span class="p">,</span> <span class="n">labels</span><span class="p">,</span> <span class="n">mode</span><span class="p">,</span> <span class="n">params</span><span class="p">):</span>
<span class="c1"># pylint: disable=unused-argument
</span> <span class="s">"""Dummy model_fn"""</span>
<span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">features</span><span class="p">,</span> <span class="nb">dict</span><span class="p">):</span> <span class="c1"># For serving
</span> <span class="n">features</span> <span class="o">=</span> <span class="n">features</span><span class="p">[</span><span class="s">'feature'</span><span class="p">]</span>
<span class="n">hidden</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">dense</span><span class="p">(</span><span class="n">features</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span>
<span class="n">predictions</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">dense</span><span class="p">(</span><span class="n">hidden</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">if</span> <span class="n">mode</span> <span class="o">==</span> <span class="n">tf</span><span class="o">.</span><span class="n">estimator</span><span class="o">.</span><span class="n">ModeKeys</span><span class="o">.</span><span class="n">PREDICT</span><span class="p">:</span>
<span class="k">return</span> <span class="n">tf</span><span class="o">.</span><span class="n">estimator</span><span class="o">.</span><span class="n">EstimatorSpec</span><span class="p">(</span><span class="n">mode</span><span class="p">,</span> <span class="n">predictions</span><span class="o">=</span><span class="p">{</span><span class="s">'hidden'</span><span class="p">:</span><span class="n">hidden</span><span class="p">,</span> <span class="s">'predictions'</span><span class="p">:</span><span class="n">predictions</span><span class="p">})</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">l2_loss</span><span class="p">(</span><span class="n">predictions</span> <span class="o">-</span> <span class="n">labels</span><span class="p">)</span>
<span class="k">if</span> <span class="n">mode</span> <span class="o">==</span> <span class="n">tf</span><span class="o">.</span><span class="n">estimator</span><span class="o">.</span><span class="n">ModeKeys</span><span class="o">.</span><span class="n">EVAL</span><span class="p">:</span>
<span class="k">return</span> <span class="n">tf</span><span class="o">.</span><span class="n">estimator</span><span class="o">.</span><span class="n">EstimatorSpec</span><span class="p">(</span>
<span class="n">mode</span><span class="p">,</span> <span class="n">loss</span><span class="o">=</span><span class="n">loss</span><span class="p">)</span>
<span class="k">elif</span> <span class="n">mode</span> <span class="o">==</span> <span class="n">tf</span><span class="o">.</span><span class="n">estimator</span><span class="o">.</span><span class="n">ModeKeys</span><span class="o">.</span><span class="n">TRAIN</span><span class="p">:</span>
<span class="n">train_op</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">train</span><span class="o">.</span><span class="n">AdamOptimizer</span><span class="p">(</span><span class="n">learning_rate</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span><span class="o">.</span><span class="n">minimize</span><span class="p">(</span>
<span class="n">loss</span><span class="p">,</span> <span class="n">global_step</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">train</span><span class="o">.</span><span class="n">get_global_step</span><span class="p">())</span>
<span class="k">return</span> <span class="n">tf</span><span class="o">.</span><span class="n">estimator</span><span class="o">.</span><span class="n">EstimatorSpec</span><span class="p">(</span>
<span class="n">mode</span><span class="p">,</span> <span class="n">loss</span><span class="o">=</span><span class="n">loss</span><span class="p">,</span> <span class="n">train_op</span><span class="o">=</span><span class="n">train_op</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">raise</span> <span class="nb">NotImplementedError</span><span class="p">()</span>
</code></pre></div></div>
<p>If using <code class="language-plaintext highlighter-rouge">predictor</code> to do inference for <code class="language-plaintext highlighter-rouge">SavedModel</code>, we simply extract the values from the dictionary using the output node names.</p>
<p><br /></p>
<p>We change the <a href="https://github.com/leimao/TensorFlow_Estimator_Basics/blob/master/serve.py"><code class="language-plaintext highlighter-rouge">serve.py</code></a> from</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">for</span> <span class="n">nb</span> <span class="ow">in</span> <span class="n">my_service</span><span class="p">():</span>
<span class="n">count</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="n">pred</span> <span class="o">=</span> <span class="n">predict_fn</span><span class="p">({</span><span class="s">'number'</span><span class="p">:</span> <span class="p">[[</span><span class="n">nb</span><span class="p">]]})[</span><span class="s">'output'</span><span class="p">]</span>
</code></pre></div></div>
<p>to</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">for</span> <span class="n">nb</span> <span class="ow">in</span> <span class="n">my_service</span><span class="p">():</span>
<span class="n">count</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="n">pred</span> <span class="o">=</span> <span class="n">predict_fn</span><span class="p">({</span><span class="s">'number'</span><span class="p">:</span> <span class="p">[[</span><span class="n">nb</span><span class="p">]]})</span>
<span class="n">hidden</span> <span class="o">=</span> <span class="n">pred</span><span class="p">[</span><span class="s">'hidden'</span><span class="p">]</span>
<span class="n">predictions</span> <span class="o">=</span> <span class="n">pred</span><span class="p">[</span><span class="s">'predictions'</span><span class="p">]</span>
</code></pre></div></div>
<h3 id="conclusions">Conclusions</h3>
<p>Inference using <code class="language-plaintext highlighter-rouge">SavedModel</code> is a better inference protocol compared to <code class="language-plaintext highlighter-rouge">Estimator</code> based <code class="language-plaintext highlighter-rouge">predict</code>.</p>
<h3 id="final-remarks">Final Remarks</h3>
<p>The <code class="language-plaintext highlighter-rouge">predictor</code> class from <code class="language-plaintext highlighter-rouge">tf.contrib</code>. Its internal implementation consists a bunch of input nodes, output nodes, and sessions to obtained the values from inference. However, in TensorFlow 2.0, there will be no <code class="language-plaintext highlighter-rouge">tf.contrib</code> and TensorFlow <code class="language-plaintext highlighter-rouge">session</code> will not be exposed to users. Doing inference using <code class="language-plaintext highlighter-rouge">Estimator</code>’s <code class="language-plaintext highlighter-rouge">predict</code> method with living graph would still work, but we will not be able to use <code class="language-plaintext highlighter-rouge">predictor</code> any more for the <code class="language-plaintext highlighter-rouge">SavedModel</code>. Fortunately, TensorFlow 2.0 has a <a href="https://www.tensorflow.org/beta/guide/saved_model">official tutorial</a> on this which is simple and straightforward. I will probably elaborate on this when TensorFlow 2.0 comes out officially and it is very necessary.</p>
<h3 id="references">References</h3>
<ul>
<li><a href="https://github.com/leimao/TensorFlow_Estimator_Basics">TensorFlow Estimator Basics - GitHub</a></li>
<li><a href="https://guillaumegenthial.github.io/serving-tensorflow-estimator.html">Guillaume Genthial’s Save and Restore a tf.estimator for inference</a></li>
<li><a href="https://hanxiao.github.io/2019/01/02/Serving-Google-BERT-in-Production-using-Tensorflow-and-ZeroMQ/">TensorFlow Estimator Serving as Web Service</a></li>
<li><a href="https://www.tensorflow.org/guide/saved_model#build_and_load_a_savedmodel">Build and load a SavedModel in TensorFlow 1.x</a></li>
<li><a href="https://www.tensorflow.org/beta/guide/saved_model">Using the SavedModel format in TensorFlow 2.x</a></li>
</ul>
<p><a href="https://leimao.github.io/blog/TensorFlow-Estimator-SavedModel/">TensorFlow Inference for Estimator</a> was originally published by Lei Mao at <a href="https://leimao.github.io">Lei Mao's Log Book</a> on August 29, 2019.</p><![CDATA[Gnome Terminator]]>https://leimao.github.io/blog/Gnome-Terminator2019-08-27 14:17:25 -0400T00:00:00-00:002019-08-27T00:00:00-05:00Lei Maohttps://leimao.github.iodukeleimao@gmail.com<h3 id="introduction">Introduction</h3>
<p>Gnome Terminator is a local terminal emulator which allows multiple terminal sub-windows inside a large window. Because with Gnome Terminator, we will be less likely to open a new terminal window, it is likely to increase our working efficiency on Linux.</p>
<p><br /></p>
<p>In this blog post, I am going to introduce how to setup customized default layouts for Gnome Terminators.</p>
<h3 id="installation">Installation</h3>
<p>We could install Gnome Terminator using one line of command on Ubuntu.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">sudo </span>apt <span class="nb">install </span>terminator
</code></pre></div></div>
<h3 id="layout-setup">Layout Setup</h3>
<p>I will use my favorite layouts as an example to show how to setup a customized default layouts for Gnome Terminator, so that every time you open Gnome Terminator, the layouts will always be the one you feel the most comfortable with.</p>
<p><br /></p>
<p>My favorite layout is displayed below. I like to have four terminal windows. One of them is running <code class="language-plaintext highlighter-rouge">htop</code> and the other one is running <code class="language-plaintext highlighter-rouge">nvidia-smi dmon</code>. In this way, I could monitor the usage of my CPU, GPU, memory, etc.</p>
<div class="titled-image">
<figure class="titled-image">
<img src="https://leimao.github.io/images/blog/2019-08-27-Gnome-Terminator/terminator.png" style="width: 100%; height: 100%" />
<figcaption>Gnome Terminator Layout</figcaption>
</figure>
</div>
<p>One of the benefits of using Gnome Terminator is that you don’t have to memorize and use short-cuts to setup the layout. There are generally two ways to setup the layout, clicking mouse and using configuration file.</p>
<h4 id="mouse-clicking">Mouse Clicking</h4>
<p>In the right-click menu, we fine-tune the layout by clicking <code class="language-plaintext highlighter-rouge">Split Horizontally</code> and <code class="language-plaintext highlighter-rouge">Split Vertically</code>. Once the layout is finalized, we click <code class="language-plaintext highlighter-rouge">Preference</code>.</p>
<p><br /></p>
<p>It is recommended to maximize the window so that you will not feel the letters are too small in the split windows.</p>
<div class="titled-image">
<figure class="titled-image">
<img src="https://leimao.github.io/images/blog/2019-08-27-Gnome-Terminator/terminator_1.png" style="width: 100%; height: 100%" />
<figcaption>Maximize Window</figcaption>
</figure>
</div>
<p>Create profiles for terminals running <code class="language-plaintext highlighter-rouge">htop</code> and <code class="language-plaintext highlighter-rouge">nvidia-smi dmon</code> under the <code class="language-plaintext highlighter-rouge">Profiles</code> tab, respectively. To keep the terminal useful even after stopping <code class="language-plaintext highlighter-rouge">htop</code> and <code class="language-plaintext highlighter-rouge">nvidia-smi dmon</code>, we add <code class="language-plaintext highlighter-rouge">; bash</code> after <code class="language-plaintext highlighter-rouge">htop</code> and <code class="language-plaintext highlighter-rouge">nvidia-smi dmon</code>. Also remember to choose <code class="language-plaintext highlighter-rouge">Hold the terminal open</code> when command exists.</p>
<div class="titled-image">
<figure class="titled-image">
<img src="https://leimao.github.io/images/blog/2019-08-27-Gnome-Terminator/terminator_2.png" style="width: 100%; height: 100%" />
<figcaption>htop Terminal Profiles</figcaption>
</figure>
</div>
<div class="titled-image">
<figure class="titled-image">
<img src="https://leimao.github.io/images/blog/2019-08-27-Gnome-Terminator/terminator_3.png" style="width: 100%; height: 100%" />
<figcaption>nvidia-smi dmon Terminal Profiles</figcaption>
</figure>
</div>
<p>Create layouts under the <code class="language-plaintext highlighter-rouge">Layouts</code> tab. The configurations of the layout we just fine-tuned would be imported automatically. We double click the name of the new layout and change it to <code class="language-plaintext highlighter-rouge">default</code>. We also change the terminal profiles to the <code class="language-plaintext highlighter-rouge">htop</code> and <code class="language-plaintext highlighter-rouge">nvidia-smi dmon</code> profiles we have just created. Do not forget to click <code class="language-plaintext highlighter-rouge">Save</code>. The new <code class="language-plaintext highlighter-rouge">default</code> layout will be conflicting to the old <code class="language-plaintext highlighter-rouge">default</code> layout. After closing the Gnome Terminator, we restart the program. The old <code class="language-plaintext highlighter-rouge">default</code> layout will be abandoned and the new <code class="language-plaintext highlighter-rouge">default</code> layout becomes the default one.</p>
<div class="titled-image">
<figure class="titled-image">
<img src="https://leimao.github.io/images/blog/2019-08-27-Gnome-Terminator/terminator_4.png" style="width: 100%; height: 100%" />
<figcaption>Add Layouts</figcaption>
</figure>
</div>
<div class="titled-image">
<figure class="titled-image">
<img src="https://leimao.github.io/images/blog/2019-08-27-Gnome-Terminator/terminator_5.png" style="width: 100%; height: 100%" />
<figcaption>Set Terminal Profiles in Layouts</figcaption>
</figure>
</div>
<h4 id="importing-configuration-file">Importing Configuration File</h4>
<p>All the configurations will be stored in the <code class="language-plaintext highlighter-rouge">~/.config/terminator/config</code> file. So it is equivalent to configure the layouts by modifying the configuration file. The configuration file of the settings we have done in the mouse clicking section is also provided below.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">cat</span> ~/.config/terminator/config
<span class="o">[</span>global_config]
window_state <span class="o">=</span> maximise
<span class="o">[</span>keybindings]
<span class="o">[</span>layouts]
<span class="o">[[</span>default]]
<span class="o">[[[</span>child0]]]
fullscreen <span class="o">=</span> False
last_active_term <span class="o">=</span> 76191a62-6770-458d-a39f-4749d681e9f5
last_active_window <span class="o">=</span> False
maximised <span class="o">=</span> True
order <span class="o">=</span> 0
parent <span class="o">=</span> <span class="s2">""</span>
position <span class="o">=</span> 67:27
size <span class="o">=</span> 1853, 1025
title <span class="o">=</span> leimao@leimao-evolvx: ~
<span class="nb">type</span> <span class="o">=</span> Window
<span class="o">[[[</span>child1]]]
order <span class="o">=</span> 0
parent <span class="o">=</span> child0
position <span class="o">=</span> 924
ratio <span class="o">=</span> 0.5
<span class="nb">type</span> <span class="o">=</span> HPaned
<span class="o">[[[</span>child2]]]
order <span class="o">=</span> 0
parent <span class="o">=</span> child1
position <span class="o">=</span> 512
ratio <span class="o">=</span> 0.501960784314
<span class="nb">type</span> <span class="o">=</span> VPaned
<span class="o">[[[</span>child5]]]
order <span class="o">=</span> 1
parent <span class="o">=</span> child1
position <span class="o">=</span> 512
ratio <span class="o">=</span> 0.501960784314
<span class="nb">type</span> <span class="o">=</span> VPaned
<span class="o">[[[</span>terminal3]]]
order <span class="o">=</span> 0
parent <span class="o">=</span> child2
profile <span class="o">=</span> default
<span class="nb">type</span> <span class="o">=</span> Terminal
uuid <span class="o">=</span> 76191a62-6770-458d-a39f-4749d681e9f5
<span class="o">[[[</span>terminal4]]]
order <span class="o">=</span> 1
parent <span class="o">=</span> child2
profile <span class="o">=</span> default
<span class="nb">type</span> <span class="o">=</span> Terminal
uuid <span class="o">=</span> d538eae0-7c50-4eae-9b5f-469406d58aab
<span class="o">[[[</span>terminal6]]]
order <span class="o">=</span> 0
parent <span class="o">=</span> child5
profile <span class="o">=</span> htop
<span class="nb">type</span> <span class="o">=</span> Terminal
uuid <span class="o">=</span> 5ca25170-d241-4993-a018-c004abbdd15b
<span class="o">[[[</span>terminal7]]]
order <span class="o">=</span> 1
parent <span class="o">=</span> child5
profile <span class="o">=</span> nvidia-smi
<span class="nb">type</span> <span class="o">=</span> Terminal
uuid <span class="o">=</span> 1e48a43f-93a4-46c0-b29e-4094c045673a
<span class="o">[[</span>New Layout]]
<span class="o">[[[</span>child0]]]
fullscreen <span class="o">=</span> False
last_active_term <span class="o">=</span> 76191a62-6770-458d-a39f-4749d681e9f5
last_active_window <span class="o">=</span> True
maximised <span class="o">=</span> True
order <span class="o">=</span> 0
parent <span class="o">=</span> <span class="s2">""</span>
position <span class="o">=</span> 67:27
size <span class="o">=</span> 1853, 1025
title <span class="o">=</span> leimao@leimao-evolvx: ~
<span class="nb">type</span> <span class="o">=</span> Window
<span class="o">[[[</span>child1]]]
order <span class="o">=</span> 0
parent <span class="o">=</span> child0
position <span class="o">=</span> 924
ratio <span class="o">=</span> 0.5
<span class="nb">type</span> <span class="o">=</span> HPaned
<span class="o">[[[</span>child2]]]
order <span class="o">=</span> 0
parent <span class="o">=</span> child1
position <span class="o">=</span> 512
ratio <span class="o">=</span> 0.501960784314
<span class="nb">type</span> <span class="o">=</span> VPaned
<span class="o">[[[</span>child5]]]
order <span class="o">=</span> 1
parent <span class="o">=</span> child1
position <span class="o">=</span> 512
ratio <span class="o">=</span> 0.501960784314
<span class="nb">type</span> <span class="o">=</span> VPaned
<span class="o">[[[</span>terminal3]]]
order <span class="o">=</span> 0
parent <span class="o">=</span> child2
profile <span class="o">=</span> default
<span class="nb">type</span> <span class="o">=</span> Terminal
uuid <span class="o">=</span> 76191a62-6770-458d-a39f-4749d681e9f5
<span class="o">[[[</span>terminal4]]]
order <span class="o">=</span> 1
parent <span class="o">=</span> child2
profile <span class="o">=</span> default
<span class="nb">type</span> <span class="o">=</span> Terminal
uuid <span class="o">=</span> d538eae0-7c50-4eae-9b5f-469406d58aab
<span class="o">[[[</span>terminal6]]]
order <span class="o">=</span> 0
parent <span class="o">=</span> child5
profile <span class="o">=</span> htop
<span class="nb">type</span> <span class="o">=</span> Terminal
uuid <span class="o">=</span> 5ca25170-d241-4993-a018-c004abbdd15b
<span class="o">[[[</span>terminal7]]]
order <span class="o">=</span> 1
parent <span class="o">=</span> child5
profile <span class="o">=</span> nvidia-smi
<span class="nb">type</span> <span class="o">=</span> Terminal
uuid <span class="o">=</span> 1e48a43f-93a4-46c0-b29e-4094c045673a
<span class="o">[</span>plugins]
<span class="o">[</span>profiles]
<span class="o">[[</span>default]]
cursor_color <span class="o">=</span> <span class="s2">"#aaaaaa"</span>
<span class="o">[[</span>htop]]
cursor_color <span class="o">=</span> <span class="s2">"#aaaaaa"</span>
custom_command <span class="o">=</span> htop<span class="p">;</span> bash
exit_action <span class="o">=</span> hold
use_custom_command <span class="o">=</span> True
<span class="o">[[</span>nvidia-smi]]
cursor_color <span class="o">=</span> <span class="s2">"#aaaaaa"</span>
custom_command <span class="o">=</span> nvidia-smi dmon<span class="p">;</span> bash
exit_action <span class="o">=</span> hold
use_custom_command <span class="o">=</span> True
</code></pre></div></div>
<h3 id="notes">Notes</h3>
<ul>
<li>Gnome Terminator is Gnome based application. Therefore, unlike Tmux, it could not be used in a non-Gnome environment, such as SSH terminal.</li>
</ul>
<p><a href="https://leimao.github.io/blog/Gnome-Terminator/">Gnome Terminator</a> was originally published by Lei Mao at <a href="https://leimao.github.io">Lei Mao's Log Book</a> on August 27, 2019.</p><![CDATA[Word2Vec Models Revisited]]>https://leimao.github.io/article/Word2Vec-Classic2019-08-23 14:17:25 -0400T00:00:00-00:002019-08-23T00:00:00-05:00Lei Maohttps://leimao.github.iodukeleimao@gmail.com<h3 id="introduction">Introduction</h3>
<p>The Word2Vec models proposed by Mikolov et al. in 2013, including the Continuous Bag-of-Words (CBOW) model and the Continuous Skip-Gram (Skip-Gram) model, are some of the earliest natural language processing models that could learn good vector representations for words. Although the word vector representations learned from these two models are no longer directly used in the state-of-the-art natural language processing models, such as Transformer and BERT, the basic ideas of the Word2Vec models are still affecting a lot of the latest natural language processing models.</p>
<p><br /></p>
<p>In this blog post, I will go through what the CBOW model and the Skip-Gram model are, how they are trained, and how they have influenced the state-of-the-art models. Some of the math are interesting so it is very worth revisiting.</p>
<h3 id="word2vec-models">Word2Vec Models</h3>
<p>We will go over both the CBOW model and the Skip-Gram model, with emphasis on the Skip-Gram model. Probably due to the restriction of computation cost at that time, unlike a feed-forward neural network with at least one hidden layer, both the CBOW model and the Skip-Gram model did not have hidden layers at all.</p>
<div class="titled-image">
<figure class="titled-image">
<img src="https://leimao.github.io/images/articles/2019-08-23-Word2Vec-Classic/word2vec.png" style="width: 95%" />
<figcaption>Word2Vec Models: CBOW and Skip-Gram</figcaption>
</figure>
</div>
<h4 id="trainable-embedding-matrix">Trainable Embedding Matrix</h4>
<p>We first create a trainable embedding matrix $E \in \mathbb{R}^{n \times d}$, where $n$ is the number of words in the corpus and $d$ is the size of embedding vector for each word. Each row is an embedding vector for one unique word. During training, because we allow the values in the embedding vector to be trained, the back propagation will tune the values in the embedding matrix $E$.</p>
<h4 id="cbow-model">CBOW Model</h4>
<p>The CBOW model tries to predict the word given its past words and future words. For example, suppose we have four words, two past words “I” and “like”, and two future words “computer” and “games”, we would like to predict the words in the middle. In this case, the word is likely to be “playing”. It should be noted that the order of the input words does not matter in the model. You could imagine, in this case, the model is taking four words as inputs and generates only one output. The embeddings of the four words were projected to four vectors using a shared weight matrix (and possibly a shared bias term), and the four vectors were averaged and the softmax were computed for the predicted word distribution.</p>
<p><br /></p>
<p>More formally, we have a weight matrix $W \in \mathbb{R}^{n \times d}$, and a bias term $b \in \mathbb{R}^{n}$. We have the embeddings for the four words $v_{t-2}$, $v_{t-1}$, $v_{t+1}$, $v_{t+2}$ from the embedding matrix $E$. Each vector $v \in \mathbb{R}^{d}$. Note that all the vectors in the article are column vectors. The logit vector $o_{t} \in \mathbb{R}^{n}$ used for computing softmax is as follows.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
o_{t} &= \frac{1}{4} \big[ (W v_{t-2} + b) + (W v_{t-1} + b) + (W v_{t+1} + b) + (W v_{t+2} + b) \big] \\
&= \frac{1}{4} W (v_{t-2} + v_{t-1} + v_{t+1} + v_{t+2}) + b
\end{aligned} %]]></script>
<p>If we set $v_{t}^{\prime}$ to the average of the input word embeddings, it is equivalent to convert the CBOW model to a model which takes only one input and generates one output.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
o_{t} &= \frac{1}{4} W (v_{t-2} + v_{t-1} + v_{t+1} + v_{t+2}) + b \\
&= W v_{t}^{\prime} + b
\end{aligned} %]]></script>
<p>During training, the model tries to maximize the probability of predicting the current word based on the surrounding context.</p>
<p><br /></p>
<p>In summary, to design the architecture of the CBOW model, as we have just discussed above, you can design a model which takes the average of the input embeddings as the only input and generates one output, or you can take multiple input embeddings as inputs, average the projected vectors of them in the model, and generates one output.</p>
<h4 id="skip-gram-model">Skip-Gram Model</h4>
<p>The Skip-Gram model, opposite to the CBOW model, tries to predict the past words and future words given the current words. For example, suppose we have the word “playing”, we would like to predict the four words around it. In this case, the surrounding words are likely to be “I”, “like”, “computer”, “games”. It should be noted that the order of the output words does not matter in the model.</p>
<p><br /></p>
<p>The diagram of the Skip-Gram model looks daunting. It looks like the model is taking one word as input and generates four outputs. This is misleading. In practice, the Skip-Gram model only takes one input and generates one output. Given an example, “I like playing computer games”, here is how we prepare the training data. We would have four input and label tuples for one example, including (“playing”, “I”), (“playing”, “like”), (“playing”, “computer”), (“playing”, “games”). These four examples were fed in a same batch to the neural network for training. The embedding of the input word were projected to one vector using a weight matrix (and possibly a bias term), the projected vector would be computed for softmax for the predicted word distribution.</p>
<p><br /></p>
<p>More formally, we have a weight matrix $W \in \mathbb{R}^{n \times d}$, and a bias term $b \in \mathbb{R}^{n}$. We have the embedding for the current word $v_{t}$ from the embedding matrix $E$. The logit vectors $o_{t-2}$, $o_{t-1}$, $o_{t+1}$, $o_{t+2}$ used for computing softmax are as follows. Each vector $o \in \mathbb{R}^{n}$.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
o_{t-2} &= W v_{t} + b \\
o_{t-1} &= W v_{t} + b \\
o_{t+1} &= W v_{t} + b \\
o_{t+2} &= W v_{t} + b \\
\end{aligned} %]]></script>
<p>During training, the model tries to maximize the probability of predicting one of the surrounding words based on the current word.</p>
<p><br /></p>
<p>The values of $o_{t-2}$, $o_{t-1}$, $o_{t+1}$, $o_{t+2}$ in the forward propagation are exactly the same because the four training examples have the exact same input. But the label for the four training examples are different. Mathematically, it is equivalent to having “playing” as input, and use a non-one-hot probability vector where the probability of “I”, “like”, “computer”, “games” are 0.25 respectively as the labels for softmax. The proof is given in the appendix chapter. So this time, in this case, instead of feeding a batch of size 4, we only need to feed a batch of size 1, to the model.</p>
<p><br /></p>
<p>In summary, to design the architecture of the Skip-Gram model, as we have just discussed above, you can design a model which takes the input word embedding as the input and generates one output. During training, you can feed multiple examples with the same input word but different labels, or you can feed one example with the input word and a probability vector representing the probabilities of the surrounding labels.</p>
<h3 id="optimization-methods">Optimization Methods</h3>
<p>Because it is a language model, the cost of computing softmax is extremely expensive. So the original authors used the following optimization methods instead.</p>
<h4 id="hierarchical-softmax">Hierarchical Softmax</h4>
<p>Instead of computing the full softmax, we could input some prior knowledge to the hierarchy of the classes, build tree structure of the labels in the computation graph, and reduce the computation cost by selectively choosing the path for optimization. I have a detailed tutorial on this topic. If you are interested, please read my article <a href="https://leimao.github.io/article/Hierarchical-Softmax/">“Hierarchical Softmax”</a>.</p>
<h4 id="noise-contrastive-estimation">Noise Contrastive Estimation</h4>
<p>Essentially, we introduced a noise distribution, and convert the multi-class classification problems to binary classification problems distinguishing if the sampled word is from the original dataset distribution or noise distribution. I have a sophisticated tutorial with all the mathematics derivations on this topic. If you are interested, please read my article <a href="https://leimao.github.io/article/Noise-Contrastive-Estimation/">“Noise Contrastive Estimation”</a>.</p>
<h4 id="negative-sampling">Negative Sampling</h4>
<p>The original authors also proposed Negative Sampling to “approximate” Noise Contrastive Estimation so that the computation is even faster. In my opinion, the complexity of Negative Sampling is asymptotically the same to Noise Contrastive Estimation, and there should be no need to use it anymore nowadays. What’s more, mathematically Negative Sampling deviates Noise Contrastive Estimation. It no longer does maximum likelihood estimation, while Noise Contrastive Estimation still does maximum likelihood estimation if the noise to data ratio is high. This probably restricts Negative Sampling to be only useful for embeddings training, but not for other machine learning problems. Here I will show some quick explanations to Negative Sampling using math based Noise Contrastive Estimation. To fully understand it, I suggest the readers to go through my <a href="https://leimao.github.io/article/Noise-Contrastive-Estimation/">“Noise Contrastive Estimation”</a> first.</p>
<p><br /></p>
<p>Let’s see what the optimization objective function is for Negative Sampling. In the paper, the authors defined it as</p>
<script type="math/tex; mode=display">J = \log \sigma(v_{w_O}^{\prime \top} v_{w_I}) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w)}\big[\log (-v_{w_i}^{\prime \top} v_{w_I}) \big]</script>
<p>In my opinion, this expression is mathematically incorrect, I believe what they meant is</p>
<script type="math/tex; mode=display">J = \log \sigma(v_{w_O}^{\prime \top} v_{w_I}) + k \mathbb{E}_{w_i \sim P_n(w)}\big[\log (-v_{w_i}^{\prime \top} v_{w_I}) \big]</script>
<p>To estimate the expected value, we sample $k$ words from the noise distribution.</p>
<script type="math/tex; mode=display">\mathbb{E}_{w_i \sim P_n(w)}\big[ \log (-v_{w_i}^{\prime \top} v_{w_I}) \big] \approx \frac{1}{k} \sum_{i=1}^{k} \log (-v_{w_i}^{\prime \top} v_{w_I})</script>
<p>So in practice, we have the optimization function</p>
<script type="math/tex; mode=display">\widehat{J} = \log \sigma(v_{w_O}^{\prime \top} v_{w_I}) + k \sum_{i=1}^{k} \log (-v_{w_i}^{\prime \top} v_{w_I})</script>
<p>Taking from my article <a href="https://leimao.github.io/article/Noise-Contrastive-Estimation/">“Noise Contrastive Estimation”</a>, the optimization objective function for Noise Contrastive Estimation is</p>
<script type="math/tex; mode=display">J^{h}(\theta) = \mathbb{E}_{w \sim P_d^h(w)}\big[\log \sigma(\Delta s_{\theta^0}(w,h)) \big] + k \mathbb{E}_{w \sim P_n(w)}\big[\log (1 - \sigma(\Delta s_{\theta^0}(w,h))) \big]</script>
<p>where $\Delta s_{\theta^0}(w,h) = s_{\theta^0}(w,h) - \log kP_n(w)$.</p>
<p><br /></p>
<p>To estimate $J^{h}(\theta)$, we could use the following objective function.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
{\widehat{J^{h}}}(\theta) &= \frac{1}{m} \sum_{i=1}^{m} \log \sigma(\Delta s_{\theta^0}(w_i,h)) + \frac{k}{n} \sum_{j=1}^{n} \log (1 - \sigma(\Delta s_{\theta^0}(w_j,h)))
\end{aligned} %]]></script>
<p>If we use $m=1$ and $n=k$, (but we do not have to),</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
{\widehat{J^{h}}}(\theta) &= \log \sigma(\Delta s_{\theta^0}(w,h)) + \sum_{j=1}^{k} \log (1 - \sigma(\Delta s_{\theta^0}(w_j,h)))
\end{aligned} %]]></script>
<p>We will rewrite it a little bit so that it looks close to the mathematical expression from the original papers.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
{\widehat{J^{h}}}(\theta) &= \log \sigma(\Delta s_{\theta^0}(w,h)) + \sum_{j=1}^{k} \log (1 - \sigma(\Delta s_{\theta^0}(w_j,h))) \\
&= \log \sigma(\Delta s_{\theta^0}(w,h)) + \sum_{i=1}^{k} \log (1 - \sigma(\Delta s_{\theta^0}(w_i,h))) \\
&= \log \sigma(\Delta s_{\theta^0}(w,h)) + \sum_{i=1}^{k} \log \sigma(- \Delta s_{\theta^0}(w_i,h))
\end{aligned} %]]></script>
<p>where $w$ is the labeled word for the input, $w_i$ is the sampled words from the noise distribution, $\Delta s_{\theta^0}(w,h) = s_{\theta^0}(w,h) - \log kP_n(w)$.</p>
<p><br /></p>
<p>Here if we use Noise Contrastive Estimation for the Word2Vec models we have just described.</p>
<script type="math/tex; mode=display">s_{\theta^0}(w,h) = v_{w}^{\prime \top} v_{w_I}</script>
<p>We further have</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
{\widehat{J^{h}}}(\theta) &= \log \sigma(\Delta s_{\theta^0}(w,h)) + \sum_{i=1}^{k} \log \sigma(- \Delta s_{\theta^0}(w_i,h)) \\
&= \log \sigma(v_{w}^{\prime \top} v_{w_I} - \log kP_n(w)) + \sum_{i=1}^{k} \log \sigma(- v_{w}^{\prime \top} v_{w_I} + \log kP_n(w)) \\
\end{aligned} %]]></script>
<p>So we immediately found that Negative Sampling is nothing special but just setting $kP_n(w) = 1$ in Noise Contrastive Estimation!</p>
<p><br /></p>
<p>The original authors said in the paper, “The main difference between the Negative sampling and NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples. And while NCE approximately maximizes the log probability of the softmax, this property
is not important for our application.”</p>
<p><br /></p>
<p>Since computing $kP_n(w)$ usually only takes $O(1)$ constant time, and as the authors admitted it no longer approximates maximum likelihood estimation, probably this Negative Sampling should not exist at all in my opinion.</p>
<p><br /></p>
<p>In practice, people were using Noise Contrastive Estimation (NCE) loss to train Word2Vec models. This is also seen in the <a href="https://www.tensorflow.org/tutorials/representation/word2vec">TensorFlow official Word2Vec tutorials</a>.</p>
<p><br /></p>
<p>But since Negative Sampling no longer does maximum likelihood estimation, how could it still successfully train the word embeddings in the first place in the original paper? I am going to show that the gradient of the optimization function is bounded such that it does not deviate from the gradient of the maximum likelihood estimation significantly. The sign of the gradient of the optimization function is also always the same to the sign of the gradient of the maximum likelihood estimation.</p>
<p><br /></p>
<p>Still taking from my article <a href="https://leimao.github.io/article/Noise-Contrastive-Estimation/">“Noise Contrastive Estimation”</a>, the gradient for the maximum likelihood estimation is</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\partial}{\partial \theta} \mathbb{E}_{w \sim P_d^h(w)}\big[\log P_{\theta}^{h}(w)\big] &= \mathbb{E}_{w \sim P_d^h(w)}\big[ \frac{\partial}{\partial \theta} s_{\theta}(w,h) \big] - \mathbb{E}_{w \sim P_{\theta}^h(w)}\big[ \frac{\partial}{\partial \theta} s_{\theta}(w,h) \big] \\
&= \sum\limits_{w} P_{d}^{h}(w) \frac{\partial}{\partial \theta} s_{\theta}(w,h) - \sum\limits_{w} P_{\theta}^{h}(w) \frac{\partial}{\partial \theta} s_{\theta}(w,h) \\
&= \sum\limits_{w} \big( P_{d}^{h}(w) - P_{\theta}^{h}(w) \big) \frac{\partial}{\partial \theta} s_{\theta}(w,h) \\
&= \sum\limits_{w} \big( P_{d}^{h}(w) - P_{\theta}^{h}(w) \big) \frac{\partial}{\partial \theta} \log u_{\theta}(w,h)
\end{aligned} %]]></script>
<p>and the gradient for Noise Contrastive Estimation is</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\partial}{\partial \theta} J^{h}(\theta) &= \mathbb{E}_{w \sim P_d^h(w)}\big[ \frac{kP_n(w)}{P_{\theta}^h(w) + kP_n(w)} \frac{\partial}{\partial \theta} \log P_{\theta}^h(w) \big] - k \mathbb{E}_{w \sim P_n(w)}\big[ \frac{P_{\theta}^h(w)}{P_{\theta}^h(w) + kP_n(w)} \frac{\partial}{\partial \theta} \log P_{\theta}^h(w) \big] \\
&= \sum \limits_{w} P_d^h(w) \frac{kP_n(w)}{P_{\theta}^h(w) + kP_n(w)} \frac{\partial}{\partial \theta} \log P_{\theta}^h(w) - k\sum \limits_{w} P_n(w) \frac{P_{\theta}^h(w)}{P_{\theta}^h(w) + kP_n(w)} \frac{\partial}{\partial \theta} \log P_{\theta}^h(w) \\
&= \sum \limits_{w} \frac{kP_n(w)}{P_{\theta}^h(w) + kP_n(w)} \big( P_d^h(w) - P_{\theta}^h(w) \big) \frac{\partial}{\partial \theta} \log P_{\theta}^h(w) \\
\end{aligned} %]]></script>
<p>Because in Negative Sampling, $kP_n(w) = 1$. The gradient for Negative Sampling is</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\partial}{\partial \theta} J^{h}(\theta) &= \sum \limits_{w} \frac{kP_n(w)}{P_{\theta}^h(w) + kP_n(w)} \big( P_d^h(w) - P_{\theta}^h(w) \big) \frac{\partial}{\partial \theta} \log P_{\theta}^h(w) \\
&= \sum \limits_{w} \frac{1}{P_{\theta}^h(w) + 1} \big( P_d^h(w) - P_{\theta}^h(w) \big) \frac{\partial}{\partial \theta} \log P_{\theta}^h(w) \\
\end{aligned} %]]></script>
<p>Because $0 \leq P_{\theta}^h(w) \leq 1$, and</p>
<script type="math/tex; mode=display">\frac{\partial}{\partial \theta} \mathbb{E}_{w \sim P_d^h(w)}\big[\log P_{\theta}^{h}(w)\big] = \sum\limits_{w} \big( P_{d}^{h}(w) - P_{\theta}^{h}(w) \big) \frac{\partial}{\partial \theta} \log u_{\theta}(w,h)</script>
<p>we have</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\partial}{\partial \theta} J^{h}(\theta) &= \sum \limits_{w} \frac{1}{P_{\theta}^h(w) + 1} \big( P_d^h(w) - P_{\theta}^h(w) \big) \frac{\partial}{\partial \theta} \log P_{\theta}^h(w) \\
&\leq \sum \limits_{w} \big( P_d^h(w) - P_{\theta}^h(w) \big) \frac{\partial}{\partial \theta} \log P_{\theta}^h(w) \\
&\leq \frac{\partial}{\partial \theta} \mathbb{E}_{w \sim P_d^h(w)}\big[\log P_{\theta}^{h}(w)\big]
\end{aligned} %]]></script>
<p>Similarly, we also have</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\partial}{\partial \theta} J^{h}(\theta) &= \sum \limits_{w} \frac{1}{P_{\theta}^h(w) + 1} \big( P_d^h(w) - P_{\theta}^h(w) \big) \frac{\partial}{\partial \theta} \log P_{\theta}^h(w) \\
&\geq \sum \limits_{w} \frac{1}{2} \big( P_d^h(w) - P_{\theta}^h(w) \big) \frac{\partial}{\partial \theta} \log P_{\theta}^h(w) \\
&\geq \frac{1}{2} \sum \limits_{w} \big( P_d^h(w) - P_{\theta}^h(w) \big) \frac{\partial}{\partial \theta} \log P_{\theta}^h(w) \\
&\geq \frac{1}{2} \frac{\partial}{\partial \theta} \mathbb{E}_{w \sim P_d^h(w)}\big[\log P_{\theta}^{h}(w)\big]
\end{aligned} %]]></script>
<p>Therefore, we conclude that the gradient of the negative sampling is bounded by the gradient of maximum likelihood estimation, and the sign of the gradient of the negative sampling is also always the same to the sign of the gradient of maximum likelihood estimation.</p>
<script type="math/tex; mode=display">\frac{1}{2} \frac{\partial}{\partial \theta} \mathbb{E}_{w \sim P_d^h(w)}\big[\log P_{\theta}^{h}(w)\big] \leq \frac{\partial}{\partial \theta} J^{h}(\theta) \leq \frac{\partial}{\partial \theta} \mathbb{E}_{w \sim P_d^h(w)}\big[\log P_{\theta}^{h}(w)\big]</script>
<h3 id="insights">Insights</h3>
<p>If you are looking at the training of BERT which also learns the vector representations of tokens, the fundamental idea is very similar to the CBOW model. Essentially BERT is masking several words in a sentence and asked the model to predict the masked word during training. The Skip-Gram model has also influenced the Skip-Thought model to learn vector representation for sentences.</p>
<p><br /></p>
<p>I may talk about these two models in depth in the future.</p>
<h3 id="references">References</h3>
<ul>
<li><a href="https://arxiv.org/abs/1301.3781">Efficient Estimation of Word Representations in Vector Space</a></li>
<li><a href="https://arxiv.org/abs/1310.4546">Distributed Representations of Words and Phrases and their Compositionality</a></li>
</ul>
<h3 id="appendix">Appendix</h3>
<h4 id="proof-for-the-equivalence-of-different-skip-gram-training-modes">Proof for the Equivalence of Different Skip-Gram Training Modes</h4>
<p>The classification model is usually trained using maximum likelihood estimation. We use $q_{\theta}(x_i)$ to denote the predicted likelihood $q(x_i|\theta)$ from the model for sample $x_i$ from the dataset. Concretely, we have the follow objective function</p>
<script type="math/tex; mode=display">% <![CDATA[
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator*{\argmax}{argmax}
\begin{aligned}
\argmax_{\theta} \prod_{i=1}^{n} q_{\theta}(y_i | x_i) &= \argmax_{\theta} \sum_{i=1}^{n} \log q_{\theta}(y_i | x_i) \\
&= \argmin_{\theta} - \sum_{i=1}^{n} \log q_{\theta}(y_i | x_i)
\end{aligned} %]]></script>
<p>$H_i(p,q_{\theta})$ is the cross entropy of sample $i$ in the dataset.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
H_{i}(p,q_{\theta}) &= - \sum\limits_{y \in Y} p(y|x_i) \log q_{\theta}(y|x_i)
\end{aligned} %]]></script>
<p>If there is only one label $y_i$ for sample $i$,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
H_{i}(p,q_{\theta}) &= - \sum\limits_{y \in Y} p(y|x_i) \log q_{\theta}(y|x_i) \\
&= - \log q_{\theta}(y_i|x_i) \\
\end{aligned} %]]></script>
<p>So in this case, we are minimizing the sum or average of the cross entropies from all the training examples.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\argmax_{\theta} \prod_{i=1}^{n} q_{\theta}(y_i | x_i) &= \argmin_{\theta} - \sum_{i=1}^{n} \log q_{\theta}(y_i | x_i) \\
&= \argmin_{\theta} \sum_{i=1}^{n} H_{i}(p,q_{\theta}) \\
&= \argmin_{\theta} \frac{1}{n} \sum_{i=1}^{n} H_{i}(p,q_{\theta}) \\
\end{aligned} %]]></script>
<p>Given a set of examples with the same input but different labels, $\{(x_t, y_1),(x_t, y_2),\cdots,(x_t, y_m)\}$, the average of the cross entropies from all the training examples would be</p>
<script type="math/tex; mode=display">\frac{1}{n} \sum_{i=1}^{n} H_{i}(p,q_{\theta}) = - \frac{1}{m} \sum_{i=1}^{m} \log q_{\theta}(y_i|x_t)</script>
<p>If we convert the above $m$ examples to one single example with multiple labels with equal probability $\frac{1}{m}$, $\{(x_t, (y_1, y_2,\cdots, y_m))\}$, the cross entropy for this single example would be</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
H(p,q_{\theta}) &= - \sum\limits_{y \in Y} p(y|x_t) \log q_{\theta}(y|x_t) \\
&= - \sum_{i=1}^{m} p(y_i|x_t) \log q_{\theta}(y_i|x_t) \\
&= - \sum_{j=1}^{m} \frac{1}{m} \log q_{\theta}(y_i|x_t) \\
&= - \frac{1}{m} \sum_{j=1}^{m} \log q_{\theta}(y_i|x_t) \\
\end{aligned} %]]></script>
<p>This cross entropy is exactly the same to the average of the cross entropies from all the training examples from the previous case. Therefore, to train the Skip-Gram model, feeding a large batch consisting of $\{(x_t, y_1),(x_t, y_2),\cdots,(x_t, y_m)\}$ is equivalent to feeding a single example of $\{(x_t, (y_1, y_2,\cdots, y_m))\}$.</p>
<p><a href="https://leimao.github.io/article/Word2Vec-Classic/">Word2Vec Models Revisited</a> was originally published by Lei Mao at <a href="https://leimao.github.io">Lei Mao's Log Book</a> on August 23, 2019.</p><![CDATA[Ubuntu Fan Throttling Noise Removal]]>https://leimao.github.io/blog/Fan-Throttling-Noise2019-08-18 14:17:25 -0400T00:00:00-00:002019-08-18T00:00:00-05:00Lei Maohttps://leimao.github.iodukeleimao@gmail.com<h3 id="introduction">Introduction</h3>
<p>When I was using my <a href="https://leimao.github.io/blog/PC-Build-Gaming-Deep-Learning/">PC</a> this weekend, somehow I found that there were some weird noise “DADADA” from the PC case from time to time when I was using Ubuntu 18.04. Because it was irregular noise, it was extremely disturbing and I could not bear with it. The even more weird thing is that when I switched to Windows 10, there is no such noise at all. It is very hard to determine where the source of the noise is, even with the PC case open. But I was still managed to identify the problem and remove the noise in the end.</p>
<p><br /></p>
<p>In this blog post, I will document how I found the problem and how to solve the problem. Hopefully this will be helpful to the users who have similar problems.</p>
<h3 id="troubleshooting">Troubleshooting</h3>
<h4 id="os-specificity">OS Specificity</h4>
<p>I could only hear noise when I was using Ubuntu 18.04. When I switched to Windows 10, the noise is gone. This is weird. Was it because I upgraded some packages in Ubuntu several days ago and it caused this problem?</p>
<h4 id="fan-physical-inspection">Fan Physical Inspection</h4>
<p>The most suspicious parts are PC fans. I have six fans in total, including two fans in the front panel, one fan in the back panel, and three fans from water cooler. My ears were not good enough to determine which fan or fans were generating the noise. But it looks like the fans were running smoothly when the noise were generated.</p>
<h4 id="bios-fan-settings">BIOS Fan Settings</h4>
<p>Unlike Windows 10 in which I have installed software to control fans, Ubuntu 18.04 does not have fan controlling software. The fans were controlled entirely by the BIOS settings. So I went to BIOS, and turned all the fans to full speed. The PC would become extremely noise as all the fans were turned to full speed. However, I did not hear the weird irregular noise of “DADADA” at all.</p>
<h4 id="the-most-likely-problem">The Most Likely Problem</h4>
<p>Given all the phenomena observed, I think it is very likely that the fans were throttling somehow, given the temperature in California went higher these days. The temperature was at some threshold where the BIOS decided to increase the fan speed but somehow the temperature drops either due to temperature fluctuation or the fan speed-up.</p>
<h4 id="solution">Solution</h4>
<p>I changed the fan controlling mode to “manual” in BIOS, and played with the fan controlling parameters, such as the fan speed at each temperature, fan speed increase interval, etc. Finally there was no more weird noise from the PC anymore when I was using Ubuntu.</p>
<h4 id="lessons">Lessons</h4>
<p>In Windows 10, the fans were controlled by more sophisticated software from Corsair and the software has different settings to BIOS fan controller. This is why I had no problems at all in Windows 10.</p>
<h3 id="final-remarks">Final Remarks</h3>
<p>We Ubuntu users enjoyed solving problems entirely on our own. This time it is certainly a good experience. However, there are more times when the problems could not be perfectly solved. Those problems would often keep bothering us until we found some ways to get around. Getting around is not solving the problem because you often do not know what the source of the problems were, not even mention to find the perfect solution for that.</p>
<p><a href="https://leimao.github.io/blog/Fan-Throttling-Noise/">Ubuntu Fan Throttling Noise Removal</a> was originally published by Lei Mao at <a href="https://leimao.github.io">Lei Mao's Log Book</a> on August 18, 2019.</p><![CDATA[Hierarchical Softmax]]>https://leimao.github.io/article/Hierarchical-Softmax2019-08-17 14:17:25 -0400T00:00:00-00:002019-08-17T00:00:00-05:00Lei Maohttps://leimao.github.iodukeleimao@gmail.com<h3 id="introduction">Introduction</h3>
<p>Because the word corpus of a language is usually very large, training a language model using the conventional softmax will take extremely long time. In order to reduce the time for model training, people have invented some optimization algorithms, such as <a href="https://leimao.github.io/article/Noise-Contrastive-Estimation/">Noise Contrastive Estimation</a>, to approximate the conventional softmax but run much faster.</p>
<p><br /></p>
<p>In this blog post, instead of talking about a fast approximate optimization algorithm, I will talk about the non-approximate hierarchical softmax, which is a specialized softmax alternative that runs extremely fast with orchestration from human beings.</p>
<h3 id="methods">Methods</h3>
<h4 id="conventional-softmax">Conventional Softmax</h4>
<p>In a model with parameters $\theta$, given some context $h$, to compute the probability of word $w$ given $h$ from the model,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
P_{\theta}^{h}(w) &= \frac{\exp(s_{\theta}(w,h))}{Z_\theta^h} \\
&= \frac{u_{\theta}(w,h)}{Z_\theta^h}
\end{aligned} %]]></script>
<p>where $s_{\theta}(w,h)$ is usually called score or logit for word $w$ in the model, $u_{\theta}(w,h) = \exp(s_{\theta}(w,h))$, and $Z_\theta^h$ is the normalizer given context $h$ and it does not dependent on word $w$.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
Z_\theta^h &= \sum\limits_{w^{\prime}} u_{\theta}(w^{\prime},h) \\
&= \sum\limits_{w^{\prime}} \exp(s_{\theta}(w^{\prime},h))
\end{aligned} %]]></script>
<p>Because the corpus is usually very large, computing $Z_\theta^h$ will usually take very long time.</p>
<h4 id="hierarchical-softmax">Hierarchical Softmax</h4>
<p>Hierarchical softmax pose the question in a different way. Suppose we could construct a tree structure for the entire corpus, each leaf in the tree represents a word from the corpus. We traverse the tree to compute the probability of a word. The probability of each word will be the product of the probability of choosing the branch that is on the path from the root to the word.</p>
<p><br /></p>
<p>For example, if we have the following tree to represent a corpus of words including Golf, Basketball, Football, Soccer, Piano, and Violin.</p>
<div class="titled-image">
<figure class="titled-image">
<img src="https://leimao.github.io/images/articles/2019-08-17-Hierarchical-Softmax/tree.png" style="width: 80%; height: 80%" />
<figcaption>Corpus Tree Example</figcaption>
</figure>
</div>
<p>The probability of the word Football will be</p>
<script type="math/tex; mode=display">P_{\theta}^{h}(w=\text{Football}) = P_{\theta}^{h}(0\rightarrow1)P_{\theta}^{h}(1\rightarrow3)P_{\theta}^{h}(3\rightarrow\text{Football})</script>
<p>Mathematically, each word $w$ can be reached by an appropriate path from the root of the tree. Let $n(w,j)$ be the $j$-th node on the path from the root to $w$, and let $L(w)$ be the length of this path. With these, $n(w,1)$ = root, and $n(w,L(w)) = w$, we have</p>
<script type="math/tex; mode=display">\begin{aligned}
P_{\theta}^{h}(w) = \prod_{j=1}^{L(w)-1} P_{\theta_j}^{h}(n(w,j) \rightarrow n(w,j+1))
\end{aligned}</script>
<p>In this way, given a corpus, if we construct the tree appropriately, we could reduce the complexity of computing softmax from $O(N)$ to $O(\log N)$ where $N$ is the size of corpus. For example, if we have a corpus of 10,000 words, we construct a two layer hierarchical softmax, the first layer consists of 100 child nodes, each node in the first layer also consists of 100 child nodes. To compute the conventional softmax, we would need to compute $u_{\theta}(w^{\prime},h)$ 10,000 times. To compute the hierarchical softmax, we just have to compute $u_{\theta_1}(n^{\prime},h)$ 100 times in the first layer, and $u_{\theta_2}(w^{\prime},h)$ 100 times in the second layer, totalling 200 times!</p>
<p><br /></p>
<p>Finally, the question is how to compute each $P_{\theta_j}^{h}(n(w,j) \rightarrow n(w,j+1))$ in practice. We assume the layer before projecting to softmax has size of $d$. For each edge $e$ in the tree, we denote the number of child nodes under the edge is $c(e)$. we would have a set of weights $W \in \mathbb{R}^{d \times c(e)}$ and biases $b \in \mathbb{R}^{c(e)}$.</p>
<p><br /></p>
<p>In a model using conventional softmax, the number of weight parameters on the layer before softmax is $d \times N$. In a model using hierarchical softmax, the number of weight parameters on the layers before softmax is much more. The number of weight parameters on the final layer, taken together, is still $d \times N$. However, we would need additional parameters for the weights on the edge before the final layer in the tree.</p>
<p><br /></p>
<p>To compute $P_{\theta_j}^{h}(n(w,j) \rightarrow n(w,j+1))$, we have a vector $v \in \mathbb{R}^{d}$ from the model, the weights are $W_{n(w,j) \rightarrow n(w,j+1)} \in \mathbb{R}^{d \times c(n(w,j) \rightarrow n(w,j+1))}$ and the bias are $b_{n(w,j) \rightarrow n(w,j+1)} \in \mathbb{R}^{c(n(w,j) \rightarrow n(w,j+1))}$. The resulting logit vector would be $vW_{n(w,j) \rightarrow n(w,j+1)} + b_{n(w,j) \rightarrow n(w,j+1)}$. Based on this logit vector which consists the logits for all the child nodes, we could compute easily the probability of choosing the node $n(w,j+1)$, which is $P_{\theta_j}^{h}(n(w,j) \rightarrow n(w,j+1))$, using a conventional softmax function.</p>
<h3 id="caveats">Caveats</h3>
<p>The model is highly biased by the structure of corpus tree. With good trees constructed, the model would likely to be trained well. With bad trees, the model would probably never achieve very good performance.</p>
<div class="titled-image">
<figure class="titled-image">
<img src="https://leimao.github.io/images/articles/2019-08-17-Hierarchical-Softmax/good_tree.png" style="width: 48%; height: 48%" />
<img src="https://leimao.github.io/images/articles/2019-08-17-Hierarchical-Softmax/bad_tree.png" style="width: 48%; height: 48%" />
<figcaption>Good Tree vs Bad Tree</figcaption>
</figure>
</div>
<p>In the model using hierarchical softmax, we implicitly introduced additional labels for the words in each layer. If those labels “make sense” and the model could learn the label information in each layer, such kind of model could usually learn very well. On the contrary, it would just “confuse” the model. The example above shows a good tree and a bad tree for corpus. On the good tree, the all the words on the left branch is related to sports and all the words on the right branch is related to music instruments. Given some context $h$, the model would first judge whether the word to be predicted is a word related to sports or music instrument. Once it is determined, say, it is sports, it will then determine whether the word is Basketball, Soccer or Football. However, on the bad tree, because the “label” on the first layer is ambiguous, the model could hardly learn too useful information, and could not even determine which branch to go in the first layer during inference.</p>
<p><br /></p>
<p>The are some methods to construct a relatively good tree, such as a binary <a href="https://leimao.github.io/blog/Huffman-Coding/">huffman tree</a>, but I am not going to elaborate on it here.</p>
<h3 id="conclusions">Conclusions</h3>
<p>Hierarchical softmax is not an approximate optimization algorithm. It accelerates the optimization by adding human orchestrations which could be highly biased.</p>
<h3 id="references">References</h3>
<ul>
<li><a href="https://arxiv.org/abs/1310.4546">Distributed Representations of Words and Phrases and their Compositionality</a></li>
<li><a href="https://github.com/leimao/Two_Layer_Hierarchical_Softmax_PyTorch">Two Layer Hierarchical Softmax PyTorch Implementation</a></li>
</ul>
<p><a href="https://leimao.github.io/article/Hierarchical-Softmax/">Hierarchical Softmax</a> was originally published by Lei Mao at <a href="https://leimao.github.io">Lei Mao's Log Book</a> on August 17, 2019.</p><![CDATA[Fcitx Chinese Input Setup on Ubuntu for Gaming]]>https://leimao.github.io/blog/Ubuntu-Gaming-Chinese-Input2019-08-15 14:17:25 -0400T00:00:00-00:002019-08-15T00:00:00-05:00Lei Maohttps://leimao.github.iodukeleimao@gmail.com<h3 id="introduction">Introduction</h3>
<p>Inputting Chinese on English Ubuntu system is not news. Ubuntu uses <code class="language-plaintext highlighter-rouge">IBus</code> keyboard input method system by default and there is Chinese input method <code class="language-plaintext highlighter-rouge">ibus-pinyin</code> which could be easily installed via <code class="language-plaintext highlighter-rouge">sudo apt-get install ibus-pinyin</code>. However, when it comes to gaming, especially in the full screen mode, <code class="language-plaintext highlighter-rouge">ibus-pinyin</code> will not show any word selection menu in the game, because it is in the <code class="language-plaintext highlighter-rouge">IBus</code> system and <code class="language-plaintext highlighter-rouge">IBus</code> system is not compatible with most of the gaming engines.</p>
<p><br /></p>
<p>To support Chinese input in Ubuntu games, in this blog post, I will introduce how to install <code class="language-plaintext highlighter-rouge">fcitx</code> based input methods, which is compatible with games, on Ubuntu. It should also be noted that this method might also apply to the input methods from other languages.</p>
<h3 id="protocols">Protocols</h3>
<h4 id="installing-components">Installing Components</h4>
<p>We would need to install <code class="language-plaintext highlighter-rouge">fcitx</code> and <code class="language-plaintext highlighter-rouge">fcitx-googlepinyin</code> first.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Install fcitx input method system</span>
<span class="nv">$ </span><span class="nb">sudo </span>apt <span class="nb">install </span>fcitx-bin
<span class="c"># Install Google Pinyin Chinese input method</span>
<span class="nv">$ </span><span class="nb">sudo </span>apt <span class="nb">install </span>fcitx-googlepinyin
</code></pre></div></div>
<h4 id="ubuntu-setups">Ubuntu Setups</h4>
<p>Change the input method system from <code class="language-plaintext highlighter-rouge">IBus</code> to <code class="language-plaintext highlighter-rouge">fcitx</code> in <code class="language-plaintext highlighter-rouge">Region & Language</code>.</p>
<div class="titled-image">
<figure class="titled-image">
<img src="https://leimao.github.io/images/blog/2019-08-15-Ubuntu-Gaming-Chinese-Input/region_language.png" style="width: 100%; height: 100%" />
<figcaption>Region & Language on Ubuntu 18.04</figcaption>
</figure>
</div>
<p>Click <code class="language-plaintext highlighter-rouge">Manage Installed Languages</code>.</p>
<div class="titled-image">
<figure class="titled-image">
<img src="https://leimao.github.io/images/blog/2019-08-15-Ubuntu-Gaming-Chinese-Input/ibus.png" style="width: 48%; height: 48%" />
<img src="https://leimao.github.io/images/blog/2019-08-15-Ubuntu-Gaming-Chinese-Input/fcitx.png" style="width: 48%; height: 48%" />
<figcaption>IBus and fcitx</figcaption>
</figure>
</div>
<p>Click <code class="language-plaintext highlighter-rouge">Install/Remove Languages</code> to install <code class="language-plaintext highlighter-rouge">Chinese Simplified</code> and/or <code class="language-plaintext highlighter-rouge">Chinese Traditional</code> if necessary.</p>
<p><br /></p>
<p>Remove all the input sources except <code class="language-plaintext highlighter-rouge">English (US)</code> under the <code class="language-plaintext highlighter-rouge">Input Sources</code> in <code class="language-plaintext highlighter-rouge">Region & Language</code>. Otherwise, there will be two input icons on your system.</p>
<p><br /></p>
<p>Reboot the computer and we would see a new input icon at the top right corner of our desktop. We then add Google Pinyin to the <code class="language-plaintext highlighter-rouge">fcitx</code> method. We start the <code class="language-plaintext highlighter-rouge">fcitx-configtool</code> by running the following command in the terminal.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>fcitx-configtool
</code></pre></div></div>
<p>Click <code class="language-plaintext highlighter-rouge">+</code> to add input methods.</p>
<div class="titled-image">
<figure class="titled-image">
<img src="https://leimao.github.io/images/blog/2019-08-15-Ubuntu-Gaming-Chinese-Input/fcitx_configuretool.png" style="width: 100%; height: 100%" />
<figcaption>fcitx Configuration</figcaption>
</figure>
</div>
<p>Uncheck <code class="language-plaintext highlighter-rouge">Only Show Current Language</code>, select <code class="language-plaintext highlighter-rouge">Google Pinyin</code>, and press <code class="language-plaintext highlighter-rouge">OK</code>.</p>
<div class="titled-image">
<figure class="titled-image">
<img src="https://leimao.github.io/images/blog/2019-08-15-Ubuntu-Gaming-Chinese-Input/google_pinyin.png" style="width: 100%; height: 100%" />
<figcaption>Add Google Pinyin</figcaption>
</figure>
</div>
<p>Now you can start to use <code class="language-plaintext highlighter-rouge">Google Pinyin</code> by toggling using <code class="language-plaintext highlighter-rouge">Ctrl</code> + <code class="language-plaintext highlighter-rouge">Space</code> by default.</p>
<h4 id="ugly-default-idle-icon">Ugly Default Idle Icon</h4>
<p>The default <code class="language-plaintext highlighter-rouge">fcitx</code> icon at idle is ugly. We would like to make it look better. We run the following commands in the terminal.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Remove classic UI</span>
<span class="nb">sudo </span>apt remove fcitx-ui-classic
<span class="c"># Install new UI</span>
<span class="nb">sudo </span>apt <span class="nb">install </span>fcitx-ui-qimpanel
</code></pre></div></div>
<p>Reboot the computer and we would see a new penguin input icon at the top right corner of our desktop. It will turn to a keyboard icon whenever the cursor is placed somewhere allows inputs.</p>
<h3 id="demo">Demo</h3>
<h4 id="dota-2">Dota 2</h4>
<p>In the game Dota 2, we press <code class="language-plaintext highlighter-rouge">Ctrl</code> + <code class="language-plaintext highlighter-rouge">Space</code> to switch to <code class="language-plaintext highlighter-rouge">Google Pinyin</code> and start to input Chinese.</p>
<div class="titled-image">
<figure class="titled-image">
<img src="https://leimao.github.io/images/blog/2019-08-15-Ubuntu-Gaming-Chinese-Input/dota2_chinese_input_1.png" style="width: 100%; height: 100%" />
<figcaption>Dota 2 Chinese Input Using Google Pinyin</figcaption>
</figure>
</div>
<p>It was successful!</p>
<div class="titled-image">
<figure class="titled-image">
<img src="https://leimao.github.io/images/blog/2019-08-15-Ubuntu-Gaming-Chinese-Input/dota2_chinese_input_2.png" style="width: 100%; height: 100%" />
<figcaption>Dota 2 Chinese Input Successful</figcaption>
</figure>
</div>
<h3 id="miscellaneous">Miscellaneous</h3>
<p>To see how to play games in Ubuntu, please check my blog post <a href="https://leimao.github.io/blog/Ubuntu-Gaming-Guide/">“Ubuntu Gaming Guide”</a>.</p>
<p><a href="https://leimao.github.io/blog/Ubuntu-Gaming-Chinese-Input/">Fcitx Chinese Input Setup on Ubuntu for Gaming</a> was originally published by Lei Mao at <a href="https://leimao.github.io">Lei Mao's Log Book</a> on August 15, 2019.</p><![CDATA[Label Smoothing]]>https://leimao.github.io/blog/Label-Smoothing2019-08-10 14:17:25 -0400T00:00:00-00:002019-08-11T00:00:00-05:00Lei Maohttps://leimao.github.iodukeleimao@gmail.com<h3 id="introduction">Introduction</h3>
<p>In machine learning or deep learning, we usually use a lot of regularization techniques, such as L1, L2, dropout, etc., to prevent our model from overfitting. In classification problems, sometimes our model would learn to predict the training examples extremely confidently. This is not good for generalization.</p>
<p><br /></p>
<p>In this blog post, I am going to talk about label smoothing as a regularization technique for classification problems to prevent the model from predicting the training examples too confidently.</p>
<h3 id="method">Method</h3>
<p>In a classification problem with $K$ candidate labels $\{1,2,\cdots,K\}$, for example $i$, $(x_i, y_i)$, from training dataset, we have the ground truth distribution $p$ over labels $p(y|x_i)$, and $\sum_{y=1}^{K} p(y|x_i) = 1$. We have a model with parameters $\theta$, it predicts the predicted label distribution as $q_{\theta}(y|x_i)$, and of course $\sum_{y=1}^{K} q_{\theta}(y|x_i) = 1$.</p>
<p><br /></p>
<p>As I described in <a href="https://leimao.github.io/blog/Cross-Entropy-KL-Divergence-MLE/">“Cross Entropy, KL Divergence, and Maximum Likelihood Estimation”</a>, the cross entropy for this particular example is</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
H_{i}(p,q_{\theta}) &= - \sum_{y=1}^{K} p(y|x_i) \log q_{\theta}(y|x_i) \\
\end{aligned} %]]></script>
<p>If we have $n$ examples in the training dataset, our loss function would be</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
L &= \sum_{i=1}^{n} H_i(p,q_{\theta}) \\
&= - \sum_{i=1}^{n} \sum_{y=1}^{K} p(y|x_i) \log q_{\theta}(y|x_i) \\
\end{aligned} %]]></script>
<h4 id="one-hot-encoding-labels">One-Hot Encoding Labels</h4>
<p>Usually this $p(y|x_i)$ would be a one-hot encoded vector where</p>
<script type="math/tex; mode=display">% <![CDATA[
p(y|x_i) =
\begin{cases}
1 & \text{if } y = y_i \\
0 & \text{otherwise}
\end{cases} %]]></script>
<p>With this, we could further reduce the loss function to</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
L &= \sum_{i=1}^{n} H_i(p,q_{\theta}) \\
&= - \sum_{i=1}^{n} \sum_{y=1}^{K} p(y|x_i) \log q_{\theta}(y|x_i) \\
&= - \sum_{i=1}^{n} p(y_i|x_i) \log q_{\theta}(y_i|x_i) \\
&= - \sum_{i=1}^{n} \log q_{\theta}(y_i|x_i) \\
\end{aligned} %]]></script>
<p>Minimizing this loss function is equivalent to do maximum likelihood estimation over the training dataset (see my proof <a href="https://leimao.github.io/blog/Cross-Entropy-KL-Divergence-MLE/">here</a>).</p>
<p><br /></p>
<p>During optimization, it is possible to minimize $L$ to almost zero, if all the inputs in the dataset do not have conflicting labels. Conflicting labels means, say, there are two examples with the extract same feature from the dataset, but their ground truth labels are different.</p>
<p><br /></p>
<p>Because usually $q_{\theta}(y_i|x_i)$ is computed from softmax function.</p>
<script type="math/tex; mode=display">q_{\theta}(y_i|x_i) = \frac{\exp(z_{y_i})}{\sum_{j=1}^{K}\exp(z_j)}</script>
<p>Where $z_i$ is the logit for candidate class $i$.</p>
<p><br /></p>
<p>The consequence of using one-hot encoded labels will be that $\exp(z_{y_i})$ will be extremely large and the other $\exp(z_j)$ where $j \neq y_i$ will be extremely small. Given a “non-conflicting” dataset, the model will classify every training example correctly with confidence of almost 1. This is certainly a signature of overfitting, and overfitted model does not generalize well.</p>
<p><br /></p>
<p>Then how do we make sure that during training the model is not going to be too confident about the labels it predicts for the training data? Using a non-conflicting training dataset, with one-hot encoded labels, overfitting seems to be inevitable. People introduced label smoothing techniques as regularization.</p>
<h4 id="label-smoothing">Label Smoothing</h4>
<p>Instead of using one-hot encoded vector, we introduce noise distribution $u(y|x)$. Our new ground truth label for data $(x_i, y_i)$ would be</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p^{\prime}(y|x_i) &= (1-\varepsilon) p(y|x_i) + \varepsilon u(y|x_i) \\
&=
\begin{cases}
1 - \varepsilon + \varepsilon u(y|x_i) & \text{if } y = y_i \\
\varepsilon u(y|x_i) & \text{otherwise}
\end{cases}
\end{aligned} %]]></script>
<p>Where $\varepsilon$ is a weight factor and note that $\sum_{y=1}^{K} p^{\prime}(y|x_i) = 1$.</p>
<p><br /></p>
<p>We use this new ground truth label in replace of the one-hot encoded ground truth label in our loss function.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
L^{\prime} &= - \sum_{i=1}^{n} \sum_{y=1}^{K} p^{\prime}(y|x_i) \log q_{\theta}(y|x_i) \\
&= - \sum_{i=1}^{n} \sum_{y=1}^{K} \big[ (1-\varepsilon) p(y|x_i) + \varepsilon u(y|x_i) \big] \log q_{\theta}(y|x_i) \\
\end{aligned} %]]></script>
<p>We further elaborate on this loss function.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
L^{\prime} &= \sum_{i=1}^{n} \bigg\{ (1-\varepsilon) \Big[ - \sum_{y=1}^{K} p(y|x_i) \log q_{\theta}(y|x_i) \Big] + \varepsilon \Big[ - \sum_{y=1}^{K} u(y|x_i) \log q_{\theta}(y|x_i) \Big] \bigg\} \\
&= - \sum_{i=1}^{n} \Big[ (1-\varepsilon) H_i(p,q_{\theta}) + \varepsilon H_i(u,q_{\theta}) \Big] \\
\end{aligned} %]]></script>
<p>We could see that for each example in the training dataset, the loss contribution is a mixture of the cross entropy between the one-hot encoded distribution and the predicted distribution $H_i(p,q_{\theta})$, and the cross entropy between the noise distribution and the predicted distribution $H_i(u,q_{\theta})$. During training, if the model learns to predict the distribution confidently, $H_i(p,q_{\theta})$ will go close to zero, but $H_i(u,q_{\theta})$ will increase dramatically. Therefore, with label smoothing, we actually introduced a regularizer $H_i(u,q_{\theta})$ to prevent the model from predicting too confidently.</p>
<p><br /></p>
<p>In practice, $u(y|x)$ is a uniform distribution which does not dependent on data. That is to say,</p>
<script type="math/tex; mode=display">u(y|x) = \frac{1}{K}</script>
<h3 id="conclusions">Conclusions</h3>
<p>Label smoothing is a regularization technique for classification problems to prevent the model from predicting the labels too confidently during training and generalizing poorly.</p>
<h3 id="references">References</h3>
<ul>
<li><a href="https://arxiv.org/abs/1512.00567">Rethinking the Inception Architecture for Computer Vision</a></li>
</ul>
<p><a href="https://leimao.github.io/blog/Label-Smoothing/">Label Smoothing</a> was originally published by Lei Mao at <a href="https://leimao.github.io">Lei Mao's Log Book</a> on August 11, 2019.</p>