tag:blogger.com,1999:blog-33733771371041174542024-03-06T01:07:32.840-08:00Meng's NotesAnonymoushttp://www.blogger.com/profile/00601792748517163575noreply@blogger.comBlogger4125tag:blogger.com,1999:blog-3373377137104117454.post-65419564377176615532015-09-11T14:18:00.000-07:002015-09-11T14:18:15.220-07:00Style = Correlation?<div class="p1">
<span class="s1">I found this paper </span><span class="s2"><a href="http://arxiv.org/pdf/1508.06576v2.pdf">http://arxiv.org/pdf/1508.06576v2.pdf</a> </span>very clear and well-written. People quickly released its implementation such as <a href="https://github.com/kaishengtai/neuralart"><span class="s2">https://github.com/kaishengtai/neuralart</span></a>.</div>
<div class="p2">
<span class="s1"></span><br /></div>
<div class="p1">
<span class="s1">The major contribution of the paper seems to be the content/style decomposition of an image (as described in Equation 1, 5 and 7). The style representation (equation 4) seems to be the key. </span></div>
<div class="p2">
<span class="s1"></span><br /></div>
<div class="p1">
<span class="s1">It seems obvious that the “content" and “style” defined in this paper are different from what we know by common sense. The “content” is actually a mixture of both content and style, since it was defined as activations at a certain layer depth in an object-recognition ConvNet, which has no guarantee that neurons won’t be activated by “style” features. Also, the “style” was defined as correlations among feature activation patterns which, for the same reason, could encompass both content and style. </span></div>
<div class="p1">
<span class="s1"><br /></span></div>
<div class="p1">
<span class="s1">One of the imaginary (could be wrong) examples of the “style" in my mind would be “starry(style) —correlated with-- sky(content)”. So when you minimize loss function of “style” using gradient decent, you are more likely to make areas starry wherever they look like sky. And the level of starry could be adjusted by weights between content and style.</span></div>
<div class="p2">
<span class="s1"></span><br /></div>
<div class="p1">
<span class="s1">I interpret the author’s main idea as: To transform an image from X(input) to Y(output), by preserving original features while introducing new features which are highly correlated in image Z(famous painting), then balance the weights between original and new features. I know practically the authors started with a white noise image and minimized its distances from both X and Z’s representations, but the ideas are equivalent.</span></div>
<div class="p2">
<span class="s1"></span><br /></div>
<br />
<div class="p1">
<span class="s1">Potential applications of this paper's idea could be interesting too. Such as to make camouflage given environmental figures; add/remove accents of human voices; translate scientific papers into novels; </span>translate novels into Sci Fi novels; etc.</div>
Anonymoushttp://www.blogger.com/profile/00601792748517163575noreply@blogger.com0tag:blogger.com,1999:blog-3373377137104117454.post-85544872093587074402013-05-08T12:30:00.000-07:002013-05-10T13:42:04.948-07:00An intuitive explanation of PCA (Principal Component Analysis)Many research papers apply PCA (Principal Component Analysis) to their data and present results to readers without further explanation of the method. When people search on the internet for a definition of PCA, they sometimes get confused, often by terms like "covariance matrix", "eigenvectors" or "eigenvalues". It is not surprising because most explanatory articles focus on detailed calculation process instead of the basic idea of PCA. They are mathematically correct, yet often not intuitively readable at first glance.<br />
<br />
For a mathematical method, I believe most people only need to understand the logic and limitations of it and let software packages to do the rest (implementation, calculation, etc.). Here I am trying to explain that PCA is not an impenetrable soup of acronyms but a quite intuitive approach. It can make large-scale data "smaller" and easier to handle.<br />
<br />
<a name='more'></a><br />
<b>PCA is nothing but coordinate system transformation: A simple example </b><br />
<b><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh5KCEIQBZo7hdWPLRxhVPoXsCNraSwRszj30EtzLVBQx1QaRhtwSxHd2TpBy89bXNxruelWpQmPFHtUzp69kRXDNfhpvEBqY5aPTlw9ZT8a2_MYdzyBvPUHHJ2AF7JlTM7u51yXxf3EQ/s1600/PCA_1.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh5KCEIQBZo7hdWPLRxhVPoXsCNraSwRszj30EtzLVBQx1QaRhtwSxHd2TpBy89bXNxruelWpQmPFHtUzp69kRXDNfhpvEBqY5aPTlw9ZT8a2_MYdzyBvPUHHJ2AF7JlTM7u51yXxf3EQ/s1600/PCA_1.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 1. (A) A simple object described by a complex set of variables (or coordinate system). (B) Find <span class="short_text" id="result_box" lang="en"><span class="hps">new
variables (coordinate axes) orthogonal to each other and pointing to
the direction of largest variances. (C) Use new variables (</span></span><span class="short_text" id="result_box" lang="en"><span class="hps"><span class="short_text" id="result_box" lang="en"><span class="hps">coordinate axes</span></span>) to describe object in a more concise way.</span></span></td></tr>
</tbody></table>
</b><br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
A simple object could be described by complex and unnecessary variables. We don't intend to do so, but often can not avoid that. One problem is <span style="color: red;">how to find the simplest way to describe an object?</span> Here is an example to show that the answer could be<span style="color: red;"> by choosing a proper coordinate system</span>.<br />
<br />
Let's assume that we have a "3-D shape detect machine" that can scan anything in front it and output a model of that object. We use it to scan a ellipse made of paper (paper thickness can be neglected). The output model uses three axes: L (Length), W (Width) and H (Height) that <span class="short_text" id="result_box" lang="en"><span class="">perpendicular to each other to represent the 3-D world</span></span><span class="short_text" id="result_box" lang="en"><span class=""> (see Figure 1A). So each data point on that object can be written as a function of three variables:</span></span><br />
<div style="text-align: center;">
<blockquote class="tr_bq">
<span class="short_text" id="result_box" lang="en"><span class="">Data(i) = f(L(i), W(i), H(i)) [function 1]</span></span></blockquote>
</div>
Apparently, this is not the best way to describe an ellipse. I guess most people will do the following improvements:<br />
<ol>
<li>Find the geometric center of the ellipse, and set it as the <span class="short_text" id="result_box" lang="en"><span class="">coordinate origin;</span></span></li>
<li><span class="short_text" id="result_box" lang="en"><span class="">Find the direction along which the ellipse has the longest radius. We call this direction (or variable along this direction) "component one".</span></span></li>
<li><span class="short_text" id="result_box" lang="en"><span class="">Find another direction that perpendicular to the first one, and </span></span><span class="short_text" id="result_box" lang="en"><span class="">along which the ellipse has the second longest radius. We call this direction "component two". </span></span><br /><span class="short_text" id="result_box" lang="en"><span class="">(</span></span><span class="short_text" id="result_box" lang="en"><span class=""><span class="short_text" id="result_box" lang="en"><span class="">step 1~3, </span></span>see Figure 1B)</span></span></li>
<li><span class="short_text" id="result_box" lang="en"><span class="">Re-plot the ellipse under the new coordinate system defined by component one (C1) and two (C2).</span></span> <span class="short_text" id="result_box" lang="en"><span class="">(</span></span><span class="short_text" id="result_box" lang="en"><span class="">see Figure 1C)</span></span></li>
</ol>
In the new <span class="short_text" id="result_box" lang="en"><span class="">coordinate system, </span></span><span class="short_text" id="result_box" lang="en"><span class="">each data point on that ellipse can be re-written as a function of two variables:</span></span><br />
<div style="text-align: center;">
<blockquote class="tr_bq">
<span class="short_text" id="result_box" lang="en"><span class="">Data(j) = g(C1(j), C2(j)) [function 2]</span></span></blockquote>
</div>
<b> </b><br />
Now it is quite clear that, after proper coordinate system transform, we get:<br />
<ul>
<li>Fewer variables (or lower dimensions of variables) of function 2 compared to function 1. </li>
<ul>
<li>(L, W, H) --> (C1, C2) </li>
</ul>
<li>No information lost.</li>
<ul>
<li>function 1 == function 2 </li>
</ul>
<ul>
<li>The relative geometric positions of all data points remain unchanged.</li>
</ul>
</ul>
That's exactly what PCA analysis do. The term "<b>principal component</b>" denotes new variables (or coordinate systems) we choose to describe our data in a more concise or convenient way. All principal components must satisfy two conditions:<br />
<ol>
<li>They must be <span class="short_text" id="result_box" lang="en"><span class="">perpendicular (or mathematically, </span></span><span class="short_text" id="result_box" lang="en"><span class=""><span class="short_text" id="result_box" lang="en"><span class="">Orthogonal</span></span>) to each other.</span></span></li>
<ol>
<li><span class="short_text" id="result_box" lang="en"><span class="">This means principal components are NOT linear correlated between each other.</span></span></li>
<li><span class="short_text" id="result_box" lang="en"><span class="">And that's why PCA can reduce the number of variables with out losing information: variables in raw data are not independent, and correlated variables cause redundancy.</span></span><span class="short_text" id="result_box" lang="en"><span class=""> </span></span></li>
</ol>
<li><span class="short_text" id="result_box" lang="en"><span class="">Numerical IDs of components are ordered by the length of radius (mathematically speaking, variance) of our data points along them. So our data must have the largest </span></span><span class="short_text" id="result_box" lang="en"><span class="">variance along the axes of component 1, and the 2nd largest </span></span><span class="short_text" id="result_box" lang="en"><span class=""><span class="short_text" id="result_box" lang="en"><span class="">variance</span></span> along the axes of component 2, and so on.</span></span></li>
<ol>
<li><span class="short_text" id="result_box" lang="en"><span class="">This means the higher order a </span></span><span class="short_text" id="result_box" lang="en"><span class="">component have, the more important it is. </span></span></li>
<li><span class="short_text" id="result_box" lang="en"><span class="">Some times we may sacrifice minor components to further reduce the number of variables. eg, If the first two principal components (out of the total 10) contributed 90% of variance of the data, we might like to focus on them only and discard the other 8 components.</span></span></li>
</ol>
</ol>
<hr />
<span style="font-size: x-small;">Why <span class="short_text" id="result_box" lang="en"><span class=""> we consider the </span></span></span><span style="font-size: x-small;"><span class="short_text" id="result_box" lang="en"><span class=""><span style="font-size: x-small;"><span class="short_text" id="result_box" lang="en"><span class=""><span style="font-size: x-small;"><span class="short_text" id="result_box" lang="en"><span class="">axes</span></span></span></span></span></span> along which the data has the biggest variance</span></span></span><span style="font-size: x-small;"><span class="short_text" id="result_box" lang="en"><span class=""><span style="font-size: x-small;"><span class="short_text" id="result_box" lang="en"><span class=""> as the most important </span></span></span>one? The reason is that along that axes </span></span><span class="short_text" id="result_box" lang="en"><span class=""><span class="short_text" id="result_box" lang="en"><span class="">we can get the best separation of data points</span></span>. A more natural explanation could be like: along that axes we can see the largest divergence from the mean value. Just like when we talk about trees, rail ways and stars, we describe them as tall, long and far -- the dimensions that have greatest divergences.</span></span></span><br />
<hr />
<br />
<b>PCA can reduce number of variables by eliminating correlations between them: A more realistic example</b><br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj8n2lymTeNfLjv2Ro6uKJdCloycaqcxMIsCV26Ny_-3lsJLFlWbFga9V9DkX0aqnHiBCLlaIoR3tsCPwwv6ykCDp9peykOP9yUNFyPPZKSMyBqk50FTZBDFevdhmewn6dvn_UJCGNTwA/s1600/PCA_2.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj8n2lymTeNfLjv2Ro6uKJdCloycaqcxMIsCV26Ny_-3lsJLFlWbFga9V9DkX0aqnHiBCLlaIoR3tsCPwwv6ykCDp9peykOP9yUNFyPPZKSMyBqk50FTZBDFevdhmewn6dvn_UJCGNTwA/s1600/PCA_2.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 2 (A) Data set of gene expression level in different tissues. (B) Gene expression level
is a n-dimensional function. (C) PCA process. (D) After PCA, gene
expression level becomes a k-dimensional function (k<n). (E) Data set is "compressed" with no information lost. </td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
Suppose we measure expression levels of N genes in M tissues. We will get a data set of M*N matrix (see Figure 2A). Mathematically, we can describe the data set as:<br />
<div style="text-align: center;">
<blockquote class="tr_bq">
Tissue(i) = f(gene1(i), gene2(i)...geneN(i))</blockquote>
<div style="text-align: left;">
In which <i>Tissue(i)</i> means expression level of all genes in the i-th tissue, and gene1(i) denotes expression level of gene 1 in the i-th tissue, and so on. As this function has N variables, we can not use a graph to show the N-dimensional function when N>3. But we can imagine the high dimension as multiple coordinate axes (see Figure 2B). Each axes stands for one gene's expression level. </div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
We use PCA to reduce the dimension of variables:</div>
</div>
<ol>
<li>Find the mean value of all data, and set it as the <span class="short_text" id="result_box" lang="en"><span class="">coordinate origin;</span></span></li>
<li><span class="short_text" id="result_box" lang="en"><span class="">Find
the direction along which the data set has the largest variance. We call it "component 1".</span></span></li>
<li><span class="short_text" id="result_box" lang="en"><span class="">Find another direction that </span></span><span class="short_text" id="result_box" lang="en"><span class=""><span class="short_text" id="result_box" lang="en"><span class="">orthogonal</span></span> to all components already found, and </span></span><span class="short_text" id="result_box" lang="en"><span class="">along which the data has the longest possible variance. We call this direction "component 2". </span></span></li>
<li><span class="short_text" id="result_box" lang="en"><span class="">repeat step 3 until we get K principal components and no more component can be found. </span></span><br /><span class="short_text" id="result_box" lang="en"><span class="">(</span></span><span class="short_text" id="result_box" lang="en"><span class=""><span class="short_text" id="result_box" lang="en"><span class="">step 1~4, </span></span>see Figure 2C)</span></span></li>
<li><span class="short_text" id="result_box" lang="en"><span class="">Re-plot the data set under the new coordinate system defined by K principal components</span></span> <span class="short_text" id="result_box" lang="en"><span class="">(</span></span><span class="short_text" id="result_box" lang="en"><span class="">see Figure 2D).</span></span><span class="short_text" id="result_box" lang="en"><span class=""> </span></span></li>
</ol>
<hr />
<span style="font-size: x-small;">Note: These steps are conceptual framework used only for understanding PCA. <span style="font-size: x-small;">In reality we use linear algebra operations of the same idea b<span style="font-size: x-small;">ut may <span style="font-size: x-small;">have</span> <span style="font-size: x-small;">quite different appea<span style="font-size: x-small;">r<span style="font-size: x-small;">a</span></span>nce.</span></span></span></span><br />
<hr />
<span class="short_text" id="result_box" lang="en"><span class="">Now we reduced the dimension of variables from N to K and K<N (see Figure 2E). A example provided by <a href="http://www.nature.com/nbt/journal/v26/n3/full/nbt0308-303.html" target="_blank">Nature Biotechnology 26, 303 - 304 (2008)</a></span></span> that when N = 8,534 , K = 104. This means a huge amount of redundancy was removed from original expression data set. Imagine the difference between two functions that have 8,534 and 104 variables accordingly. PCA can make large scale data much easier to handle. Original data table (Figure 2A) will be greatly compressed (Figure 2E) after PCA process.<br />
<br />
PCA can reduce data dimension at a minimum cost. But it can not do anything other than that. we still need to use specific approaches for next step analyses like clustering, inferring or other jobs.<br />
<br />
<b>PCA has limitations : example of failures</b><br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhHatArIYbGgThesLS27pyo4Y8ccScvCbdrVCey6NdsAVsgzh_Px2aIV4sLCo-Gdtsj-XmR2AF1XoKmXmYeEemHX2aHmjbu_Z3Ymqe7YLwYdxb-bD3AlFSe-jAO8g85YhAwbL8RH3MfGw/s1600/PCA_3.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhHatArIYbGgThesLS27pyo4Y8ccScvCbdrVCey6NdsAVsgzh_Px2aIV4sLCo-Gdtsj-XmR2AF1XoKmXmYeEemHX2aHmjbu_Z3Ymqe7YLwYdxb-bD3AlFSe-jAO8g85YhAwbL8RH3MfGw/s1600/PCA_3.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 3. (A) PCA on non-Gaussian source data. (B) Principal components
given by PCA may still be correlated in a non-linear way. <span class="short_text" id="result_box" lang="en"><span class="">Prior knowledge could help us to find out the best principal components. </span></span></td></tr>
</tbody></table>
Any algorithm could fail when its assumptions are not satisfied.<b> </b>PCA makes "the largest variance" assumption. If the data does not follow a multidimensional normal (Gaussian) distribution, PCA may not give the best principal components (see Figure 3A). For non-Gaussian source data, we will need independent component analysis (ICA) to separate data into additive, mutual statistical independent sub-components.<br />
<br />
PCA also makes assumption that all principal components are <span class="short_text" id="result_box" lang="en"><span class=""> </span></span><span class="short_text" id="result_box" lang="en"><span class=""><span class="short_text" id="result_box" lang="en"><span class="">orthogonal</span></span> (linear uncorrelated) to each other</span></span>. So it may fail when principal components are correlated in a non-linear manner. For a ring-shape data set (see Figure 3B), PCA will give two principal components C1 and C2, which are actually correlated through rotation angle θ. It's easy to know that instead of C1 or C2, θ should be the real first-order principal component. In such cases, before applying PCA, we need to first perform a non-linear transformation on raw data to create a new linear space in which variable θ becomes a linear dimension. And this method was called kernel-PCA.<br />
<br />
<b>PCA implementations : eigenvalue decomposition (EVD) and singular value decomposition (SVD)</b><br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjXYPp3S0KRHBRhK-P4Z7ANnSJed2FTqgjjppAVv6tTvhFXnYtW3Md8aJwbtMOHZQFh3C_WyIGF5F88FxgvNTF8-fLp7SDpkSM3rvFvwkos4hNkKIUp7A6IXOkeNnhG8tw37630QsxCqg/s1600/PCA_4.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjXYPp3S0KRHBRhK-P4Z7ANnSJed2FTqgjjppAVv6tTvhFXnYtW3Md8aJwbtMOHZQFh3C_WyIGF5F88FxgvNTF8-fLp7SDpkSM3rvFvwkos4hNkKIUp7A6IXOkeNnhG8tw37630QsxCqg/s1600/PCA_4.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 4</td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
</div>
As a conceptual framework, PCA could have different implementations. May be the most popular ones are eigenvalue decomposition (EVD) and singular value decomposition (SVD). They both can be used to find principal components and will give same results in most cases.<br />
<br />
Before applying EVD or SVD, we should define our goal as to find a proper transformation through which a NxM matrix can be transformed (or compressed) into a NxK (K<=M) matrix with no information lost (see Figure 4A). <br />
<br />
Now let's imagine EVD algorithm as a machine with inputs and outputs. We feed EVD with our raw data set X, a NxM matrix, as input. EVD will output the following (see Figure 4B):<br />
<ol>
<li>Two new matrices W and D. W contains all principal component vectors, while D contains all ranks of those vectors (ordered from the largest variance to the least one). \(X^T\) and \(W^T\) are transposes of X and W accordingly. So they are not considered as new.</li>
<li>An equation : $$XX^T = WDW^T$$ </li>
</ol>
We know that WD is a NxK (K<=M) matrix which is exactly of the same format we want to transform our data set into. So we can re-write the equation into:<br />
<div style="text-align: center;">
<blockquote class="tr_bq">
$$XX^TW = WD [Solution 1]$$ </blockquote>
<div style="text-align: left;">
OK, we get a solution (transformation) for our goal using EVD. Because we can transform a NxM matrix X into a NxK (K<=M) matrix WD (see Figure 4C). </div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
SVD is another method that can be used to reach our goal. Let's imagine SVD as a machine with inputs and outputs. We
feed SVD with data set X, the NxM matrix, as input. SVD will
output the following (see Figure 4D):</div>
<ol style="text-align: left;">
<li>Three new matrix U, \(\Sigma\) and \(V^T\). U and \(V^T\) contain principal component vectors for two directions (column and row of raw data) accordingly. \(\Sigma\) contains ordered ranks of those principal components. </li>
<li>An equation : $$X = U \Sigma V^T$$</li>
</ol>
<div style="text-align: left;">
We know that U\(\Sigma\) is a NxK (K<=M) matrix which is exactly of the
same format we want to transform our data set into. We can re-write
the equation into:<br />
<div style="text-align: center;">
<blockquote class="tr_bq">
$$XV = U\Sigma [Solution 2]$$</blockquote>
<div style="text-align: left;">
So we get another solution for our goal using SVD. Because we can transform a NxM matrix X into a NxK (K<=M) matrix U\(\Sigma\) (see Figure 4E). </div>
<div style="text-align: left;">
<br />
Another question is, as EVD and SVD seemingly do the same work, which one should we choose? I personally recommend SVD. The reason is EVD requires calculation of \(XX^T\) which might lead to losing of precision when X is a Läuchli matrix. In this scenario, \( XX^T\) will perform operations of adding very large numbers with very small numbers, and the small ones will be "eaten" by large ones. SVD does not have this problem. </div>
</div>
</div>
</div>
<ol>
</ol>
<hr size="8/" />
<span style="font-size: x-small;">I thank the following articles for providing inspiration: </span><br />
<ol>
<li><span style="font-size: x-small;"></span><span class="short_text" id="result_box" lang="en"><span class=""><a href="http://www.nature.com/nbt/journal/v26/n3/full/nbt0308-303.html" target="_blank">What is principal component analysis? <i>Nature Biotechnology 26, 303 - 304 (2008)</i></a></span></span></li>
<li><span class="short_text" id="result_box" lang="en"><span class=""><a href="http://www.youtube.com/watch?v=ey2PE5xi9-A">Machine Learning Lecture 14, Stanford <i>(by Andrew Ng)</i></a><i> </i></span></span></li>
<li><span class="short_text" id="result_box" lang="en"><span class=""><a href="http://www.snl.salk.edu/~shlens/pca.pdf">A tutorial on Principal Components Analysis (by Jonathon Shlens, 2009)</a><i> </i></span></span></li>
<li><span class="short_text" id="result_box" lang="en"><span class=""><a href="https://en.wikipedia.org/wiki/Principal_component_analysis">Principal component analysis <i>(From Wikipedia)</i></a><i> </i></span></span></li>
<li><span class="short_text" id="result_box" lang="en"><span class=""><a href="http://www.cnblogs.com/jerrylead/archive/2011/04/18/2020209.html">PCA - the maximum variance explanation<i> (JerryLead, in Chinese)</i></a><i> </i></span></span></li>
<li><a href="http://www.cnblogs.com/jerrylead/archive/2011/04/18/2020216.html">PCA - the least square error explaination <i>(JerryLead, in Chinese) </i></a></li>
</ol>
<script type="text/javascript">MathJax.Hub.Queue(["Typeset",MathJax.Hub]);</script>Anonymoushttp://www.blogger.com/profile/00601792748517163575noreply@blogger.com17tag:blogger.com,1999:blog-3373377137104117454.post-72312380478993385992013-03-19T09:23:00.002-07:002013-05-08T14:17:53.811-07:00How much confidence can we obtain from a piece of evidence?Every one knows that most computational predictions on biological systems are "less confidential" although many methods were based on convincible evidences and were announced of having high accuracy. But why these theoretically applicable methods failed to give reliable outputs? Some people believe that it's simply because existing methods are not good enough, or the biological system is too complex to predict. That's probably true. But there's <span class="short_text" id="result_box" lang="en"><span class="">another important factor that was often ignored, the abundance of potential true positives.</span></span><br />
<br />
<span class="short_text" id="result_box" lang="en"><span class="">I will use a simple disease diagnosis example to show that even if we have a excellent prediction method based on strong evidences, we might still get poor predictions as long as the disease is rare in population.</span></span><br />
<a name='more'></a><br />
<hr />
Suppose there is a cancer diagnostic test with pretty high sensitivity and specificity of both 90%. If a man took the test and, unfortunately, got a positive result, can we say he is more likely to have cancer than other people?<br />
<br />
When we are talking about someone is "more likely" to have
cancer, we are comparing his chance of having cancer to that of people
randomly selected from the population. This question seems so easy. Mathematically, we are asking the level of :<br />
$$\frac{chance\ of\ people\ having\ cancer,\ given\ an\ evidence\ (positive\ test)}{chance\ of\ people\ having\ cancer,\ in\ general\ (given\ no\ special\ evidence)}$$ $$or$$ $$\frac{Pr(cancer | people\_with\_positive\_test)}{Pr(cancer | people\_randomly\_chosen)}\ \ \ (1)$$ <br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
The meaning of expression \((1)\) could be explained by the figure below.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTV7tHsZMj9ii62rGeM_FpDdlT5FvJZVKQdkhVounAPsK5T02Fa7bMkfPmNguPr42NKxMPVzHFovZ9b1DxrDiSR52eN4qv_4Thir2dSu8i6jlXUHSKPai3QY1TH1TBtN3Btrq8HdX5Xw/s1600/Slide1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTV7tHsZMj9ii62rGeM_FpDdlT5FvJZVKQdkhVounAPsK5T02Fa7bMkfPmNguPr42NKxMPVzHFovZ9b1DxrDiSR52eN4qv_4Thir2dSu8i6jlXUHSKPai3QY1TH1TBtN3Btrq8HdX5Xw/s1600/Slide1.jpg" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
It's easy to see that, $$Pr(cancer | people\_with\_positive\_test)$$ $$= \frac{\#\ people\ got\ positive\ test\ and\ having\ cancer}{\#\ people\ who\ got\ positive\ test} = \frac{C2}{C2+H1}\ \ \ (2)$$ $$ and $$ $$Pr(cancer | people\_randomly\_chosen)$$ $$=\frac{\#\ people\ having\ cancer}{whole\ population} = \frac{C1+C2}{C1+C2+H1+H2} \ \ \ (3)$$ <br />
We substitute expression \((1)\) with expression \((2)\) and \((3)\),<br />
$$\frac{Pr(cancer | people\_with\_positive\_test)}{Pr(cancer | people\_randomly\_chosen)}$$ $$= \frac{\frac{C2}{C2+H1}}{ \frac{C1+C2}{C1+C2+H1+H2}} = \frac{\frac{C2}{C1+C2}}{ \frac{C2+H1}{C1+C2+H1+H2}}$$<br />
We know that,<br />
\(\frac{C2}{C1+C2} = Pr(positive\_test|cancer), \frac{C2+H1}{C1+C2+H1+H2} = Pr(positive\_test|people\_randomly\_chosen)\). So,<br />
$$\frac{Pr(cancer | people\_with\_positive\_test)}{Pr(cancer |
people\_randomly\_chosen)} = \frac{Pr(positive\_test|cancer)}{Pr(positive\_test|people\_randomly\_chosen)}\ \ \ (4)$$<br />
<hr />
<span style="font-size: x-small;">We noticed that expression (4) has exactly the same format as <a href="http://en.wikipedia.org/wiki/Bayes%27_theorem" target="_blank">Bayes' theorem</a> :</span><br />
<span style="font-size: x-small;">\(Pr(Cancer|Test)=\frac{Pr(Test|Cancer)}{Pr(Test)}Pr(Cancer)\).</span><br />
<span style="font-size: x-small;">isn<span style="font-size: x-small;">'t it?</span> </span><br />
<hr />
<br />
<b>In order to calculate the actual probability</b>, we remember that,<br />
\(\frac{C2}{C1+C2}\) is, by definition, the sensitivity of the diagnostic test, while \(\frac{H2}{H1+H2}\) is the specificity.<br />
$$\frac{Pr(cancer | people\_with\_positive\_test)}{Pr(cancer | people\_randomly\_chosen)} = \frac{\frac{C2}{C1+C2}}{ \frac{C2+H1}{C1+C2+H1+H2}} = \frac{sensitivity}{ \frac{C2+H1}{C1+C2+H1+H2}}\\<br />
\begin{matrix} <br />
= &\frac{sensitivity}{\frac{((C1+C2+H1+H2)*\frac{C1+C2}{C1+C2+H1+H2}*\frac{C2}{C1+C2})+((C1+C2+H1+H2)*(1-\frac{C1+C2}{C1+C2+H1+H2})*(1- \frac{H2}{H1+H2}))}{C1+C2+H1+H2}}\\<br />
=&\frac{sensitivity}{(\frac{C1+C2}{C1+C2+H1+H2}*\frac{C2}{C1+C2})+((1-\frac{C1+C2}{C1+C2+H1+H2})*(1-\frac{H2}{H1+H2}))}\\<br />
=&\frac{sensitivity}{Pr(cancer)*sensitivity+(1-Pr(cancer))*(1-specificity)}\ \ \ (5)\\<br />
=&\left\{\begin{matrix}<br />
<br />
1 & Pr(cancer)\rightarrow 1&&\\<br />
<br />
&&& \\<br />
<br />
\frac{sensitivity}{1-specificity} & Pr(cancer)\rightarrow 0&&<br />
<br />
\end{matrix}\right.<br />
\end{matrix}$$<br />
<hr />
<span style="font-size: x-small;"> \(Pr(cancer)\) is the same thing as \(Pr(cancer | people\_randomly\_chosen)\).</span><br />
<hr />
<br />
Now let's play with it : <br />
<br />
<iframe frameborder="0" height="270" marginheight="0" marginwidth="0" src="http://instacalc.com/10593/embed" width="450"></iframe>
<br />
The results shows that when sensitivity and specificity of diagnostic test are both 90% high, a man who tested positive is 8.3 fold more likely of having cancer than by chance. The absolute probability of him having cancer is 8.3%.<br />
<br />
<hr />
<span style="font-size: x-small;">It<span style="font-size: x-small;">'s interesting</span> to also try combinations of [99%, 99%, 1%] and [90%, 90%, 50%].</span><br />
<hr />
<br />
Now it's obvious that when making predictions, <b>even the evidence is very strong</b> (represented by high sensitivity and specificity) <b>we might still get a relatively low confidence </b>(probability of true positives)<b> due to the low abundance of background positives</b>. When the absolute probability of true positive prediction is too low, we are unlikely to appreciate the fold change. The confidence we obtained from every piece of evidence could be diluted by the sea of negatives.<br />
<br />
<hr size="8/" />
<span style="font-size: small;">I thank the following articles for providing inspiration: </span><br />
<ol><span style="font-size: small;">
</span>
<li><span style="font-size: small;"><a href="http://betterexplained.com/articles/understanding-bayes-theorem-with-ratios/" target="_blank">Understanding Bayes Theorem With Ratios</a> -- by BetterExplained</span></li>
<span style="font-size: small;">
</span>
<li><span style="font-size: small;"><a href="http://yudkowsky.net/rational/bayes" target="_blank">An Intuitive Explanation of Bayes' Theorem</a> -- by Yudkowsky</span></li>
</ol>
<script type="text/javascript">MathJax.Hub.Queue(["Typeset",MathJax.Hub]);</script>Anonymoushttp://www.blogger.com/profile/00601792748517163575noreply@blogger.com0tag:blogger.com,1999:blog-3373377137104117454.post-78339578499983411332012-12-19T15:49:00.003-08:002017-06-18T11:54:00.891-07:00Simple Enrichment Test -- calculate hypergeometric p-values in RHypergeometric test are useful for enrichment analysis. For example, having a gene list in hand, people might want to tell which functions (GO terms) are enriched among these genes. Hypergeometric test (or its equivalent: one-tailed
Fisher's exact test) will give you statistical confidence in \(p\)-values. <br />
<br />
R software provids function <a href="http://stat.ethz.ch/R-manual/R-devel/library/stats/html/Hypergeometric.html">phyper</a> and <a href="http://stat.ethz.ch/R-manual/R-patched/library/stats/html/fisher.test.html">fisher.test</a> for Hypergeometric and Fisher's exact test accordingly. However, it is tricky to get it right. I spent some time to make it clear.<br />
<br />
<a name='more'></a><br />
<b>Here is a simple example:</b><br />
Five cards were chosen from a well shuffled deck<br />
<b>x</b> = the number of diamonds selected.<br />
We use a 2x2 table to represent the case: <br />
<br />
Diamond Non-Diamond<br />
selected x 5-x total 5 sampled cards<br />
left 13-x 34+x total 47 left cards after sampling<br />
13 Dia 39 Non-Dia total 52 cards<br />
<br />
We 're asking if diamond enriched or depleted in our selected cards, comparing to the background. <br />
<hr />
<br />
<b>Here are the different parameters used by <a href="http://stat.ethz.ch/R-manual/R-devel/library/stats/html/Hypergeometric.html">phyper</a> and </b><b><b><a href="http://stat.ethz.ch/R-manual/R-patched/library/stats/html/fisher.test.html">fisher.test</a></b>: <a href="http://stat.ethz.ch/R-manual/R-patched/library/stats/html/fisher.test.html"><br /></a></b><br />
<blockquote class="tr_bq">
<b>phyper</b>(<span style="background-color: #e06666;">x</span>, <span style="background-color: #f1c232;">13</span>, <span style="background-color: #6aa84f;">39</span>, <span style="background-color: #6fa8dc;">5</span>, lower.tail=TRUE); </blockquote>
<blockquote class="tr_bq">
# Numerical parameters in order: </blockquote>
<blockquote class="tr_bq">
# (<span style="background-color: #e06666;">success-in-sample</span>, <span style="background-color: #ffd966;">success-in-bkgd</span>, <span style="background-color: #93c47d;">failure-in-bkgd</span>, <span style="background-color: #6fa8dc;">sample-size</span>).</blockquote>
<blockquote class="tr_bq">
<b>fisher.test</b>(matrix(c(<span style="background-color: #e06666;">x</span>, <span style="background-color: #ffd966;">13-x</span>, <span style="background-color: #93c47d;">5-x</span>, <span style="background-color: #6fa8dc;">34+x</span>), 2, 2), alternative='less'); </blockquote>
<blockquote class="tr_bq">
# Numerical parameters in order: </blockquote>
<blockquote class="tr_bq">
# (<span style="background-color: #e06666;">success-in-sample</span>, <span style="background-color: #ffd966;">success-in-left-part</span>, <span style="background-color: #93c47d;">failure-in-sample</span>, <span style="background-color: #6fa8dc;">failure-in-left-part</span>). </blockquote>
It's obvious that hypergeometric test compares sample to bkgd, while fisher's exact test compares sample to the left part of bkgd after sampling without replacement. They will give the same p-value (because they assume the same distribution).<br />
<hr />
<b>Here is the results:</b><br />
<blockquote class="tr_bq">
x=1; # x could be 0~5 </blockquote>
<blockquote class="tr_bq">
hitInSample <span class="s1">= </span><span class="s2">1 # </span>could be 0~5</blockquote>
<blockquote class="tr_bq">
hitInPop <span class="s1">= </span><span class="s2">13</span> </blockquote>
<blockquote class="tr_bq">
failInPop <span class="s1">= </span><span class="s2">52 </span><span class="s1">- </span>hitInPop </blockquote>
<blockquote class="tr_bq">
sampleSize <span class="s1">= </span><span class="s2">5</span></blockquote>
<ol>
</ol>
<ul>
<li><b>Test for under-representation (depletion)</b></li>
</ul>
<blockquote class="tr_bq">
<span class="s1">phyper(</span><span style="color: red;">hitInSample</span><span class="s1">, </span>hitInPop<span class="s1">, </span>failInPop<span class="s1">, </span>sampleSize<span class="s1">, </span>lower.tail<span class="s1">= </span><span class="s3">TRUE</span><span class="s1">);</span></blockquote>
<blockquote class="tr_bq">
## [1] 0.6329532</blockquote>
<blockquote class="tr_bq">
<span class="s1">fisher.test(matrix(c(</span>hitInSample<span class="s1">, </span>hitInPop<span class="s1">-</span>hitInSample<span class="s1">, </span>sampleSize<span class="s1">-</span>hitInSample<span class="s1">, </span>failInPop<span class="s1">-</span>sampleSize <span class="s1">+</span>hitInSample<span class="s1">), </span><span class="s2">2</span><span class="s1">, </span><span class="s2">2</span><span class="s1">), </span>alternative<span class="s1">=</span><span class="s4">'less'</span><span class="s1">)$</span>p.value<span class="s1">;</span> </blockquote>
<blockquote class="tr_bq">
## [1] 0.6329532</blockquote>
<ul>
<li><b>Test for over-representation (enrichment)</b></li>
</ul>
<blockquote class="tr_bq">
<span class="s1">phyper(</span><span style="color: red;">hitInSample<span class="s1">-</span><span class="s2">1</span></span><span class="s1">, </span>hitInPop<span class="s1">, </span>failInPop<span class="s1">, </span>sampleSize<span class="s1">, </span>lower.tail<span class="s1">= </span><span class="s3">FALSE</span><span class="s1">);</span></blockquote>
<blockquote class="tr_bq">
## [1] 0.7784664</blockquote>
<blockquote class="tr_bq">
<span class="s1">fisher.test(matrix(c(</span>hitInSample<span class="s1">, </span>hitInPop<span class="s1">-</span>hitInSample<span class="s1">, </span>sampleSize<span class="s1">-</span>hitInSample<span class="s1">, </span>failInPop<span class="s1">-</span>sampleSize <span class="s1">+</span>hitInSample<span class="s1">), </span><span class="s2">2</span><span class="s1">, </span><span class="s2">2</span><span class="s1">), </span>alternative<span class="s1">=</span><span class="s4">'greater'</span><span class="s1">)$</span>p.value<span class="s1">;</span> </blockquote>
<blockquote class="tr_bq">
## [1] 0.7784664</blockquote>
<ul>
<li> <b>Why </b><span style="color: red;">hitInSample-1</span> <b>when testing over-representation?</b></li>
</ul>
<blockquote class="tr_bq">
Because<b> </b><b><code></code></b>if <b><code>lower.tail </code></b>is TRUE (default), probabilities are
<i>P[X ≤ x]</i>, otherwise, <i>P[X > x]</i>. We subtract x by 1, when <i>P[X ≥ x]</i> is needed.<b> </b></blockquote>
<blockquote class="tr_bq">
</blockquote>
<hr />
<br />
<br />
<b>So are there any advantages fisher.test has over phyper, as they give the same p-values?</b><br />
Yes, fisher.test can do two other jobs: two-side test, and giving confidence intervals of odds ratio. Please refer to its manual for details. For one-side p-value calculating, they don't have any difference if correct parameters were used.<br />
<b><br /></b>
<br />
<ol>
</ol>
<ol>
</ol>
<blockquote class="tr_bq">
</blockquote>
<script type="text/javascript">MathJax.Hub.Queue(["Typeset",MathJax.Hub]);</script>Anonymoushttp://www.blogger.com/profile/00601792748517163575noreply@blogger.com21