Principal Component Analysis
The first step is to do mean normalization and feature scaling. This is to make sure that the features are comparable.
where
Let the given data contain n features and we want to reduce the dimensions to k with minimum information loss.
The task is find a hyperplane of dimension k such that the squared distance between the datapoints X and the hyperplane is at a minimum. Once you find that, you can use projection theorem to find the projection
of each datapoint
.
Covariance Matrix
To find the similarity between two vectors x1 and x2, you take dot product between two vectors. Because we subracted the input data from it’s mean, the dot product will also corresponds to variance.
Consider the following matrix A.
The covariance between x1 and x2 is given by
We want to find the covariance between all vectors –
. Simplest way to do this is to take
.
Now, we have covariance between all of the vectors. The challenge now is to find the direction of highest variance. Fortunately, we have eigen value and eigen vectors for the rescue. Eigen vector is a vector whose direction doesn’t change even after transformation.
To understand further about eigen values and eigen vectors, refer to mit linear algebra document.
By calculating the eigen value and eigen vector of the covariance matrix
, we get the directions of the vectors whose direction remains same even after transforming the vector by
. The eigen vectors are linearly independent and can be the basis vectors for the input space. If we choose k eigen vectors with maximum eigen values, it represents the k directions of maximum variance. The input data will then be projected on to hyperplane formed by these k vectors (Let the matrix of k vectors be represented by U).
To choose k, calculate the error or the distance between original dataset and dimensionality reduced dataset and divide it by total variance in the dataset. Choose k such that the ratio is very small.
Choose smallest value of k that satisfies the following condition (Click here to see the video).
Final note: We run PCA only on the training set, not on the validation set (We just project the test set on to the basis found by pca).
![Rendered by QuickLaTeX.com \[\mu_j = \frac{1}{m}\sum_{i=1}^{m}x_j^{i}\]](http://www.ssravisutha.com/wp-content/ql-cache/quicklatex.com-9687ecf4d8b1d8975c32775f72362ceb_l3.png)
![Rendered by QuickLaTeX.com \[\sigma^2_j = \frac{1}{m - 1}\sum_{i=1}^{m} (x_j^{i} - \mu^{i}_{j})^2\]](http://www.ssravisutha.com/wp-content/ql-cache/quicklatex.com-78e7f00ae3c877460239d63e2431c13f_l3.png)
![Rendered by QuickLaTeX.com \[\frac{\frac{1}{m} \sum_{i = 1}^{m}||x^{(i)} - x^{(i)}*{approx}||^2}{\frac{1}{m} \sum_{i = 1}^{m}||x^{(i)}||^2} \leq 0.01\]](http://www.ssravisutha.com/wp-content/ql-cache/quicklatex.com-0764588c79d991966dd0dc41860f354e_l3.png)