# PCA(I)

*The goal of this tutorial is to both provide an intuitive feel for PCA, and a thorough discussion of  this topic.

## INTRODUCTION  AND INTUITION

Principal Component Analysis is a standard tool in modern data analysis in diverse fields from neuroscience to computer graphics because it’s a simple, non parametric method for extracting relevant information from confusing data sets.

PCA is mainly  used  for dimensionality reduction.Its operation can be thought of as revealing the internal structure of the data in a way that best explains the variance in the data. If a multivariate dataset is visualised as a set of coordinates in a high dimensional data space, PCA can supply the user with a lower-dimensional picture, a “shadow” of this object when viewed from its  most informative viewpoint. The main objective of PCA is to convert a set of observations of possibly correlated variables into a set of values of  linearly uncorrelated variables called principal componentsThis transformation is defined in such a way that the first principal component has the largest possible variance  and each succeeding component in turn has the highest variance possible under the constraint that it be orthogonal to (i.e., uncorrelated with) the preceding components.

Mathematically,Given a data matrix X, where every sample $\overrightarrow{X}$  is a m dimensional vector where m is the number of measurement types(variables), we  want to project X into lower dimensional space while preserving as much information as possible which can be achieved by  finding an orthogonal basis that spans the column space of  X. In particular, choose projection that minimizes squared error in reconstructing original data

PCA makes one stringent but powerful assumption,linearity.Linearity vastly simplifies the problem by restricting the set of potential bases.With this assumption PCA is now limited to reexpressing data as a linear combination of its basis vectors.

Let X be the original recorded data set and y is a new representation of the data set.Our goal is to find a linear transformation matrix P such as  PX = Y  s.t  rows of P are row of basis vectors for expressing the columns of X.these row vectors will then become the  principal components of  X.

Now comes the question, what does “best express the data” mean .To extract valuable information from a signal measurement noise in any data should be low.A common measure for noise which quantifies it in terms of  signal is SNR(Signal to noise ratio) :

SNR = ${\sigma}_{signal}^{2}$ / ${\sigma}_{noise}^{2}$           where ${\sigma}^{2}$ is the variance.

A high SNR indicates a high precision measurement whereas a low SNR indicates very noisy data.Thus , by assumption the dynamics of interest exist along directions with largest variance and presumably highest SNR.

Another key factor is the redundancy of data.For example,in a 2-dim case,for two arbitrary measurement types ${r}_{1}$ and ${r}_{2}$ ,  was it really necessary to record two variables or it would be more meaningful to just have recored a single variable,not both .Can we predict ${r}_{1}$ from ${r}_{2}$ or vice versa using the best fit line.This is the central idea behind dimensional reduction.

Identifying the redundant cases  is easy for a 2-dim case (find the slope of the best-fit line and judge the quality of the fit).For higher dimensions this can be achieved by determining the covariances between the variables and defining the covariance matrix for X as ${C}_{X}=\frac {1}{n}(X{X}^{T})$ where diagonal elements are the variance of particular measurement types,where large values denote interesting structure and off-diagonal terms are the covariance between  different measurement types where large values correspond to high redundancy.We have to manipulate our covariance matrix ${C}_{X}$ to say ${C}_{Y}$ to achieve following two goals:

• to minimize redundancy which is meaured by the magnitude of the covariance and
• maximize the signal,measured by the variance.

This can be achieved by diagonalizing ${C}_{Y}$  i.e to make all off-diagonal elements to be zero (Y is decorrelated) and by rank ordering each successive dimension in Y according to variance.

While there are many ways for diagonalizing ${C}_{Y}$,PCA selects the easiest method i.e to assume that all basis vectors { ${p}_{1}$, ${p}_{2}$,…… ${p}_{m}$} are orthonormal .terefore,P is an orthonormal matrix.The true benefit of this assumption is that there exists an efficient ,analytical solution to this problem.