Principal component analysis
In this chapter we will discuss how to use PCA method implemented in the mdatools. Besides that, we will use PCA examples to introduce some principles, which are common for most of the other methods (e.g. PLS, SIMCA, PLS-DA, etc.) available in this package. This includes such things as model and result objects, getting and visualization of performance statistics, validation, use of different kinds of plots, and so on.
Principal component analysis is one of the methods that decompose a data matrix \(\mathbf{X}\) into a combination of three matrices: \(\mathbf{X} = \mathbf{TP}^\mathrm{T} + \mathbf{E}\). Here \(\mathbf{P}\) is a matrix with unit vectors, defined in the original variables space. The unit vectors, also known as loadings, form a new basis — principal components. The components are mutually orthogonal and oriented in variable space to capture direction of maximum variation of data points.
The data points are being projected to the principal components. Coordinates of these projections in principal component space, known as scores, forming matrix \(\mathbf{T} = \mathbf{XP}\). Product of scores and loadings, \(\mathbf{TP}^\mathrm{T}\) gives coordinates of the projections in the original variable space. Matrix \(\mathbf{E}\) contains residuals — difference between position of projected data points and their original locations. This difference is a part of data variation, which PCA model does not capture, hence the name.
If original data matrix has \(I\) rows (observations, objects) and \(J\) variables and PCA decomposition is made with \(A\) components, then matrix \(\mathbf{P}\) will have dimension \(J\times A\), matrix \(\mathbf{T}\) — \(I\times A\), and \(\mathbf{TP}^\mathrm{T}\) and \(\mathbf{E}\) will have the same dimension as the original data matrix, \(\mathbf{X}\).
Relationship between data objects and principal component space (PCA model) can be described using two distances (in some literature they are also called residual distances) — orthogonal distance, OD, and score distance, SD. The orthogonal distance is a squared Euclidean distance between the position of an object and its projection in original variable space. It can be computed by taking sum of squared values from matrix \(\mathbf{E} = \{e_{ij}\}\) along every row:
\[q_i = \sum_{j=1}^{J} e_{ij}^2\]
This distance usually is denoted as \(Q\) or \(q\). It can be considered as a lack of fit measure for this particular object.
The score distance is a squared Mahalanobis distance between the projection of the objects and the origin. It is a measure of extremeness of object and usually denoted as \(h\) or \(T^2\). The latter is used because Hotelling \(T^2\) distribution is often used to describe the distribution of the score distance values, so in many software the distance is called as Hotelling \(T^2\) distance. The distance can be computed using standardized scores (so score values for every component have unit variance).
\[h_i = \sum_{a = 1}^{A} \frac{t_{ia}^2}{\lambda_a} \]
Here \(\lambda\) are eigenvalues for the principal components, which correspond to the variance of corresponding scores.
Both score and orthogonal distances are important statistics allowing to assess how well objects are described by PCA model. They can be assessed visually, using the Distance plot, — scatter plot where orthogonal distance is plotted against the score distance for particular number of components. In mdatools this plot is called as Residuals plot due to some historical reasons.
Both distances can be described using theoretical distributions. This helps to identify regular and extreme objects as well as outliers. All details will be shown later in this tutorial.
There are several other methods, which can be used for decomposition of data similar to PCA. Some of them are Projection Pursuit (PP), Independent Component Analysis (ICA) and many others, that work in a similar way. The main difference among the methods is the way they find the orientation of the unit-vectors. Thus, PCA finds them as directions of maximum variance of data points. In addition to that, all PCA loadings are orthogonal to each other. The PP and ICA use other criteria for the orientation of the basis vectors and e.g. for ICA the vectors are not orthogonal.
There are several methods to compute PCA loadings, including Singular Values Decomposition (SVD) and Non-linear Iterative Partial Least Squares (NIPALS). Both methods are implemented in this package and can be selected using method
argument of main class pca
. By default SVD is used. In addition, one can use randomized version of the two methods, which can be efficient if data matrix contains large amount of rows. This is explained in the last part of this chapter.