Principal component analysis

In this chapter we will discuss how to use PCA method implemented in the mdatools. Besides that, we will use PCA examples to introduce some principles, which are common for most of the other methods (e.g. PLS, SIMCA, PLS-DA, etc.) available in this package. This includes such things as model and result objects, showing performance statistics for models and results, validation, different kinds of plots, and so on.

Principal component analysis is one of the methods that decompose a data matrix \(X\) into a combination of three matrices: \(X = TP^T + E\). Here \(P\) is a matrix with unit vectors, defined in the original variables space. The unit vectors form a new basis, which is used to project all data points into. Matrix \(T\) contains coordinates of the projections in the new basis and product of the two matrices, \(TP^T\) represents the coordinates of projections in original variable space. Matrix \(E\) contains residuals — difference between position of projected data points and their original locations.

In terms of PCA, the unit-vectors defining the new coordinate space are called loadings and the coordinate axes oriented alongside the loadings are Principal Components (PC). The coordinates of data points projected to the principal components are called scores.

There are several other methods, such as Projection Pursuit (PP), Independent Component Analysis (ICA) and some others, that work in a similar way and resulting in the data decomposition shown above. The principal difference among the methods is the way they find the orientation of the unit-vectors. Thus, PCA finds them as directions of maximum variance of data points. In addition to that, all PCA loadings are orthogonal to each other. The PP and ICA use other criteria for the orientation of the basis vectors and e.g. for ICA the vectors are not orthogonal.

There are several methods to get loadings for PCA, including Singular Values Decomposition (SVD) and Non-linear Iterative Partial Least Squares (NIPALS). Both methods are implemented in this package and can be selected using method argument of main class pca. By default SVD is used. In addition, one can use randomized version of the two methods, which can be efficient if data matrix contains large amount of rows. This is explained in the last part of this chapter.