Models and results

In mdatools, any method for data analysis, such as PCA, PLS regression, SIMCA classification and so on, can create two types of objects — a model and a result. Every time you build a model you get a model object. Every time you apply the model to a dataset you get a result object. Thus for PCA, the objects have classes pca and pcares correspondingly.

Each object includes a list with object’s properties (e.g. loadings for model, scores and explained variance for result) and provides a number of methods for visualization and exploration.

Model calibration

Let’s see how this works using a simple example — People data. We already used this data when was playing with plots, it consists of 32 objects (persons from Scandinavian and Mediterranean countries, 16 male and 16 female) and 12 variables (height, weight, shoesize, annual income, beer and wine consumption and so on.). More information about the data can be found using ?people. We will first load the data matrix and split it into two subsets as following:

library(mdatools)
data(people)

idx = seq(4, 32, 4)
Xc = people[-idx, ]
Xt = people[idx, ]

So Xc is our calibration subset we are going to use for creating a PCA model and Xt is a subset we will apply the calibrated model to. Now let’s calibrate the model:

m = pca(Xc, 7, scale = TRUE, info = "People PCA model")
m = selectCompNum(m, 5)

Here pca is a function that builds (calibrates) a PCA model and returns the model object. In terms of programming we can call function pca() a constructor of pca model object. The constructor has one mandatory argument — matrix or data frame with data values. Other arguments are optional as they all have default values. In this case we use three additional: 7 is maximum number of components, scale = TRUE tells that data values should be standardized (centering is a default option) and info allows to add a short text with information about the model which can be shown later.

Function selectCompNum() allows to select an “optimal” number of components for the model. In our case we calibrate model with 7 principal components however, e.g. after investigation of explained variance, we found out that 5 components is optimal. In this case we have two choices. Either recalibrate the model using 5 components or use the model that is calibrated already but “tell” the model that 5 components is the optimal number. In this case the model will keep all results calculated for 7 components but will use optimal number of components when necessary. For example, when showing distance plot for the model. Or when PCA model is used in SIMCA classification.

Function print prints the model object info:

print(m)
## 
## PCA model (class pca)
## 
## 
## Call:
## pca(x = Xc, ncomp = 7, scale = TRUE, info = "People PCA model")
## 
## Major model fields:
## $loadings - matrix with loadings
## $eigenvals - eigenvalues for components
## $ncomp - number of calculated components
## $ncomp.selected - number of selected components
## $center - values for centering data
## $scale - values for scaling data
## $alpha - significance level for critical limits
## $gamma - significance level for outlier limits
## $Qlim - critical values and parameters for orthogonal distances
## $T2lim - critical values and parameters for score distances
## $res - list with model results (calibration, test)

As you can see, there are no scores, explained variance values, distances and so on. Because they actually are not part of a PCA model, they are results of applying the model to a calibration set. But loadings, eigenvalues, number of calculated and selected principal components, vectors for centering and scaling the data, significance levels and corresponding critical limits for distances — all are parameters of the model.

Here is an example how to get values from loading matrix having the model object:

m$loadings[1:4, 1:4]
##              Comp 1      Comp 2      Comp 3      Comp 4
## Height   -0.3792846  0.08004057 -0.06676611  0.04512380
## Weight   -0.3817929  0.08533809 -0.08527883 -0.04051629
## Hairleng  0.3513874 -0.22676635 -0.02273504  0.01575716
## Shoesize -0.3776985  0.12503739 -0.02117369  0.09929010

One can also notice that the model object has a particular field — res, which is in fact a list of PCA result objects containing results of applying the model to the calibration set (cal) and (if available) test set (test). In case of regression and classification res also contains object with cross-validation results (cv). For example here is the description of object with calibration results:

print(m$res$cal)
## 
## Results for PCA decomposition (class pcares)
## 
## Major fields:
## $scores - matrix with score values
## $T2 - matrix with T2 distances
## $Q - matrix with Q residuals
## $ncomp.selected - selected number of components
## $expvar - explained variance for each component
## $cumexpvar - cumulative explained variance

And if we want to look at scores, here is the way:

m$res$cal$scores[1:4, 1:4]
##           Comp 1     Comp 2     Comp 3      Comp 4
## Lars   -5.108742 -1.2714943  1.0765871  1.08910438
## Peter  -3.021811 -0.3163758 -0.2958259 -1.36053121
## Rasmus -2.887335 -0.4428721  0.1231706 -1.15070563
## Mette   1.116457 -1.3716444 -1.6344512 -0.03803356

Both model and result objects also have related functions (methods), first of all for visualizing various values (e.g. scores plot, loadings plot, etc.). Some of the functions will be discussed later in this chapter, a full list can be found in help for a proper constructor (e.g. ?pca).

The result object is also created every time you apply a model to a new data. Like in many built-in R methods, method predict() is used in this case. The first argument of the method is always a model object. Here is a PCA example (assuming we have already built the model):

res = predict(m, Xt)
print(res)
## 
## Results for PCA decomposition (class pcares)
## 
## Major fields:
## $scores - matrix with score values
## $T2 - matrix with T2 distances
## $Q - matrix with Q residuals
## $ncomp.selected - selected number of components
## $expvar - explained variance for each component
## $cumexpvar - cumulative explained variance

Model validation

PCA model can be validated using only test set validation, while for supervised model, e.g. PLS or SIMCA, you can use cross-validation as well. The validation results are, of course, represented by result objects, which are fields of a model object similar to cal, but with names test.

Here is how to build a PCA model with test set validation (we will use Xt as test data):

m = pca(Xc, 7, scale = TRUE, x.test = Xt, info = "PCA model")
m = selectCompNum(m, 5)

And here is the info for both result objects:

print(m$res$cal)
## 
## Results for PCA decomposition (class pcares)
## 
## Major fields:
## $scores - matrix with score values
## $T2 - matrix with T2 distances
## $Q - matrix with Q residuals
## $ncomp.selected - selected number of components
## $expvar - explained variance for each component
## $cumexpvar - cumulative explained variance
print(m$res$test)
## 
## Results for PCA decomposition (class pcares)
## 
## Major fields:
## $scores - matrix with score values
## $T2 - matrix with T2 distances
## $Q - matrix with Q residuals
## $ncomp.selected - selected number of components
## $expvar - explained variance for each component
## $cumexpvar - cumulative explained variance

Let us compare, for example, the explained variance values for the results:

var = data.frame(
   cal = m$res$cal$expvar,
   test = m$res$test$expvar
)
show(round(var, 1))
##         cal test
## Comp 1 54.2 44.8
## Comp 2 20.3 17.2
## Comp 3 13.1 17.0
## Comp 4  7.9  8.0
## Comp 5  2.3  4.4
## Comp 6  1.1  2.4
## Comp 7  0.5  0.7

Every model and every result object has a method summary(), which shows some statistics for evaluation of a model performance. Here are some examples.

# summary for whole model
summary(m)
## 
## Summary for PCA model (class pca)
## Type of limits: ddmoments
## Alpha: 0.05
## Gamma: 0.01
## 
##        Eigenvals Expvar Cumexpvar Nq Nh
## Comp 1     6.509  54.24     54.24 10  2
## Comp 2     2.434  20.28     74.52  3  9
## Comp 3     1.572  13.10     87.62  2 14
## Comp 4     0.946   7.88     95.51  5 10
## Comp 5     0.272   2.27     97.77  3  9
## Comp 6     0.137   1.14     98.92  3 11
## Comp 7     0.058   0.48     99.39  2 13
# summary for calibration results
summary(m$res$cal)
## 
## Summary for PCA results
## 
## Selected components: 5
## 
##        Expvar Cumexpvar
## Comp 1  54.24     54.24
## Comp 2  20.28     74.52
## Comp 3  13.10     87.62
## Comp 4   7.88     95.51
## Comp 5   2.27     97.77
## Comp 6   1.14     98.92
## Comp 7   0.48     99.39

The same methodology is used for any other method, e.g. PLS or SIMCA. In the next section we will look at how to use plotting functions for models and results.