Models and results

In mdatools, any method for data analysis, such as PCA, PLS regression, SIMCA classification and so on, can create two types of objects — a model and a result. Every time you build a model you get a model object. Every time you apply the model to a dataset you get a result object. Thus for PCA, the objects have classes pca and pcares correspondingly.

Each object includes a list with object’s properties (e.g. loadings for model, scores and explained variance for result) and provides a number of methods for visualization and exploration.

Model calibration

Let’s see how this works using a simple example — People data. We already used this data when was playing with plots, it consists of 32 objects (persons from Scandinavian and Mediterranean countries, 16 male and 16 female) and 12 variables (height, weight, shoesize, annual income, beer and wine consumption and so on.). More information about the data can be found using ?people. We will first load the data matrix:

library(mdatools)
data(people)

Now let’s calibrate the model:

m = pca(people, 7, scale = TRUE, info = "People PCA model")
m = selectCompNum(m, 5)

Here pca is a function that builds (calibrates) a PCA model and returns the model object. In terms of programming we can call function pca() a constructor of pca model object. The constructor has one mandatory argument — matrix or data frame with data values. Other arguments are optional as they all have default values. In this case we use three additional: 7 is maximum number of components, scale = TRUE tells that data values should be standardized (centering is a default option) and info allows to add a short text with information about the model which can be shown later.

Function selectCompNum() allows to select an “optimal” number of components for the model. In our case we calibrate model with 7 principal components however, e.g. after investigation of explained variance, we found out that 5 components is optimal. In this case we have two choices. Either recalibrate the model using 5 components or use the model that is calibrated already but “tell” the model that 5 components is the optimal number. In this case the model will keep all results calculated for 7 components but will use optimal number of components when necessary. For example, when showing distance plot for the model. Or when PCA model is used in SIMCA classification.

Function print prints the model object info:

print(m)
## 
## PCA model (class pca)
## 
## 
## Call:
## pca(x = people, ncomp = 7, scale = TRUE, info = "People PCA model")
## 
## Major model fields:
## $loadings - matrix with loadings
## $eigenvals - eigenvalues for components
## $ncomp - number of calculated components
## $ncomp.selected - number of selected components
## $center - values for centering data
## $scale - values for scaling data
## $alpha - significance level for critical limits
## $gamma - significance level for outlier limits
## $Qlim - critical values and parameters for orthogonal distances
## $T2lim - critical values and parameters for score distances
## $res - list with model results (calibration, test)

As you can see, there are no scores, explained variance values, distances and so on. Because they actually are not part of a PCA model, they are results of applying the model to a calibration set. But loadings, eigenvalues, number of calculated and selected principal components, vectors for centering and scaling the data, significance levels and corresponding critical limits for distances — all are parameters of the model.

Here is an example how to get values from loading matrix having the model object:

m$loadings[1:4, 1:4]
##              Comp 1     Comp 2      Comp 3      Comp 4
## Height   -0.3752858 -0.1354585 -0.07237117  0.07431853
## Weight   -0.3811357 -0.1114472 -0.06810860 -0.03313762
## Hairleng  0.3377735  0.1501634  0.07901752  0.11431601
## Shoesize -0.3776970 -0.1508061 -0.00127806  0.06569227

One can also notice that the model object has a particular field — res, which is in fact a list of PCA result objects containing results of applying the model to the calibration set (cal) and (if available) validation/test set (test). In case of regression and classification res can also contain object with cross-validation results (cv). For example here is the description of object with calibration results:

print(m$res$cal)
## 
## Results for PCA decomposition (class pcares)
## 
## Major fields:
## $scores - matrix with score values
## $T2 - matrix with T2 distances
## $Q - matrix with Q residuals
## $ncomp.selected - selected number of components
## $expvar - explained variance for each component
## $cumexpvar - cumulative explained variance

And if we want to look at scores, here is the way:

m$res$cal$scores[1:4, 1:4]
##           Comp 1    Comp 2     Comp 3     Comp 4
## Lars   -5.332528 0.6765698  1.0673076  1.1008056
## Peter  -3.114319 0.2934016 -0.6707091 -1.3099103
## Rasmus -2.996848 0.3604689 -0.2118207 -1.1169333
## Lene    1.084240 1.8449249 -0.4091936 -0.1228956

Both model and result objects also have related functions (methods), first of all for visualizing various values (e.g. scores plot, loadings plot, etc.). Some of the functions will be discussed later in this chapter, a full list can be found in help for a proper constructor (e.g. ?pca).

The result object is also created every time you apply a model to a new data. Like in many built-in R methods, method predict() is used in this case. The first argument of the method is always a model object. Here is a PCA example (assuming we have already built the model):

# create data matrix with three new measurements
Xt = rbind(
   c(180, 80, -1, 44, 35, 35000, 200, 100, -1, 96, -1, 105),
   c(170, 67,  1, 40, 41, 29500, 100, 200,  1, 94,  1, 105),
   c(175, 70,  1, 41, 28, 21500, 110, 120,  1, 97,  1, 115)
)

# project the new data to the previous PCA model and show the result object
res = predict(m, Xt)
print(res)
## 
## Results for PCA decomposition (class pcares)
## 
## Major fields:
## $scores - matrix with score values
## $T2 - matrix with T2 distances
## $Q - matrix with Q residuals
## $ncomp.selected - selected number of components
## $expvar - explained variance for each component
## $cumexpvar - cumulative explained variance