Models and results

In mdatools, any method for data analysis, such as PCA, PLS regression, SIMCA classification and so on, can create two types of objects — a model and a result. Every time you build a model you get a model object. Every time you apply the model to a dataset you get a result object. Thus for PCA, the objects have classes pca and pcares correspondingly.

Each object includes a list with variables (e.g. loadings for model, scores and explained variance for result) and provides a number of methods for visualisation and exploration.

Model calibration

Let’s see how this works using a simple example — People data. We already used this data when was playing with plots, it consists of 32 objects (persons from Scandinavian and Mediterranean countries, 16 male and 16 female) and 12 variables (height, weight, shoesize, annual income, beer and wine consumption and so on.). More information about the data can be found using ?people. We will first load the data matrix and split it into two subsets as following:

library(mdatools)
data(people)

idx = seq(4, 32, 4)
X.c = people[-idx, ]
X.t = people[idx, ]

So X.c is our calibration subset we are going to use for creating a PCA model and X.t is a subset we will apply the calibrated model to. Now let’s calibrate the model and show an information about the model object:

m = pca(X.c, 7, scale = T, info = "People PCA model")
m = selectCompNum(m, 5)

Here pca is a function that builds (calibrates) a PCA model and returns the model object.

Function selectCompNum allows to select an “optimal” number of components for the model. In our case we calibrate model with 7 principal components (second argument for the function pca()) however, e.g. after investigation of explained variance, we found out that 5 components is optimal. In this case we have two choices. Either recalibrate the model using 5 components or use the model that is calibrated already but “tell” the model that 5 components is the optimal number. In this case the model will keep all results calculated for 7 components but will use optimal number of components when necessary. For example, when showing residuals plot for the model. Or when PCA model is used in SIMCA classification.

Finally, function print prints the model object info:

print(m)
##
## PCA model (class pca)
##
##
## Call:
## pca(x = X.c, ncomp = 7, scale = T, info = "People PCA model")
##
## Major fields:
## $loadings - matrix with loadings ##$eigenvals - eigenvalues for components
## $ncomp - number of calculated components ##$ncomp.selected - number of selected components
## $center - values for centering data ##$scale - values for scaling data
## $cv - number of segments for cross-validation ##$alpha - significance level for Q residuals
## $calres - results (scores, etc) for calibration set As you can see, there are no scores, explained variance values, residuals and so on. Because they actually are not part of a PCA model, they are results of applying the model to a calibration set. But loadings, eigenvalues, number of calculated and selected principal components, vectors for centering and scaling the data, number of segments for cross-validation (if used) and significance levels are the model fields: m$loadings[1:4, 1:4]
##              Comp 1      Comp 2      Comp 3      Comp 4
## Height   -0.3792846  0.08004057 -0.06676611  0.04512380
## Weight   -0.3817929  0.08533809 -0.08527883 -0.04051629
## Hairleng  0.3513874 -0.22676635 -0.02273504  0.01575716
## Shoesize -0.3776985  0.12503739 -0.02117369  0.09929010

One can also notice that the model object has a particular field — calres, which is in fact a PCA result object containing results of applying the model to the calibration set. If we look at the object description we will get the following:

print(m$calres) ## ## Results for PCA decomposition (class pcares) ## ## Major fields: ##$scores - matrix with score values
## $T2 - matrix with T2 distances ##$Q - matrix with Q residuals
## $ncomp.selected - selected number of components ##$expvar - explained variance for each component
## $cumexpvar - cumulative explained variance And if we want to look at scores, here is the way: m$calres$scores[1:4, 1:4] ## Comp 1 Comp 2 Comp 3 Comp 4 ## Lars -5.108742 -1.2714943 1.0765871 1.08910438 ## Peter -3.021811 -0.3163758 -0.2958259 -1.36053121 ## Rasmus -2.887335 -0.4428721 0.1231706 -1.15070563 ## Mette 1.116457 -1.3716444 -1.6344512 -0.03803356 Both model and result objects also have related functions (methods), first of all for visualizing various values (e.g. scores plot, loadings plot, etc.). Some of the functions will be discussed later in this chapter, a full list can be found in help for a proper method. The result object is also created every time you apply a model to a new data. Like in many built-in R methods, method predict() is used in this case. The first argument of the method is always a model object. Here is a PCA example (assuming we have already built the model): res = predict(m, X.t) print(res) ## ## Results for PCA decomposition (class pcares) ## ## Major fields: ##$scores - matrix with score values
## $T2 - matrix with T2 distances ##$Q - matrix with Q residuals
## $ncomp.selected - selected number of components ##$expvar - explained variance for each component
## $cumexpvar - cumulative explained variance Model validation Any model can be validated with cross-validation or/and test set validation. The validation results are, of course, represented by result objects, which are fields of a model object similar to calres, but with names cvres and testres correspondingly. Here is how to build a PCA model with full cross-validation and test set validation (we will use X.t as test data) at the same time: m = pca(X.c, 7, scale = T, cv = 1, x.test = X.t, info = "PCA model") m = selectCompNum(m, 5) Parameter cv specifies options for cross-validation. If a numeric value is provided then it will be used as number of segments for random cross-validation, e.g. if cv = 2 cross-validation with two segments will be used. For full cross-validation use cv = 1 like we did in the example above (this is perhaps a bit misleading, but I keep this option for compatability). For more advanced option you can provide a list with name of cross-validation method, number of segments and number of iterations, e.g. cv = list('rand', 4, 4) for running random cross-validation with four segments and four repetitions. And here is the model object info: print(m) ## ## PCA model (class pca) ## ## ## Call: ## pca(x = X.c, ncomp = 7, scale = T, cv = 1, x.test = X.t, info = "PCA model") ## ## Major fields: ##$loadings - matrix with loadings
## $eigenvals - eigenvalues for components ##$ncomp - number of calculated components
## $ncomp.selected - number of selected components ##$center - values for centering data
## $scale - values for scaling data ##$cv - number of segments for cross-validation
## $alpha - significance level for Q residuals ##$calres - results (scores, etc) for calibration set
## $cvres - results for cross-validation ##$testres - results for test set

As you can see we have all three types of results now — calibration (calres), cross-validation (cvres) and test set validation (testres). Let us compare, for example, the explained variance values for the results:

var = data.frame(cal = m$calres$expvar, cv = m$cvres$expvar, test = m$testres$expvar)
show(round(var, 1))
##         cal   cv test
## Comp 1 54.2 43.1 44.8
## Comp 2 20.3 21.2 17.2
## Comp 3 13.1 14.7 17.0
## Comp 4  7.9 13.0  8.0
## Comp 5  2.3  3.6  4.4
## Comp 6  1.1  2.0  2.4
## Comp 7  0.5  0.8  0.7

Every model and every result has a method summary(), which shows some statistics for evaluation of a model performance. Here are some examples.

summary(m)
##
## PCA model (class pca) summary
##
## Info:
## PCA model
##        Eigvals Expvar Cumexpvar
## Comp 1   6.509  54.24     54.24
## Comp 2   2.434  20.28     74.52
## Comp 3   1.572  13.10     87.62
## Comp 4   0.946   7.88     95.51
## Comp 5   0.272   2.27     97.77
## Comp 6   0.137   1.14     98.92
## Comp 7   0.058   0.48     99.39
summary(m\$calres)
##
## Summary for PCA results
##
## Selected components: 5
##
##        Expvar Cumexpvar
## Comp 1  54.24     54.24
## Comp 2  20.28     74.52
## Comp 3  13.10     87.62
## Comp 4   7.88     95.51
## Comp 5   2.27     97.77
## Comp 6   1.14     98.92
## Comp 7   0.48     99.39

The same methodology is used for any other method, e.g. PLS or SIMCA. In the next section we will look at how to use plotting functions for models and results.