Calibration and validation

The model calibration is similar to PCA, but there are several additional arguments, which are important for classification. First of all it is a class name. Class name is a string, which can be used later e.g. for identifying class members for testing. The second important argument is a level of significance, alpha. This parameter is used for calculation of statistical limits and can be considered as probability for false negatives. The default value is 0.05. Finally the parameter lim.type allows to select the method for compuring critical limits for the distances, as it is described in the PCA chapter.

In this chapter as well as for describing other classification methods we will use a famous Iris dataset, available in R. The dataset includes 150 measurements of three Iris species: Setosa, Virginica and Versicola. The measurements are length and width of petals and sepals in cm. Use ?iris for more details.

Let’s get the data and split it to calibration and test sets.

data(iris)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
# generate indices for calibration set
idx = seq(1, nrow(iris), by = 2)

# split the values
X.c = iris[idx, 1:4]
c.c = iris[idx, 5, drop = F]

X.t = iris[-idx, 1:4]
c.t = iris[-idx, 5, drop = F]

Now, because for calibration we need only objects belonging to a class, we will split the X.c into three matrices — one for each species. The data is ordered by the species, so it can be done relatively easy by taking every 25 rows.

X.set = X.c[1:25, ]
X.ver = X.c[26:50, ]
X.vir = X.c[51:75, ]

Let’s start with creating a model for class Versicolor and exploring available statistics and plots. We will use full cross-validation to validate the results. In this case we do not change method for critical limits, so the default option (jm) is used.

library(mdatools)
m = simca(X.ver, 'versicolor', ncomp = 3, cv = 1)
summary(m)
## 
## SIMCA model for class "versicolor" summary
## 
## Info: 
## Method for critical limits: jm
## Significance level (alpha): 0.05
## Selected number of components: 3
## 
##        Expvar Cumexpvar Sens (cal) Expvar (cv) Sens (cv)
## Comp 1  76.44     76.44       0.96       71.85      0.88
## Comp 2  13.93     90.37       0.92       13.91      0.84
## Comp 3   8.45     98.82       0.92       12.24      0.84

Let’s look at plots and start with summary plot.

plot(m)

The plot is very similar to what we seen for PCA model, the only difference is that it shows modelling power instead of loadings. Modelling power is a measure of contribution of each variable to the model and varies from 0 to 1. Usually variables with modelling power below 0.1 are considered as irrelevant.

Let’s give a closer look at the residuals plot with different values for alpha (we will keep number of components equal to three in all cases).

m1 = simca(X.ver, 'versicolor', ncomp = 3, cv = 1, alpha = 0.01)
m2 = simca(X.ver, 'versicolor', ncomp = 3, cv = 1, alpha = 0.05)
m3 = simca(X.ver, 'versicolor', ncomp = 3, cv = 1, alpha = 0.10)
m4 = simca(X.ver, 'versicolor', ncomp = 3, cv = 1, alpha = 0.15)

par(mfrow = c(2, 2))
plotResiduals(m1)
plotResiduals(m2)
plotResiduals(m3)
plotResiduals(m4)

As you can see, using alpha = 0.01 reduced number of false negatives to zero, as the acceptance limits became larger, while alpha = 0.15 gives a lot of incorrectly rejected class members. It must be noted, that decreasing alpha will also lead to a larger number of false positives, which we can not see in this case.

Predictions and validation with a test set

When model is ready one can test it using a new test set with know classes. In this case we will use objects from all three species and be able to see how good the model performs on strangers (and calculate the specificity). In order to do that we will provide both the matrix with predictors, X.t, and a vector with names of the classes for corresponding objects/rows (c.t). The values with known classes in this case can be:

  • a vector with text values (names)
  • a factor using the names as labels
  • a vector with logical values (TRUE for class members and FALSE for strangers)

In our case we have a factor. Instead of creating a new model and providing the values as test set we will make predictions instead.

res = predict(m, X.t, c.t)
summary(res)
## 
## Summary for SIMCA one-class classification result
## 
## Class name: versicolor
## Number of selected components: 3
## 
##        Expvar Cumexpvar TP FP TN FN Spec Sens
## Comp 1  64.27     64.27 23  5 45  2 0.90 0.92
## Comp 2   1.67     65.95 24  3 47  1 0.94 0.96
## Comp 3  32.45     98.40 22  3 47  3 0.94 0.88

In this case we see a more detailed statistics with true/false positives and negatives, specificity and sensitivity. The performance statistics can be also shown as plots.

par(mfrow = c(2, 2))
plotSpecificity(res)
plotSensitivity(res)
plotMisclassified(res)
plotPerformance(res)

The classification results can be shown both graphically and numerically. Here is a prediction plot for the results.

par(mfrow = c(2, 1))
plotPredictions(res, show.legend = F)
plotPredictions(res, ncomp = 2, show.legend = F)

So we can see (the legend has been hidden in order to see all points clearly) that for the model with three components we have three false positives (specificity = 47/50 = 0.94) and three false negatives (sensitivity = 22/25 = 0.88). You can also show the predictions as a matrix with -1 and +1 using method showPredictions() or get the array with predicted class values directly as it is shown in the example below (for first 10 rows, different number of components and the first classification variable).

show(res$c.pred[31:40, 1:3, 1])
##    Comp 1 Comp 2 Comp 3
## 62      1      1      1
## 64      1      1      1
## 66      1      1      1
## 68      1      1     -1
## 70      1      1      1
## 72      1      1      1
## 74      1      1     -1
## 76      1      1      1
## 78      1      1      1
## 80      1      1      1

You can also get and show the confusion matrix (rows correspond to real classes and columns to the predicted class) for an object with SIMCA results (as well as results obtained with any other classification method, e.g. PLS-DA). In this example the matrix also shows 3 false negatives and 3 false positives we have mentioned before.

show(getConfusionMatrix(res))
##            versicolor None
## setosa              0   25
## versicolor         22    3
## virginica           3   22

Class belonging probabilities

In addition to the array with predicted class, the object with SIMCA results also contains an array with class beloning probabilities. The probabilities are calculated depending on how close a particular object is to the the critical limit border.

To compute the probability we use the theoretical distribution for Q and T2 distances as for computing critical values (defined by the parameter lim.type). The distribution is used to calculate a p-value — chance to get object with given distance value or larger. The p-value is then compared with signidicance level, \(\alpha\), and the probability, \(\pi\) is calculated as follows:

\[\pi = 0.5 (p / \alpha) \]

So if p-value is the same as significance level (which happens when object is lying exactly on the acceptance line) the probability is 0.5. If p-value is e.g. 0.04, \(\pi = 0.4\), or 40%, and the object will be rejected as a stranger (here we assume that the \(\alpha = 0.05\)). If the p-value is e.g. 0.06, \(\pi = 0.6\), or 60%, and the object will be accepted as a member of the class. If p-value is larger than \(2\times\alpha\) the probability is set to 1.

In case of rectangular acceptance area (lim.type = 'jm' or 'chisq') the probability is computed separately for Q and T2 values and the smallest of the two is taken. In case of triangular acceptance area (lim.type = 'ddmoments' or 'ddrobust') the probability is calculated for a combination of the distances.

Here is how to show the probability values, that correspond to the predictions shown in the previous code chunk.

show(res$p.pred[31:40, 1:3, 1])
##    Comp 1 Comp 2     Comp 3
## 62      1      1 1.00000000
## 64      1      1 1.00000000
## 66      1      1 1.00000000
## 68      1      1 0.03884308
## 70      1      1 1.00000000
## 72      1      1 1.00000000
## 74      1      1 0.04593059
## 76      1      1 1.00000000
## 78      1      1 1.00000000
## 80      1      1 1.00000000

It is also possible to show the probability values as a plot with method plotProbabilities():

par(mfrow = c(2, 1))
plotProbabilities(res, cgroup = c.t$Species)
plotProbabilities(res, ncomp = 2, cgroup = c.t$Species)

The plot can be shown for any SIMCA results (including e.g. calibration set or cross-validated results).

Extreme plot

Extreme plot was proposed as a tool for comparing methods for computing of critical limits as well as for optimization of model parameters, e.g. number of components. The idea of the plot is following. For any given \(i\) we can compute \(\alpha'\) as \(\alpha' = i / n\), where \(n\) is the number of objects in calibration set, so \(i\) is the number of objects outside the acceptance area for the given \(\alpha'\). Then we can count number of objects, \(m\), in our data that have p-value below the \(\alpha'\).

In ideal case \(m = i\), but in fact there will be some small random variation of the \(m\) around \(i\). This variation can be computed as \(±2\sqrt{i (1 - \alpha)}\). For example, if the expected number of extreme objects \(i = 5\) and \(n = 50\), like in our last model, \(\alpha = 5/50 = 0.1\). So we can expect \(5±2\sqrt{5 \times 0.9} ≈ 5 ± 4\) objects may have the p-value smaller than \(0.1\).

The plotExtreme() method counts the number of objects from calibration set rejected by the model for \(i = 1, 2,...,n\) (and corresponding alpha values) and shows the values vs the \(i\). Besides that, the statistical limits. If all points are lying within the limits, the classification model works as expected.

Below you will find an example with four extreme plots made for four Versicolor SIMCA models with different methods for estimation of critical limits.

Ve = iris[51:100, 1:4]

m1 = simca(Ve, 'Versicolor', ncomp = 2, lim.type = 'jm')
m2 = simca(Ve, 'Versicolor', ncomp = 2, lim.type = 'chisq')
m3 = simca(Ve, 'Versicolor', ncomp = 2, lim.type = 'ddmoments')
m4 = simca(Ve, 'Versicolor', ncomp = 2, lim.type = 'ddrobust')


par(mfrow = c(2, 2))
plotExtreme(m1, main = 'JM')
plotExtreme(m2, main = 'Chisq')
plotExtreme(m3, main = 'DD moments')
plotExtreme(m4, main = 'DD robust')