Variable selection

PLS calculates several statistics, which can be used to select most important (or remove least important) variables in order to improve performance and make model simpler. The first two are VIP-scores (variables important for projection) and Selectivity ratio. All details and theory can be found e.g. here.

Both parameters can be shown as plots and as vector of values for a selected y-variable. Selectivity ration is calculated for all possible components in a model, but VIP scores (due to computational time) only for selected number of components and are recalculated every time you change number of optimal components using selectCompNum() method. Here are some plots.

par(mfrow = c(2, 2))
plotVIPScores(m1, type = 'h', show.labels = T)
plotSelectivityRatio(m1, type = 'b', show.labels = T)
plotSelectivityRatio(m1, ncomp = 1, type = 'h', show.labels = T)
plotSelectivityRatio(m1, ncomp = 2, type = 'h', show.labels = T)

In the example below, I create two other PLS models by excluding variables with VIP score or selectivity ratio below a threshold (I use 0.5 and 1 correspondingly) and show the performance for both.

m3 = pls(X.c, y.c, 5, scale = T, cv = 1, exclcols = getVIPScores(m1, ncomp = 2) < 0.5)
summary(m3)
## 
## PLS model (class pls) summary
## 
## Performance and validation:
## Number of selected components: 4
##     X cumexpvar Y cumexpvar  RMSE Slope    Bias  RPD
## Cal       85.29       98.15 0.527  0.98  0.0000 7.50
## CV        80.53       96.34 0.741  0.97 -0.0382 5.34
m4 = pls(X.c, y.c, 5, scale = T, cv = 1, exclcols = getSelectivityRatio(m1, ncomp = 2) < 1)
summary(m4)
## 
## PLS model (class pls) summary
## 
## Performance and validation:
## Number of selected components: 1
##     X cumexpvar Y cumexpvar  RMSE Slope   Bias  RPD
## Cal       86.80       94.97 0.868  0.95 0.0000 4.56
## CV        84.37       93.91 0.955  0.94 0.0034 4.14

Another way is make an inference about regression coefficients and calculate confidence intervals and p-values for each variable. This can be done usine Jack-Knife approach, when model is cross-validated using efficient number of segments (at least ten) and statistics are calculated using the distribution of regression coefficient values obtained for each step. There are two parameters, coeffs.ci and coeffs.alpha, first is to select the method (so far only Jack-Knife is available, the value is 'jk') and second is a level of significance for computing confidence intervals (by default is 0.1). Here is an example.

mjk = pls(X.c, y.c, 7, scale = T, coeffs.ci = 'jk', coeffs.alpha = 0.05)

If number of segments is not specified as in the example above, full cross-validation will be used.

The statistics are calculated for each y-variable and each available number of components. When you show a plot for regression coefficients, confidence interval will be shown automatically. You can changes this by using parameter show.ci = FALSE.

par(mfrow = c(2, 2))
plotRegcoeffs(mjk, type = 'h', show.labels = T)
plotRegcoeffs(mjk, ncomp = 2, type = 'h', show.labels = T)
plotRegcoeffs(mjk, type = 'h', show.labels = T, show.ci = F)
plotRegcoeffs(mjk, ncomp = 2, type = 'h', show.labels = T, show.ci = F)

Calling function summary() for regression coefficients allows to get all calculated statistics.

summary(mjk$coeffs, ncomp = 2)
## Regression coefficients for Shoesize
## ------------------------------------
##          Estimated coefficients t-value          SE p-value 95% CI (lo)
## Height               0.19357436    25.6 0.007507308   0.000  0.17804431
## Weight               0.18935489    24.6 0.007629829   0.000  0.17357139
## Hairleng            -0.17730737   -17.3 0.010263098   0.000 -0.19853820
## Age                  0.02508113     0.9 0.026825735   0.358 -0.03041213
## Income               0.01291497     0.4 0.034034881   0.703 -0.05749155
## Beer                 0.11441573     6.3 0.018145181   0.000  0.07687956
## Wine                 0.08278735     2.7 0.030127820   0.012  0.02046321
## Sex                 -0.17730737   -17.3 0.010263098   0.000 -0.19853820
## Swim                 0.19426871    14.6 0.013195834   0.000  0.16697105
## Region               0.02673161     1.1 0.023419425   0.271 -0.02171516
## IQ                  -0.01705852    -0.5 0.032839748   0.603 -0.08499272
##          95% CI (up)
## Height    0.20910441
## Weight    0.20513839
## Hairleng -0.15607653
## Age       0.08057439
## Income    0.08332148
## Beer      0.15195189
## Wine      0.14511150
## Sex      -0.15607653
## Swim      0.22156638
## Region    0.07517838
## IQ        0.05087567
## 
## Degrees of freedom: 23, t-value: 2.07
summary(mjk$coeffs, ncomp = 2, alpha = 0.01)
## Regression coefficients for Shoesize
## ------------------------------------
##          Estimated coefficients t-value          SE p-value  99% CI (lo)
## Height               0.19357436    25.6 0.007507308   0.000  0.172498826
## Weight               0.18935489    24.6 0.007629829   0.000  0.167935399
## Hairleng            -0.17730737   -17.3 0.010263098   0.000 -0.206119327
## Age                  0.02508113     0.9 0.026825735   0.358 -0.050227714
## Income               0.01291497     0.4 0.034034881   0.703 -0.082632366
## Beer                 0.11441573     6.3 0.018145181   0.000  0.063476112
## Wine                 0.08278735     2.7 0.030127820   0.012 -0.001791551
## Sex                 -0.17730737   -17.3 0.010263098   0.000 -0.206119327
## Swim                 0.19426871    14.6 0.013195834   0.000  0.157223579
## Region               0.02673161     1.1 0.023419425   0.271 -0.039014575
## IQ                  -0.01705852    -0.5 0.032839748   0.603 -0.109250717
##          99% CI (up)
## Height    0.21464989
## Weight    0.21077438
## Hairleng -0.14849541
## Age       0.10038997
## Income    0.10846230
## Beer      0.16535534
## Wine      0.16736626
## Sex      -0.14849541
## Swim      0.23131385
## Region    0.09247780
## IQ        0.07513367
## 
## Degrees of freedom: 23, t-value: 2.81

Function getRegcoeffs() in this case may also return corresponding t-value, standard error, p-value, and confidence interval for each of the coefficient (except intercept) if user specifies a parameter full. The standard error and confidence intervals are also computed for raw, undstandardized, variables (similar to coefficients).

show(getRegcoeffs(mjk, ncomp = 2, full = T))
##           Estimated coefficients     t-value           SE      p-value
## Intercept           1.342626e+01          NA           NA           NA
## Height              7.456695e-02  25.5886755 2.891897e-03 2.116375e-18
## Weight              4.896328e-02  24.6148053 1.972917e-03 5.012592e-18
## Hairleng           -6.865495e-01 -17.2818695 3.973960e-02 1.137354e-14
## Age                 1.081716e-02   0.9380153 1.156959e-02 3.579832e-01
## Income              5.642220e-06   0.3857258 1.486897e-05 7.032445e-01
## Beer                5.010542e-03   6.2692224 7.946214e-04 2.136646e-06
## Wine                7.511377e-03   2.7258935 2.733526e-03 1.204856e-02
## Sex                -6.865495e-01 -17.2818695 3.973960e-02 1.137354e-14
## Swim                1.010496e-01  14.6146811 6.863861e-03 3.941369e-13
## Region              1.035071e-01   1.1283478 9.068204e-02 2.708059e-01
## IQ                 -5.277164e-03  -0.5268369 1.015919e-02 6.033516e-01
##             95% CI (lo)   95% CI (up)
## Intercept            NA            NA
## Height     6.858461e-02  8.054929e-02
## Weight     4.488199e-02  5.304457e-02
## Hairleng  -7.687571e-01 -6.043419e-01
## Age       -1.311636e-02  3.475068e-02
## Income    -2.511659e-05  3.640103e-05
## Beer       3.366742e-03  6.654342e-03
## Wine       1.856647e-03  1.316611e-02
## Sex       -7.687571e-01 -6.043419e-01
## Swim       8.685061e-02  1.152486e-01
## Region    -8.408298e-02  2.910972e-01
## IQ        -2.629305e-02  1.573872e-02
## attr(,"name")
## [1] "Regression coefficients for  Shoesize"

It is also possible to change significance level for confidence intervals.

show(getRegcoeffs(mjk, ncomp = 2, full = T, alpha = 0.01))
##           Estimated coefficients     t-value           SE      p-value
## Intercept           1.342626e+01          NA           NA           NA
## Height              7.456695e-02  25.5886755 2.891897e-03 2.116375e-18
## Weight              4.896328e-02  24.6148053 1.972917e-03 5.012592e-18
## Hairleng           -6.865495e-01 -17.2818695 3.973960e-02 1.137354e-14
## Age                 1.081716e-02   0.9380153 1.156959e-02 3.579832e-01
## Income              5.642220e-06   0.3857258 1.486897e-05 7.032445e-01
## Beer                5.010542e-03   6.2692224 7.946214e-04 2.136646e-06
## Wine                7.511377e-03   2.7258935 2.733526e-03 1.204856e-02
## Sex                -6.865495e-01 -17.2818695 3.973960e-02 1.137354e-14
## Swim                1.010496e-01  14.6146811 6.863861e-03 3.941369e-13
## Region              1.035071e-01   1.1283478 9.068204e-02 2.708059e-01
## IQ                 -5.277164e-03  -0.5268369 1.015919e-02 6.033516e-01
##             99% CI (lo)   99% CI (up)
## Intercept            NA            NA
## Height     6.644843e-02  8.268548e-02
## Weight     4.342464e-02  5.450192e-02
## Hairleng  -7.981119e-01 -5.749871e-01
## Age       -2.166256e-02  4.329689e-02
## Income    -3.609997e-05  4.738441e-05
## Beer       2.779773e-03  7.241311e-03
## Wine      -1.625491e-04  1.518530e-02
## Sex       -7.981119e-01 -5.749871e-01
## Swim       8.178042e-02  1.203188e-01
## Region    -1.510678e-01  3.580821e-01
## IQ        -3.379741e-02  2.324309e-02
## attr(,"name")
## [1] "Regression coefficients for  Shoesize"

The p-values, t-values and standard errors are stored each as a 3-way array similar to regression coefficients. The selection can be made by comparing e.g. p-values with a threshold similar to what we have done with VIP-scores and selectivity ratio.

exclcols = mjk$coeffs$p.values[, 2, 1] > 0.05
show(exclcols)
##   Height   Weight Hairleng      Age   Income     Beer     Wine      Sex 
##    FALSE    FALSE    FALSE     TRUE     TRUE    FALSE    FALSE    FALSE 
##     Swim   Region       IQ 
##    FALSE     TRUE     TRUE

Here p.values[, 2, 1] means values for all predictors, model with two components, first y-variable.

newm = pls(X.c, y.c, 3, scale = T, cv = 1, exclcols = exclcols)
summary(newm)
## 
## PLS model (class pls) summary
## 
## Performance and validation:
## Number of selected components: 3
##     X cumexpvar Y cumexpvar  RMSE Slope    Bias  RPD
## Cal       97.85       97.79 0.575  0.98  0.0000 6.87
## CV        96.64       96.49 0.727  0.97 -0.0161 5.44
show(getRegcoeffs(newm))
##               Shoesize
## Intercept  4.952691046
## Height     0.091079963
## Weight     0.052708955
## Hairleng  -0.315927643
## Age        0.000000000
## Income     0.000000000
## Beer       0.009320427
## Wine       0.018524717
## Sex       -0.315927643
## Swim       0.134354170
## Region     0.000000000
## IQ         0.000000000
## attr(,"exclrows")
##    Age Income Region     IQ 
##      4      5     10     11 
## attr(,"name")
## [1] "Regression coefficients for  Shoesize"

As you can see, the variables Age, Income, Region and IQ have been excluded as they are not related to the Shoesize, which seems to be correct.

Variable selection as well as all described above can be also carried out for PLS discriminant analysis (PLS-DA), which can be explained later in one of the next chapters.