Variable selection

PLS allows to calculate several statistics, which can be used to select most important (or remove least important) variables in order to improve performance and make model simpler. The first two are VIP-scores (variables important for projection) and Selectivity ratio. All details and theory can be found e.g. here.

Both parameters can be shown as plots and as vector of values for a selected y-variable. Take into account that when you make a plot for VIP scores or Selectivity ratio, the corresponding values should be computed first, which can take some time for large datasets.

Here is an example of corresponding plots.

par(mfrow = c(2, 2))
plotVIPScores(m1)
plotVIPScores(m1, ncomp = 2, type = "h", show.labels = TRUE)
plotSelectivityRatio(m1)
plotSelectivityRatio(m1, ncomp = 2, type = "h", show.labels = TRUE)

To compute the values without plotting use vipscores() and selratio() functions. In the example below, I create two other PLS models by excluding variables with VIP score or selectivity ratio below a threshold (I use 1 and 3 correspondingly) and show the performance for both.

vip = vipscores(m1, ncomp = 2)
m3 = pls(Xc, yc, 4, scale = T, cv = 1, exclcols = (vip < 0.5))
summary(m3)
## 
## PLS model (class pls) summary
## -------------------------------
## Info: 
## Number of selected components: 2
## Cross-validation: full (leave one out)
## 
## Response variable: Shoesize
##     X cumexpvar Y cumexpvar    R2  RMSE Slope   Bias
## Cal    93.51369    95.27241 0.953 0.842 0.953 0.0000
## Cv           NA          NA 0.933 1.005 0.935 0.0254
##      RPD
## Cal 4.70
## Cv  3.94
sr = selratio(m1, ncomp = 2)
m4 = pls(Xc, yc, 4, scale = T, cv = 1, exclcols = (sr < 2))
summary(m4)
## 
## PLS model (class pls) summary
## -------------------------------
## Info: 
## Number of selected components: 2
## Cross-validation: full (leave one out)
## 
## Response variable: Shoesize
##     X cumexpvar Y cumexpvar    R2  RMSE Slope  Bias  RPD
## Cal    99.44331    95.79889 0.958 0.794 0.958 0.000 4.98
## Cv           NA          NA 0.946 0.903 0.944 0.014 4.38

Another way is to make an inference about regression coefficients and calculate confidence intervals and p-values for each variable. This can be done using Jack-Knife approach, when model is cross-validated using efficient number of segments (at least ten) and statistics are calculated using the distribution of regression coefficient values obtained for each step. The statistics are automatically computed when you use full cross-validation.

mjk = pls(Xc, yc, 7, scale = TRUE, cv = 1)

The statistics are calculated for each y-variable and each available number of components. When you show a plot for regression coefficients, you can show the confidence intervals by using parameter show.ci = TRUE as shown in examples below.

par(mfrow = c(2, 2))
plotRegcoeffs(mjk, type = "h", show.ci = TRUE, show.labels = TRUE)
plotRegcoeffs(mjk, ncomp = 2, type = "h", show.ci = TRUE, show.labels = TRUE)
plotRegcoeffs(mjk, type = "l", show.ci = TRUE, show.labels = TRUE)
plotRegcoeffs(mjk, ncomp = 2, type = "l", show.ci = TRUE, show.labels = TRUE)

Calling function summary() for regression coefficients allows to get all calculated statistics.

summary(mjk$coeffs, ncomp = 2)
## 
## Regression coefficients for Shoesize (ncomp = 2)
## ------------------------------------------------
##               Coeffs   Std. err. t-value p-value
## Height    0.19357436 0.008270469   23.39   0.000
## Weight    0.18935489 0.005386414   35.12   0.000
## Hairleng -0.17730737 0.010368912  -17.09   0.000
## Age       0.02508113 0.026525148    0.95   0.351
## Income    0.01291497 0.033491494    0.39   0.702
## Beer      0.11441573 0.017851922    6.40   0.000
## Wine      0.08278735 0.030960265    2.68   0.013
## Sex      -0.17730737 0.010368912  -17.09   0.000
## Swim      0.19426871 0.013654151   14.21   0.000
## Region    0.02673161 0.023989628    1.12   0.275
## IQ       -0.01705852 0.032399035   -0.53   0.602
##                 2.5%       97.5%
## Height    0.17646559  0.21068313
## Weight    0.17821224  0.20049754
## Hairleng -0.19875710 -0.15585764
## Age      -0.02979032  0.07995258
## Income   -0.05636746  0.08219740
## Beer      0.07748621  0.15134524
## Wine      0.01874117  0.14683354
## Sex      -0.19875710 -0.15585764
## Swim      0.16602295  0.22251448
## Region   -0.02289471  0.07635794
## IQ       -0.08408103  0.04996399
## 
## Degrees of freedom (Jack-Knifing): 23
summary(mjk$coeffs, ncomp = 2, alpha = 0.01)
## 
## Regression coefficients for Shoesize (ncomp = 2)
## ------------------------------------------------
##               Coeffs   Std. err. t-value p-value
## Height    0.19357436 0.008270469   23.39   0.000
## Weight    0.18935489 0.005386414   35.12   0.000
## Hairleng -0.17730737 0.010368912  -17.09   0.000
## Age       0.02508113 0.026525148    0.95   0.351
## Income    0.01291497 0.033491494    0.39   0.702
## Beer      0.11441573 0.017851922    6.40   0.000
## Wine      0.08278735 0.030960265    2.68   0.013
## Sex      -0.17730737 0.010368912  -17.09   0.000
## Swim      0.19426871 0.013654151   14.21   0.000
## Region    0.02673161 0.023989628    1.12   0.275
## IQ       -0.01705852 0.032399035   -0.53   0.602
##                  0.5%       99.5%
## Height    0.170356376  0.21679234
## Weight    0.174233417  0.20447636
## Hairleng -0.206416384 -0.14819835
## Age      -0.049383866  0.09954612
## Income   -0.081106896  0.10693683
## Beer      0.064299387  0.16453206
## Wine     -0.004128502  0.16970321
## Sex      -0.206416384 -0.14819835
## Swim      0.155936928  0.23260050
## Region   -0.040615325  0.09407855
## IQ       -0.108013488  0.07389645
## 
## Degrees of freedom (Jack-Knifing): 23

Function getRegcoeffs() in this case may also return corresponding t-value, standard error, p-value, and confidence interval for each of the coefficient (except intercept) if user specifies a parameter full. The standard error and confidence intervals are also computed for raw, non-standardized, variables (similar to coefficients).

show(getRegcoeffs(mjk, ncomp = 2, full = TRUE))
##               Estimated    Std. err.     t-value
## Intercept  1.342626e+01           NA          NA
## Height     7.456695e-02 3.185875e-03  23.3854388
## Weight     4.896328e-02 1.392816e-03  35.1242970
## Hairleng  -6.865495e-01 4.014933e-02 -17.0926429
## Age        1.081716e-02 1.143995e-02   0.9524214
## Income     5.642220e-06 1.463158e-05   0.3875343
## Beer       5.010542e-03 7.817789e-04   6.3971862
## Wine       7.511377e-03 2.809055e-03   2.6802915
## Sex       -6.865495e-01 4.014933e-02 -17.0926429
## Swim       1.010496e-01 7.102257e-03  14.2147650
## Region     1.035071e-01 9.288992e-02   1.1174095
## IQ        -5.277164e-03 1.002285e-02  -0.5285419
##                p-value        2.5%       97.5%
## Intercept           NA          NA          NA
## Height    0.000000e+00  0.14310275  0.20905257
## Weight    0.000000e+00  0.14272997  0.20887799
## Hairleng  1.443290e-14 -0.19852297 -0.13073192
## Age       3.507861e-01 -0.03032377  0.12353583
## Income    7.019238e-01 -0.02372318  0.14371942
## Beer      1.579903e-06  0.11003515  0.15623859
## Wine      1.336266e-02 -0.07053100  0.07603415
## Sex       1.443290e-14 -0.19852297 -0.13073192
## Swim      7.018830e-13  0.14236915  0.20510992
## Region    2.753561e-01 -0.10563433  0.04291911
## IQ        6.021868e-01 -0.08281410  0.07610725
## attr(,"name")
## [1] "Regression coefficients for  Shoesize"

It is also possible to change significance level for confidence intervals.

show(getRegcoeffs(mjk, ncomp = 2, full = TRUE, alpha = 0.01))
##               Estimated    Std. err.     t-value
## Intercept  1.342626e+01           NA          NA
## Height     7.456695e-02 3.185875e-03  23.3854388
## Weight     4.896328e-02 1.392816e-03  35.1242970
## Hairleng  -6.865495e-01 4.014933e-02 -17.0926429
## Age        1.081716e-02 1.143995e-02   0.9524214
## Income     5.642220e-06 1.463158e-05   0.3875343
## Beer       5.010542e-03 7.817789e-04   6.3971862
## Wine       7.511377e-03 2.809055e-03   2.6802915
## Sex       -6.865495e-01 4.014933e-02 -17.0926429
## Swim       1.010496e-01 7.102257e-03  14.2147650
## Region     1.035071e-01 9.288992e-02   1.1174095
## IQ        -5.277164e-03 1.002285e-02  -0.5285419
##                p-value        0.5%       99.5%
## Intercept           NA          NA          NA
## Height    0.000000e+00  0.13132804  0.22082728
## Weight    0.000000e+00  0.13091988  0.22068808
## Hairleng  1.443290e-14 -0.21062642 -0.11862847
## Age       3.507861e-01 -0.05779394  0.15100599
## Income    7.019238e-01 -0.05361845  0.17361469
## Beer      1.579903e-06  0.10178596  0.16448777
## Wine      1.336266e-02 -0.09669881  0.10220196
## Sex       1.443290e-14 -0.21062642 -0.11862847
## Swim      7.018830e-13  0.13116738  0.21631168
## Region    2.753561e-01 -0.13215712  0.06944191
## IQ        6.021868e-01 -0.11118799  0.10448113
## attr(,"name")
## [1] "Regression coefficients for  Shoesize"

The p-values, t-values and standard errors are stored each as a 3-way array similar to regression coefficients. The selection can be made by comparing e.g. p-values with a threshold similar to what we have done with VIP-scores and selectivity ratio.

exclcols = mjk$coeffs$p.values[, 2, 1] > 0.05
show(exclcols)
##   Height   Weight Hairleng      Age   Income     Beer 
##    FALSE    FALSE    FALSE     TRUE     TRUE    FALSE 
##     Wine      Sex     Swim   Region       IQ 
##    FALSE    FALSE    FALSE     TRUE     TRUE

Here p.values[, 2, 1] means values for all predictors, model with two components, first y-variable.

newm = pls(Xc, yc, 3, scale = TRUE, cv = 1, exclcols = exclcols)
summary(newm)
## 
## PLS model (class pls) summary
## -------------------------------
## Info: 
## Number of selected components: 3
## Cross-validation: full (leave one out)
## 
## Response variable: Shoesize
##     X cumexpvar Y cumexpvar    R2  RMSE Slope    Bias
## Cal    98.63485    97.79127 0.978 0.575 0.978  0.0000
## Cv           NA          NA 0.965 0.727 0.972 -0.0161
##      RPD
## Cal 6.87
## Cv  5.44
show(getRegcoeffs(newm))
##              Estimated
## Intercept  4.952691046
## Height     0.091079963
## Weight     0.052708955
## Hairleng  -0.315927643
## Age        0.000000000
## Income     0.000000000
## Beer       0.009320427
## Wine       0.018524717
## Sex       -0.315927643
## Swim       0.134354170
## Region     0.000000000
## IQ         0.000000000
## attr(,"exclrows")
##    Age Income Region     IQ 
##      4      5     10     11 
## attr(,"name")
## [1] "Regression coefficients for  Shoesize"

As you can see, the variables Age, Income, Region and IQ have been excluded as they are not related to the Shoesize, which seems to be correct.

Variable selection as well as all described above can be also carried out for PLS discriminant analysis (PLS-DA), which can be explained later in one of the next chapters.