Randomized PCA algorithms

Both SVD and NIPALS are not very efficient when number of rows in dataset is very large (e.g. hundreds of thousands values or even more). Such datasets can be easily obtained in case of for example hyperspectral images. Direct use of the traditional algorithms with such datasets often leads to a lack of memory and long computational time.

One of the solution here is to use probabilistic algorithms, which allow to reduce the number of values needed for estimation of principal components. Starting from 0.9.0 one of the probabilistic approach is also implemented in mdatools. The original idea can be found in this paper and some examples on using the approach for PCA analysis of hyperspectral images are described here.

The approach is based on calculation of matrix $$B$$, which has much smaller number of rows comparing to the original data, but captures the action of the data. To produce proper values for $$B$$, several parameters are used. First of all it is a rank of original data, $$k$$, if it is known. Second, it is oversampling parameter, $$p$$, which is important if rank is underestimated. The idea is to use $$p$$ to overestimate the rank thus making solution more stable. The third parameter, $$q$$, is a number of iterations needed to make $$B$$ more robust. Usually using $$p = 5$$ and $$q = 1$$ will work well on most of the datasets and, at the same time, will take less time for finding a solution comparing with conventional methods. By default, $$k$$ is set to number of components, used for PCA decomposition.

The example below uses two methods (classic SVD and randomized SVD) for PCA decomposition of 100 000 x 300 dataset and compares time and main outcomes (loadings and explained variance).

# create a dataset as a linear combination of three sin curves with random "concentrations"
n = 100000
X = seq(0, 29.9, by = 0.1)
S = cbind(sin(X), sin(10 * X), sin(5 * X))
C = cbind(runif(n, 0, 1), runif(n, 0, 2), runif(n, 0, 3))
D = C %*% t(S)
D = D + matrix(runif(300 * n, 0, 0.5), ncol = 300)
show(dim(D))
##  100000    300
# conventional SVD
t1 = system.time({m1 = pca(D, ncomp = 2)})
show(t1)
##    user  system elapsed
##  52.910   2.816  55.741
# randomized SVD with p = 5 and q = 1
t2 = system.time({m2 = pca(D, ncomp = 2, rand = c(5, 1))})
show(t2)
##     user   system  elapsed
##   29.620    2.668 2310.736
# compare variances
summary(m1)
##
## PCA model (class pca) summary
##
## Info:
##
##        Eigvals Expvar Cumexpvar
## Comp 1 113.696  62.35     62.35
## Comp 2  49.957  27.40     89.74
summary(m2)
##
## PCA model (class pca) summary
##
## Info:
##
##
## Parameters for randomized algorithm: q = 5, p = 1
##        Eigvals Expvar Cumexpvar
## Comp 1 113.696  62.35     62.35
## Comp 2  49.957  27.40     89.74
show(m1$loadings[1:10, ]) ## Comp 1 Comp 2 ## [1,] 2.169611e-05 7.819699e-05 ## [2,] -3.884146e-02 6.928805e-02 ## [3,] -6.831018e-02 7.508661e-02 ## [4,] -8.137611e-02 1.246559e-02 ## [5,] -7.445530e-02 -6.122070e-02 ## [6,] -4.921439e-02 -7.786260e-02 ## [7,] -1.169291e-02 -2.269241e-02 ## [8,] 2.876378e-02 5.345657e-02 ## [9,] 6.195369e-02 8.026104e-02 ## [10,] 7.977416e-02 3.286764e-02 show(m2$loadings[1:10, ])
##              Comp 1        Comp 2
##  [1,] -2.169611e-05  7.819699e-05
##  [2,]  3.884146e-02  6.928805e-02
##  [3,]  6.831018e-02  7.508661e-02
##  [4,]  8.137611e-02  1.246559e-02
##  [5,]  7.445530e-02 -6.122070e-02
##  [6,]  4.921439e-02 -7.786260e-02
##  [7,]  1.169291e-02 -2.269241e-02
##  [8,] -2.876378e-02  5.345657e-02
##  [9,] -6.195369e-02  8.026104e-02
## [10,] -7.977416e-02  3.286764e-02

As you can see the explained variance values, eigenvalues and loadings are identical in the two models and the second method is about twice faster.

It is possible to make PCA decomposition even faster if only loadings and scores are needed. In this case you can use method pca.run() and skip other steps, like calculation of residuals, variances, critical limits and so on. But in this case data matrix must be centered (and scaled if necessary) manually prior to the decomposition. Here is an example using the data generated in previous code.

D = scale(D, center = T, scale = F)

# conventional SVD
t1 = system.time({P1 = pca.run(D, method = 'svd', ncomp = 2)})
show(t1)
##    user  system elapsed
##  28.070   0.257  29.614
# randomized SVD with p = 5 and q = 1
t2 = system.time({P2 = pca.run(D, method = 'svd', ncomp = 2, rand = c(5, 1))})
show(t2)
##    user  system elapsed
##   2.269   0.062   2.332
show(P1$loadings[1:10, ]) ## [,1] [,2] ## [1,] 2.169611e-05 7.819699e-05 ## [2,] -3.884146e-02 6.928805e-02 ## [3,] -6.831018e-02 7.508661e-02 ## [4,] -8.137611e-02 1.246559e-02 ## [5,] -7.445530e-02 -6.122070e-02 ## [6,] -4.921439e-02 -7.786260e-02 ## [7,] -1.169291e-02 -2.269241e-02 ## [8,] 2.876378e-02 5.345657e-02 ## [9,] 6.195369e-02 8.026104e-02 ## [10,] 7.977416e-02 3.286764e-02 show(P2$loadings[1:10, ])
##                [,1]          [,2]
##  [1,] -2.169611e-05  7.819699e-05
##  [2,]  3.884146e-02  6.928805e-02
##  [3,]  6.831018e-02  7.508661e-02
##  [4,]  8.137611e-02  1.246559e-02
##  [5,]  7.445530e-02 -6.122070e-02
##  [6,]  4.921439e-02 -7.786260e-02
##  [7,]  1.169291e-02 -2.269241e-02
##  [8,] -2.876378e-02  5.345657e-02
##  [9,] -6.195369e-02  8.026104e-02
## [10,] -7.977416e-02  3.286764e-02

As you can see the loadings are still the same but the probabilistic algorithm is about 15 times faster.