Unless I am mistaken, this is the first time I see RandNLA operation performed using a GPU. It is not the last.
GPU Accelerated Randomized Singular Value Decomposition and Its Application in Image Compression by Hao Ji and Yaohang Li
From the paper:
In this paper, we present a GPU-accelerated implementation of randomized Singular Value Decomposition (SVD) algorithm on a large matrix to rapidly approximate the top-k dominating singular values and correspondent singular vectors. The fundamental idea of randomized SVD is to condense a large matrix into a small dense matrix by random sampling while keeping the important information. Then performing traditional deterministic SVD on this small dense matrix reveals the top-k dominating singular values/singular vectors approximation. The randomized SVD algorithm is suitable for the GPU architecture; however, our study finds that the key bottleneck lies on the SVD computation of the small matrix. Our solution is to modify the randomized SVD algorithm by applying SVD to a derived small square matrix instead as well as a hybrid GPU-CPU scheme. Our GPU-accelerated randomized SVD implementation is around 6~7 times faster than the corresponding CPU version. Our experimental results demonstrate that the GPU-accelerated randomized SVD implementation can be effectively used in image compression.
From the paper:
The elapsed time spent on each primary computational component in randomized SVD is shown in Figure 2 for a 4,096 x 4,096 random matrix where k is 128 and p is 3. Multiplication between A and a “tall-and-skinny” or “short-and-wide” matrix can be efficiently carried out on the GPU’s SIMT architecture and hence the computational time in generating matrix \omega and performing matrix-matrix multiplications shrinks to nearly negligible. Nevertheless, deterministic SVD, particularly when the target matrix is small, has difficulty in fully taking advantage of GPU architecture, due to a series of sequential Householder transformations need to be applied. As a result, deterministic SVD becomes the main bottleneck and thus this GPU implementation has only 1.65 over that of the CPU.
I wonder what happens when the data becomes larger than 4096 x 4096., i.e; if these ratios still hold.
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.