Unless I am mistaken, this is the first time I see RandNLA operation performed using a GPU. It is not the last.
GPU Accelerated Randomized Singular Value Decomposition and Its Application in Image Compression by Hao Ji and Yaohang Li
From the paper:
In this paper, we present a GPU-accelerated implementation of randomized Singular Value Decomposition (SVD) algorithm on a large matrix to rapidly approximate the top-k dominating singular values and correspondent singular vectors. The fundamental idea of randomized SVD is to condense a large matrix into a small dense matrix by random sampling while keeping the important information. Then performing traditional deterministic SVD on this small dense matrix reveals the top-k dominating singular values/singular vectors approximation. The randomized SVD algorithm is suitable for the GPU architecture; however, our study finds that the key bottleneck lies on the SVD computation of the small matrix. Our solution is to modify the randomized SVD algorithm by applying SVD to a derived small square matrix instead as well as a hybrid GPU-CPU scheme. Our GPU-accelerated randomized SVD implementation is around 6~7 times faster than the corresponding CPU version. Our experimental results demonstrate that the GPU-accelerated randomized SVD implementation can be effectively used in image compression.
From the paper:
The elapsed time spent on each primary computational component in randomized SVD is shown in Figure 2 for a 4,096 x 4,096 random matrix where k is 128 and p is 3. Multiplication between A and a “tall-and-skinny” or “short-and-wide” matrix can be efficiently carried out on the GPU’s SIMT architecture and hence the computational time in generating matrix \omega and performing matrix-matrix multiplications shrinks to nearly negligible. Nevertheless, deterministic SVD, particularly when the target matrix is small, has difficulty in fully taking advantage of GPU architecture, due to a series of sequential Householder transformations need to be applied. As a result, deterministic SVD becomes the main bottleneck and thus this GPU implementation has only 1.65 over that of the CPU.
I wonder what happens when the data becomes larger than 4096 x 4096., i.e; if these ratios still hold.
No comments:
Post a Comment