cuTensor

GPU Tensor for Machine Learning

The cuTensor-tubal library adopts a frequency domain computation scheme and exploits highly optimized GPU libraries such as cuFFT and cuBLAS. We optimize the data transfer, memory access, and further provide the following seven key tensor operations on GPUs: t-FFT, inverse t-FFT, t-product, t-SVD, t-QR, t-inverse, and t-normalization.

These operations are the key components of the basic linear algebra for three-dimensional data. The proposed library fully exploits the separability in the frequency domain and maps the tube-wise and slice-wise parallelisms onto the single instruction multiple thread (SIMT) GPU architecture.

To achieve good performance, we designed the batched and the streamed parallelization scheme for tensor operations with regular and irregular computation patterns, respectively.

COMPUTING THIRD-ORDER TENSORS ON GPUS

We briefly summarize its concept as well as basic and key operations, and introduce how to compute third-order tensor operations of this model on GPUs. Throughout this study, we focus on realvalue third-order tensors in the space R

DESIGN OF THE CUTENSOR-TUBAL LIBRARY

We design this library on top of existing highly optimized CUDA libraries including cuFFT `{`27`}`, cuBLAS `{`27`}`, cuSolver `{`27`}`, Magma `{`28`}`, and KBLAS `{`29`}` for efficient vector Fourier transform and matrix computations.

OVERVIEW OF THE LIBRARY

The cuTensor-tubal library consists of five layers: Applications; cuTensor-tubal API; Third-party tensor libraries; CUDA libraries; Hardware platform.

PERFORMANCE EVALUATION

We measure the running time and speedups of seven key tensor operations.We test tensor operation performance, and further test tensor completion and t-SVD-based video compression performance.

RELATED WORKS

Early works accelerate tensor computations primarily on single machines with multi-core CPUs or distributed CPU clusters. Later with the advent of the high-performance GPUs, more and more works adopt GPUs to accelerate intensively computational tensor computations.

CONCLUSION AND FUTURE WORK

We presented a cuTensor-tubal library of common tensor operations for low-tubal-rank tensor decomposition.In the future, we plan to extend the cuTensor-tubal library to include more tensor operations,and scale the library onto multi-GPU systems.

TensorLet Team

The achievement of cuTensor we did by now!
1st GPU system made number ONE on the TOP500 for supercomputers, 2012 fall. Ever since then, GPUs serve the major computational power source, and driver the AI booming.

By Oak Ridge National Labs (Oct. 2012):
18, 688 Opteron 16-core CPUs
18, 688 NIVIDIA Tesla K20 GPUs
17.6 peta FLOPS
Comparing Tesla P100, V100 with CPU (Matlab R2014b)

For tensor decompositions, our cuTensor library achieves the following speedups
* 15x for tensor size 2000 x 2000 x 128;
* 11.8x for tensor size 2000 x 2000 x 64;
* 10.5x for tensor size 2000 x 2000 x 32;

Related Publications

[TPDS] T. Zhang, X.-Y. Liu, X. Wang. High performance GPU tensor completion with tubal-sampling pattern. IEEE Transactions on Parallel and Distributed Systems, 2020.
[TPDS] X.-Y. Liu, T. Zhang (co-primary author), X. Wang, A. Walid. cuTensor-tubal: Efficient primitives for tubal-rank tensor learning operations on GPUs. IEEE Transactions on Parallel and Distributed Systems, 2019.

T. Zhang, H. Lu, X.-Y. Liu. High Performance Homomorphic Matrix Completion on Multiple GPUs. IEEE Access, 2020.

[HPCC] H. Li, T. Zhang, R. Zhang, X.-Y. Liu. High-performance tensor decoder on GPUs for wireless camera networks in IoT. IEEE HPCC 2019.
[HPCC] H. Lu, T. Zhang, X.-Y. Liu. High-performance homomorphic matrix completion on GPUs. IEEE HPCC 2019.
[ICASSP] X.-Y. Liu, T. Zhang (co-primary author). cuTensor-tubal: Optimized GPU library for low-tubal-rank tensors. IEEE ICASSP, 2019.
[ICCAD] C. Deng, M. Yin, X.-Y. Liu, X. Wang, B. Yuan. High-performance hardware architecture for tensor singular value decomposition (Invited paper). International Conference on Computer-Aided Design (ICCAD), 2019.
[IPCCC] J. Huang, L. Kong, X.-Y. Liu, W. Qu and G. Chen. A C++ library for tensor decomposition. International Performance Computing and Communications Conference (IPCCC), 2019.

Reach the top ending AI science!

A young team, professional in GPU tensor and Deep Learning technology, commits to creating top AI algorithms and solutions for corprates, labs, schools and communities.