Tensor Learning

High Performance Tensor Computing for Machine Learning

We developed efficient tensor libraries for tensor decompositions and tensor networks, including CP, Tucker, Hierarchical Tucker, tensor-train, tensor-ring, low-tubal-rank tensor decompositions, etc. We provide efficient primitives for tensor, Hadamard, Khatri-Tao products; contraction, matricization, tensor times matrix (TTM), matricized tensor times Khatri-Rao product (MTTKRP), on tensor cores. These operations are the key components of the tensor algebra.

E.g., cuTensor-tubal library adopts a frequency domain computation scheme. We optimize the data transfer, memory access, and support seven key tensor operations: t-FFT, inverse t-FFT, t-product, t-SVD, t-QR, t-inverse, and t-normalization. cuTensor-tubal library fully exploits the separability in the frequency domain and maps the tube-wise and slice-wise parallelisms onto the single instruction multiple thread (SIMT) GPU architecture.


We briefly summarize its concept as well as basic and key operations, and introduce how to compute third-order tensor operations of this model on GPUs. Throughout this study, we focus on realvalue third-order tensors in the space R


We design this library on top of existing highly optimized CUDA libraries including cuFFT `{`27`}`, cuBLAS `{`27`}`, cuSolver `{`27`}`, Magma `{`28`}`, and KBLAS `{`29`}` for efficient vector Fourier transform and matrix computations.


The cuTensor-tubal library consists of five layers: Applications; cuTensor-tubal API; Third-party tensor libraries; CUDA libraries; Hardware platform.


We measure the running time and speedups of seven key tensor operations.We test tensor operation performance, and further test tensor completion and t-SVD-based video compression performance.


Early works accelerate tensor computations primarily on single machines with multi-core CPUs or distributed CPU clusters. Later with the advent of the high-performance GPUs, more and more works adopt GPUs to accelerate intensively computational tensor computations.


We presented a cuTensor-tubal library of common tensor operations for low-tubal-rank tensor decomposition.In the future, we plan to extend the cuTensor-tubal library to include more tensor operations,and scale the library onto multi-GPU systems.

TensorLet Team

The achievement of cuTensor we did by now!

For tensor decompositions, our cuTensor library achieves speedups xxx.

Related Publications

X.-Y. Liu, T. Zhang, H. Lu, X. Wang, and A. Walid. Efficient GPU primitives for tensor learning with CP and Tucker decompositions. 2020.
[TNML] X.-Y. Liu, H. Lu, T. Zhang. cuTensor-CP: High performance third-order CP tensor decompositions on GPUs. IJCAI 2020 Workshop on Tensor Network Represenations in Machine Learning, 2020.
[TNML] H. Hong, T. Zhang, . X.-Y. Liu. cuTensor-TT/TR: High performance third-order tensor-train and -ring decompositions on GPUs. IJCAI 2020 Workshop on Tensor Network Represenations in Machine Learning, 2020.
[TNRML] H. Huang, T. Zhang, X.-Y. Liu. cuTensor-HT: High performance third-order Hierarchical Tucker tensor decomposition on GPUs. IJCAI 2020 Workshop on Tensor Network Represenations in Machine Learning, 2020.
[TNRML] M. Yin, S. Liao, X.-Y. Liu, X. Wang, B. Yuan. Compressing recurrent neural networks using hierarchical Tucker tensor decomposition. IJCAI 2020 Workshop on Tensor Network Represenations in Machine Learning, 2020.
[JPDC] T. Zhang, W. Kan, X.-Y. Liu. High performance GPU primitives for graph-tensor learning operations. (Major Revision) Elsevier Journal of Parallel and Distributed Computing, 2020.
[TPDS] T. Zhang, X.-Y. Liu, X. Wang. High performance GPU tensor completion with tubal-sampling pattern. IEEE Transactions on Parallel and Distributed Systems, 2020.
[TPDS] X.-Y. Liu, T. Zhang (co-primary author), X. Wang, A. Walid. cuTensor-tubal: Efficient primitives for tubal-rank tensor learning operations on GPUs. IEEE Transactions on Parallel and Distributed Systems, 2019.

T. Zhang, H. Lu, X.-Y. Liu. High performance homomorphic matrix completion on multiple GPUs. IEEE Access, 2020.

[HPCC] H. Li, T. Zhang, R. Zhang, X.-Y. Liu. High-performance tensor decoder on GPUs for wireless camera networks in IoT. IEEE HPCC 2019.
[HPCC] H. Lu, T. Zhang, X.-Y. Liu. High-performance homomorphic matrix completion on GPUs. IEEE HPCC 2019.
[ICASSP] X.-Y. Liu, T. Zhang (co-primary author). cuTensor-tubal: Optimized GPU library for low-tubal-rank tensors. IEEE ICASSP, 2019.
[ICCAD] C. Deng, M. Yin, X.-Y. Liu, X. Wang, B. Yuan. High-performance hardware architecture for tensor singular value decomposition (Invited paper). International Conference on Computer-Aided Design (ICCAD), 2019.
[IPCCC] J. Huang, L. Kong, X.-Y. Liu, W. Qu and G. Chen. A C++ library for tensor decomposition. International Performance Computing and Communications Conference (IPCCC), 2019.

Reach the top ending AI science!

A young team, professional in GPU tensor and Deep Learning technology, commits to creating top AI algorithms and solutions for corprates, labs, schools and communities.