TensorCrypto: High Throughput Acceleration of Lattice-based Cryptography Using Tensor Core on GPU
Tensor core is a newly introduced hardware unit in NVIDIA GPU chips that allows matrix multiplication to be computed much faster than in the integer and floating-point units. In this paper, we show that for the first time, tensor core can be used to accelerate state-of-the-art lattice-based cryptosystems. We employed tensor core to speed up polynomial convolution, which is the most time consuming operation in lattice-based cryptosystems. Towards that aim, several parallel algorithms are
... to allow the tensor core to handle flexible matrix sizes and ephemeral key pairs. Experimental results show that the polynomial convolution computed using the tensor core is at least 2× faster than the version implemented with conventional integer units of the NVIDIA GPU. The proposed tensorcore-based polynomial convolution technique was applied to NTRU, one of the finalists in NIST postquantum cryptography (PQC) standardization. It achieved 2.02×/1.98× (encapsulation) and 1.56×/1.90× (decapsulation) higher throughput on two parameter sets (ntruhps2048509 and ntruhps2048677), compared to the conventional integer-based implementations on a GPU. In particular, the proposed implementation techniques achieved throughput up to 793651 key encapsulations per second and 505051 decapsulations per second on a RTX2060 GPU. To demonstrate the flexibility of the proposed technique, we extend the implementation to other lattice-based cryptosystems that have a small modulus: LAC and two variant parameter sets in FrodoKEM. Considering that the IoT gateway devices and cloud servers need to handle massive connections from the sensor nodes, the proposed high throughput implementation on GPU is very useful in securing the IoT communication.