A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Fast implementation of DGEMM on Fermi GPU
2011
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
In this paper we present a thorough experience on tuning double-precision matrix-matrix multiplication (DGEMM) on the Fermi GPU architecture. We choose an optimal algorithm with blocking in both shared memory and registers to satisfy the constraints of the Fermi memory hierarchy. Our optimization strategy is further guided by a performance modeling based on micro-architecture benchmarks. Our optimizations include software pipelining, use of vector memory operations, and instruction scheduling.
doi:10.1145/2063384.2063431
dblp:conf/sc/TanLTPBS11
fatcat:4v6sakpdyzg5pnx6lyzb2ogrda