A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is
Reducing computation cost, inference latency, and memory footprint of neural networks are frequently cited as research motivations for pruning and sparsity. However, operationalizing those benefits and understanding the end-to-end effect of algorithm design and regularization on the runtime execution is not often examined in depth. Here we apply structured and unstructured pruning to attention weights of transformer blocks of the BERT language model, while also expanding block sparsearXiv:2106.08846v2 fatcat:bfx3lvvpzvbbffzeqic2g4ffye