A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is
The capacity of neural networks like the widely adopted transformer is known to be very high. Evidence is emerging that they learn successfully due to inductive bias in the training routine, typically a variant of gradient descent (GD). To better understand this bias, we study the tendency for transformer parameters to grow in magnitude (ℓ_2 norm) during training, and its implications for the emergent representations within self attention layers. Empirically, we document norm growth in thearXiv:2010.09697v4 fatcat:gp7hyvv6xjefvbeh4ouq5q4bwu