A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
[article]
2020
arXiv
pre-print
Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In this work, we present our techniques for training very large transformer models and implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. Our approach does not require
arXiv:1909.08053v4
fatcat:hqdaodavsfb7fcobxrsodjkqam