A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is
Machine learning models made up of millions or billions of parameters are trained and served on large multi-GPU systems. As models grow in size and execute on more GPUs, the collective communications used in these applications become a bottleneck. Custom collective algorithms optimized for both particular network topologies and application specific communication patterns can alleviate this bottleneck and help these applications scale. However, correctly and efficiently implementing customarXiv:2201.11840v3 fatcat:542qd5tmozb3vphtjplgshjmiu