5 Hits in 6.3 sec

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning [article]

Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He
2021 arXiv   pre-print
However, we are getting close to the GPU memory wall.  ...  In the last three years, the largest dense deep learning models have grown over 1000x to reach hundreds of billions of parameters, while the GPU memory has only grown by 5x (16 GB to 80 GB).  ...  ACKNOWLEDGEMENT We thank Elton Zheng, Reza Yazdani Aminabadi, Arash Ashari for their help on improving various components of the code, and Cheng Li for her help in proof reading the paper.  ... 
arXiv:2104.07857v1 fatcat:leqp6gpjfzdvxdzub3uaimzxyu

M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining [article]

Junyang Lin, An Yang, Jinze Bai, Chang Zhou, Le Jiang, Xianyan Jia, Ang Wang, Jie Zhang, Yong Li, Wei Lin, Jingren Zhou, Hongxia Yang
2021 arXiv   pre-print
Recent expeditious developments in deep learning algorithms, distributed training, and even hardware design for large models have enabled training extreme-scale models, say GPT-3 and Switch Transformer  ...  Besides demonstrating the application of Pseudo-to-Real, we also provide a technique, Granular CPU offloading, to manage CPU memory for training large model and maintain high GPU utilities.  ...  Besides the application of Pseudo-to-Real training strategy, we further provide Granular CPU offloading to enhance GPU utility while breaking the GPU memory wall with a cost in efficiency.  ... 
arXiv:2110.03888v3 fatcat:gc57h3wizjhmdmws6vg6kz5uay

SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System [article]

Liang Shen, Zhihua Wu, WeiBao Gong, Hongxiang Hao, Yangfan Bai, HuaChao Wu, Xinxuan Wu, Haoyi Xiong, Dianhai Yu, Yanjun Ma
2022 arXiv   pre-print
For scalable inference in a single node, especially when the model size is larger than GPU memory, SE-MoE forms the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation  ...  tasks across the memory sections in a round-robin manner for efficient inference.  ...  MoE Training Design In the field of deep learning, there are two key factors affecting the performance and effect of model training: the model scale and the data size.  ... 
arXiv:2205.10034v1 fatcat:2ekxx4t76nhbdndbt4yssd4ale

Symbolic execution of verification languages and floating-point code

Daniel Simon Liew, Alastair Donaldson, Cristian Cadar, Engineering And Physical Sciences Research Council, ARM (Firm)
of symbolic execution for this language.  ...  Third, an investigation into the use of coverage-guided fuzzing as a means for solving constraints over finite data types, inspired by the difficulties associated with solving floating-point constraints  ...  The IEEE-754 binary format contains several classes of data: normal, denormal, zero, infinity and NaN.  ... 
doi:10.25560/59705 fatcat:zljitt4ouzeyrnowg222qh5owm

Analysis of On-Chip Inductors and Arithmetic Circuits in the Context of High Performance Computing

Stefan Kosnac
While the on-chip structures have scaled down extremely, the pin count of packages could not be increased accordingly, which is known as pin limitation.  ...  The result is accurate enough to be used for circuit optimization.  ...  The Result Classifier handles operations with special values, such as qNaN, sNaN, zero, infinity, and subnormal numbers.  ... 
doi:10.11588/heidok.00030559 fatcat:otsjvfztdjhpfjizsiwimcpx7e