10,511 Hits in 8.0 sec

Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models [article]

Ze-Feng Gao, Peiyu Liu, Wayne Xin Zhao, Zhong-Yi Lu, Ji-Rong Wen
2022 arXiv   pre-print
Extensive experiments based on T5 and GPT-2 show improved performance and efficiency of the pre-trained language model (27.2x reduction in total parameters for the superior model performance, compared  ...  Recently, Mixture-of-Experts (short as MoE) architecture has achieved remarkable success in increasing the model capacity of large-scale language models.  ...  Introduction Large-scale pre-trained language models (PLMs), such as BERT (Devlin et al., 2018) and T5 (Raffel et al., 2020) , have become the de facto standard in natural language processing (NLP).  ... 
arXiv:2203.01104v3 fatcat:julsxmpjfrdmvheztzorxioxtu

A Review of Sparse Expert Models in Deep Learning [article]

William Fedus, Jeff Dean, Barret Zoph
2022 arXiv   pre-print
By doing so, the degree of sparsity decouples the parameter count from the compute per example allowing for extremely large, but efficient models.  ...  This class of architecture encompasses Mixture-of-Experts, Switch Transformers, Routing Networks, BASE layers, and others, all with the unifying idea that each example is acted on by a subset of the parameters  ...  Park, Nan Du, Jason Wei, James Lee-Thorp, and Yanqi Zhou for feedback and comments on our drafts.  ... 
arXiv:2209.01667v1 fatcat:sau75mvqxjd57ffhp5ouuykyfu

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity [article]

William Fedus, Barret Zoph, Noam Shazeer
2022 arXiv   pre-print
In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example.  ...  Finally, we advance the current scale of language models by pre-training up to trillion parameter models on the "Colossal Clean Crawled Corpus" and achieve a 4x speedup over the T5-XXL model.  ...  Blake Hechtman who provided invaluable help in profiling and improving the training performance of our models.  ... 
arXiv:2101.03961v3 fatcat:jmgrr46hyzghhoxjlcrm6prhrq

Tricks for Training Sparse Translation Models [article]

Dheeru Dua, Shruti Bhosale, Vedanuj Goswami, James Cross, Mike Lewis, Angela Fan
2021 arXiv   pre-print
and dense pre-training.  ...  Sparse scaling architectures, such as BASELayers, provide flexible mechanisms for different tasks to have a variable number of parameters, which can be useful to counterbalance skewed data distributions  ...  Figure 4 : 4 Expert distribution for Romanian (lowresource) and French (high-resource) as sparse finetuning progresses on pre-trained dense model for WMT-15 with 16 experts.  ... 
arXiv:2110.08246v1 fatcat:cbtnsy5g7fcjve3jnb2jjvb5au

Towards More Effective and Economic Sparsely-Activated Model [article]

Hao Jiang, Ke Zhan, Jianwei Qu, Yongkang Wu, Zhaoye Fei, Xinyu Zhang, Lei Chen, Zhicheng Dou, Xipeng Qiu, Zikai Guo, Ruofei Lai, Jiawen Wu (+5 others)
2021 arXiv   pre-print
for training and implementing extremely large models.  ...  To increase the number of activated experts without an increase in computational cost, we propose SAM (Switch and Mixture) routing, an efficient hierarchical routing mechanism that activates multiple experts  ...  In order to explore the ability of the model, we test the perplexity for the language model in the pre-training stage. The pre-training corpus will be introduced in the next section.  ... 
arXiv:2110.07431v1 fatcat:26slcfmyebasvfk2w4yiscfrti

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts [article]

Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Furu Wei
2022 arXiv   pre-print
Because of the modeling flexibility of MoME, pretrained VLMo can be fine-tuned as a fusion encoder for vision-language classification tasks, or used as a dual encoder for efficient image-text retrieval  ...  Specifically, we introduce Mixture-of-Modality-Experts (MoME) Transformer, where each block contains a pool of modality-specific experts and a shared self-attention layer.  ...  the proposed model to integrate more modalities (e.g., speech, video, and structured knowledge), supporting general-purpose multimodal pre-training.  ... 
arXiv:2111.02358v2 fatcat:crzh75jj3rhgtglzf2wbp26c7i

AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large Language Models [article]

Yaqing Wang, Subhabrata Mukherjee, Xiaodong Liu, Jing Gao, Ahmed Hassan Awadallah, Jianfeng Gao
2022 arXiv   pre-print
Fine-tuning large-scale pre-trained language models to downstream tasks require updating hundreds of millions of parameters.  ...  By only tuning 0.23% of a pre-trained language model's parameters, our model outperforms the full model fine-tuning performance and several competing methods.  ...  AdaMix for parameter-efficient fine-tuning of large pre-trained language models for NLP tasks.  ... 
arXiv:2205.12410v1 fatcat:5ervotcw2zgyxb6erwyoichu4i

MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation [article]

Simiao Zuo, Qingru Zhang, Chen Liang, Pengcheng He, Tuo Zhao, Weizhu Chen
2022 arXiv   pre-print
Pre-trained language models have demonstrated superior performance in various natural language processing tasks.  ...  We initialize MoEBERT by adapting the feed-forward neural networks in a pre-trained model into multiple experts. As such, representation power of the pre-trained model is largely retained.  ...  Conclusion We present MoEBERT, which uses a Mixture-of-Experts structure to distill pre-trained language models.  ... 
arXiv:2204.07675v2 fatcat:mkgb5vwbxfh6nk33htauo66oqy

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale [article]

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He
2022 arXiv   pre-print
As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures  ...  Its training cost saving is demonstrated from encoder-decoder models (prior works) to a 5x saving for auto-aggressive language models (this work along with parallel explorations).  ...  Acknowledgment We thank Olatunji Ruwase from the Microsoft DeepSpeed Team for his contributions on developing, debugging, testing, and releasing the DeepSpeed-MoE software.  ... 
arXiv:2201.05596v2 fatcat:y5v2jx7y4fdxnlcl5bgpeuytei

Efficient Language Modeling with Sparse all-MLP [article]

Ping Yu, Mikel Artetxe, Myle Ott, Sam Shleifer, Hongyu Gong, Ves Stoyanov, Xian Li
2022 arXiv   pre-print
In this work, we analyze the limitations of MLPs in expressiveness, and propose sparsely activated MLPs with mixture-of-experts (MoEs) in both feature and input (token) dimensions.  ...  The proposed sparse all-MLP improves language modeling perplexity and obtains up to 2× improvement in training efficiency compared to both Transformer-based MoEs (GShard, Switch Transformer, Base Layers  ...  It is worth noting that the GPT-3 model was trained with more pre-training data (GPT-3 used 300 billion tokens for pre-training and our pre-training data contains 100 billion tokens).  ... 
arXiv:2203.06850v3 fatcat:gi7gsmzmvfaxrocbv54z7kg3o4

ST-MoE: Designing Stable and Transferable Sparse Expert Models [article]

Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, William Fedus
2022 arXiv   pre-print
In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models.  ...  We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B).  ...  We also thank the Google Brain Team for useful discussions throughout the course of this work.  ... 
arXiv:2202.08906v2 fatcat:asqhzxfkl5dkjd6vl6ks5uwfvu

BASE Layers: Simplifying Training of Large, Sparse Models [article]

Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, Luke Zettlemoyer
2021 arXiv   pre-print
Sparse layers can dramatically improve the efficiency of training and inference by routing each token to specialized expert modules that contain only a small fraction of the model parameters.  ...  We introduce a new balanced assignment of experts (BASE) layer for large language models that greatly simplifies existing high capacity sparse layers.  ...  Future work should explore more efficient implementations for computing balanced assignments, to further improve training speed.  ... 
arXiv:2103.16716v1 fatcat:wn5rqohr5rghrfav2yaquinyiu

Scalable and Efficient MoE Training for Multitask Multilingual Models [article]

Young Jin Kim, Ammar Ahmad Awan, Alexandre Muzio, Andres Felipe Cruz Salinas, Liyang Lu, Amr Hendy, Samyam Rajbhandari, Yuxiong He, Hany Hassan Awadalla
2021 arXiv   pre-print
The Mixture of Experts (MoE) models are an emerging class of sparsely activated deep learning models that have sublinear compute costs with respect to their parameters.  ...  A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks.  ...  training efficiency and experts pruning strategy to improve inference time (Section 3). 3) Effective training recipes to scale up multitask and multilingual language models with MoE model architecture  ... 
arXiv:2109.10465v1 fatcat:k45qxinuqzcg3gdyjh6rskvsqm

Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference [article]

Sneha Kudugunta, Yanping Huang, Ankur Bapna, Maxim Krikun, Dmitry Lepikhin, Minh-Thang Luong, Orhan Firat
2021 arXiv   pre-print
Sparse Mixture-of-Experts (MoE) has been a successful approach for scaling multilingual translation models to billions of parameters without a proportional increase in training computation.  ...  On WMT, our task-MoE with 32 experts (533M parameters) outperforms the best performing token-level MoE model (token-MoE) by +1.0 BLEU on average across 30 language pairs.  ...  Scaling Transformers with Mixture-of-Experts The Transformer (Vaswani et al., 2017) architecture is a popular model used for neural machine translation and other natural language understanding/generation  ... 
arXiv:2110.03742v1 fatcat:stp4wtshfjanncfo4axzwsm3ki

Do Transformer Modifications Transfer Across Implementations and Applications? [article]

Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, Karishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, Yanqi Zhou, Wei Li (+4 others)
2021 arXiv   pre-print
In this paper, we comprehensively evaluate many of these modifications in a shared experimental setting that covers most of the common uses of the Transformer in natural language processing.  ...  The research community has proposed copious modifications to the Transformer architecture since it was introduced over three years ago, relatively few of which have seen widespread adoption.  ...  Switch Transformer, mixture of experts, and product key memories all improve performance with significantly more parameters than the baseline model.  ... 
arXiv:2102.11972v2 fatcat:w6y6mkrw7vavnkkrcmzgz7roee
« Previous Showing results 1 — 15 out of 10,511 results