Filters








5 Hits in 2.4 sec

BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover's Distance [article]

Jianquan Li, Xiaokang Liu, Honghong Zhao, Ruifeng Xu, Min Yang, Yaohong Jin
2020 arXiv   pre-print
EMD enables the effective matching for many-to-many layer mapping.  ...  In this paper, we propose a novel BERT distillation method based on many-to-many layer mapping, which allows each intermediate student layer to learn from any intermediate teacher layers.  ...  Methodology In this section, we propose a novel BERT compression method based on many-to-many layer mapping and Earth Mover's Distance (called BERT-EMD).  ... 
arXiv:2010.06133v1 fatcat:ttbwizqsdfaupdfiztap5z6zf4

BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover's Distance

Jianquan Li, Xiaokang Liu, Honghong Zhao, Ruifeng Xu, Min Yang, Yaohong Jin
2020 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)   unpublished
EMD enables the effective matching for many-to-many layer mapping.  ...  In this paper, we propose a novel BERT distillation method based on many-to-many layer mapping, which allows each intermediate student layer to learn from any intermediate teacher layers.  ...  Methodology In this section, we propose a novel BERT compression method based on many-to-many layer mapping and Earth Mover's Distance (called BERT-EMD).  ... 
doi:10.18653/v1/2020.emnlp-main.242 fatcat:rhbukunegfhkhjjwzrjccntrta

Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing [article]

Haoyu He, Xingjian Shi, Jonas Mueller, Zha Sheng, Mu Li, George Karypis
2021 arXiv   pre-print
with better bias/variance trade-off for estimating the MI between the teacher and the student.  ...  Our experiments reveal the following: 1) the approach used to distill the intermediate representations is the most important factor in KD performance, 2) among different objectives for intermediate distillation  ...  (2020) proposed a many-to-many layer mapping function leveraging the Earth Mover's Distance to transfer intermediate knowledge.  ... 
arXiv:2109.11105v1 fatcat:z6ngyoqy2bcklhcrepvsoyr3xy

Scene-adaptive Knowledge Distillation for Sequential Recommendation via Differentiable Architecture Search [article]

Lei Chen, Fajie Yuan, Jiaxi Yang, Min Yang, Chengming Li
2022 arXiv   pre-print
In addition, we leverage Earth Mover's Distance (EMD) to realize many-to-many layer mapping during knowledge distillation, which enables each intermediate student layer to learn from other intermediate  ...  Naturally, we argue that compressing the heavy recommendation models into middle- or light- weight neural networks is of great importance for practical production systems.  ...  In addition, we leverage Earth Mover's Distance (EMD) to realize effective many-to-many layer mapping during the distillation process, enabling each intermediate layer of student to learn from any other  ... 
arXiv:2107.07173v2 fatcat:gjveklueevdrrimfwtcbf5ixla

Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing

Haoyu He, Xingjian Shi, Jonas Mueller, Sheng Zha, Mu Li, George Karypis
2021 Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing   unpublished
Within Distiller, we unify commonly used objectives for distillation of intermediate representations under a universal mutual information (MI) objective and propose a class of MIα objective functions with  ...  Our experiments reveal the following: 1) the approach used to distill the intermediate representations is the most important factor in KD performance, 2) among different objectives for intermediate distillation  ...  ., 2019) , with subsequent task-specific distillation. proposed a many-to-many layer mapping function leveraging the Earth Mover's Distance to transfer intermediate knowledge.  ... 
doi:10.18653/v1/2021.sustainlp-1.13 fatcat:3dutoum5h5cutfwx34ohighobq