A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover's Distance
[article]
2020
arXiv
pre-print
EMD enables the effective matching for many-to-many layer mapping. ...
In this paper, we propose a novel BERT distillation method based on many-to-many layer mapping, which allows each intermediate student layer to learn from any intermediate teacher layers. ...
Methodology In this section, we propose a novel BERT compression method based on many-to-many layer mapping and Earth Mover's Distance (called BERT-EMD). ...
arXiv:2010.06133v1
fatcat:ttbwizqsdfaupdfiztap5z6zf4
BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover's Distance
2020
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
unpublished
EMD enables the effective matching for many-to-many layer mapping. ...
In this paper, we propose a novel BERT distillation method based on many-to-many layer mapping, which allows each intermediate student layer to learn from any intermediate teacher layers. ...
Methodology In this section, we propose a novel BERT compression method based on many-to-many layer mapping and Earth Mover's Distance (called BERT-EMD). ...
doi:10.18653/v1/2020.emnlp-main.242
fatcat:rhbukunegfhkhjjwzrjccntrta
Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing
[article]
2021
arXiv
pre-print
with better bias/variance trade-off for estimating the MI between the teacher and the student. ...
Our experiments reveal the following: 1) the approach used to distill the intermediate representations is the most important factor in KD performance, 2) among different objectives for intermediate distillation ...
(2020) proposed a many-to-many layer mapping function leveraging the Earth Mover's Distance to transfer intermediate knowledge. ...
arXiv:2109.11105v1
fatcat:z6ngyoqy2bcklhcrepvsoyr3xy
Scene-adaptive Knowledge Distillation for Sequential Recommendation via Differentiable Architecture Search
[article]
2022
arXiv
pre-print
In addition, we leverage Earth Mover's Distance (EMD) to realize many-to-many layer mapping during knowledge distillation, which enables each intermediate student layer to learn from other intermediate ...
Naturally, we argue that compressing the heavy recommendation models into middle- or light- weight neural networks is of great importance for practical production systems. ...
In addition, we leverage Earth Mover's Distance (EMD) to realize effective many-to-many layer mapping during the distillation process, enabling each intermediate layer of student to learn from any other ...
arXiv:2107.07173v2
fatcat:gjveklueevdrrimfwtcbf5ixla
Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing
2021
Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing
unpublished
Within Distiller, we unify commonly used objectives for distillation of intermediate representations under a universal mutual information (MI) objective and propose a class of MIα objective functions with ...
Our experiments reveal the following: 1) the approach used to distill the intermediate representations is the most important factor in KD performance, 2) among different objectives for intermediate distillation ...
., 2019) , with subsequent task-specific distillation. proposed a many-to-many layer mapping function leveraging the Earth Mover's Distance to transfer intermediate knowledge. ...
doi:10.18653/v1/2021.sustainlp-1.13
fatcat:3dutoum5h5cutfwx34ohighobq