A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is
To address this flaw, we propose PMI-Masking, a principled masking strategy based on the concept of Pointwise Mutual Information (PMI), which jointly masks a token n-gram if it exhibits high collocation ... , and random-span masking. ... PMI: FROM BIGRAMS TO n-GRAMS Our aim is to define a masking strategy that targets correlated sequences of tokens in a principled way. ...arXiv:2010.01825v1 fatcat:wo6hobt64bfr7gjn3wu5ghttjq
Findings of the Association for Computational Linguistics: EMNLP 2021
PMI-masking: Principled pages 4171–4186, Minneapolis, Minnesota. Associ- masking of correlated spans. In Proc. of ICLR. ... The tendency of VLP models is to predict something that is correlated with the text, or common answers. ...doi:10.18653/v1/2021.findings-emnlp.259 fatcat:skhhfoittjg33b26oo23olx37a
We theoretically predict the existence of an embedding rank bottleneck that limits the contribution of self-attention width to the Transformer expressivity. ... We empirically demonstrate the existence of this bottleneck and its implications on the depth-to-width interplay of Transformer architectures, linking the architecture variability across domains to the ... Yoav Levine was supported by the Israel Academy of Sciences Adams fellowship. ...arXiv:2105.03928v2 fatcat:bah7r5jzwzdsbifa6iffcz27xe