Always Keep your Target in Mind: Studying Semantics and Improving Performance of Neural Lexical Substitution
Proceedings of the 28th International Conference on Computational Linguistics
Lexical substitution, i.e. generation of plausible words that can replace a particular target word in a given context, is an extremely powerful technology that can be used as a backbone of various NLP applications, including word sense induction and disambiguation, lexical relation extraction, data augmentation, etc. In this paper, we present a large-scale comparative study of lexical substitution methods employing both rather old and most recent language and masked language models (LMs and
... models (LMs and MLMs), such as context2vec, ELMo, BERT, RoBERTa, XLNet. We show that already competitive results achieved by SOTA LMs/MLMs can be further substantially improved if information about the target word is injected properly. Several existing and new target word injection methods are compared for each LM/MLM using both intrinsic evaluation on lexical substitution datasets and extrinsic evaluation on word sense induction (WSI) datasets. On two WSI datasets we obtain new SOTA results. Besides, we analyze the types of semantic relations between target words and their substitutes generated by different models or given by annotators. * Left Samsung This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http:// creativecommons.org/licenses/by/4.0/. models on more data, what are the other ways to improve lexical substitution, and (iii) what are the generated substitutes semantically. More specifically, the main contributions of the paper are as follows 1 : • A comparative study of five neural LMs/MLMs applied for lexical substitution based on both intrinsic and extrinsic evaluation. • A study of methods of target word injection for further lexical substitution quality improvement. • An analysis of types of semantic relations (synonyms, hypernyms, co-hyponyms, etc.) produced by neural substitution models as well as human annotators. Related Work Solving the lexical substitution task requires finding words that are both appropriate in the given context and related to the target word in some sense (which may vary depending on the application of generated substitutes). To achieve this, unsupervised substitution models heavily rely on distributional similarity models of words (DSMs) and language models (LMs). Probably, the most commonly used DSM is word2vec model (Mikolov et al., 2013) . It learns word embeddings and context embeddings to be similar when they tend to occur together, resulting in similar embeddings for distributionally similar words. Contexts are either nearby words or syntactically related words (Levy and Goldberg, 2014). In (Melamud et al., 2015b) several metrics for lexical substitution were proposed based on embedding similarity of substitutes both to the target word and to the words in the given context. Later (Roller and Erk, 2016) improved this approach by switching to dot-product instead of cosine similarity and applying an additional trainable transformation to context word embeddings. A more sophisticated context2vec model producing embeddings for a word in a particular context (contextualized word embeddings) was proposed in (Melamud et al., 2016) and was shown to outperform previous models in a ranking scenario when candidate substitutes are given. The training objective is similar to word2vec, but context representation is produced by two LSTMs (a forward and a backward for the left and the right context), in which final outputs are combined by feed-forward layers. For lexical substitution, candidate word embeddings are ranked by their similarity to the given context representation. A similar architecture consisting of a forward and a backward LSTM is employed in ELMo (Peters et al., 2018). However, in ELMo each LSTM was trained with the LM objective instead. To rank candidate substitutes using ELMo (Soler et al., 2019) proposed calculating cosine similarity between contextualized ELMo embeddings of the target word and all candidate substitutes. This requires feeding the original example with the target word replaced by one of the candidate substitutes at a time. The average of outputs at the target timestep from all ELMo layers performed best. However, they found context2vec performing even better and explained this by the negative sampling training objective, which is more related to the task. Recently, Transformer-based models pre-trained on huge corpora with LM or similar objectives have shown SOTA results in various NLP tasks. BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) were trained to restore a word replaced with a special [MASK] token given its full left and right contexts (masked LM objective), while XLNet (Yang et al., 2019) predicted a word at a specified position given only some randomly selected words from its context (permutation LM objective). In (Zhou et al., 2019), BERT was reported to perform poorly for lexical substitution (which is contrary to our experiments), and two improvements were proposed to achieve SOTA results using it. Firstly, dropout is applied to the target word embedding before showing it to the model. Secondly, the similarity between the original contextualized representations of the context words and their representations after replacing the target by one of the possible substitutes are integrated into the ranking metric to ensure minimal changes in the sentence meaning. This approach is very computationally expensive, requiring calculation of several forward passes of BERT for each input example, depending on the number of possible substitutes. We are not aware of any work applying XLNet for lexical substitution, but our experiments show that it outperforms BERT by a large margin.