Time-Aware Word Embeddings for Three Lebanese News Archives

Jad Doughman, Fatima Abu Salem, Shady Elbassuoni
2020 International Conference on Language Resources and Evaluation  
Word embeddings have proven to be an effective method for capturing semantic relations among distinct terms within a large corpus. In this paper, we present a set of word embeddings learnt from three large Lebanese news archives, which collectively consist of 609,386 scanned newspaper images and spanning a total of 151 years, ranging from 1933 till 2011. The diversified ideological nature of the news archives alongside the temporal variability of the embeddings offer a rare glimpse onto the
more » ... ation of word representation across the left-right political spectrum. To train the word embeddings, Google's Tesseract 4.0 OCR engine was employed to transcribe the scanned news archives, and various archive-level as well as decade-level word embeddings were learnt. To evaluate the accuracy of the learnt word embeddings, a benchmark of analogy tasks was used. Finally, we demonstrate an interactive system that allows the end user to visualize for a given word of interest, the variation of the top-k closest words in the embedding space as a function of time and across news archives using an animated scatter plot.
dblp:conf/lrec/DoughmanSE20 fatcat:gkhr2nhjxfbrrmmh6kptqlrj6a