Short Text Classification Using Deep Representation: A Case Study of Spanish Tweets in Coset Shared Task

Erfaneh Gharavi, Kayvan Bijari
2017 Annual Conference of the Spanish Society for Natural Language Processing  
Topic identification as a specific case of text classification is one of the primary steps toward knowledge extraction from the raw textual data. In such tasks, words are dealt with as a set of features. Due to high dimensionality and sparseness of feature vector result from traditional feature selection methods, most of the proposed text classification methods for this purpose lack performance and accuracy. In dealing with tweets which are limited in the number of words the aforementioned
more » ... ems are reflected more than ever. In order to alleviate such issues, we have proposed a new topic identification method for Spanish tweets based on the deep representation of Spanish words. In the proposed method, words are represented as multi-dimensional vectors, in other words, words are replaced with their equivalent vectors which are calculated based on some transformation of raw text data. Average aggregation technique is used to transform the word vectors into tweet representation. Our model is trained based on deep vectorized representation of the tweets and an ensemble of different classifiers is used for Spanish tweet classification. The best result obtained by a fully connected multi-layer neural network with three hidden layers. The experimental results demonstrate the feasibility and scalability of the proposed method.
dblp:conf/sepln/GharaviB17 fatcat:lav3zvizmvby7ptz4q6afzrizq