Statistical Analysis of Classification Algorithms for Predicting Socioeconomics Status of Twitter Users [thesis]

Ying Zhou
The purpose of this study is to compare a series of well-known statistical machine learning techniques that classify online social network (OSN) Twitter users based on their socioeconomic status (upper/middle/lower). These approaches are of difference owing to their assumptions, strengths, and weaknesses. In the experiments, five (5) classification algorithms are employed for the classification task. Logistic Regression, Support Vector Machine (SVM), Naïve Bayes (NB), k-Nearest Neighbors, and
more » ... cision Tree are applied on high-dimensional data set extracted from the users' platform-based and profile-based behavior on Twitter. These algorithms are theoretically investigated and experimentally evaluated in terms of four (4) performance measures: accuracy, precision, recall, and AUC. Then, ensemble methods i.e. Bagging and Boosting are employed to improve the performance of the aforementioned classifiers. Multivariate analysis of variance is employed to examine if performance measures of these algorithms are significantly different. And univariate analysis of variance is used to analyze the differences of our classification methods for each performance measure. The analyses indicate a significant difference among these algorithms; both SVM and NB achieve good performance on our high-dimensional OSN data set. ii
doi:10.22215/etd/2017-11995 fatcat:bxh3473duvgl3py2e3b5w56h4u