An Efficient Smote-based Model for Dyslexia Prediction

Vani Chakraborty, Research Scholar, Garden City University, Meenatchi Sundaram
2021 International Journal of Information Engineering and Electronic Business  
Dyslexia is a learning disability which causes difficulty in an individual to read, write and spell and do simple mathematical calculations. It affects almost 10% of the global population and detecting it early is paramount for its effective handling. There are many different methods to detect the risk of Dyslexia. Some of these methods are using assessment tools, handwriting recognition, expert psychological help and also using the eye movement data recorded while reading. One of the other
more » ... enient and easy ways of detecting risk of dyslexia is to make an individual participate in a simple game related to phonological awareness, syllabic awareness, auditory discrimination, lexical awareness, visual working memory, and many more and recording the observations. The proposed research work presents an effective way of predicing the risk of dyslexia with high accuracy and reliability. It uses a dataset made available from the kaggle repository to predict the risk of dyslexia using various machine learning algorithms. Also it is observed that the dataset has an unequal distribution of positive and negative cases and so the classification accuracy is compromised if used directly. The proposed research work uses three resampling techniques to reduce the imbalance in the dataset. The resampling techniques used are undersampling using near-miss algorithm, oversampling using SMOTE and ADASYN. After applying the undersampling near-miss algorithm, best accuracy was given by SVC classifier with the value of 81.63%. All the other classifiers used in the experiment produced accuracy in the range of 64% to 79.08%. After using the oversampling algorithm SMOTE, the classifiers produced very good results in the evaluation metrics of accuracy,CV score, F1 Score and recall. The maximum accuracy was given by RandomForest with a value of 96.37% and closely followed by XGBBoosting and GradientBoosting with an accuracy of 95.14%. Decision tree, SVC and ADABoost got an accuracy of 91.26%, 93.36% and 93.48% respectively. Even the values of CV score, F1 and recall were considerably high for all these classifiers. After applying the oversampling technique of ADASYN, RandomForest algorithm generated maximum accuracy of 96.25%. Between the two oversampling techniques, SMOTE algorithm performed slightly better in producing better evaluation metrics than ADASYN. The proposed system has very high reliability and so can be effectively used for detecting the risk of dyslexia.
doi:10.5815/ijieeb.2021.06.02 fatcat:o6ax4vb6ofd3rfnpntmldz5roe