Adaptive Neuro-Fuzzy Inference System Approach on Predicting Hard Disk Failures towards Reliable Data Center

Jenniea A. Olalia
2018 International Journal for Research in Applied Science and Engineering Technology  
Despite the high accuracy showcased by some studies in predicting hard disk failure using decision tree in the classification process, the accuracy of the decision tree is in question due to its vast difference among other algorithms ranging from 17.7% to 60.92%. This paper confirms the claim of some studies about the overfitting of decision tree when used with a large amount of data, real-valued and numeric attributes like SMART at-tributes. Utilizing ANFIS algorithm in predicting imminent
more » ... disk failure surpasses other algorithms (CHAID, C&R Tree, Neural Network, MLR, and SVM) by 4.2% while keeping its distance from the very high percentage (99.58) over-fitted decision tree at 86.08%. The ANFIS also predicted the failure 5 days before it actually happens. I. INTRODUCTION Having files rested and backed up on a data center or a cloud storage platform ensures continuous operation, data protection and recovery [1] . It is the responsibility of the data center to do all their necessary plans and procedures to be so. Critical factors such as experience, financial stability, security, support and physical infrastructure need to be maintained to have a reliable data center [2] . In the Philippines, 80% of business enterprises experienced data loss costing them around $8 billion worth of data loss [3] . On a larger scale, companies around the world experienced an average data loss of 400% in just two years accumulating a total amount loss of $1.7 trillion, 30% of which came from cloud storage. Admittedly, 51% of surveyed companies have no disaster recovery plan [4] . Study shows that the leading cause of data loss is the hardware failure, human error, software corruption, computer viruses and natural disaster among which hardware failure rank first at 57% [5] . Recent studies conducted by Data Barracks shows that hardware failure is still one of the topmost cause of data loss at 25%, being human error as the leader at 29% [6] . With the help of machine learning, failure can be preempted thereby avoiding data loss before it happens. There are some researches delving into this problem [7] [8][9][10][11] [12] . Each research presented varying results. Among these different researches and algorithms used, decision tree came out as the most accurate. Similarly, Suchatpong and Bhumkittipich's study com-pared decision tree with the neural network, SVM, CHAID and C&R Tree. Decision tree ranks first at 99.58% while neural network is at 56.09% accurate, SVM is at 38.66%, CHAID at 50.42% and C&R tree at 56.93%, which are way lower that the decision tree. While the result of the decision tree is promising, there is an observable irregularity in the result. The decision tree is too accurate compare to other algorithm in predicting hard drive failure. IBM states that if an algorithm has 98% accuracy while other techniques tried has 60% accuracy, it is most probably overfitting, which is exactly the case [13] . Initial investigation shows that if a decision tree is fed with large amount of data, it tends to over-fit [14] . A slight variations in the training data will also make decision tree unstable [15] [16] . Also, when a decision tree is supplied with realvalued attributes like the values in SMART Attributes, it will over-fit and it will give each numeric idea a branch thus the tree will become big [17] . Decision tree has also problems regarding robustness, adaptability, scalability and height optimization [18] . With the accuracy of the decision tree in questions and with low accuracy result of other algorithm, the researcher argues that the problem of predicting hard disk failure is still open and needs to be innovated. With the careful selection of data sets, systematic selection of predictors and use of suitable machine learning algorithm, the researcher believes that it will generate interesting result and produce new knowledge that will benefit not only the data centers but also the owner of the data. This study aims to verify the truthfulness of the claim of other researchers that decision tree is inappropriate in this area and that this topic is still an open problem. This study introduced the use of ANFIS algorithm and compares its result to decision tree and other stated algorithms used by other researchers. II. RELATED LITERATURE Cloud storage is a service where data is remotely man-aged and maintained [19] . This can be accessed through the internet from any device. From 2015 to 2016, IDC's CloudView survey found 137% increase in utilization of data storage [20] and 95% of the respondents are using cloud of which 89% are public clouds, 72% are private clouds and 67% are combination or hybrid [21] . As the needs for data storage grow, it is expected that need for managing storage devices also increase. One of the top identified data growth strategies in a data center is the re-placement of existing hardware such as hard disk which will result to the top most challenges in data center at 31% [22] . Hard disk is still the most commonly used storage system in a data center at 75%, followed by hybrid storage (55%), all-flash (21%), software-defined storage (21%) and hy-per-converged infrastructures (16%) [23] . Hard disk, when deployed in a data center last for 1.38 years on the average with an annualized failure rate of 2.12% [24] or 1,000,000 -1,500,00 hours at a nominal annual failure rate of at most 0.88% [25] . Other researches see the annual failure rate at 0.7% [26] . Therefore, constant monitoring of thousands of hard drive should be done not only for replacement of failed hard drive but also the prediction of pending future failure. Being said, failure to determine that a hard drive will crash will cause catastrophic amount of data lost. A study conducted by Emerson and Ponemon Institute shows that the average total cost of per minute of unplanned outages is $8,851 or $740, 357 for year 2016 which indicates a 38% increase in downtime since their study on 2010 [27] . Since hard disk is a physical hardware, it logical to use physical characteristics or behaviour of this disk to predict imminent failure. The interrelationship of temperature, workload, and hard disk drive failures were studied and shows that temperature exhibits stronger correlation to failure than the correlation of disk utilization. However, monitoring drives using its physical characteristics will require special or cus-tom-made monitoring tools just to classify failure. On the other hand, interesting data are available that may be an alternative to physical predictors. These data include SMART attributes, daily logs, and complain logs. Earlier studies conducted by Google suggest that some SMART attributes are correlated with high failure probabilities [28] . SMART (Self-Monitoring, Analysis, and Reporting Tool) is a monitoring system for computer hard disk that shows indicators of reliability [29] that are available to hard disk drives (HDD) and Solid State Drives (SSD). BackBlaze's analysis of nearly 40,000 drives shows that among 253 defined SMART attributes, five (5) metrics correlate strongly with impending disk failure. These are 5, 187, 188, 197 and 198 [30]. Using time series analysis of SMART attributes from 30,000 disks from two major manufacturers, Botezatu and Giurgui identified combination of SMART 197, 188, 10, 201 and 5 to be the cause of failure [31] . Although the two researches uses SMART at-tributes as predictors, the difference between their data might cause the distinction between their selected attributes. These can also be ascribed to the difference in the combination of hard disk brand, model, capacity, working environment and selection algorithm used. In the classification process, rule-based decision support algorithms were utilized. Algorithms employed includes Maximum Likelihood Rules, Classification and Regression Trees, Bayesian Networks, Decision Tree, Support Vector Machine + Time Series & Survival Analysis and multi-instance learning framework using Naïve Bayes that yields varying and opposing results and comparison. Some researches employed novel or hybrid algorithms such as Gaussian mixture based fault detection [32] , , feature selection-based Mahalanobis distance [34] , Gradient Boosted Regression Trees [35] . With all the different algorithms used, Adaptive Neuro-Fuzzy Inference System (ANFIS) was not used. It can be observed that among different algorithms, decision tree exhibited better performance for classification having an accuracy percentage from 98% to 99.58% while other algorithms' performances are way too low. Though the performance of the decision tree is promising, it can be somehow puzzling or confusing. Decision tree is too accurate and the result is too good to be true. IBM, in its article states that if an algorithm has 98% accuracy while other techniques used has 60% accuracy, it is most probably overfitting [36] . Overfitting happens when a given algorithm models the training data set too well [37] . Overfitting can also be caused by inappropriate data. Study shows that if a decision tree is fed with large amount of data, it also tends to over-fit [38] . This will influence the performance of the decision tree learning. A slight variation in the training data will also make decision tree unstable [39] [40] . Moreover, decision tree has issues with real-valued attributes. Since SMART attributes are real numbers when these are subjected to decision tree, the decision tree will over-fit and it will give each numeric idea a branch thus the tree will become big [41] . It can create over-complex trees that do not represent the data well [42] . Additionally, decision tree is good when the values are discrete and not continuous which contradicts the nature of hard disk data set [43] . Decision tree has also problems regarding robustness, adaptability, scalability and height optimization [44] . Being said, with the decision tree's performance in question and the poor performances of other algorithms, the researcher believes that the problems in this area is still un-clear and still open for innovation. Appropriate algorithm can be identified and used to
doi:10.22214/ijraset.2018.3480 fatcat:7keexgvyurgmrcejknk6j6nu4e