An overview of recent distributed algorithms for learning fuzzy models in Big Data classification

Pietro Ducange, Michela Fazzolari, Francesco Marcelloni
2020 Journal of Big Data  
Introduction In the Big Data Era [1], a huge Volume of information is generated at very high speed. In most cases, such data are collected from different sources, may have different formats (Variety) and need to be elaborated in almost real time (Velocity) [2] . This is the so-called three-V's model of Big Data and it has been used for the first time by Douglas Laney in 2001 [3], to describe the data management in three-dimensions. This original three-V paradigm is still valid, but it has been
more » ... ecently enriched by additional Vs. In fact, Big Data may be poorly accurate or truthful (Veracity). Moreover, the added-Value that the analysis of Big Data may offer is already exploited in several contexts such as industrial applications [4], marketing strategies [5], Cloud Computing and Internet of Things [6, 7], and health care [8]. Abstract Nowadays, a huge amount of data are generated, often in very short time intervals and in various formats, by a number of different heterogeneous sources such as social networks and media, mobile devices, internet transactions, networked devices and sensors. These data, identified as Big Data in the literature, are characterized by the popular Vs features, such as Value, Veracity, Variety, Velocity and Volume. In particular, Value focuses on the useful knowledge that may be mined from data. Thus, in the last years, a number of data mining and machine learning algorithms have been proposed to extract knowledge from Big Data. These algorithms have been generally implemented by using ad-hoc programming paradigms, such as MapReduce, on specific distributed computing frameworks, such as Apache Hadoop and Apache Spark. In the context of Big Data, fuzzy models are currently playing a significant role, thanks to their capability of handling vague and imprecise data and their innate characteristic to be interpretable. In this work, we give an overview of the most recent distributed learning algorithms for generating fuzzy classification models for Big Data. In particular, we first show some design and implementation details of these learning algorithms. Thereafter, we compare them in terms of accuracy and interpretability. Finally, we argue about their scalability.
doi:10.1186/s40537-020-00298-6 fatcat:vutg2g544rcbpfhthhleg5sffy