Filters








3,491 Hits in 7.1 sec

Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets

João Henriques, Filipe Caldeira, Tiago Cruz, Paulo Simões
2020 Electronics  
Second, adopting data processing platforms able to cope with the large datasets and complex data analysis algorithms required for such purposes.  ...  The classified anomaly events produced by our framework can be used, for instance, as candidates for further forensic and compliance auditing analysis in security management.  ...  Consequently, in the data mining research area, time series data mining was classified as one of the ten most challenging problems [10] .  ... 
doi:10.3390/electronics9071164 fatcat:z5364l6y7ndfbei6eurhopzlqy

Parallel classification and optimization of telco trouble ticket dataset

Fauzy Bin Che Yayah, Khairil Imran Ghauth, Choo-Yee Ting
2021 TELKOMNIKA (Telecommunication Computing Electronics and Control)  
The large volume of data nowadays demands an efficient method of building machinelearning classifiers to classify big data.  ...  Apache Spark is recommended as the primary data processing framework for the research activities.  ...  The large volume of data nowadays demands an efficient method of building machinelearning classifiers to classify big data.  ... 
doi:10.12928/telkomnika.v19i3.18159 fatcat:t4gqsos6dbdgphbuqkorz7oshy

BigDataGrapes D4.3 - Models and Tools for Predictive Analytics over Extremely Large Datasets

Nicola Tonellotto, Vinicius Monteiro de Lira, Franco Maria Nardini, Raffaele Perego, Cristina Muntean, Ida Mele, Salvatore Trani
2018 Zenodo  
The first one shows how to train two kinds of regressors, i.e., linear and random forest regressors, to fit synthetically generated data. We p [...]  ...  This accompanying document for deliverable D4.3 (Models and Tools for Predictive Analytics over Extremely Large Datasets) describes the first version of the mechanisms and tools supporting efficient and  ...  and Data Mining.  ... 
doi:10.5281/zenodo.1481800 fatcat:rlqwgvajzre6pfxuiiclmk2r34

Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark

Athanasios Alexopoulos, Georgios Drakopoulos, Andreas Kanavos, Phivos Mylonas, Gerasimos Vonitsanos
2020 Algorithms  
First, a vast number of classifiers is applied to two datasets, namely Higgs and PAMAP. Second, a two-step classification is ab ovo performed to the same datasets.  ...  This article extensively relies in two ways on classifiers implemented in MLlib, the main machine learning library for the Hadoop ecosystem.  ...  Random forest.  ... 
doi:10.3390/a13030071 fatcat:pn334kfqxfh4nb7wjfb63oudp4

Classification of Metabric Clinical Dataset using Naive Bayes Classifier

2019 VOLUME-8 ISSUE-10, AUGUST 2019, REGULAR ISSUE  
A huge volume of clinical dataset has been considered and it is analyzed using Naive Bayes Classifier.  ...  Big Data technology is very useful for organizations to take proper decisions to attain their goals and in mounting themselves organization to full fledge.  ...  Various other classifiers like random forest, decision tree, SVM classifiers can also be used to classify the clinical dataset.  ... 
doi:10.35940/ijitee.l3703.1081219 fatcat:lm5owiwhzrgxpgftye6qadzweq

BigDataGrapes D4.3 - Models and Tools for Predictive Analytics over Extremely Large Datasets

Nicola Tonellotto, Vinicius Monteiro de Lira, Franco Maria Nardini, Raffaele Perego, Cristina Muntean, Ida Mele, Salvatore Trani, Matteo Ceneta
2019 Zenodo  
The first one shows how to train two kinds of regressors, i.e., linear and random forest regressors, to fit synthetically gen [...]  ...  This accompanying document for deliverable D4.3 (Models and Tools for Predictive Analytics over Extremely Large Datasets) describes the first version of the mechanisms and tools supporting efficient and  ...  ., random forest, to deal with the complexity of the data.  ... 
doi:10.5281/zenodo.2641952 fatcat:n6ag6qt4gzg6tmnytqs2f7op4u

Hadoop based Feature Selection and Decision Making Models on Big Data

Thulasi Bikku, N. Sambasiva Rao, Ananda Rao Akepogu
2016 Indian Journal of Science and Technology  
It becomes computationally inaccurate to analyze such big data for decision making systems.  ...  Methods/Analysis: Hadoop, which is a working model based on the Map-Reduce framework with efficient computation and processing of Big Data.  ...  Oversampling and undersampling issue in large datasets is handled using Random forest algorithm for classification 6 .  ... 
doi:10.17485/ijst/2016/v9i10/88905 fatcat:2h6kz3u6qzf5xgt2coshk6sgqi

Detecting atmospheric rivers in large climate datasets

Surendra Byna, Prabhat, Michael F. Wehner, Kesheng John Wu
2011 Proceedings of the 2nd international workshop on Petascal data analytics: challenges and opportunities - PDAC '11  
To aid the understanding of this phenomenon, we have developed an efficient detection algorithm suitable for analyzing large amounts of data.  ...  We develop an efficient parallel implementation of the algorithm and demonstrate good weak and strong scaling. We process a 30-year simulation output on 10,000 cores in under 3 seconds.  ...  ; we demonstrate efficient parallel scaling on a large 1TB dataset.  ... 
doi:10.1145/2110205.2110208 fatcat:dy7rllc3hvcxpeoufiffkofzse

An Approach to Data Reduction for Learning from Big Datasets: Integrating Stacking, Rotation, and Agent Population Learning Techniques

Ireneusz Czarnowski, Piotr Jędrzejowicz
2018 Complexity  
Data reduction makes it possible to classify instances belonging to big datasets.  ...  In the paper, several data reduction techniques for machine learning from big datasets are discussed and evaluated.  ...  Among techniques for dealing with massive datasets are different parallel processing approaches aiming at achieving a substantial speed-up of the computation.  ... 
doi:10.1155/2018/7404627 fatcat:zewpwq7hxbap7cgat3pd3x6d24

Spark-based Ensemble Learning for Imbalanced Data Classification

Jiaman Ding
2018 International Journal of Performability Engineering  
After that, it trains several classifiers with random forest in Spark environment by the correlation feature selection means.  ...  use of the efficient computing power of Spark distributed platform in training the massive data.  ...  Literature [3] proposed a parallel stochastic forest algorithm for Spark based big data processing, and a large number of experimental results show its advantage in classification accuracy and efficiency  ... 
doi:10.23940/ijpe.18.05.p14.955964 fatcat:frqxaqjs7jhf5fqevze236maay

A new model for large dataset dimensionality reduction based on teaching learning-based optimization and logistic regression

Hind Raad Ibraheem, Zahraa Faiz Hussain, Sura Mazin Ali, Mohammad Aljanabi, Mostafa Abdulghafoor Mohammed, Tole Sutikno
2020 TELKOMNIKA (Telecommunication Computing Electronics and Control)  
Some of the effective ways of data classification are data mining and classification methods.  ...  This result showed that the projected TLBO is an efficient features optimization technique for sustaining data-based decision-making systems.  ...  [32] proposed a coarse-grained parallel genetic algorithm (CGPGA) for optimizing the features in the dataset and constraints for SVM.  ... 
doi:10.12928/telkomnika.v18i3.13764 fatcat:7ajhvgnktbgf5opamkifnglspy

Methods and Evaluations of Decision Tree Algorithms on GPUs: An Overview

Nesreen Adnan Hamad, Fatima Mousa Quiam, Khalid Mohammad Jaber
2018 ICIC Express Letters  
However, when dealing with massive datasets, the time needed to build the decision tree will increase. Therefore, parallel computing is used to accelerate the construction of the decision tree.  ...  Various methods related to bioinformatics computations have been used to extract, search, integrate, and analyze biological data in efficient ways.  ...  Data mining can be defined as a method for uncovering useful information that is concealed in massive databases.  ... 
doi:10.24507/icicel.12.07.723 fatcat:y7bt7qub75dbpnlkwzlrr3kjdy

ASE: Anomaly Scoring Based Ensemble Learning for Imbalanced Datasets [article]

Xiayu Liang, Ying Gao, Shanrong Xu
2022 arXiv   pre-print
Decision Tree, Multilayer perceptron, KNN) and is more efficient than other existing methods under a wide range of imbalance ratio, data scale and data dimension.  ...  However, in real-life scenarios, positive examples only make up a small part of all instances and our datasets suffer from high imbalance ratio which leads to poor performance of existing classification  ...  Boosted SVM with active learning strategy for imbalanced data. Soft Computing 19, 12 (2015), 3357-3368.  ... 
arXiv:2203.10769v2 fatcat:d4c7q7psx5cyvizda22sd5sscy

Efficient handling of high-dimensional feature spaces by randomized classifier ensembles

Aleksander Kołcz, Xiaomei Sun, Jugal Kalitax
2002 Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02  
Handling massive datasets is a difficult problem not only due to prohibitively large numbers of entries but in some cases also due to the very high dimensionality of the data.  ...  In this work we demonstrate how these types of architectures effectively reduce the feature space for submodels and groups of sub-models, which lends itself to efficient sequential and/or parallel implementations  ...  BACKGROUND Ensemble classifiers have been quite popular in many data mining applications due to their high accuracy and potential for efficient parallel implementations.  ... 
doi:10.1145/775047.775093 dblp:conf/kdd/KolczSK02 fatcat:m64gbpf5yje7pmexyaslgu6da4

Student Performance Prediction in Mathematics Course Based on the Random Forest and Simulated Annealing

Shaohai Huang, Junjie Wei, Hangjun Che
2022 Scientific Programming  
algorithm optimization, use the out-of-bag error as the optimization objective function, and then propose the IRFC (improved random forest classifier) algorithm in this paper.  ...  Based on the random forest (RF) and simulated annealing (SA) algorithms, we binary encode the relevant parameters (number of features, tree size, and tree decision weights) as the target variables for  ...  How to combine data mining work with data parallel processing mechanism for processing is also one of the major challenges for data mining work. e combination of data mining algorithm and data parallelism  ... 
doi:10.1155/2022/9340434 fatcat:6wpztn7pezhenoe6br6ewrowwu
« Previous Showing results 1 — 15 out of 3,491 results