A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets
2020
Electronics
Second, adopting data processing platforms able to cope with the large datasets and complex data analysis algorithms required for such purposes. ...
The classified anomaly events produced by our framework can be used, for instance, as candidates for further forensic and compliance auditing analysis in security management. ...
Consequently, in the data mining research area, time series data mining was classified as one of the ten most challenging problems [10] . ...
doi:10.3390/electronics9071164
fatcat:z5364l6y7ndfbei6eurhopzlqy
Parallel classification and optimization of telco trouble ticket dataset
2021
TELKOMNIKA (Telecommunication Computing Electronics and Control)
The large volume of data nowadays demands an efficient method of building machinelearning classifiers to classify big data. ...
Apache Spark is recommended as the primary data processing framework for the research activities. ...
The large volume of data nowadays demands an efficient method of building machinelearning classifiers to classify big data. ...
doi:10.12928/telkomnika.v19i3.18159
fatcat:t4gqsos6dbdgphbuqkorz7oshy
BigDataGrapes D4.3 - Models and Tools for Predictive Analytics over Extremely Large Datasets
2018
Zenodo
The first one shows how to train two kinds of regressors, i.e., linear and random forest regressors, to fit synthetically generated data. We p [...] ...
This accompanying document for deliverable D4.3 (Models and Tools for Predictive Analytics over Extremely Large Datasets) describes the first version of the mechanisms and tools supporting efficient and ...
and Data Mining. ...
doi:10.5281/zenodo.1481800
fatcat:rlqwgvajzre6pfxuiiclmk2r34
Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark
2020
Algorithms
First, a vast number of classifiers is applied to two datasets, namely Higgs and PAMAP. Second, a two-step classification is ab ovo performed to the same datasets. ...
This article extensively relies in two ways on classifiers implemented in MLlib, the main machine learning library for the Hadoop ecosystem. ...
Random forest. ...
doi:10.3390/a13030071
fatcat:pn334kfqxfh4nb7wjfb63oudp4
Classification of Metabric Clinical Dataset using Naive Bayes Classifier
2019
VOLUME-8 ISSUE-10, AUGUST 2019, REGULAR ISSUE
A huge volume of clinical dataset has been considered and it is analyzed using Naive Bayes Classifier. ...
Big Data technology is very useful for organizations to take proper decisions to attain their goals and in mounting themselves organization to full fledge. ...
Various other classifiers like random forest, decision tree, SVM classifiers can also be used to classify the clinical dataset. ...
doi:10.35940/ijitee.l3703.1081219
fatcat:lm5owiwhzrgxpgftye6qadzweq
BigDataGrapes D4.3 - Models and Tools for Predictive Analytics over Extremely Large Datasets
2019
Zenodo
The first one shows how to train two kinds of regressors, i.e., linear and random forest regressors, to fit synthetically gen [...] ...
This accompanying document for deliverable D4.3 (Models and Tools for Predictive Analytics over Extremely Large Datasets) describes the first version of the mechanisms and tools supporting efficient and ...
., random forest, to deal with the complexity of the data. ...
doi:10.5281/zenodo.2641952
fatcat:n6ag6qt4gzg6tmnytqs2f7op4u
Hadoop based Feature Selection and Decision Making Models on Big Data
2016
Indian Journal of Science and Technology
It becomes computationally inaccurate to analyze such big data for decision making systems. ...
Methods/Analysis: Hadoop, which is a working model based on the Map-Reduce framework with efficient computation and processing of Big Data. ...
Oversampling and undersampling issue in large datasets is handled using Random forest algorithm for classification 6 . ...
doi:10.17485/ijst/2016/v9i10/88905
fatcat:2h6kz3u6qzf5xgt2coshk6sgqi
Detecting atmospheric rivers in large climate datasets
2011
Proceedings of the 2nd international workshop on Petascal data analytics: challenges and opportunities - PDAC '11
To aid the understanding of this phenomenon, we have developed an efficient detection algorithm suitable for analyzing large amounts of data. ...
We develop an efficient parallel implementation of the algorithm and demonstrate good weak and strong scaling. We process a 30-year simulation output on 10,000 cores in under 3 seconds. ...
; we demonstrate efficient parallel scaling on a large 1TB dataset. ...
doi:10.1145/2110205.2110208
fatcat:dy7rllc3hvcxpeoufiffkofzse
An Approach to Data Reduction for Learning from Big Datasets: Integrating Stacking, Rotation, and Agent Population Learning Techniques
2018
Complexity
Data reduction makes it possible to classify instances belonging to big datasets. ...
In the paper, several data reduction techniques for machine learning from big datasets are discussed and evaluated. ...
Among techniques for dealing with massive datasets are different parallel processing approaches aiming at achieving a substantial speed-up of the computation. ...
doi:10.1155/2018/7404627
fatcat:zewpwq7hxbap7cgat3pd3x6d24
Spark-based Ensemble Learning for Imbalanced Data Classification
2018
International Journal of Performability Engineering
After that, it trains several classifiers with random forest in Spark environment by the correlation feature selection means. ...
use of the efficient computing power of Spark distributed platform in training the massive data. ...
Literature [3] proposed a parallel stochastic forest algorithm for Spark based big data processing, and a large number of experimental results show its advantage in classification accuracy and efficiency ...
doi:10.23940/ijpe.18.05.p14.955964
fatcat:frqxaqjs7jhf5fqevze236maay
A new model for large dataset dimensionality reduction based on teaching learning-based optimization and logistic regression
2020
TELKOMNIKA (Telecommunication Computing Electronics and Control)
Some of the effective ways of data classification are data mining and classification methods. ...
This result showed that the projected TLBO is an efficient features optimization technique for sustaining data-based decision-making systems. ...
[32] proposed a coarse-grained parallel genetic algorithm (CGPGA) for optimizing the features in the dataset and constraints for SVM. ...
doi:10.12928/telkomnika.v18i3.13764
fatcat:7ajhvgnktbgf5opamkifnglspy
Methods and Evaluations of Decision Tree Algorithms on GPUs: An Overview
2018
ICIC Express Letters
However, when dealing with massive datasets, the time needed to build the decision tree will increase. Therefore, parallel computing is used to accelerate the construction of the decision tree. ...
Various methods related to bioinformatics computations have been used to extract, search, integrate, and analyze biological data in efficient ways. ...
Data mining can be defined as a method for uncovering useful information that is concealed in massive databases. ...
doi:10.24507/icicel.12.07.723
fatcat:y7bt7qub75dbpnlkwzlrr3kjdy
ASE: Anomaly Scoring Based Ensemble Learning for Imbalanced Datasets
[article]
2022
arXiv
pre-print
Decision Tree, Multilayer perceptron, KNN) and is more efficient than other existing methods under a wide range of imbalance ratio, data scale and data dimension. ...
However, in real-life scenarios, positive examples only make up a small part of all instances and our datasets suffer from high imbalance ratio which leads to poor performance of existing classification ...
Boosted SVM with active learning strategy for imbalanced data. Soft Computing 19, 12 (2015), 3357-3368. ...
arXiv:2203.10769v2
fatcat:d4c7q7psx5cyvizda22sd5sscy
Efficient handling of high-dimensional feature spaces by randomized classifier ensembles
2002
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02
Handling massive datasets is a difficult problem not only due to prohibitively large numbers of entries but in some cases also due to the very high dimensionality of the data. ...
In this work we demonstrate how these types of architectures effectively reduce the feature space for submodels and groups of sub-models, which lends itself to efficient sequential and/or parallel implementations ...
BACKGROUND Ensemble classifiers have been quite popular in many data mining applications due to their high accuracy and potential for efficient parallel implementations. ...
doi:10.1145/775047.775093
dblp:conf/kdd/KolczSK02
fatcat:m64gbpf5yje7pmexyaslgu6da4
Student Performance Prediction in Mathematics Course Based on the Random Forest and Simulated Annealing
2022
Scientific Programming
algorithm optimization, use the out-of-bag error as the optimization objective function, and then propose the IRFC (improved random forest classifier) algorithm in this paper. ...
Based on the random forest (RF) and simulated annealing (SA) algorithms, we binary encode the relevant parameters (number of features, tree size, and tree decision weights) as the target variables for ...
How to combine data mining work with data parallel processing mechanism for processing is also one of the major challenges for data mining work. e combination of data mining algorithm and data parallelism ...
doi:10.1155/2022/9340434
fatcat:6wpztn7pezhenoe6br6ewrowwu
« Previous
Showing results 1 — 15 out of 3,491 results