98,427 Hits in 7.9 sec

A sampling-based framework for parallel data mining

Shengnan Cong, Jiawei Han, Jay Hoeflinger, David Padua
2005 Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '05  
In this paper, we present a framework for parallel mining frequent itemsets and sequential patterns based on the divide-and-conquer strategy of pattern growth.  ...  We implemented parallel versions of both frequent itemsets and sequential pattern mining algorithms following our framework.  ...  In this paper, we propose a framework for parallel mining frequent itemset and sequential patterns.  ... 
doi:10.1145/1065944.1065979 dblp:conf/ppopp/CongHHP05 fatcat:ap3pc5i4v5cgnfoptszlvrv6u4

Transplantation of Data Mining Algorithms to Cloud Computing Platform when Dealing Big Data [article]

Yong Wang, Ya Wei Zhao
2017 arXiv   pre-print
It revealed the Cloud Computing platform based on Map-Reduce cannot solve all the Big Data and data mining problems.  ...  This paper made a short review of Cloud Computing and Big Data, and discussed the portability of general data mining algorithms to Cloud Computing platform.  ...  The main reason is that Map-Reduce framework is based on off-line data processing methods to solve problems, and it is only suitable for the simple computation.  ... 
arXiv:1702.01508v1 fatcat:t5wspkditbhxhcwm6rtzanipeq


Giuseppe Agapito, Mario Cannataro, Pietro Hiram Guzzi, Fabrizio Marozzo, Domenico Talia, Paolo Trunfio
2007 Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics - BCB'13  
The paper presents the design and experimentation of Cloud4SNP, a novel Cloud-based bioinformatics tool for the parallel preprocessing and statistical analysis of pharmacogenomics SNP microarray data.  ...  Due to the large number of samples and the high resolution of instruments, the data to be analyzed can be very huge, requiring high performance computing.  ...  Similarly, the micro-CS project [11] presents a framework for the analysis of microarray data based on a distributed architecture made of different web-services internally parallel for the annotation  ... 
doi:10.1145/2506583.2506605 dblp:conf/bcb/AgapitoCGMTT13 fatcat:dbno35zp3ra6jj3phwhurcbzga


Jitha Janardhanan
2017 International Journal of Advanced Research in Computer Science  
Distributed parallel algorithms for mining frequent balanced itemsets aims to load by equally dividing data among a collection of computing nodes.  ...  In this comparative study aims to present a study of Frequent pattern mining techniques deviations among in Hadoop MapReduce concepttunder the data mining techniques that are in use in large database transactions  ...  PARMA does this by generating multiple tiny random samples of the transactional dataset and running a mining algorithm on the samples separately and in parallel.  ... 
doi:10.26483/ijarcs.v8i7.4499 fatcat:keqnovdtojeblocoi4v5skql3e

Robust Learning for Text Classification with Multi-source Noise Simulation and Hard Example Mining [article]

Guowei Xu, Wenbiao Ding, Weiping Fu, Zhongqin Wu, Zitao Liu
2021 arXiv   pre-print
number of simulated samples for optimal performance. 3) To make our model learn noise-invariant representations, a stability loss is employed.  ...  We propose a novel robust training framework which 1) employs simple but effective methods to directly simulate natural OCR noises from clean texts and 2) iteratively mines the hard examples from a large  ...  We propose a hard example mining algorithm that dynamically distinguishes hard and easy samples for each training epoch as follows: The Overall Framework The overall framework is shown in Figure 1  ... 
arXiv:2107.07113v1 fatcat:6x6gunu3azelvkuvqzafncouje

Multiagent Framework for Bio-data Mining [chapter]

Pengyi Yang, Li Tao, Liang Xu, Zili Zhang
2009 Lecture Notes in Computer Science  
Based on the framework, we developed a prototype system to demonstrate how it helps the biologists to perform a comprehensive mining task for answering biological questions.  ...  Followed by that, an initial multiagent based bio-data mining framework is presented.  ...  Conclusion In this proposal, we argue for applying multiagent based data mining framework to biological data analysis.  ... 
doi:10.1007/978-3-642-02962-2_25 fatcat:mwcofeo3uvcftopszxcx76jrkq

A Resource Aware Parallelized Back Propagation Neural Network in Enabling Efficient Large-scale Digital Health Data Processing

Yang Liu, Xianbang Chen, Lixiong Xu, Huaqiang Li, Maozhen Li
2019 IEEE Access  
Therefore this paper presents a Hadoop based parallelized BPNN algorithm which is able to process the large-scale data efficiently.  ...  In order to complement the potential accuracy loss issue for the parallelized data processing, ensemble learning techniques are also involved.  ...  Based on the data separation, the traditional standalone BPNN can be parallelized using the Hadoop framework.  ... 
doi:10.1109/access.2019.2935691 fatcat:6axwh7ohlnb3fblgogflqrbeky

A Cloud-Based Framework for Large-Scale Log Mining through Apache Spark and Elasticsearch

Yun Li, Yongyao Jiang, Juan Gu, Mingyue Lu, Manzhu Yu, Edward Armstrong, Thomas Huang, David Moroni, Lewis McGibbney, Greguska Frank, Chaowei Yang
2019 Applied Sciences  
To speed up the mining process, we propose a cloud-based log-mining framework using Apache Spark and Elasticsearch.  ...  As a proof of concept, oceanographic data search and access logs are chosen to validate performance of the proposed parallel log-mining framework.  ...  Figure 3 . 3 Sample FTP log data in FTP format. Figure 3 . 3 Sample FTP log data in FTP format. Figure 4 . 4 The framework for log mining. Figure 4 . 4 The framework for log mining.  ... 
doi:10.3390/app9061114 fatcat:3rm2pn3ubfhhvnsvupsdcjwqsq

Seamless Automation and Integration of Machine Learning Capabilities for Big Data Analytics

Amril Nazir
2017 International Journal of Distributed and Parallel systems  
The paper aims at proposing a solution for designing and developing a seamless automation and integration of machine learning capabilities for Big Data with the following requirements: 1) the ability to  ...  for analyzing Big Data datasets based on data characteristics, domain expert inputs, and data pre-processing component; 3) the ability to automatically select the most appropriate libraries and tools  ...  For example, the Data Sampling component may select the algorithm to use based on the data mining algorithm (e.g., Decision Tree/SVM, EM/K-Mean).  ... 
doi:10.5121/ijdps.2017.8301 fatcat:klz25x276ndrffwtnw2yonqmoi

A Survey of Parallel Sequential Pattern Mining [article]

Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-Chieh Chao, Philip S. Yu
2019 arXiv   pre-print
We review the related work of parallel sequential pattern mining in detail, including partition-based algorithms for PSPM, Apriori-based PSPM, pattern growth based PSPM, and hybrid algorithms for PSPM,  ...  As a fundamental task of data mining, sequential pattern mining (SPM) is used in a wide variety of real-life applications.  ...  ACKNOWLEDGMENT We would like to thank the anonymous reviewers for their detailed comments and constructive suggestions for this paper.  ... 
arXiv:1805.10515v2 fatcat:6bothuniprd7xclmpwx26s6udu

Parallel Implementation of Classification Algorithms Based on Cloud Computing Environment

Lijuan Zhou, Hui Wang, Wenbo Wang
2012 TELKOMNIKA Indonesian Journal of Electrical Engineering  
And it mainly introduces a parallel Naïve Bayes classification algorithm based on MapReduce, which is a simple yet powerful parallel programming technique.  ...  The enlarging volumes of information emerging by the progress of technology and the growing individual needs of data mining, makes classifying of very large scale of data a challenging task.  ...  Therefore, a new cloud computing model of massive data mining includes the pre-processing for huge amounts of data, cloud computing for massive parallel data mining algorithms, the new massive data mining  ... 
doi:10.11591/telkomnika.v10i5.1353 fatcat:6tvh7xpsxvgk7of7wsfhnlliba

A Data Mining Algorithm based on Relevant Vector Machine of Cloud Simulation

Wuqi Gao
2018 International Journal of Performability Engineering  
(RVM), a data mining algorithm that is mainly used on the small sample of data mining with a good effect but a large amount of calculation that is based on an open source distributed storage and computing  ...  Based on the sum of the distribution of small sample data mining law in sequence, in some cases, the algorithm reflects the law of large sample data mining.  ...  Gao for his proof-reading.  ... 
doi:10.23940/ijpe.18.06.p28.13601364 fatcat:lpoeahc5fzfibpxrel767i5xxa

Direct out-of-memory distributed parallel frequent pattern mining

Zheyi Rong, Jeroen De Knijf
2013 Proceedings of the 2nd International Workshop on Big Data, Streams and Heterogeneous Source Mining Algorithms, Systems, Programming Models and Applications - BigMine '13  
This paper extends the direct sampling approach by casting the algorithm into the MapReduce framework, effectively ceasing the memory requirements that the data should fit into main memory.  ...  Frequent itemset mining is a well studied and important problem in the datamining community.  ...  Specifically, we transform the method into the Hadoop MapReduce framework, resulting in a distributed parallel frequent itemset mining algorithm with low memory demands.  ... 
doi:10.1145/2501221.2501229 dblp:conf/kdd/RongK13 fatcat:f7c32mcl5zhbtm3372lcab3lqm

Spark-based data analytics of sequence motifs in large omics data

Oluwafemi A. Sarumi, Carson K. Leung, Adebayo O. Adetunmbi
2018 Procedia Computer Science  
In this article, we present a distributed sequential algorithm-which uses the MapReduce programming model on a cluster of homogeneous distributed-memory system running on an Apache Spark computing framework-for  ...  In this article, we present a distributed sequential algorithm-which uses the MapReduce programming model on a cluster of homogeneous distributed-memory system running on an Apache Spark computing framework-for  ...  Apache Spark framework Apache Spark recently becomes a popular parallel framework for processing high volumes of big data on a distributed system.  ... 
doi:10.1016/j.procs.2018.07.294 fatcat:tljcpsrgabh4bmdhp7rgxcrfby

Big Data Mining using Map Reduce: A Survey Paper

Shital Suryawanshi, Prof. V.S Wadne
2014 IOSR Journal of Computer Engineering  
MR-Cube is framework (based on mapreduce)used for cube materialization and mining over massive datasets using holistic measure.  ...  For data processing Big data processing framework relay on cluster computers and parallel execution framework provided by Map-Reduce. Extending cube computation techniques to this paradigm.  ...  Map Reduce MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks.  ... 
doi:10.9790/0661-16673740 fatcat:jue4l2eueffclemxwb2aepo73i
« Previous Showing results 1 — 15 out of 98,427 results