9,420 Hits in 6.4 sec

Efficient Data Structures for Massive N-Gram Datasets

Giulio Ermanno Pibiri, Rossano Venturini
2017 Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR '17  
In this paper we study the problem of reducing the space required by the representation of such datasets, maintaining the capability of looking up for a given N -gram within micro seconds.  ...  e e cient indexing of large and sparse N -gram datasets is crucial in several applications in Information Retrieval, Natural Language Processing and Machine Learning.  ...  Each dataset comprises all N -grams for 1 ≤ N ≤ 5 and associated frequency counts. Table 2 shows the basic statistics of the datasets. Compared Indexes.  ... 
doi:10.1145/3077136.3080798 dblp:conf/sigir/PibiriV17 fatcat:cmxkp3lus5hqxivr6j6bkddvae

Efficient n-gram analysis in R with cmscu

David W. Vinson, Jason K. Davis, Suzanne S. Sindi, Rick Dale
2016 Behavior Research Methods  
, Inc. dataset.  ...  ., 2013) modified Kneser-Ney n-gram smoothing algorithm using cmscu as the querying engine.  ...  Here we use a sketch algorithm known for its efficiency in processing Predictions are provided by the full negative binomial model controlling for other variables massive real-time data (Cormode & Muthukrishnan  ... 
doi:10.3758/s13428-016-0766-5 pmid:27496173 fatcat:7skgmebau5gyxl2baponubpq2i

SSketch: An Automated Framework for Streaming Sketch-Based Analysis of Big Data on FPGA

Bita Darvish Rouhani, Ebrahim M. Songhori, Azalia Mirhoseini, Farinaz Koushanfar
2015 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines  
The stream of input data is used by SSketch for adaptive learning and updating a corresponding ensemble of lower dimensional data structures, a.k.a., a sketch matrix.  ...  This paper proposes SSketch, a novel automated computing framework for FPGA-based online analysis of big data with dense (non-sparse) correlation matrices.  ...  However, their O(m 2 n) complexity makes it hard to utilize these wellknown algorithms for massive datasets.  ... 
doi:10.1109/fccm.2015.56 dblp:conf/fccm/RouhaniSMK15 fatcat:n3jjxkpwbzgvvboez4ywtgik54

ExtDict: Extensible Dictionaries for Data- and Platform-Aware Large-Scale Learning

Azalia Mirhoseini, Bita Darvish Rouhani, Ebrahim Songhori, Farinaz Koushanfar
2017 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)  
Abstract-This paper proposes ExtDict, a novel data-and platform-aware framework for iterative analysis/learning of massive and dense datasets.  ...  Iterative execution is prohibitively costly for distributed architectures where the cost of moving data is continually growing compared with the cost of arithmetic computing.  ...  Finding methods for efficient and scalable learning of massive and complex datasets is an active area of research.  ... 
doi:10.1109/ipdpsw.2017.171 dblp:conf/ipps/MirhoseiniRSK17 fatcat:m6txzqqg5zhk5liu2s3al5hpse

Layered Higher Order N-grams for Hardening Payload Based Anomaly Intrusion Detection

Neminath Hubballi, Santosh Biswas, Sukumar Nandi
2010 2010 International Conference on Availability, Reliability and Security  
Each such n-gram is a 2 tuple where the first element is byte values of the n-gram and second is the frequency of gram in the entire training data.  ...  Since behavior of every application is not same a different model is necessary for each application. Studies have revealed that higher order n-grams are good for capturing the network profile.  ...  But our algorithm uses an efficient tree structure for storage(discussed bellow) and can accommodate any amount of training data.  ... 
doi:10.1109/ares.2010.31 dblp:conf/IEEEares/HubballiBN10 fatcat:bblvbm4bmnhblffnb7leapijxq

KONG: Kernels for ordered-neighborhood graphs [article]

Moez Draief, Konstantin Kutzkov, Kevin Scaman, Milan Vojnovic
2018 arXiv   pre-print
Graphs with ordered neighborhoods are a natural data representation for evolving graphs where edges are created over time, which induces an order.  ...  For the special case of general graphs, i.e. graphs without ordered neighborhoods, the new graph kernels yield efficient and simple algorithms for the comparison of label distributions between graphs.  ...  Count-Sketch and Tensor-Sketch Sketching is an algorithmic tool for the summarization of massive datasets such that key properties of the data are preserved.  ... 
arXiv:1805.10014v2 fatcat:k4s3bfox7resjilnmvhkvm3csq

XML Structural Similarity Search Using MapReduce [chapter]

Peisen Yuan, Chaofeng Sha, Xiaoling Wang, Bin Yang, Aoying Zhou, Su Yang
2010 Lecture Notes in Computer Science  
XML is a de-facto standard for web data exchange and information representation. Efficient management of these large volumes of XML data brings challenges to conventional technique.  ...  In this paper, an efficient and scalable framework is proposed for XML structural similarity search on large cluster with MapReduce.  ...  Extensive experiments on real datasets show that our framework is efficient and scales well in term of the size of the corpus for structural similarity searching for XML data.  ... 
doi:10.1007/978-3-642-14246-8_19 fatcat:y44tsi6bpjhqhehfa5l6qagxv4

TIDM: Topic-Specific Information Detection Model

Wen Xu, Jing He, Bo Mao, Youtao Li, Peiqun Liu, Zhiwang Zhang, Jie Cao
2017 Procedia Computer Science  
Unfortunately, due to the informal expressions, detecting the massive data on the internet is a big challenge based on the traditional text mining methods such as Topic Model.  ...  For training the words and idiomatic phrases, we adopt the supervise learning technique: manually constructing a specific Semantic Dataset for training our model.  ...  Thus, it's very important and urgent to deploy a method to help us detect and control the massive social data automatically.  ... 
doi:10.1016/j.procs.2017.11.365 fatcat:ryoqtr36tzbydho4xi6vsk7gcu

Semantic N-Gram Topic Modeling

Pooja Kherwa, Poonam Bansal
2018 EAI Endorsed Transactions on Scalable Information Systems  
particular topic are calculated and best considerable semantic N-Gram phrases and terms are considered for further topic modeling.  ...  Results are evaluated and it was found that perplexity is drastically improved and found significant improvement in coherence score specifically for short text data set like movie reviews and political  ...  Parameter Setting for Experiments In this experiment after pre-processing a collocation model is constructed for learning phrases in dataset up to N Grams.  ... 
doi:10.4108/eai.13-7-2018.163131 fatcat:cnk3rkd6wzcpln4nus6ritvn4q

GRU based Convolutional Neural Network with Initialized Filters for Text Classification

Linhong Weng, Qing Li, Ding Xuehai
2019 Australian Journal of Intelligent Information Processing Systems  
Besides, for reducing the time of parameter adjustment in convolutional layer, training data is used to initialize filters.  ...  Text classification is a classical task of natural language processing, which can quickly find corresponding categories from massive amount of data.  ...  (5)𝑦 , 𝐿 𝑦 , 𝑊 𝑦 for each class y, which are initialized to be empty 2: for each sentence s in S, obtain n-grams from s and then, add n-grams to N 𝑦 , where y is the label of s 3: for each class  ... 
dblp:journals/ajiips/WengLX19 fatcat:v7vmp23ua5ajdmlsi4jt2fiawu

N-gram language models for massively parallel devices

Nikolay Bogoychev, Adam Lopez
2016 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)  
For many applications, the query speed of N -gram language models is a computational bottleneck.  ...  Although massively parallel hardware like GPUs offer a potential solution to this bottleneck, exploiting this hardware requires a careful rethinking of basic algorithms and data structures.  ...  We thank Kenneth Heafield, Ulrich Germann, Rico Sennrich, Hieu Hoang, Federico Fancellu, Nathan Schneider, Naomi Saphra, Sorcha Gilroy, Clara Vania and the anonymous reviewers for productive discussion  ... 
doi:10.18653/v1/p16-1183 dblp:conf/acl/BogoychevL16 fatcat:zxogg47trnc3tceikutdhvftp4

Comparison of Distributed Computing Approaches to Complexity of n-gram Extraction

Sanzhar Aubakirov, Paulo Trigo, Darhan Ahmed-Zaki
2016 Proceedings of the 5th International Conference on Data Management Technologies and Applications  
We address the task of n-gram text extraction which is a big computational given a large amount of textual data to process.  ...  Each experiment uses both datasets and the experiment repeats for a set of different file sizes. We compared performance and efficiency among MPJ Express, Apache Hadoop and Apache Spark.  ...  We have concluded that: • for dataset type A for all data sizes MPJ Express shows the best speedup and efficiencyfor dataset type B for dataset sizes of 64 Mb to 2048 Mb MPJ Express shows 3 times better  ... 
doi:10.5220/0005943000250030 dblp:conf/data/AubakirovTA16 fatcat:4kif42bixvh2xmeyaxmk26bxvu

Syllabification Model of Indonesian Language Named-Entity Using Syntactic n-Gram

Ahmad Muammar Fanani, Suyanto Suyanto
2021 Procedia Computer Science  
There are two main ways for automatic syllabification, namely rule-based and data-driven.  ...  There are two main ways for automatic syllabification, namely rule-based and data-driven.  ...  Massive data augmentation methods may also be exploited to boost the n-gram model 18 .  ... 
doi:10.1016/j.procs.2021.01.058 fatcat:ak6uhecstfgytmjeoxd5ps2jf4

Modelling Student Behavior using Granular Large Scale Action Data from a MOOC [article]

Steven Tang, Joshua C. Peterson, Zachary A. Pardos
2016 arXiv   pre-print
In the field of language modelling, traditional n-gram techniques and modern recurrent neural network (RNN) approaches have been applied to algorithmically find structure in language and predict the next  ...  We find that simply following the syllabus (built-in structure of the course) gives on average 23% accuracy in making this prediction, followed by the n-gram method with 70.4%, and RNN based methods with  ...  Acknowledgement This work was supported by a grant from the National Science Foundation (IIS: BIG-DATA 1547055).  ... 
arXiv:1608.04789v1 fatcat:4unssptxt5epzdzvpf5h2oznom

FNG-IE: an improved graph-based method for keyword extraction from scholarly big-data

Noman Tahir, Muhammad Asif, Shahbaz Ahmad, Muhammad Sheraz Arshad Malik, Hanan Aljuaid, Muhammad Arif Butt, Mobashar Rehman
2021 PeerJ Computer Science  
This research first converted the handcrafted dataset, collected from impact factor journals into n-grams combinations, ranging from unigram to pentagram and also enhanced traditional graph-based approaches  ...  The research community is drowning in data and starving for information. The keywords are the words that describe the theme of the whole document in a precise way by consisting of just a few words.  ...  The third phase comprises n-grams generation and building the network of graphs to prepare data to be used for graph-based techniques.  ... 
doi:10.7717/peerj-cs.389 pmid:33817035 pmcid:PMC7959634 fatcat:pdl5azghxnfknfpjlotgd7xlf4
« Previous Showing results 1 — 15 out of 9,420 results