A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2019; you can also visit the original URL.
The file type is application/pdf
.
Filters
Efficient Data Structures for Massive N-Gram Datasets
2017
Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR '17
In this paper we study the problem of reducing the space required by the representation of such datasets, maintaining the capability of looking up for a given N -gram within micro seconds. ...
e e cient indexing of large and sparse N -gram datasets is crucial in several applications in Information Retrieval, Natural Language Processing and Machine Learning. ...
Each dataset comprises all N -grams for 1 ≤ N ≤ 5 and associated frequency counts. Table 2 shows the basic statistics of the datasets. Compared Indexes. ...
doi:10.1145/3077136.3080798
dblp:conf/sigir/PibiriV17
fatcat:cmxkp3lus5hqxivr6j6bkddvae
Efficient n-gram analysis in R with cmscu
2016
Behavior Research Methods
, Inc. dataset. ...
., 2013) modified Kneser-Ney n-gram smoothing algorithm using cmscu as the querying engine. ...
Here we use a sketch algorithm known for its efficiency in processing Predictions are provided by the full negative binomial model controlling for other variables massive real-time data (Cormode & Muthukrishnan ...
doi:10.3758/s13428-016-0766-5
pmid:27496173
fatcat:7skgmebau5gyxl2baponubpq2i
SSketch: An Automated Framework for Streaming Sketch-Based Analysis of Big Data on FPGA
2015
2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines
The stream of input data is used by SSketch for adaptive learning and updating a corresponding ensemble of lower dimensional data structures, a.k.a., a sketch matrix. ...
This paper proposes SSketch, a novel automated computing framework for FPGA-based online analysis of big data with dense (non-sparse) correlation matrices. ...
However, their O(m 2 n) complexity makes it hard to utilize these wellknown algorithms for massive datasets. ...
doi:10.1109/fccm.2015.56
dblp:conf/fccm/RouhaniSMK15
fatcat:n3jjxkpwbzgvvboez4ywtgik54
ExtDict: Extensible Dictionaries for Data- and Platform-Aware Large-Scale Learning
2017
2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
Abstract-This paper proposes ExtDict, a novel data-and platform-aware framework for iterative analysis/learning of massive and dense datasets. ...
Iterative execution is prohibitively costly for distributed architectures where the cost of moving data is continually growing compared with the cost of arithmetic computing. ...
Finding methods for efficient and scalable learning of massive and complex datasets is an active area of research. ...
doi:10.1109/ipdpsw.2017.171
dblp:conf/ipps/MirhoseiniRSK17
fatcat:m6txzqqg5zhk5liu2s3al5hpse
Layered Higher Order N-grams for Hardening Payload Based Anomaly Intrusion Detection
2010
2010 International Conference on Availability, Reliability and Security
Each such n-gram is a 2 tuple where the first element is byte values of the n-gram and second is the frequency of gram in the entire training data. ...
Since behavior of every application is not same a different model is necessary for each application. Studies have revealed that higher order n-grams are good for capturing the network profile. ...
But our algorithm uses an efficient tree structure for storage(discussed bellow) and can accommodate any amount of training data. ...
doi:10.1109/ares.2010.31
dblp:conf/IEEEares/HubballiBN10
fatcat:bblvbm4bmnhblffnb7leapijxq
KONG: Kernels for ordered-neighborhood graphs
[article]
2018
arXiv
pre-print
Graphs with ordered neighborhoods are a natural data representation for evolving graphs where edges are created over time, which induces an order. ...
For the special case of general graphs, i.e. graphs without ordered neighborhoods, the new graph kernels yield efficient and simple algorithms for the comparison of label distributions between graphs. ...
Count-Sketch and Tensor-Sketch Sketching is an algorithmic tool for the summarization of massive datasets such that key properties of the data are preserved. ...
arXiv:1805.10014v2
fatcat:k4s3bfox7resjilnmvhkvm3csq
XML Structural Similarity Search Using MapReduce
[chapter]
2010
Lecture Notes in Computer Science
XML is a de-facto standard for web data exchange and information representation. Efficient management of these large volumes of XML data brings challenges to conventional technique. ...
In this paper, an efficient and scalable framework is proposed for XML structural similarity search on large cluster with MapReduce. ...
Extensive experiments on real datasets show that our framework is efficient and scales well in term of the size of the corpus for structural similarity searching for XML data. ...
doi:10.1007/978-3-642-14246-8_19
fatcat:y44tsi6bpjhqhehfa5l6qagxv4
TIDM: Topic-Specific Information Detection Model
2017
Procedia Computer Science
Unfortunately, due to the informal expressions, detecting the massive data on the internet is a big challenge based on the traditional text mining methods such as Topic Model. ...
For training the words and idiomatic phrases, we adopt the supervise learning technique: manually constructing a specific Semantic Dataset for training our model. ...
Thus, it's very important and urgent to deploy a method to help us detect and control the massive social data automatically. ...
doi:10.1016/j.procs.2017.11.365
fatcat:ryoqtr36tzbydho4xi6vsk7gcu
Semantic N-Gram Topic Modeling
2018
EAI Endorsed Transactions on Scalable Information Systems
particular topic are calculated and best considerable semantic N-Gram phrases and terms are considered for further topic modeling. ...
Results are evaluated and it was found that perplexity is drastically improved and found significant improvement in coherence score specifically for short text data set like movie reviews and political ...
Parameter Setting for Experiments In this experiment after pre-processing a collocation model is constructed for learning phrases in dataset up to N Grams. ...
doi:10.4108/eai.13-7-2018.163131
fatcat:cnk3rkd6wzcpln4nus6ritvn4q
GRU based Convolutional Neural Network with Initialized Filters for Text Classification
2019
Australian Journal of Intelligent Information Processing Systems
Besides, for reducing the time of parameter adjustment in convolutional layer, training data is used to initialize filters. ...
Text classification is a classical task of natural language processing, which can quickly find corresponding categories from massive amount of data. ...
(5)𝑦 , 𝐿 𝑦 , 𝑊 𝑦 for each class y, which are initialized to be empty 2: for each sentence s in S, obtain n-grams from s and then, add n-grams to N 𝑦 , where y is the label of s 3: for each class ...
dblp:journals/ajiips/WengLX19
fatcat:v7vmp23ua5ajdmlsi4jt2fiawu
N-gram language models for massively parallel devices
2016
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
For many applications, the query speed of N -gram language models is a computational bottleneck. ...
Although massively parallel hardware like GPUs offer a potential solution to this bottleneck, exploiting this hardware requires a careful rethinking of basic algorithms and data structures. ...
We thank Kenneth Heafield, Ulrich Germann, Rico Sennrich, Hieu Hoang, Federico Fancellu, Nathan Schneider, Naomi Saphra, Sorcha Gilroy, Clara Vania and the anonymous reviewers for productive discussion ...
doi:10.18653/v1/p16-1183
dblp:conf/acl/BogoychevL16
fatcat:zxogg47trnc3tceikutdhvftp4
Comparison of Distributed Computing Approaches to Complexity of n-gram Extraction
2016
Proceedings of the 5th International Conference on Data Management Technologies and Applications
We address the task of n-gram text extraction which is a big computational given a large amount of textual data to process. ...
Each experiment uses both datasets and the experiment repeats for a set of different file sizes. We compared performance and efficiency among MPJ Express, Apache Hadoop and Apache Spark. ...
We have concluded that: • for dataset type A for all data sizes MPJ Express shows the best speedup and efficiency • for dataset type B for dataset sizes of 64 Mb to 2048 Mb MPJ Express shows 3 times better ...
doi:10.5220/0005943000250030
dblp:conf/data/AubakirovTA16
fatcat:4kif42bixvh2xmeyaxmk26bxvu
Syllabification Model of Indonesian Language Named-Entity Using Syntactic n-Gram
2021
Procedia Computer Science
There are two main ways for automatic syllabification, namely rule-based and data-driven. ...
There are two main ways for automatic syllabification, namely rule-based and data-driven. ...
Massive data augmentation methods may also be exploited to boost the n-gram model 18 . ...
doi:10.1016/j.procs.2021.01.058
fatcat:ak6uhecstfgytmjeoxd5ps2jf4
Modelling Student Behavior using Granular Large Scale Action Data from a MOOC
[article]
2016
arXiv
pre-print
In the field of language modelling, traditional n-gram techniques and modern recurrent neural network (RNN) approaches have been applied to algorithmically find structure in language and predict the next ...
We find that simply following the syllabus (built-in structure of the course) gives on average 23% accuracy in making this prediction, followed by the n-gram method with 70.4%, and RNN based methods with ...
Acknowledgement This work was supported by a grant from the National Science Foundation (IIS: BIG-DATA 1547055). ...
arXiv:1608.04789v1
fatcat:4unssptxt5epzdzvpf5h2oznom
FNG-IE: an improved graph-based method for keyword extraction from scholarly big-data
2021
PeerJ Computer Science
This research first converted the handcrafted dataset, collected from impact factor journals into n-grams combinations, ranging from unigram to pentagram and also enhanced traditional graph-based approaches ...
The research community is drowning in data and starving for information. The keywords are the words that describe the theme of the whole document in a precise way by consisting of just a few words. ...
The third phase comprises n-grams generation and building the network of graphs to prepare data to be used for graph-based techniques. ...
doi:10.7717/peerj-cs.389
pmid:33817035
pmcid:PMC7959634
fatcat:pdl5azghxnfknfpjlotgd7xlf4
« Previous
Showing results 1 — 15 out of 9,420 results