Filters








160,620 Hits in 8.2 sec

On the Distribution of the Number of Missing Words in Random Texts

SVEN RAHMANN, ERIC RIVALS
2003 Combinatorics, probability & computing  
We study a generalization: Given a finite alphabet of size σ and a word length q, what is the distribution of the number X of words (of length q) that do not occur in a random text of length n+q−1 over  ...  For q ≥ 2, X is related to the number Y of empty urns with σ q urns and n balls, but the law of X is more complicated because successive words in the text overlap.  ...  Introduction Let X (n,σ,q) be the random number of missing words of length q (also called q-grams) in a random text of length n+q−1 over an alphabet Σ of size σ.  ... 
doi:10.1017/s0963548302005473 fatcat:5si7exyplnbv5jglp464i7rj7u

Exact and Efficient Computation of the Expected Number of Missing and Common Words in Random Texts [chapter]

Sven Rahmann, Eric Rivals
2000 Lecture Notes in Computer Science  
The number of missing words (NMW) of length q in a text, and the number of common words (NCW) of two texts are useful text statistics.  ...  Knowing the distribution of the NMW in a random text is essential for the construction of so-called monkey tests for pseudorandom number generators.  ...  The referees' comments have led to substantial improvements in the presentation of this work. E. R. was supported by a grant of the Deutsches Humangenomprojekt and is now supported by the CNRS.  ... 
doi:10.1007/3-540-45123-4_31 fatcat:6zmtsau2bjdjtpbdwnn5464ppu

Asynchronous Training of Word Embeddings for Large Text Corpora

Avishek Anand, Megha Khosla, Jaspreet Singh, Jan-Hendrik Zab, Zijian Zhang
2019 Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining - WSDM '19  
In this paper, we propose a scalable approach to train word embeddings by partitioning the input space instead in order to scale to massive text corpora while not sacrificing the performance of the embeddings  ...  Finally we also show that we are robust to missing words in sub-models and are able to effectively reconstruct word representations.  ...  probability distribution of word context pairs, P(w) and P(c) are the probability distributions for word and context respectively in the given text corpora.  ... 
doi:10.1145/3289600.3291011 dblp:conf/wsdm/AnandKSZZ19 fatcat:aryzqx4g6jdejfbwyvabv2xgwu

Testing randomness via aperiodic words

Andrew L. Rukhin, Zeev Volkovich
2008 Journal of Statistical Computation and Simulation  
Volkovich is also affiliated with the Department of Mathematics and Statistics, University of Maryland at Baltimore County. A. L. Rukhin's research was supported by a grant no.  ...  MSPF-02G-068 from the National Security Agency. The authors are grateful to the referee for his helpful comments and to J. Soto and A. Roginsky for their interesting discussion.  ...  One of the most important applications of this distribution is in testing for randomness of the underlying text. A number of classic tests of randomness are reviewed in ref. [1] .  ... 
doi:10.1080/10629360600864142 fatcat:xbdiiogb3zb2fgke3ty3te36va

Toward the optimized crowdsourcing strategy for OCR post-correction

Omri Suissa, Avshalom Elmalech, Maayan Zhitomirsky-Geffet
2019 Aslib Journal of Information Management  
In terms of efficiency, the best results were obtained when using longer text in the single-stage structure with no image.  ...  Findings The analysis suggests that in terms of accuracy, the optimal text length is medium (paragraph-size) and the optimal structure of the experiment is two phase with a scanned image.  ...  (𝐸𝑞 5)𝑃𝑟𝑜𝑜𝑓𝑖𝑛𝑔 − 𝑀𝑖𝑠𝑠 = |𝑂𝐶𝑅𝐸∩𝐹𝐸| 𝐸𝑟𝑟𝑜𝑟𝑠 , where FE is the number of word errors in the Fixed text, OCRE is the number of word errors in the OCRed text, Errors is the number of  ... 
doi:10.1108/ajim-07-2019-0189 fatcat:d25luvzk6rgi5eyoo53imzzq5y

A Robust Text Coverless Information Hiding Based on Multi-Index Method

Lin Xiang, Jiaohua Qin, Xuyu Xiang, Yun Tan, Neal N. Xiong
2021 Intelligent Automation and Soft Computing  
Then, search all texts containing the keyword ID in the big data text, and use the robust text search algorithm to find multiple texts.  ...  Secondly, we transform keywords into keyword IDs by the word index table and introduce a random increment factor to control.  ...  The experiment results are given in section 4. Finally, section 5 provides the conclusion.  ... 
doi:10.32604/iasc.2021.017720 fatcat:fsssqr3gzjarpcvsrwumwsv4iu

A Big Data Text Coverless Information Hiding Based on Topic Distribution and TF-IDF

Jiaohua Qin, Zhuo Zhou, Yun Tan, Xuyu Xiang, Zhibin He
2021 International Journal of Digital Crime and Forensics  
At the same time, random numbers are introduced to control the keyword order of secret information.  ...  However, for the text coverless has relatively low hiding capacity, this paper proposed a big data text coverless information hiding method based on LDA (latent Dirichlet allocation) topic distribution  ...  Words index Figure Figure 4. the text index Algorithm 2 . 2 Control increasing random factor Input: initial random number R Output: the random number of w i Parameter: branch number N if initial random  ... 
doi:10.4018/ijdcf.20210701.oa4 fatcat:k2cbkfchnff6fesg2lidii5hty

Holistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs [chapter]

Andreas Wagner, Veli Bicer, Thanh Tran, Rudi Studer
2014 Lecture Notes in Computer Science  
In experiments on real-world data, TopGuess allowed for great improvements in estimation accuracy, without sacrificing efficiency.  ...  Previous works on selectivity estimation, however, suffer from inherent drawbacks, which are reflected in efficiency and effectiveness issues.  ...  Yet, TopGuess resolves the issue of missing words completely: the TopGuess parameters (stored on disk) capture all words in the vocabulary.  ... 
doi:10.1007/978-3-319-11915-1_7 fatcat:ufz44u3z4zdedooeusolv2qthi

Artificial intelligence-based Multimodal Risk Assessment Model for Surgical site infection (AMRAMS): a development and validation study (Preprint)

Weijia Chen, Zhijun Lu, Lijue You, Lingling Zhou, Jie Xu, Ken Chen
2020 JMIR Medical Informatics  
The AUROCs of LASSO, random forest, and GBDT models using text embeddings were statistically higher than the AUROCs of models not using text embeddings (P<.001).  ...  We used word-embedding techniques to encode text information, and we trained the LASSO (least absolute shrinkage and selection operator) model, random forest model, gradient boosting decision tree (GBDT  ...  Here, n was the padding length decided by the upper boundary of the 1.5-IQR rule based on the distribution of word sequence lengths in the development set.  ... 
doi:10.2196/18186 pmid:32538798 fatcat:23guhp6itrdljkp5qpbchgr2ma

Word2vec Feature Extraction in Traveler Comments Using Machine Learningin Imbalance Data

2020 International Journal of Emerging Trends in Engineering Research  
The opinion given is in the form of comments and ratings on a topic. This research was conducted to classify user comments on tourist objects to be in the form of rating on a scale of 1 to 5.  ...  The dataset used is user opinion data on the tripadvisor application with a total data of 17675. Word2vec is used to extract semantic features from words from the data they have.  ...  Table1 Number Data based on Tourist Type No Tourist Type Number Data 1 Temple 5874 2 Mountain 5587 3 Beach The distribution of rating data on the obtained dataset is distributed as shown in  ... 
doi:10.30534/ijeter/2020/1218102020 fatcat:i7fxuc6e4zac3mptuzliktzc64

Noise-aware Missing Shipment Return Comment Classification in e-Commerce

Avijit Saha, Vishal Kakkar, T. Ravindra Babu
2018 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval  
E-Commerce companies face a number of challenges in return requests.  ...  Claims of missing-items is one such challenge, where customer claims that main product is missing from shipment through return comments.  ...  In Filtering, stop words are removed. Then, the preprocessed text is converted into feature vectors. One of the widely used model for feature generation is the bag-of-words (BOW) model [18] .  ... 
dblp:conf/sigir/SahaKB18 fatcat:ai56yqmoufer5jwlxxzpbb757i

Text Analysis and Automatic Triage of Posts in a Mental Health Forum

Ehsaneddin Asgari, Soroush Nasiriany, Mohammad R.K. Mofrad
2016 Proceedings of the Third Workshop on Computational Lingusitics and Clinical Psychology  
In addition, we perform feature importance analysis to characterize key features in identification of the critical posts. 157  ...  We present an approach for automatic triage of message posts in ReachOut.com mental health forum, which was a shared task in the 2016 Computational Linguistics and Clinical Psychology (CLPsych).  ...  number of times a post is viewed by the forum users. 1 Body Tf-idf representation of the text in the body of the post. 55758 Subject Tf-idf representation of the text in the subject of the post. 3690 Embedded-Body  ... 
doi:10.18653/v1/w16-0318 dblp:conf/naacl/AsgariNM16 fatcat:k4elltvz55cgbbptzz4mrv5uuq

Fast Text Classification Using Sequential Sampling Processes [chapter]

Michael D. Lee
2001 Lecture Notes in Computer Science  
These algorithms make extremely fast decisions, because they need to examine only a small number of words in each text document.  ...  A central problem in information retrieval is the automated classification of text documents.  ...  Table 1 . 1 Mean number of words examined, mean percentage of words examined, and mean percentage error of the forced choice random walk and accumulator text classifiers.  ... 
doi:10.1007/3-540-45656-2_27 fatcat:gm3nbzqeunfatlpheiem622uh4

Learning cross-modality similarity for multinomial data

Yangqing Jia, Mathieu Salzmann, Trevor Darrell
2011 2011 International Conference on Computer Vision  
In this paper, we propose a model that addresses both these challenges. Our model can be seen as a Markov random field of topic models, which connects the documents based on their similarity.  ...  In order to leverage the information present in all the modalities, one must model the relationships between them.  ...  Similarly as in LDA, we generate θ d , from a Dirichlet prior. However, in addition to this prior, the topic distribution also depends on the random field.  ... 
doi:10.1109/iccv.2011.6126524 dblp:conf/iccv/JiaSD11 fatcat:qqjsdmxwjncp5e3bz5cswujdii

LONG TERMS AND READABILITY OF PHYSICS SCHOOL TEXT

Ivana Škorecová, Aba Teleki, Ľubomír Zelenický
2017 CBU International Conference Proceedings  
The difference between the probability distribution of the compared texts corresponds with the differences between the appropriate survival functions, where random fluctuations in the frequency of terms  ...  The results show a strong correlation between the test scores and the probability distributions of terms used in the school texts.  ...  showing the approximation of the 2 distribution defined in Table 1 : 1 Number of words and terms of each text analyzed Specification of the text Number of words Full-text 2r 7 604 Terms 1 521 The  ... 
doi:10.12955/cbup.v5.1031 fatcat:sdwuvrlfu5bgrkhuo6hwwlfw24
« Previous Showing results 1 — 15 out of 160,620 results