Filters








958 Hits in 6.2 sec

Exploiting Syntactic and Distributional Information for Spelling Correction with Web-Scale N-gram Models

Wei Xu, Joel R. Tetreault, Martin Chodorow, Ralph Grishman, Le Zhao
2011 Conference on Empirical Methods in Natural Language Processing  
The syntactic and distributional information provides extra evidence in addition to that provided by a web-scale n-gram corpus and especially helps with data sparsity problems.  ...  We propose a novel way of incorporating dependency parse and word co-occurrence information into a state-of-the-art web-scale ngram model for spelling correction.  ...  Acknowledgments We wish to thank Michael Flor of Educational Testing Service for his TrendStream tool, which provides fast access and easy manipulation of the Google N-gram Corpus.  ... 
dblp:conf/emnlp/XuTCGZ11 fatcat:h6mdnhi34jgirmgicujujbji6i

OCR Context-Sensitive Error Correction Based on Google Web 1T 5-Gram Data Set [article]

Youssef Bassil, Mohammad Alwani
2012 arXiv   pre-print
The core of the proposed solution is a combination of three algorithms: The error detection, candidate spellings generator, and error correction algorithms, which all exploit information extracted from  ...  Google Web 1T 5-gram data set.  ...  Acknowledgment This research was funded by the Lebanese Association for Computational Sciences (LACSC), Beirut, Lebanon under the "Web-Scale OCR Research Project -WSORP2011".  ... 
arXiv:1204.0188v1 fatcat:rfudthyyk5hujiromjlru5hu5e

OCR Post-Processing Error Correction Algorithm using Google Online Spelling Suggestion [article]

Youssef Bassil, Mohammad Alwani
2012 arXiv   pre-print
This paper proposes a post-processing context-based error correction algorithm for detecting and correcting OCR non-word and real-word errors.  ...  The proposed algorithm is based on Google's online spelling suggestion which harnesses an internal database containing a huge collection of terms and word sequences gathered from all over the web, convenient  ...  ACKNOWLEDGMENTS This research was funded by the Lebanese Association for Computational Sciences (LACSC), Beirut, Lebanon under the "Web-Scale OCR Research Project -WSORP2011".  ... 
arXiv:1204.0191v1 fatcat:azx563fb4fhsliys7cvuw6jbay

Learning to Rank Answers to Non-Factoid Questions from Web Collections

Mihai Surdeanu, Massimiliano Ciaramita, Hugo Zaragoza
2011 Computational Linguistics  
such as word senses and semantic roles can have a significant impact on large-scale information retrieval tasks.  ...  We show that it is possible to exploit existing large collections of question-answer pairs (from online social Question Answering sites) to extract such features and train ranking models which combine  ...  Where applicable, we show within parentheses the text representation for the corresponding feature: W for words, N for n-grams, D for syntactic dependencies, and R for semantic roles.  ... 
doi:10.1162/coli_a_00051 fatcat:l6eao4y535hljip2jr3auze44m

Exploiting Syntactic Similarities for Preposition Error Corrections on Indonesian Sentences Written by Second Language Learner

Budi Irmawati, Hiroyuki Shindo, Yuji Matsumoto
2016 Procedia Computer Science  
Experimental results show that the preposition error correction model trained on the artificial data resulted from our method outperforms the correction model trained on the similar size of native data  ...  Our method copies a preposition error from a learner sentence to a native sentence by firstly calculating a syntactic similarity score between the native sentence and the learners' sentence.  ...  Acknowledgements We acknowledge the anonymous reviewers for their constructive comments.  ... 
doi:10.1016/j.procs.2016.04.052 fatcat:facnhefb3remhdpos5qxirvxgy

Reranking Bilingually Extracted Paraphrases Using Monolingual Distributional Similarity

Tsz Ping Chan, Chris Callison-Burch, Benjamin Van Durme
2011 GEometrical Models of Natural Language Semantics  
Raw monolingual data provides a complementary and orthogonal source of information that lessens the commonly observed errors in bilingual pivotbased methods.  ...  The results also show that monolingual distribution similarity can serve as a threshold for high precision paraphrase selection.  ...  Here we calculate distributional similarity using a web-scale n-gram corpus (Brants and Franz, 2006; Lin et al., 2010) .  ... 
dblp:conf/acl-gems/ChanCD11 fatcat:jjm4yboo6nhf3iuafqiupvweh4

Web-based models for natural language processing

Mirella Lapata, Frank Keller
2005 ACM Transactions on Speech and Language Processing  
and a wider range of n-grams and parts of speech than have been previously explored.  ...  For the majority of our tasks, we find that simple, unsupervised models perform better when n-gram counts are obtained from the web rather than from a large corpus.  ...  ACKNOWLEDGMENTS We thank the anonymous referees for helpful comments and suggestions.  ... 
doi:10.1145/1075389.1075392 dblp:journals/tslp/LapataK05 fatcat:vn4lwddytvbkbcs2nqntxexdom

Coreference Semantics from Web Features

Mohit Bansal, Dan Klein
2012 Annual Meeting of the Association for Computational Linguistics  
To address semantic ambiguities in coreference resolution, we use Web n-gram features that capture a range of world knowledge in a diffuse but robust way.  ...  Specifically, we exploit short-distance cues to hypernymy, semantic compatibility, and semantic context, as well as general lexical co-occurrence.  ...  Acknowledgments We would like to thank Nathan Gilbert, Adam Pauls, and the anonymous reviewers for their helpful suggestions.  ... 
dblp:conf/acl/BansalK12 fatcat:j5kaj2nxtrgsdaol4mtxptsxry

A Spell Checking Web Service API for Smart City Communication Platforms

Vita S. Barletta, Danilo Caivano, Antonella Nannavecchia, Michele Scalera
2019 Open Journal of Applied Sciences  
This work was the first step of a wider project aimed at providing a Spell Checking Web Service API for Smart City communication platforms able to automatically select, among the large availability of  ...  in order to provide information and online services in real time through platform systems rather than by means of humans, using Artificial Intelligence and Natural Language Processing techniques.  ...  Acknowledgements The authors are very grateful for the collaboration received from SER&Practices-Software Engineering Research & Practices, Spin-off of the University of Bari "Aldo Moro".  ... 
doi:10.4236/ojapps.2019.912066 fatcat:syqnosawpfh5fbq4qazepaoz2m

Syntactic complexity of Web search queries through the lenses of language models, networks and users

Rishiraj Saha Roy, Smith Agarwal, Niloy Ganguly, Monojit Choudhury
2016 Information Processing & Management  
The three complementary studies show that the syntactic structure of Web queries is more complex than what n-grams can capture, but simpler than NL.  ...  We then use complex network analysis for a comparative analysis of the topological properties of queries issued by real Web users and those generated by statistical models.  ...  spelling correction [20] and information retrieval (IR) [21] .  ... 
doi:10.1016/j.ipm.2016.04.002 fatcat:eqskitwxkjgxjkcewatoxuzb4i

Representation models for text classification

George Giannakopoulos, Petra Mavridi, Georgios Paliouras, George Papadakis, Konstantinos Tserpes
2012 Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics - WIMS '12  
We claim and support that each category calls for different classification settings with respect to the representation model.  ...  Accuracy is increased due to the contextual information that is encapsulated in the edges of the n-gram graphs; efficiency, on the other hand, is boosted by reducing the feature space to a limited set  ...  and Virtualisation under contract no.257774.  ... 
doi:10.1145/2254129.2254148 dblp:conf/wims/GiannakopoulosMPPT12 fatcat:u4igklrzabelroardrqoxtldcy

What makes a good biography?

Lucie Flekova, Oliver Ferschke, Iryna Gurevych
2014 Proceedings of the 23rd international conference on World wide web - WWW '14  
Additionally, we study the classification performance and differences for the biographies of living and dead people as well as those for men and women.  ...  Can we provide meaningful support for quality assurance with automated text processing techniques?  ...  GU 798/14/1, and by the Hessian research excellence program "Landes-Offensive zur Entwicklung Wissenschaftlich-Ökonomischer Exzellenz" as part of the research center "Digital Humanities".  ... 
doi:10.1145/2566486.2567972 dblp:conf/www/FlekovaFG14 fatcat:qny75ak3kjgjrcjaw32a3oec2a

Detecting English Writing Styles For Non Native Speakers [article]

Yanging Chen, Rami Al-Rfou', Yejin Choi
2017 arXiv   pre-print
We believe such sources of data are crucial to generate robust solutions for the web with high accuracy and easy to deploy in practice.  ...  This paper presents the first attempt, up to our knowledge, to classify English writing styles on this scale with the challenge of classifying day to day language written by writers with different backgrounds  ...  We are also indebted to the NLTK and the Sklearn teams for producing excellent NLP and machine learning resources.  ... 
arXiv:1704.07441v1 fatcat:kxvknm7gt5fv7m4tzchzo6jmsu

Finding Parallel Passages in Cultural Heritage Archives

Martyn Harris, Mark Levene, Dell Zhang, Dan Levene
2018 ACM Journal on Computing and Cultural Heritage  
The key to such a domain-independent and language-independent digital infrastructure is a novel combination of a character-based n-gram language model, space-optimised suffix tree, generalised edit distance  ...  It is at the core of our Samtla (Search And Mining Tools with Linguistic Analysis) system designed in collaboration with historians and linguists.  ...  The interpolation method incorporates information from lower order n-grams, which have more stable counts for approximating the probability for each given n-gram.  ... 
doi:10.1145/3195727 fatcat:plgfe4irgnbgjizo5d3lg6hurq

Statistical machine translation enhancements through linguistic levels

Marta R. Costa-Jussà, Mireia Farrús
2014 ACM Computing Surveys  
However, with this basic approach, there are some issues at each written linguistic level (i.e., orthographic, morphological, lexical, syntactic and semantic) that remain unsolved.  ...  , and linguists.  ...  For other MT systems, there are works like that integrate Maximum Entropy models considering n-gram features from the source sentence.  ... 
doi:10.1145/2518130 fatcat:cy6cud32tjgvjjsgiiv5aj65zi
« Previous Showing results 1 — 15 out of 958 results