Filters








280 Hits in 7.5 sec

Combining probability models and web mining models: a framework for proper name transliteration

Yilu Zhou, Feng Huang, Hsinchun Chen
2007 Journal of Special Topics in Information Technology and Management  
a Web mining model that uses word frequency of occurrence information from the Web.  ...  Our results show promise for using transliteration techniques to improve multilingual Web retrieval.  ...  Experiment methodology We used the 10-fold cross validation method, commonly used in testing data mining algorithms and models, to test system accuracy.  ... 
doi:10.1007/s10799-007-0031-9 fatcat:diqx2xyhwzealptfce6shwh3zq

Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval [article]

Wessel Kraaij, Jian-Yun Nie, Michel Simard
2003 arXiv   pre-print
The Web provides a vast resource for the automatic construction of parallel corpora which can be used to train statistical translation models automatically.  ...  In this paper, we will investigate the problem of automatically mining parallel texts from the Web and different ways of integrating the translation models within the retrieval process.  ...  We would like to thank Xerox Research Center Europe (XRCE) for making their Xelda toolkit available to us.  ... 
arXiv:cs/0312008v1 fatcat:hztoxce3frcgpbsmegftpg4rdu

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus [article]

Isaac Caswell, Theresa Breiner, Daan van Esch, Ankur Bapna
2020 arXiv   pre-print
These techniques enable us to create an initial data set covering 100K or more relatively clean sentences in each of 500+ languages, paving the way towards a 1,000-language web text corpus.  ...  We train LangID models on up to 1,629 languages with comparable quality on held-out test sets, but find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around  ...  We propose alternative evaluation metrics that better estimate the quality of LangID models from the perspective of web-mining (Section 5) and perform a deep, 600-language web-crawl (Section 6) This work  ... 
arXiv:2010.14571v2 fatcat:qcm4knca6fd4finqbqq4fk6dwu

Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval

Wessel Kraaij, Jian-Yun Nie, Michel Simard
2003 Computational Linguistics  
The Web provides a vast resource for the automatic construction of parallel corpora that can be used to train statistical translation models automatically.  ...  In this article, we will investigate the problem of automatically mining parallel texts from the Web and different ways of integrating the translation models within the retrieval process.  ...  We would like to thank Xerox Research Center Europe (XRCE) for making its Xelda toolkit available to us.  ... 
doi:10.1162/089120103322711587 fatcat:dkxidh7b3vdszodokvwhjd4nre

Automatic Evaluation of Search Ontologies in the Entertainment Domain Using Natural Language Processing [chapter]

Michael Elhadad, David Gabay, Yael Netzer
2011 Applied Semantic Web Technologies  
We automatically construct a domain corpus from a set of movie individuals by crawling the Web for movie reviews.  ...  On the basis of this mapping, we evaluate the adequacy of the ontology by translating ontology properties into properties over the textual corpora, which can be empirically tested using natural language  ...  Acknowledgments This research is supported by Deutsche Telekom at the BGU T-Lab laboratories of Ben-Gurion University.  ... 
doi:10.1201/b11085-14 fatcat:ec3r7pr4xndozmellvyc6gi3su

A phonetic similarity model for automatic extraction of transliteration pairs

Jin-Shea Kuo, Haizhou Li, Ying-Kuei Yang
2007 ACM Transactions on Asian Language Information Processing  
The unsupervised learning approach works almost as well as the supervised one, thus allowing us to deploy automatic extraction of transliteration pairs in the Web space.  ...  Then, in the validation process, we qualify the transliteration pair candidates with a hypothesis test.  ...  We also thank Yu Chen at the Institute for Infocomm Research, Singapore, for her efforts in improving the manuscript; Wen-Hsiang Lu at the National Cheng-Kung University for providing hyperlink and Web  ... 
doi:10.1145/1282080.1282081 fatcat:cabttqaf6vd6la4xfh46pxtbcu

Human Languages in Source Code: Auto-Translation for Localized Instruction [article]

Chris Piech, Sami Abu-El-Haija
2019 arXiv   pre-print
Our translations have already been used in classrooms around the world, and represent a first step in an important open CS-education problem.  ...  The study is to the best of our knowledge the first on human-language in code and covers 2.9 million Java repositories.  ...  We also thank the WWW teachers for educating students around the world in their local language.  ... 
arXiv:1909.04556v1 fatcat:b6idol37efdshiyt365ialldi4

Security improvements Zone Routing Protocol in Mobile Ad Hoc Network

Mahsa Seyyedtaj, Mohammad Ali Jabraeil Jamali
2014 International Journal of Computer Applications Technology and Research  
A hybrid routing protocol should use a mixture of both proactive and reactive e approaches. Hence, in the recent years, several hybrid routing protocols are proposed like ZRP [5].  ...  The attractive features of ad-hoc networks such as dynamic topology, absence of central authorities and distributed cooperation hold the promise of revolutionizing the ad-hoc networks across a range of  ...  The validation was carried out using ten folds of the training sets.  ... 
doi:10.7753/ijcatr0309.1001 fatcat:n7yb26a6zbgwnpvfmvka3cnpoq

No Language Left Behind: Scaling Human-Centered Machine Translation [article]

NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun (+27 others)
2022 arXiv   pre-print
Critically, we evaluated the performance of over 40,000 different translation directions using a human-translated benchmark, Flores-200, and combined human evaluation with a novel toxicity benchmark covering  ...  More specifically, we developed a conditional compute model based on Sparsely Gated Mixture of Experts that is trained on data obtained with novel and effective data mining techniques tailored for low-resource  ...  We thank the Wikimedia Foundation staff and Wikimedia volunteers who worked with us and provided feedback to our model. We thank Vishrav Chaudhary for help with the data pipeline.  ... 
arXiv:2207.04672v2 fatcat:gsbt3imt4bgodpmubpaq53onnm

Variations on language modeling for information retrieval

Wessel Kraaij
2005 SIGIR Forum  
The next two subsections describe the process of mining a probabilistic dictionary from the Web. The first step in this process is to find parallel texts on the Web. 5.3.1.1. Mining parallel pages.  ...  , word counts.  ...  Instead, papers which use the SMART IR system use the smartinternal encoding which has never been published . This encoding has a different semantics for the letter n, thus giving rise to confusion.  ... 
doi:10.1145/1067268.1067291 fatcat:h23lp5aqfvfu5iecwnihfme244

Open challenges for data stream mining research

Georg Krempl, Myra Spiliopoulou, Jerzy Stefanowski, Indre Žliobaite, Dariusz Brzeziński, Eyke Hüllermeier, Mark Last, Vincent Lemaire, Tino Noack, Ammar Shaker, Sonja Sievi
2014 SIGKDD Explorations  
Every day, huge volumes of sensory, transactional, and web data are continuously generated as streams, which need to be analyzed online as they arrive.  ...  of complex data, and evaluation of stream mining algorithms.  ...  on the challenges in stream mining.  ... 
doi:10.1145/2674026.2674028 fatcat:y3bozzeohveibgxb5wmiwfcogm

Message from the general chair

Benjamin C. Lee
2015 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)  
We explore ways of using the resulting grounding to boost the performance of a state-of-the-art co-reference resolution system.  ...  To inject knowledge, we use a state-of-the-art system which cross-links (or "grounds") expressions in free text to Wikipedia.  ...  We model transliteration mining as an interpolation of transliteration and non-transliteration sub-models.  ... 
doi:10.1109/ispass.2015.7095776 dblp:conf/ispass/Lee15 fatcat:ehbed6nl6barfgs6pzwcvwxria

A Manual for Web Corpus Crawling of Low Resource Languages

Armin Hoenen, Cemre Koc, Marc Daniel Rahn
2019
vs. convenient, annotated vs. raw, small vs. big are only some antonyms that can be used to describe the range of possible corpora that can be and have been created.  ...  Since the seminal publication of "Web as Corpus" [1], the potential of creating corpora from the web has been realized for good for the creation of both online and offline corpora: noisy vs. clean, balanced  ...  Conclusion We have presented a guideline to searches for content in LRLs on the web which sprang from the experiences made and resources gathered during a course in 2019, the concept of which we had presented  ... 
doi:10.6092/issn.2532-8816/9931 fatcat:4z2hqoaotrf5ndbpjqoewbvmru

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

Isaac Caswell, Theresa Breiner, Daan van Esch, Ankur Bapna
2020 Proceedings of the 28th International Conference on Computational Linguistics   unpublished
These techniques enable us to create an initial data set covering 100K or more relatively clean sentences in each of 500+ languages, paving the way towards a 1,000-language web text corpus.  ...  We train LangID models on up to 1,629 languages with comparable quality on held-out test sets, but find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around  ...  We propose alternative evaluation metrics that better estimate the quality of LangID models from the perspective of web-mining (Section 5) and perform a deep, 600-language web-crawl (Section 6) This work  ... 
doi:10.18653/v1/2020.coling-main.579 fatcat:e5wzlagpozbatjv2vxepvv4mde

User Behavior Analysis on Social Web with Knowledge Discovery Techniques

Δέσποινα Δ. Χατζάκου
2021
Such discovered knowledge can empower new web services and applications with easily interpretable and comprehensible conclusions for the end users.  ...  The emergence of social media platforms changed drastically the way that people communicate.  ...  For all the experiments pre- sented next, the WEKA data mining toolkit is used and a repeated (10 times) 10-fold cross validation [Kim, 2009], providing the relevant standard deviation (STD).  ... 
doi:10.26262/heal.auth.ir.295524 fatcat:m5dqoztvh5b5pnmoigcboeblpu
« Previous Showing results 1 — 15 out of 280 results