A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2010; you can also visit the original URL.
The file type is application/pdf
.
Filters
Combining probability models and web mining models: a framework for proper name transliteration
2007
Journal of Special Topics in Information Technology and Management
a Web mining model that uses word frequency of occurrence information from the Web. ...
Our results show promise for using transliteration techniques to improve multilingual Web retrieval. ...
Experiment methodology We used the 10-fold cross validation method, commonly used in testing data mining algorithms and models, to test system accuracy. ...
doi:10.1007/s10799-007-0031-9
fatcat:diqx2xyhwzealptfce6shwh3zq
Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval
[article]
2003
arXiv
pre-print
The Web provides a vast resource for the automatic construction of parallel corpora which can be used to train statistical translation models automatically. ...
In this paper, we will investigate the problem of automatically mining parallel texts from the Web and different ways of integrating the translation models within the retrieval process. ...
We would like to thank Xerox Research Center Europe (XRCE) for making their Xelda toolkit available to us. ...
arXiv:cs/0312008v1
fatcat:hztoxce3frcgpbsmegftpg4rdu
Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus
[article]
2020
arXiv
pre-print
These techniques enable us to create an initial data set covering 100K or more relatively clean sentences in each of 500+ languages, paving the way towards a 1,000-language web text corpus. ...
We train LangID models on up to 1,629 languages with comparable quality on held-out test sets, but find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around ...
We propose alternative evaluation metrics that better estimate the quality of LangID models from the perspective of web-mining (Section 5) and perform a deep, 600-language web-crawl (Section 6) This work ...
arXiv:2010.14571v2
fatcat:qcm4knca6fd4finqbqq4fk6dwu
Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval
2003
Computational Linguistics
The Web provides a vast resource for the automatic construction of parallel corpora that can be used to train statistical translation models automatically. ...
In this article, we will investigate the problem of automatically mining parallel texts from the Web and different ways of integrating the translation models within the retrieval process. ...
We would like to thank Xerox Research Center Europe (XRCE) for making its Xelda toolkit available to us. ...
doi:10.1162/089120103322711587
fatcat:dkxidh7b3vdszodokvwhjd4nre
Automatic Evaluation of Search Ontologies in the Entertainment Domain Using Natural Language Processing
[chapter]
2011
Applied Semantic Web Technologies
We automatically construct a domain corpus from a set of movie individuals by crawling the Web for movie reviews. ...
On the basis of this mapping, we evaluate the adequacy of the ontology by translating ontology properties into properties over the textual corpora, which can be empirically tested using natural language ...
Acknowledgments This research is supported by Deutsche Telekom at the BGU T-Lab laboratories of Ben-Gurion University. ...
doi:10.1201/b11085-14
fatcat:ec3r7pr4xndozmellvyc6gi3su
A phonetic similarity model for automatic extraction of transliteration pairs
2007
ACM Transactions on Asian Language Information Processing
The unsupervised learning approach works almost as well as the supervised one, thus allowing us to deploy automatic extraction of transliteration pairs in the Web space. ...
Then, in the validation process, we qualify the transliteration pair candidates with a hypothesis test. ...
We also thank Yu Chen at the Institute for Infocomm Research, Singapore, for her efforts in improving the manuscript; Wen-Hsiang Lu at the National Cheng-Kung University for providing hyperlink and Web ...
doi:10.1145/1282080.1282081
fatcat:cabttqaf6vd6la4xfh46pxtbcu
Human Languages in Source Code: Auto-Translation for Localized Instruction
[article]
2019
arXiv
pre-print
Our translations have already been used in classrooms around the world, and represent a first step in an important open CS-education problem. ...
The study is to the best of our knowledge the first on human-language in code and covers 2.9 million Java repositories. ...
We also thank the WWW teachers for educating students around the world in their local language. ...
arXiv:1909.04556v1
fatcat:b6idol37efdshiyt365ialldi4
Security improvements Zone Routing Protocol in Mobile Ad Hoc Network
2014
International Journal of Computer Applications Technology and Research
A hybrid routing protocol should use a mixture of both proactive and reactive e approaches. Hence, in the recent years, several hybrid routing protocols are proposed like ZRP [5]. ...
The attractive features of ad-hoc networks such as dynamic topology, absence of central authorities and distributed cooperation hold the promise of revolutionizing the ad-hoc networks across a range of ...
The validation was carried out using ten folds of the training sets. ...
doi:10.7753/ijcatr0309.1001
fatcat:n7yb26a6zbgwnpvfmvka3cnpoq
No Language Left Behind: Scaling Human-Centered Machine Translation
[article]
2022
arXiv
pre-print
Critically, we evaluated the performance of over 40,000 different translation directions using a human-translated benchmark, Flores-200, and combined human evaluation with a novel toxicity benchmark covering ...
More specifically, we developed a conditional compute model based on Sparsely Gated Mixture of Experts that is trained on data obtained with novel and effective data mining techniques tailored for low-resource ...
We thank the Wikimedia Foundation staff and Wikimedia volunteers who worked with us and provided feedback to our model. We thank Vishrav Chaudhary for help with the data pipeline. ...
arXiv:2207.04672v2
fatcat:gsbt3imt4bgodpmubpaq53onnm
Variations on language modeling for information retrieval
2005
SIGIR Forum
The next two subsections describe the process of mining a probabilistic dictionary from the Web. The first step in this process is to find parallel texts on the Web. 5.3.1.1. Mining parallel pages. ...
, word counts. ...
Instead, papers which use the SMART IR system use the smartinternal encoding which has never been published . This encoding has a different semantics for the letter n, thus giving rise to confusion. ...
doi:10.1145/1067268.1067291
fatcat:h23lp5aqfvfu5iecwnihfme244
Open challenges for data stream mining research
2014
SIGKDD Explorations
Every day, huge volumes of sensory, transactional, and web data are continuously generated as streams, which need to be analyzed online as they arrive. ...
of complex data, and evaluation of stream mining algorithms. ...
on the challenges in stream mining. ...
doi:10.1145/2674026.2674028
fatcat:y3bozzeohveibgxb5wmiwfcogm
Message from the general chair
2015
2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
We explore ways of using the resulting grounding to boost the performance of a state-of-the-art co-reference resolution system. ...
To inject knowledge, we use a state-of-the-art system which cross-links (or "grounds") expressions in free text to Wikipedia. ...
We model transliteration mining as an interpolation of transliteration and non-transliteration sub-models. ...
doi:10.1109/ispass.2015.7095776
dblp:conf/ispass/Lee15
fatcat:ehbed6nl6barfgs6pzwcvwxria
A Manual for Web Corpus Crawling of Low Resource Languages
2019
vs. convenient, annotated vs. raw, small vs. big are only some antonyms that can be used to describe the range of possible corpora that can be and have been created. ...
Since the seminal publication of "Web as Corpus" [1], the potential of creating corpora from the web has been realized for good for the creation of both online and offline corpora: noisy vs. clean, balanced ...
Conclusion We have presented a guideline to searches for content in LRLs on the web which sprang from the experiences made and resources gathered during a course in 2019, the concept of which we had presented ...
doi:10.6092/issn.2532-8816/9931
fatcat:4z2hqoaotrf5ndbpjqoewbvmru
Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus
2020
Proceedings of the 28th International Conference on Computational Linguistics
unpublished
These techniques enable us to create an initial data set covering 100K or more relatively clean sentences in each of 500+ languages, paving the way towards a 1,000-language web text corpus. ...
We train LangID models on up to 1,629 languages with comparable quality on held-out test sets, but find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around ...
We propose alternative evaluation metrics that better estimate the quality of LangID models from the perspective of web-mining (Section 5) and perform a deep, 600-language web-crawl (Section 6) This work ...
doi:10.18653/v1/2020.coling-main.579
fatcat:e5wzlagpozbatjv2vxepvv4mde
User Behavior Analysis on Social Web with Knowledge Discovery Techniques
2021
Such discovered knowledge can empower new web services and applications with easily interpretable and comprehensible conclusions for the end users. ...
The emergence of social media platforms changed drastically the way that people communicate. ...
For all the experiments pre-
sented next, the WEKA data mining toolkit is used and a repeated (10 times) 10-fold
cross validation [Kim, 2009], providing the relevant standard deviation (STD). ...
doi:10.26262/heal.auth.ir.295524
fatcat:m5dqoztvh5b5pnmoigcboeblpu
« Previous
Showing results 1 — 15 out of 280 results