Filters








12 Hits in 1.9 sec

IRLbot

Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, Dmitri Loguinov
2009 ACM Transactions on the Web  
In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled 6.3 billion valid HTML pages (7.6 billion connection requests) and sustained an average download rate  ...  This article shares our experience in designing a Web crawler that can download billions of pages using a single-server implementation and models its performance.  ...  This transition likely occurs for N between 100 billion and 10 trillion, where E[V ] jumps from 271 to 2.7 million pages. When IRLbot reaches this scale, we will consider increasing its hash size.  ... 
doi:10.1145/1541822.1541823 fatcat:i2ullqmlync5llct2udyggge3a

Agnostic topology-based spam avoidance in large-scale web crawls

Clint Sparkman, Hsin-Tsang Lee, Dmitri Loguinov
2011 2011 Proceedings IEEE INFOCOM  
To shed light on Internet-wide spam avoidance, we study the domain-level graph from a 6.3B-page web crawl and compare several agnostic topology-based ranking algorithms on this dataset.  ...  limited resources and allocate the majority of bandwidth to reputable sites.  ...  Our thrust to overcome these challenges has led to a high-performance crawler called IRLbot [24] that can perform multi-billion-page web exploration using a single server, which is accomplished using  ... 
doi:10.1109/infcom.2011.5935303 dblp:conf/infocom/SparkmanLL11 fatcat:f63v5z3bpzgylfw5qbumfhicu4

BUbiNG

Paolo Boldi, Andrea Marino, Massimo Santini, Sebastiano Vigna
2018 ACM Transactions on the Web  
at the same time scales linearly with the amount of resources available.  ...  BUbiNG is an open-source Java fully distributed crawler (no central coordination), and single BUbiNG agents using sizeable hardware can crawl several thousands pages (per agent) per second respecting strict  ...  Finally, we thank Domenico Dato and Renato Soru for providing the hardware and bandwidth for the iStella experiment.  ... 
doi:10.1145/3160017 fatcat:aw7b6gkaqvbrbgut2kbptdipfe

BUbiNG

Paolo Boldi, Andrea Marino, Massimo Santini, Sebastiano Vigna
2014 Proceedings of the 23rd International Conference on World Wide Web - WWW '14 Companion  
at the same time scales linearly with the amount of resources available.  ...  BUbiNG is an open-source Java fully distributed crawler (no central coordination), and single BUbiNG agents using sizeable hardware can crawl several thousands pages (per agent) per second respecting strict  ...  Finally, we thank Domenico Dato and Renato Soru for providing the hardware and bandwidth for the iStella experiment.  ... 
doi:10.1145/2567948.2577304 dblp:conf/www/BoldiMSV14 fatcat:vijton7wx5aspik4alt25p2zly

BUbiNG: Massive Crawling for the Masses [article]

Paolo Boldi, Andrea Marino, Massimo Santini, Sebastiano Vigna
2016 arXiv   pre-print
at the same time scales linearly with the amount of resources available.  ...  and IP-based.  ...  Finally, we thank Domenico Dato and Renato Soru for providing the hardware and bandwidth for the iStella experiments.  ... 
arXiv:1601.06919v1 fatcat:wzmejm75hvf2jdl6suei4dh3dq

Web Crawling

Christopher Olston, Marc Najork
2010 Foundations and Trends in Information Retrieval  
data structures to theoretical questions such as how often to revisit evolving content sources.  ...  This is a survey of the science and practice of web crawling.  ...  It supports distributed operation and should therefore be suitable for very large crawls; but as of the writing of [81] it has not been scaled beyond 100 million pages.  ... 
doi:10.1561/1500000017 fatcat:rjc3oe77c5bipoikqrkwmy3ed4

A Brief History of Web Crawlers

Seyed M. Mirtaheri, Mustafa Emre Dinçktürk, Salman Hooshmand, Gregor V. Bochmann, Guy-Vincent Jourdan, Iosif Viorel Onut
2014 arXiv   pre-print
Web crawlers visit internet applications, collect data, and learn about new web pages from visited pages. Web crawlers have a long and interesting history.  ...  In addition to collecting statistics about the web and indexing the applications for search engines, modern crawlers can be used to perform accessibility and vulnerability checks on the application.  ...  In 2008, an extremely scalable web crawler called IRLbot ran for 41.27 days on a quad-CPU AMD Opteron 2.6 GHz server and it crawled over 6.38 billion web pages [20] .  ... 
arXiv:1405.0749v1 fatcat:rd4chhesdrg4zn2llxklrcnm6m

Probabilistic near-duplicate detection using simhash

Sadhan Sood, Dmitri Loguinov
2011 Proceedings of the 20th ACM international conference on Information and knowledge management - CIKM '11  
We show that with 95% recall compared to deterministic search of prior work [16], our method exhibits 4-14 times faster lookup and requires 2-10 times less RAM on our collection of 70M web pages.  ...  This paper offers a novel look at using a dimensionalityreduction technique called simhash [8] to detect similar document pairs in large-scale collections.  ...  Dataset All our experiments involve a set of 100M web pages crawled by IRLbot [15] in April 2008.  ... 
doi:10.1145/2063576.2063737 dblp:conf/cikm/SoodL11 fatcat:hd4n6qrprfebxoqs4wyfu3tv4m

Clouds and Continuous Analytics Enabling Social Networks for Massively Multiplayer Online Games [chapter]

Alexandru Iosup, Adrian Lăscăteu
2011 Studies in Computational Intelligence  
IRLbot [22] is a generic web crawler designed to scale web crawling to billions of pages using limited resources.  ...  IRLbot was used to crawl over 6 billion valid HTML pages while sustaining an average download rate of over 300 MB/s and almost 2,000 pages/s during an experiment that lasted 41 days.  ... 
doi:10.1007/978-3-642-20344-2_12 fatcat:hc4fwlvyefclncnxjshxbsh4oa

Searching and Browsing Linked Data with SWSE: The Semantic Web Search Engine

Aidan Hogan, Andreas Harth, JJrgen Umbrich, Sheila Kinsella, Axel Polleres, Stefan Decker
2011 Social Science Research Network  
Throughout, we offer evaluation and complementary argumentation to support our design choices, and also offer discussion on future directions and open research questions.  ...  In particular, many challenges exist in adopting Semantic Web technologies for Web data: the unique challenges of the Web -in terms of scale, unreliability, inconsistency and noise -are largely overlooked  ...  Acknowledgements We would like to thank the anonymous reviewers and the editors for their feedback which helped to improve this paper.  ... 
doi:10.2139/ssrn.3199532 fatcat:ob2ko5yfbzcqpg3fgbrysqstzi

Searching and browsing Linked Data with SWSE: The Semantic Web Search Engine

Aidan Hogan, Andreas Harth, Jürgen Umbrich, Sheila Kinsella, Axel Polleres, Stefan Decker
2011 Journal of Web Semantics  
Throughout, we offer evaluation and complementary argumentation to support our design choices, and also offer discussion on future directions and open research questions.  ...  In particular, many challenges exist in adopting Semantic Web technologies for Web data: the unique challenges of the Web -in terms of scale, unreliability, inconsistency and noise -are largely overlooked  ...  Acknowledgements We would like to thank the anonymous reviewers and the editors for their feedback which helped to improve this paper.  ... 
doi:10.1016/j.websem.2011.06.004 fatcat:lteloasxhvgbhp3256ehrv5wf4

New approaches for de-novo motif discovery using phylogenetic footprinting - from data acquisition to motif visualization ; [kumulative Dissertation] [article]

Arthur Martin Nettling, Universitäts- Und Landesbibliothek Sachsen-Anhalt, Martin-Luther Universität, Ivo Grosse, Peter Stadler
2018
In this thesis, my colleagues and I have addressed six limitations in three related fields. First, we proposed miRGen and DRUMS, two approaches to improve "data acquisition and data preparation."  ...  We found that all three approaches lead to an improved prediction of transcription factor binding sites. Third, we proposed DiffLogo to improve the "visualization of sequence motifs."  ...  Furthermore, we thank Unister GmbH for the opportunity to develop and publish the software as open source project.  ... 
doi:10.25673/2002 fatcat:crtvubgk7vdlxnq67rn7h4jwxu