49,930 Hits in 4.7 sec

Crawling the Web [chapter]

Gautam Pant, Padmini Srinivasan, Filippo Menczer
2004 Web Dynamics  
The large size and the dynamic nature of the Web make it necessary to continually maintain Web based information retrieval systems.  ...  While some systems rely on crawlers that exhaustively crawl the Web, others incorporate "focus" within their crawlers to harvest application-or topic-specific collections.  ...  Acknowledgments The authors would like thank the anonymous referees for their valuable suggestions. This work is funded in part by NSF CAREER Grant No. IIS-0133124 to FM.  ... 
doi:10.1007/978-3-662-10874-1_7 fatcat:wz2wsoi3d5h2vebem2rof2jv2e

Crawling the Hidden Web

Sriram Raghavan, Hector Garcia-Molina
2001 Very Large Data Bases Conference  
Current-day crawlers retrieve content only from the publicly indexable Web, i.e., the set of Web pages reachable purely by following hypertext links, ignoring search forms and pages that require authorization  ...  In this paper, we address the problem of designing a crawler capable of extracting content from this hidden Web.  ...  We proposed an application/task specific approach to hidden Web crawling.  ... 
dblp:conf/vldb/RaghavanG01 fatcat:kgcaiplixrdybb2pivlvdg6oua

Crawling the Infinite Web

Ricardo A. Baeza-Yates, Carlos Castillo
2007 Journal of Web Engineering  
This poses a problem for the crawlers of Web search engines, as the network and storage resources required for indexing Web pages are neither infinite nor free.  ...  We use these models to estimate how deep a crawler must go to download a significant portion of the Web site content that is actually visited.  ...  We also thank Luc Devroye for pointing out the average return time result in Markov chains.  ... 
dblp:journals/jwe/Baeza-YatesC07 fatcat:hxlw6akktbepjf5musimjbpwf4

Explorations On The Web Crawling Algorithms

2017 International Journal of Recent Trends in Engineering and Research  
We appraisals explore on the web crawling algorithms studies on investigating which is best on the basis on study, In this paper we analyze the web crawling algorithms.  ...  Crawling algorithms are thus important in selecting the pages that satisfied the user needs.  ...  BASICS OF WEB CRAWLING Web crawler is an Internet system that gather all the pages from the www, normally for the intention of index them properly.  ... 
doi:10.23883/ijrter.2017.3516.rmznt fatcat:eait4icldzfwhfg6u3a3ex4qyu

Crawling the Web Surface Databases

Vidushi Singhal, Sachin Sharma
2012 International Journal of Computer Applications  
The World Wide Web is growing at a rapid rate. A web crawler is a computer program which independently browses the World Wide Web. The size of web as on February 2007 was 29 billion pages.  ...  One of the most important uses of web page is in indexing purpose and keeping web pages up to date which can be used by search engine to serve the end user queries.  ...  The second is that multiple crawlers can redundantly crawls the same regions of the web.  ... 
doi:10.5120/8309-1827 fatcat:n4pnskxznbesxoz5mrqqoor47u

Crawling the web for structured documents

Julián Urbano, Juan Loréns, Yorgos Andreadakis, Mónica Marrero
2010 Proceedings of the 19th ACM international conference on Information and knowledge management - CIKM '10  
documents off the Web.  ...  Despite the use of XML documents is the immediate choice, the Web is filled with several other types of structured information, which account for millions of other documents.  ...  ACKNOWLEDGEMENTS We acknowledge the Spanish National Plan of Scientific Research, Development and Technological Innovation, which has funded this work through the research project TIN2007-67153.  ... 
doi:10.1145/1871437.1871773 dblp:conf/cikm/UrbanoLAM10 fatcat:3gqugbw6pventfkfeo5di5fhbu

Focused crawling for the hidden web

Panagiotis Liakos, Alexandros Ntoulas, Alexandros Labrinidis, Alex Delis
2015 World wide web (Bussum)  
Web site with the minimum cost.  ...  The Hidden Web comprises all these information sources that conventional web crawlers are incapable of discovering.  ...  A preliminary version of the work appeared in the Proc. of the 13th Int. Conf. on Web Information Systems Engineering [15] .  ... 
doi:10.1007/s11280-015-0349-x fatcat:nycbax6khbaytgjo7zowb5rswi

DATABASE: Spiders Crawl Onto the Web

2005 Science DATABASE Spiders Crawl Onto the Web bombs," partly molten lava globs.  ...  Web to survey people.  ... 
doi:10.1126/science.309.5744.2141e fatcat:4wf4eiowsrftvkctjgguq4duha

Web-collaborative filtering: recommending music by crawling the Web

William W Cohen, Wei Fan
2000 Computer Networks  
In CF, entities are recommended to a new user based on the stated preferences of other, similar users. We describe a CF spider that collects from the Web lists of semantically related entities.  ...  Instead, the CF spider uses commercial Web-search engines to find pages likely to contain lists in the domain of interest, and then applies previously-proposed heuristics [Cohen, 1999] to extract lists  ...  index the Web.  ... 
doi:10.1016/s1389-1286(00)00057-8 fatcat:7nn6o4imwrdlbh2fqt4xbrbwju

On the Stability of Web Crawling and Web Search [chapter]

Reid Anderson, Christian Borgs, Jennifer Chayes, John Hopcroft, Vahab Mirrokni, Shang-Hua Teng
2008 Lecture Notes in Computer Science  
We introduce a notion of stable cores, which is the set of web pages that are usually contained in the crawling buffer when the buffer size is smaller than the total number of web pages.  ...  In this paper, we analyze a graph-theoretic property motivated by web crawling.  ...  Web crawling can be viewed as a dynamic process over the entire web graph.  ... 
doi:10.1007/978-3-540-92182-0_60 fatcat:sc3jfu2prnboxc3ukbtsfw7i44

Towards Crawling the Web for Structured Data: Pitfalls of Common Crawl for E-Commerce

Alex Stolz, Martin Hepp
2015 International Semantic Web Conference  
In this paper, we conduct a small-sized experiment where we compare the Web pages from a popular Web crawler, Common Crawl, with the URLs in sitemap files of respective Web sites.  ...  In the recent years, the publication of structured data inside HTML content of Web sites has become a mainstream feature of commercial Web sites.  ...  Common Crawl with respect to e-commerce on the Web.  ... 
dblp:conf/semweb/StolzH15 fatcat:63nt55huivehxkjygqg7gwwxsi

Crawling the Hidden Web: An Approach to Dynamic Web Indexing

Moumie Soulemane, Mohammad Rafiuzzaman, Hasan Mahmud
2012 International Journal of Computer Applications  
With the ever growing quantity of such hidden web pages, this issue continues to raise diverse opinions between the research and practitioner among the web mining communities.  ...  General Terms Web content mining, hidden web indexing, elimination of duplicate URLs, hadoop-Mapreduce for index updating.  ...  An interesting observation to be made here is the advantage of crawling the hidden web over the lonely surfaced web crawling.  ... 
doi:10.5120/8717-7290 fatcat:2kgprpkhbrgvxaoaco7fbbmifu

Using sensors in the web crawling process [article]

Ilya Zemskov
2003 arXiv   pre-print
This paper offers a short description of an Internet information field monitoring system, which places a special module-sensor on the side of the Web-server to detect changes in information resources and  ...  subsequently reindexes only the resources signalized by the corresponding sensor.  ...  the module in the new versions of installation package for the Web-server software).  ... 
arXiv:cs/0312033v1 fatcat:u6rmkkqxkrdo3jesbwlovx2vte

Avoiding Useless Content While Crawling the Web

Olaf Behrendt, Alexander Hierle
2021 Zenodo  
INTRODUCTION Crawling the web is a challenging endeavor and as such an hurdle for start-ups of web search engines (SE).  ...  ADL is defined as the maximum allowed number of pages for a single domain. When starting a new crawl all DLs are set to a low number like 2000.  ... 
doi:10.5281/zenodo.5884441 fatcat:witqdu6yjbgzjlxfh3uaui4bcu

On the Ubiquity of Web Tracking: Insights from a Billion-Page Web Crawl [article]

Sebastian Schelter, Jérôme Kunegis
2016 arXiv   pre-print
We perform a large-scale analysis of third-party trackers on the World Wide Web from more than 3.5 billion web pages of the CommonCrawl 2012 corpus.  ...  To the best of our knowledge, this constitutes the largest web tracking dataset collected so far, and exceeds related studies by more than an order of magnitude in the number of domains and web pages analyzed  ...  to the crawling strategy employed in latter corpora.  ... 
arXiv:1607.07403v2 fatcat:dbshae6rmfhdnlugsvpapldenm
« Previous Showing results 1 — 15 out of 49,930 results