501 Hits in 5.5 sec

Comparing Topic Coverage in Breadth-First and Depth-First Crawls Using Anchor Texts [chapter]

Thaer Samar, Myriam C. Traub, Jacco van Ossenbruggen, Arjen P. de Vries
2016 Lecture Notes in Computer Science  
We use link anchor text of two Web crawls created with different crawling strategies in order to compare their coverage of past popular topics.  ...  Using simple exact string matching between anchor texts and popular topics from the three different sources, we found that the breadth-first crawl covered more topics than the depth-first crawl.  ...  This research was funded by the Netherlands Organization for Scientific Research (NWO CATCH program, WebART project), and Dutch COMMIT/ program (SEALINCMedia project).  ... 
doi:10.1007/978-3-319-43997-6_11 fatcat:odndkvcgzbdpjcq7pzznn6yvda

A cross-language focused crawling algorithm based on multiple relevance prediction strategies

Zhumin Chen, Jun Ma, Jingsheng Lei, Bo Yuan, Li Lian, Ling Song
2009 Computers and Mathematics with Applications  
For cross-language crawling, we first introduce a hierarchical taxonomy to describe topics in both English and Chinese.  ...  We then present a formal description of the relevance predicting process and discuss four strategies that make use of page contents, anchor texts, URL addresses and link types of Web pages, respectively  ...  In order to fairly compare these algorithms, we use a new function (9) to compute the relevance of crawled pages to the given topic, in which v c and v t are the vectors of content current and topic  ... 
doi:10.1016/j.camwa.2008.09.021 fatcat:ijsqf4olbfdjtknycza5rombca

Distributed Ontology-Driven Focused Crawling

R. Campos, O. Rojas, M. Marin, M. Mendoza
2013 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing  
Using information gathered at running time, focused crawlers explore the web following promissory hyperlinks and fetching only pages which appear to be relevant.  ...  In this article, we introduce an efficient focused crawling strategy which considers a number of distributed focused crawlers which recover relevant pages to a given knowledge domain.  ...  In a third experiment we compared the performance of our scheduler against state of the art crawling strategies as breadth first and depth first.  ... 
doi:10.1109/pdp.2013.23 dblp:conf/pdp/CamposRMM13 fatcat:rp4isqkegbhmra54rbrt3wqa3q

Finding Thai Web Pages in Foreign Web Spaces

K. Somboonviwat, T. Tamura, M. Kitsuregawa
2006 22nd International Conference on Data Engineering Workshops (ICDEW'06)  
Then, the LSWC strategy is evaluated on a crawling simulator with large dataset.  ...  This paper proposes language specific web crawling (LSWC) as a method of creating Web archives for countries with linguistic identities such as Thailand.  ...  Figure 4 and 4 Figure 5 respectively show the traces of the coverage and the harvest rate for each strategy. First, let us consider the coverage trace in Figure 4.  ... 
doi:10.1109/icdew.2006.60 dblp:conf/icde/SomboonviwatTK06 fatcat:5eeogtc5fze35ecv4p5vswwyse

What's there and what's not?

Ziming Zhuang, Rohit Wagle, C. Lee Giles
2005 Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries - JCDL '05  
Some large scale topical digital libraries, such as CiteSeer, harvest online academic documents by crawling open-access archives, university and author homepages, and authors' self-submissions.  ...  We propose to use alternative online resources and techniques that maximally exploit other resources to build the complete document collection of any given publication venue.  ...  Mitra and the anonymous reviewers for their comments, I. Councill and P. Teregowda for their work on the CiteSeer metadata, and E. Maldonado and D. Hellar for the crawl list.  ... 
doi:10.1145/1065385.1065455 dblp:conf/jcdl/ZhuangWG05 fatcat:wsui5zxwmbe73mdlxnavv7utq4

Effective Concentrated Web Crawling Approach Path for Google

Ashwani Kumar, Anuj Kumar, Rahul Mishra
2017 International Journal of Advanced Research in Computer Science and Software Engineering  
To check the resemblance of web pages with respect to topic keywords and priority of extracted associate is computed.  ...  A concentered crawler crosses the World Wide Web, choosing out applicable pages to a predefined topic and forgetting those out of concern.  ...  If the crawler is indexing several hosts, then this approach broadcast the load quickly so we implement the parallel procedureing. 2) Depth-First Crawling [13]: In Depth-first crawling follow all associates  ... 
doi:10.23956/ijarcsse.v7i11.459 fatcat:77zpoffqtfcwln6dsee67ohv2i

The Modified Concept based Focused Crawling using Ontology

S. Thenmalar, T. V. Geetha
2014 Journal of Web Engineering  
The pages are ranked in comparison between concept vectors at each depth, across depths and between the overall topics indicating concept vector.  ...  However in this work, we determine and rank the seed page set from the seed URLs. We rank and filter the page sets at the succeeding depths of crawl.  ...  The crawling strategy involves the breadth-first strategy and configure via online page importance scoring [23] .  ... 
dblp:journals/jwe/ThenmalarG14 fatcat:ieryzmqjrfehlmqzuik7ks47vm

Finding pages on the unarchived Web

Hugo C. Huurdeman, Anat Ben-David, Jaap Kamps, Thaer Samar, Arjen P. de Vries
2014 IEEE/ACM Joint Conference on Digital Libraries  
First, the crawled Web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of the Web archive.  ...  In this paper, we propose an approach to recover significant parts of the unarchived Web, by reconstructing descriptions of these pages based on links and anchors in the set of crawled pages, and experiment  ...  Acknowledgments Part of this paper is based on an initial report on uncovering and characterizing unarchived pages, published as [25] .  ... 
doi:10.1109/jcdl.2014.6970188 dblp:conf/jcdl/HuurdemanBKSV14 fatcat:rya7otftlvdqhlqp7rtb7kp3u4

A review of web crawling approaches

Elda Xhumari, Izaura Xhumari
2021 International Conference on Recent Trends and Applications in Computer Science and Information Technology  
This study presents web crawler methodology, the first steps of development, how it works, the different types of web crawlers, the benefits of using and comparing their operating methods which are the  ...  Websites are getting richer and richer with information in different formats.  ...  Web crawling algorithms i. Breadth First Search It starts with a small set of pages and then explores other pages following the links in the first width.  ... 
dblp:conf/rtacsit/XhumariX21 fatcat:daq53pbvv5czjokacqk3k4i3we

SmartCrawler: A Two-Stage Crawler for Efficiently Harvesting Deep-Web Interfaces

Feng Zhao, Jingyu Zhou, Chang Nie, Heqing Huang, Hai Jin
2016 IEEE Transactions on Services Computing  
To achieve more accurate results for a focused crawl, SmartCrawler ranks websites to prioritize highly relevant ones for a given topic.  ...  However, due to the large volume of web resources and the dynamic nature of deep web, achieving wide coverage and high efficiency is a challenging issue.  ...  Thus, in-site searching is performed in breadth-first fashion to achieve broader coverage of web directories.  ... 
doi:10.1109/tsc.2015.2414931 fatcat:2g3oqslxhzcw7kahsg5kvgatbi

Lost but not forgotten: finding pages on the unarchived web

Hugo C. Huurdeman, Jaap Kamps, Thaer Samar, Arjen P. de Vries, Anat Ben-David, Richard A. Rogers
2015 International Journal on Digital Libraries  
Due to restrictions in crawling depth, crawling frequency, and restrictive selection policies, large parts of the Web are unarchived and, therefore, lost to posterity.  ...  Second, the link and anchor text have a highly skewed distribution: popular pages such as home pages have more links pointing to them and more terms in the anchor text, but the richness tapers off quickly  ...  , and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.  ... 
doi:10.1007/s00799-015-0153-3 fatcat:f5yhxhrdxjduznbnxamlcjvacm

Exploration versus Exploitation in Topic Driven Crawlers

Gautam Pant, Padmini Srinivasan, Filippo Menczer
2002 The Web Conference  
Using a framework and a number of quality metrics developed to evaluate topic driven crawling algorithms in a fair way, we find that a mix of exploitation and exploration is essential for both tasks, in  ...  spite of a penalty in the early stage of the crawl.  ...  Third, we concatenate the text descriptions and anchor text of the target URLs (written by DMOZ human editors) to form a topic description.  ... 
dblp:conf/www/PantSM02 fatcat:kixxir4eijhhbgkfrvardl3gce

SIMHAR - Smart distributed web crawler for the hidden web using SIM+Hash and Redis Server

Sawroop Kaur, G. Geetha
2020 IEEE Access  
In distributed crawling, crawling agents are given a task to fetch and download web pages. The number and heterogeneous structure of web pages are increasing rapidly.  ...  Duplication detection is based on hybrid technology using hash-maps of Redis and Sim+Hash.  ...  They have also supported the breadth-first search for complete coverage. Performance is compared up to 64 active threads to crawl two-page application and medium sized application. Xu et al.  ... 
doi:10.1109/access.2020.3004756 fatcat:iqbi3gq7pjc6vl2e3r457atfbq

Crawling the Web [chapter]

Gautam Pant, Padmini Srinivasan, Filippo Menczer
2004 Web Dynamics  
This is followed by a review of several topical crawling algorithms, and evaluation metrics that may be used to judge their performance.  ...  Crawlers facilitate this process by following hyperlinks in Web pages to automatically download new and updated Web pages.  ...  This work is funded in part by NSF CAREER Grant No. IIS-0133124 to FM.  ... 
doi:10.1007/978-3-662-10874-1_7 fatcat:wz2wsoi3d5h2vebem2rof2jv2e

Effective Focused Crawling Based on Content and Link Structure Analysis [article]

Anshika Pal, Deepak Singh Tomar, S.C. Shrivastava
2009 arXiv   pre-print
The proposed work also uses a method for traversing the irrelevant pages that met during crawling to improve the coverage of a specific topic.  ...  In this paper a technique of effective focused crawling is implemented to improve the quality of web navigation.  ...  We wish to express our gratitude to all the people who helped turn the World-Wide Web into the useful and popular distributed hypertext it is.  ... 
arXiv:0906.5034v1 fatcat:2eof54e5f5g2rk4xvenm3knd44
« Previous Showing results 1 — 15 out of 501 results