195 Hits in 6.8 sec

Do not crawl in the dust

Ziv Bar-Yossef, Idit Keidar, Uri Schonfeld
2007 Proceedings of the 16th international conference on World Wide Web - WWW '07  
We consider the problem of dust: Different URLs with Similar Text.  ...  Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, and dynamically generates the same page from various different URL requests.  ...  We thank Tal Cohen and the forum site team, and Greg Pendler and the http://ee.technion. admins for providing us with access to web logs and for technical assistance.  ... 
doi:10.1145/1242572.1242588 dblp:conf/www/Bar-YossefKS07 fatcat:zoh5lyvcnzehhptpzvhhykhs4a

A Data Mining Approach to Topic-Specific Web Resource Discovery

Lei Xiang, Xin Meng
2009 2009 Second International Conference on Intelligent Computation Technology and Automation  
In fact as estimated about 29% of web page is duplicates. Such URL commonly named as dust represent an important problem in search engines.  ...  Alignment strategy that can lead to reduction of 54% larger in the number of duplicate URL.  ...  Rules are selected if they have large support, they do not come from large groups and URLs matched by them have similar sketches or compatible sizes in the training log.  ... 
doi:10.1109/icicta.2009.378 fatcat:gcodvqjjrvdznfrqs26xfypwsi

Design of a Migrating Crawler Based on a Novel URL Scheduling Mechanism using AHP

Deepika Punj, Ashutosh Dixit
2017 International Journal of Rough Sets and Data Analysis  
The proposed ordering technique is based on URL structure, which plays a crucial role in utilizing the web efficiently. Scheduling ensures that URLs should go to optimum agent for downloading.  ...  In this paper, architecture of migrating crawler is proposed which is based on URL ordering, URL scheduling and document redundancy elimination mechanism.  ...  This problem is designated as DUST (Schonfeld et al., 2007) i.e. different URLs with similar text. DUST effect the whole working of Search Engines i.e. crawling, indexing, ranking etc.  ... 
doi:10.4018/ijrsda.2017010106 fatcat:43k7w3lknjbvpjroufwhemti3a

Automatic Extraction of Top-K Lists from Web

Ashish N. Patil, Shital N. Kadam
2017 IARJSET  
The web pages is in the structured, unstructured and semi-structured format. Also gives results in less time.  ...  Sometimes these links may contain audios, videos, and Twitter and Facebook comments which is not useful for users.  ...  Deshpande for valuable suggestions in carrying our research work. We also take opportunity to thank my friends for supporting me.  ... 
doi:10.17148/iarjset/nciarcse.2017.42 fatcat:vwpwv7vd5rdp7heewsfwxp6ebi

URL normalization for de-duplication of web pages

Amit Agarwal, Hema Swetha Koppula, Krishna P. Leela, Krishna Prasad Chitrapura, Sachin Garg, Pavan Kumar GM, Chittaranjan Haty, Anirban Roy, Amit Sasturkar
2009 Proceeding of the 18th ACM conference on Information and knowledge management - CIKM '09  
Our technique is composed of mining the crawl logs and utilizing clusters of similar pages to extract specific rules from URLs belonging to each cluster.  ...  Presence of duplicate documents in the World Wide Web adversely affects crawling, indexing and relevance, which are the core building blocks of web search.  ...  [2] call the problem, "DUST: Different URLs with Similar Text" and propose a technique to uncover URLs pointing to similar pages.  ... 
doi:10.1145/1645953.1646283 dblp:conf/cikm/AgarwalKLCGGHRS09 fatcat:qqph5ytrqjhhnc6ennbmtctiwu

Finding and Classifying Near-Duplicate Pages based on Identical Sentences Detection

Tomohide Shibata, Naun Kang, Sadao Kurohashi
2010 Transactions of the Japanese society for artificial intelligence  
First, in each page, its content region is extracted since sentences in a non-content region do not tend to be utilized for the similar page detection.  ...  Next, similar pages are classified based on several information such as an overlap ratio, the number of inlinks/outlinks, and the URL similarity.  ...  TSUBAKI 11 12 ♦ ♦ [BarYossef 07] BarYossef, Z., Keidar, I., and Schonfeld, U.: Do Not Crawl in the DUST: Different URLs with Similar Text, in Proceed- ings of WWW2007, pp. 111-120 (2007)  ... 
doi:10.1527/tjsai.25.224 fatcat:cm57qzelbrbqhk7om2akbbymsm

Mapping the Blogosphere--Towards a Universal and Scalable Blog-Crawler

Philipp Berger, Patrick Hennig, Justus Bross, Christoph Meinel
2011 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing  
Modeling and mining this vast pool of data to extract and describe meaningful knowledge in order to leverage (content-related) structures and dynamics of emerging networks within the blogosphere is the  ...  While the concept of our tailor-mode feed-crawler was already discussed in two earlier publications this paper focuses on our approach to extend the earlier feedcrawler to a more universal and highly scalable  ...  The idea behind this is to analyze URLs and predict for example other URLs where archives of posts can be found. A very important part is to crawl the posts not listed on the first web page.  ... 
doi:10.1109/passat/socialcom.2011.57 dblp:conf/socialcom/BergerHBM11 fatcat:uilhg2gisjfwdg7hiey7eycepa


Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, Lei Zhang
2008 Proceeding of the 17th international conference on World Wide Web - WWW '08  
crawling routings among different kinds of pages.  ...  However, Web forum crawling is not a trivial problem due to the in-depth link structures, the large amount of duplicate pages, as well as many invalid pages caused by login failure issues.  ...  There is also some recent work discusses URL-based duplicate detection, which tries to mine rules of different URLs with similar text (DUST) [6] .  ... 
doi:10.1145/1367497.1367558 dblp:conf/www/CaiYLWZ08 fatcat:uwn5a624xnfkplxotp7kj5lrla

Learning URL patterns for webpage de-duplication

Hema Swetha Koppula, Krishna P. Leela, Amit Agarwal, Krishna Prasad Chitrapura, Sachin Garg, Amit Sasturkar
2010 Proceedings of the third ACM international conference on Web search and data mining - WSDM '10  
Our technique is composed of mining the crawl logs and utilizing clusters of similar pages to extract transformation rules, which are used to normalize URLs belonging to each cluster.  ...  The rule extraction techniques are robust against web-site specific URL conventions. We compare the precision and scalability of our approach with recent efforts in using URLs for de-duplication.  ...  [4] call the problem, "DUST: Different URLs with Similar Text" and propose a technique to uncover URLs pointing to similar pages.  ... 
doi:10.1145/1718487.1718535 dblp:conf/wsdm/KoppulaLACGS10 fatcat:2z4jhswpofc6rjxujshkjlnnuq

Web Crawling

Christopher Olston, Marc Najork
2010 Foundations and Trends in Information Retrieval  
This is a survey of the science and practice of web crawling.  ...  While at first glance web crawling may appear to be merely an application of breadth-first-search, the truth is that there are many challenges ranging from systems concerns such as managing very large  ...  Schonfeld et al. proposed the "duplicate URL with similar text" (DUST) algorithm [12] to detect this form of aliasing, and to infer rules for normalizing URLs into a canonical form. Dasgupta et al.  ... 
doi:10.1561/1500000017 fatcat:rjc3oe77c5bipoikqrkwmy3ed4

Application of webometrics methods for analysis and enhancement of academic site structure based on page value criterion

Anthony M. Nwohiri, University of Lagos, Andrey A. Pechnikov, Institute of Applied Mathematical Research of the Karelian Research Centre, Russian Academy of Sciences
2019 Vestnik of Saint Petersburg University Applied Mathematics Computer Science Control Processes  
One of such important reasons is the problem of dust -a situation whereby different URLs have similar text.  ...  In other publicly accessible web crawlers, the authors do not know how the dust problem is resolved and whether it was even resolved.  ... 
doi:10.21638/11702/spbu10.2019.304 fatcat:fl4d5lvlrbgcfiwrr3oxvs74ue

Performance and Comparative Analysis of the Two Contrary Approaches for Detecting Near Duplicate Web Documents in Web Crawling

VA Narayana, P Premchand, A Govardhan
2012 International Journal of Electrical and Computer Engineering (IJECE)  
In this research, we have presented an efficient approach for the detection of near duplicate web pages in web crawling which uses keywords and the distance measure. Besides that, G.S. Manku et al.'  ...  Then we have implemented both the approaches and conducted an extensive comparative study between our similarity score based approach and G.S. Manku et al.'s fingerprint based approach.  ...  [14] have mentioned the issue of dust: Different URLs with Similar Text.  ... 
doi:10.11591/ijece.v2i6.1746 fatcat:tbvknlh2onhs7ckxgo2pnwl6zm

Removing Dust Using Sequence Alignment and Content Matching

Priyanka Khopkar, D Bhosale
International Research Journal of Engineering and Technology (IRJET)   unpublished
Some of the contents related to search query collected by the web crawlers include pages with duplicate information. Different URLs with Similar contents are known as DUST.  ...  In the proposed system, the URL normalization process is used which identifies DUST with fetching the content of the URLs.  ...  1.INTRODUCTION The URLs which are having similar content are called as DUST (Duplicate URLs with Similar Text). Syntactically these URLs are different but having similar content.  ... 

Improved Data Partition in Web-URL Hadoop Cluster Using Dust Removing LDA-CRATFS Technique

V Manochitra, N Vijayalakshmi
International Journal of Electrical Electronics & Computer Science Engineering   unpublished
By evaluating this method, it observed it achieved larger reductions in the number of duplicate URLs than our best baseline, with gains of 85 to 150.76 percent in two different web collections.  ...  Incorporating the similarity metric and the Locality-Sensitive Hashing technique, In this proposed model VUK (Valid Unique Key) DUST removing technique LDA-CRATS mining data is used to run this approach  ...  Do Not Crawl in the Dust: Different Urls with Similar Text: We focus on URLs with similar contents rather than identical Ones, since different versions of the same document are Not always identical; they  ... 

Detection of Distinct URL and Removing DUST Using Multiple Alignments of Sequences

Sandhya Shinde, Ms Rutuja Bidkar, Nisha Deore, Nikita Salunke, Neelay Shivsharan
International Research Journal of Engineering and Technology   unpublished
By evaluating this method, we observed it achieved larger reductions in the duplicate URLs than our best baseline, with gains of 82% and 140.74% in two different web collections.  ...  Before the generation of the rules takes place, a full multi-sequence alignment of URLs with duplicated content is demonstrated that can lead to the deployment of very effective rules.  ...  Hence these duplicate URLs commonly known as DUST (Duplicate URLs with Similar Text).  ... 
« Previous Showing results 1 — 15 out of 195 results