A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2010; you can also visit the original URL.
The file type is application/pdf
.
Filters
Do not crawl in the dust
2007
Proceedings of the 16th international conference on World Wide Web - WWW '07
We consider the problem of dust: Different URLs with Similar Text. ...
Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, and dynamically generates the same page from various different URL requests. ...
We thank Tal Cohen and the forum site team, and Greg Pendler and the http://ee.technion. ac.il admins for providing us with access to web logs and for technical assistance. ...
doi:10.1145/1242572.1242588
dblp:conf/www/Bar-YossefKS07
fatcat:zoh5lyvcnzehhptpzvhhykhs4a
A Data Mining Approach to Topic-Specific Web Resource Discovery
2009
2009 Second International Conference on Intelligent Computation Technology and Automation
In fact as estimated about 29% of web page is duplicates. Such URL commonly named as dust represent an important problem in search engines. ...
Alignment strategy that can lead to reduction of 54% larger in the number of duplicate URL. ...
Rules are selected if they have large support, they do not come from large groups and URLs matched by them have similar sketches or compatible sizes in the training log. ...
doi:10.1109/icicta.2009.378
fatcat:gcodvqjjrvdznfrqs26xfypwsi
Design of a Migrating Crawler Based on a Novel URL Scheduling Mechanism using AHP
2017
International Journal of Rough Sets and Data Analysis
The proposed ordering technique is based on URL structure, which plays a crucial role in utilizing the web efficiently. Scheduling ensures that URLs should go to optimum agent for downloading. ...
In this paper, architecture of migrating crawler is proposed which is based on URL ordering, URL scheduling and document redundancy elimination mechanism. ...
This problem is designated as DUST (Schonfeld et al., 2007) i.e. different URLs with similar text. DUST effect the whole working of Search Engines i.e. crawling, indexing, ranking etc. ...
doi:10.4018/ijrsda.2017010106
fatcat:43k7w3lknjbvpjroufwhemti3a
Automatic Extraction of Top-K Lists from Web
2017
IARJSET
The web pages is in the structured, unstructured and semi-structured format. Also gives results in less time. ...
Sometimes these links may contain audios, videos, and Twitter and Facebook comments which is not useful for users. ...
Deshpande for valuable suggestions in carrying our research work. We also take opportunity to thank my friends for supporting me. ...
doi:10.17148/iarjset/nciarcse.2017.42
fatcat:vwpwv7vd5rdp7heewsfwxp6ebi
URL normalization for de-duplication of web pages
2009
Proceeding of the 18th ACM conference on Information and knowledge management - CIKM '09
Our technique is composed of mining the crawl logs and utilizing clusters of similar pages to extract specific rules from URLs belonging to each cluster. ...
Presence of duplicate documents in the World Wide Web adversely affects crawling, indexing and relevance, which are the core building blocks of web search. ...
[2] call the problem, "DUST: Different URLs with Similar Text" and propose a technique to uncover URLs pointing to similar pages. ...
doi:10.1145/1645953.1646283
dblp:conf/cikm/AgarwalKLCGGHRS09
fatcat:qqph5ytrqjhhnc6ennbmtctiwu
Finding and Classifying Near-Duplicate Pages based on Identical Sentences Detection
2010
Transactions of the Japanese society for artificial intelligence
First, in each page, its content region is extracted since sentences in a non-content region do not tend to be utilized for the similar page detection. ...
Next, similar pages are classified based on several information such as an overlap ratio, the number of inlinks/outlinks, and the URL similarity. ...
TSUBAKI
11
12
♦
♦
[BarYossef 07] BarYossef, Z., Keidar, I., and Schonfeld, U.: Do Not
Crawl in the DUST: Different URLs with Similar Text, in Proceed-
ings of WWW2007, pp. 111-120 (2007) ...
doi:10.1527/tjsai.25.224
fatcat:cm57qzelbrbqhk7om2akbbymsm
Mapping the Blogosphere--Towards a Universal and Scalable Blog-Crawler
2011
2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing
Modeling and mining this vast pool of data to extract and describe meaningful knowledge in order to leverage (content-related) structures and dynamics of emerging networks within the blogosphere is the ...
While the concept of our tailor-mode feed-crawler was already discussed in two earlier publications this paper focuses on our approach to extend the earlier feedcrawler to a more universal and highly scalable ...
The idea behind this is to analyze URLs and predict for example other URLs where archives of posts can be found. A very important part is to crawl the posts not listed on the first web page. ...
doi:10.1109/passat/socialcom.2011.57
dblp:conf/socialcom/BergerHBM11
fatcat:uilhg2gisjfwdg7hiey7eycepa
crawling routings among different kinds of pages. ...
However, Web forum crawling is not a trivial problem due to the in-depth link structures, the large amount of duplicate pages, as well as many invalid pages caused by login failure issues. ...
There is also some recent work discusses URL-based duplicate detection, which tries to mine rules of different URLs with similar text (DUST) [6] . ...
doi:10.1145/1367497.1367558
dblp:conf/www/CaiYLWZ08
fatcat:uwn5a624xnfkplxotp7kj5lrla
Learning URL patterns for webpage de-duplication
2010
Proceedings of the third ACM international conference on Web search and data mining - WSDM '10
Our technique is composed of mining the crawl logs and utilizing clusters of similar pages to extract transformation rules, which are used to normalize URLs belonging to each cluster. ...
The rule extraction techniques are robust against web-site specific URL conventions. We compare the precision and scalability of our approach with recent efforts in using URLs for de-duplication. ...
[4] call the problem, "DUST: Different URLs with Similar Text" and propose a technique to uncover URLs pointing to similar pages. ...
doi:10.1145/1718487.1718535
dblp:conf/wsdm/KoppulaLACGS10
fatcat:2z4jhswpofc6rjxujshkjlnnuq
Web Crawling
2010
Foundations and Trends in Information Retrieval
This is a survey of the science and practice of web crawling. ...
While at first glance web crawling may appear to be merely an application of breadth-first-search, the truth is that there are many challenges ranging from systems concerns such as managing very large ...
Schonfeld et al. proposed the "duplicate URL with similar text" (DUST) algorithm [12] to detect this form of aliasing, and to infer rules for normalizing URLs into a canonical form. Dasgupta et al. ...
doi:10.1561/1500000017
fatcat:rjc3oe77c5bipoikqrkwmy3ed4
Application of webometrics methods for analysis and enhancement of academic site structure based on page value criterion
2019
Vestnik of Saint Petersburg University Applied Mathematics Computer Science Control Processes
One of such important reasons is the problem of dust -a situation whereby different URLs have similar text. ...
In other publicly accessible web crawlers, the authors do not know how the dust problem is resolved and whether it was even resolved. ...
doi:10.21638/11702/spbu10.2019.304
fatcat:fl4d5lvlrbgcfiwrr3oxvs74ue
Performance and Comparative Analysis of the Two Contrary Approaches for Detecting Near Duplicate Web Documents in Web Crawling
2012
International Journal of Electrical and Computer Engineering (IJECE)
In this research, we have presented an efficient approach for the detection of near duplicate web pages in web crawling which uses keywords and the distance measure. Besides that, G.S. Manku et al.' ...
Then we have implemented both the approaches and conducted an extensive comparative study between our similarity score based approach and G.S. Manku et al.'s fingerprint based approach. ...
[14] have mentioned the issue of dust: Different URLs with Similar Text. ...
doi:10.11591/ijece.v2i6.1746
fatcat:tbvknlh2onhs7ckxgo2pnwl6zm
Removing Dust Using Sequence Alignment and Content Matching
International Research Journal of Engineering and Technology (IRJET)
unpublished
Some of the contents related to search query collected by the web crawlers include pages with duplicate information. Different URLs with Similar contents are known as DUST. ...
In the proposed system, the URL normalization process is used which identifies DUST with fetching the content of the URLs. ...
1.INTRODUCTION The URLs which are having similar content are called as DUST (Duplicate URLs with Similar Text). Syntactically these URLs are different but having similar content. ...
fatcat:x26f4sem6zdwnlz6a6ft3uykoa
Improved Data Partition in Web-URL Hadoop Cluster Using Dust Removing LDA-CRATFS Technique
International Journal of Electrical Electronics & Computer Science Engineering
unpublished
By evaluating this method, it observed it achieved larger reductions in the number of duplicate URLs than our best baseline, with gains of 85 to 150.76 percent in two different web collections. ...
Incorporating the similarity metric and the Locality-Sensitive Hashing technique, In this proposed model VUK (Valid Unique Key) DUST removing technique LDA-CRATS mining data is used to run this approach ...
Do Not Crawl in the Dust: Different Urls with Similar Text: We focus on URLs with similar contents rather than identical Ones, since different versions of the same document are Not always identical; they ...
fatcat:ots2vw2jw5df7nhnv4bpz63pee
Detection of Distinct URL and Removing DUST Using Multiple Alignments of Sequences
International Research Journal of Engineering and Technology
unpublished
By evaluating this method, we observed it achieved larger reductions in the duplicate URLs than our best baseline, with gains of 82% and 140.74% in two different web collections. ...
Before the generation of the rules takes place, a full multi-sequence alignment of URLs with duplicated content is demonstrated that can lead to the deployment of very effective rules. ...
Hence these duplicate URLs commonly known as DUST (Duplicate URLs with Similar Text). ...
fatcat:hkfmimz625hlba3ujw7bljlo74
« Previous
Showing results 1 — 15 out of 195 results