Filters








5,436 Hits in 5.8 sec

URL normalization for de-duplication of web pages

Amit Agarwal, Hema Swetha Koppula, Krishna P. Leela, Krishna Prasad Chitrapura, Sachin Garg, Pavan Kumar GM, Chittaranjan Haty, Anirban Roy, Amit Sasturkar
<span title="">2009</span> <i title="ACM Press"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/6g37zvjwwrhv3dizi6ffue642m" style="color: black;">Proceeding of the 18th ACM conference on Information and knowledge management - CIKM &#39;09</a> </i> &nbsp;
In this paper, we present a set of techniques to mine rules from URLs and utilize these learnt rules for de-duplication using just URL strings without fetching the content explicitly.  ...  Preserving each mined rules for de-duplication is not efficient due to the large number of specific rules.  ...  As duplicate URLs have specific patterns which can be utilized for de-duplication of web-pages, in this paper we will focus on the problem of de-duplication using just URLs without fetching the content  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1145/1645953.1646283">doi:10.1145/1645953.1646283</a> <a target="_blank" rel="external noopener" href="https://dblp.org/rec/conf/cikm/AgarwalKLCGGHRS09.html">dblp:conf/cikm/AgarwalKLCGGHRS09</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/qqph5ytrqjhhnc6ennbmtctiwu">fatcat:qqph5ytrqjhhnc6ennbmtctiwu</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20141108065337/http://www.cs.cornell.edu:80/~hema/papers/sp0955-agarwalATS.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/6c/9a/6c9a7a76170799dde67013236554ceb106620e5d.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1145/1645953.1646283"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> acm.org </button> </a>

A Data Mining Approach to Topic-Specific Web Resource Discovery

Lei Xiang, Xin Meng
<span title="">2009</span> <i title="IEEE"> 2009 Second International Conference on Intelligent Computation Technology and Automation </i> &nbsp;
A well know problem face by web crawler is the existence of large fraction of distinct URL that correspond to page with duplicate or nearby duplicate contents.  ...  In fact as estimated about 29% of web page is duplicates. Such URL commonly named as dust represent an important problem in search engines.  ...  Generally duplication of contents is due to generation of dynamic web pages that are invoked by the web crawler. On web there is large-scale de-duplication of documents.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1109/icicta.2009.378">doi:10.1109/icicta.2009.378</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/gcodvqjjrvdznfrqs26xfypwsi">fatcat:gcodvqjjrvdznfrqs26xfypwsi</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20180410053040/http://garph.org/downloads/HVPM%20Special%20Issue/12.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/fa/b8/fab817b9b9d27816f07ebcfafc887dba935736df.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1109/icicta.2009.378"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> ieee.com </button> </a>

Removing Duplicate URLs based on URL Normalization and Query Parameter

Kavita Goel, Jay Shankar Prasad, Saba Hilal
<span title="2018-07-20">2018</span> <i title="Science Publishing Corporation"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/piy2nrvrjrfcfoz5nmre6zwa4i" style="color: black;">International Journal of Engineering &amp; Technology</a> </i> &nbsp;
This paper proposes a Web Crawler which performs crawling in particular category to remove irrelevant URL and implements URL normalization for removing duplicate URLs within particular category.  ...  Results are analyzed on the basis of total URL Fetched, Duplicate URLs, and Query execution time.  ...  Lay-Ki Soon [6] , suggests method of identifying equivalent URLs by using metadata (page size and body text) of web pages along with basic URL normalization.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.14419/ijet.v7i3.12.16107">doi:10.14419/ijet.v7i3.12.16107</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/wrcwrsnm3fclbkot6kt3wwtuya">fatcat:wrcwrsnm3fclbkot6kt3wwtuya</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20190428053207/https://www.sciencepubco.com/index.php/ijet/article/download/16107/6777" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/23/66/23666b253ed68ba55d2b9023dc545726a4fc9bfc.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.14419/ijet.v7i3.12.16107"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="unlock alternate icon" style="background-color: #fb971f;"></i> Publisher / doi.org </button> </a>

Study of Near Duplicate Content: Identification of Categories Generating Maximum Duplicate URL in Results

Kavita Garg, Jayshankar Prasad, Saba Hilal
<span title="2017-04-17">2017</span> <i title="Foundation of Computer Science"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/b637noqf3vhmhjevdfk3h5pdsu" style="color: black;">International Journal of Computer Applications</a> </i> &nbsp;
General Terms Duplicate Urls Identification Keywords Keywords are your own designated keywords which can be used for easy location of the manuscript using any search engines.  ...  The study of identification of near duplicate content involves identifying search categories which generate same URL in a query result.  ...  In [3] , Rekha V R, Resmy V R uses machine learning technique to identify the patterns of the URLs. URLs patterns are used to develop a framework for de-duplicating the web pages.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.5120/ijca2017913526">doi:10.5120/ijca2017913526</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/suuukil5irbqjp4nkir4armz4e">fatcat:suuukil5irbqjp4nkir4armz4e</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20180604165323/https://www.ijcaonline.org/archives/volume163/number5/garg-2017-ijca-913526.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/36/e0/36e06113723115c1e19b1da34b738df19b58455d.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.5120/ijca2017913526"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> Publisher / doi.org </button> </a>

A pattern tree-based approach to learning URL normalization rules

Tao Lei, Rui Cai, Jiang-Ming Yang, Yan Ke, Xiaodong Fan, Lei Zhang
<span title="">2010</span> <i title="ACM Press"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/s4hirppq3jalbopssw22crbwwa" style="color: black;">Proceedings of the 19th international conference on World wide web - WWW &#39;10</a> </i> &nbsp;
URL normalization is to transform duplicate URLs to a canonical form using a set of rewrite rules.  ...  To deal with a large scale of websites, automatic approaches are highly desired to learn rewrite rules for various kinds of duplicate URLs.  ...  on the content of the retrieved web page.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1145/1772690.1772753">doi:10.1145/1772690.1772753</a> <a target="_blank" rel="external noopener" href="https://dblp.org/rec/conf/www/LeiCYKFZ10.html">dblp:conf/www/LeiCYKFZ10</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/2ofezyd2abaxvhmlo6plflnhme">fatcat:2ofezyd2abaxvhmlo6plflnhme</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20170809043525/http://wwwconference.org/proceedings/www2010/www/p611.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/af/c3/afc38ab06a410ec087b1da1f2d027861758c6b55.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1145/1772690.1772753"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> acm.org </button> </a>

Learning URL patterns for webpage de-duplication

Hema Swetha Koppula, Krishna P. Leela, Amit Agarwal, Krishna Prasad Chitrapura, Sachin Garg, Amit Sasturkar
<span title="">2010</span> <i title="ACM Press"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/puezkhxc3rggrgb456avsvxi34" style="color: black;">Proceedings of the third ACM international conference on Web search and data mining - WSDM &#39;10</a> </i> &nbsp;
The rule extraction techniques are robust against web-site specific URL conventions. We compare the precision and scalability of our approach with recent efforts in using URLs for de-duplication.  ...  Our technique is composed of mining the crawl logs and utilizing clusters of similar pages to extract transformation rules, which are used to normalize URLs belonging to each cluster.  ...  As duplicate URLs have specific patterns which can be utilized for de-duplication, in this paper we focus on the problem of de-duplication of web-pages using just URLs without fetching the content.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1145/1718487.1718535">doi:10.1145/1718487.1718535</a> <a target="_blank" rel="external noopener" href="https://dblp.org/rec/conf/wsdm/KoppulaLACGS10.html">dblp:conf/wsdm/KoppulaLACGS10</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/2z4jhswpofc6rjxujshkjlnnuq">fatcat:2z4jhswpofc6rjxujshkjlnnuq</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20170814024415/http://www.cs.cornell.edu/~hema/papers/wsdm157-koppula.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/34/60/3460a3ddac2352680d13298843840ea15b8e3bef.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1145/1718487.1718535"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> acm.org </button> </a>

Web Forum Crawling Techniques

Namrata H.S.Bamrah, B. S Satpute, Pramod Patil
<span title="2014-01-16">2014</span> <i title="Foundation of Computer Science"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/b637noqf3vhmhjevdfk3h5pdsu" style="color: black;">International Journal of Computer Applications</a> </i> &nbsp;
The paper also gives the overview of web crawling and web forums.  ...  The main goal of this paper is to focus on the web forum crawling techniques. In this paper, the various techniques of web forum crawler and challenges of crawling are discussed.  ...  FoCUS adopts a simple URL string de-duplication technique. The main advantage of FoCUS is that it can avoid duplicates without duplicate detection.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.5120/14936-3506">doi:10.5120/14936-3506</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/np2esgz4rbcujeu5kvqtr6gyr4">fatcat:np2esgz4rbcujeu5kvqtr6gyr4</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20150715004418/http://research.ijcaonline.org:80/volume85/number17/pxc3893506.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/4d/82/4d82aff10f7cef185261782e3477a422252f8d9d.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.5120/14936-3506"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> Publisher / doi.org </button> </a>

Priority Queue Based Estimation of Importance of Web Pages for Web Crawlers

Mohammed Rashad Baker, M. Ali Akcayol
<span title="">2017</span> <i title="International Academy Publishing (IAP)"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/7wdo44rafbg5rfskrbd6snu7yy" style="color: black;">International Journal of Computer and Electrical Engineering</a> </i> &nbsp;
Thus, the need for an efficient web crawler that deals with most of the web pages. Most of the web crawlers do not have the ability to visit and parse pages using URLs.  ...  There are hundreds of new web pages that are added daily to web directories. Web crawlers are developing over the same time of web pages growing up rapidly.  ...  Fig. 10 . 10 Size of crawled web pages. Fig. 11 . 11 Size of crawled web pages. Fig. 12 . 12 Duplicated URLs counts for every 30 minutes of crawling process. Fig.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.17706/ijcee.2017.9.1.330-342">doi:10.17706/ijcee.2017.9.1.330-342</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/3znrbkrcend5fnn4m6tlpikwdm">fatcat:3znrbkrcend5fnn4m6tlpikwdm</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20180721143948/http://www.ijcee.org/vol9/940-T047.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/4b/9b/4b9baf23f590e96c6c70a033b29c386b4a033bc2.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.17706/ijcee.2017.9.1.330-342"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> Publisher / doi.org </button> </a>

Evolutionary Study of Web Spam: Webb Spam Corpus 2011 versus Webb Spam Corpus 2006

De Wang, Danesh Irani, Calton Pu
<span title="">2012</span> <i title="IEEE"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/pzyiigk6mnbcfmpdtfilb6u2hu" style="color: black;">Proceedings of the 8th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing</a> </i> &nbsp;
Although large corpora of legitimate web pages are available to researchers, the same cannot be said about web spam or spam web pages.  ...  The corpus contains web pages crawled from links found in over 6.3 million spam emails. We analyze multiple aspects of this corpus including redirection, HTTP headers and web page content.  ...  Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or other funding  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.4108/icst.collaboratecom.2012.250689">doi:10.4108/icst.collaboratecom.2012.250689</a> <a target="_blank" rel="external noopener" href="https://dblp.org/rec/conf/colcom/WangIP12.html">dblp:conf/colcom/WangIP12</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/i2a3e32a3jcx5dfs322laybcue">fatcat:i2a3e32a3jcx5dfs322laybcue</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20190429050025/https://eudl.eu/pdf/10.4108/icst.collaboratecom.2012.250689" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/42/6e/426e27a1c288cac53a7c6d3a9e1be5861e57ef9d.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.4108/icst.collaboratecom.2012.250689"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> Publisher / doi.org </button> </a>

Linking wikipedia to the web

Rianne Kaptein, Pavel Serdyukov, Jaap Kamps
<span title="">2010</span> <i title="ACM Press"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/ibcfmixrofb3piydwg5wvir3t4" style="color: black;">Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval - SIGIR &#39;10</a> </i> &nbsp;
We investigate the task of finding links from Wikipedia pages to external web pages.  ...  Such external links significantly extend the information in Wikipedia with information from the Web at large, while retaining the encyclopedic organization of Wikipedia.  ...  This research was supported by the Netherlands Organization for Scientific Research (NWO, under project # 612.066.513). We thank Arjen de Vries for useful discussions on related issues.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1145/1835449.1835642">doi:10.1145/1835449.1835642</a> <a target="_blank" rel="external noopener" href="https://dblp.org/rec/conf/sigir/KapteinSK10.html">dblp:conf/sigir/KapteinSK10</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/bu5u7ukmarefvj7iwujej7uxcq">fatcat:bu5u7ukmarefvj7iwujej7uxcq</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20110401023945/http://staff.science.uva.nl/~kamps/publications/2010/kapt:link10.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/9d/1c/9d1c702d4eed124a1c482cb4063ebf3f5eb175fe.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1145/1835449.1835642"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> acm.org </button> </a>

Improvised Architecture for Distributed Web Crawling

Tilak Patidar, Aditya Ambasth
<span title="2016-10-17">2016</span> <i title="Foundation of Computer Science"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/b637noqf3vhmhjevdfk3h5pdsu" style="color: black;">International Journal of Computer Applications</a> </i> &nbsp;
A web crawler interacts with millions of hosts, fetches millions of page per second and updates these pages into a database, creating a need for maintaining I/O performance, network resources within OS  ...  Web crawlers are program, designed to fetch web pages for information retrieval system.  ...  The De-Duplication test is carried out in the bloom filter.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.5120/ijca2016911857">doi:10.5120/ijca2016911857</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/j3m4zwhtovcf7bsjdpncyirvum">fatcat:j3m4zwhtovcf7bsjdpncyirvum</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20180602222358/https://www.ijcaonline.org/archives/volume151/number9/patidar-2016-ijca-911857.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/71/5d/715db1703884cd962759cc84c53b0d30a5cb437b.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.5120/ijca2016911857"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> Publisher / doi.org </button> </a>

Temporal Evolution of the UK Web

Ilaria Bordino, Paolo Boldi, Debora Donato, Massimo Santini, Sebastiano Vigna
<span title="">2008</span> <i title="IEEE"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/gckg3mzs4fhxhbrvmbsa54bccm" style="color: black;">2008 IEEE International Conference on Data Mining Workshops</a> </i> &nbsp;
Recently, a new temporal dataset has been made public: it is made of a series of twelve 100M pages snapshots of the .uk domain [2] .  ...  on appearance and disappearance of pages and links, or on the crawler behaviour).  ...  Acknowledgements We are really thankful to Ricardo Baeza-Yates, Carlos Castillo, Aristides Gionis and Stefano Leonardi for several helpful discussions.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1109/icdmw.2008.88">doi:10.1109/icdmw.2008.88</a> <a target="_blank" rel="external noopener" href="https://dblp.org/rec/conf/icdm/BordinoBDSV08.html">dblp:conf/icdm/BordinoBDSV08</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/t3d7gkaxerbe7gz5tvr3trgtga">fatcat:t3d7gkaxerbe7gz5tvr3trgtga</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20160318164422/http://vigna.di.unimi.it/ftp/papers/TemporalEvolution.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/fa/4f/fa4f5cb8d7660fb35941c7dadf908e0f702d55ef.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1109/icdmw.2008.88"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> ieee.com </button> </a>

Crawling the Hidden Web: An Approach to Dynamic Web Indexing

Moumie Soulemane, Mohammad Rafiuzzaman, Hasan Mahmud
<span title="2012-10-20">2012</span> <i title="Foundation of Computer Science"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/b637noqf3vhmhjevdfk3h5pdsu" style="color: black;">International Journal of Computer Applications</a> </i> &nbsp;
General Terms Web content mining, hidden web indexing, elimination of duplicate URLs, hadoop-Mapreduce for index updating.  ...  With the ever growing quantity of such hidden web pages, this issue continues to raise diverse opinions between the research and practitioner among the web mining communities.  ...  The jaccard index of the text content of web pages retrieved is calculated to detect for any eventual redundancy, then the duplicates are deleted.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.5120/8717-7290">doi:10.5120/8717-7290</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/2kgprpkhbrgvxaoaco7fbbmifu">fatcat:2kgprpkhbrgvxaoaco7fbbmifu</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20180602092708/https://research.ijcaonline.org/volume55/number1/pxc3877290.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/dd/f7/ddf769feb92b9fa8321637abcec0bbdb476e136f.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.5120/8717-7290"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> Publisher / doi.org </button> </a>

iRobot

Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, Lei Zhang
<span title="">2008</span> <i title="ACM Press"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/s4hirppq3jalbopssw22crbwwa" style="color: black;">Proceeding of the 17th international conference on World Wide Web - WWW &#39;08</a> </i> &nbsp;
However, Web forum crawling is not a trivial problem due to the in-depth link structures, the large amount of duplicate pages, as well as many invalid pages caused by login failure issues.  ...  pages from a forum site; and 3) Long threads that are divided into multiple pages can be re-concatenated and archived as a whole thread, which is of great help for further indexing and data mining.  ...  However, content-based de-dup can only be carried out offline after the Web pages have been downloaded.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1145/1367497.1367558">doi:10.1145/1367497.1367558</a> <a target="_blank" rel="external noopener" href="https://dblp.org/rec/conf/www/CaiYLWZ08.html">dblp:conf/www/CaiYLWZ08</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/uwn5a624xnfkplxotp7kj5lrla">fatcat:uwn5a624xnfkplxotp7kj5lrla</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20110923140924/http://research.microsoft.com/en-us/um/people/weilai/download/papers/irobot_intelligentwebforumcrawler_www08-cai.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/d6/ec/d6ec8d25774f9ae9a1eb3cc143d6d9d0b7a48068.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1145/1367497.1367558"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> acm.org </button> </a>

DeDuSERP: De-duplication in search engine result page

Naresh Sharma, Priti Dimri
<span title="2018-03-19">2018</span> <i title="Science Publishing Corporation"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/piy2nrvrjrfcfoz5nmre6zwa4i" style="color: black;">International Journal of Engineering &amp; Technology</a> </i> &nbsp;
The purpose of this research is to identify a subtype of De-Duplication. DeDuSERP is de-duplication in search engine result page.  ...  It restricts the showcasing of urls with duplicate or similar data and hence enhances the search result experience of any client.  ...  It contains separate functions for web crawling i.e. crawl(), to check the duplicates i.e. DeDuSERP, to index pages i.e. indexer() to get the URLs of the files already present on the cloud.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.14419/ijet.v7i2.8.10475">doi:10.14419/ijet.v7i2.8.10475</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/l24ays4tp5hnngdzir7qrrog54">fatcat:l24ays4tp5hnngdzir7qrrog54</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20180720040626/https://www.sciencepubco.com/index.php/ijet/article/download/10475/3785" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/5e/a8/5ea8779fbc406085520a7db2293b6536b77884dc.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.14419/ijet.v7i2.8.10475"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="unlock alternate icon" style="background-color: #fb971f;"></i> Publisher / doi.org </button> </a>
&laquo; Previous Showing results 1 &mdash; 15 out of 5,436 results