Efficient, automatic web resource harvesting

Michael L. Nelson, Joan A. Smith, Ignacio Garcia del Campo
<span title="">2006</span> <i title="ACM Press"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/qs7va5zauzbk5g5bngeon7yfoy" style="color: black;">Proceedings of the eighth ACM international workshop on Web information and data management - WIDM &#39;06</a> </i> &nbsp;
There are two problems associated with conventional web crawling techniques: a crawler cannot know if all resources at a non-trivial web site have been discovered and crawled ("the counting problem") and the human-readable format of the resources are not always suitable for machine processing ("the representation problem"). We introduce an approach that solves these two problems by implementing support for both the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and MPEG-21
more &raquo; ... igital Item Declaration Language (DIDL) into the web server itself. We present the Apache module "mod oai", which can be used to address the counting problem by listing all valid URIs at a web server and efficiently discovering updates and additions on subsequent crawls. Our experiments indicated comparable performance for initial crawls, and dramatic increases in update speed. mod oai can also be used to address the representation problem by providing "preservation ready" versions of web resources aggregated with their respective forensic metadata in MPEG-21 DIDL format.
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1145/1183550.1183560">doi:10.1145/1183550.1183560</a> <a target="_blank" rel="external noopener" href="https://dblp.org/rec/conf/widm/NelsonSC06.html">dblp:conf/widm/NelsonSC06</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/k2b5z36gsncitp4cwvtpt6vrma">fatcat:k2b5z36gsncitp4cwvtpt6vrma</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20070328134728/http://public.lanl.gov:80/herbertv/papers/f140-nelson.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/4e/10/4e10150755d3136efb43a0d0783e4e69918f5014.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1145/1183550.1183560"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> acm.org </button> </a>