Filters








19,861 Hits in 2.0 sec

Crawling the web for structured documents

Julián Urbano, Juan Loréns, Yorgos Andreadakis, Mónica Marrero
2010 Proceedings of the 19th ACM international conference on Information and knowledge management - CIKM '10  
Despite the use of XML documents is the immediate choice, the Web is filled with several other types of structured information, which account for millions of other documents.  ...  documents off the Web.  ...  ACKNOWLEDGEMENTS We acknowledge the Spanish National Plan of Scientific Research, Development and Technological Innovation, which has funded this work through the research project TIN2007-67153.  ... 
doi:10.1145/1871437.1871773 dblp:conf/cikm/UrbanoLAM10 fatcat:3gqugbw6pventfkfeo5di5fhbu

Hadoop-based Crawling and Detection of New HTML5 Vulnerabilities on Public Institutions' Web Sites

In-A Kim, Kyu-Hyun Cho, Hyung-Jun Yim, Hwan-Kuk Kim, Kyu-Chul Lee
2015 Indian Journal of Science and Technology  
By applying distributed parallel processing for the crawling and detecting processes, we were able to improve the performance of the crawling and detecting processes for a large number of web documents  ...  HTML5 is a recent version of HTML, a programming language for web documents. It was developed to solve the problems of previous HTML versions.  ...  Nutch has a structure for handling a large number of web documents based on Hadoop.  ... 
doi:10.17485/ijst/2015/v8i27/87068 fatcat:g2zkslmkjnfcjb7kq4jd3itghy

Not so creepy crawler

Franziska von dem Bussche, Klara Weiand, Benedikt Linse, Tim Furche, François Bry
2010 Proceedings of the 19th international conference on World wide web - WWW '10  
In these cases, pages are far more uniformly structured than in the general Web and thus crawlers can use the structure of Web pages for more precise data extraction and more expressive analysis.  ...  Customizing crawlers just means writing (declarative) XML queries that can access the currently crawled document as well as the metadata of the crawl process.  ...  In this demonstration, we introduce the "Not so Creepy Crawler" (nc 2 ), a novel approach to structure-based crawling that combines crawling with standard Web query technology for data extraction and aggregation  ... 
doi:10.1145/1772690.1772908 dblp:conf/www/BusscheWLFB10 fatcat:q3vh4ks27rd63nruuynnwutrpe

Web-crawling reliability

Viv Cothey
2004 Journal of the American Society for Information Science and Technology  
The investigation includes a critical examination of the practice of Web crawling and contrasts the results of content crawling with the results of link crawling.  ...  In this article, I investigate the reliability, in the social science sense, of collecting informetric data about the World Wide Web by Web crawling.  ...  It is part of the Web indicators for scientific, technological, and innovation research (WISER) project, (Contract HPV2- CT-2002-00015).  ... 
doi:10.1002/asi.20078 fatcat:k23re4dlsng3jbewquom2sv5ny

ARCOMEM Crawling Architecture

Vassilis Plachouras, Florent Carpentier, Muhammad Faheem, Julien Masanès, Thomas Risse, Pierre Senellart, Patrick Siehndel, Yannis Stavrakas
2014 Future Internet  
takes 9 into account the type of Web sites and applications to extract structure from crawled content. 10 We also describe a large-scale distributed crawler that has been developed, as well as the 11  ...  We introduce the overall 7 architecture and we describe its modules, such as the online analysis module which computes 8 a priority for the Web pages to be crawled, and the Application-Aware Helper which  ...  551 Conflicts of Interest 552 Thomas Risse is co-editor of the special issue on Archiving Community Memories.  ... 
doi:10.3390/fi6030518 fatcat:ng3b7vqbv5dadowfowj2edkfae

Chinese Automatic Documents Classification System

Ji-Rui Li, Kai Yang
2010 2010 3rd International Conference on Computer Science and Information Technology  
Chinese Web Automatic Document Classification is one of the core technologies in Chinese information retrieval.  ...  Web Spider technology is the key in Chinese WEB document automatic classification. this issue surrounds WEB information explore which is this cutting-edge research, combined with the overall requirements  ...  [2] Breadth means the web spider crawl all web document the start WEB document links to, and then select one of the links web document, continue to crawl all the WEB documents the document links in this  ... 
doi:10.1109/iccsit.2010.5565018 fatcat:d36ppxbzznhvlnnpnsfadvjtfi

Ontology-focused crawling of Web documents

Marc Ehrig, Alexander Maedche
2003 Proceedings of the 2003 ACM symposium on Applied computing - SAC '03  
This paper proposes an approach for document discovery building on a comprehensive framework for ontology-focused crawling of Web documents.  ...  The Web, the largest unstructured database of the world, has greatly improved access to documents. However, documents on the Web are largely disorganized.  ...  ACKNOWLEDGEMENTS The research presented in this paper would not have been possible without our colleagues, at the Institute AIFB, University of Karlsruhe, and FZI, Karlsruhe. We thank L.  ... 
doi:10.1145/952756.952761 fatcat:wpnwnlhc3racnku4mmjmuxdari

Ontology-focused crawling of Web documents

Marc Ehrig, Alexander Maedche
2003 Proceedings of the 2003 ACM symposium on Applied computing - SAC '03  
This paper proposes an approach for document discovery building on a comprehensive framework for ontology-focused crawling of Web documents.  ...  The Web, the largest unstructured database of the world, has greatly improved access to documents. However, documents on the Web are largely disorganized.  ...  ACKNOWLEDGEMENTS The research presented in this paper would not have been possible without our colleagues, at the Institute AIFB, University of Karlsruhe, and FZI, Karlsruhe. We thank L.  ... 
doi:10.1145/952532.952761 dblp:conf/sac/EhrigM03 fatcat:ywth4flttbhorjtquylech2bzi

An Ontological Crawling Approach for Improving Information Aggregation over eGovernment Websites

Heru Agus Santoso, Junta Zeniarja, Ardytha Luthfiarta, Bima Jati Wijaya
2016 Journal of Computer Science  
The data in the form of HTML Web document text, meta-data, hyperlinks and other rich-contents are effectively crawled.  ...  For example, the use of information integration for Web portal content is still very limited.  ...  Acknowledgement We thank the Government of Indonesia for the funding and the referees for useful suggestions. Bima Jati Wijaya: Drafting the article.  ... 
doi:10.3844/jcssp.2016.455.463 fatcat:3mex44vg3baj3inf2bd4q5s6pm

Replicating Web Structure in Small-Scale Test Collections

Cathal Gurrin, Alan F. Smeaton
2004 Information retrieval (Boston)  
linkage-based retrieval by examining the linkage structure of the WWW.  ...  Based on these requirements we report on methodologies for synthesising such a test collection.  ...  Thanks goes to Mark Sanderson & Hideo Joho at the University of Sheffield for aiding our access to the collection.  ... 
doi:10.1023/b:inrt.0000011206.23588.ab fatcat:g2ymwh63gzcllfbm3dzpexagmq

Maintaining the search engine freshness using mobile agent

Marwa Badawi, Ammar Mohamed, Ahmed Hussein, Mervat Gheith
2013 Egyptian Informatics Journal  
In this paper, we suggest a document index based change detection technique and distributed indexing using mobile agents.  ...  So we are interested in detecting the significant changes in web pages which reflect effectively in search engine's index and minimize the network load.  ...  Also there is a copy of this document index saved at the web server for using later in the upcoming crawling cycles.  ... 
doi:10.1016/j.eij.2012.11.001 fatcat:fbesa2uwufhotpswwujirt3v2e

DSphere: A Source-Centric Approach to Crawling, Indexing and Searching the World Wide Web

Bhuvan Bamba, Ling Liu, James Caverlee, Vaibhav Padliya, Mudhakar Srivatsa, Tushar Bansal, Mahesh Palekar, Joseph Patrao, Suiyang Li, Aameek Singh
2007 2007 IEEE 23rd International Conference on Data Engineering  
We describe DSPHERE 1 − a decentralized system for crawling, indexing, searching and ranking of documents in the World Wide Web.  ...  Unlike most of the existing search technologies that depend heavily on a page-centric view of the Web, we advocate a source-centric view of the Web and propose a decentralized architecture for crawling  ...  For example, a peer may be assigned the responsibility of crawling all or a subset of the documents in the www.cc.gatech.edu domain.  ... 
doi:10.1109/icde.2007.369060 dblp:conf/icde/BambaLCPSBPPLS07 fatcat:bjl4o23mqjfifkwgeu7u5ddx3a

Do TREC web collections look like the web?

Ian Soboroff
2002 SIGIR Forum  
This is not an idle question; characteristics of the web, such as power law relationships, diameter, and connected components have all been observed within the scope of general web crawls, constructed  ...  The .GOV collection is a fairly straightforward 18GB crawl of sites in the .gov domain.  ...  Acknowledgments We are grateful to the attendees of SIGIR 2002, and in particular David Hawking and Andrei Broder, for their insightful and illuminating comments.  ... 
doi:10.1145/792550.792554 fatcat:yvltwab6gnaxxnggaafkecudiu

Crawler with Search Engine based Simple Web Application System for Forum Mining

M. Maheswari, N. Tharminie
2014 IOSR Journal of Computer Engineering  
Web mining is an important term to manage the data from web which has different categorization as structure, content, usage.  ...  The designed crawler performs two functions, URL Crawling (structure mining) by page classification and Content Crawling (content mining) by Pattern clustering.  ...  These links add depth to the document, providing the multi-dimensionality that characterizes the web. Mining this link structure is the second area of web mining.  ... 
doi:10.9790/0661-16287982 fatcat:mc47cagwmneivpkaszcsn4gwou

Finding Thai Web Pages in Foreign Web Spaces

K. Somboonviwat, T. Tamura, M. Kitsuregawa
2006 22nd International Conference on Data Engineering Workshops (ICDEW'06)  
The LSWC strategy for selectively gathering Thai web pages from virtually anywhere on the Web is derived based on static analyses of the Thai Web graph.  ...  This paper proposes language specific web crawling (LSWC) as a method of creating Web archives for countries with linguistic identities such as Thailand.  ...  The LSWC crawler for Thai web pages, which incorporates the Thai language classifier and the knowledge about the Thai Web graph structure, is evaluated using a web crawling simulator (proposed in [8]  ... 
doi:10.1109/icdew.2006.60 dblp:conf/icde/SomboonviwatTK06 fatcat:5eeogtc5fze35ecv4p5vswwyse
« Previous Showing results 1 — 15 out of 19,861 results