Sampling, information extraction and summarisation of Hidden Web databases

Yih-Ling Hedley, Muhammad Younas, Anne James, Mark Sanderson
2006 Data & Knowledge Engineering  
Hidden Web databases maintain a collection of specialised documents, which are dynamically generated in response to users' queries. The majority of these documents are generated through Web page templates, which contain information that is often irrelevant to queries. In this paper, we present a system designed to detect and extract query-related information from documents sampled from databases. The proposed system, 2PS, is based on a two-phase framework for the sampling, extraction and
more » ... sation of Hidden Web documents. In the first phase, 2PS queries databases with random terms selected from those contained in their search interface pages and the subsequently retrieved documents -this phase retrieves a pre-determined number of sampled documents. In the second phase, it detects Web page templates from the sampled documents in order to extract information relevant to respective queries from which a content summary is generated. 2PS is validated through the implmementation of a prototype system. Its evaluation is performed through experiments on a number of real-world Hidden Web databases. The experimental results demonstrate that 2PS effectively eliminates irrelevant information contained in Web page templates and generates terms and frequencies with improved accuracy.
doi:10.1016/j.datak.2006.01.009 fatcat:ftdqsklu6zaglgpkjg2tyc75pq