2,303 Hits in 4.6 sec

Site-Level Web Template Extraction Based on DOM Analysis [chapter]

Julián Alarte, David Insa, Josep Silva, Salvador Tamarit
2016 Lecture Notes in Computer Science  
In this work we propose a novel method for automatic web template extraction that is based on similarity analysis between the DOM trees of a collection of webpages that are detected using an hyperlink  ...  It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks.  ...  This justifies the importance of template removal [21, 19] for web mining and search. Our approach to template extraction is based on the DOM [8] structures that represent webpages.  ... 
doi:10.1007/978-3-319-41579-6_4 fatcat:bsitcaesyjecvecvcbxacvzpki

The volume and evolution of web page templates

David Gibson, Kunal Punera, Andrew Tomkins
2005 Special interest tracks and posters of the 14th international conference on World Wide Web - WWW '05  
We study the nature, evolution, and prevalence of these templates on the web.  ...  Our results show that 40-50% of the content on the web is template content. Over the last eight years, the fraction of template content has doubled, and the growth shows no sign of abating.  ...  We consider two algorithms, one based on the DOM structure of the web page, and the other based on syntactic sequences of characters.  ... 
doi:10.1145/1062745.1062763 dblp:conf/www/GibsonPT05 fatcat:ftd5n634tvdfli4cp4otbppjhe

Web Data Extraction from Scientific Publishers' Website Using Heuristic Algorithm

Umamageswari Kumaresan, Kalpana Ramanujam
2017 International Journal of Intelligent Systems and Applications  
Data analytics and data mining applications depend on data from deep web pages and automatic extraction of data from deep web is cumbersome due to diverse structure of web pages.  ...  WWW is a huge repository of information and the amount of information available on the web is growing day by day in an exponential manner.  ...  The proposed approach is based on the observation that the journal home pages linked to publishers' web site are well structured and they are generated using same serverside template.  ... 
doi:10.5815/ijisa.2017.10.04 fatcat:qsp4lttx2fekxnns5v4lcdhzgy

Detection on Large Amount of Web Pages Based on a Highly Active Way of Site-Level Template

Xiangang Zuo, Zhixia Zhang, Jianping Xie
2015 International Journal of Hybrid Information Technology  
Generally, the theories of computing web are all based on DOM, includes these theory which are based on node statistic feature, and the theories based on root-to-leaf chain matching and the minimal editing  ...  Therefore, we design an algorithm to detect the two kind of templates from the samplings from a particular site, and extract the main content out of the noises using the template.  ...  Conclusions In this paper, we propose a method to detect and re-move the site-level template, i.e. site style template from large amount of web pages.  ... 
doi:10.14257/ijhit.2015.8.4.28 fatcat:i4poalnoefgvtecmummy4jjj7y

VB-PTC: Visual Block Multi-Record Text Extraction Based on Sensor Network Page Type Conversion

Jibing Gong, Hekai Zhang, Weixia Du, Huanhuan Li, Hongnian Wen
2020 IEEE Access  
This method uses a combination of site-level noise reduction based on hashtree and page-level noise reduction based on linked clusters to eliminate noise in web articles, and it successfully converts multi-record  ...  In this paper, we propose a visual block construction method based on page type conversion (VB-PTC).  ...  Then, the web data record is extracted based on the visual law of the web page.  ... 
doi:10.1109/access.2020.3024194 fatcat:x7p3qcvys5fkfp6coicuueycwm

Reappearance Layout based Web Page Segmentation for Small Screen Devices

V. Kalaivani, K. Rajkumar
2012 International Journal of Computer Applications  
Keywords DOM (Document object Model), Layout based segmentation (LSE), Reappearance based segmentation (RSE), RLSE (Reappearance Layout based Segmentation).  ...  If it contains reappearance tag in key pattern means, it will segment based on reappearance based segmentation. Otherwise it will segment based on web layout information.  ...  Suhit Gupta et al [10] , proposed a DOM-based content extraction technique which is used to extract content from web pages which is based on exploring the DOM Tree representation of web page.  ... 
doi:10.5120/7884-0801 fatcat:23zua2mqjzgnlnio7gj2hnfbse

Novel Web Data Extraction Using Template Extraction and Filtering Non Information

2015 International Journal of Science and Research (IJSR)  
Template matching will be based upon depth and data similarity and also removing the non-information part from the web pages by using filtering.  ...  In our proposed system data is extracted using template extraction.  ...  We take multiple URLs from different site as an input document. After processing these URLs it will generate DOM tree for each one. Then it is going to show extracted data.  ... 
doi:10.21275/v4i12.nov152454 fatcat:t6vgthd245cnhj3wwzy4sb3qla

Extraction of Template using Clustering from Heterogeneous Web Documents

Rashmi DThakare, Manisha R Patil
2015 International Journal of Computer Applications  
This has practical importance in applications like data analysis based on web-log, text and market-base.  ...  These template detection methods, which are operating at site-level though looking promising has limitations a First, less percentage of web templates are comprised of this site level template [3] .  ... 
doi:10.5120/21112-3906 fatcat:vksbqx55rjc5tohadufqt2uy34

Implementation of Web Scraping on News Sites Using the Supervised Learning Method

2021 Elementary Education Online  
To do basic web scraping namely knowing DOM patterns, XPath structure as a data model or selector at each site.  ...  interfere with reader's comfort, from these problems this study aims to implement web scraping techniques with supervised learning methods and analyzing the form of DOM tree and XPath news sites.  ...  display an error such as Figure CONCLUSION Based on the results and previous discussion, the results of this study are in the form of DOM tree and XPath pattern analysis of the news sites studied  ... 
doi:10.17051/ilkonline.2021.03.43 fatcat:kp4f4mivqfg77leqndpjody2ay

Extracting informative textual parts from web pages containing user-generated content

Nikolaos Pappas, Georgios Katsimpras, Efstathios Stamatatos
2012 Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies - i-KNOW '12  
Based on a human annotated corpus consisting of diverse topics, domains and templates, we demonstrate the learning abilities of our algorithm, we examine its e↵ectiveness in extracting the informative  ...  textual parts and its usage as a rule-based classifier for web page type detection in a realistic web setting.  ...  [16] proposed a template removal method for web pages that uses a small set sample of pages per site for the template detection. D. Cao et al. S. Nam et al.  ... 
doi:10.1145/2362456.2362462 dblp:conf/iknow/PappasKS12 fatcat:rl4rfjtfc5c3ddxp3mhnalntzi

Web document text and images extraction using DOM analysis and natural language processing

Parag Mulendra Joshi, Sam Liu
2009 Proceedings of the 9th ACM symposium on Document engineering - DocEng '09  
There are some recent publications on web article extraction based on DOM analysis [9] , [4] .  ...  For example, all paragraphs might not be on the same level of a DOM subtree; instead they might be in different subtrees at different level of the DOM.  ... 
doi:10.1145/1600193.1600241 dblp:conf/doceng/JoshiL09 fatcat:x4cnms2otrfptolt6vheyz4sx4

Automatically extracting user reviews from forum sites

Wei Liu, Hualiang Yan, Jianguo Xiao
2011 Computers and Mathematics with Applications  
The review records are extracted from web pages based on the proposed level-weighted tree similarity algorithm first, and then the review contents in records are extracted exactly by measuring the node  ...  Our experimental results based on 20 forum sites indicate that WeRE can achieve high extraction accuracy.  ...  Wrapper-based extraction needs to generate three wrappers for each web site.  ... 
doi:10.1016/j.camwa.2011.07.044 fatcat:7ny2eihq2jhobaqndxutt6vhae

A Survey of Web Information Extraction Tools

Noha Negm, Passent ElKafrawy, Abdel Badea Salem
2012 International Journal of Computer Applications  
Based on this survey, we can decide which suitable Web IE tool will be integrated in our future work in Web Text Mining.  ...  This has resulted in the need for automated Web Information Extraction (IE) tools that analyze the Web pages and harvest useful information from noisy content for any further analysis.  ...  Fig 1: A DOM tree example [29] Fig 4 : 4 F-measure-Based comparison in the three Websites Table 2 . 2 Analysis for each Web IE tools based on input task and techniques used Table 4 . 4 Block level  ... 
doi:10.5120/6115-8296 fatcat:2ijvncas7zbv5nwsonovfeodc4

A Framework for Automated Scraping of Structured Data Records From the Deep Web Using Semantic Labeling

Umamageswari Kumaresan, Kalpana Ramanujam
2022 International Journal of Information Retrieval Research  
The experiments conducted on the real world web sites prove the effectiveness and versatility of the proposed approach.  ...  Most of the automated extraction techniques in the literature captures repeated pattern among a set of similarly structured web pages, thereby deducing the template used for the generation of those web  ...  FiVaTech (Kayed & Chang, 2010) is a page-level extraction system based on tree merging and schema deduction.  ... 
doi:10.4018/ijirr.290830 fatcat:cc2z22fix5awnozjfbkzdvrpj4

A layout-similarity-based approach for detecting phishing pages

Angelo P. E. Rosiello, Engin Kirda, Christopher Kruegel, Fabrizio Ferrandi
2007 2007 Third International Conference on Security and Privacy in Communications Networks and the Workshops - SecureComm 2007  
In a phishing attack, the attacker persuades the victim to reveal confidential information by using web site spoofing techniques.  ...  In previous work, we have developed AntiPhish, a phishing protection system that prevents sensitive user information from being entered on phishing sites.  ...  Acknowledgements This work was supported by the Austrian Science Foundation (FWF) under grants P18368 (Omnis) and P18764 (Web-Defense), and by the Secure Business Austria competence center.  ... 
doi:10.1109/seccom.2007.4550367 dblp:conf/securecomm/RosielloKKF07 fatcat:j3vjojkvzbbe3o23575qz5c5vm
« Previous Showing results 1 — 15 out of 2,303 results