Filters








3,380 Hits in 6.4 sec

Exploiting PageRank at Different Block Level [chapter]

Xue-Mei Jiang, Gui-Rong Xue, Wen-Guan Song, Hua-Jun Zeng, Zheng Chen, Wei-Ying Ma
2004 Lecture Notes in Computer Science  
Furthermore, based on different block level, inter-hyperlink and intra-hyperlink can be two relative concepts. Thus which level should be optimal to distinguish the intra-or inter-hyperlink?  ...  In recent years, information retrieval methods focusing on the link analysis have been developed; The PageRank and HITS are two typical ones According to the hierarchical organization of Web pages, we  ...  Modified PageRank After obtaining the block-based Web structure, we apply a link analysis algorithm similar to PageRank to re-rank the web-pages. We construct a matrix to describe the graph.  ... 
doi:10.1007/978-3-540-30480-7_26 fatcat:5gu4jf4diza7peopyuh3ks3f7u

Similarity based Dynamic Web Data Extraction and Integration System from Search Engine Result Pages for Web Content Mining [article]

Srikantaiah K C, Suraj M, Venugopal K R, L M Patnaik
2013 arXiv   pre-print
Web Content Mining is one of the techniques that help users to extract useful information from these SERPs.  ...  In this paper, we propose two similarity based mechanisms; WDES, to extract desired SERPs and store them in the local depository for offline browsing and WDICS, to integrate the requested contents and  ...  Extraction algorithm is used to crawl the relevant pages and stores in local repository. Integration algorithm is used to integrate the similar data in various records based on cosine similarity.  ... 
arXiv:1303.5867v1 fatcat:xjja67gp35cc7jgyffooe6kz7y

Extracting Related Words from Anchor Text Clusters by Focusing on the Page Designer's Intention [chapter]

Jianquan Liu, Hanxiong Chen, Kazutaka Furuse, Nobuo Ohbo
2009 Lecture Notes in Computer Science  
Our approach is based on the idea that the web page designers usually make the correlative hyperlinks appear in close zone on the browser.  ...  We developed a browser-based crawler to collect "geographically" near hyperlinks, then by clustering these hyperlinks based on their pixel coordinates, we extract related words which can well reflect the  ...  [10] tells us that the link blocks can help users to identify relevant zones on a multi-topic page.  ... 
doi:10.1007/978-3-642-03573-9_39 fatcat:rbc3wi2jbnhathqhch4t7m54zu

Clustering and searching WWW images using link and page layout analysis

Xiaofei He, Deng Cai, Ji-Rong Wen, Wei-Ying Ma, Hong-Jiang Zhang
2007 ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)  
By using a vision-based page segmentation algorithm, a Web page is partitioned into blocks, and the textual and link information of an image can be accurately extracted from the block containing that image  ...  By extracting the page-to-block, block-to-image, block-to-page relationships through link structure and page layout analysis, we construct an image graph.  ...  The image graph was constructed based on the traditional perspective that the hyperlinks are considered from page to page.  ... 
doi:10.1145/1230812.1230816 fatcat:t4e3mzpgsndulgumqn66pfoxum

The Data Records Extraction from Web Pages

Nwe Nwe Hlaing, Thi Thi Soe Nyunt, Myat Thet Nyo
2019 Zenodo  
The explosive growth and popularity of the world wide web has resulted in a huge number of information sources on the Internet.  ...  There are four levels of information extraction from the World Wide Web such as free text level, record level, page level and site level.  ...  Filter the noisy blocks based on heuristic rules. 6. Cluster the remaining blocks based on their appearance similarity. 7. Labeling the data attributes for extracted data record.  ... 
doi:10.5281/zenodo.3591282 fatcat:4crasiwehjcetdqsja56u6qeb4

Optimized Focused Web Crawler with Natural Language Processing Based Relevance Measure in Bioinformatics Web Sources

S. R. Mani Sekhar, G. M. Siddesh, Sunilkumar S. Manvi, K. G. Srinivasa
2019 Cybernetics and Information Technologies  
A solution for predicting the page relevance, which is based on Natural Language Processing, is proposed in the paper.  ...  The frequency of the keywords on the top ranked sentences of the page determines the relevance of the pages within genomics sources.  ...  Focused Web Crawler Focused crawler is an automated mechanism to efficiently find web pages relevant to a topic on the web.  ... 
doi:10.2478/cait-2019-0021 fatcat:r3mmk4rxofbyhlmv7qjzvirks4

Web document text and images extraction using DOM analysis and natural language processing

Parag Mulendra Joshi, Sam Liu
2009 Proceedings of the 9th ACM symposium on Document engineering - DocEng '09  
Finally, our semantic similarity algorithm based on NLP tries to associate relevant images with main text content based on captions around images.  ...  From that it derives a cosine similarity based on the two normalized frequency distributions from An example web page is shown in Figure 4 .  ... 
doi:10.1145/1600193.1600241 dblp:conf/doceng/JoshiL09 fatcat:x4cnms2otrfptolt6vheyz4sx4

Challenges and Issues in Adapting Web Contents on Small Screen Devices [article]

Krishna Murthy A., Suresha, Anil Kumar K. M
2014 arXiv   pre-print
These proposed methods involve segment the Web page based on its semantic structure, followed by noise removal based on block features and to utilize the hierarchy of the content element to regenerate  ...  There are many approaches have been proposed in literature to regenerate HTML Web pages suitable for browsing on SSDs.  ...  The method is based on the basic idea of Case Based Reasoning (CBR) to find noise pattern in current Web page by matching similar noise pattern kept in Case-Based.  ... 
arXiv:1408.4067v1 fatcat:nx2zhhggn5gf3nrnsab5fguiby

Web2Text: Deep Structured Boilerplate Removal [article]

Thijs Vogels, Octavian-Eugen Ganea, Carsten Eickhoff
2018 arXiv   pre-print
To address this issue, we introduce a novel model that performs sequence labeling to collectively classify all text blocks in an HTML page as either boilerplate or main content.  ...  Web pages are a valuable source of information for many natural language processing and information retrieval tasks.  ...  The leaves of the Collapsed DOM tree of a Web page form an ordered sequence of blocks to be labeled. For each block, we extract a number of DOM tree-based features.  ... 
arXiv:1801.02607v3 fatcat:wha5oi5hubcurnpddtggzegziy

Dynamic Web content filtering based on user's knowledge

N. Churcharoenkrung, Y.S. Kim, B.H. Kang
2005 International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II  
This paper focuses on the development of a maintainable information filtering system. The simple and efficient solution to this problem is to block the Web sites by URL, including IP address.  ...  However, it is not efficient for unknown Web sites and it is difficult to obtain complete block list.  ...  However, not all information on the Web is useful or relevant to users.  ... 
doi:10.1109/itcc.2005.137 dblp:conf/itcc/ChurcharoenkrungKK05 fatcat:lwfbi4nxqrfhxlxadiqkssfdfy

Cleaning Web Pages for Effective Web Content Mining [chapter]

Jing Li, C. I. Ezeife
2006 Lecture Notes in Computer Science  
Key-word based search engines can return a ranked list o f web pages including all relevant documents, as well as many non-relevant or uninterested contents.  ...  The basic idea o f web page cleaning is first to segment web pages into a set o f blocks, then, calculate the block importance based on its frequency o f appearance in these web pages.  ... 
doi:10.1007/11827405_55 fatcat:hsfexgg63vdppkckymcup4qste

Stylistic and lexical co-training for web block classification

Chee How Lee, Min-Yen Kan, Sandra Lai
2004 Proceedings of the 6th annual ACM international workshop on Web information and data management - WIDM '04  
In addition to table-based layout, the system handles real-world pages which feature layout based on divisions and spans as well as stylistic inference for pages using cascaded style sheets.  ...  As such, web page division into blocks and the subsequent block classification have become a preprocessing step.  ...  To apply co-training to web block classification, we use two separate views based on the stylistic and lexical properties of blocks, as shown in Figure 1 .  ... 
doi:10.1145/1031453.1031478 dblp:conf/widm/LeeKL04 fatcat:vhxkxpl2yfc5liwdtjfub2hfem

Multi-model similarity propagation and its application for web image retrieval

Xin-Jing Wang, Wei-Ying Ma, Gui-Rong Xue, Xing Li
2004 Proceedings of the 12th annual ACM international conference on Multimedia - MULTIMEDIA '04  
Our experiments based on 10,628 images crawled from the Web show that our proposed approach can significantly improve Web image retrieval performance.  ...  The basic idea is that if two objects of the same type are both related to one object of another type, these two objects are similar; likewise, if two objects of the same type are related to two different  ...  ACKNOWLEDGEMENTS Special thanks should be given to Deng Cai, Xuemei Jiang and Shen Huang for their sincerely helps.  ... 
doi:10.1145/1027527.1027746 dblp:conf/mm/WangMXL04 fatcat:2dqpqm3ogjbjjkjszwp2euw62q

Combining Browsing Behaviors and Page Contents for Finding User Interests [chapter]

Fang Li, Yihong Li, Yanchen Wu, Kai Zhou, Feng Li, Xingguang Wang, Benjamin Liu*
2008 Autonomous Systems – Self-Organization, Management, and Control  
The calculation for the interested degree is based on Gaussian process regression model which captures the relationship between a user's browsing behaviors and his interest to a web page.  ...  An advanced client browser plug-in is implemented to track the user browsing behaviors and collect the information about the web pages that he has viewed.  ...  Finally we want to express our sincere thanks to Prof. Bo Yuan for English language correction. This research is supported by the Intel China Lt.  ... 
doi:10.1007/978-1-4020-8889-6_16 dblp:conf/jtb/LiLWZLWL08 fatcat:zcf6zfyrera5dowot6p7ggkzqy

Learning Web Page Block Functions using Roles of Images

Xin Yang, Yuanchun Shi
2008 2008 Third International Conference on Pervasive Computing and Applications  
We regard image as a strong indicator of Web page blocks with various functions and propose to learn block functions using roles of images as part of block features.  ...  We experiment on 140 Web pages and demonstrate that utilizing roles of images can significantly improve the classification quality of learning Web page block functions.  ...  [7] classified Web content into "clutter" and "useful and relevant" content, and proposed a DOM-based method to extract the latter from Web pages using heuristics. Yi et al.  ... 
doi:10.1109/icpca.2008.4783565 fatcat:k3fvi3mlsjgztcu4jrbz5aho3q
« Previous Showing results 1 — 15 out of 3,380 results