Filters








130,951 Hits in 9.6 sec

A comparative study of citations and links in document classification

Thierson Couto, Marco Cristo, Marcos André Gonçalves, Pável Calado, Nivio Ziviani, Edleno Moura, Berthier Ribeiro-Neto
2006 Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries - JCDL '06  
In this work we present a comparative study of digital library citations and Web links, in the context of automatic text classification.  ...  In our reference collections, measures based on co-citation tend to perform better for pages in the Web directory, with gains up to 37% over text based classifiers, while measures based on bibliographic  ...  All of these allowed us to reach better conclusions regarding the use of citation-link based similarity measures in document classification DATASETS In this section we describe the citation and link  ... 
doi:10.1145/1141753.1141766 dblp:conf/jcdl/CoutoCGCZMR06 fatcat:u4ex4pdlwbf2rij54wqmt3jl3a

Classifying documents with link-based bibliometric measures

T. Couto, N. Ziviani, P. Calado, M. Cristo, M. Gonçalves, E. S. de Moura, W. Brandão
2009 Information retrieval (Boston)  
Automatic document classification can be used to organize documents in a digital library, construct on-line directories, improve the precision of web searching, or help the interactions between user and  ...  In one of the test collections used, we obtained improvements of up to 69.8% of macro-averaged F 1 over the traditional text-based kNN classifier, considered as the baseline measure in our experiments.  ...  Conclusions In this work we studied the usage of classifiers based on bibliometric similarity measures for classifying web collections.  ... 
doi:10.1007/s10791-009-9119-7 fatcat:ycujmpakhjdzvmx3x2s3m43qia

An Ontology-Based Webpage Classification Approach for the Knowledge Grid Environment

Hai Dong, Farookh Khadeer Hussain, Elizabeth Chang
2009 2009 Fifth International Conference on Semantics, Knowledge and Grid  
In order to solve the above issue, in this paper, we present a novel ontology-based webpage classification method for the Knowledge Grid environment, which utilizes generated metadata from webpages as  ...  With the rapid growth of the amount of information available in the Web, webpage classification technologies are widely employed by many search engines in order to formulate user queries and make users  ...  ACKNOWLEDGMENT We would like to express our gratitude for the assistance of all relevant DEBII staff, especially to our programmer Wei Liu who took responsibility for implementing the Webpage Classification  ... 
doi:10.1109/skg.2009.69 dblp:conf/skg/DongHC09 fatcat:tnpidumjdff4fhoby63bnrtjq4

Focused Web Crawling Using Decay Concept And Genetic Programming

Mahdi Bazarganigilani
2011 Zenodo  
The ongoing rapid growth of web information is a theme of research in many papers. In this paper, we introduce a new optimized method for web crawling.  ...  Using genetic programming enhances the accuracy of simialrity measurement. This measurement applies to different parts of the web pages including the title and the body.  ...  Such content-based similarity measures have been applied to the content of Web.  ... 
doi:10.5281/zenodo.1248240 fatcat:22tnpwuzjvhevbyohimmagljiy

An Improved Focused Crawler: Using Web Page Classification and Link Priority Evaluation

Houqing Lu, Donghui Zhan, Lei Zhou, Dengchao He
2016 Mathematical Problems in Engineering  
In addition, this paper introduces an evaluation approach of the link, link priority evaluation (LPE), which combines web page content block partition algorithm and the strategy of joint feature evaluation  ...  However, the performance of the current focused crawling can easily suffer the impact of the environments of web pages and multiple topic web pages.  ...  First, the current web page is partitioned into many content blocks based on CBP. Then, we compute the relevance of content blocks with the topic using the method of similarity measure.  ... 
doi:10.1155/2016/6406901 fatcat:qljrqgvhaneuxa4f4lfnxbjygi

A comparison of implicit and explicit links for web page classification

Dou Shen, Jian-Tao Sun, Qiang Yang, Zheng Chen
2006 Proceedings of the 15th international conference on World Wide Web - WWW '06  
We provide an approach for automatically building the implicit links between Web pages using Web query logs, together with a thorough comparison between the uses of implicit and explicit links in Web page  ...  the Macro-F1 measurement.  ...  Through the two different classification approaches, the link-based method and the content-based method, we compared the contribution of the implicit links and the explicit links for Web page classification  ... 
doi:10.1145/1135777.1135871 dblp:conf/www/ShenSYC06 fatcat:77r7dkfdejcpdbbfb6miaofupe

A machine learning approach to web page filtering using content and structure analysis

Michael Chau, Hsinchun Chen
2008 Decision Support Systems  
We represent each Web page by a set of content-based and link-based features, which can be used as the input for various machine learning algorithms.  ...  However, developers of topic-specific search engines need to address two issues: how to locate relevant documents (URLs) on the Web and how to filter out irrelevant documents from a set of documents collected  ...  Acknowledgements This project has been supported in part by the following grants: We would like to thank the National Library of Medicine for making UMLS freely available to researchers, and the medical  ... 
doi:10.1016/j.dss.2007.06.002 fatcat:h7eeg35b65htlj6fejwc2khvy4

Linked latent Dirichlet allocation in web spam filtering

István Bíró, Dávid Siklósi, Jácint Szabó, András A. Benczúr
2009 Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web - AIRWeb '09  
In this paper we apply an extension of LDA for web spam classification.  ...  The addition of this method to a log-odds based combination of strong link and content baseline classifiers results in a 3% improvement in AUC.  ...  Table 3 : Classification accuracy measured in AUC by combining the classifications of Tables 1 and 2 with a log-odds based random forest.  ... 
doi:10.1145/1531914.1531922 dblp:conf/airweb/BiroSSB09 fatcat:exyu4ajvarad3nnuoqkhh4ueoy

Web page classification

Xiaoguang Qi, Brian D. Davison
2009 ACM Computing Surveys  
The uncontrolled nature of web content presents additional challenges to web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides  ...  As we review work in web page classification, we note the importance of these web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assumptions behind the  ...  Acknowledgments This material is based upon work supported by the National Science Foundation under Grant No. IIS-0328825.  ... 
doi:10.1145/1459352.1459357 fatcat:octyi2gsvngndjrtuodocrtffa

Building Web Annotation Stickies based on Bidirectional Links

Hiroyuki Sano, Taiki Ito, Tadachika Ozono, Toramatsu Shintani
2008 Semantic Web Applications and Perspectives  
We have implemented a positioning method based on the Document Object Model (DOM), as well as a new method for placing stickies which depends on web contents related to the information referenced by the  ...  The stickies allow for important parts of a web page which contains large amounts of data to be highlighted.  ...  The similarity between documents in classifying stickies is calculated by using a cosine measure based on the Vector Space Model.  ... 
dblp:conf/swap/SanoIOS08 fatcat:mbvqtqx2hjesdj7n63jqrazgba

Hypertext Classification Using Tensor Space Model and Rough Set Based Ensemble Classifier [chapter]

Suman Saha, C. A. Murthy, Sankar K. Pal
2009 Lecture Notes in Computer Science  
Tensor similarity measure is defined. We have demonstrated the use of rough set based ensemble classifier on proposed tensor space model.  ...  Instead of using the text on a page for representing features in a vector space model, we have used features on the page and neighborhood features to represent a hypertext document in a tensor space model  ...  K-NN classification has been performed using tensor similarity measure. In the second method ensemble classification has been performed.  ... 
doi:10.1007/978-3-642-11164-8_34 fatcat:ldakbasrtbfmfaravxsjzsredm

Web Page Structure Enhanced Feature Selection for Classification of Web Pages

B. LeelaDevi, A. Sankar
2013 International Journal of Computer Applications  
In this paper, a model for the exploitation of semantic-based feature selection is proposed to improve search and retrieval of web pages over large document repositories.  ...  Semantic search motivates Semantic Web from inception for classification and retrieval processes.  ...  Similarity measures calculate similarity values between keywords and document. Ranking is based on similarity values. The first step is keywords identification for a document set.  ... 
doi:10.5120/11818-7494 fatcat:ol7zrmg645fpjp3mv3uhu7tbie

Automated subject classification of textual web documents

Koraljka Golub
2006 Journal of Documentation  
Findings -Provides major similarities and differences between the three approaches: document pre-processing and utilization of web--specific document characteristics is common to all the approaches; major  ...  Design/methodology/approach -A range of works dealing with automated classification of full--text web documents are discussed.  ...  In the narrower focus of this paper is automated classification of textual web documents into subject categories for browsing.  ... 
doi:10.1108/00220410610666501 fatcat:cdlhrejd7jfmzlxn7q646xk3ya

Looking into the past to better classify web spam

Na Dai, Brian D. Davison, Xiaoguang Qi
2009 Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web - AIRWeb '09  
In this paper, we use content features from historical versions of web pages to improve spam classification.  ...  Experiments on the WEBSPAM-UK2007 dataset show that our approach improves spam classification F-measure performance by 30% compared to a baseline classifier which only considers current page content.  ...  We also thank Jian Wang and Liangjie Hong for helpful discussions.  ... 
doi:10.1145/1531914.1531916 dblp:conf/airweb/DaiDQ09 fatcat:erixqsx6k5eh7bom6tikyim3nq

Using neighborhood information for automated categorization of Web pages

Nadejda Panteleeva
2003 International United Information Systems Conference  
We present the approach to automated pruning of linking Web pages.  ...  In this paper we discuss several issues related to the influence of expansion of a Web document representation on quality of topical categorization of Web pages.  ...  These samples correspond to the Base representation of a Web document -one document stands for one .html file.  ... 
dblp:conf/ista/Panteleeva03 fatcat:gdggxy3dszeghccfvgjoy34wvy
« Previous Showing results 1 — 15 out of 130,951 results