Hardware Implementation of Web Based Arabic Optical Character Recognition Units
Journal of Emerging Technologies in Web Intelligence
Web page classification has many applications and plays a vital role in web mining and semantic web. Web pages contain much irrelevant information that does not reflect their categories or topics, and operates as noise in the process of their classification, especially when using a text classifier. Thus, the use of information from related web pages can help to overcome the problem of noisy content and to get a better result after the classification. Web pages are linked either directly by
... links or indirectly by user's intuitive judgment. In this work, we suggest a post classification corrective method that uses the query-log to build an implicit neighborhood, and collectively propagate classes over web pages of that neighborhood. This collective propagation helps improving text classifier results by correcting wrongly assigned categories. Our technique operates in four steps. In the first step, it builds a weighted graph called initial graph, whose vertices are web pages and edges are implicit links. In the second step, it uses a text classifier to determine classes of all web pages represented by vertices in the initial graph. In the third step, it constructs clusters of web pages using Formal Concept Analysis. Then, it applies a first adjustment of classes called Internal Propagation of Categories (IPC). In the final step, it performs a second adjustment of classes called External Propagation of Categories (EPC). This adjustment leads to significant improvements of results provided by the text classifier. We conduct our experiments using five classifiers: SVM (Support Vector Machine), NB (Naïve Bayes), KNN (K Nearest Neighbors), ICA (Iterative classification algorithm) based on SVM and ICA based on NB, on four subsets of ODP (Open Directory Project). We also compare our approach to Classification using Linked Neighborhood (CLN) considered as the closest algorithm to EPC. Results show that: (1) when applied after SVM, NB, KNN or ICA classification, IPC followed by EPC help bringing improvements on results. (2) F1 scores provided by our approach with any of the five classifiers are significantly better than those obtained by CLN. (3) The performance provided by our proposed approach grows proportionally to the size of the query-log, and to the density of the weighted graph. Index Terms-formal concept analysis, centrality degree, semantic web, web page classification, query-log.