Application on Web Page Filtering Technology

Bo Shen, Lei Li, Ning-wei Wang
2014 International Journal of Multimedia and Ubiquitous Engineering  
Web page filtering technology intends to filter out the large number of the repeated and theme-unrelated noise information and obtain useful information. Some web filtering methods cannot make full use of the layout and visual features. In view of the new mainstream "DIV+CSS" designing style of modern commercial web sites, this paper summarizes that elements laying in the same div blocks have common semantic features and proposed a DIV_FOREST model to represent the web pages. And in combination
more » ... with the Vision-based Page Segmentation Algorithm, a DVPS Algorithm which considers both layout features and visual features was proposed to improve web page filtering efficiency. Another is based on the latter method. By setting the upper left corner of the screen to coordinate origin, Kovacevic [7] established a reference coordinate system, for positioning the relative positions of HTML objects in the screen. In the actual page data extraction, the two methods have their own advantages and disadvantages because of different types of data sets. The first method is suitable for the case web with pages from one or a few websites. By using multiple pages the same site to extract their template, you can quickly distinguish topic information and noise information in all the pages of this website, high efficiency, but the applicability of the difference. The second method can make up the lack of in terms of flexibility brought by first method, this method can effectively deal with the situation that most web pages are not generated by the same template, but the diversity of the modern web, the complexity of the development has brought new challenges to the page data filtering scheme based on this idea. Based on DIV tags dividing the content block of the page, this paper proposes a new data filtering scheme, DVPS algorithm. By determining if the block size factor Doc has reached the threshold, this algorithm decides how the page is divided into blocks, where each block is composed of several sub-tree of DIV, and corresponds the web visual block at the macro. Copyright ⓒ 2014 SERSC 407 2.2. Text Pretreatment 2.2.1. Regular Expression: regular expressions [12] can provide a mechanism which can search the specific string from the character set. It is an expression consisting of uppercase and lowercase letters, numbers and metacharacters, which can match a class of string. Users can build a string matching pattern by expression and then build comparative relationship with data files, web pages and other established target objects. In the Java language, the string regular expression should first be compiled into an instance of Pattern class and then create the Matcher object based on the Pattern class. We can create a regular expression matching any string. States involved in the matching process are stored in matcher, so allowing multiple matchers are allowed to share the same pattern. Chinese Word Segmentation: Chinese word segmentation is the process dividing the sequence of characters into separate words or characters. The most common methods are: methods based on statistics, dictionary and understanding: In a text, the more times adjacent characters simultaneously appear, the more they are likely to constitute a word. Therefore, the frequencies statistics of the adjacent word co-occurrence can well reflect the likelihood that they can constitute a word. Chinese word segmentation based on statistics uses this idea, and do not rely on word dictionaries. Dictionary-based approach is also called string matching method. This method makes Chinese Characters string match the dictionary studied by the machine. Matching principles include Maximum Matching, Minimum Matching and Best Matching Methods based on understanding establish mechanism for computer simulating human understanding to identify the words. This method usually includes three parts, semantic system, segmentation system and the total controller. The method uses semantic and syntactic analysis to eliminate ambiguity, including artificial neural network word segmentation and expert system word segmentation. of this design approach is that: the designer can pay the related logical and semantic characteristics content into the same DIV block, in order to control the page style by using CSS.
doi:10.14257/ijmue.2014.9.12.35 fatcat:tdapvuf5azgz7js2yifg42rxzi