62,925 Hits in 3.6 sec

Multilingual Focused Crawler System based on Web Content Extraction and Path Configuration

Jie Wang, Sanhong Deng, Lijuan Wang
2019 IOP Conference Series: Materials Science and Engineering  
Then, it uses path configuration information or web content extraction algorithm based on the distribution line block to get webpage content, and adopts rules or configuration information to acquire new  ...  The multilingual focused crawler system combines web content extraction with path configuration to make use of their advantages and achieve automatic collection of network information in multiple languages  ...  ( 2 ) 2 Getting new links from web pages to achieve circular crawling. (3) Extracting web content, title and published time. (4) Filtering information by keywords.  ... 
doi:10.1088/1757-899x/569/5/052030 fatcat:xtvc6x2qg5cfvlzfjcdxwvw4ze

A Layout-Independent Web News Article Contents Extraction Method Based on Relevance Analysis [chapter]

Hao Han, Takehiro Tokuda
2009 Lecture Notes in Computer Science  
In this paper, we propose a relevance-based analysis method to extract the news article contents from the news pages without the analysis of news page layouts before extraction.  ...  The traditional Web news article contents extraction methods are time-costly and need much maintenance because they analyze the layout of news pages to generate the wrappers manually or automatically.  ...  Experiment 2 We extract and analyze the topic-based Web news articles from news site databases to observe the difference in the various topics. We select the countries and leaders as our test topics.  ... 
doi:10.1007/978-3-642-02818-2_37 fatcat:erbnmaozbndxpobf5wzaesr6qe

Using linguistic features to automatically extract web page title

Najlah Gali, Radu Mariescu-Istodor, Pasi Fränti
2017 Expert systems with applications  
Abstract Existing methods for extracting titles from HTML web page mostly rely on visual and structural features.  ...  Using annotated English corpus, we learn the morphosyntactic characteristics of known titles and define a part-of-speech tag patterns that help to extract candidate phrases from the web page.  ...  ; Wang et al., 2009; Xue et al., 2007) However, extracting a title from the body of the web page is not an easy task, as roughly half of a page's content is irrelevant text (Gibson, Punera, & Tomkins  ... 
doi:10.1016/j.eswa.2017.02.045 fatcat:iru5ti3dgfbn7hkqdyme36s5sq

Enriching an Authority File of Scientific Conferences with Information Extracted from the Web

Heider Alvarenga de Jesus, Denilson Alves Pereira
2017 Journal of Computer Science  
This paper proposes an approach for the enrichment of a publication venue authority file by extracting information on conferences from their web pages.  ...  Our approach includes the steps for querying a web search engine, classifying documents obtained in the result sets and extracting information from the relevant pages.  ...  Acknowledgement and Funding Information This work was partially supported by the FAPEMIG grant CEX-APQ-01834-14, CNPq grant 200828/2015-0 and an individual scholarship from UFLA.  ... 
doi:10.3844/jcssp.2017.68.77 fatcat:ikx2yue6fjfatloo2gfhmxr56q

HisTrace: A system for mining on news-related articles instead of web pages

Lian'en Huang, Xiaoming Li
2010 2010 IEEE 2nd Symposium on Web Society  
Anchor texts are firstly used to extract titles from HTML bodies and then contents are extracted right after titles.  ...  In this paper we propose a system to enable mining on news-related articles instead of raw web pages.  ...  Title and Content Extraction First of all, we shall extract news-related articles (titles and contents) out from web pages.  ... 
doi:10.1109/sws.2010.5607481 fatcat:u7m4p5iagjggdf6vcej7aeojmi

A system for extracting top-K lists from the web

Zhixian Zhang, Kenny Qili Zhu, Haixun Wang
2012 Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '12  
We present an efficient system that extracts the target lists from web pages with high accuracy.  ...  List data is an important source of structured data on the web. This paper is concerned with "top-k" pages, which are web pages that describe a list of k instances of a particular topic or concept.  ...  which makes up most of the web content.  ... 
doi:10.1145/2339530.2339780 dblp:conf/kdd/ZhangZW12 fatcat:q5fkjzsyxfbgdkcrsd4x4vbs6y

Can we learn a template-independent wrapper for news article extraction from a single training site?

Junfeng Wang, Chun Chen, Can Wang, Jian Pei, Jiajun Bu, Ziyu Guan, Wei Vivian Zhang
2009 Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '09  
Automatic news extraction from news pages is important in many Web applications such as news aggregation.  ...  We formalize Web news extraction as a machine learning problem and learn a template-independent wrapper using a very small number of labeled news pages from a single site.  ...  Furthermore, the previous template-based wrappers such as TED can only extract plain texts from news pages.  ... 
doi:10.1145/1557019.1557163 dblp:conf/kdd/WangCWPBGZ09 fatcat:l7amo3exbfa5jajklewqfdjegm

Boilerplate Removal and Content Extraction from Dynamic Web Pages

Pan Ei San
2014 International Journal of Computer Science Engineering and Applications  
To ensure the high quality of web page, a good boilerplate removal algorithm is needed to extract only the relevant contents from web page.  ...  The system classifies the noise or content from HTML web page. Content Extraction algorithm describes to get high performance without parsing DOM trees.  ...  Extracting useful or relevant information from Web pages thus becomes an important task. Also irrelevant information is contained in these Web pages.  ... 
doi:10.5121/ijcsea.2014.4603 fatcat:c625vi3bjngntguh6mc3ajjwkq

News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages

Bam ber, Micah Jason
2017 Bonfring International Journal of Data Mining  
In order to test this system, Web news pages with core hints (which are the subject keywords presented by the news authors) are selected from the 163 website (  ...  Experimental results show that this method can correctly recognize Web news pages with a rate of better than 96 percent.  ...  After the automatic recognition and filtering, our system uses a new key-phrase extraction method from Web news content based on semantic relation.  ... 
doi:10.9756/bijdm.8339 fatcat:7wmbt5eumnb4jf2mjgvzgmurda

Efficient Extraction of Top-k Instances from Web

Prof. Sayali Shinde, Tejaswi Shewale
2017 IARJSET  
Extraction of top-k list depends on 1] Extracting web URLs and its titles 2] Removing dust from web URLs 3] Using extraction algorithm extract exact top-k list.  ...  This paper work on information extraction from top-k web pages which contains top-k instances for open domain knowledge based. For example-"Top 10 IT companies in India".  ...  Zhu, Haixun Wang [3] defines extraction of general lists and tables from the web. It is based on recognize, extract and understand top-k list content from web pages.  ... 
doi:10.17148/iarjset/nciarcse.2017.01 fatcat:y7kt2catdbgwjcv6ablp5tlesa

Research and Innovative Design of Search Engine for Banking Industry Decision-makers

Huaihai Hui, Des McLernon, Ali Zaidi
2018 Proceedings of the 2018 10th International Conference on Information Management and Engineering - ICIME 2018  
This article only gives the implementation method and workflow of typical functions of web search.  ...  Based on the actual needs of the banking industry, this paper designs and develops an innovative Search Engine for Banking Industry Decision-makers (SEfBIDm).  ...  WEB SEARCH DESIGN Web search uses a method of tag-based web page analysis. This method extracts content based on the unique tags of each part of the web page.  ... 
doi:10.1145/3285957.3285978 fatcat:apu6qh3ecfb3tdp4j3ww4civ3i

Extracting informative textual parts from web pages containing user-generated content

Nikolaos Pappas, Georgios Katsimpras, Efstathios Stamatatos
2012 Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies - i-KNOW '12  
The proposed algorithm takes into account visual and non-visual characteristics of a web page and is able to remove noisy parts from three major categories of pages which contain user-generated content  ...  The vast amount of user-generated content on the Web has increased the need for handling the problem of automatically processing content in web pages.  ...  The extraction of the valuable information from the web pages can be performed by the usage of application specific APIs, which give machine-readable access to the contents directly from the source.  ... 
doi:10.1145/2362456.2362462 dblp:conf/iknow/PappasKS12 fatcat:rl4rfjtfc5c3ddxp3mhnalntzi

Automatic extraction of top-k lists from the web

Zhixian Zhang, K. Q. Zhu, Haixun Wang, Hongsong Li
2013 2013 IEEE 29th International Conference on Data Engineering (ICDE)  
This paper is concerned with information extraction from top-k web pages, which are web pages that describe top k instances of a topic which is of general interest.  ...  In this paper, we present an efficient method that extracts top-k lists from web pages with high performance.  ...  We extracted 1.7 million top-k lists from a web corpus that contains 1.6 billion web pages.  ... 
doi:10.1109/icde.2013.6544897 dblp:conf/icde/ZhangZWL13 fatcat:pnjgttbxhvg7npfqwbudttleae

An Approach of Information Extraction Based on Dom Tree and Weight Value

Haitao Wang, Shufen Liu
2016 International Journal of Grid and Distributed Computing  
Eliminating noisy information and extracting information content from web pages are increasing to become an important research issue in information retrieval field.  ...  In this paper, we present an approach of information extraction based on Dom tree and weight value calculation, which contains the following steps, parse the web page to construct the Dom tree, extract  ...  of API to obtain or process the operation data. (2) Extract the title and key word of web page body firstly, extract the content of title tag in web page, that is, the title information of body, then  ... 
doi:10.14257/ijgdc.2016.9.10.28 fatcat:mlerk3xrprfrnde3dz3iey7z2q

The Design of Intelligence Collection System Based on Internet

Xiaojun Liu
2011 Procedia Engineering  
Based on the features of collection and retrieval of public intelligence, an intelligence collection system using knowledge base and user interest model is developed.  ...  Nonetheless, the extracted content from a web page may vary greatly with the set subject. This phenomenon will affect the extraction accuracy of subject page information.  ...  Therefore, after the web page is downloaded by subject search engine, the title and the anchor are extracted from each HTML document and taken as the indexes.  ... 
doi:10.1016/j.proeng.2011.08.573 fatcat:rq2gwjwemjefjeycy6rakj7aiy
« Previous Showing results 1 — 15 out of 62,925 results