A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2019; you can also visit the original URL.
The file type is application/pdf
.
Filters
Multilingual Focused Crawler System based on Web Content Extraction and Path Configuration
2019
IOP Conference Series: Materials Science and Engineering
Then, it uses path configuration information or web content extraction algorithm based on the distribution line block to get webpage content, and adopts rules or configuration information to acquire new ...
The multilingual focused crawler system combines web content extraction with path configuration to make use of their advantages and achieve automatic collection of network information in multiple languages ...
( 2 ) 2 Getting new links from web pages to achieve circular crawling. (3) Extracting web content, title and published time. (4) Filtering information by keywords. ...
doi:10.1088/1757-899x/569/5/052030
fatcat:xtvc6x2qg5cfvlzfjcdxwvw4ze
A Layout-Independent Web News Article Contents Extraction Method Based on Relevance Analysis
[chapter]
2009
Lecture Notes in Computer Science
In this paper, we propose a relevance-based analysis method to extract the news article contents from the news pages without the analysis of news page layouts before extraction. ...
The traditional Web news article contents extraction methods are time-costly and need much maintenance because they analyze the layout of news pages to generate the wrappers manually or automatically. ...
Experiment 2 We extract and analyze the topic-based Web news articles from news site databases to observe the difference in the various topics. We select the countries and leaders as our test topics. ...
doi:10.1007/978-3-642-02818-2_37
fatcat:erbnmaozbndxpobf5wzaesr6qe
Using linguistic features to automatically extract web page title
2017
Expert systems with applications
Abstract Existing methods for extracting titles from HTML web page mostly rely on visual and structural features. ...
Using annotated English corpus, we learn the morphosyntactic characteristics of known titles and define a part-of-speech tag patterns that help to extract candidate phrases from the web page. ...
; Wang et al., 2009; Xue et al., 2007) However, extracting a title from the body of the web page is not an easy task, as roughly half of a page's content is irrelevant text (Gibson, Punera, & Tomkins ...
doi:10.1016/j.eswa.2017.02.045
fatcat:iru5ti3dgfbn7hkqdyme36s5sq
Enriching an Authority File of Scientific Conferences with Information Extracted from the Web
2017
Journal of Computer Science
This paper proposes an approach for the enrichment of a publication venue authority file by extracting information on conferences from their web pages. ...
Our approach includes the steps for querying a web search engine, classifying documents obtained in the result sets and extracting information from the relevant pages. ...
Acknowledgement and Funding Information This work was partially supported by the FAPEMIG grant CEX-APQ-01834-14, CNPq grant 200828/2015-0 and an individual scholarship from UFLA. ...
doi:10.3844/jcssp.2017.68.77
fatcat:ikx2yue6fjfatloo2gfhmxr56q
HisTrace: A system for mining on news-related articles instead of web pages
2010
2010 IEEE 2nd Symposium on Web Society
Anchor texts are firstly used to extract titles from HTML bodies and then contents are extracted right after titles. ...
In this paper we propose a system to enable mining on news-related articles instead of raw web pages. ...
Title and Content Extraction First of all, we shall extract news-related articles (titles and contents) out from web pages. ...
doi:10.1109/sws.2010.5607481
fatcat:u7m4p5iagjggdf6vcej7aeojmi
A system for extracting top-K lists from the web
2012
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '12
We present an efficient system that extracts the target lists from web pages with high accuracy. ...
List data is an important source of structured data on the web. This paper is concerned with "top-k" pages, which are web pages that describe a list of k instances of a particular topic or concept. ...
which makes up most of the web content. ...
doi:10.1145/2339530.2339780
dblp:conf/kdd/ZhangZW12
fatcat:q5fkjzsyxfbgdkcrsd4x4vbs6y
Can we learn a template-independent wrapper for news article extraction from a single training site?
2009
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '09
Automatic news extraction from news pages is important in many Web applications such as news aggregation. ...
We formalize Web news extraction as a machine learning problem and learn a template-independent wrapper using a very small number of labeled news pages from a single site. ...
Furthermore, the previous template-based wrappers such as TED can only extract plain texts from news pages. ...
doi:10.1145/1557019.1557163
dblp:conf/kdd/WangCWPBGZ09
fatcat:l7amo3exbfa5jajklewqfdjegm
Boilerplate Removal and Content Extraction from Dynamic Web Pages
2014
International Journal of Computer Science Engineering and Applications
To ensure the high quality of web page, a good boilerplate removal algorithm is needed to extract only the relevant contents from web page. ...
The system classifies the noise or content from HTML web page. Content Extraction algorithm describes to get high performance without parsing DOM trees. ...
Extracting useful or relevant information from Web pages thus becomes an important task. Also irrelevant information is contained in these Web pages. ...
doi:10.5121/ijcsea.2014.4603
fatcat:c625vi3bjngntguh6mc3ajjwkq
News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages
2017
Bonfring International Journal of Data Mining
In order to test this system, Web news pages with core hints (which are the subject keywords presented by the news authors) are selected from the 163 website (www.163.com). ...
Experimental results show that this method can correctly recognize Web news pages with a rate of better than 96 percent. ...
After the automatic recognition and filtering, our system uses a new key-phrase extraction method from Web news content based on semantic relation. ...
doi:10.9756/bijdm.8339
fatcat:7wmbt5eumnb4jf2mjgvzgmurda
Efficient Extraction of Top-k Instances from Web
2017
IARJSET
Extraction of top-k list depends on 1] Extracting web URLs and its titles 2] Removing dust from web URLs 3] Using extraction algorithm extract exact top-k list. ...
This paper work on information extraction from top-k web pages which contains top-k instances for open domain knowledge based. For example-"Top 10 IT companies in India". ...
Zhu, Haixun Wang [3] defines extraction of general lists and tables from the web. It is based on recognize, extract and understand top-k list content from web pages. ...
doi:10.17148/iarjset/nciarcse.2017.01
fatcat:y7kt2catdbgwjcv6ablp5tlesa
Research and Innovative Design of Search Engine for Banking Industry Decision-makers
2018
Proceedings of the 2018 10th International Conference on Information Management and Engineering - ICIME 2018
This article only gives the implementation method and workflow of typical functions of web search. ...
Based on the actual needs of the banking industry, this paper designs and develops an innovative Search Engine for Banking Industry Decision-makers (SEfBIDm). ...
WEB SEARCH DESIGN Web search uses a method of tag-based web page analysis. This method extracts content based on the unique tags of each part of the web page. ...
doi:10.1145/3285957.3285978
fatcat:apu6qh3ecfb3tdp4j3ww4civ3i
Extracting informative textual parts from web pages containing user-generated content
2012
Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies - i-KNOW '12
The proposed algorithm takes into account visual and non-visual characteristics of a web page and is able to remove noisy parts from three major categories of pages which contain user-generated content ...
The vast amount of user-generated content on the Web has increased the need for handling the problem of automatically processing content in web pages. ...
The extraction of the valuable information from the web pages can be performed by the usage of application specific APIs, which give machine-readable access to the contents directly from the source. ...
doi:10.1145/2362456.2362462
dblp:conf/iknow/PappasKS12
fatcat:rl4rfjtfc5c3ddxp3mhnalntzi
Automatic extraction of top-k lists from the web
2013
2013 IEEE 29th International Conference on Data Engineering (ICDE)
This paper is concerned with information extraction from top-k web pages, which are web pages that describe top k instances of a topic which is of general interest. ...
In this paper, we present an efficient method that extracts top-k lists from web pages with high performance. ...
We extracted 1.7 million top-k lists from a web corpus that contains 1.6 billion web pages. ...
doi:10.1109/icde.2013.6544897
dblp:conf/icde/ZhangZWL13
fatcat:pnjgttbxhvg7npfqwbudttleae
An Approach of Information Extraction Based on Dom Tree and Weight Value
2016
International Journal of Grid and Distributed Computing
Eliminating noisy information and extracting information content from web pages are increasing to become an important research issue in information retrieval field. ...
In this paper, we present an approach of information extraction based on Dom tree and weight value calculation, which contains the following steps, parse the web page to construct the Dom tree, extract ...
of API to obtain or process the operation data. (2) Extract the title and key word of web page body firstly, extract the content of title tag in web page, that is, the title information of body, then ...
doi:10.14257/ijgdc.2016.9.10.28
fatcat:mlerk3xrprfrnde3dz3iey7z2q
The Design of Intelligence Collection System Based on Internet
2011
Procedia Engineering
Based on the features of collection and retrieval of public intelligence, an intelligence collection system using knowledge base and user interest model is developed. ...
Nonetheless, the extracted content from a web page may vary greatly with the set subject. This phenomenon will affect the extraction accuracy of subject page information. ...
Therefore, after the web page is downloaded by subject search engine, the title and the anchor are extracted from each HTML document and taken as the indexes. ...
doi:10.1016/j.proeng.2011.08.573
fatcat:rq2gwjwemjefjeycy6rakj7aiy
« Previous
Showing results 1 — 15 out of 62,925 results