76,052 Hits in 4.7 sec

Application of Distributed Web Crawlers in Information Management System

Bo Wen
2018 Informatica (Ljubljana, Tiskana izd.)  
systems.  ...  The simulation experiment verified that the system could operate stably in information management system, which offers a reference for the application of distributed web crawlers in information management  ...  Conclusion In conclusion, distributed network crawlers based information management system could precisely satisfy the requirements of web crawling, with a high performance and expandability.  ... 
dblp:journals/informaticaSI/Wen18 fatcat:gyrizkk3gngunmg3f2z7ek43sa

Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtualization

Hussein Al-Bahadili, Hamzah Qtishat, Reyadh S. Naoum
2013 International Journal on Web Service Computing  
So that the crawling process should be a continuous process performed from time-to-time to maintain up-to-date crawled data.  ...  This paper develops and investigates the performance of a new approach to speed up the crawling process on a multi-core processor through virtualization.  ...  cost-effective high speed crawling system.  ... 
doi:10.5121/ijwsc.2013.4102 fatcat:h3i4nps4cfgjzm37tzqbdnzxui


Felix Hamborg, Norman Meuschke, Corinna Breitinger, Bela Gipp, Humboldt-Universität Zu Berlin, Humboldt-Universität Zu Berlin
2017 International Symposium of Information Science  
Our system allows crawling arbitrary news websites and extracting the major elements of news articles on those websites, i.e., title, lead paragraph, main content, publication date, author, and main image  ...  However, large scale collection of news data is cumbersome due to a lack of generic tools for crawling and extracting such data.  ...  These systems typically achieve high precision and recall for their extraction task, but require significant initial setup effort in order to customize the extractors to a set of specific news websites  ... 
doi:10.18452/1447 dblp:conf/isiwi/HamborgMBG17 fatcat:763h7ckq6rf2hlyqp6t46s4pku

news-please: A Generic News Crawler and Extractor

Felix Hamborg, Norman Meuschke, Corinna Breitinger, Bela Gipp
2017 Zenodo  
Our system allows crawling arbitrary news websites and extracting the major elements of news articles on those websites, i.e., title, lead paragraph, main content, publication date, author, and main image  ...  However, large scale collection of news data is cumbersome due to alack of generic tools for crawling and extracting such data.  ...  Web Crawling. news-please performs two sub-tasks in this phase. (1) The crawler downloads articles' HTML, using the scrapy framework. (2) To find all articles published by the news outlet, the system supports  ... 
doi:10.5281/zenodo.4120316 fatcat:ubvtewe25zgy5c47kfe3pjkgim

An Extended Model for Effective Migrating Parallel Web Crawling with Domain Specific and Incremental Crawling

Md. Faizan Farooqui
2012 International Journal on Web Service Computing  
In this paper we propose the architecture for Effective Migrating Parallel Web Crawling approach with domain specific and incremental crawling strategy that makes web crawling system more effective and  ...  Domain specific crawling will yield high quality pages. The crawling process will migrate to host or server with specific domain and start downloading pages within specific domain.  ...  High quality of pages will be downloaded as crawling processes are performing in breadth first manner. Breadth first crawling improves the quality of downloaded pages.  ... 
doi:10.5121/ijwsc.2012.3308 fatcat:5p43j3aevnfttk6et4kdu6lzxu

Current Challenges in Web Crawling [chapter]

Denis Shestakov
2013 Lecture Notes in Computer Science  
In this tutorial, we will introduce the audience to five topics: architecture and implementation of high-performance web crawler, collaborative web crawling, crawling the deep Web, crawling multimedia  ...  Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website  ...  -Architecture and implementation of high-performance web crawler.  ... 
doi:10.1007/978-3-642-39200-9_49 fatcat:igaskwpugrdvpapxbwxg5imyge

High-Performance Web Crawling [chapter]

Marc Najork, Allan Heydon
2002 Massive Computing  
Abstract High-performance web crawlers are an important component of many web services.  ...  This chapter describes our experience building and operating such a high-performance crawler.  ...  High performance.  ... 
doi:10.1007/978-1-4615-0005-6_2 fatcat:axqtctlvfzfdhpywjmqmi7taye

Capturing Connectivity Graphs of a Large-Scale P2P Overlay Network

Hani Salah, Thorsten Strufe
2013 2013 IEEE 33rd International Conference on Distributed Computing Systems Workshops  
The results show that the crawler is fast and captures high accurate graph snapshots.  ...  Measuring accurate graph snapshots of peer-to-peer (P2P) overlay networks is essential to understand these systems.  ...  They also thank Moritz Steiner for his cooperation through performing tests on Blizzard, and thank the G-lab administrators for their help.  ... 
doi:10.1109/icdcsw.2013.35 dblp:conf/icdcsw/SalahS13 fatcat:me56h6y2c5dx5apfhi75yaqxj4

Unsupervised Parallel Corpus Mining on Web Data [article]

Guokun Lai, Zihang Dai, Yiming Yang
2020 arXiv   pre-print
With a large amount of parallel data, neural machine translation systems are able to deliver human-level performance for sentence-level translation.  ...  On the WMT'16 English-Romanian and Romanian-English benchmarks, our system produces new state-of-the-art results, 39.81 and 38.95 BLEU scores, even compared with supervised approaches.  ...  In our experiment, we show that the machine translation system trained with crawled parallel data from our system is able to achieve a similar or even superior performance compared to fully supervised  ... 
arXiv:2009.08595v1 fatcat:jwfgwptdkzfipbmr6vdhl23zjy

An improved topic relevance algorithm for focused crawling

Hong-Wei Hao, Cui-Xia Mu, Xu-Cheng Yin, Shen Li, Zhi-Bin Wang
2011 2011 IEEE International Conference on Systems, Man, and Cybernetics  
Third, in real crawling experiments on the prototype system, the crawler using TF-IDF has high performance with the accumulated topic relevance increasing quickly at the beginning of crawling, however  ...  Last, the crawler using TFIDF+LSI performs the same crawl task and demonstrates the combination advantage of TF-IDF and LSI.  ...  accumulated topic relevance increases steadily and slowly throughout the whole crawling. • Secondly, at the beginning of crawling, the crawler using TF-IDF has high performance with the accumulated topic  ... 
doi:10.1109/icsmc.2011.6083759 dblp:conf/smc/HaoMYLW11 fatcat:qex6mjeitvcghltaujii3biycq

Using web text to improve keyword spotting in speech

Ankur Gandhe, Long Qin, Florian Metze, Alexander Rudnicky, Ian Lane, Matthias Eck
2013 2013 IEEE Workshop on Automatic Speech Recognition and Understanding  
In this paper, we investigate the use of online text resources to improve the performance of speech recognition specifically for the task of keyword spotting.  ...  By integrating the web text into our systems, we observed significant improvements in keyword spotting accuracy for four out of the five languages.  ...  The most gain was obtained in Turkish, where the LimitedLP system has a rather high OOV rate.  ... 
doi:10.1109/asru.2013.6707768 dblp:conf/asru/GandheQMRLE13 fatcat:e23pdljlzzdkddheqllq3owp5u

SNES: Social-Network-Oriented Public Opinion Monitoring Platform Based on ElasticSearch

Chuiju You, Dongjie Zhu, Yundong Sun, Anshan Ye, Gangshan Wu, Ning Cao, Jinming Qiu, Helen Min Zhou
2019 Computers Materials & Continua  
However, these platforms cannot perform well in scalability, fault tolerance, and real-time performance.  ...  A great number of empirical experiments prove that the platform can adapt well to the social network with highly real-time data and has good performance in public opinion monitoring.  ...  Kafka (a high throughput distributed publish and subscribe message system) and Spark Streaming (a real-time streaming computing framework).  ... 
doi:10.32604/cmc.2019.06133 fatcat:pczrzpwrkzen7nz4bblcorebhq

Experimental Study of Military Crawl as a Special Type of Human Quadripedal Automatic Locomotion

Dmitry Skvortsov, Victor Anisimov, Alina Aizenshtein
2021 Applied Sciences  
Progressive and propulsive motions are characterized as normal; additional right–left side motions—with high degree of reciprocity.  ...  Eight healthy adults aged 15–31 (four women and four men) were examined by means of a 3D kinematic analysis with Optitrack optical motion-capture system which consists of 12 Flex 13 cameras.  ...  Military Crawling A biomechanical analysis of motions used in military crawling was performed.  ... 
doi:10.3390/app11167666 fatcat:hwnebynfyvfdlppsl74cgtxwqm

A Parametric Layered Approach to Perform Web Page Ranking

Ratika Goel, Anchal Garg
2013 International Journal of Computer Applications  
The presented work will provide an recommendation based web page indexing so that effective web crawling will be performed.  ...  Web crawling is the foremost step to perform the effective and efficient web content search so that the user will get the specific web pages initially in an indexed form.  ...  Author presented an architecture for the system with the performance bottleneck and to drive the high performance based association search over the web [7] .The author has defined the work under the capabilities  ... 
doi:10.5120/11467-7251 fatcat:p4wl56r6vrcydk3qa4pezsb3da

Ontology Property-based Adaptive Crawler for Linked Data(OPAC)

Jihoon An, Younggi Kim, Minseok Lee, Younghee Lee
2013 2013 Fourth International Conference on the Network of the Future (NoF)  
Performance evaluation shows that this system can reduce overhead costs by more than 70% while maintaining a high freshness of data.  ...  Frequent crawling is required for dynamic data to meet the high freshness requirement of real time applications. Crawling large datasets may cause serious scalability problems.  ...  Performance evaluation shows that most of the data can maintain high freshness with much lower overhead. The paper is organized as follows.  ... 
doi:10.1109/nof.2013.6724500 dblp:conf/nof/AnKLL13 fatcat:g6zjc2dukbb6rhpsnk4yzkc3ui
« Previous Showing results 1 — 15 out of 76,052 results