A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2012; you can also visit the original URL.
The file type is application/pdf
.
Filters
Open-source Corpora: Using the net to fish for linguistic data
2006
International Journal of Corpus Linguistics
Experiments with a variety of languages show that Internet-derived corpora can be successfully used in the absence of large representative corpora that are rare and expensive to build. ...
The paper proposes a methodology for collecting "open-source" corpora, i.e. corpora that are automatically collected from the Internet and distributed in the form of a list of links with open-source software ...
The present study experimented with the development of Internet corpora for Chinese, English, German, Romanian, Russian and Ukrainian. ...
doi:10.1075/ijcl.11.4.05sha
fatcat:pwp6r4upavevtitvryz2lpuwmm
Internet-based biosurveillance methods for vector-borne diseases: Are they novel public health tools or just novelties?
2017
PLoS Neglected Tropical Diseases
Internet-based surveillance methods for vector-borne diseases (VBDs) using "big data" sources such as Google, Twitter, and internet newswire scraping have recently been developed, yet reviews on such " ...
The fundamental features, advantages, and drawbacks of each internet big data source are presented for those with varying familiarity of "digital epidemiology." ...
Title 17 U.S.C. § 105 provides that "copyright protection under this title is not available for any work of the United States Government." Title 17 U. ...
doi:10.1371/journal.pntd.0005871
pmid:29190281
pmcid:PMC5708615
fatcat:fnzh5femvzcczg4l76kdqe772u
On sparsity and drift for effective real-time filtering in microblogs
2013
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management - CIKM '13
Through experiments using the TREC Microblog track 2012, we show that our approach is effective for a number of common filtering metrics such as the user's utility, and that it compares favourably with ...
state-of-the-art news filtering baselines. ...
The newswire stream was crawled from major global news sources: BBC, CNN, Google News, New York Times, Guardian, Reuters, The Register and Wired. ...
doi:10.1145/2505515.2505709
dblp:conf/cikm/AlbakourMO13
fatcat:zgrlkbktzvdvvdfbpuqzrby6sq
Evaluating Elements of Web-Based Data Enrichment for Pseudo-relevance Feedback Retrieval
[chapter]
2021
Lecture Notes in Computer Science
By enriching topics with text data from web search engine result pages and linked contents, we train topic-specific and cost-efficient classifiers that can be used to search test collections for relevant ...
Overall, the analyzed method is robust in terms of average retrieval performance and a promising way to use web content for the data enrichment of relevance feedback methods. ...
Furthermore, we analyzed the experiments in different contexts with other newswire test collections. ...
doi:10.1007/978-3-030-85251-1_5
fatcat:5brtpjy4gvbbnpfb4ycjjx2iqq
Technological Impediments to B2C Electronic Commerce: An Update
2005
Communications of the Association for Information Systems
As a result, new features added to the Internet do not consider all relevant factors, and are thus sub-optimal. ...
We identify how advances in technology both partially resolve concerns with the original technological impediments, and inhibit their full resolution. ...
ACKNOWLEDGEMENTS The authors acknowledge the assistance of Ray Tan Hang Kiang, Woohyeok Lim, Thanawat Sorawatanakam, Ananda Yetukuri, Xu Dan, Alan Chin, and Glen Martin who helped to obtain the practitioner articles ...
doi:10.17705/1cais.01607
fatcat:f2n4dqi6mbf2jhl2h4u6c273qi
Ord i Dag: Mining Norwegian Daily Newswire
[chapter]
2006
Lecture Notes in Computer Science
Describing the complete process, we provide an entirely disclosed method for media monitoring and news summarization. ...
For keyword extraction, a reference corpus serves as background about average language use, which is contrasted with the current day's word frequencies. ...
For each sentence, a source reference contains a backlink to the original newswire article. ...
doi:10.1007/11816508_51
fatcat:53oszuslfbgp7osat6cj6o2ugq
News vertical search using user-generated content
2012
SIGIR Forum
for a newswire article. ...
These two datasets comes with pre-provided newswire article importance assessments for each topic day. ...
Appendix A Parameter Analysis for Top Events Identification In Chapter 6 of this thesis, we proposed a variety of unlearned ranking models for identifying top news events, represented by newswire articles ...
doi:10.1145/2492189.2492202
fatcat:wuha3gotmnffnbqhrdltooys5m
How law and computer science can work together to improve the information society
2017
Communications of the ACM
That is not new-it applied to the tabloid newspapers methods of 'yellow' journalism, radio news and telegraph-supplied newswires 100 years ago. ...
Fake News 'Fake news' is the heartfelt cry of politicians who feel wronged by the online media. Ad blocking and filter bubbles have made consumers and voters harder to reach. ...
doi:10.1145/3163907
fatcat:odxsjenj2bdctipfndpi2rnehm
Generating Semantic Snapshots of Newscasts Using Entity Expansion
[chapter]
2015
Lecture Notes in Computer Science
Results of the experiments show the robustness of the approach holding an Average Normalized Discounted Cumulative Gain of 66.6%. ...
paper we propose an approach that retrieves and analyzes related documents in the Web to automatically generate semantic annotations that provide viewers and experts comprehensive information about the news ...
With this objective in mind, we selected 5 news videos and manually extracted entities from the subtitles; video image; text contained in the video; articles related to the subject of the video; and entities ...
doi:10.1007/978-3-319-19890-3_26
fatcat:napz7jwqgjgtlgcv2hbajdw7wu
You had me at hello: How phrasing affects memorability
[article]
2012
arXiv
pre-print
To this end, we develop an analysis framework and build a corpus of movie quotes, annotated with memorability information, in which we are able to control for both the speaker and the setting of the quotes ...
Another is that memorable quotes tend to be more general in ways that make them easy to apply in new contexts --- that is, more portable. ...
This paper is based upon work supported in part by NSF grants IIS-0910664, IIS-1016099, Google, and Yahoo! ...
arXiv:1203.6360v2
fatcat:puy5k6guhbbljazmb5pof32wra
Predicting Fine-grained Social Roles with Selectional Preferences
2014
Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science
First person uses of verbs that select for a given social role as subject (e.g. I teach ... for teacher) are used to quickly build up binary classifiers for that role. ...
Selectional preferences, the tendencies of predicates to select for certain semantic classes of arguments, have been successfully applied to a number of tasks in computational linguistics including word ...
For example, if we read in a news article that an artist drew ..., we can take a tweet saying I drew ... as potential evidence that the author bears the artist social role. ...
doi:10.3115/v1/w14-2515
dblp:conf/acl/BellerHD14
fatcat:nlfgargitbge3bsescv7zwecjm
Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection
[article]
2022
arXiv
pre-print
Using a new dataset of U.S. high school newspaper articles -- written by students from across the country -- we investigate whose language is preferred by the quality filter used for GPT-3. ...
As such, resources like Wikipedia, books, and newswire often serve as anchors for automatically selecting web text most suitable for language modeling, a process typically referred to as quality filtering ...
For example, documents 12 Here, the general newswire are articles from popular online news sources; see §4 for data details. entirely about Trump and the presidential election have quality scores 35 percentage ...
arXiv:2201.10474v2
fatcat:3v3dqozhljemxahn7ut6u7xhru
News Article Teaser Tweets and How to Generate Them
[article]
2019
arXiv
pre-print
A teaser is a short reading suggestion for an article that is illustrative and includes curiosity-arousing elements to entice potential readers to read particular news items. ...
Teasers are one of the main vehicles for transmitting news to social media users. ...
Acknowledgments We thank Siemens CT members and the anonymous reviewers for valuable feedback. This research was supported by Bundeswirtschaftsministerium (bmwi.de), grant 01MD15010A (Smart Data Web). ...
arXiv:1807.11535v2
fatcat:6djcdpiyrbf5dbn3ge4xliuqhq
Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies
2018
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
We present NEWSROOM, a summarization dataset of 1.3 million articles and summaries written by authors and editors in newsrooms of 38 major news publications. ...
In particular, the summaries combine abstractive and extractive strategies, borrowing words and phrases from articles at varying rates. ...
Acknowledgements This work is funded by Oath as part of the Connected Experiences Laboratory and by a Google Research Award. We thank the anonymous reviewers for their feedback. ...
doi:10.18653/v1/n18-1065
dblp:conf/naacl/GruskyNA18
fatcat:xx2jqc62tjfufecbjrbpheoipi
Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies
[article]
2020
arXiv
pre-print
We present NEWSROOM, a summarization dataset of 1.3 million articles and summaries written by authors and editors in newsrooms of 38 major news publications. ...
In particular, the summaries combine abstractive and extractive strategies, borrowing words and phrases from articles at varying rates. ...
Acknowledgements This work is funded by Oath as part of the Connected Experiences Laboratory and by a Google Research Award. We thank the anonymous reviewers for their feedback. ...
arXiv:1804.11283v2
fatcat:d6egjf6axvax7pr6ku6mzma6we
« Previous
Showing results 1 — 15 out of 848 results