Filters








848 Hits in 3.4 sec

Open-source Corpora: Using the net to fish for linguistic data

Serge Sharoff
2006 International Journal of Corpus Linguistics  
Experiments with a variety of languages show that Internet-derived corpora can be successfully used in the absence of large representative corpora that are rare and expensive to build.  ...  The paper proposes a methodology for collecting "open-source" corpora, i.e. corpora that are automatically collected from the Internet and distributed in the form of a list of links with open-source software  ...  The present study experimented with the development of Internet corpora for Chinese, English, German, Romanian, Russian and Ukrainian.  ... 
doi:10.1075/ijcl.11.4.05sha fatcat:pwp6r4upavevtitvryz2lpuwmm

Internet-based biosurveillance methods for vector-borne diseases: Are they novel public health tools or just novelties?

Simon Pollett, Benjamin M. Althouse, Brett Forshey, George W. Rutherford, Richard G. Jarman, Robert C Reiner
2017 PLoS Neglected Tropical Diseases  
Internet-based surveillance methods for vector-borne diseases (VBDs) using "big data" sources such as Google, Twitter, and internet newswire scraping have recently been developed, yet reviews on such "  ...  The fundamental features, advantages, and drawbacks of each internet big data source are presented for those with varying familiarity of "digital epidemiology."  ...  Title 17 U.S.C. § 105 provides that "copyright protection under this title is not available for any work of the United States Government." Title 17 U.  ... 
doi:10.1371/journal.pntd.0005871 pmid:29190281 pmcid:PMC5708615 fatcat:fnzh5femvzcczg4l76kdqe772u

On sparsity and drift for effective real-time filtering in microblogs

M-Dyaa Albakour, Craig Macdonald, Iadh Ounis
2013 Proceedings of the 22nd ACM international conference on Conference on information & knowledge management - CIKM '13  
Through experiments using the TREC Microblog track 2012, we show that our approach is effective for a number of common filtering metrics such as the user's utility, and that it compares favourably with  ...  state-of-the-art news filtering baselines.  ...  The newswire stream was crawled from major global news sources: BBC, CNN, Google News, New York Times, Guardian, Reuters, The Register and Wired.  ... 
doi:10.1145/2505515.2505709 dblp:conf/cikm/AlbakourMO13 fatcat:zgrlkbktzvdvvdfbpuqzrby6sq

Evaluating Elements of Web-Based Data Enrichment for Pseudo-relevance Feedback Retrieval [chapter]

Timo Breuer, Melanie Pest, Philipp Schaer
2021 Lecture Notes in Computer Science  
By enriching topics with text data from web search engine result pages and linked contents, we train topic-specific and cost-efficient classifiers that can be used to search test collections for relevant  ...  Overall, the analyzed method is robust in terms of average retrieval performance and a promising way to use web content for the data enrichment of relevance feedback methods.  ...  Furthermore, we analyzed the experiments in different contexts with other newswire test collections.  ... 
doi:10.1007/978-3-030-85251-1_5 fatcat:5brtpjy4gvbbnpfb4ycjjx2iqq

Technological Impediments to B2C Electronic Commerce: An Update

Cecil Chua, Gregory Rose, Huoy Min Khoo, Detmar Straub
2005 Communications of the Association for Information Systems  
As a result, new features added to the Internet do not consider all relevant factors, and are thus sub-optimal.  ...  We identify how advances in technology both partially resolve concerns with the original technological impediments, and inhibit their full resolution.  ...  ACKNOWLEDGEMENTS The authors acknowledge the assistance of Ray Tan Hang Kiang, Woohyeok Lim, Thanawat Sorawatanakam, Ananda Yetukuri, Xu Dan, Alan Chin, and Glen Martin who helped to obtain the practitioner articles  ... 
doi:10.17705/1cais.01607 fatcat:f2n4dqi6mbf2jhl2h4u6c273qi

Ord i Dag: Mining Norwegian Daily Newswire [chapter]

Unni Cathrine Eiken, Anja Therese Liseth, Hans Friedrich Witschel, Matthias Richter, Chris Biemann
2006 Lecture Notes in Computer Science  
Describing the complete process, we provide an entirely disclosed method for media monitoring and news summarization.  ...  For keyword extraction, a reference corpus serves as background about average language use, which is contrasted with the current day's word frequencies.  ...  For each sentence, a source reference contains a backlink to the original newswire article.  ... 
doi:10.1007/11816508_51 fatcat:53oszuslfbgp7osat6cj6o2ugq

News vertical search using user-generated content

Richard McCreadie
2012 SIGIR Forum  
for a newswire article.  ...  These two datasets comes with pre-provided newswire article importance assessments for each topic day.  ...  Appendix A Parameter Analysis for Top Events Identification In Chapter 6 of this thesis, we proposed a variety of unlearned ranking models for identifying top news events, represented by newswire articles  ... 
doi:10.1145/2492189.2492202 fatcat:wuha3gotmnffnbqhrdltooys5m

How law and computer science can work together to improve the information society

Chris Marsden
2017 Communications of the ACM  
That is not new-it applied to the tabloid newspapers methods of 'yellow' journalism, radio news and telegraph-supplied newswires 100 years ago.  ...  Fake News 'Fake news' is the heartfelt cry of politicians who feel wronged by the online media. Ad blocking and filter bubbles have made consumers and voters harder to reach.  ... 
doi:10.1145/3163907 fatcat:odxsjenj2bdctipfndpi2rnehm

Generating Semantic Snapshots of Newscasts Using Entity Expansion [chapter]

José Luis Redondo García, Giuseppe Rizzo, Lilia Perez Romero, Michiel Hildebrand, Raphaël Troncy
2015 Lecture Notes in Computer Science  
Results of the experiments show the robustness of the approach holding an Average Normalized Discounted Cumulative Gain of 66.6%.  ...  paper we propose an approach that retrieves and analyzes related documents in the Web to automatically generate semantic annotations that provide viewers and experts comprehensive information about the news  ...  With this objective in mind, we selected 5 news videos and manually extracted entities from the subtitles; video image; text contained in the video; articles related to the subject of the video; and entities  ... 
doi:10.1007/978-3-319-19890-3_26 fatcat:napz7jwqgjgtlgcv2hbajdw7wu

You had me at hello: How phrasing affects memorability [article]

Cristian Danescu-Niculescu-Mizil, Justin Cheng, Jon Kleinberg, Lillian Lee
2012 arXiv   pre-print
To this end, we develop an analysis framework and build a corpus of movie quotes, annotated with memorability information, in which we are able to control for both the speaker and the setting of the quotes  ...  Another is that memorable quotes tend to be more general in ways that make them easy to apply in new contexts --- that is, more portable.  ...  This paper is based upon work supported in part by NSF grants IIS-0910664, IIS-1016099, Google, and Yahoo!  ... 
arXiv:1203.6360v2 fatcat:puy5k6guhbbljazmb5pof32wra

Predicting Fine-grained Social Roles with Selectional Preferences

Charley Beller, Craig Harman, Benjamin Van Durme
2014 Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science  
First person uses of verbs that select for a given social role as subject (e.g. I teach ... for teacher) are used to quickly build up binary classifiers for that role.  ...  Selectional preferences, the tendencies of predicates to select for certain semantic classes of arguments, have been successfully applied to a number of tasks in computational linguistics including word  ...  For example, if we read in a news article that an artist drew ..., we can take a tweet saying I drew ... as potential evidence that the author bears the artist social role.  ... 
doi:10.3115/v1/w14-2515 dblp:conf/acl/BellerHD14 fatcat:nlfgargitbge3bsescv7zwecjm

Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection [article]

Suchin Gururangan, Dallas Card, Sarah K. Dreier, Emily K. Gade, Leroy Z. Wang, Zeyu Wang, Luke Zettlemoyer, Noah A. Smith
2022 arXiv   pre-print
Using a new dataset of U.S. high school newspaper articles -- written by students from across the country -- we investigate whose language is preferred by the quality filter used for GPT-3.  ...  As such, resources like Wikipedia, books, and newswire often serve as anchors for automatically selecting web text most suitable for language modeling, a process typically referred to as quality filtering  ...  For example, documents 12 Here, the general newswire are articles from popular online news sources; see §4 for data details. entirely about Trump and the presidential election have quality scores 35 percentage  ... 
arXiv:2201.10474v2 fatcat:3v3dqozhljemxahn7ut6u7xhru

News Article Teaser Tweets and How to Generate Them [article]

Sanjeev Kumar Karn, Mark Buckley, Ulli Waltinger, Hinrich Schütze
2019 arXiv   pre-print
A teaser is a short reading suggestion for an article that is illustrative and includes curiosity-arousing elements to entice potential readers to read particular news items.  ...  Teasers are one of the main vehicles for transmitting news to social media users.  ...  Acknowledgments We thank Siemens CT members and the anonymous reviewers for valuable feedback. This research was supported by Bundeswirtschaftsministerium (bmwi.de), grant 01MD15010A (Smart Data Web).  ... 
arXiv:1807.11535v2 fatcat:6djcdpiyrbf5dbn3ge4xliuqhq

Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies

Max Grusky, Mor Naaman, Yoav Artzi
2018 Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)  
We present NEWSROOM, a summarization dataset of 1.3 million articles and summaries written by authors and editors in newsrooms of 38 major news publications.  ...  In particular, the summaries combine abstractive and extractive strategies, borrowing words and phrases from articles at varying rates.  ...  Acknowledgements This work is funded by Oath as part of the Connected Experiences Laboratory and by a Google Research Award. We thank the anonymous reviewers for their feedback.  ... 
doi:10.18653/v1/n18-1065 dblp:conf/naacl/GruskyNA18 fatcat:xx2jqc62tjfufecbjrbpheoipi

Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies [article]

Max Grusky, Mor Naaman, Yoav Artzi
2020 arXiv   pre-print
We present NEWSROOM, a summarization dataset of 1.3 million articles and summaries written by authors and editors in newsrooms of 38 major news publications.  ...  In particular, the summaries combine abstractive and extractive strategies, borrowing words and phrases from articles at varying rates.  ...  Acknowledgements This work is funded by Oath as part of the Connected Experiences Laboratory and by a Google Research Award. We thank the anonymous reviewers for their feedback.  ... 
arXiv:1804.11283v2 fatcat:d6egjf6axvax7pr6ku6mzma6we
« Previous Showing results 1 — 15 out of 848 results