5,422 Hits in 3.5 sec

Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies [article]

Mukund Srinath, Shomir Wilson, C. Lee Giles
2020 arXiv   pre-print
We design a corpus creation pipeline which consists of crawling the web followed by filtering documents using language detection, document classification, duplicate and near-duplication removal, and content  ...  Organisations disclose their privacy practices by posting privacy policies on their website.  ...  Common Crawl has also been releasing a domain-level webgraph from which the harmonic centrality of the crawled domains are calculated.  ... 
arXiv:2004.11131v1 fatcat:gbiyqrhfsbbtfmasbfe5cmz7bu

Bilingual web page and site readability assessment

Tak Pang Lau, Irwin King
2006 Proceedings of the 15th international conference on World Wide Web - WWW '06  
Furthermore, we can obtain the overall content distribution in a Web site by studying the variation of its readability.  ...  In this paper, we investigate the applications of readability assessment in Web development, such that users can retrieve information which is appropriate to their levels.  ...  crawling.  ... 
doi:10.1145/1135777.1135981 dblp:conf/www/LauK06 fatcat:mjqe6ljfd5dc5nxyod5jjc7sim

Statistical Estimation of Word Acquisition With Application to Readability Prediction

Paul Kidwell, Guy Lebanon, Kevyn Collins-Thompson
2011 Journal of the American Statistical Association  
compare the estimated distributions with word acquisition data from existing oral studies, revealing interesting historical trends as well as differences between oral and written word acquisition grade levels  ...  We use this model to estimate the distributions of word acquisition ages from empirical readability data collected from the web.  ...  EXPERIMENTAL RESULTS In our experiments we used three readability datasets. The corpora were compiled by crawling web pages containing documents authored for audiences of specific grade levels.  ... 
doi:10.1198/jasa.2010.ap09318 fatcat:njm6g2kbcvgbtahdnk6hqtkuve

An Unsupervised Technical Readability Ranking Model by Building a Conceptual Terrain in LSI

Shoaib Jameel, Xiaojun Qian
2012 2012 Eighth International Conference on Semantics, Knowledge and Grids  
Our model has achieved significant improvement in ranking documents by technical readability.  ...  Moreover, readability methods cannot address the issue in domain-specific IR ranking because they fail to give precise prediction when applied on web pages.  ...  By crawling web pages from different resources available online we are able to collect technical contents which fit the understanding level and difficulty for diverse backgrounds of people.  ... 
doi:10.1109/skg.2012.20 dblp:conf/skg/JameelQ12 fatcat:vvkqihxx4vfkrept4nyeacr7dm

A Simple Post-Processing Technique for Improving Readability Assessment of Texts using Word Mover's Distance [article]

Joseph Marvin Imperial, Ethel Ong
2021 arXiv   pre-print
further ground the difficulty level given by a model.  ...  In this study, we improve the conventional methodology of automatic readability assessment by incorporating the Word Mover's Distance (WMD) of ranked texts as an additional post-processing technique to  ...  The word embeddings in various languages were trained from Common Crawl and Wikipedia datasets by Grave et al. (2019) .  ... 
arXiv:2103.07277v2 fatcat:tlz5tz7xzbeuhhmsfd6cpz3ihu


Austin D. Chen, Qing Zhao Ruan, Alexandra Bucknor, Patrick P. Bletsis, Anmol S. Chattha, Bernard T. Lee, Samuel J. Lin
2017 Plastic and Reconstructive Surgery, Global Open  
Average readability of plastic surgeon and non-plastic surgeon posted articles attained mean reading grade level of 14.5 and 15.3, respectively (p<0.001).  ...  Of the total unique articles, 128 articles (55%) were posted by plastic surgeons and 106 (45%) were posted by non-plastic surgeons.  ...  An acceptable reading level is defined as no higher than the sixth-grade reading level by the National Institutes of Health and the American Medical Association.  ... 
doi:10.1097/01.gox.0000526220.41088.3d fatcat:2ukuzvoaqfd3voxo7g7zvsvrcm

Privacy Policies over Time: Curation and Analysis of a Million-Document Dataset [article]

Ryan Amos, Gunes Acar, Elena Lucherini, Mihir Kshirsagar, Arvind Narayanan, Jonathan Mayer
2020 arXiv   pre-print
We find that, over the last twenty years, privacy policies have more than doubled in length and the median reading level, while already challenging, has increased modestly.  ...  The average readability score has risen from around 12 to around 13 over 20 years, corresponding to a college level. More popular websites have less readable policies.  ...  Document-level trends In order to understand how privacy policies have changed at a macroscopic level, we examine how length, readability, and updates have changed over time.  ... 
arXiv:2008.09159v2 fatcat:oevueqefcreojazb3qbz2rtnje

Obfuscated malicious javascript detection using classification techniques

Peter Likarish, Eunjin Jung, Insoon Jo
2009 2009 4th International Conference on Malicious and Unwanted Software (MALWARE)  
The crawls typically resulted between 5 and 7 megabytes of data, although by the time of the crawl, most of the exploit code had been removed.  ...  In order to validate this claim on the Internet, we used a new crawl of all domains blacklisted at, and extracted all scripts from the crawl that were unique by MD5.  ... 
doi:10.1109/malware.2009.5403020 dblp:conf/malware/LikarishJJ09 fatcat:zn2wif2uzrbj5f5p2fjl7nuahu

Feature Extraction Techniques Using Semantic Based Crawler For Search Engine

Poonam P. Doshi, Dr. Emmanual M
2018 Zenodo  
Seed URLs -Seed URLs are the input URLs from end user which user wants to crawls. -Web Layer: Crawling -Actual crawling of those URLs given by end users will be happening in this module.  ...  Ontology learning based techniques can be used to solve the issue of semantic focused crawling, by learning new knowledge from crawled documents and integrating the new knowledge with ontology's in order  ... 
doi:10.5281/zenodo.1468203 fatcat:reisnwuc7bh23emjacrkpd7op4

SpeedReader: Reader Mode Made Fast and Private [article]

Mohammad Ghasemisharif and Peter Snyder and Andrius Aucinas and Benjamin Livshits
2018 arXiv   pre-print
Most popular web browsers include "reader modes" that improve the user experience by removing un-useful page elements.  ...  For Twitter, we extracted shared links from the top 10 worldwide Twitter trends by crawling and extracting shared links from their Tweets. RSS / feed readers.  ...  We build a list of RSS-shared content by crawling the Alexa 1K, identifying websites that included RSS feeds, and fetching the five most recent pages of content in each RSS feed.  ... 
arXiv:1811.03661v1 fatcat:msy6uawepbf2hgaedula7zghgq

Evaluating Web Site Structure Based on Navigation Profiles and Site Topology [chapter]

Alberto Simões, Anália Lourenço, José João Almeida
2013 Advances in Intelligent Systems and Computing  
Acknowledgments This work is funded by ERDF -European Regional Development Fund through the COMPETE Programme (operational programme for competitiveness) and by National Funds through the FCT -Fundação  ...  Depending on previous visits and Web site indexing, visits can be initiated at any level and may look into different (related or non-related) levels of contents.  ...  These parameters specialise broad crawling in order to prevent crawling into irrelevant or problematic areas. For example, Concurrent Version System (CVS) access points and mailing list archives.  ... 
doi:10.1007/978-3-642-36981-0_29 fatcat:25vtv4odgvc4xfmj65nybhtxdm

On Gobbledygook and Mood of the Philippine Senate: An Exploratory Study on the Readability and Sentiment of Selected Philippine Senators' Microposts [article]

Fatima M. Moncada, Jaderick P. Pabico
2015 arXiv   pre-print
This could mean that a senator's tweet sentiment is affected by specific Philippine-based events.  ...  This paper presents the findings of a readability assessment and sentiment analysis of selected six Philippine senators' microposts over the popular Twitter microblog.  ...  This research effort is funded partly by and was conducted at the Research Collaboratory for Advanced Intelligent Systems, Institute of Computer Science, University of the Philippines Los Baños, College  ... 
arXiv:1508.01321v1 fatcat:c3k5szxwwzforibzrd4dk4kpdm

Crawling Twitter Data [chapter]

Shamanth Kumar, Fred Morstatter, Huan Liu
2013 SpringerBriefs in Computer Science  
The way to collect SNS data as well as tweets is handled by crawlers. Twitter crawler has recently emerged as a great tool to crawl Twitter data as well as tweets.  ...  We also develop crawling strategies to efficiently extract tweets in terms of time and amount.  ...  Fig. 6 . 6 Crawling Tweets by keyword. Fig 7 . 7 Extracting tweets from database by keyword. Fig. 8 . 8 Deleting tweets from a database by keyword.  ... 
doi:10.1007/978-1-4614-9372-3_2 fatcat:jedr5eybd5gzbmkjermtub26bu

Ontological Paradigm for Focused Crawling based on Lexical Analysis

Nidhi Sharma, Atul Srivastava
2013 International Journal of Computer Applications  
The semantic web is a synergetic movement led by International standards body, the WWW Consortium (W3C).It aims at converting the current web dominated by unstructured and semi structured documents into  ...  Here two techniques of semantic web crawling are reviewed, one is ontology based and other is based on Lexical database .For this, architecture has been proposed which is a combination of above two techniques  ...  web pages by inserting machine-readable metadata about pages and how they are related to each other, enabling automated agents to access the Web more intelligently and perform tasks on behalf of users  ... 
doi:10.5120/10086-4715 fatcat:jgtfwf6ql5ac7py35nsk26ezja

The weaponization of web archives: Data craft and COVID-19 publics

Amelia Acker, Mitch Chaiet
2020 Harvard Kennedy School Misinformation Review  
After identifying conspiracy content that has been archived by human actors with the Wayback Machine, we report on user patterns of "screensampling," where images of archived misinformation are spread  ...  information even after it has been moderated and fact-checked, for some individuals, will give health misinformation and conspiracy theories more traction because it has been labeled as specious content by  ...  Funding This research was made possible by a grant from the Institute of Museum and Library Services. The project's grant number is RE-07-18-0008-18.  ... 
doi:10.37016/mr-2020-41 fatcat:ixylc7g5v5diffk2cfqyhtuxse
« Previous Showing results 1 — 15 out of 5,422 results