44 Hits in 6.5 sec

Lazy preservation

Frank McCown, Joan A. Smith, Michael L. Nelson
2006 Proceedings of the eighth ACM international workshop on Web information and data management - WIDM '06  
We evaluate the effectiveness of lazy preservation by reconstructing 24 websites of varying sizes and composition using Warrick, a web-repository crawler.  ...  We introduce "lazy preservation" -digital preservation performed as a result of the normal operation of web crawlers and caches.  ...  We explore the effectiveness of lazy preservation by downloading 24 websites of various sizes and subject matter and reconstructing them using a web-repository crawler named Warrick 1 which recovers missing  ... 
doi:10.1145/1183550.1183564 dblp:conf/widm/McCownSN06 fatcat:n6wwumbgyvgq5ivrvkfz74tziu

Reconstructing Websites for the Lazy Webmaster [article]

Frank McCown, Joan A. Smith, Michael L. Nelson, Johan Bollen
2005 arXiv   pre-print
We introduce the concept of "lazy preservation"- digital preservation performed as a result of the normal operations of the Web infrastructure (search engines and caches).  ...  In the face of complete website loss, "lazy" webmasters or concerned third parties may be able to recover some of their website from the Internet Archive.  ...  If a web repository will not or cannot crawl it, Warrick cannot recover it. More significantly, Warrick can only reconstruct the external view of a website as viewed by a web crawler.  ... 
arXiv:cs/0512069v1 fatcat:nkqs7egz3fcg7bia7egzycbiom

Using the web infrastructure to preserve web pages

Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein
2007 International Journal on Digital Libraries  
To date, most of the focus regarding digital preservation has been on replicating copies of the resources to be preserved from the "living web" and placing them in an archive for controlled curation.  ...  The common characteristic of the projects is they creatively employ the web infrastructure to provide shallow but broad preservation capability for all web pages.  ...  Acknowledgements Johan Bollen (Los Alamos National Laboratory) contributed to the initial development of the "lazy" and "just-in-time" preservation models.  ... 
doi:10.1007/s00799-007-0012-y fatcat:5ufnwywctfbrrheo65zxqrz22q

PrivacyMeter: Designing and Developing a Privacy-Preserving Browser Extension [chapter]

Oleksii Starov, Nick Nikiforakis
2018 Lecture Notes in Computer Science  
score for any website that a user is visiting.  ...  This score is computed based on each website's privacy practices and how these compare to the privacy practices of other pre-analyzed websites.  ...  the last time that the site was crawled.  ... 
doi:10.1007/978-3-319-94496-8_6 fatcat:mixyiqh7a5fyjm2pkx6kiar6fe

Site Design Impact on Robots

Joan A. Smith, Michael L. Nelson
2008 D-Lib Magazine  
Reconstructing Websites for the Lazy Webmaster. In Proceedings of the Eighth ACM International Workshop on Web Information and Data Management (WIDM'06) , November 2006, pages 67-74.  ...  Even those who have their own established customer base may want to examine ways to increase crawler penetration, since search engines cache much of their content (providing a type of "lazy preservation  ... 
doi:10.1045/march2008-smith fatcat:enxc2zujprcubbpwj5fe4mhnqm

Usage analysis of a public website reconstruction tool

Frank McCown, Michael L. Nelson
2008 Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries - JCDL '08  
Over the last six months, 520 individuals have reconstructed more than 700 websites with 800K resources from the Web Infrastructure.  ...  The Web is increasingly the medium by which information is published today, but due to its ephemeral nature, web pages and sometimes entire websites are often "lost" due to server crashes, viruses, hackers  ...  We call this preservation service lazy preservation. Lazy preservation requires no work from the content producer, but it offers no quality-of-service guarantees.  ... 
doi:10.1145/1378889.1378955 dblp:conf/jcdl/McCownN08a fatcat:q46xbxg36zachlfcoopcxxaaee

A framework for describing web repositories

Frank McCown, Michael L. Nelson
2009 Proceedings of the 2009 joint international conference on Digital libraries - JCDL '09  
In prior work we have demonstrated that search engine caches and archiving projects like the Internet Archive's Wayback Machine can be used to "lazily preserve" websites and reconstruct them when they  ...  We use the term "web repositories" for collections of automatically refreshed and migrated content, and collectively we refer to these repositories as the "web infrastructure".  ...  Because the archive's goal is to preserve the Web as it was found, web archives are useful when reconstructing lost websites.  ... 
doi:10.1145/1555400.1555456 dblp:conf/jcdl/McCownN09a fatcat:nfeba62pxnaethyhteopv3kpo4

Why web sites are lost (and how they're sometimes found)

Frank McCown, Catherine C. Marshall, Michael L. Nelson
2009 Communications of the ACM  
The survey results provide several implications for personal digital preservation, for the WI, and for lazy preservation tools like Warrick.  ...  We call this after-loss recovery "lazy preservation". Warrick can only recover what is accessible to the WI, namely the crawlable Web.  ... 
doi:10.1145/1592761.1592794 fatcat:65avx4gd2ve4dgh2f7gu7u2juy

Research on the Methods and Key Techniques of Web Archive Oriented Social Media Information Collection

Xinping Huang
2021 Journal of Web Engineering  
Social media information collection and preservation is a hot issue in the field of Web Archive.  ...  In terms of the problem that social websites impose restrictions on the call frequency of API, the paper provides solutions, for example, use the multiplexing mechanism, use the naive Bayesian algorithm  ...  Acknowledgments This research is funded by the National Social Science Foundation of China, grant number 18CTQ040.  ... 
doi:10.13052/jwe1540-9589.20812 fatcat:jyyztztuz5ehtnv22nidjq55lu


Helge Holzmann, Vinay Goel, Avishek Anand
2016 Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries - JCDL '16  
Performance optimizations in ArchiveSpark, facilitated by the use of a widely available metadata index, result in significant speed-ups of data processing.  ...  held by Web archiving institutions.  ...  As an example, CDX files are available for the crawls provided by the Common Crawl initiative 10 .  ... 
doi:10.1145/2910896.2910902 dblp:conf/jcdl/HolzmannGA16 fatcat:qx5alzzaufhu5mjqauf3fjvrcq

Preserving the Quality of Architectural Tactics in Source Code Preserving the Quality of Architectural Tactics in Source Code

Mehdi Mirakhorli, Mehdi Mirakhorli, Jane Cleland-Huang
Preserving the Quality of Architectural Tactics in Source Code by Mehdi Mirakhorli In any complex software system, strong interdependencies exist between requirements and software architecture.  ...  Requirements drive architectural choices while also being constrained by the existing architecture and by what is economically feasible.  ...  These parameters are reported at (WEBSITE).  ... 

On Identifying the Bounds of an Internet Resource

Faryaneh Poursardar, Frank Shipman
2016 Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval - CHIIR '16  
I am so fortunate to have her, she is the hope in my life.  ...  For URLs that point to the first page of a document that has been broken up over multiple pages, users are likely to consider the v ACKNOWLEDGEMENTS I would like to thank my advisor, Dr.  ...  How does the archiving site need to prepare the website for preservation? How can metadata be derived for Web resources?  ... 
doi:10.1145/2854946.2854982 dblp:conf/chiir/PoursardarS16 fatcat:ib2o4ou6wrgb3d6justzg5tj3i

Concepts and tools for the effective and efficient use of web archives [article]

Helge Holzmann, University, My, University, My
We address both angles: 1. by presenting a retrospective analysis of crawl metadata on the size, age and growth of a Web dataset, 2. by proposing a programming framework for efficiently processing archival  ...  The third perspective is what we call the graph-centric view. Here, websites, pages or extracted facts are considered nodes in a graph. Links among pages or the extracted informatio [...]  ...  Most likely, many universities even had a website before 1996, but only got picked up by the crawlers later.  ... 
doi:10.15488/4436 fatcat:rgjuppdyjrea7jqhifxfzdvy6m

Directed test generation and analysis for web applications [article]

Amin Milani Fard
We evaluated the presented techniques by conducting various empirical studies and comparisons. The evaluation results point to the effectiveness of [...]  ...  The work presented in this dissertation has focused on advancing the state-of-the-art in testing and maintaining web applications by proposing a new set of techniques and tools.  ...  In our work we did not intend to increase the code coverage by considering the problem of input generation for the crawler and only focused on improving the crawling strategy.  ... 
doi:10.14288/1.0340953 fatcat:sr433fn2obczbm2hzz437qojsi

Building Machine Translation Systems for the Next Thousand Languages [article]

Ankur Bapna, Isaac Caswell, Julia Kreutzer, Orhan Firat, Daan van Esch, Aditya Siddhant, Mengmeng Niu, Pallavi Baljekar, Xavier Garcia, Wolfgang Macherey, Theresa Breiner, Vera Axelrod (+12 others)
2022 arXiv   pre-print
and monolingual datasets for an additional 1000+ languages; and (iii) Studying the limitations of evaluation metrics for these languages and conducting qualitative analysis of the outputs from our MT  ...  We describe results in three research domains: (i) Building clean, web-mined datasets for 1500+ languages by leveraging semi-supervised pre-training for language identification and developing data-driven  ... 9 translation website for Ethiopian languages.  ... 
arXiv:2205.03983v3 fatcat:65fva7qvpbaapemrrmovgkpac4
« Previous Showing results 1 — 15 out of 44 results