Donag: Generating Efficient Patches and Diffs for Compressed Archives release_cqwdyisjejhwzk5y3f3aa35vym

by Michael J. May

Published in ACM Transactions on Storage by Association for Computing Machinery (ACM).

2022  

Abstract

Differencing between compressed archives is a common task in file management and synchronization. Applications include source code distribution, application updates, and document synchronization. General purpose binary differencing tools can create and apply patches to compressed archives, but don't consider the internal structure of the compressed archive or the file lifecycle. Therefore, they miss opportunities to save space based on the archive's internal structure and metadata. To address the gap, we develop a content-aware, format independent theory for differencing on compressed archives and propose a canonical form and digest for compressed archives. Based on them, we present Donag, a content-aware differencing and patching algorithm that produces smaller patches than general purpose binary differencing tools on versioned archives by exploiting the compressed archives' internal structure. Donag uses the VCDiff and BSDiff engines internally. We compare Donag's patches to ones produced by bsdiff, xdelta3, and Delta++ on three classes of compressed archives: open-source code repositories, large and small applications, and office productivity documents (DOCX, XLSX, PPTX). Donag's patches are typically 10% to 89% smaller than those produced by bsdiff, xdelta3, and Delta++, with reasonable memory overhead and throughput on commodity hardware. In the worst case, Donag's patches are negligibly larger.
In application/xml+jats format

Archived Files and Locations

application/pdf   2.8 MB
file_af5ryddu7bhvfpzkcozesobefe
dl.acm.org (publisher)
web.archive.org (webarchive)
Read Archived PDF
Preserved and Accessible
Type  article-journal
Stage   published
Date   2022-07-27
Language   en ?
Container Metadata
Not in DOAJ
In Keepers Registry
ISSN-L:  1553-3077
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: 6d71064b-f14e-4ffc-8c96-3bb55d513bad
API URL: JSON