Donag: Generating Efficient Patches and Diffs for Compressed Archives

Michael J. May
2022 ACM Transactions on Storage  
Differencing between compressed archives is a common task in file management and synchronization. Applications include source code distribution, application updates, and document synchronization. General purpose binary differencing tools can create and apply patches to compressed archives, but don't consider the internal structure of the compressed archive or the file lifecycle. Therefore, they miss opportunities to save space based on the archive's internal structure and metadata. To address
more » ... e gap, we develop a content-aware, format independent theory for differencing on compressed archives and propose a canonical form and digest for compressed archives. Based on them, we present Donag, a content-aware differencing and patching algorithm that produces smaller patches than general purpose binary differencing tools on versioned archives by exploiting the compressed archives' internal structure. Donag uses the VCDiff and BSDiff engines internally. We compare Donag's patches to ones produced by bsdiff, xdelta3, and Delta++ on three classes of compressed archives: open-source code repositories, large and small applications, and office productivity documents (DOCX, XLSX, PPTX). Donag's patches are typically 10% to 89% smaller than those produced by bsdiff, xdelta3, and Delta++, with reasonable memory overhead and throughput on commodity hardware. In the worst case, Donag's patches are negligibly larger.
doi:10.1145/3507919 fatcat:cqwdyisjejhwzk5y3f3aa35vym