Decoupling algorithms from schedules for easy optimization of image processing pipelines

Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman Amarasinghe, Frédo Durand
<span title="2012-07-01">2012</span> <i title="Association for Computing Machinery (ACM)"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/cqrugwalkvcezgalqorn4fwnuu" style="color: black;">ACM Transactions on Graphics</a> </i> &nbsp;
Using existing programming tools, writing high-performance image processing code requires sacrificing readability, portability, and modularity. We argue that this is a consequence of conflating what computations define the algorithm, with decisions about storage and the order of computation. We refer to these latter two concerns as the schedule, including choices of tiling, fusion, recomputation vs. storage, vectorization, and parallelism. We propose a representation for feed-forward imaging
more &raquo; ... elines that separates the algorithm from its schedule, enabling highperformance without sacrificing code clarity. This decoupling simplifies the algorithm specification: images and intermediate buffers become functions over an infinite integer domain, with no explicit storage or boundary conditions. Imaging pipelines are compositions of functions. Programmers separately specify scheduling strategies for the various functions composing the algorithm, which allows them to efficiently explore different optimizations without changing the algorithmic code. We demonstrate the power of this representation by expressing a range of recent image processing applications in an embedded domain specific language called Halide, and compiling them for ARM, x86, and GPUs. Our compiler targets SIMD units, multiple cores, and complex memory hierarchies. We demonstrate that it can handle algorithms such as a camera raw pipeline, the bilateral grid, fast local Laplacian filtering, and image segmentation. The algorithms expressed in our language are both shorter and faster than state-of-the-art implementations. Computational photography algorithms require highly efficient implementations to be used in practice, especially on powerconstrained mobile devices. This is not a simple matter of programming in a low-level language like C. The performance difference between naive C and highly optimized C is often an order of magnitude. Unfortunately, optimization usually comes at the cost of programmer pain and code complexity, as computation must be reorganized to achieve memory efficiency and parallelism. (a) Clean C++ : 9.94 ms per megapixel void blur(const Image ∈, Image &blurred) { Image tmp(in.width(), in.height()); for (int y = 0; y < in.height(); y++) for (int x = 0; x < in.width(); x++) tmp(x, y) = (in(x-1, y) + in(x, y) + in(x+1, y))/3; for (int y = 0; y < in.height(); y++) for (int x = 0; x < in.width(); x++) blurred(x, y) = (tmp(x, y-1) + tmp(x, y) + tmp(x, y+1))/3; } (b) Fast C++ (for x86) : 0.90 ms per megapixel void fast_blur(const Image ∈, Image &blurred) { m128i one_third = _mm_set1_epi16(21846); #pragma omp parallel for for (int yTile = 0; yTile < in.height(); yTile += 32) { m128i a, b, c, sum, avg; m128i tmp[(256/8) * (32+2)]; for (int xTile = 0; xTile < in.width(); xTile += 256) { m128i * tmpPtr = tmp; for (int y = -1; y < 32+1; y++) { const uint16_t * inPtr = &(in(xTile, yTile+y)); for (int x = 0; x < 256; x += 8) { a = _mm_loadu_si128(( m128i * )(inPtr-1)); b = _mm_loadu_si128(( m128i * )(inPtr+1)); c = _mm_load_si128(( m128i * )(inPtr)); sum = _mm_add_epi16(_mm_add_epi16(a, b), c); avg = _mm_mulhi_epi16(sum, one_third); _mm_store_si128(tmpPtr++, avg); inPtr += 8; }} tmpPtr = tmp; for (int y = 0; y < 32; y++) { m128i * outPtr = ( m128i * )(&(blurred(xTile, yTile+y))); for (int x = 0; x < 256; x += 8) { a = _mm_load_si128(tmpPtr+(2 * 256)/8); b = _mm_load_si128(tmpPtr+256/8); c = _mm_load_si128(tmpPtr++); sum = _mm_add_epi16(_mm_add_epi16(a, b), c); avg = _mm_mulhi_epi16(sum, one_third); _mm_store_si128(outPtr++, avg); }}}}} (c) Halide : 0.90 ms per megapixel Func halide_blur(Func in) { Func tmp, blurred; Var x, y, xi, yi;
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1145/2185520.2185528">doi:10.1145/2185520.2185528</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/mdfbhjwc5zatfoq4mt6p3j7m7a">fatcat:mdfbhjwc5zatfoq4mt6p3j7m7a</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20170612170332/http://liacs.leidenuniv.nl/~stefanovtp/courses/StudentenSeminarium/Papers/CvP/DASE.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/d5/ff/d5ff22aa57fb1fe380824505ffb275fc94bc5bea.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1145/2185520.2185528"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> acm.org </button> </a>