Filters








297 Hits in 3.1 sec

Vectorization-aware loop unrolling with seed forwarding

Rodrigo C. O. Rocha, Vasileios Porpodas, Pavlos Petoumenos, Luís F. W. Góes, Zheng Wang, Murray Cole, Hugh Leather
2020 Proceedings of the 29th International Conference on Compiler Construction  
VALU also forwards the vectorizable code to SLP, allowing it to bypass its greedy search for vectorizable seed instructions, exposing more vectorization opportunities.  ...  Loop unrolling is a widely adopted loop transformation, commonly used for enabling subsequent optimizations. Straightline-code vectorization (SLP) is an optimization that beneits from unrolling.  ...  Vectorization-Aware Loop Unrolling In this section, we describe our vectorization-aware loop unrolling (VALU).  ... 
doi:10.1145/3377555.3377890 dblp:conf/cc/RochaPPG0CL20 fatcat:urkgtdgxfzfjzcwudfkfdn4uz4

Insufficient Vectorization: A New Method to Exploit Superword Level Parallelism

Wei GAO, Lin HAN, Rongcai ZHAO, Yingying LI, Jian LIU
2017 IEICE transactions on information and systems  
Second, the methods of computing inter-iteration and intra-iteration SIMD parallelism for loops are put forward.  ...  Because all the slots which vector register provides must be used, the chances of vectorizing programs with low SIMD parallelism are abandoned by sufficient vectorization method.  ...  We test loop-aware, VMSP and VMSP with unrolling. Unroll times is generated heuristically by compiler.  ... 
doi:10.1587/transinf.2016edp7236 fatcat:h74uyavipjd5tfbkkeviisnjsq

A Source Transformation via Operator Overloading Method for the Automatic Differentiation of Mathematical Functions in MATLAB

Matthew J. Weinstein, Anil V. Rao
2016 ACM Transactions on Mathematical Software  
The approach is demonstrated on several examples and is found to be highly efficient when compared with well known MATLAB automatic differentiation programs.  ...  ADiMat was supplied the compressed seed matrix in both the scalar forward and non-overloaded vector forward modes and MAD was used in the compressed forward mode and the sparse forward mode.  ...  CPU(Jf)/CPU(f) CPU(Jf)/CPU(f) Program Sizes (kB) Program Sizes (kB) N Ratio with Ratio with with with Rolled Loop Unrolled Loop Rolled Loop Unrolled Loop 10 223 65 1.5 2.9 100 1255  ... 
doi:10.1145/2699456 fatcat:j5lyehiwbncbhld6syu66vnvma

Pseudo-Random Number Generator Verification: A Case Study [chapter]

Felix Dörre, Vladimir Klebanov
2016 Lecture Notes in Computer Science  
We show how to specify PRNG seeding with information flow contracts from the KeY's extension to the Java Modeling Language (JML) and report our experiences in verifying the actual implementation.  ...  A programming error affecting the information flow in the seeding code of the generator has weakened the security of the cryptographic protocol behind bitcoin transactions.  ...  For loop-and recursion-free programs, symbolic execution is performed in a fully automated manner. Loops can either be unrolled or abstracted by a userprovided loop invariant.  ... 
doi:10.1007/978-3-319-29613-5_4 fatcat:mvoovrzasfb6tgay353hel6rsy

Halide

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, Saman Amarasinghe
2013 SIGPLAN notices  
and across different hardware architectures, including multicores with SIMD, and heterogeneous CPU+GPU execution.  ...  They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns.  ...  Vectorization and unrolling passes replace loops of constant with k scheduled as vectorized or unrolled with the corresponding k-wide vector code or k copies of the loop body.  ... 
doi:10.1145/2499370.2462176 fatcat:afs2mud2unentdmcazyg2qhiqq

Halide

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, Saman Amarasinghe
2013 Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation - PLDI '13  
and across different hardware architectures, including multicores with SIMD, and heterogeneous CPU+GPU execution.  ...  They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns.  ...  Vectorization and unrolling passes replace loops of constant with k scheduled as vectorized or unrolled with the corresponding k-wide vector code or k copies of the loop body.  ... 
doi:10.1145/2491956.2462176 dblp:conf/pldi/Ragan-KelleyBAPDA13 fatcat:tr3fzvh5arbbbo4nn2iqpivdaa

IR2Vec: LLVM IR based Scalable Program Embeddings [article]

S. VenkataKeerthy, Rohit Aggarwal, Shalini Jain, Maunendra Sankar Desarkar, Ramakrishna Upadrasta, Y. N. Srikant
2020 arXiv   pre-print
Symbolic encodings are obtained from the seed embedding vocabulary, and Flow-Aware encodings are obtained by augmenting the Symbolic encodings with the flow information.  ...  Using this infrastructure, we propose two incremental encodings:Symbolic and Flow-Aware.  ...  Using these learned seed embeddings, hierarchical vectors for the new programs are formed. To represent Instruction vectors, we propose two flavors of encodings: Symbolic and Flow-Aware.  ... 
arXiv:1909.06228v3 fatcat:nmrwcya6ejfp7cj23kbtbltl5y

Efficient SIMD code generation for irregular kernels

Seonggun Kim, Hwansoo Han
2012 SIGPLAN notices  
Due to those challenges, existing SIMD compilers have excluded loops with array indirection from their candidate loops for SIMD vectorization.  ...  In this work, we propose a method to generate efficient SIMD code for loops containing indirected memory references.  ...  After the DFG is finalized, the loop is unrolled as many times as vector length by replicating all nodes except vector nodes.  ... 
doi:10.1145/2370036.2145824 fatcat:xgx5rhqc3ra2vmxfmukmu2a4se

FastWave: Accelerating Autoregressive Convolutional Neural Networks on FPGA

Shehzeen Hussain, Mojan Javaheripi, Paarth Neekhara, Ryan Kastner, Farinaz Koushanfar
2019 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)  
Our model uses a fully parameterized parallel architecture for fast matrix-vector multiplication that enables per-layer customized latency fine-tuning for further throughput improvement.  ...  We use loop unrolling factor = 8 for the inner loop of our dot product and also the queue update operations.  ...  After some experimentation, we found that Loop Unrolling outperforms pipelining in terms of both resource utilization and throughput for fixed point data-types.  ... 
doi:10.1109/iccad45719.2019.8942122 dblp:conf/iccad/HussainJNKK19 fatcat:s6jpod255jdjjef73dqjoe6vf4

Recurrent Pixel Embedding for Instance Grouping

Shu Kong, Charless Fowlkes
2018 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition  
We unroll the recurrent grouping module into T loops, and accumulate the same Figure 5 : We compare the embedding vector gradients backpropagated through zero or one iteration of mean shift grouping.  ...  grouping process. loss function at the unrolled loop-t: ℓ t = N k=1 i,j∈S k w k i w k j |S k | 1 {yi=yj } (1−s t ij )+1 {yi =yj } [s t ij −α] + We note that gradient magnitudes grow with the iteration  ... 
doi:10.1109/cvpr.2018.00940 dblp:conf/cvpr/KongF18a fatcat:v5rgn7rtl5bhdpggrdzetkm7e4

Co-synthesis of FPGA-based application-specific floating point simd accelerators

Andrei Hagiescu, Weng-Fai Wong
2011 Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays - FPGA '11  
We describe an automated co-design toolchain that generates code and application-specific platform extensions that implement SIMD instructions with a parameterizable number of vector elements.  ...  The parallelism exposed by encapsulating computation in vector instructions is matched to an adjustable pool of execution units.  ...  In this model, the unrolled version takes N · (tM + tI ) cycles to complete, while a single iteration of the loop with longer vectors takes N · tM + tI cycles.  ... 
doi:10.1145/1950413.1950459 dblp:conf/fpga/HagiescuW11 fatcat:6n4yqb4fafaqjfwxulv3qgw7am

Instruction-level parallel processing: History, overview, and perspective

B. Ramakrishna Rau, Joseph A. Fisher
1993 Journal of Supercomputing  
, over the unrolled loop body.  ...  This is important for trace scheduling unrolled loops as well [142] .  ... 
doi:10.1007/bf01205181 fatcat:v7uhz4km5ndxzhr7baybks2bn4

Instruction-Level Parallel Processing: History, Overview, and Perspective [chapter]

B. Ramakrishna Rau, Joseph A. Fisher
1993 Instruction-Level Parallelism  
, over the unrolled loop body.  ...  This is important for trace scheduling unrolled loops as well [142] .  ... 
doi:10.1007/978-1-4615-3200-2_3 fatcat:eg7nutqurffxfj2y62g5lfc57m

Recurrent Pixel Embedding for Instance Grouping [article]

Shu Kong, Charless Fowlkes
2017 arXiv   pre-print
We unroll the recurrent grouping module into T loops, and accumulate the same loss function at the unrolled loop-t: t = M k=1 i,j∈S k w k i w k j |S k | 1 {y i =y j } (1 − s t ij ) + 1 {y i =y j } [s t  ...  Details about the deriva- Figure 5 : To analyze the recurrent mean shift grouping module, we compare the embedding vector gradients with and without one loop of grouping.  ...  A single loss with more loops of GBMS provides greater gradient than that with fewer loops to update data, as seen in (g). 4 .  ... 
arXiv:1712.08273v1 fatcat:77ohxblx3vgpjnwr5loalgqeam

BasicBlocker: ISA Redesign to Make Spectre-Immune CPUs Faster [article]

Jan Philipp Thoma, Jakob Feldtkeller, Markus Krausz, Tim Güneysu, Daniel J. Bernstein
2021 arXiv   pre-print
Finally, each of these loops is marked in st-opt with an explicit UNROLL(4) or UNROLL(2), where UNROLL uses existing compiler features to control the amount of unrolling.  ...  inlining, loop unrolling).  ... 
arXiv:2007.15919v2 fatcat:d2ejjqtr7rhuhfyn4gtjcf45hq
« Previous Showing results 1 — 15 out of 297 results