1,001 Hits in 4.5 sec


Tomokatsu Takahashi, Hiroaki Shiokawa, Hiroyuki Kitagawa
2017 Proceedings of the 2nd International Workshop on Network Data Analytics - NDA'17  
Second, SCAN-XP e ectively exploits 512 bit SIMD instructions implemented in the Intel Xeon Phi to speed up the density evaluations.  ...  In this paper, so as to address the above problem, we present a novel algorithm SCAN-XP that performs over Intel Xeon Phi.  ...  KNL is able to perform as a host CPU by using up to 72 physical cores; each core shows 1.3-1.5 GHz clock frequency with AVX-512 SIMD instruction set.  ... 
doi:10.1145/3068943.3068949 dblp:conf/sigmod/TakahashiSK17 fatcat:mqfleftsdrglvjefrzwxmm3wde

Polygonization of Implicit Surfaces on Multi-Core Architectures with SIMD Instructions [article]

Pourya Shirazian, Brian Wyvill, Jean-Luc Duprat
2012 Eurographics Symposium on Parallel Graphics and Visualization  
that organizes all these in a scene graph data structure called BlobTree.  ...  In this research we tackle the problem of rendering complex models which are created using implicit primitives, blending operators, affine transformations and constructive solid geometry in a design environment  ...  different SIMD instruction sets.  ... 
doi:10.2312/egpgv/egpgv12/089-098 fatcat:tdzfppatgvcjthasqbdknpgvpq

EmptyHeaded: A Relational Engine for Graph Processing [article]

Christopher R. Aberger, Susan Tu, Kunle Olukotun, Christopher Ré
2017 arXiv   pre-print
High-level engines are easier to use but are orders of magnitude slower than the low-level graph engines.  ...  To achieve high performance, EmptyHeaded introduces a new join engine architecture, including a novel query optimizer and data layouts that leverage single-instruction multiple data (SIMD) parallelism.  ...  C.2 Memory Usage We utilize a small amount of the available memory (1TB RAM) for the datasets run in this paper.  ... 
arXiv:1503.02368v7 fatcat:hlbgwo66wbe7bmavodpli3xfb4


Christopher R. Aberger, Susan Tu, Kunle Olukotun, Christopher Ré
2016 Proceedings of the 2016 International Conference on Management of Data - SIGMOD '16  
High-level engines are easier to use but are orders of magnitude slower than the low-level graph engines.  ...  To achieve high performance, EmptyHeaded introduces a new join engine architecture, including a novel query optimizer and data layouts that leverage single-instruction multiple data (SIMD) parallelism.  ...  We use this as a means to speed up set intersection, which is the core operation in our approach to join processing.  ... 
doi:10.1145/2882903.2915213 pmid:28077912 pmcid:PMC5221635 dblp:conf/sigmod/AbergerTOR16 fatcat:gtq53m7ytzas7ixlqf7kkofbha

Efficient ray sorting for the tracing of incoherent rays

Jae-Ho Nah, Yun-Hye Jung, Woo-Chan Park, Tack-Don Han
2012 IEICE Electronics Express  
In this method, we use ray origin buckets and ray direction grids to reorder rays quickly. We implemented our approach on the Manta interactive ray tracer and achieved up to a 1.48× speedup.  ...  In order to accelerate the tracing of incoherent rays, we propose a simple sorting method to increase ray coherence.  ...  This scheme achieved an orderof-magnitude speed-up in terms of primary visibility because it substitutes per-ray-packet operations for expensive per-ray operations.  ... 
doi:10.1587/elex.9.849 fatcat:7efzr7f7zvdepb2weiyfp2en4y

SIMD^2: A Generalized Matrix Instruction Set for Accelerating Tensor Computation beyond GEMM [article]

Yunan Zhang and Po-An Tsai and Hung-Wei Tseng
2022 arXiv   pre-print
SIMD^2 instructions accelerate eight more types of matrix operations, in addition to matrix multiplications.  ...  We find that many algorithms share the same structure and differ in only the core operation; for example, using add-minimum instead of multiply-add.  ...  This work was also supported by new faculty start-up funds from University of California, Riverside.  ... 
arXiv:2205.01252v2 fatcat:6r2fwmrwtfc2vhed6d3pwvxgm4

Algorithm optimizations and mapping scheme for interactive ray tracing on a reconfigurable architecture

Marcos Sanchez-Elez, Haitao Du, Nozar Tabrizi, Yun Long, Nader Bagherzadeh, Milagros Fernandez
2003 Computers & graphics  
We apply an SIMD octree traversal algorithm that supports ray traversals of any origins and directions.  ...  This paper presents a mapping scheme of an optimized octree-based ray tracing algorithm and its implementation on a SIMD reconfigurable architecture, MorphoSys, with appropriate hardware incorporated.  ...  It uses space partitioning structures such as octrees in order to speed up the ray traversal algorithm by reducing the set of objects tested (from complexity O(N) to O(logN), where N is the number of objects  ... 
doi:10.1016/s0097-8493(03)00143-2 fatcat:efcy6zpr5jdmhpu3l73dt5jtge

Accelerating mesh-based Monte Carlo method on modern CPU architectures

Qianqian Fang, David R. Kaeli
2012 Biomedical Optics Express  
Single Instruction Multiple Data (SIMD) based computation and branch-less design are exploited to accelerate ray-tetrahedron intersection tests and yield a 2-fold speed-up for ray-tracing calculations  ...  The combination of these techniques achieved an overall improvement of 22% in simulation speed as compared to using a non-SIMD implementation.  ...  SSE-accelerated ray-tetrahedron intersection tests We implemented two SSE-accelerated ray-triangle intersection methods in our software: an SSE ray-tracer based on the H&H method [17] and an SSE partial  ... 
doi:10.1364/boe.3.003223 pmid:23243572 pmcid:PMC3521306 fatcat:ud7tqbtx2rbcpb527zbxuchnui

Generating SIMD Instructions for Cerebras CS-1 using Polyhedral Compilation Techniques

Sven Verdoolaege, Manjunath Kudlur, Rob Schreiber, Harinath Kamepalli
2020 Zenodo  
In this intermediate code, the use of SIMD instructions is made explicit. The main focus of the paper is the generation of these CS-1 SIMD instructions for convolution style algorithms.  ...  In order to achieve optimal performance, it is crucial to use SIMD instructions as much as possible.  ...  Many of the test cases with lower speed-up use the size enumeration of Section 5.4, which has some overhead in selecting the SIMD configuration.  ... 
doi:10.5281/zenodo.4295954 fatcat:kc6e37mgmreb5pf4xbiaswfvpi

High throughput SAFT for an experimental USCT system as MATLAB implementation with use of SIMD CPU instructions

M. Zapf, G. F. Schwarzenberg, N. V. Ruiter, Stephen A. McAleavey, Jan D'hooge
2008 Medical Imaging 2008: Ultrasonic Imaging and Signal Processing  
The fastest found solution uses an SIMD enhanced assembler code wrapped in the C-interface of MATLAB. Additionally a 10 % speed up is gained by reducing the function call overhead.  ...  With 3.5 millions of acquired raw data sets and up to one billion voxels for an image, a reconstruction may last up to months. In this work a performance optimized SAFT algorithm is developed.  ...  Intel SSE4, 17 and • supporting 64 bit CPU capabilities, e.g. extended register sets with a potential speed up of factor two.  ... 
doi:10.1117/12.770443 fatcat:pc5t3jkl35a55pya7uwxtqopnu

Analysis of Cache Behavior and Performance of Different BVH Memory Layouts for Tracing Incoherent Rays [article]

Dominik Wodniok, Andre Schulz, Sven Widmer, Michael Goesele
2013 Eurographics Symposium on Parallel Graphics and Visualization  
We optimize the BVH layout using information gathered in a pre-processing pass applying a number of different BVH reordering techniques.  ...  While parallelization is trivial in theory, properties of real hardware make efficient parallelization difficult, especially when tracing incoherent rays.  ...  Using a packet size larger than the native SIMD width and different optimizations, they reported 3.3-10.7× speed-ups over the native SIMD packet size. Purcell et al.  ... 
doi:10.2312/egpgv/egpgv13/057-064 fatcat:5o2euacr4vbjhlgd26srr6zjga

Performance Measures for Evaluating Algorithms for SIMD Machines

L.J. Siegel, H.J. Siegel, P.H. Swain
1982 IEEE Transactions on Software Engineering  
The measures discussed and compared include execution time, speed, parallel efficiency, overhead ratio, processor utilization, redundancy, cost effectiveness, speed-up of the parallel algorithm over the  ...  This paper examines measures for evaluating the performance of algorithms for single instruction stream -multiple data stream (SIMD) machines.  ...  The complexity of SIMD algorithms is, in general, a function of the problem size (number of elements in the data set to be processed), machine size (number of PE's), and the interconnection network used  ... 
doi:10.1109/tse.1982.235426 fatcat:3663rpdas5h4zc47azjev24nxu


Jianguo Wang, Chunbin Lin, Ruining He, Moojin Chae, Yannis Papakonstantinou, Steven Swanson
2017 Proceedings of the VLDB Endowment  
We compare MILC with 12 recent compression algorithms and experimentally show that MILC improves the query performance by up to 13.2× and reduces the space overhead by up to 4.7×.  ...  In this work, we set out to bridge this performance gap for the first time by proposing a new compression scheme, namely, MILC (memory inverted list compression).  ...  The CPU is based on Haswell microarchitecture which supports AVX2 instruction set. We use mavx2 optimization flag for the SIMD optimization.  ... 
doi:10.14778/3090163.3090164 fatcat:k33od6iqhjftvoelltoz3jbldm

A Paradigm for the Design of Parallel Algorithms with Applications

I.V. Ramakrishnan, J.C. Browne
1983 IEEE Transactions on Software Engineering  
In order to conveniently display that linear speed-up in the number of processors-is obtained, we write the algorithm where A and B are of size n and m, respectively, and k processors are used.  ...  Set Intersection The two sets of elements are contained in arrays A and B. Array A is of size n and array B of size m (n > m). Let G.  ... 
doi:10.1109/tse.1983.234777 fatcat:cajjzzzaj5djxipgom2qeuvium


Stephen. J. Guy, Jatin Chhugani, Changkyu Kim, Nadathur Satish, Ming Lin, Dinesh Manocha, Pradeep Dubey
2009 Proceedings of the 2009 ACM SIGGRAPH/Eurographics Symposium on Computer Animation - SCA '09  
We use a discrete optimization method to efficiently compute the motion of each agent. This resulting algorithm can be parallelized by exploiting data-parallelism and thread-level parallelism.  ...  Our new parallel collision avoidance algorithm, P-ClearPath can efficiently perform local collision avoidance for all agents in such tight packed simulations at 550 FPS on Intel quad-core Xeon (3.14 GHz  ...  Acknowledgements This research is supported in part by ARO Contract W911NF-04-1-0088, NSF award 0636208, DARPA/RDECOM Contracts N61339-04-C-0043 and WR91CRB-08-C-0137, Intel, and Microsoft.  ... 
doi:10.1145/1599470.1599494 dblp:conf/sca/GuyCKSLMD09 fatcat:dcxaagyrerc33hzex3byzjxy3m
« Previous Showing results 1 — 15 out of 1,001 results