Filters








41,387 Hits in 4.6 sec

Scatter-Add in Data Parallel Architectures

Jung Ho Ahn, M. Erez, W.J. Dally
11th International Symposium on High-Performance Computer Architecture  
We detail the micro-architecture of a scatter-add implementation on a stream architecture, which requires less than 2% increase in die area yet shows performance speedups ranging from 1.45 to over 11 on  ...  The scatter-add mechanism scatters a set of data values to a set of memory addresses and adds each data value to each referenced memory location instead of overwriting it.  ...  Conclusion In this paper we introduced the hardware scatter-add operation for data parallel SIMD architectures.  ... 
doi:10.1109/hpca.2005.30 dblp:conf/hpca/AhnED05 fatcat:3p46csgdyfclpegiud3m4qjc44

CircusTent: A Benchmark Suite for Atomic Memory Operations

Brody Williams, John Leidel, Xi Wang, David Donofrio, Yong Chen
2020 The International Symposium on Memory Systems  
A paradigm shift is currently taking place in the field of computer architecture.  ...  Parallel processing and corresponding programming models, already ubiquitous to high performance computing, will play a crucial role in these systems.  ...  ACKNOWLEDGMENTS Research reported in this publication was supported by the U.S. Department of Defense under Contract FA8075−14−D−0002.  ... 
doi:10.1145/3422575.3422789 fatcat:wnwa2piq2nacpj74nhnf76pjxm

Atomic Vector Operations on Chip Multiprocessors

Sanjeev Kumar, Daehyun Kim, Mikhail Smelyanskiy, Yen-Kuang Chen, Jatin Chhugani, Christopher J. Hughes, Changkyu Kim, Victor W. Lee, Anthony D. Nguyen
2008 2008 International Symposium on Computer Architecture  
Hardware to scatter and gather sparse data has previously been proposed to enable vector execution for these codes.  ...  In particular, code with sparse data access patterns cannot easily utilize the vector/SIMD instructions of mainstream processors.  ...  Scatter-add [8] extends the fetch-and-add mechanism to support parallel reductions on data parallel architectures.  ... 
doi:10.1109/isca.2008.38 dblp:conf/isca/KumarKSCCHKLN08 fatcat:ej2tf7hhrzbc7n4s6tmhp44zmq

Atomic Vector Operations on Chip Multiprocessors

Sanjeev Kumar, Daehyun Kim, Mikhail Smelyanskiy, Yen-Kuang Chen, Jatin Chhugani, Christopher J. Hughes, Changkyu Kim, Victor W. Lee, Anthony D. Nguyen
2008 SIGARCH Computer Architecture News  
Hardware to scatter and gather sparse data has previously been proposed to enable vector execution for these codes.  ...  In particular, code with sparse data access patterns cannot easily utilize the vector/SIMD instructions of mainstream processors.  ...  Scatter-add [8] extends the fetch-and-add mechanism to support parallel reductions on data parallel architectures.  ... 
doi:10.1145/1394608.1382154 fatcat:q7srk2z5kfettdyc5qlyfx63na

Parallel computing for electromagnetic field computation

C. Vollaire, L. Nicolas, A. Nicolas
1998 IEEE transactions on magnetics  
Shared memory and distributed memory architectures are presented, with their implication in the development of parallel numerical algorithms.  ...  This paper deals with parallel computation in electrical engineering.  ...  Parallel algorithms have to be adapted to suit the architecture of the computer in order to obtain the best parallel performance.  ... 
doi:10.1109/20.717805 fatcat:d5q6yilk2rafhdnehfxbfhou4i

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation [chapter]

Michael Lange, Gerard Gorman, Michèle Weiland, Lawrence Mitchell, James Southern
2013 Lecture Notes in Computer Science  
In order to achieve efficient scalability on massively parallel systems scientific software must evolve across the entire stack to exploit the multiple levels of parallelism exposed in modern architectures  ...  In this paper we demonstrate the use of hybrid MPI/OpenMP parallelisation to optimise parallel sparse matrix-vector multiplication in PETSc, a widely used scientific library for the scalable solution of  ...  Copy data to/from buffer Initial greedy allocation Local diffusion algorithm Architecture Overview Cray XE6 (HECToR) NUMA architecture 32 cores per node 4 NUMA domains, 8 cores each Fujitsu  ... 
doi:10.1007/978-3-642-38750-0_8 fatcat:hxs4xjbdenbhrfnkatd4zfgggy

A hardware complete detection mechanism for an energy efficient reconfigurable accelerator CMA

Akihito Tsusaka, Mai Izawa, Rie Uno, Nobuyuki Ozaki, Hideharu Amano
2013 2013 23rd International Conference on Field programmable Logic and Applications  
The completion time in the PE array has been calculated from the delay table and mapping results manually, and specified in the micro-code.  ...  In order to reduce the power for storing intermediate results and clock tree, the PE array is consisting of combinatorial circuits.  ...  This work is also supported by VLSI Design and Education Center (VDEC), the University of Tokyo in collaboration with Cadence Design Systems, Inc.  ... 
doi:10.1109/fpl.2013.6645594 dblp:conf/fpl/TsusakaIUOA13 fatcat:ysc3tauiurfappdpxxuewwffta

Debunking the 100X GPU vs. CPU myth

Victor W. Lee, Per Hammarlund, Ronak Singhal, Pradeep Dubey, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty
2010 Proceedings of the 37th annual international symposium on Computer architecture - ISCA '10  
Recent advances in computing have led to an explosion in the amount of data being generated.  ...  Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications.  ...  Exploiting SIMD parallelism is however challenging due to the presence of gather/scatter operations required to gather/scatter object data (position, velocity) for different objects.  ... 
doi:10.1145/1815961.1816021 dblp:conf/isca/LeeKCDKNSSCHSD10 fatcat:7dgqdsykarcwhp22t7oxgawwza

Debunking the 100X GPU vs. CPU myth

Victor W. Lee, Per Hammarlund, Ronak Singhal, Pradeep Dubey, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty
2010 SIGARCH Computer Architecture News  
Recent advances in computing have led to an explosion in the amount of data being generated.  ...  Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications.  ...  Exploiting SIMD parallelism is however challenging due to the presence of gather/scatter operations required to gather/scatter object data (position, velocity) for different objects.  ... 
doi:10.1145/1816038.1816021 fatcat:pxizpaiizrdq7gmsfs45obdqwy

Parameterized Characterization of Bioinfomatics Workload on SIMD Architecture

Naeem Z. Azeemi, A. Sultan, A Arshad Muhammad
2006 2006 International Conference on Information and Automation  
Bioinformatics applications expression profile is a critical performance metric in high end genomic data processing.  ...  The temporal and spatial localities in gnomic data are also discussed. Experimental results are measured at a high end SIMD (single instruction stream over multiple data stream) processor.  ...  From a programmer's viewpoint, SIMD (single instruction stream over multiple data stream) architectures are lucrative choice for exploiting spatial parallelism rather than temporal parallelism.  ... 
doi:10.1109/icinfa.2006.374110 fatcat:m3enylpalzas5dheluuvtnv7ou

Computer Architecture and Design [chapter]

Siamack Haghighi, Jean-Luc Gaudiot, Manoj Franklin, Bruce Jacob, Lejla Batina, Binu Mathew, Krste Asanovic´, Kazuo Sakiyama, Ingrid Verbauwhede, Donna Quammen
2008 Physics and Applications of Negative Refractive Index Materials  
There are two main classes of data parallel architectures: distributed memory SIMD (single instruction, multiple data [1] ) architecture and shared memory vector architecture.  ...  Conclusions Data parallel instructions have appeared in many forms in high-performance computer architectures over the last 30 years.  ... 
doi:10.1201/9781420068764.sec1 fatcat:6mkbmjhjibgttex7w3yoqalhuy

Accelerating temporal action proposal generation via high performance computing [article]

Tian Wang, Shiye Lei, Youyou Jiang, Choi Chang, Hichem Snoussi, Guangcun Shan
2020 arXiv   pre-print
Remarkably, the total data transmission is reduced by adding a connection between multiple computing load in the newly developed architecture.  ...  In this work, one novel high performance ring parallel architecture based on Message Passing Interface (MPI) is further introduced into temporal action proposal generation, which is a reliable communication  ...  Different from scatter, GPUs don't need to add but replace its own block by the block received.  ... 
arXiv:1906.06496v4 fatcat:wv4ycnrvo5dife5b4fyjecooay

Versatile and scalable parallel histogram construction

Wookeun Jung, Jongsoo Park, Jaejin Lee
2014 Proceedings of the 23rd international conference on Parallel architectures and compilation - PACT '14  
However, it is challenging to efficiently utilize abundant parallel resources in modern processors for histogram construction.  ...  Histograms are used in various fields to quickly profile the distribution of a large amount of data.  ...  To address this difficulty, architectural features such as scatter-add [6] and gather-linked-and-scatter-conditional [20] have been proposed, but they are yet to be implemented in production hardware  ... 
doi:10.1145/2628071.2628108 dblp:conf/IEEEpact/JungPL14 fatcat:maaugr6st5e3rkmyz6k2udscwe

Tuning HipGISAXS on Multi and Many Core Supercomputers [chapter]

Abhinav Sarje, Xiaoye S. Li, Alexander Hexemer
2014 Lecture Notes in Computer Science  
In this paper, we present optimization and tuning of HipGISAXS, a parallel X-ray scattering simulation code [1], on various massively-parallel state-ofthe-art supercomputers based on multi and many-core  ...  With the continual development of multi and manycore architectures, there is a constant need for architecturespecific tuning of application-codes in order to realize high computational performance and  ...  In this paper, we consider one such application code developed by us recently, HipGISAXS, which is a massively parallel X-ray scattering simulation code [1] , [2] .  ... 
doi:10.1007/978-3-319-10214-6_11 fatcat:qgmhhdjlc5gslfxpqkq263ri2y

MapGraph

Zhisong Fu, Michael Personick, Bryan Thompson
2014 Proceedings of Workshop on GRAph Data management Experiences and Systems - GRADES'14  
However, the SIMT architecture used in GPUs places particular constraints on both the design and implementation of the algorithms and data structures, making the development of such programs difficult  ...  architectures.  ...  Finally, in the Scatter stage, vertex vi checks if it is changed and if so adds its out-edge neighbors to the new frontier.  ... 
doi:10.1145/2621934.2621936 dblp:conf/sigmod/FuTP14 fatcat:273pk6stsjhizjdh3aq65qlhse
« Previous Showing results 1 — 15 out of 41,387 results