Filters








2,034 Hits in 4.0 sec

Atomic Vector Operations on Chip Multiprocessors

Sanjeev Kumar, Daehyun Kim, Mikhail Smelyanskiy, Yen-Kuang Chen, Jatin Chhugani, Christopher J. Hughes, Changkyu Kim, Victor W. Lee, Anthony D. Nguyen
2008 2008 International Symposium on Computer Architecture  
Parallel performance is harvested through vector/SIMD instructions as well as multithreading (through both multithreaded cores and chip multiprocessors).  ...  However, on multithreaded architectures, a number of applications spend significant time on atomic operations (e.g., parallel reductions), which cannot be vectorized using previously proposed schemes.  ...  Atomic Operations Shared memory multiprocessor architectures usually include hardware support for performing scalar atomic readmodify-write operations on memory.  ... 
doi:10.1109/isca.2008.38 dblp:conf/isca/KumarKSCCHKLN08 fatcat:ej2tf7hhrzbc7n4s6tmhp44zmq

Atomic Vector Operations on Chip Multiprocessors

Sanjeev Kumar, Daehyun Kim, Mikhail Smelyanskiy, Yen-Kuang Chen, Jatin Chhugani, Christopher J. Hughes, Changkyu Kim, Victor W. Lee, Anthony D. Nguyen
2008 SIGARCH Computer Architecture News  
Parallel performance is harvested through vector/SIMD instructions as well as multithreading (through both multithreaded cores and chip multiprocessors).  ...  However, on multithreaded architectures, a number of applications spend significant time on atomic operations (e.g., parallel reductions), which cannot be vectorized using previously proposed schemes.  ...  Atomic Operations Shared memory multiprocessor architectures usually include hardware support for performing scalar atomic readmodify-write operations on memory.  ... 
doi:10.1145/1394608.1382154 fatcat:q7srk2z5kfettdyc5qlyfx63na

Parallel Computation of Non-Bonded Interactions in Drug Discovery: Nvidia GPUs vs. Intel Xeon Phi

Jianbin Fang, Ana Lucia Varbanescu, Baldomero Imbernon, José M. Cecilia, Horacio Emilio Pérez Sánchez
2014 International Work-Conference on Bioinformatics and Biomedical Engineering  
In this work, we discuss the effective parallelization of the non-bonded electrostatic computation for VS, and evaluate its performance on these two architectures.  ...  These are computationally intensive operations, and massively parallel in nature, so they perfectly fit in the new landscape of high performance computing, dominated by massively parallel architectures  ...  Consequently, the same operations of calculating the distance between atoms are performed on data elements within a single cache-line.  ... 
dblp:conf/iwbbio/FangVICS14 fatcat:3ibxgwmpdveuzphxo3bm43dacq

Parallel buffers for chip multiprocessors

John Cieslewicz, Kenneth A. Ross, Ioannis Giannakakis
2007 Proceedings of the 3rd international workshop on Data management on new hardware - DaMoN '07  
Chip multiprocessors (CMPs) present new opportunities for improving database performance on large queries.  ...  In this paper we propose and evaluate a parallel buffer that enables intra-operator parallelism on CMPs by avoiding contention between hardware threads that need to concurrently read or write to the same  ...  The T1 is a chip multiprocessor with eight cores and four hardware threads per core for a total of 32 threads on one chip.  ... 
doi:10.1145/1363189.1363192 dblp:conf/damon/CieslewiczRG07 fatcat:edyyqeykhzhq7mjdsacxeyzziu

The GPU Computing Era

John Nickolls, William J Dally
2010 IEEE Micro  
Acknowledgments We thank Jen-Hsun Huang of NVIDIA for his Hot Chips 21 keynote 23 that inspired this article, and the entire NVIDIA team that brings GPU computing to market.  ...  HOT CHIPS instructions focus on scalar (rather than vector) operations to match standard scalar programming languages.  ...  vector operations.  ... 
doi:10.1109/mm.2010.41 fatcat:tmcgmo7v5zasbpakpqk37anni4

A Fast GPU Implementation for Solving Sparse Ill-Posed Linear Equation Systems [chapter]

Florian Stock, Andreas Koch
2010 Lecture Notes in Computer Science  
Solvers for these equation systems (not restricted to image reconstruction) spend most of their time in sparse matrix-vector multiplications (SpMV).  ...  In this paper we will present a GPU-accelerated scheme for a Conjugate Gradient (CG) solver, with focus on the SpMV.  ...  A block is assigned atomically for execution on a multiprocessor, which then processes the contained threads in parallel in a SIMD manner.  ... 
doi:10.1007/978-3-642-14390-8_48 fatcat:4ybhuquhhre43gshr5tvpub7iq

Sparcle: an evolutionary processor design for large-scale multiprocessors

A. Agarwal, J. Kubiatowicz, D. Kranz, B.H. Lim, D. Yeung, G. D'Souza, M. Parkin
1993 IEEE Micro  
Sparcle is a processor chip developed jointly by MIT, LSI Logic, and SUN Microsystems, by e v olving an existing RISC architecture towards a processor suited for large-scale multiprocessors.  ...  Sparcle supports three multiprocessor mechanisms: fast context switching, fast, user-level message handling, and ne-grain synchronization.  ...  Our design was inuenced by Halstead's work on multithreaded processors.  ... 
doi:10.1109/40.216748 fatcat:5f63o6sgj5aadbi4wkibrmdcli

High performance predictable histogramming on GPUs

Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, Bart Mesman
2011 Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units - GPGPU-4  
Histogramming has been mapped on a GPU prior to this work.  ...  However, outside a warp, a complete virtual atomic operation could be performed in one of the stages of another thread's virtual atomic operation, causing potentially incorrect behaviour.  ...  First, data is read coalesced from off-chip memory and stored in on-chip shared memory.  ... 
doi:10.1145/1964179.1964181 dblp:conf/asplos/NugterenBCM11 fatcat:o2xtvsjrz5hn5og6ua6jcjh25u

CUDA ARCHITECTURE ANALYSIS AS THE DRIVING FORCE OF PARALLEL CALCULATION ORGANIZATION [chapter]

Andriy Dudnik, Taras Shevchenko National University of Kyiv, Ukraine, Tetiana Domkiv, National Aviation University, Ukraine
2020 Innovative scientific researches: European development trends and regional aspect  
chip.  ...  For example, in Nvidia video chips, the main unit is a multiprocessor with eight to ten cores and hundreds of ALUs in general, several thousand registers and a small amount of shared memory.  ...  Even the element-wise addition of two vectors will require drawing the figure on the screen or in an off-screen buffer.  ... 
doi:10.30525/978-9934-588-38-9-59 fatcat:kd37fedkvzbldcwe7fpu32pexa

A multiprocessor using protocol-based programming primitives

Erik P. DeBenedictis
1987 International journal of parallel programming  
If necessary, these operations are done in a critical region to assure they are atomic.  ...  The input function operates on an input message and a state vector, altering the state vector. The output function operates on a state vector, altering the vector, and perhaps sending a message.  ... 
doi:10.1007/bf01408174 fatcat:pfkixlpoanbxtakpwrfpddef4e

RTTM

Martin Schoeberl, Florian Brandner, Jan Vitek
2010 Proceedings of the 2010 ACM Symposium on Applied Computing - SAC '10  
The proposed RTTM is evaluated with a simulation of a Java chip-multiprocessor.  ...  Hardware transactional memory is a promising synchronization technology for chip-multiprocessors.  ...  a simulation of a chip-multiprocessor (CMP).  ... 
doi:10.1145/1774088.1774158 dblp:conf/sac/SchoeberlBV10 fatcat:i4r7mr3tinds7k2vd3er42r3gy

Creating HW/SW co-designed MPSoPC's from high level programming models

Eugene Cartwright, Sen Ma, David Andrews, Miaoqing Huang
2011 2011 International Conference on High Performance Computing & Simulation  
FPGA densities have continued to follow Moore's law and can now support a complete multiprocessor system on programmable chip.  ...  In this paper we outline a new approach that allows users to drive the generation of a complete hardware/software co-designed multiprocessor system on programmable chip from an unaltered standard high  ...  atomic operations [16] .  ... 
doi:10.1109/hpcsim.2011.5999874 dblp:conf/ieeehpcs/CartwrightMAH11 fatcat:5z3445b6xvbmjlz7tc5lziwiv4

Effective Parallelization of Non-bonded Interactions Kernel for Virtual Screening on GPUs [chapter]

Ginés D. Guerrero, Horacio Pérez-Sánchez, Wolfgang Wenzel, José M. Cecilia, José M. García
2011 Advances in Intelligent and Soft Computing  
The Tesla C1060 is based on scalable processor array which has 240 streaming processors (SPs) cores organized as 30 streaming multiprocessors (SMs) and 4GB off-chip GDDR3 memory called device memory.  ...  and capable of vector processing reaching a theoretical peak performance of around 230 GFLOPS.  ... 
doi:10.1007/978-3-642-19914-1_9 dblp:conf/pacbb/GuerreroSWCG11 fatcat:5o4iio45nje3zhtqoe3wfhdygu

Synchronization and communication in the T3E multiprocessor

Steven L. Scott
1996 ACM SIGOPS Operating Systems Review  
Through E-registers, the T3E provides a rich set of atomic memory operations and a flexible, user-level messaging facility.  ...  This paper discusses the Cray T3E multiprocessor, which is based on the DEC Alpha 21164 microprocessor.  ...  Thanks to Bill Dally for his consultation on this project. Karl Feind and Al Rivers were particularly helpful in obtaining performance measurements for this paper.  ... 
doi:10.1145/248208.237144 fatcat:lsx5ybe7qnaxxieylpqjk6m4jy

Synchronization and communication in the T3E multiprocessor

Steven L. Scott
1996 Proceedings of the seventh international conference on Architectural support for programming languages and operating systems - ASPLOS-VII  
Through E-registers, the T3E provides a rich set of atomic memory operations and a flexible, user-level messaging facility.  ...  This paper discusses the Cray T3E multiprocessor, which is based on the DEC Alpha 21164 microprocessor.  ...  Thanks to Bill Dally for his consultation on this project. Karl Feind and Al Rivers were particularly helpful in obtaining performance measurements for this paper.  ... 
doi:10.1145/237090.237144 dblp:conf/asplos/Scott96 fatcat:bs5zz7ivcjeyjatcic6rifvw3i
« Previous Showing results 1 — 15 out of 2,034 results