A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2016; you can also visit the original URL.
The file type is application/pdf
.
Filters
Atomic Vector Operations on Chip Multiprocessors
2008
2008 International Symposium on Computer Architecture
Parallel performance is harvested through vector/SIMD instructions as well as multithreading (through both multithreaded cores and chip multiprocessors). ...
However, on multithreaded architectures, a number of applications spend significant time on atomic operations (e.g., parallel reductions), which cannot be vectorized using previously proposed schemes. ...
Atomic Operations Shared memory multiprocessor architectures usually include hardware support for performing scalar atomic readmodify-write operations on memory. ...
doi:10.1109/isca.2008.38
dblp:conf/isca/KumarKSCCHKLN08
fatcat:ej2tf7hhrzbc7n4s6tmhp44zmq
Atomic Vector Operations on Chip Multiprocessors
2008
SIGARCH Computer Architecture News
Parallel performance is harvested through vector/SIMD instructions as well as multithreading (through both multithreaded cores and chip multiprocessors). ...
However, on multithreaded architectures, a number of applications spend significant time on atomic operations (e.g., parallel reductions), which cannot be vectorized using previously proposed schemes. ...
Atomic Operations Shared memory multiprocessor architectures usually include hardware support for performing scalar atomic readmodify-write operations on memory. ...
doi:10.1145/1394608.1382154
fatcat:q7srk2z5kfettdyc5qlyfx63na
Parallel Computation of Non-Bonded Interactions in Drug Discovery: Nvidia GPUs vs. Intel Xeon Phi
2014
International Work-Conference on Bioinformatics and Biomedical Engineering
In this work, we discuss the effective parallelization of the non-bonded electrostatic computation for VS, and evaluate its performance on these two architectures. ...
These are computationally intensive operations, and massively parallel in nature, so they perfectly fit in the new landscape of high performance computing, dominated by massively parallel architectures ...
Consequently, the same operations of calculating the distance between atoms are performed on data elements within a single cache-line. ...
dblp:conf/iwbbio/FangVICS14
fatcat:3ibxgwmpdveuzphxo3bm43dacq
Parallel buffers for chip multiprocessors
2007
Proceedings of the 3rd international workshop on Data management on new hardware - DaMoN '07
Chip multiprocessors (CMPs) present new opportunities for improving database performance on large queries. ...
In this paper we propose and evaluate a parallel buffer that enables intra-operator parallelism on CMPs by avoiding contention between hardware threads that need to concurrently read or write to the same ...
The T1 is a chip multiprocessor with eight cores and four hardware threads per core for a total of 32 threads on one chip. ...
doi:10.1145/1363189.1363192
dblp:conf/damon/CieslewiczRG07
fatcat:edyyqeykhzhq7mjdsacxeyzziu
The GPU Computing Era
2010
IEEE Micro
Acknowledgments We thank Jen-Hsun Huang of NVIDIA for his Hot Chips 21 keynote 23 that inspired this article, and the entire NVIDIA team that brings GPU computing to market. ...
HOT CHIPS instructions focus on scalar (rather than vector) operations to match standard scalar programming languages. ...
vector operations. ...
doi:10.1109/mm.2010.41
fatcat:tmcgmo7v5zasbpakpqk37anni4
A Fast GPU Implementation for Solving Sparse Ill-Posed Linear Equation Systems
[chapter]
2010
Lecture Notes in Computer Science
Solvers for these equation systems (not restricted to image reconstruction) spend most of their time in sparse matrix-vector multiplications (SpMV). ...
In this paper we will present a GPU-accelerated scheme for a Conjugate Gradient (CG) solver, with focus on the SpMV. ...
A block is assigned atomically for execution on a multiprocessor, which then processes the contained threads in parallel in a SIMD manner. ...
doi:10.1007/978-3-642-14390-8_48
fatcat:4ybhuquhhre43gshr5tvpub7iq
Sparcle: an evolutionary processor design for large-scale multiprocessors
1993
IEEE Micro
Sparcle is a processor chip developed jointly by MIT, LSI Logic, and SUN Microsystems, by e v olving an existing RISC architecture towards a processor suited for large-scale multiprocessors. ...
Sparcle supports three multiprocessor mechanisms: fast context switching, fast, user-level message handling, and ne-grain synchronization. ...
Our design was inuenced by Halstead's work on multithreaded processors. ...
doi:10.1109/40.216748
fatcat:5f63o6sgj5aadbi4wkibrmdcli
High performance predictable histogramming on GPUs
2011
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units - GPGPU-4
Histogramming has been mapped on a GPU prior to this work. ...
However, outside a warp, a complete virtual atomic operation could be performed in one of the stages of another thread's virtual atomic operation, causing potentially incorrect behaviour. ...
First, data is read coalesced from off-chip memory and stored in on-chip shared memory. ...
doi:10.1145/1964179.1964181
dblp:conf/asplos/NugterenBCM11
fatcat:o2xtvsjrz5hn5og6ua6jcjh25u
CUDA ARCHITECTURE ANALYSIS AS THE DRIVING FORCE OF PARALLEL CALCULATION ORGANIZATION
[chapter]
2020
Innovative scientific researches: European development trends and regional aspect
chip. ...
For example, in Nvidia video chips, the main unit is a multiprocessor with eight to ten cores and hundreds of ALUs in general, several thousand registers and a small amount of shared memory. ...
Even the element-wise addition of two vectors will require drawing the figure on the screen or in an off-screen buffer. ...
doi:10.30525/978-9934-588-38-9-59
fatcat:kd37fedkvzbldcwe7fpu32pexa
A multiprocessor using protocol-based programming primitives
1987
International journal of parallel programming
If necessary, these operations are done in a critical region to assure they are atomic. ...
The input function operates on an input message and a state vector, altering the state vector. The output function operates on a state vector, altering the vector, and perhaps sending a message. ...
doi:10.1007/bf01408174
fatcat:pfkixlpoanbxtakpwrfpddef4e
The proposed RTTM is evaluated with a simulation of a Java chip-multiprocessor. ...
Hardware transactional memory is a promising synchronization technology for chip-multiprocessors. ...
a simulation of a chip-multiprocessor (CMP). ...
doi:10.1145/1774088.1774158
dblp:conf/sac/SchoeberlBV10
fatcat:i4r7mr3tinds7k2vd3er42r3gy
Creating HW/SW co-designed MPSoPC's from high level programming models
2011
2011 International Conference on High Performance Computing & Simulation
FPGA densities have continued to follow Moore's law and can now support a complete multiprocessor system on programmable chip. ...
In this paper we outline a new approach that allows users to drive the generation of a complete hardware/software co-designed multiprocessor system on programmable chip from an unaltered standard high ...
atomic operations [16] . ...
doi:10.1109/hpcsim.2011.5999874
dblp:conf/ieeehpcs/CartwrightMAH11
fatcat:5z3445b6xvbmjlz7tc5lziwiv4
Effective Parallelization of Non-bonded Interactions Kernel for Virtual Screening on GPUs
[chapter]
2011
Advances in Intelligent and Soft Computing
The Tesla C1060 is based on scalable processor array which has 240 streaming processors (SPs) cores organized as 30 streaming multiprocessors (SMs) and 4GB off-chip GDDR3 memory called device memory. ...
and capable of vector processing reaching a theoretical peak performance of around 230 GFLOPS. ...
doi:10.1007/978-3-642-19914-1_9
dblp:conf/pacbb/GuerreroSWCG11
fatcat:5o4iio45nje3zhtqoe3wfhdygu
Synchronization and communication in the T3E multiprocessor
1996
ACM SIGOPS Operating Systems Review
Through E-registers, the T3E provides a rich set of atomic memory operations and a flexible, user-level messaging facility. ...
This paper discusses the Cray T3E multiprocessor, which is based on the DEC Alpha 21164 microprocessor. ...
Thanks to Bill Dally for his consultation on this project. Karl Feind and Al Rivers were particularly helpful in obtaining performance measurements for this paper. ...
doi:10.1145/248208.237144
fatcat:lsx5ybe7qnaxxieylpqjk6m4jy
Synchronization and communication in the T3E multiprocessor
1996
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems - ASPLOS-VII
Through E-registers, the T3E provides a rich set of atomic memory operations and a flexible, user-level messaging facility. ...
This paper discusses the Cray T3E multiprocessor, which is based on the DEC Alpha 21164 microprocessor. ...
Thanks to Bill Dally for his consultation on this project. Karl Feind and Al Rivers were particularly helpful in obtaining performance measurements for this paper. ...
doi:10.1145/237090.237144
dblp:conf/asplos/Scott96
fatcat:bs5zz7ivcjeyjatcic6rifvw3i
« Previous
Showing results 1 — 15 out of 2,034 results