Filters








1,063 Hits in 6.8 sec

ILP versus TLP on SMT

Nicholas Mitchell, Larry Carter, Jeanne Ferrante, Dean Tullsen
1999 Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM) - Supercomputing '99  
By sharing processor resources among threads at a very fine granularity, a simultaneous multithreading processor (SMT) renders thread-level parallelism (TLP) and instruction-level parallelism (ILP) operationally  ...  In this paper, we define the performance characteristics that divide codes into one of these three circumstances. We present evidence from three codes to support the factors involved in the model.  ...  TLP does not share this problem. Figure 5 : 5 In integer sort, the best bucket size depends on number of threads.  ... 
doi:10.1145/331532.331569 dblp:conf/sc/MitchellCFT99 fatcat:jhfm53cfgrempgfv7h6zgug73q

Multivariate Polynomial Multiplication on GPU

Diana Andreea Popescu, Rogelio Tomas Garcia
2016 Procedia Computer Science  
We obtain very good speedups over another multivariate polynomial multiplication library for GPUs (up to 548x), and over the implementation of our algorithm for multi-core machines using OpenMP (up to  ...  These works focus on multi-core machines, only [22] and [12] mentioning algorithms for GPU.  ...  Using shared memory assures much faster access to the data than global memory access, because it is located on chip.  ... 
doi:10.1016/j.procs.2016.05.306 fatcat:dpkukgn5vnh4vceby475ybd44y

Simple: A Methodology for Programming High Performance Algorithms on Clusters of Symmetric Multiprocessors (SMPs)

David A. Bader, Joseph JáJá
1999 Journal of Parallel and Distributed Computing  
We illustrate the power of our methodology by presenting experimental results for sorting integers, two-dimensional fast Fourier transforms (FFT), and constraint-satisfied searching.  ...  The SMP cluster programming methodology is based on a small prototype kernel (SIMPLE) of collective communication primitives that make efficient use of the hybrid shared and message passing environment  ...  Note that the NAS IS benchmark requires that the integers be ranked and not necessarily placed in sorted order.  ... 
doi:10.1006/jpdc.1999.1541 fatcat:l6rspgbrjfckbbezpjvwrh5eca

Towards using and improving the NAS parallel benchmarks

Vivek Kale
2010 Proceedings of the 2010 Workshop on Parallel Programming Patterns - ParaPLoP '10  
, and simultaneous multi-threading.  ...  In order to benchmark large-scale parallel machines, the NAS parallel benchmarks must perform properly on a large number of nodes.  ... 
doi:10.1145/1953611.1953623 fatcat:5pw75ahajfhrfhpowid42d5kwi

On the performance and energy efficiency of the PGAS programming model on multicore architectures

Jeremie Lagraviere, Johannes Langguth, Mohammed Sourouri, Phuong H. Ha, Xing Cai
2016 2016 International Conference on High Performance Computing & Simulation (HPCS)  
On the multi-node platform we used the hardware measurement solution called High definition Energy Efficiency Monitoring tool in order to measure energy.  ...  detail the communication time, cache hit/miss ratio and memory usage.  ...  OpenMP offers ease of programming for shared memory machines, while MPI offers high performance on distributed memory supercomputers.  ... 
doi:10.1109/hpcsim.2016.7568416 dblp:conf/ieeehpcs/LagraviereLSHC16 fatcat:xaganptxkfbmbbrttdx5tr7hwm

Characterization of Shared-Memory Multi-Core Applications

Mohammed Mohammed, Gheith Abandah
2016 Jordanian Journal of Computers and Information Technology  
KEYWORDS Multi-core processor, On-the-fly analysis, Shared memory applications, Communication patterns, Performance evaluation. reduce these patterns.  ...  Almost all the sharing in Radix, FFT and Blackscholes is with only one thread. In Fluidanimate and Swaptions, thare are about 23% of sharing with two threads.  ...  Instrumented Code Multi-Core Applications Feedback Execution on Multi-Core System On-the-fly Traces Figure 1 . Methodology used to characterize multi-core applications.  ... 
doi:10.5455/jjcit.71-1448574289 fatcat:ish3odn2g5go7fuqhr6urrbdxy

Memory-aware Thread and Data Mapping for Hierarchical Multi-core Platforms

Eduardo Henrique Molina da Cruz, Marco Antonio Zanata Alves, Alexandre Carissimi, Philippe Olivier Alexandre Navaux, Christiane Pousa Ribeiro, Jean-François Méhaut
2012 International Journal of Networking and Computing  
In order to evaluate our proposal, we use the NAS Parallel Benchmarks (NPB) running on two modern multi-core NUMA machines.  ...  The problem is even more important in multi-core machines with NUMA characteristics, since the remote access imposes high overhead, making them more sensitive to thread and data mapping.  ...  In order to evaluate our proposal, we have performed experiments on two NUMA multi-core machines using NAS Parallel Benchmarks.  ... 
doi:10.15803/ijnc.2.1_97 fatcat:pcbmir2eirc4dbobn47efmplcq

Productivity and performance using partitioned global address space languages

Katherine Yelick, Parry Husbands, Costin Iancu, Amir Kamil, Rajesh Nishtala, Jimmy Su, Michael Welcome, Tong Wen, Dan Bonachea, Wei-Yu Chen, Phillip Colella, Kaushik Datta (+4 others)
2007 Proceedings of the 2007 international workshop on Parallel symbolic computation - PASCO '07  
The result is portable highperformance compilers that run on a large variety of shared and distributed memory multiprocessors.  ...  Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing.  ...  threads and their associated data to processors: on a distributed memory machine, the local memory of a processor holds both the thread's private data and the shared data with affinity to that thread.  ... 
doi:10.1145/1278177.1278183 dblp:conf/issac/YelickBCCDDGHHHIKNSWW07 fatcat:hpedjb24vvfkbpi7fbawt6xf4u

Towards portable message passing in Java: Binding MPI [chapter]

Sava Mintchev, Vladimir Getov
1997 Lecture Notes in Computer Science  
To evaluate the resulting combination we have run a Java version of the NAS parallel IS benchmark on a distributed{memory IBM SP2 machine.  ...  One way of employing Java in high performance computing is to utilize the potential of Java concurrent threads for programming parallel shared-memory machines 13] A very interesting related theme is the  ...  implementationon the IBM POWERparallel System SP machine of a Java run{time system with parallel threads 10], using message passing to emulate shared memory.  ... 
doi:10.1007/3-540-63697-8_79 fatcat:frx7soj5fneajkracd4trqyi64

Optimizing MapReduce for GPUs with effective shared memory usage

Linchuan Chen, Gagan Agrawal
2012 Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing - HPDC '12  
To support a general and efficient implementation, we support the following features: a memory hierarchy for maintaining the reduction object, a multi-group scheme in shared memory to trade-off space requirements  ...  In this paper, we propose a new implementation of MapReduce for GPUs, which is very effective in utilizing shared memory, a small programmable cache on modern GPUs.  ...  We divide the threads in each block into many groups, which is similar to the multi-group scheme in the shared memory.  ... 
doi:10.1145/2287076.2287109 dblp:conf/hpdc/ChenA12 fatcat:oi75yvs57zhqpjx5iaio4pa4d4

Back to the futures

James Swaine, Kevin Tew, Peter Dinda, Robert Bruce Findler, Matthew Flatt
2010 Proceedings of the ACM international conference on Object oriented programming systems languages and applications - OOPSLA '10  
We find that incremental parallelization can provide useful, scalable parallelism on commodity multicore processors at a fraction of the effort required to implement conventional parallel threads.  ...  This approach can be applied inexpensively to many sequential runtime systems, and we demonstrate its effectiveness in the Racket runtime system and Parrot virtual machine.  ...  Thanks also to Nikos Hardavellas for discussions on the work in general and on multi-core architectures specifically.  ... 
doi:10.1145/1869459.1869507 dblp:conf/oopsla/SwaineTDFF10 fatcat:qaymm3qjzvd3zpipndj6ignlsa

Back to the futures

James Swaine, Kevin Tew, Peter Dinda, Robert Bruce Findler, Matthew Flatt
2010 SIGPLAN notices  
We find that incremental parallelization can provide useful, scalable parallelism on commodity multicore processors at a fraction of the effort required to implement conventional parallel threads.  ...  This approach can be applied inexpensively to many sequential runtime systems, and we demonstrate its effectiveness in the Racket runtime system and Parrot virtual machine.  ...  Thanks also to Nikos Hardavellas for discussions on the work in general and on multi-core architectures specifically.  ... 
doi:10.1145/1932682.1869507 fatcat:jndou4fktbhezeemeszyq46p2e

Places

Kevin Tew, James Swaine, Matthew Flatt, Robert Bruce Findler, Peter Dinda pdinda@northwestern.edu
2011 Proceedings of the 7th symposium on Dynamic languages - DLS '11  
The fork-join form on line 5 creates (processor-count) places and records the configuration in a communicator group cg.  ...  The ([N n]) part binds the size n from the original place to N Acknowledgments Thanks to Jay McCarthy for access to the 12core machine we used to run our experiments and the anonymous DLS reviewers for  ...  The NAS Parallel Benchmarks consists of seven benchmarks. Integer Sort (IS) is a simple histogram integer sort. Fourier Transform (FT) is a 3-D fast Fourier transform.  ... 
doi:10.1145/2047849.2047860 dblp:conf/dls/TewSFFD11 fatcat:6f7o7walsbdujn455jokb2jscy

Studying the impact of application-level optimizations on the power consumption of multi-core architectures

Shah Mohammad Faizur Rahman, Jichi Guo, Akshatha Bhat, Carlos Garcia, Majedul Haque Sujon, Qing Yi, Chunhua Liao, Daniel Quinlan
2012 Proceedings of the 9th conference on Computing Frontiers - CF '12  
and multi-threaded benchmarks using varying compiler optimization settings and runtime configurations.  ...  This paper studies the overall system power variations of two multi-core architectures, an 8-core Intel and a 32-core AMD workstation, while using these machines to execute a wide variety of sequential  ...  PARSEC benchmarks [3] The Princeton Application Repository for Shared-Memory Computers (PARSEC) is a benchmark suite focusing on next generation shared-memory programs for chip-multiporcessors.  ... 
doi:10.1145/2212908.2212927 dblp:conf/cf/RahmanGBGSYLQ12 fatcat:qw6nsbdsi5ek7bitl3p5iwbcse

Feedback-directed thread scheduling with memory considerations

Fengguang Song, Shirley Moore, Jack Dongarra
2007 Proceedings of the 16th international symposium on High performance distributed computing - HPDC '07  
This paper describes a novel approach to generate an optimized schedule to run threads on distributed shared memory (DSM) systems.  ...  The approach relies upon a binary instrumentation tool to automatically acquire the memory sharing relationship between user-level threads by analyzing their memory trace.  ...  The authors would like to thank HPDC'07 reviewers for their valuable comments and suggestions on the initial draft of the paper.  ... 
doi:10.1145/1272366.1272380 dblp:conf/hpdc/SongMD07 fatcat:mkkhtlxcajgllpz4bw5zuro5ma
« Previous Showing results 1 — 15 out of 1,063 results