Filters








59 Hits in 4.8 sec

The Wisconsin Wind Tunnel project

Mark D. Hill, James R. Larus, David A. Wood
1994 SIGARCH Computer Architecture News  
CC-NUMA and S-COMA, for our benchmarks and base system assumptions.  ...  This paper evaluates protocol scheduling policies for a software DSM running on an SMP cluster.  ... 
doi:10.1145/192537.192543 fatcat:rvtgkgeonnba3cdbociaiglrdq

On the Energy-Efficiency of Byte-Addressable Non-Volatile Memory

Hans Vandierendonck, Ahmad Hassan, Dimitrios S. Nikolopoulos
2015 IEEE computer architecture letters  
An Efficient Kernel-Level Scheduling Methodology for Multiprogrammed Shared Memory Multiprocessors.  ...  Enhancing the Performance of Autoscheduling with Locality-Based Partitioning on Distributed Shared Memory Multiprocessors.  ... 
doi:10.1109/lca.2014.2355195 fatcat:35mkkiczcnd5thic5aqpwodiry

Optimization of the Load Balancing Policy for Tiled Many-Core Processors

Ye Liu, Shinpei Kato, Masato Edahiro
2019 IEEE Access  
and more threads sharing the same tile (processing core), and the contention for memory controllers due to cache misses.  ...  ., KNL and the TILE-Gx72 processor), on which processing cores are fitted onto a single chip and cores are interconnected via mesh-based networks, are different from the traditional many-core systems.  ...  shared-memory multi-threaded applications for chip multiprocessors.  ... 
doi:10.1109/access.2018.2883415 fatcat:uju5yyhserbvrkuqv3pgjyux4a

Scheduler-Activated Dynamic Page Migration for Multiprogrammed DSM Multiprocessors

Dimitrios S. Nikolopoulos, Constantine D. Polychronopoulos, Theodore S. Papatheodorou, Jesús Labarta, Eduard Ayguadé
2002 Journal of Parallel and Distributed Computing  
On cachecoherent distributed shared-memory (DSM) multiprocessors, such scheduler interventions tend to increase the rate of remote memory accesses.  ...  The performance of multiprogrammed shared-memory multiprocessors suffers often from scheduler interventions that neglect data locality.  ...  ACKNOWLEDGMENT We thank the journal referees for their insightful comments, which helped us improve the paper considerably.  ... 
doi:10.1006/jpdc.2001.1817 fatcat:mm4g6niwc5e4dn4adqub77grbu

Realistic Workload Scheduling Policies for Taming the Memory Bandwidth Bottleneck of SMPs [chapter]

Christos D. Antonopoulos, Dimitrios S. Nikolopoulos, Theodore S. Papatheodorou
2004 Lecture Notes in Computer Science  
Therefore, we present and evaluate two realistic scheduling policies which treat memory bandwidth as a first-class resource.  ...  Scheduling algorithms usually attempt to maximize performance of memory intensive applications by optimally exploiting the cache hierarchy.  ...  Introduction Conventional schedulers for shared-memory multiprocessors are practically organized around the well-known UNIX multilevel priority queue mechanism, with limited extensions for support of multiprocessor  ... 
doi:10.1007/978-3-540-30474-6_33 fatcat:elyq5hocivhazeb54xkn4imndi

Adapt or become extinct!

Georgios Goumas, Sally A. McKee, Magnus Själander, Thomas R. Gross, Sven Karlsson, Christian W. Probst, Lixin Zhang
2011 Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era - EXADAPT '11  
This environment also presents a number of hard boundaries (walls) for applications which limit software development (parallel programming wall), performance (memory wall, communication wall) and viability  ...  We consider specialization based on dynamic information like user input, architectural characteristics such as the memory hierarchy organization, and the execution profile of the application as obtained  ...  NUMA-based multicore processors integrate one (or more) memory controller(s) with each processor, and the physical memory space is divided between processors.  ... 
doi:10.1145/2000417.2000422 fatcat:h5gnh4twjzcetarsdj2cnfp2yu

Scheduler-conscious synchronization

Leonidas I. Kontothanassis, Robert W. Wisniewski, Michael L. Scott
1997 ACM Transactions on Computer Systems  
Efficient synchronization is important for achieving good performance in parallel programs, especially on large-scale multiprocessors.  ...  We show that these problems are particularly severe for scalable synchronization algorithms based on distributed data structures.  ...  for pushing us a little when we needed it.  ... 
doi:10.1145/244764.244765 fatcat:dggjnw6zxfgvtj43stqeh76vpq

In-Memory Big Data Management and Processing: A Survey

Hao Zhang, Gang Chen, Beng Chin Ooi, Kian-Lee Tan, Meihui Zhang
2015 IEEE Transactions on Knowledge and Data Engineering  
However, in-memory systems are much more sensitive to other sources of overhead that do not matter in traditional I/O-bounded disk-based systems.  ...  Growing main memory capacity has fueled the development of in-memory big data management and processing. By eliminating disk I/O bottleneck, it is now possible to support interactive data analytics.  ...  We would like to thank the anonymous reviewers, and also Bingsheng He, Eric Lo and Bogdan Marius Tudor, for their insightful comments and suggestions.  ... 
doi:10.1109/tkde.2015.2427795 fatcat:u7r3rtvhxbainfeazfduxcdwrm

Scaling Non-Regular Shared-Memory Codes by Reusing Custom Loop Schedules

Dimitrios S. Nikolopoulos, Ernest Artiaga, Eduard Ayguadé, Jesús Labarta
2003 Scientific Programming  
as possible along the execution of the program for better memory access locality.  ...  In this paper we explore the idea of customizing and reusing loop schedules to improve the scalability of non-regular numerical codes in shared-memory architectures with non-uniform memory access latency  ...  Acknowledgements We are grateful to the ECMWF and Siegfried Benkner for providing us with the irregular kernels.  ... 
doi:10.1155/2003/379739 fatcat:hq64p5sconblroyahayyt2qiae

Fast synchronization on shared-memory multiprocessors: An architectural approach

Zhen Fang, Lixin Zhang, John B. Carter, Liqun Cheng, Michael Parker
2005 Journal of Parallel and Distributed Computing  
Second, we present an architectural innovation called active memory that enables very fast atomic operations in a shared-memory multiprocessor.  ...  To the best of our knowledge, synchronization based on active memory outforms all existing spinlock and non-hardwired barrier implementations by a large margin.  ...  We would also like to thank Allan Gottlieb for his feedback on this work.  ... 
doi:10.1016/j.jpdc.2005.04.013 fatcat:3dj627j3r5ekzamur5epdzgdo4

Evaluation of OpenMP for the Cyclops Multithreaded Architecture [chapter]

George Almasi, Eduard Ayguadé, Călin Caşcaval, José Castaños, Jesús Labarta, Francisco Martínez, Xavier Martorell, José Moreira
2003 Lecture Notes in Computer Science  
Multithreaded architectures have the potential of tolerating large memory and functional unit latencies and increase resource utilization.  ...  Programming such applications for this unconventional design requires a significant porting effort when using the basic built-in mechanisms for thread management and synchronization.  ...  Multiprocessors systems-on-a-chip based on the replication of multithreaded cores offer a complexity-conscious alternative to future chip designs.  ... 
doi:10.1007/3-540-45009-2_6 fatcat:dod3k6rf5vffbh4vj33t7syl4y

Enhancing Programmability, Portability, and Performance with Rich Cross-Layer Abstractions [article]

Nandita Vijaykumar
2019 arXiv   pre-print
In doing so, they enable a rich space of hardware-software cooperative mechanisms to optimize for performance.  ...  This thesis makes the case for rich low-overhead cross-layer abstractions as a highly effective means to address the above challenges.  ...  We use Algorithm 1 to form CTA clusters, and schedule each formed cluster at the same SM in a non-NUMA system. 2 In a NUMA system, we rst partition the CTAs across the di erent NUMA zones (see §3.4.6),  ... 
arXiv:1911.05660v1 fatcat:w5f3g4isqbcphm2jjfzjtvrjnq

Region templates: Data representation and management for high-throughput image analysis

George Teodoro, Tony Pan, Tahsin Kurc, Jun Kong, Lee Cooper, Scott Klasky, Joel Saltz
2014 Parallel Computing  
The execution of the application is coordinated by a runtime system that implements optimizations for hybrid machines, including performance-aware scheduling for maximizing the utilization of computing  ...  Finally, a processing rate of 11,730 4K×4K tiles per minute was achieved for the microscopy imaging application on a cluster with 100 nodes (300 GPUs and 1,200 CPU cores).  ...  platforms with Non-Uniform Memory Access (NUMA).  ... 
doi:10.1016/j.parco.2014.09.003 pmid:26139953 pmcid:PMC4484879 fatcat:4miblqmyyzad5bdcvmzxngv2oy

Hardware-Conscious Stream Processing: A Survey [article]

Shuhao Zhang, Feng Zhang, Yingjun Wu, Bingsheng He, Paul Johns
2020 arXiv   pre-print
Witnessing the recent great achievements in the computer architecture community, researchers and practitioners have investigated the potential of adoption hardware-conscious stream processing by better  ...  The authors would like to thank the anonymous reviewer and the associate editor, Pınar Tözün, for their insightful comments on improving this manuscript.  ...  [8] proposed a cache conscious scheduling algorithm for mapping stream application on multicore processors.  ... 
arXiv:2001.05667v1 fatcat:hga7siyyzvbavilpxvxjofvtii

Region Templates: Data Representation and Management for Large-Scale Image Analysis [article]

George Teodoro, Tony Pan, Tahsin Kurc, Jun Kong, Lee Cooper, Scott Klasky, Joel Saltz
2014 arXiv   pre-print
A number of optimizations for hybrid machines are available in our runtime system, including performance-aware scheduling for maximizing utilization of computing devices and techniques to reduce impact  ...  In this paper, we introduce a region template abstraction for the efficient management of common data types used in analysis of large datasets of high resolution images on clusters of hybrid computing  ...  platforms with Non-Uniform Memory Access (NUMA).  ... 
arXiv:1405.7958v1 fatcat:doemcxj4djhmnplhvhkimy2x3q
« Previous Showing results 1 — 15 out of 59 results