Filters








979 Hits in 6.7 sec

Optimizing virtual machine scheduling in NUMA multicore systems

Jia Rao, Kun Wang, Xiaobo Zhou, Cheng-Zhong Xu
2013 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)  
However, the complex interplay among data locality, contention on shared on-chip memory resources, and cross-node data sharing overhead, makes the delivery of an optimal and predictable program performance  ...  An increasing number of new multicore systems use the Non-Uniform Memory Access architecture due to its scalable memory performance.  ...  Acknowledgements We are grateful to the anonymous reviewers for their constructive comments. This research was supported in part by the U.S.  ... 
doi:10.1109/hpca.2013.6522328 dblp:conf/hpca/RaoWZX13 fatcat:jz2yyayw4jgzbozgdvvj7hsivm

A Survey on Hardware and Software Support for Thread Level Parallelism [article]

Somnath Mazumdar, Roberto Giorgi
2016 arXiv   pre-print
We also further discuss on software support for threads, to mainly increase the deterministic behavior during runtime.  ...  Hardware support at execution time is very crucial to the performance of the system, thus different types of hardware support for threads also exist or have been proposed, primarily based on widely used  ...  At runtime, memory is used by concurrent threads, so achieving high system throughput, and effective resource allocation entails optimizing memory access.  ... 
arXiv:1603.09274v3 fatcat:75isdvgp5zbhplocook6273sq4

Efficient Embedded Software Migration towards Clusterized Distributed-Memory Architectures

Rafael Garibotti, Anastasiia Butko, Luciano Ost, Abdoulaye Gamatie, Gilles Sassatelli, Chris Adeniyi-Jones
2016 IEEE transactions on computers  
This paper proposes a solution tailored for an efficient execution of applications defined with shared-memory programming models onto on-chip distributed-memory multicore architectures.  ...  However with the growing number of cores in modern manycore embedded architectures, they present a bottleneck related to their centralized memory accesses.  ...  By concentrating all re-design efforts at runtime software and hardware levels, it facilitates the execution of shared-memory oriented embedded applications on distributed-memory multicore on-chip systems  ... 
doi:10.1109/tc.2015.2485202 fatcat:v2lbaqig5zd6zdr3h5afrhit3u

Runtime 3-D stacked cache data management for energy minimization of 3-D chip-multiprocessors

Seunghan Lee, Kyungsu Kang, Jongpil Jung, Chong-Min Kyung
2014 Fifteenth International Symposium on Quality Electronic Design  
In a 3-D processor-memory system, multiple cache dies can be stacked onto multi-core die to reduce latency and power of the on-chip wires connecting the cores and the cache, which finally increases the  ...  The proposed method considers both temperature distribution and memory traffic of 3-D CMPs.  ...  systems [2] , when considering that on-chip SRAM cache often consumes almost half of total energy in a microprocessor system [3] [4] .  ... 
doi:10.1109/isqed.2014.6783325 dblp:conf/isqed/LeeKJK14 fatcat:4fhwsqh66faxnmgrr2yi77niy4

ARS: an adaptive runtime system for locality optimization

Jie Tao, Martin Schulz, Wolfgang Karl
2003 Future generations computer systems  
A solution, called the Adaptive Runtime System (ARS), is presented in this paper. ARS is designed to adjust the data distribution at runtime through automatic page migrations.  ...  Shared memory programs running on Non-Uniform Memory Access (NUMA) machines usually face inherent performance problems stemming from excessive remote memory accesses.  ...  The RAHM (Remote Access Histories Mechanism) [8, 16] is a technique that uses remote access histories for thread migration in order to improve the locality of memory references in distributed shared  ... 
doi:10.1016/s0167-739x(02)00183-8 fatcat:fqfd7va3ovhdjon3vsrp73z25e

A library for portable and composable data locality optimizations for NUMA systems

Zoltan Majo, Thomas R. Gross
2015 Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming - PPoPP 2015  
Optimizing NUMA memory system performance is difficult and costly for three principal reasons: (1) today's programming languages/libraries have no explicit support for NUMA systems, (2) NUMA optimizations  ...  Many recent multiprocessor systems are realized with a nonuniform memory architecture (NUMA) and accesses to remote memory locations take more time than local memory accesses.  ...  Acknowledgments We thank Michael Stumm, Frank Müller, Yves Geissbühler, Albert Noll, and the anonymous referees for their helpful comments and acknowledge computing resources provided by SNF grant 206021  ... 
doi:10.1145/2688500.2688509 dblp:conf/ppopp/MajoG15 fatcat:qyf5f3uwd5astlg26tkbblfutq

A Library for Portable and Composable Data Locality Optimizations for NUMA Systems

Zoltan Majo, Thomas R. Gross
2017 ACM Transactions on Parallel Computing  
Optimizing NUMA memory system performance is difficult and costly for three principal reasons: (1) today's programming languages/libraries have no explicit support for NUMA systems, (2) NUMA optimizations  ...  Many recent multiprocessor systems are realized with a nonuniform memory architecture (NUMA) and accesses to remote memory locations take more time than local memory accesses.  ...  Acknowledgments We thank Michael Stumm, Frank Müller, Yves Geissbühler, Albert Noll, and the anonymous referees for their helpful comments and acknowledge computing resources provided by SNF grant 206021  ... 
doi:10.1145/3040222 fatcat:2cjl3cpvtzhfdbxvkre3zijxse

A library for portable and composable data locality optimizations for NUMA systems

Zoltan Majo, Thomas R. Gross
2015 SIGPLAN notices  
Optimizing NUMA memory system performance is difficult and costly for three principal reasons: (1) today's programming languages/libraries have no explicit support for NUMA systems, (2) NUMA optimizations  ...  Many recent multiprocessor systems are realized with a nonuniform memory architecture (NUMA) and accesses to remote memory locations take more time than local memory accesses.  ...  Acknowledgments We thank Michael Stumm, Frank Müller, Yves Geissbühler, Albert Noll, and the anonymous referees for their helpful comments and acknowledge computing resources provided by SNF grant 206021  ... 
doi:10.1145/2858788.2688509 fatcat:xoevnbyw55auxawir3lpjcpacm

Exploiting Hidden Non-uniformity of Uniform Memory Access on Manycore CPUs [chapter]

Balazs Gerofi, Masamichi Takagi, Yutaka Ishikawa
2014 Lecture Notes in Computer Science  
At the same time, these architectures come with complex on-chip networks for inter-core communication and multiple memory controllers for accessing off-chip RAM modules.  ...  Although the chip provides Uniform Memory Access (UMA), we find that there are substantial (as high as 60%) differences in access latencies for different memory blocks depending on which CPU core issues  ...  Acknowledgment This work has been partially supported by the CREST project of the Japan Science and Technology Agency (JST) and by the National Project of MEXT called Feasibility Study on Advanced and  ... 
doi:10.1007/978-3-319-14313-2_21 fatcat:ldnb6unhwnddnm4mc5jg4liqym

Performance Impact of Task Mapping on the Cell BE Multicore Processor [chapter]

Jörg Keller, Ana Lucia Varbanescu
2011 Lecture Notes in Computer Science  
We find that low-level tricks for static mapping do not necessarily achieve optimal performance.  ...  We report on our experiments to map a simple application with communication in a ring to SPEs of a Cell BE processor such that performance is optimized.  ...  Platonov for running part of the experiments.  ... 
doi:10.1007/978-3-642-24322-6_2 fatcat:pix6sob4nzeefdrolisrq46l74

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors [chapter]

Mohammad Hammoud, Sangyeun Cho, Rami Melhem
2009 Lecture Notes in Computer Science  
This paper proposes and studies a hardware-based adaptive controlled migration strategy for managing distributed L2 caches in chip multiprocessors.  ...  Building on an area-efficient shared cache design, the proposed scheme dynamically migrates cache blocks to cache banks that best minimize the average L2 access latency.  ...  The proposed mechanism optimizes the L2 miss rate via maintaining the uniqueness of cache blocks on chip.  ... 
doi:10.1007/978-3-540-92990-1_26 fatcat:yutlnwo5lbew7cqw53wrgm3vti

OpenMP extension to SMP clusters

Yang-Suk Kee
2006 IEEE potentials  
The major obstacles are thread-unsafe memory access, slow inter-process synchronization, and excessive remote page accesses, which stem from the page-based memory consistency mechanisms of the traditional  ...  This paper discusses on the approaches to apply the OpenMP programming model to SMP (Symmetric Multi-Processor) clusters using SDSM (Software Distributed Shared Memory).  ...  Some noticeable examples are the studies on OpenMP for multi-processors on a chip in the embedded system community and OpenMP for computational Grids in the high performance distributed computing community  ... 
doi:10.1109/mp.2006.1657761 fatcat:5th5gj37wzbqbd5febzq7hoavq

Ecoscale: Reconfigurable Computing And Runtime System For Future Exascale Systems

Iakovos Mavroidis, Ioannis Papaefstathiou, Luciano Lavagno, Dimitrios Nikolopoulos, Dirk Koch, John Goodacre, Ioannis Sourdis, Vassilis Papaefstathiou, Marcello Coppola, Manuel Palomino
2016 Zenodo  
Unit with coherent memory access.  ...  ECOSCALE introduces a novel heterogeneous energy-efficient hierarchical architecture, as well as a hybrid many-core+OpenCL programming environment and runtime system.  ...  intelligent runtime system and middleware; and hardware support for sharing distributed and reconfigurable accelerators.  ... 
doi:10.5281/zenodo.34893 fatcat:ocwfndo4vjei3hqucmndj22xu4

Locality-information-based scheduling in shared-memory multiprocessors [chapter]

Frank Bellosa
1996 Lecture Notes in Computer Science  
All data gathered at runtime are transformed into affinity values inside a metric space, so that threads migrate near to their (sub)optimal operation points defined by location and timing of execution.  ...  This paper examines the performance implications of locality information usage in thread scheduling algorithms for scalable shared-memory multiprocessors.  ...  Special thanks to Martin Steckermeier for designing and implementing the Mthreads runtime system.  ... 
doi:10.1007/bfb0022298 fatcat:a2tqyztdcjapzmupo4sbvnq77e

CODA

Hyojong Kim, Ramyad Hadidi, Lifeng Nai, Hyesoon Kim, Nuwan Jayasena, Yasuko Eckert, Onur Kayiran, Gabriel Loh
2018 ACM Transactions on Architecture and Code Optimization (TACO)  
contiguously on individual memory modules (as is desirable for NDP private data), and (2) decide whether to localize or distribute each memory object based on its anticipated access pattern and steer  ...  interfaces by distributing the memory traffic.  ...  by distributing the memory traffic.  ... 
doi:10.1145/3232521 fatcat:vrmsepasrfgadanadruj6bvuoq
« Previous Showing results 1 — 15 out of 979 results