A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Filters
Optimizing virtual machine scheduling in NUMA multicore systems
2013
2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)
However, the complex interplay among data locality, contention on shared on-chip memory resources, and cross-node data sharing overhead, makes the delivery of an optimal and predictable program performance ...
An increasing number of new multicore systems use the Non-Uniform Memory Access architecture due to its scalable memory performance. ...
Acknowledgements We are grateful to the anonymous reviewers for their constructive comments. This research was supported in part by the U.S. ...
doi:10.1109/hpca.2013.6522328
dblp:conf/hpca/RaoWZX13
fatcat:jz2yyayw4jgzbozgdvvj7hsivm
A Survey on Hardware and Software Support for Thread Level Parallelism
[article]
2016
arXiv
pre-print
We also further discuss on software support for threads, to mainly increase the deterministic behavior during runtime. ...
Hardware support at execution time is very crucial to the performance of the system, thus different types of hardware support for threads also exist or have been proposed, primarily based on widely used ...
At runtime, memory is used by concurrent threads, so achieving high system throughput, and effective resource allocation entails optimizing memory access. ...
arXiv:1603.09274v3
fatcat:75isdvgp5zbhplocook6273sq4
Efficient Embedded Software Migration towards Clusterized Distributed-Memory Architectures
2016
IEEE transactions on computers
This paper proposes a solution tailored for an efficient execution of applications defined with shared-memory programming models onto on-chip distributed-memory multicore architectures. ...
However with the growing number of cores in modern manycore embedded architectures, they present a bottleneck related to their centralized memory accesses. ...
By concentrating all re-design efforts at runtime software and hardware levels, it facilitates the execution of shared-memory oriented embedded applications on distributed-memory multicore on-chip systems ...
doi:10.1109/tc.2015.2485202
fatcat:v2lbaqig5zd6zdr3h5afrhit3u
Runtime 3-D stacked cache data management for energy minimization of 3-D chip-multiprocessors
2014
Fifteenth International Symposium on Quality Electronic Design
In a 3-D processor-memory system, multiple cache dies can be stacked onto multi-core die to reduce latency and power of the on-chip wires connecting the cores and the cache, which finally increases the ...
The proposed method considers both temperature distribution and memory traffic of 3-D CMPs. ...
systems [2] , when considering that on-chip SRAM cache often consumes almost half of total energy in a microprocessor system [3] [4] . ...
doi:10.1109/isqed.2014.6783325
dblp:conf/isqed/LeeKJK14
fatcat:4fhwsqh66faxnmgrr2yi77niy4
ARS: an adaptive runtime system for locality optimization
2003
Future generations computer systems
A solution, called the Adaptive Runtime System (ARS), is presented in this paper. ARS is designed to adjust the data distribution at runtime through automatic page migrations. ...
Shared memory programs running on Non-Uniform Memory Access (NUMA) machines usually face inherent performance problems stemming from excessive remote memory accesses. ...
The RAHM (Remote Access Histories Mechanism) [8, 16] is a technique that uses remote access histories for thread migration in order to improve the locality of memory references in distributed shared ...
doi:10.1016/s0167-739x(02)00183-8
fatcat:fqfd7va3ovhdjon3vsrp73z25e
A library for portable and composable data locality optimizations for NUMA systems
2015
Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming - PPoPP 2015
Optimizing NUMA memory system performance is difficult and costly for three principal reasons: (1) today's programming languages/libraries have no explicit support for NUMA systems, (2) NUMA optimizations ...
Many recent multiprocessor systems are realized with a nonuniform memory architecture (NUMA) and accesses to remote memory locations take more time than local memory accesses. ...
Acknowledgments We thank Michael Stumm, Frank Müller, Yves Geissbühler, Albert Noll, and the anonymous referees for their helpful comments and acknowledge computing resources provided by SNF grant 206021 ...
doi:10.1145/2688500.2688509
dblp:conf/ppopp/MajoG15
fatcat:qyf5f3uwd5astlg26tkbblfutq
A Library for Portable and Composable Data Locality Optimizations for NUMA Systems
2017
ACM Transactions on Parallel Computing
Optimizing NUMA memory system performance is difficult and costly for three principal reasons: (1) today's programming languages/libraries have no explicit support for NUMA systems, (2) NUMA optimizations ...
Many recent multiprocessor systems are realized with a nonuniform memory architecture (NUMA) and accesses to remote memory locations take more time than local memory accesses. ...
Acknowledgments We thank Michael Stumm, Frank Müller, Yves Geissbühler, Albert Noll, and the anonymous referees for their helpful comments and acknowledge computing resources provided by SNF grant 206021 ...
doi:10.1145/3040222
fatcat:2cjl3cpvtzhfdbxvkre3zijxse
A library for portable and composable data locality optimizations for NUMA systems
2015
SIGPLAN notices
Optimizing NUMA memory system performance is difficult and costly for three principal reasons: (1) today's programming languages/libraries have no explicit support for NUMA systems, (2) NUMA optimizations ...
Many recent multiprocessor systems are realized with a nonuniform memory architecture (NUMA) and accesses to remote memory locations take more time than local memory accesses. ...
Acknowledgments We thank Michael Stumm, Frank Müller, Yves Geissbühler, Albert Noll, and the anonymous referees for their helpful comments and acknowledge computing resources provided by SNF grant 206021 ...
doi:10.1145/2858788.2688509
fatcat:xoevnbyw55auxawir3lpjcpacm
Exploiting Hidden Non-uniformity of Uniform Memory Access on Manycore CPUs
[chapter]
2014
Lecture Notes in Computer Science
At the same time, these architectures come with complex on-chip networks for inter-core communication and multiple memory controllers for accessing off-chip RAM modules. ...
Although the chip provides Uniform Memory Access (UMA), we find that there are substantial (as high as 60%) differences in access latencies for different memory blocks depending on which CPU core issues ...
Acknowledgment This work has been partially supported by the CREST project of the Japan Science and Technology Agency (JST) and by the National Project of MEXT called Feasibility Study on Advanced and ...
doi:10.1007/978-3-319-14313-2_21
fatcat:ldnb6unhwnddnm4mc5jg4liqym
Performance Impact of Task Mapping on the Cell BE Multicore Processor
[chapter]
2011
Lecture Notes in Computer Science
We find that low-level tricks for static mapping do not necessarily achieve optimal performance. ...
We report on our experiments to map a simple application with communication in a ring to SPEs of a Cell BE processor such that performance is optimized. ...
Platonov for running part of the experiments. ...
doi:10.1007/978-3-642-24322-6_2
fatcat:pix6sob4nzeefdrolisrq46l74
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors
[chapter]
2009
Lecture Notes in Computer Science
This paper proposes and studies a hardware-based adaptive controlled migration strategy for managing distributed L2 caches in chip multiprocessors. ...
Building on an area-efficient shared cache design, the proposed scheme dynamically migrates cache blocks to cache banks that best minimize the average L2 access latency. ...
The proposed mechanism optimizes the L2 miss rate via maintaining the uniqueness of cache blocks on chip. ...
doi:10.1007/978-3-540-92990-1_26
fatcat:yutlnwo5lbew7cqw53wrgm3vti
OpenMP extension to SMP clusters
2006
IEEE potentials
The major obstacles are thread-unsafe memory access, slow inter-process synchronization, and excessive remote page accesses, which stem from the page-based memory consistency mechanisms of the traditional ...
This paper discusses on the approaches to apply the OpenMP programming model to SMP (Symmetric Multi-Processor) clusters using SDSM (Software Distributed Shared Memory). ...
Some noticeable examples are the studies on OpenMP for multi-processors on a chip in the embedded system community and OpenMP for computational Grids in the high performance distributed computing community ...
doi:10.1109/mp.2006.1657761
fatcat:5th5gj37wzbqbd5febzq7hoavq
Ecoscale: Reconfigurable Computing And Runtime System For Future Exascale Systems
2016
Zenodo
Unit with coherent memory access. ...
ECOSCALE introduces a novel heterogeneous energy-efficient hierarchical architecture, as well as a hybrid many-core+OpenCL programming environment and runtime system. ...
intelligent runtime system and middleware; and hardware support for sharing distributed and reconfigurable accelerators. ...
doi:10.5281/zenodo.34893
fatcat:ocwfndo4vjei3hqucmndj22xu4
Locality-information-based scheduling in shared-memory multiprocessors
[chapter]
1996
Lecture Notes in Computer Science
All data gathered at runtime are transformed into affinity values inside a metric space, so that threads migrate near to their (sub)optimal operation points defined by location and timing of execution. ...
This paper examines the performance implications of locality information usage in thread scheduling algorithms for scalable shared-memory multiprocessors. ...
Special thanks to Martin Steckermeier for designing and implementing the Mthreads runtime system. ...
doi:10.1007/bfb0022298
fatcat:a2tqyztdcjapzmupo4sbvnq77e
contiguously on individual memory modules (as is desirable for NDP private data), and (2) decide whether to localize or distribute each memory object based on its anticipated access pattern and steer ...
interfaces by distributing the memory traffic. ...
by distributing the memory traffic. ...
doi:10.1145/3232521
fatcat:vrmsepasrfgadanadruj6bvuoq
« Previous
Showing results 1 — 15 out of 979 results