A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2005; you can also visit the original URL.
The file type is application/pdf
.
Filters
Shared-memory performance profiling
1997
SIGPLAN notices
distributed shared memory system. ...
As a demonstration, Paradyn helped us improve the performance of a new shared-memory application program by a factor of four. ...
ACKNOWLEDGMENTS Thanks to Sang Tae Kim and Atipat Rojnuckarin for providing the application code and the insight and effort for tuning its performance. ...
doi:10.1145/263767.263796
fatcat:6tgmda63wvf4jpwfhyuwuzepli
Shared-memory performance profiling
1997
Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming - PPOPP '97
distributed shared memory system. ...
As a demonstration, Paradyn helped us improve the performance of a new shared-memory application program by a factor of four. ...
ACKNOWLEDGMENTS Thanks to Sang Tae Kim and Atipat Rojnuckarin for providing the application code and the insight and effort for tuning its performance. ...
doi:10.1145/263764.263796
dblp:conf/ppopp/XuLM97
fatcat:ymcsip25sbhojhcnovfrbopg7q
On Using Incremental Profiling for the Performance Analysis of Shared Memory Parallel Applications
[chapter]
Lecture Notes in Computer Science
Profiling is often the method of choice for performance analysis of parallel applications due to its low overhead and easily comprehensible results. ...
of performance data. ...
Conclusion and Future Work We have presented a study on the utility of incremental profiling for performance analysis of shared memory parallel applications. ...
doi:10.1007/978-3-540-74466-5_8
fatcat:jwb3xpww45gpzh464yctij7o54
GMProf: A low-overhead, fine-grained profiling approach for GPU programs
2012
2012 19th International Conference on High Performance Computing
Our experimental results show that GMProf, with all optimizations, incurs a moderate overhead, e.g., 1.36 times on average for shared memory profiling. ...
use of shared memory, but also helped tune the implementations. ...
Fig. 1 . 1 Sample code showing the use of shared memory.
1 ) 1 Runtime Overhead for Shared Memory Profiling: To measure the efficiency of shared memory profiling and the performance contribution of the ...
doi:10.1109/hipc.2012.6507475
dblp:conf/hipc/ZhengRMQA12
fatcat:57nz4e23crdn7djx5bjmhhzj3m
Ensuring the Fairness of program's performance on CMP
2017
MATEC Web of Conferences
The share resource CPU and memory are included in our study. Meanwhile, the Linux resource management tool Cgroups is used to realize our idea. ...
By the resource we profiled and the tool support, we realize a reasonable resource dividing method to ensure the fairness of program's performance. ...
Among all kinds of shared resources, CPU and memory have the greatest impact on program's performance. ...
doi:10.1051/matecconf/201712804016
fatcat:3uwpkfheobdwtjghavfczx66ya
Modeling Shared Cache Performance of OpenMP Programs using Reuse Distance
[article]
2019
arXiv
pre-print
We present a Scalable Analytical Shared Memory Model to predict the performance of parallel applications that runs on a multicore computer and shares the same level of cache in the hierarchy. ...
Performance modeling of parallel applications on multicore computers remains a challenge in computational co-design due to the complex design of multicore processors including private and shared memory ...
Several recent works have focused on CRD profile and performance prediction of the shared cache [26] - [30] . ...
arXiv:1907.12666v1
fatcat:cbjoytjwgbgpbaxm7mjtzdctae
PPT-SASMM: Scalable Analytical Shared Memory Model: Predicting the Performance of Multicore Caches from a Single-Threaded Execution Trace
[article]
2021
arXiv
pre-print
Multicores include complex private and shared memory hierarchies. We present a Scalable Analytical Shared Memory Model (SASMM). ...
The profiles are calculated from the memory traces of applications that run sequentially rather than using multi-threaded traces. ...
Several recent works have focused on CRD profile and performance prediction of the shared cache [9, 18, 20, 44, 48] . ...
arXiv:2103.10635v1
fatcat:rijjcrwmhjhwxph5d7y5zzw5sy
NumaPerf: Predictive and Full NUMA Profiling
[article]
2021
arXiv
pre-print
To achieve this, NumaPerf focuses on memory sharing patterns between threads, instead of real remote accesses. ...
However, existing NUMA-profiling tools share some similar shortcomings, such as portability, effectiveness, and helpfulness issues. ...
It separates cache false sharing issues from true sharing and page sharing so that users can use the padding to achieve better performance. ...
arXiv:2102.05204v1
fatcat:imxguscybbeuhj4gbovg6zxsdm
PPT-Multicore: Performance Prediction of OpenMP applications using Reuse Profiles and Analytical Modeling
[article]
2021
arXiv
pre-print
We present PPT-Multicore, an analytical model embedded in the Performance Prediction Toolkit (PPT) to predict parallel application performance running on a multicore processor. ...
We use a probabilistic and computationally efficient reuse profile to predict the cache hit rates and runtimes of OpenMP programs' parallel sections. ...
Several recent works have focused on CRD profiles for predicting the performance of shared cache [21, 33, 76, 79, 89] . ...
arXiv:2104.05102v1
fatcat:mtcyzxf5g5hslogtp5mra5z7wa
Studying multicore processor scaling via reuse distance analysis
2013
Proceedings of the 40th Annual International Symposium on Computer Architecture - ISCA '13
The key to realizing the potential of LCMPs is the cache hierarchy, so studying how memory performance will scale is crucial. Reuse distance (RD) analysis can help architects do this. ...
In particular, recent work has developed concurrent reuse distance (CRD) and private reuse distance (PRD) profiles to enable analysis of shared and private caches. ...
RD analysis measures a program's memory reuse distance histogram, or RD profile, capturing the application-level locality responsible for cache performance. ...
doi:10.1145/2485922.2485965
dblp:conf/isca/WuZY13
fatcat:g6p2y66rjndv7natn4a5fv3atq
Studying multicore processor scaling via reuse distance analysis
2013
SIGARCH Computer Architecture News
The key to realizing the potential of LCMPs is the cache hierarchy, so studying how memory performance will scale is crucial. Reuse distance (RD) analysis can help architects do this. ...
In particular, recent work has developed concurrent reuse distance (CRD) and private reuse distance (PRD) profiles to enable analysis of shared and private caches. ...
RD analysis measures a program's memory reuse distance histogram, or RD profile, capturing the application-level locality responsible for cache performance. ...
doi:10.1145/2508148.2485965
fatcat:vjuzzdw2rrekpf7ofcd76nyo2i
Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis
2012
Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness - MSPC '12
In today's hierarchies, performance is determined by complicated thread interactions, such as interference in shared caches and replication and communication in private caches. ...
Understanding multicore memory behavior is crucial, but can be challenging due to the complex cache hierarchies employed in modern CPUs. ...
Schuff predicts shared (private) cache performance using CRD (PRD) profiles. In subsequent work, Schuff speeds up profile acquisition via sampling and parallelization techniques [16] . ...
doi:10.1145/2247684.2247687
dblp:conf/pldi/WuY12
fatcat:bdtfalx5lnbuxcgqgxupjqadse
Preliminary evaluation of dynamic load balancing using loop re-partitioning on Omni/SCASH
2003
CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings.
In this paper, we report our ongoing work on dynamic load balancing extension to Omni/SCASH which is an implementation of OpenMP on Software Distributed Shared Memory, SCASH. ...
Such a commodity cluster environment, there may be incremental upgrade due to several reasons, such as rapid progress in processor technologies, or user needs and it may cause the performance heterogeneity ...
In the ERC memory model, the consistency of a shared memory area is maintained on each synchronization called the memory barrier synchronization point. ...
doi:10.1109/ccgrid.2003.1199402
dblp:conf/ccgrid/SakaeSMH03
fatcat:n77y5o66lbenhb4iccj2oko2sa
Efficient Reuse Distance Analysis of Multicore Scaling for Loop-Based Parallel Programs
2013
ACM Transactions on Computer Systems
And fourth, we apply CRD and PRD profiles to analyze multicore cache performance. ...
The predicted profiles can then be used to predict cache performance for the scaled CPUs. ...
RD analysis measures a program's memory reuse distance histogram, or RD profile, capturing the application-level locality responsible for cache performance. ...
doi:10.1145/2427631.2427632
fatcat:rwqdtcyp6jbsfn5ptcas22qolu
Matching memory access patterns and data placement for NUMA systems
2012
Proceedings of the Tenth International Symposium on Code Generation and Optimization - CHO '12
Many recent multicore multiprocessors are based on a nonuniform memory architecture (NUMA). ...
To alleviate this problem we describe a small set of language-level primitives for memory allocation and loop scheduling. ...
By eliminating sharing we obtain performance improvements also when profile-based allocation does not. ...
doi:10.1145/2259016.2259046
dblp:conf/cgo/MajoG12
fatcat:ljm3tb3to5dpfexns4uqbviooi
« Previous
Showing results 1 — 15 out of 286,308 results