Filters








286,308 Hits in 4.2 sec

Shared-memory performance profiling

Zhichen Xu, James R. Larus, Barton P. Miller
1997 SIGPLAN notices  
distributed shared memory system.  ...  As a demonstration, Paradyn helped us improve the performance of a new shared-memory application program by a factor of four.  ...  ACKNOWLEDGMENTS Thanks to Sang Tae Kim and Atipat Rojnuckarin for providing the application code and the insight and effort for tuning its performance.  ... 
doi:10.1145/263767.263796 fatcat:6tgmda63wvf4jpwfhyuwuzepli

Shared-memory performance profiling

Zhichen Xu, James R. Larus, Barton P. Miller
1997 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming - PPOPP '97  
distributed shared memory system.  ...  As a demonstration, Paradyn helped us improve the performance of a new shared-memory application program by a factor of four.  ...  ACKNOWLEDGMENTS Thanks to Sang Tae Kim and Atipat Rojnuckarin for providing the application code and the insight and effort for tuning its performance.  ... 
doi:10.1145/263764.263796 dblp:conf/ppopp/XuLM97 fatcat:ymcsip25sbhojhcnovfrbopg7q

On Using Incremental Profiling for the Performance Analysis of Shared Memory Parallel Applications [chapter]

Karl Fuerlinger, Michael Gerndt, Jack Dongarra
Lecture Notes in Computer Science  
Profiling is often the method of choice for performance analysis of parallel applications due to its low overhead and easily comprehensible results.  ...  of performance data.  ...  Conclusion and Future Work We have presented a study on the utility of incremental profiling for performance analysis of shared memory parallel applications.  ... 
doi:10.1007/978-3-540-74466-5_8 fatcat:jwb3xpww45gpzh464yctij7o54

GMProf: A low-overhead, fine-grained profiling approach for GPU programs

Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, Gagan Agrawal
2012 2012 19th International Conference on High Performance Computing  
Our experimental results show that GMProf, with all optimizations, incurs a moderate overhead, e.g., 1.36 times on average for shared memory profiling.  ...  use of shared memory, but also helped tune the implementations.  ...  Fig. 1 . 1 Sample code showing the use of shared memory. 1 ) 1 Runtime Overhead for Shared Memory Profiling: To measure the efficiency of shared memory profiling and the performance contribution of the  ... 
doi:10.1109/hipc.2012.6507475 dblp:conf/hipc/ZhengRMQA12 fatcat:57nz4e23crdn7djx5bjmhhzj3m

Ensuring the Fairness of program's performance on CMP

Qilong Wang, Jun Gao, Guangsong Hou, Hongkui Li, Ke Xu, Yansong Wang
2017 MATEC Web of Conferences  
The share resource CPU and memory are included in our study. Meanwhile, the Linux resource management tool Cgroups is used to realize our idea.  ...  By the resource we profiled and the tool support, we realize a reasonable resource dividing method to ensure the fairness of program's performance.  ...  Among all kinds of shared resources, CPU and memory have the greatest impact on program's performance.  ... 
doi:10.1051/matecconf/201712804016 fatcat:3uwpkfheobdwtjghavfczx66ya

Modeling Shared Cache Performance of OpenMP Programs using Reuse Distance [article]

Atanu Barai and Gopinath Chennupati and Nandakishore Santhi and Abdel-Hameed A. Badawy and Stephan Eidenbenz
2019 arXiv   pre-print
We present a Scalable Analytical Shared Memory Model to predict the performance of parallel applications that runs on a multicore computer and shares the same level of cache in the hierarchy.  ...  Performance modeling of parallel applications on multicore computers remains a challenge in computational co-design due to the complex design of multicore processors including private and shared memory  ...  Several recent works have focused on CRD profile and performance prediction of the shared cache [26] - [30] .  ... 
arXiv:1907.12666v1 fatcat:cbjoytjwgbgpbaxm7mjtzdctae

PPT-SASMM: Scalable Analytical Shared Memory Model: Predicting the Performance of Multicore Caches from a Single-Threaded Execution Trace [article]

Atanu Barai, Gopinath Chennupati, Nandakishore Santhi, Abdel-Hameed Badawy, Yehia Arafa, Stephan Eidenbenz
2021 arXiv   pre-print
Multicores include complex private and shared memory hierarchies. We present a Scalable Analytical Shared Memory Model (SASMM).  ...  The profiles are calculated from the memory traces of applications that run sequentially rather than using multi-threaded traces.  ...  Several recent works have focused on CRD profile and performance prediction of the shared cache [9, 18, 20, 44, 48] .  ... 
arXiv:2103.10635v1 fatcat:rijjcrwmhjhwxph5d7y5zzw5sy

NumaPerf: Predictive and Full NUMA Profiling [article]

Xin Zhao
2021 arXiv   pre-print
To achieve this, NumaPerf focuses on memory sharing patterns between threads, instead of real remote accesses.  ...  However, existing NUMA-profiling tools share some similar shortcomings, such as portability, effectiveness, and helpfulness issues.  ...  It separates cache false sharing issues from true sharing and page sharing so that users can use the padding to achieve better performance.  ... 
arXiv:2102.05204v1 fatcat:imxguscybbeuhj4gbovg6zxsdm

PPT-Multicore: Performance Prediction of OpenMP applications using Reuse Profiles and Analytical Modeling [article]

Atanu Barai and Yehia Arafa and Abdel-Hameed Badawy and Gopinath Chennupati and Nandakishore Santhi and Stephan Eidenbenz
2021 arXiv   pre-print
We present PPT-Multicore, an analytical model embedded in the Performance Prediction Toolkit (PPT) to predict parallel application performance running on a multicore processor.  ...  We use a probabilistic and computationally efficient reuse profile to predict the cache hit rates and runtimes of OpenMP programs' parallel sections.  ...  Several recent works have focused on CRD profiles for predicting the performance of shared cache [21, 33, 76, 79, 89] .  ... 
arXiv:2104.05102v1 fatcat:mtcyzxf5g5hslogtp5mra5z7wa

Studying multicore processor scaling via reuse distance analysis

Meng-Ju Wu, Minshu Zhao, Donald Yeung
2013 Proceedings of the 40th Annual International Symposium on Computer Architecture - ISCA '13  
The key to realizing the potential of LCMPs is the cache hierarchy, so studying how memory performance will scale is crucial. Reuse distance (RD) analysis can help architects do this.  ...  In particular, recent work has developed concurrent reuse distance (CRD) and private reuse distance (PRD) profiles to enable analysis of shared and private caches.  ...  RD analysis measures a program's memory reuse distance histogram, or RD profile, capturing the application-level locality responsible for cache performance.  ... 
doi:10.1145/2485922.2485965 dblp:conf/isca/WuZY13 fatcat:g6p2y66rjndv7natn4a5fv3atq

Studying multicore processor scaling via reuse distance analysis

Meng-Ju Wu, Minshu Zhao, Donald Yeung
2013 SIGARCH Computer Architecture News  
The key to realizing the potential of LCMPs is the cache hierarchy, so studying how memory performance will scale is crucial. Reuse distance (RD) analysis can help architects do this.  ...  In particular, recent work has developed concurrent reuse distance (CRD) and private reuse distance (PRD) profiles to enable analysis of shared and private caches.  ...  RD analysis measures a program's memory reuse distance histogram, or RD profile, capturing the application-level locality responsible for cache performance.  ... 
doi:10.1145/2508148.2485965 fatcat:vjuzzdw2rrekpf7ofcd76nyo2i

Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis

Meng-Ju Wu, Donald Yeung
2012 Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness - MSPC '12  
In today's hierarchies, performance is determined by complicated thread interactions, such as interference in shared caches and replication and communication in private caches.  ...  Understanding multicore memory behavior is crucial, but can be challenging due to the complex cache hierarchies employed in modern CPUs.  ...  Schuff predicts shared (private) cache performance using CRD (PRD) profiles. In subsequent work, Schuff speeds up profile acquisition via sampling and parallelization techniques [16] .  ... 
doi:10.1145/2247684.2247687 dblp:conf/pldi/WuY12 fatcat:bdtfalx5lnbuxcgqgxupjqadse

Preliminary evaluation of dynamic load balancing using loop re-partitioning on Omni/SCASH

Y. Sakae, S. Matsuoka, M. Sato, H. Harada
2003 CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings.  
In this paper, we report our ongoing work on dynamic load balancing extension to Omni/SCASH which is an implementation of OpenMP on Software Distributed Shared Memory, SCASH.  ...  Such a commodity cluster environment, there may be incremental upgrade due to several reasons, such as rapid progress in processor technologies, or user needs and it may cause the performance heterogeneity  ...  In the ERC memory model, the consistency of a shared memory area is maintained on each synchronization called the memory barrier synchronization point.  ... 
doi:10.1109/ccgrid.2003.1199402 dblp:conf/ccgrid/SakaeSMH03 fatcat:n77y5o66lbenhb4iccj2oko2sa

Efficient Reuse Distance Analysis of Multicore Scaling for Loop-Based Parallel Programs

Meng-Ju Wu, Donald Yeung
2013 ACM Transactions on Computer Systems  
And fourth, we apply CRD and PRD profiles to analyze multicore cache performance.  ...  The predicted profiles can then be used to predict cache performance for the scaled CPUs.  ...  RD analysis measures a program's memory reuse distance histogram, or RD profile, capturing the application-level locality responsible for cache performance.  ... 
doi:10.1145/2427631.2427632 fatcat:rwqdtcyp6jbsfn5ptcas22qolu

Matching memory access patterns and data placement for NUMA systems

Zoltan Majo, Thomas R. Gross
2012 Proceedings of the Tenth International Symposium on Code Generation and Optimization - CHO '12  
Many recent multicore multiprocessors are based on a nonuniform memory architecture (NUMA).  ...  To alleviate this problem we describe a small set of language-level primitives for memory allocation and loop scheduling.  ...  By eliminating sharing we obtain performance improvements also when profile-based allocation does not.  ... 
doi:10.1145/2259016.2259046 dblp:conf/cgo/MajoG12 fatcat:ljm3tb3to5dpfexns4uqbviooi
« Previous Showing results 1 — 15 out of 286,308 results