Filters








6,171 Hits in 4.2 sec

A fast analytical model of fully associative caches

Tobias Gysi, Tobias Grosser, Laurin Brandner, Torsten Hoefler
2019 Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation - PLDI 2019  
Our cache model often computes the results within seconds and contrary to simulation the execution time is mostly problem size independent.  ...  Existing cache models and simulators provide the missing information but are computationally expensive.  ...  We also would like to thank the Swiss National Supercomputing Center for providing the computing resources.  ... 
doi:10.1145/3314221.3314606 dblp:conf/pldi/GysiGBH19 fatcat:v53rr5ds3fdtpnsnmgnjl7cfmi

Efficient threads mapping on multicore architecture

Iulian Nita, Adrian Rapan, Vasile Lazarescu, Tiberiu Seceleanu
2010 2010 8th International Conference on Communications  
We've realized a comparison between the parallel computing with an efficient mapping algorithm of threads to specific cores and parallel computing with threads mapping maintained by Linux kernel process  ...  In our simulation we used Kubuntu Linux operating system, a system with Intel Core 2 Duo processor and another system with an Intel Quad Core.  ...  However, since the two virtual CPUs compete for essentially all computing, cache, and memory resources, it would typically be more efficient to dispatch the process to a different core or CPU if one is  ... 
doi:10.1109/iccomm.2010.5508993 fatcat:t5ujqiwahfbjld2ip3kqcriub4

AMP: An Affinity-Based Metadata Prefetching Scheme in Large-Scale Distributed Storage Systems

Lin Lin, Xuemin Li, Hong Jiang, Yifeng Zhu, Lei Tian
2008 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID)  
Compared with LRU and some of the latest file prefetching algorithms such as Nexus and C-Miner, our trace-driven simulations show that AMP can improve buffer cache hit rates by up to 12%, 4.5% and 4% respectively  ...  Through mining useful information about metadata accesses from past history, AMP can discover metadata file affinities accurately and intelligently for prefetching.  ...  Conventional data prefetching algorithms are usually very conservative and only prefetch one or two files upon each cache miss. They are not efficient for metadata prefetching.  ... 
doi:10.1109/ccgrid.2008.117 dblp:conf/ccgrid/LinLJZT08 fatcat:nyumvi3rdvf3nfkg2e7stqsjeu

An efficient profile-analysis framework for data-layout optimizations

Shai Rubin, Rastislav Bodík, Trishul Chilimbi
2002 SIGPLAN notices  
We propose a parameterizable framework for data-layout optimization of generalpurpose applications.  ...  To make the search process practical, we develop space-reduction heuristics and optimize the expensive simulation via memoization.  ...  Assume that we have already computed (using cache simulation starting with an empty cache) the above summary values for these two sub-traces.  ... 
doi:10.1145/565816.503287 fatcat:oyncsxtkmvck5p6cspuooe767e

An efficient profile-analysis framework for data-layout optimizations

Shai Rubin, Rastislav Bodík, Trishul Chilimbi
2002 Proceedings of the 29th ACM SIGPLAN-SIGACT symposium on Principles of programming languages - POPL '02  
We propose a parameterizable framework for data-layout optimization of generalpurpose applications.  ...  To make the search process practical, we develop space-reduction heuristics and optimize the expensive simulation via memoization.  ...  Assume that we have already computed (using cache simulation starting with an empty cache) the above summary values for these two sub-traces.  ... 
doi:10.1145/503272.503287 dblp:conf/popl/RubinBC02 fatcat:jldcejcfonhdvld34luzzsan2e

hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications

Franois Broquedis, Jerome Clet-Ortega, Stephanie Moreaud, Nathalie Furmento, Brice Goglin, Guillaume Mercier, Samuel Thibault, Raymond Namyst
2010 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing  
High-performance computing applications now have to carefully adapt their placement and behavior according to the underlying hierarchy of hardware resources and their software affinities.  ...  The increasing numbers of cores, shared caches and memory nodes within machines introduces a complex hardware topology.  ...  For instance, it is a common practice to reduce the computation time vs. simulation accuracy dilemma by dynamically refining the simulation space only in the parts of the domain where accuracy is needed  ... 
doi:10.1109/pdp.2010.67 dblp:conf/pdp/BroquedisCMFGMTN10 fatcat:dh6xcoke6rffxhddhk4vrpftme

Microarchitectural mechanisms to exploit value structure in SIMT architectures

Ji Kim, Christopher Torng, Shreesha Srinath, Derek Lockhart, Christopher Batten
2013 SIGARCH Computer Architecture News  
compute-focused data-parallel accelerators.  ...  When compared to a baseline without compact affine execution, our approach can improve GP-SIMT cycle-level performance by 4-17% and can improve FG-SIMT absolute performance by 20-65% and energy efficiency  ...  In FG-SIMT, we can reuse the CP's standard functional unit for base computations, but still must add an extra functional unit for stride computations.  ... 
doi:10.1145/2508148.2485934 fatcat:2a4t3xfgyrcjrguwkkrzg5w5pq

Microarchitectural mechanisms to exploit value structure in SIMT architectures

Ji Kim, Christopher Torng, Shreesha Srinath, Derek Lockhart, Christopher Batten
2013 Proceedings of the 40th Annual International Symposium on Computer Architecture - ISCA '13  
compute-focused data-parallel accelerators.  ...  When compared to a baseline without compact affine execution, our approach can improve GP-SIMT cycle-level performance by 4-17% and can improve FG-SIMT absolute performance by 20-65% and energy efficiency  ...  In FG-SIMT, we can reuse the CP's standard functional unit for base computations, but still must add an extra functional unit for stride computations.  ... 
doi:10.1145/2485922.2485934 dblp:conf/isca/KimTSLB13 fatcat:nw23iplrp5ajxftz4vstmq2r3i

Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors

Qingda Lu, Christophe Alias, Uday Bondhugula, Thomas Henretty, Sriram Krishnamoorthy, J. Ramanujam, Atanas Rountev, P. Sadayappan, Yongjian Chen, Haibo Lin, Tin-fook Ngai
2009 2009 18th International Conference on Parallel Architectures and Compilation Techniques  
localizable computations.  ...  Simulation-based results on a 16-core 2D tiled CMP demonstrate the effectiveness of the approach.  ...  The authors would also like to thank the anonymous reviewers for their comments on the earlier version of this paper.  ... 
doi:10.1109/pact.2009.36 dblp:conf/IEEEpact/LuABHKRRSCLN09 fatcat:vsvqsjncfna6bkfn4caivb2ghe

Multiple global affine motion model for H.264 video coding with low bit rate

Xiaohuan Li, Joel R. Jackson, Aggelos K. Katsaggelos, Russel M. Merserau, Amir Said, John G. Apostolopoulos
2005 Image and Video Communications and Processing 2005  
Simulation shows that abut 20-40% of the MB's choose one of the affine modes.  ...  The affine motion models for multiple MOs are estimated and coded in the frame header.  ...  Second, the affine model for each segment is computed from the MV's within that segment by (7) .  ... 
doi:10.1117/12.587328 dblp:conf/eiivcp/LiJKM05 fatcat:hpiw7nubyzh6fpb4swevmnofxa

On the parallelization and performance analysis of Barnes–Hut algorithm using Java parallel platforms

Badri Munier, Muhammad Aleem, Majid Khan, Muhammad Arshad Islam, Muhammad Azhar Iqbal, Muhammad Kamran Khattak
2020 SN Applied Sciences  
Conventionally, the applications for high-performance computing (HPC) are written in native (programming) languages.  ...  Multi-core processors provide time-efficient and cost-effective solutions to execute the algorithms for complex physical systems.  ...  The JS (Affinity) has outperformed both the JMT and the JS for all executions based on a different number of the simulated particles.  ... 
doi:10.1007/s42452-020-2386-z fatcat:plwzqn73sba33o7tyj6va3eiiq

Clustered affinity scheduling on large-scale NUMA multiprocessors

Yi-Min Wang, Hsiao-Hsi Wang, Ruei-Chuan Chang
1997 Journal of Systems and Software  
The overheads include remote reads to the queues for the indices information, synchronous writes to the queues for migrating iterations, and the time in loading data into cache.  ...  We confirm our idea by running many applications under a realistic hierarchy memory simulator.  ...  So the affinity cffcct is lighter and the cache miss ratios for those algorithms are also lower. Figure 5 shows the cache miss ratios for various algorithms.  ... 
doi:10.1016/s0164-1212(96)00163-x fatcat:tncqtflchnhpzdyg7bpywlwply

Exploiting inter-thread temporal locality for chip multithreading

Jiayuan Meng, Jeremy W Sheaffer, Kevin Skadron
2010 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)  
Simulations on M5 achieve an average speedup of 1.69× and 36% energy savings over conventional scheduling techniques that are oblivious to whether threads share a cache.  ...  While this has been studied for concurrent threads with disjoint working sets, the problem has not been addressed for multi-threaded data-parallel workloads in which threads can be scheduled or constructed  ...  Marino for their helpful comments.  ... 
doi:10.1109/ipdps.2010.5470465 dblp:conf/ipps/MengSS10 fatcat:6b33ba2lmzcnzjo24mlogswgza

Variable-based multi-module data caches for clustered VLIW processors

E. Gibert, J. Abella, J. Sanchez, X. Vera, A. Gonzalez
2005 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05)  
We propose to divide the L1 data cache into two cache modules for a clustered VLIW processor consisting of two clusters.  ...  In addition, we also explore a reconfigurable distributed cache, where the cache can be reconfigured on a context switch.  ...  Note that the affinity computed previously for other instructions is not recomputed.  ... 
doi:10.1109/pact.2005.40 dblp:conf/IEEEpact/GibertASVG05 fatcat:y5qzvgjnozadte6jlwlnyagv6e

Optimization of the N-Body Simulation on Intel's Architectures Based on AVX-512 Instruction Set [chapter]

Enzo Rucci, Ezequiel Moreno, Adrián Pousa, Franco Chichizola
2020 Communications in Computer and Information Science  
The N-body simulations have become a powerful tool to test the gravitational interaction among particles, ranging from a few bodies to complete galaxies.  ...  This paper optimizes the all-pairs N-body simulation on both current Intel platforms supporting AVX-512 extensions: a Xeon Phi KNL node and a server equipped with a dual CKL processor.  ...  computing a single simulation step.  ... 
doi:10.1007/978-3-030-48325-8_3 fatcat:4kgeck3x6fhw3cvnx6cxndrd5y
« Previous Showing results 1 — 15 out of 6,171 results