Filters








8,578 Hits in 5.6 sec

Automated empirical tuning of scientific codes for performance and power consumption

Shah Faizur Rahman, Jichi Guo, Qing Yi
2011 Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers - HiPEAC '11  
In particular, we extensively parameterize the configuration of a large number of compiler optimizations, including loop parallelization, blocking, unroll-andjam, array copying, scalar replacement, strength  ...  Automatic empirical tuning of compiler optimizations has been widely used to achieve portable high performance for scientific applications.  ...  The configuration of scalar replacement is similar to array copying. • Loop unrolling, where an innermost loop is unrolled by a number of iterations to create a larger loop body.  ... 
doi:10.1145/1944862.1944880 dblp:conf/hipeac/RahmanGY11 fatcat:hsdyzdsojjezto4tyfjtjdmljq

Tile Reduction: The First Step towards Tile Aware Parallelization in OpenMP [chapter]

Ge Gan, Xu Wang, Joseph Manzano, Guang R. Gao
2009 Lecture Notes in Computer Science  
a set of benchmarks.  ...  We propose tile reduction, an OpenMP tile aware parallelization technique that allows reduction to be performed on multi-dimensional arrays.  ...  We thank all the members of CAPSL group at University of Delaware. We thank Jason Lin and Lei Huang for their valuable comments and feedback.  ... 
doi:10.1007/978-3-642-02303-3_12 fatcat:swjuq4mivnd5vl6nfbipc6mkmq

Can We Trust Edge Computing Simulations? An Experimental Assessment

Gonçalo Carvalho, Filipe Magalhães, Bruno Cabral, Vasco Pereira, Jorge Bernardino
2022 Computers  
This paper compares the execution of the EdgeBench benchmark in a real-world environment and in a simulation environment using FogComputingSim, an EC simulator.  ...  There are some works about simulation environments in Edge Computing (EC), but there is a gap of studies that state the validity of these simulators.  ...  Data Availability Statement: EdgeBench datasets @ https://github.com/CityBench/Benchmark, accessed on 1 April 2022. Conflicts of Interest: The authors declare no conflict of interest.  ... 
doi:10.3390/computers11060090 fatcat:e54uit5waveetmktxgwnoeaf3y

Improving data cache performance with integrated use of split caches, victim cache and stream buffers

Afrin Naz, Mehran Rezaei, Krishna Kavi, Philip Sweany
2004 Proceedings of the 2004 workshop on MEmory performance DEaling with Applications , systems and architecture - MEDEA '04  
Since significant amounts of compulsory and conflict misses are avoided, the size of each cache (i.e., array and scalar), as well as the combined cache capacity can be reduced.  ...  The work showed that using separate (data) caches for indexed or stream data and scalar data items could lead to substantial improvements in terms of cache misses.  ...  In an attempt to evaluate the optimal configuration of the integrated approach, a variety cache sizes, block sizes, associativity and replacement methods were examined for each of array, scalar, victim  ... 
doi:10.1145/1152922.1101876 fatcat:e5hol53y2zgfrna3os2stzoihe

Improving data cache performance with integrated use of split caches, victim cache and stream buffers

Afrin Naz, Mehran Rezaei, Krishna Kavi, Philip Sweany
2005 SIGARCH Computer Architecture News  
Since significant amounts of compulsory and conflict misses are avoided, the size of each cache (i.e., array and scalar), as well as the combined cache capacity can be reduced.  ...  The work showed that using separate (data) caches for indexed or stream data and scalar data items could lead to substantial improvements in terms of cache misses.  ...  In an attempt to evaluate the optimal configuration of the integrated approach, a variety cache sizes, block sizes, associativity and replacement methods were examined for each of array, scalar, victim  ... 
doi:10.1145/1101868.1101876 fatcat:rdcgf6yxcbavppdyjbljkdecli

Exposing Tunable Parameters in Multi-threaded Numerical Code [chapter]

Apan Qasem, Jichi Guo, Faizur Rahman, Qing Yi
2010 Lecture Notes in Computer Science  
A series of experiments on two scientific benchmarks illustrates the non-orthogonality of the transformation search space and reiterates the need for integrated transformation heuristics for achieving  ...  Achieving high performance on today's architectures requires careful orchestration of many optimization parameters.  ...  Our experimental results illustrate the non-orthogonality of the search spaces and reinforces the need for application tuning through integrated transformation heuristics.  ... 
doi:10.1007/978-3-642-15672-4_6 fatcat:b5pgdieb4va7nlw6vrpktpikvm

Recycled Error Bits: Energy-Efficient Architectural Support for Floating Point Accuracy

Ralph Nathan, Bryan Anthonio, Shih-Lien Lu, Helia Naeimi, Daniel J. Sorin, Xiaobai Sun
2014 SC14: International Conference for High Performance Computing, Networking, Storage and Analysis  
Experimental results on physical hardware show that software that exploits architecturally recycled error bits can (a) achieve accuracy comparable to a 64-bit FPU with performance and energy that are comparable  ...  to a 32-bit FPU, and (b) achieve accuracy comparable to an all-software scheme for 128-bit accuracy with far better performance and energy usage.  ...  ACKNOWLEDGMENTS This material is based on work supported by the National Science Foundation under grant CCF-111-5367. We thank our shepherd, Mike O'Connor, for his advice in improving this work.  ... 
doi:10.1109/sc.2014.15 dblp:conf/sc/NathanALNSS14 fatcat:axrthxpeefg5jehl7ytaa3m66i

Studying the impact of application-level optimizations on the power consumption of multi-core architectures

Shah Mohammad Faizur Rahman, Jichi Guo, Akshatha Bhat, Carlos Garcia, Majedul Haque Sujon, Qing Yi, Chunhua Liao, Daniel Quinlan
2012 Proceedings of the 9th conference on Computing Frontiers - CF '12  
Our extensive experimental study provides insights for answering two questions: 1) what degrees of impact can application level optimizations have on reducing the overall system power consumption of modern  ...  This paper studies the overall system power variations of two multi-core architectures, an 8-core Intel and a 32-core AMD workstation, while using these machines to execute a wide variety of sequential  ...  Section 3 presents the benchmarks used in the evaluation and our experimental methodology.  ... 
doi:10.1145/2212908.2212927 dblp:conf/cf/RahmanGBGSYLQ12 fatcat:qw6nsbdsi5ek7bitl3p5iwbcse

MATLAB Parallelization through Scalarization

Chun-Yu Shei, Adarsh Yoga, Madhav Ramesh, Arun Chauhan
2011 2011 15th Workshop on Interaction between Compilers and Computer Architectures  
Evaluation results on a set of benchmarks selected from diverse domains shows speed improvements ranging from 1.5x to 16x on eight-core Intel Core 2 Duo machine.  ...  In both cases, it is possible to generate fused loops and replace array temporaries by scalars, thus reducing the memory bandwidth pressure.  ...  We evaluated our algorithm on a diverse set of benchmarks on multi-core machine as well as a GPU card.  ... 
doi:10.1109/interact.2011.18 dblp:conf/IEEEinteract/SheiYRC11 fatcat:hmnywoccd5cjxawmkb6cw6hq7u

Towards Resiliency Evaluation of Vector Programs

Vishal Chandra Sharma, Ganesh Gopalakrishnan, Sriram Krishnamoorthy
2016 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)  
Using VULFI, we conduct a resiliency study of nine real-world vector benchmarks using Intel's AVX and SSE extensions as the target vector instruction sets, and offer the first reported understanding of  ...  systems resilience research community has developed methods to manually insert additional source-program level assertions to trap errors, and also devised tools to conduct fault injection studies for scalar  ...  Experimental Setup The experiments described in the paper were carried out on an Intel's Core™i7 4770 system running 64-bit Ubuntu 12.04 operating system and with 16GB of main memory.  ... 
doi:10.1109/ipdpsw.2016.187 dblp:conf/ipps/SharmaGK16 fatcat:fcu76rhn7vhdxpqdgyrfxfvusu

AdaptMemBench: Application-Specific MemorySubsystem Benchmarking [article]

Mahesh Lakshminarasimhan, Catherine Olschanowsky
2018 arXiv   pre-print
A benchmark framework that exploresthe performance in an application-specific manner is essential tocharacterize memory performance and at the same time informmemory-efficient coding practices.  ...  Optimizing scientific applications to take full advan-tage of modern memory subsystems is a continual challenge forapplication and compiler developers.  ...  Our results agree with previous experimental evaluation showing no performance gain [10] .  ... 
arXiv:1812.07778v1 fatcat:53h5gupql5a2vnvcbsulphfyya

Extensive Parameterization And Tuning of Architecture-Sensitive Optimizations

Qing Yi, Jichi Guo
2011 Procedia Computer Science  
We have used our framework to apply 6 highly interactive optimizations, parallelization via OpenMP, cache blocking, array copying, unroll-and-jam, scalar replacement, and loop unrolling, and present results  ...  The complexity of modern architectures require compilers to apply an increasingly large collection of architecturesensitive optimizations, e.g., parallelization and cache optimizations, which interact  ...  The configuration of scalar replacement is similar to array copying. • Loop unrolling, where an innermost loop is unrolled by a number of iterations to create a larger loop body.  ... 
doi:10.1016/j.procs.2011.04.236 fatcat:bxovcxilibhl5n6ntdh4wsoqdy

Task sampling: computer architecture simulation in the many-core era

Majedul Haque Sujon, R. Clint Whaley, Qing Yi
2013 Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques  
Modern architectures increasingly rely on SIMD vectorization to improve performance for floating point intensive scientific applications.  ...  We have integrated our technique in an iterative optimizing compiler and have employed empirical tuning to select the profitable paths for speculation.  ...  ACKNOWLEDGMENT This work was supported in part by the NSF CAREER grant# OCI-1149303, CCF-1261778, and CCF-1261811, and by the Department of Energy grant# DE-SC0001770.  ... 
doi:10.1109/pact.2013.6618831 dblp:conf/IEEEpact/SujonWY13 fatcat:zkqzet7ldbd25ct4rl3llealoq

An Evaluation of Vectorizing Compilers

Saeed Maleki, Yaoqing Gao, Maria J. Garzar´n, Tommy Wong, David A. Padua
2011 2011 International Conference on Parallel Architectures and Compilation Techniques  
This paper evaluates how well compilers vectorize a synthetic benchmark consisting of 151 loops, two application from Petascale Application Collaboration Teams (PACT), and eight applications from Media  ...  compilers we evaluated.  ...  ACKNOWLEDGMENT This research is part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (award number OCI 07-25070) and the state of Illinois  ... 
doi:10.1109/pact.2011.68 dblp:conf/IEEEpact/MalekiGGWP11 fatcat:2kz7mvgeefg3romnfmruk6ej4m

Multifactorial Cellular Genetic Algorithm (MFCGA): Algorithmic Design, Performance Comparison and Genetic Transferability Analysis [article]

Eneko Osaba, Aritz D. Martinez, Jesus L. Lobo, Javier Del Ser and Francisco Herrera
2020 arXiv   pre-print
A further contribution of this analysis beyond performance benchmarking is a quantitative examination of the genetic transferability among the problem instances, eliciting an empirical demonstration of  ...  We conduct an extensive performance analysis of the proposed MFCGA and compare it to the canonical MFEA under the same algorithmic conditions and over 15 different multitasking setups (encompassing different  ...  On the contrary, they replace the current individual upon the fulfillment of a given criterion (for example, an improvement in the fitness function).  ... 
arXiv:2003.10768v1 fatcat:prwwqklz3ncbthiq6jf7gtvms4
« Previous Showing results 1 — 15 out of 8,578 results