33 Hits in 0.6 sec

Autotuning OpenCL Workgroup Size for Stencil Patterns [article]

Chris Cummins, Pavlos Petoumenos, Michel Steuwer, Hugh Leather
2016 arXiv   pre-print
Selecting an appropriate workgroup size is critical for the performance of OpenCL kernels, and requires knowledge of the underlying hardware, the data being operated on, and the implementation of the kernel. This makes portable performance of OpenCL programs a challenging goal, since simple heuristics and statically chosen values fail to exploit the available performance. To address this, we propose the use of machine learning-enabled autotuning to automatically predict workgroup sizes for
more » ... il patterns on CPUs and multi-GPUs. We present three methodologies for predicting workgroup sizes. The first, using classifiers to select the optimal workgroup size. The second and third proposed methodologies employ the novel use of regressors for performing classification by predicting the runtime of kernels and the relative performance of different workgroup sizes, respectively. We evaluate the effectiveness of each technique in an empirical study of 429 combinations of architecture, kernel, and dataset, comparing an average of 629 different workgroup sizes for each. We find that autotuning provides a median 3.79x speedup over the best possible fixed workgroup size, achieving 94% of the maximum performance.
arXiv:1511.02490v3 fatcat:y4ezbmneh5etpe2fi6qa4gdgpq

Fast Automatic Heuristic Construction Using Active Learning [chapter]

William F. Ogilvie, Pavlos Petoumenos, Zheng Wang, Hugh Leather
2015 Lecture Notes in Computer Science  
Building effective optimization heuristics is a challenging task which often takes developers several months if not years to complete. Predictive modelling has recently emerged as a promising solution, automatically constructing heuristics from training data. However, obtaining this data can take months per platform. This is becoming an ever more critical problem and if no solution is found we shall be left with out of date heuristics which cannot extract the best performance from modern
more » ... s. In this work, we present a low-cost predictive modelling approach for automatic heuristic construction which significantly reduces this training overhead. Typically in supervised learning the training instances are randomly selected to evaluate regardless of how much useful information they carry. This wastes effort on parts of the space that contribute little to the quality of the produced heuristic. Our approach, on the other hand, uses active learning to select and only focus on the most useful training examples. We demonstrate this technique by automatically constructing a model to determine on which device to execute four parallel programs at differing problem dimensions for a representative Cpu-Gpu based heterogeneous system. Our methodology is remarkably simple and yet effective, making it a strong candidate for wide adoption. At high levels of classification accuracy the average learning speed-up is 3x, as compared to the stateof-the-art.
doi:10.1007/978-3-319-17473-0_10 fatcat:gwcklt44kvcgfhgpirsh76lvne

Cache replacement based on reuse-distance prediction

Georgios Keramidas, Pavlos Petoumenos, Stefanos Kaxiras
2007 2007 25th International Conference on Computer Design  
Georgios Keramidas is supported by a PENED 2003 grant; Pavlos Petoumenos is supported by a "K. Karatheodoris" grant.  ... 
doi:10.1109/iccd.2007.4601909 dblp:conf/iccd/KeramidasPK07 fatcat:3tqipqswarfrhcavcylri3hayu

Sgap: Towards Efficient Sparse Tensor Algebra Compilation for GPU [article]

Genghan Zhang, Yuetong Zhao, Yanting Tao, Zhongming Yu, Guohao Dai, Sitao Huang, Yuan Wen, Pavlos Petoumenos, Yu Wang
2022 arXiv   pre-print
Sparse compiler is a promising solution for sparse tensor algebra optimization. In compiler implementation, reduction in sparse-dense hybrid algebra plays a key role in performance. Though GPU provides various reduction semantics that can better utilize the parallel computing and memory bandwidth capacity, the central question is: how to elevate the flexible reduction semantics to sparse compilation theory that assumes serial execution. Specifically, we have to tackle two main challenges: (1)
more » ... ere are wasted parallelism by adopting static synchronization granularity (2) static reduction strategy limits optimization space exploration. We propose Sgap: segment group and atomic parallelism to solve these problems. Atomic parallelism captures the flexible reduction semantics to systematically analyze the optimization space of sparse-dense hybrid algebra on GPU. It is a new optimization technique beyond current compiler-based and open-source runtime libraries. Segment group elevates the flexible reduction semantics to suitable levels of abstraction in the sparse compilation theory. It adopts changeable group size and user-defined reduction strategy to solve challenge (1) and (2), respectively. Finally, we use GPU sparse matrix-matrix multiplication (SpMM) on the TACO compiler as a use case to demonstrate the effectiveness of segment group in reduction semantics elevation. We achieve up to 1.2x speedup over the original TACO's SpMM kernels. We also apply new optimization techniques found by atomic parallelism to an open-source state-of-the-art SpMM library dgSPARSE. We achieve 1.6x - 2.3x speedup on the algorithm tuned with atomic parallelism.
arXiv:2209.02882v1 fatcat:b5ozokczovg5pfnqoufwlzetvq

BenchPress: A Deep Active Benchmark Generator [article]

Foivos Tsimpourlas, Pavlos Petoumenos, Min Xu, Chris Cummins, Kim Hazelwood, Ajitha Rajan, Hugh Leather
2022 arXiv   pre-print
We develop BenchPress, the first ML benchmark generator for compilers that is steerable within feature space representations of source code. BenchPress synthesizes compiling functions by adding new code in any part of an empty or existing sequence by jointly observing its left and right context, achieving excellent compilation rate. BenchPress steers benchmark generation towards desired target features that has been impossible for state of the art synthesizers (or indeed humans) to reach. It
more » ... forms better in targeting the features of Rodinia benchmarks in 3 different feature spaces compared with (a) CLgen - a state of the art ML synthesizer, (b) CLSmith fuzzer, (c) SRCIROR mutator or even (d) human-written code from GitHub. BenchPress is the first generator to search the feature space with active learning in order to generate benchmarks that will improve a downstream task. We show how using BenchPress, Grewe's et al. CPU vs GPU heuristic model can obtain a higher speedup when trained on BenchPress's benchmarks compared to other techniques. BenchPress is a powerful code generator: Its generated samples compile at a rate of 86%, compared to CLgen's 2.33%. Starting from an empty fixed input, BenchPress produces 10x more unique, compiling OpenCL benchmarks than CLgen, which are significantly larger and more feature diverse.
arXiv:2208.06555v2 fatcat:6cxqwyvlzvcezakmlj2zq4bp6u

Where replacement algorithms fail

Georgios Keramidas, Pavlos Petoumenos, Stefanos Kaxiras
2010 Proceedings of the 7th ACM international conference on Computing frontiers - CF '10  
Petoumenos et al.  ... 
doi:10.1145/1787275.1787316 dblp:conf/cf/KeramidasPK10 fatcat:uofbg5td4jdm5annb72ifvbjki

Iterative compilation on mobile devices [article]

Paschalis Mpeis, Pavlos Petoumenos, Hugh Leather
2016 arXiv   pre-print
The abundance of poorly optimized mobile applications coupled with their increasing centrality in our digital lives make a framework for mobile app optimization an imperative. While tuning strategies for desktop and server applications have a long history, it is difficult to adapt them for use on mobile phones. Reference inputs which trigger behavior similar to a mobile application's typical are hard to construct. For many classes of applications the very concept of typical behavior is
more » ... nt, each user interacting with the application in very different ways. In contexts like this, optimization strategies need to evaluate their effectiveness against real user input, but doing so online runs the risk of user dissatisfaction when suboptimal optimizations are evaluated. In this paper we present an iterative compiler which employs a novel capture and replay technique in order to collect real user input and use it later to evaluate different transformations offline. The proposed mechanism identifies and stores only the set of memory pages needed to replay the most heavily used functions of the application. At idle periods, this minimal state is combined with different binaries of the application, each one build with different optimizations enabled. Replaying the targeted functions allows us to evaluate the effectiveness of each set of optimizations for the actual way the user interacts with the application. For the BEEBS benchmark suite, our approach was able to improve performance by up to 57%, while keeping the slowdown experienced by the user on average at 0.8%. By focusing only on heavily used functions, we are able to conserve storage space by between two and three orders of magnitude compared to typical capture and replay implementations.
arXiv:1511.02603v4 fatcat:b6xbzindujddti2amqbz5edsou


Lev Mukhanov, Pavlos Petoumenos, Zheng Wang, Nikos Parasyris, Dimitrios S. Nikolopoulos, Bronis R. De Supinski, Hugh Leather
2017 ACM Transactions on Architecture and Code Optimization (TACO)  
doi:10.1145/3050436 fatcat:hqzuvpy46fbz3kkvlgtlpxyvem

Modeling Cache Sharing on Chip Multiprocessor Architectures

Pavlos Petoumenos, Georgios Keramidas, Hakan Zeffer, Stefanos Kaxiras, Erik Hagersten
2006 2006 IEEE International Symposium on Workload Characterization  
As CMPs are emerging as the dominant architecture for a wide range of platforms (from embedded systems and game consoles, to PCs, and to servers) the need to manage on-chip resources, such as shared caches, becomes a necessity. In this paper we propose a new statistical model of a CMP shared cache which not only describes cache sharing but also its management via a novel fine-grain mechanism. Our model, called StatShare, accurately describes the behavior of the sharing threads using run-time
more » ... ormation (reuse-distance information for memory accesses) and helps us understand how effectively each thread uses its space. The mechanism to manage the cache at the cache-line granularity is inspired by Cache Decay, but contains important differences. Decayed cache-lines are not turned-off to save leakage but are rather "available for replacement." Decay modifies the underlying replacement policy (random, LRU) to control sharing but in a very flexible and non-strict way which makes it superior to strict cache partitioning schemes (both fine and coarse grained). The statistical model allows us to assess a thread's cache behavior under decay. Detailed CMP simulations show that: i) StatShare accurately predicts the thread behavior in a shared cache, ii) managing sharing via decay (in combination with the StatShare run time information) can be used to enforce external QoS requirements or various high-level fairness policies.
doi:10.1109/iiswc.2006.302740 dblp:conf/iiswc/PetoumenosKZKH06 fatcat:23oy7ih23nhgrgmv45c37gosyy

Synthesizing benchmarks for predictive modeling

Chris Cummins, Pavlos Petoumenos, Zheng Wang, Hugh Leather
2017 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)  
Predictive modeling using machine learning is an effective method for building compiler heuristics, but there is a shortage of benchmarks. Typical machine learning experiments outside of the compilation field train over thousands or millions of examples. In machine learning for compilers, however, there are typically only a few dozen common benchmarks available. This limits the quality of learned models, as they have very sparse training data for what are often high-dimensional feature spaces.
more » ... hat is needed is a way to generate an unbounded number of training programs that finely cover the feature space. At the same time the generated programs must be similar to the types of programs that human developers actually write, otherwise the learning will target the wrong parts of the feature space. We mine open source repositories for program fragments and apply deep learning techniques to automatically construct models for how humans write programs. We sample these models to generate an unbounded number of runnable training programs. The quality of the programs is such that even human developers struggle to distinguish our generated programs from hand-written code. We use our generator for OpenCL programs, CLgen, to automatically synthesize thousands of programs and show that learning over these improves the performance of a state of the art predictive model by 1.27×. In addition, the fine covering of the feature space automatically exposes weaknesses in the feature design which are invisible with the sparse training examples from existing benchmark suites. Correcting these weaknesses further increases performance by 4.30×.
doi:10.1109/cgo.2017.7863731 fatcat:4attl72gpfhz5bagpz3ny7rp6i

Compiler fuzzing through deep learning

Chris Cummins, Pavlos Petoumenos, Alastair Murray, Hugh Leather
2018 Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis - ISSTA 2018  
Random program generation -fuzzing -is an effective technique for discovering bugs in compilers but successful fuzzers require extensive development effort for every language supported by the compiler, and often leave parts of the language space untested. We introduce DeepSmith, a novel machine learning approach to accelerating compiler validation through the inference of generative models for compiler inputs. Our approach infers a learned model of the structure of real world code based on a
more » ... ge corpus of open source code. Then, it uses the model to automatically generate tens of thousands of realistic programs. Finally, we apply established differential testing methodologies on them to expose bugs in compilers. We apply our approach to the OpenCL programming language, automatically exposing bugs with little effort on our side. In 1,000 hours of automated testing of commercial and open source compilers, we discover bugs in all of them, submitting 67 bug reports. Our test cases are on average two orders of magnitude smaller than the state-of-the-art, require 3.03× less time to generate and evaluate, and expose bugs which the state-of-the-art cannot. Our random program generator, comprising only 500 lines of code, took 12 hours to train for OpenCL versus the state-of-the-art taking 9 man months to port from a generator for C and 50,000 lines of code. With 18 lines of code we extended our program generator to a second language, uncovering crashes in Solidity compilers in 12 hours of automated testing. CCS CONCEPTS • Software and its engineering → Software testing and debugging;
doi:10.1145/3213846.3213848 dblp:conf/issta/CumminsPML18 fatcat:frdnak7jmrbfhjrdavjvhr76vm

MLP-Aware Instruction Queue Resizing: The Key to Power-Efficient Performance [chapter]

Pavlos Petoumenos, Georgia Psychou, Stefanos Kaxiras, Juan Manuel Cebrian Gonzalez, Juan Luis Aragon
2010 Lecture Notes in Computer Science  
Several techniques aiming to improve power-efficiency (measured as EDP) in out-of-order cores trade energy with performance. Prime examples are the techniques to resize the instruction queue (IQ). While most of them produce good results, they fail to take into account that changing the timing of memory accesses can have significant consequences on the memory-level parallelism (MLP) of the application and thus incur disproportional performance degradation. We propose a novel mechanism that deals
more » ... with this realization by collecting fine-grain information about the maximum IQ resizing that does not affect the MLP of the program. This information is used to override the resizing enforced by feedback mechanisms when this resizing might reduce MLP. We compare our technique to a previously proposed non-MLP-aware management technique and our results show a significant increase in EDP savings for most benchmarks of the SPEC2000 suite. 1. Sequentially from the end of the IQ in [1] or independently of position in [3] .
doi:10.1007/978-3-642-11950-7_11 fatcat:sovpsn4ezffd7nzkeaac4ekj5u

Power Capping: What Works, What Does Not

Pavlos Petoumenos, Lev Mukhanov, Zheng Wang, Hugh Leather, Dimitrios S. Nikolopoulos
2015 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS)  
Peak power consumption is the first order design constraint of data centers. Though peak power consumption is rarely, if ever, observed, the entire data center facility must prepare for it, leading to inefficient usage of its resources. The most prominent way for addressing this issue is to limit the power consumption of the data center IT facility far below its theoretical peak value. Many approaches have been proposed to achieve that, based on the same small set of enforcement mechanisms, but
more » ... there has been no corresponding work on systematically examining the advantages and disadvantages of each such mechanism. In the absence of such a study, it is unclear what is the optimal mechanism for a given computing environment, which can lead to unnecessarily poor performance if an inappropriate scheme is used. This paper fills this gap by comparing for the first time five widely used power capping mechanisms under the same hardware/software setting. We also explore possible alternative power capping mechanisms beyond what has been previously proposed and evaluate them under the same setup. We systematically analyze the strengths and weaknesses of each mechanism, in terms of energy efficiency, overhead, and predictable behavior. We show how these mechanisms can be combined in order to implement an optimal power capping mechanism which reduces the slowdown compared to the most widely used mechanism by up to 88%. Our results provide interesting insights regarding the different trade-offs of power capping techniques, which will be useful for designing and implementing highly efficient power capping in the future.
doi:10.1109/icpads.2015.72 dblp:conf/icpads/PetoumenosMWLN15 fatcat:hb5qce4rtjdvbpp6titl63sfsi

End-to-End Deep Learning of Optimization Heuristics

Chris Cummins, Pavlos Petoumenos, Zheng Wang, Hugh Leather
2017 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)  
Accurate automatic optimization heuristics are necessary for dealing with the complexity and diversity of modern hardware and software. Machine learning is a proven technique for learning such heuristics, but its success is bound by the quality of the features used. These features must be hand crafted by developers through a combination of expert domain knowledge and trial and error. This makes the quality of the final model directly dependent on the skill and available time of the system
more » ... ect. Our work introduces a better way for building heuristics. We develop a deep neural network that learns heuristics over raw code, entirely without using code features. The neural network simultaneously constructs appropriate representations of the code and learns how best to optimize, removing the need for manual feature creation. Further, we show that our neural nets can transfer learning from one optimization problem to another, improving the accuracy of new models, without the help of human experts. We compare the effectiveness of our automatically generated heuristics against ones with features hand-picked by experts. We examine two challenging tasks: predicting optimal mapping for heterogeneous parallelism and GPU thread coarsening factors. In 89% of the cases, the quality of our fully automatic heuristics matches or surpasses that of state-of-theart predictive models using hand-crafted features, providing on average 14% and 12% more performance with no human effort expended on designing features.
doi:10.1109/pact.2017.24 dblp:conf/IEEEpact/CumminsP0L17 fatcat:p6sckrwd6rcxbpuqh64gcow3li

Function Merging by Sequence Alignment

Rodrigo C. O. Rocha, Pavlos Petoumenos, Zheng Wang, Murray Cole, Hugh Leather
2019 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)  
Resource-constrained devices for embedded systems are becoming increasingly important. In such systems, memory is highly restrictive, making code size in most cases even more important than performance. Compared to more traditional platforms, memory is a larger part of the cost and code occupies much of it. Despite that, compilers make little effort to reduce code size. One key technique attempts to merge the bodies of similar functions. However, production compilers only apply this
more » ... to identical functions, while research compilers improve on that by merging the few functions with identical control-flow graphs and signatures. Overall, existing solutions are insufficient and we end up having to either increase cost by adding more memory or remove functionality from programs. We introduce a novel technique that can merge arbitrary functions through sequence alignment, a bioinformatics algorithm for identifying regions of similarity between sequences. We combine this technique with an intelligent exploration mechanism to direct the search towards the most promising function pairs. Our approach is more than 2.4x better than the state-of-the-art, reducing code size by up to 25%, with an overall average of 6%, while introducing an average compilation-time overhead of only 15%. When aided by profiling information, this optimization can be deployed without any significant impact on the performance of the generated code.
doi:10.1109/cgo.2019.8661174 dblp:conf/cgo/RochaP0CL19 fatcat:rpprdk7r4zesngerwcjisselty
« Previous Showing results 1 — 15 out of 33 results