43 Hits in 3.9 sec

Automatic loop kernel analysis and performance modeling with Kerncraft

Julian Hammer, Georg Hager, Jan Eitzinger, Gerhard Wellein
2015 Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems - PMBS '15  
processors using the Roofline or the Execution-Cache-Memory (ECM) model.  ...  Starting from a validated performance model one can infer the relevant hardware bottlenecks and promising optimization opportunities.  ...  Acknowledgments Discussions with Johannes Hofmann are gratefully acknowledged.  ... 
doi:10.1145/2832087.2832092 dblp:conf/sc/HammerHEW15 fatcat:vsyvvnthjne27di4tzt4qkcb6m

Modeling Large Compute Nodes with Heterogeneous Memories with Cache-Aware Roofline Model [chapter]

Nicolas Denoyelle, Brice Goglin, Aleksandar Ilic, Emmanuel Jeannot, Leonel Sousa
2017 Lecture Notes in Computer Science  
The Cache-Aware Roofline Model (CARM) is an insightful, yet simple model designed to address this issue.  ...  Finally, we show the model ability to exhibits several bottlenecks of such systems, which were not supported by CARM.  ...  Some experiments presented in this paper were carried out using the PLAFRIM experimental testbed, being developed under the Inria PlaFRIM development action with support from Bordeaux INP, LaBRI and IMB  ... 
doi:10.1007/978-3-319-72971-8_5 fatcat:k4hhzhxhvbdrxctjqsc2oxhwqq

Roofline Model for UAVs: A Bottleneck Analysis Tool for Onboard Compute Characterization of Autonomous Unmanned Aerial Vehicles [article]

Srivatsan Krishnan, Zishen Wan, Kshitij Bhardwaj, Ninad Jadhav, Aleksandra Faust, Vijay Janapa Reddi
2022 arXiv   pre-print
We introduce an early-phase bottleneck analysis and characterization model called the F-1 for designing computing systems that target autonomous Unmanned Aerial Vehicles (UAVs).  ...  The model is experimentally validated using real UAVs, and the error is between 5.1% to 9.5% compared to real-world flight tests.  ...  ACKNOWLEDGEMENTS The authors thank Magnus Själander and the other anonymous reviewers for their valuable feedback. The work was sponsored in part by IARPA award 2022-21100600004.  ... 
arXiv:2204.10898v1 fatcat:wbgo6hrbdjf5tnznzl4ocm4kmm

Analytically Modeling Application Execution for Software-Hardware Co-design

Jichi Guo, Jiayuan Meng, Qing Yi, Vitali Morozov, Kalyan Kumaran
2014 2014 IEEE 28th International Parallel and Distributed Processing Symposium  
In fact, our technique's analysis time does not increase with the input data size.  ...  This requires a high level understanding about the full applications' potential behavior on a future system, e.g. the most time-consuming regions, the performance bottlenecks for these regions, etc.  ...  ACKNOWLEDGMENT The authors thank the ALCF application and operations support staff for their help.  ... 
doi:10.1109/ipdps.2014.56 dblp:conf/ipps/GuoMYMK14 fatcat:w32x6ppuyrhulbtzgetpibikve

Execution-Cache-Memory Performance Model: Introduction and Validation [article]

Johannes Hofmann, Jan Eitzinger, Dietmar Fey
2017 arXiv   pre-print
This report serves two purposes: To introduce and validate the Execution-Cache-Memory (ECM) performance model and to provide a thorough analysis of current Intel processor architectures with a special  ...  The architectural analysis and model predictions are showcased and validated using a set of elementary microbenchmarks.  ...  Apart from upgrading the memory from DDR 3 used in the previous Sandy and Ivy Bridge microarchitectures to DDR 4 to increase the peak bandwidth, the efficiency of the memory interface has been improved  ... 
arXiv:1509.03118v3 fatcat:ve3pzfdyebebpo6wb42j72jx64

A Parametric Microarchitecture Model for Accurate Basic Block Throughput Prediction on Recent Intel CPUs [article]

Andreas Abel, Jan Reineke
2022 arXiv   pre-print
Surprisingly, this model is already competitive with the state of the art, indicating that there is significant potential for improvement.  ...  Performance models that statically predict the steady-state throughput of basic blocks on particular microarchitectures, such as IACA, Ithemal, llvm-mca, OSACA, or DiffTune, can guide optimizing compilers  ...  A basic block throughput predictor may be one component of tools and methodologies to determine whether code is actually compute bound, such as the Roofline model [48] or the Execution-Cache-Memory model  ... 
arXiv:2107.14210v2 fatcat:agnmtmwlr5fithzhyaofi56fce

ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX [article]

Christie Alappat and Nils Meyer and Jan Laukemann and Thomas Gruber and Georg Hager and Gerhard Wellein and Tilo Wettig
2021 arXiv   pre-print
We present an architectural analysis of the A64FX used in the Fujitsu FX1000 supercomputer at a level of detail that allows for the construction of Execution-Cache-Memory (ECM) performance models for steady-state  ...  A comparison with state-of-the-art high-end Intel Cascade Lake AP and Nvidia V100 systems puts the capabilities of the A64FX into perspective.  ...  ACKNOWLEDGMENTS We thank Daniel Richtmann for providing us with the V100 benchmarks of the DW kernel and Julian Hammer for useful discussions regarding cache modeling.  ... 
arXiv:2103.03013v2 fatcat:654bqrqianci7n3vtjmzd2pz7q

POSE: A mathematical and visual modelling tool to guide energy aware code optimisation

Stephen Roberts, Steven Wright, David Lecomber, Christopher January, Jonathan Byrd, Xavier Oro, Stephen Jarvis
2015 2015 Sixth International Green and Sustainable Computing Conference (IGSC)  
Tools such as Wattch [8] and McPAT [9] extend performance simulators with models of power draw.  ...  A further example of this approach is the Roofline model [15] , which frames application performance in terms of its operational intensity and two system bottlenecks; off-chip memory bandwidth and floating  ... 
doi:10.1109/igcc.2015.7393705 dblp:conf/green/RobertsWLJBOJ15 fatcat:apqjhshs6jdy3obrzo5pz4mt3y

Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms

Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick
2009 Journal of Parallel and Distributed Computing  
Additionally, we present detailed analysis of each optimization, which reveal surprising hardware bottlenecks and software challenges for future multicore systems and applications.  ...  The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels.  ...  In addition, our detailed analysis reveals the performance bottlenecks for LBMHD in each system.  ... 
doi:10.1016/j.jpdc.2009.04.002 fatcat:q26fu5e3tfezlbdpbond4bdglq

An analysis of core- and chip-level architectural features in four generations of Intel server processors [article]

Johannes Hofmann and Georg Hager and Gerhard Wellein and Dietmar Fey
2017 arXiv   pre-print
Using microbenchmarks we study the influence of these factors on code performance. This insight can then serve as input for analytic performance models.  ...  This paper presents a survey of architectural features among four generations of Intel server processors (Sandy Bridge, Ivy Bridge, Haswell, and Broad- well) with a focus on performance with floating point  ...  In contrast to the Roofline model it drops the assumption of a single bottleneck for the steady-state execution of a loop.  ... 
arXiv:1702.07554v1 fatcat:4sacm6w6vnfqlkdhxnrxkk23xq

Analytic Performance Modeling and Analysis of Detailed Neuron Simulations [article]

Francesco Cremonesi, Georg Hager, Gerhard Wellein, Felix Schürmann
2019 arXiv   pre-print
The gained insight is used to identify the main governing mechanisms underlying performance bottlenecks in the simulation.  ...  Big science initiatives are trying to reconstruct and model the brain by attempting to simulate brain tissue at larger scales and with increasingly more biological detail than previously thought possible  ...  Acknowledgments This work has been funded by the EPFL Blue Brain Project (funded by the Swiss ETH board).  ... 
arXiv:1901.05344v1 fatcat:b4tfyvyqqvh3pn3qj7xacvjclm

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks [article]

Geraldo F. Oliveira and Juan Gómez-Luna and Lois Orosa and Saugata Ghose and Nandita Vijaykumar and Ivan Fernandez and Mohammad Sadrosadati and Onur Mutlu
2022 arXiv   pre-print
We develop the first systematic methodology to classify applications based on the sources contributing to data movement bottlenecks.  ...  With this goal in mind, we perform the first large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that  ...  Acknowledgments We thank the SAFARI Research Group members for valuable feedback and the stimulating intellectual environment they provide.  ... 
arXiv:2105.03725v5 fatcat:ulzati7agrdxxn27pjg6yf4gf4

Bridging the Architecture Gap: Abstracting Performance-Relevant Properties of Modern Server Processors

2020 Supercomputing Frontiers and Innovations  
Both single-and multicore analysis shows that the model exhibits average and maximum relative errors of 5 % and 10 %. Deviations from the model and insights gained are discussed in detail.  ...  Moreover, new first principles underlying the model's estimates are derived from common microarchitectural features implemented by today's server processors to make the model more architecture independent  ...  provided the original work is properly cited.  ... 
doi:10.14529/jsfi200204 fatcat:ed2qa525pnghffqnuzz2xjvayi

Portable multi- and many-core performance for finite-difference or finite-element codes – application to the free-surface component of NEMO (NEMOLite2D 1.0)

Andrew R. Porter, Jeremy Appleyard, Mike Ashworth, Rupert W. Ford, Jason Holt, Hedong Liu, Graham D. Riley
2018 Geoscientific Model Development  
In quantifying whether or not the obtained performance is <q>good</q> we also consider the limitations of the basic roofline model and improve on it by generating kernel-specific CPU ceilings.</p>  ...  We have taken the free-surface part of the NEMO ocean model and created a new shallow-water model named NEMOLite2D.  ...  and Engineering South Consortium operated in partnership with the STFC Rutherford Appleton Laboratory (, last access: June 2017).  ... 
doi:10.5194/gmd-11-3447-2018 fatcat:x25n5seearbadbkf5mshp2rmfu

Ara: A 1 GHz+ Scalable and Energy-Efficient RISC-V Vector Processor with Multi-Precision Floating Point Support in 22 nm FD-SOI [article]

Matheus Cavalcante, Fabian Schuiki, Florian Zaruba, Michael Schaffner, Luca Benini
2019 arXiv   pre-print
An analysis on several vectorizable linear algebra computation kernels for a range of different matrix and vector sizes gives insight into performance limitations and bottlenecks for vector processors  ...  Ara's microarchitecture is scalable, as it is composed of a set of identical lanes, each containing part of the processor's vector register file and functional units.  ...  ACKNOWLEDGMENTS We would like to thank Frank Gürkaynak and Francesco Conti for the helpful discussions and insights.  ... 
arXiv:1906.00478v3 fatcat:h7zn4tkpqjf6xd35iuacpkre2a
« Previous Showing results 1 — 15 out of 43 results