6,706 Hits in 6.0 sec

Memory organization and data layout for instruction set extensions with architecturally visible storage

Panagiotis Athanasopoulos, Philip Brisk, Yusuf Leblebici, Paolo Ienne
2009 Proceedings of the 2009 International Conference on Computer-Aided Design - ICCAD '09  
Present application specific embedded systems tend to choose instruction set extensions (ISEs) based on limitations imposed by the available data bandwidth to custom functional units (CFUs).  ...  In this paper we propose a novel methodology for laying out data in memories, generating highbandwidth memory systems by making use of existing lowbandwidth low-cost ones and designing custom functional  ...  EXTENSIONS AND FUTURE WORK One potential area for future work is to generate memories and data layouts for a set of ISEs with conflicting access patterns that operate on the same data.  ... 
doi:10.1145/1687399.1687527 dblp:conf/iccad/AthanasopoulosBLI09 fatcat:zrq22had2nedhciqpxjbmoh6sy

Column Scan Optimization by Increasing Intra-Instruction Parallelism

Nusrat Jahan Lisa, Annett Ungethüm, Dirk Habich, Nguyen Duy Anh Tuan, Akash Kumar, Wolfgang Lehner
2018 Proceedings of the 7th International Conference on Data Science, Technology and Applications  
To satisfy these requirements for analytical query workloads, in-memory column store database systems are state-of-the-art.  ...  For this reason, we investigated the optimization of a well-known scan technique using SIMD (Single Instruction Multiple Data) vectorization as well as using Field Programmable Gate Arrays (FPGA).  ...  On the one hand, Single Instruction Multiple Data (SIMD) instruction set extensions such as Intels SSE (Streaming SIMD Extensions) and AVX (Advanced Vector Extensions) have been available in modern processors  ... 
doi:10.5220/0006897003440353 dblp:conf/data/LisaUHN0L18 fatcat:3cb7eu4phjd25p4y5crrwu4dfe

Application Design Considerations [chapter]

Alexander Supalov, Andrey Semin, Michael Klemm, Christopher Dahnken
2014 Optimizing HPC Applications with Intel® Cluster Tools  
In Chapters 5 to 7 we reviewed the methods, tools, and techniques for application tuning, explained by using examples of HPC applications and benchmarks.  ...  The blueprint analysis of platform capabilities and system-level tuning considerations were provided in Chapter 4, based on several system architecture metrics discussed in Chapter 2.  ...  A data organization in memory that is beneficial for one computer architecture may end up not being the best for another.  ... 
doi:10.1007/978-1-4302-6497-2_8 fatcat:z2zifl6lo5hihkqdmzjnh4633u

MaxSim: A simulation platform for managed applications

Andrey Rodchenko, Christos Kotselidis, Andy Nisbet, Antoniu Pop, Mikel Lujan
2017 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)  
MaxSim is able to simulate fast and accurately managed workloads running on top of Maxine VM and its capabilities are showcased with novel simulation techniques for: 1) low-intrusive microarchitectural  ...  Furthermore, we demonstrate a hardware/software co-designed optimization that performs dynamic load elimination for array length retrieval achieving up to 14% L1 data cache loads reduction and up to 4%  ...  The profiling is performed during memory access operations, and collected events are associated with triplets of an instruction pointer, a pointer tag, and a memory address offset.  ... 
doi:10.1109/ispass.2017.7975286 dblp:conf/ispass/RodchenkoKNPL17 fatcat:umwzxfynwzd5ne2wbm47dmtpeq

Evolution of the PowerPC architecture

K. Diefendorff, R. Oehler, R. Hochsprung
1994 IEEE Micro  
For compatibility with existing software, the developers retained POWERS basic instruction set, opcode assignments, and programming model. oine time ago, Apple, IBM, and Motorola decided to develop a common  ...  the notion of superscalar operation in the instruction set architecture, improving the architecture as a target for compilers, reducing instruction path lengths, and including floating-point as a first-class  ...  Acknowledgments We give special recognition of Cathy May, Ed Silha, and Hank Warren for their long hours of work on the PowerPC architecture.  ... 
doi:10.1109/40.272836 fatcat:nojrko6qsbdtvcna5gx6jrdfmy

Instruction fetch architectures and code layout optimizations

A. Ramirez, J.L. Larriba-Pey, M. Valero
2001 Proceedings of the IEEE  
We also show how compiler techniques that optimize the layout of the code in memory can be used to improve the fetch performance of the different engines described.  ...  This paper explores the challenges faced by the instruction fetch stage for a variety of processor designs, from early pipelined processors, to the more aggressive wide issue superscalars.  ...  ACKNOWLEDGMENT The authors also want to thank the reviewers for their insightful comments.  ... 
doi:10.1109/5.964440 fatcat:yp3a5e42wbfjtfkqsyfr5dkrcq

Introduction [chapter]

2013 Computer Organization, Design, and Architecture, Fifth Edition  
These machines had separate storage for data and instructions.  ...  Current Harvard architectures do not use separate storage for data and instructions but have separate paths and buffers to access data and instructions simultaneously.  ...  ASC organization, instruction set, assembly-language programming, and details of an assembler are provided along with an introduction to program linking and loading.  ... 
doi:10.1201/b16435-2 fatcat:s4xa2hmduncynfiarbb4trujvy

On the Design of a Register Queue Based Processor Architecture (FaRM-rq) [chapter]

Ben A. Abderazek, Soichi Shigeta, Tsutomu Yoshinaga, Masahiro Sowa
2003 Lecture Notes in Computer Science  
(FRM) -when switched for register based instructions support, and (2) Q-mode (FQM) -when switched for Queue based instructions support.  ...  The above processor, which is named Functional Assignment Register Microprocessor (FaRM-rq) supports queue and register based instruction set architecture and functions into different modes: (1) R-mode  ...  ; Data/Address Register Instructions The instruction set are designed with four data registers (d0∼d3) and four address (a0∼a4)) registers.  ... 
doi:10.1007/3-540-37619-4_26 fatcat:2d2fibl3kbaz5enbx2g47dn3ii

Emerging Database Systems in Support of Scientific Data [chapter]

Per Svensson, Peter Boncz, Milena Ivanova, Martin Kersten, Niels Nes, Doron Rotem
2009 Scientific Data Management  
The topics discussed in this chapter include the evolution of storage structures from the 1970"s till now, data compression techniques, and query processing techniques for single-and multi-variable queries  ...  This is followed by an example of using MonetDB for the SkyServer data, and the query processing improvements it offers.  ...  Rather, the MonetDB architecture was based on other considerations given in the original Decomposition Storage Model (DSM) [CK85] paper, namely it focused on data storage layout and query algebra, with  ... 
doi:10.1201/9781420069815-c7 fatcat:ft3mckhzr5agfhopo6awmhwk7e

A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units

Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, Alan R. Bishop
2014 SIAM Journal on Scientific Computing  
Sparse matrix-vector multiplication (spMVM) is the most time-consuming kernel in many numerical algorithms and has been studied extensively on all modern processor and accelerator architectures.  ...  SELL-C-sigma comes with two tuning parameters whose performance impact across the range of test matrices is studied and for which reasonable choices are proposed.  ...  We are indebted to Intel Germany and Nvidia for providing test systems for benchmarking.  ... 
doi:10.1137/130930352 fatcat:4diqhkbvsfaylcaxypkphjjwdy

XpulpNN: Enabling Energy Efficient and Flexible Inference of Quantized Neural Network on RISC-V based IoT End Nodes [article]

Angelo Garofalo, Giuseppe Tagliavini, Francesco Conti, Luca Benini, Davide Rossi
2020 arXiv   pre-print
By extending the ISA with nibble (4-bit) and crumb (2-bit) SIMD instructions, we are able to show near-linear speedup with respect to higher precision integer computation on the key kernels for QNN computation  ...  QNN convolution kernels on a parallel cluster implementing the proposed extension run 6 x and 8 x faster when considering 4- and 2-bit data operands, respectively, compared to a baseline processing cluster  ...  Instruction Set Architecture (ISA).  ... 
arXiv:2011.14325v1 fatcat:tuawnqq5gngqneli5u2vzvmvem

A Technology-Scalable Architecture for Fast Clocks and High ILP [chapter]

Karthikeyan Sankaralingam, Ramadass Nagarajan, Doug Burger, Stephen W. Keckler
2001 Interaction between Compilers and Computer Architectures  
For the mapped window of execution, instructions execute in a dataflow-like manner, with each ALU forwarding its result along short wires to the consumers of the result.  ...  We describe our studies of program behavior and a preliminary evaluation that show that this architecture has the potential for both high clock speeds and high ILP, and may offer the best of both the VLIW  ...  Acknowledgements Many thanks to the anonymous reviewers and the CART group members for their feedback on early versions of this paper.  ... 
doi:10.1007/978-1-4757-3337-2_7 fatcat:yibv6xtijjdhfcdlqve62kqnmy


Gunter Knittel
2000 2000 IEEE Symposium on Volume Visualization (VV 2000)  
The system was specifically designed for Pentium III CPUs, and makes extensive use of MMX and Streaming SIMD instructions.  ...  This paper describes architecture and implementation of the ULTRAVIS system, a pure software solution for versatile and fast volume rendering.  ...  Seen from the cache, the memory is organized as a set of consecutive pages, equal in size to the cache. The cache memory itself is organized in lines (32 bytes for the Pentium III).  ... 
doi:10.1109/vv.2000.10014 fatcat:agjoknbuwzet7pe6sqipxlcaey

The ULTRAVIS system

Gunter Knittel
2000 Proceedings of the 2000 IEEE symposium on Volume visualization - VVS '00  
The system was specifically designed for Pentium III CPUs, and makes extensive use of MMX and Streaming SIMD instructions.  ...  This paper describes architecture and implementation of the ULTRAVIS system, a pure software solution for versatile and fast volume rendering.  ...  Seen from the cache, the memory is organized as a set of consecutive pages, equal in size to the cache. The cache memory itself is organized in lines (32 bytes for the Pentium III).  ... 
doi:10.1145/353888.353901 dblp:conf/vvs/Knittel00 fatcat:px5f76pvtzcftltqouc637u7la


Isaac Gelado, John H. Kelm, Shane Ryoo, Steven S. Lumetta, Nacho Navarro, Wen-mei W. Hwu
2008 Proceedings of the 22nd annual international conference on Supercomputing - ICS '08  
The mapping in CUBA preserves the original layout of the shared data structures hosted in the co-processor local memory.  ...  The mapping renders the data marshalling process unnecessary and reduces the need for code changes in order to use the co-processors.  ...  Patt, and Alex Ramirez for their insghtful comments.  ... 
doi:10.1145/1375527.1375571 dblp:conf/ics/GeladoKRLNH08 fatcat:wfrgmqmrkrcfhaygva4wxkhqmy
« Previous Showing results 1 — 15 out of 6,706 results