A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2016; you can also visit the original URL.
The file type is application/pdf
.
Filters
Memory organization and data layout for instruction set extensions with architecturally visible storage
2009
Proceedings of the 2009 International Conference on Computer-Aided Design - ICCAD '09
Present application specific embedded systems tend to choose instruction set extensions (ISEs) based on limitations imposed by the available data bandwidth to custom functional units (CFUs). ...
In this paper we propose a novel methodology for laying out data in memories, generating highbandwidth memory systems by making use of existing lowbandwidth low-cost ones and designing custom functional ...
EXTENSIONS AND FUTURE WORK One potential area for future work is to generate memories and data layouts for a set of ISEs with conflicting access patterns that operate on the same data. ...
doi:10.1145/1687399.1687527
dblp:conf/iccad/AthanasopoulosBLI09
fatcat:zrq22had2nedhciqpxjbmoh6sy
Column Scan Optimization by Increasing Intra-Instruction Parallelism
2018
Proceedings of the 7th International Conference on Data Science, Technology and Applications
To satisfy these requirements for analytical query workloads, in-memory column store database systems are state-of-the-art. ...
For this reason, we investigated the optimization of a well-known scan technique using SIMD (Single Instruction Multiple Data) vectorization as well as using Field Programmable Gate Arrays (FPGA). ...
On the one hand, Single Instruction Multiple Data (SIMD) instruction set extensions such as Intels SSE (Streaming SIMD Extensions) and AVX (Advanced Vector Extensions) have been available in modern processors ...
doi:10.5220/0006897003440353
dblp:conf/data/LisaUHN0L18
fatcat:3cb7eu4phjd25p4y5crrwu4dfe
Application Design Considerations
[chapter]
2014
Optimizing HPC Applications with Intel® Cluster Tools
In Chapters 5 to 7 we reviewed the methods, tools, and techniques for application tuning, explained by using examples of HPC applications and benchmarks. ...
The blueprint analysis of platform capabilities and system-level tuning considerations were provided in Chapter 4, based on several system architecture metrics discussed in Chapter 2. ...
A data organization in memory that is beneficial for one computer architecture may end up not being the best for another. ...
doi:10.1007/978-1-4302-6497-2_8
fatcat:z2zifl6lo5hihkqdmzjnh4633u
MaxSim: A simulation platform for managed applications
2017
2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
MaxSim is able to simulate fast and accurately managed workloads running on top of Maxine VM and its capabilities are showcased with novel simulation techniques for: 1) low-intrusive microarchitectural ...
Furthermore, we demonstrate a hardware/software co-designed optimization that performs dynamic load elimination for array length retrieval achieving up to 14% L1 data cache loads reduction and up to 4% ...
The profiling is performed during memory access operations, and collected events are associated with triplets of an instruction pointer, a pointer tag, and a memory address offset. ...
doi:10.1109/ispass.2017.7975286
dblp:conf/ispass/RodchenkoKNPL17
fatcat:umwzxfynwzd5ne2wbm47dmtpeq
Evolution of the PowerPC architecture
1994
IEEE Micro
For compatibility with existing software, the developers retained POWERS basic instruction set, opcode assignments, and programming model. oine time ago, Apple, IBM, and Motorola decided to develop a common ...
the notion of superscalar operation in the instruction set architecture, improving the architecture as a target for compilers, reducing instruction path lengths, and including floating-point as a first-class ...
Acknowledgments We give special recognition of Cathy May, Ed Silha, and Hank Warren for their long hours of work on the PowerPC architecture. ...
doi:10.1109/40.272836
fatcat:nojrko6qsbdtvcna5gx6jrdfmy
Instruction fetch architectures and code layout optimizations
2001
Proceedings of the IEEE
We also show how compiler techniques that optimize the layout of the code in memory can be used to improve the fetch performance of the different engines described. ...
This paper explores the challenges faced by the instruction fetch stage for a variety of processor designs, from early pipelined processors, to the more aggressive wide issue superscalars. ...
ACKNOWLEDGMENT The authors also want to thank the reviewers for their insightful comments. ...
doi:10.1109/5.964440
fatcat:yp3a5e42wbfjtfkqsyfr5dkrcq
Introduction
[chapter]
2013
Computer Organization, Design, and Architecture, Fifth Edition
These machines had separate storage for data and instructions. ...
Current Harvard architectures do not use separate storage for data and instructions but have separate paths and buffers to access data and instructions simultaneously. ...
ASC organization, instruction set, assembly-language programming, and details of an assembler are provided along with an introduction to program linking and loading. ...
doi:10.1201/b16435-2
fatcat:s4xa2hmduncynfiarbb4trujvy
On the Design of a Register Queue Based Processor Architecture (FaRM-rq)
[chapter]
2003
Lecture Notes in Computer Science
(FRM) -when switched for register based instructions support, and (2) Q-mode (FQM) -when switched for Queue based instructions support. ...
The above processor, which is named Functional Assignment Register Microprocessor (FaRM-rq) supports queue and register based instruction set architecture and functions into different modes: (1) R-mode ...
;
Data/Address Register Instructions The instruction set are designed with four data registers (d0∼d3) and four address (a0∼a4)) registers. ...
doi:10.1007/3-540-37619-4_26
fatcat:2d2fibl3kbaz5enbx2g47dn3ii
Emerging Database Systems in Support of Scientific Data
[chapter]
2009
Scientific Data Management
The topics discussed in this chapter include the evolution of storage structures from the 1970"s till now, data compression techniques, and query processing techniques for single-and multi-variable queries ...
This is followed by an example of using MonetDB for the SkyServer data, and the query processing improvements it offers. ...
Rather, the MonetDB architecture was based on other considerations given in the original Decomposition Storage Model (DSM) [CK85] paper, namely it focused on data storage layout and query algebra, with ...
doi:10.1201/9781420069815-c7
fatcat:ft3mckhzr5agfhopo6awmhwk7e
A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units
2014
SIAM Journal on Scientific Computing
Sparse matrix-vector multiplication (spMVM) is the most time-consuming kernel in many numerical algorithms and has been studied extensively on all modern processor and accelerator architectures. ...
SELL-C-sigma comes with two tuning parameters whose performance impact across the range of test matrices is studied and for which reasonable choices are proposed. ...
We are indebted to Intel Germany and Nvidia for providing test systems for benchmarking. ...
doi:10.1137/130930352
fatcat:4diqhkbvsfaylcaxypkphjjwdy
XpulpNN: Enabling Energy Efficient and Flexible Inference of Quantized Neural Network on RISC-V based IoT End Nodes
[article]
2020
arXiv
pre-print
By extending the ISA with nibble (4-bit) and crumb (2-bit) SIMD instructions, we are able to show near-linear speedup with respect to higher precision integer computation on the key kernels for QNN computation ...
QNN convolution kernels on a parallel cluster implementing the proposed extension run 6 x and 8 x faster when considering 4- and 2-bit data operands, respectively, compared to a baseline processing cluster ...
Instruction Set Architecture (ISA). ...
arXiv:2011.14325v1
fatcat:tuawnqq5gngqneli5u2vzvmvem
A Technology-Scalable Architecture for Fast Clocks and High ILP
[chapter]
2001
Interaction between Compilers and Computer Architectures
For the mapped window of execution, instructions execute in a dataflow-like manner, with each ALU forwarding its result along short wires to the consumers of the result. ...
We describe our studies of program behavior and a preliminary evaluation that show that this architecture has the potential for both high clock speeds and high ILP, and may offer the best of both the VLIW ...
Acknowledgements Many thanks to the anonymous reviewers and the CART group members for their feedback on early versions of this paper. ...
doi:10.1007/978-1-4757-3337-2_7
fatcat:yibv6xtijjdhfcdlqve62kqnmy
The ULTRAVIS System
2000
2000 IEEE Symposium on Volume Visualization (VV 2000)
The system was specifically designed for Pentium III CPUs, and makes extensive use of MMX and Streaming SIMD instructions. ...
This paper describes architecture and implementation of the ULTRAVIS system, a pure software solution for versatile and fast volume rendering. ...
Seen from the cache, the memory is organized as a set of consecutive pages, equal in size to the cache. The cache memory itself is organized in lines (32 bytes for the Pentium III). ...
doi:10.1109/vv.2000.10014
fatcat:agjoknbuwzet7pe6sqipxlcaey
The ULTRAVIS system
2000
Proceedings of the 2000 IEEE symposium on Volume visualization - VVS '00
The system was specifically designed for Pentium III CPUs, and makes extensive use of MMX and Streaming SIMD instructions. ...
This paper describes architecture and implementation of the ULTRAVIS system, a pure software solution for versatile and fast volume rendering. ...
Seen from the cache, the memory is organized as a set of consecutive pages, equal in size to the cache. The cache memory itself is organized in lines (32 bytes for the Pentium III). ...
doi:10.1145/353888.353901
dblp:conf/vvs/Knittel00
fatcat:px5f76pvtzcftltqouc637u7la
The mapping in CUBA preserves the original layout of the shared data structures hosted in the co-processor local memory. ...
The mapping renders the data marshalling process unnecessary and reduces the need for code changes in order to use the co-processors. ...
Patt, and Alex Ramirez for their insghtful comments. ...
doi:10.1145/1375527.1375571
dblp:conf/ics/GeladoKRLNH08
fatcat:wfrgmqmrkrcfhaygva4wxkhqmy
« Previous
Showing results 1 — 15 out of 6,706 results