A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Streaming Computations with Region-Based State on SIMD Architectures
[article]
2020
arXiv
pre-print
This work describes mechanisms to implement such computations efficiently on a SIMD-parallel architecture such as a GPU. ...
Finally, we study an implementation of our ideas as part of the MERCATOR system for irregular streaming computations on GPUs, investigating how the frequency of region boundaries in a stream impacts SIMD ...
Streaming Computations with Region-Based State on SIMD Architectures ...
arXiv:2006.07478v1
fatcat:vw6vzsvf4ndyje3xrwtk2rorby
Balancing Efficiency and Flexibility for DNN Acceleration via Temporal GPU-Systolic Array Integration
[article]
2020
arXiv
pre-print
We propose Simultaneous Multi-mode Architecture (SMA), a novel architecture design and execution model that offers general-purpose programmability on DNN accelerators in order to accelerate end-to-end ...
The key to SMA is the temporal integration of the systolic execution model with the GPU-like SIMD execution model. ...
This dataflow is more SIMD-friendly and enables the seamless integration on a SIMD substrate. ...
arXiv:2002.08326v2
fatcat:3asj3sqruncz7czbtkxbookt2u
Stream-Dataflow Acceleration
2017
Proceedings of the 44th Annual International Symposium on Computer Architecture - ISCA '17
The dataflow component of this architecture enables high concurrency, and the stream component enables communication and coordination at very-low power and area overhead. ...
SIMD, GPGPUs) are insufficient, as evidenced by the orderof-magnitude improvements and industry adoption of application and domain-specific accelerators in important areas like machine learning, computer ...
Because the vector length is fixed and relatively short, short-vector SIMD processors constantly rely on the general purpose core to dynamically schedule parallel instructions. ...
doi:10.1145/3079856.3080255
dblp:conf/isca/NowatzkiGAS17
fatcat:xm36xv6cbfevveabvmpafgjtli
Stream-Dataflow Acceleration
2017
SIGARCH Computer Architecture News
The dataflow component of this architecture enables high concurrency, and the stream component enables communication and coordination at very-low power and area overhead. ...
SIMD, GPGPUs) are insufficient, as evidenced by the orderof-magnitude improvements and industry adoption of application and domain-specific accelerators in important areas like machine learning, computer ...
Because the vector length is fixed and relatively short, short-vector SIMD processors constantly rely on the general purpose core to dynamically schedule parallel instructions. ...
doi:10.1145/3140659.3080255
fatcat:g5spj35pyvh7jlr6i3qr5ertlq
A Comparative Study of Scheduling Techniques for Multimedia Applications on SIMD Pipelines
[article]
2015
arXiv
pre-print
Parallel architectures are essential in order to take advantage of the parallelism inherent in streaming applications. One particular branch of these employ hardware SIMD pipelines. ...
architecture design. ...
In particular, the target architecture we focus on is a generic architecture that employs a SIMD pipeline. ...
arXiv:1502.07447v1
fatcat:3xd2psbxprezrgd4eyp7rejyni
Exploring the potential of heterogeneous von neumann/dataflow execution models
2015
SIGARCH Computer Architecture News
However, even after decades of research, dataflow architectures have yet to come into prominence as a solution. ...
This paper makes the observation that if both out-of-order and explicit-dataflow were available in one processor, many types of GPP cores can benefit from dynamically switching during certain phases of ...
is that if very-high performance on irregular code is necessary, dataflow is not an alternative to building big OOOs. ...
doi:10.1145/2872887.2750380
fatcat:f7i5ox5p6vgq5eqd65isiyhe2a
Analyzing Behavior Specialized Acceleration
2016
ACM SIGOPS Operating Systems Review
Acknowledgments We thank Venkatraman Govindaraju for his help in creating the initial TDG models and validation for DySER and SIMD. ...
Offloaded instructions require two additional edges to enforce accelerator pipelining: one for the pipeline depth between computation instances, and one for in-order completion. ...
Here, nodes represent pipeline stages, and edges represent dependencies to enforce architectural constraints. ...
doi:10.1145/2954680.2872412
fatcat:66uy7l3ggbh6ze2mp33wtgmbtm
SPA-GCN: Efficient and Flexible GCN Accelerator with an Application for Graph Similarity Computation
[article]
2021
arXiv
pre-print
The architecture is specialized for dealing with many small graphs since the graph size has a significant impact on design considerations. ...
The unique characteristics of graphs, such as the irregular memory access and dynamic parallelism, impose several challenges when the algorithm is mapped to a CPU or GPU. ...
However, on the FPGA side, we can exploit a deep pipeline across the phases by enabling a dataflow architecture. ...
arXiv:2111.05936v1
fatcat:lu6lwxjatnfetpcrv3pdai44km
Exploring the potential of heterogeneous von neumann/dataflow execution models
2015
Proceedings of the 42nd Annual International Symposium on Computer Architecture - ISCA '15
designs: A pessimistic view is that if very-high performance
on irregular code is necessary, dataflow is not ...
Figure 4 shows our view on how different
dataflow-based squashing. ...
doi:10.1145/2749469.2750380
dblp:conf/isca/NowatzkiGS15
fatcat:hql7xymzgjch3jv4dk5mvbesji
Computational models and resource allocation for supercomputers
1989
Proceedings of the IEEE
There are several different architectures used in supercomputers, with differing computational models. These different models present a variety of resource allocation problems that must be solved. ...
Implementing the dataflow model of computation on a non-dataflow architecture requires careful handling of resources in order to control the possibility of too much parallelism, which could cause resources ...
Large-and medium-grain dataflow models [7] , [18] - [20] take processes consisting of many operations and execute them in dataflow fashion.
4) Multiple SIMD Machines: Many newer supercomputers offer ...
doi:10.1109/5.48828
fatcat:vhlqy2v3trfwblcqpsol2zivpi
Efficient Spatial Processing Element Control via Triggered Instructions
2014
IEEE Micro
In a classic dataflow architecture, multiple pipeline stages are devoted to marshaling tokens, distributing tokens, and scoreboarding which instructions are ready. ...
This reduces scheduler implementation cost and removes the token-related pipeline stages. ...
Rachid Rayess is a silicon architecture engineer in the MMDC group at Intel. His research focuses on memory architecture and memory design automation. ...
doi:10.1109/mm.2014.14
fatcat:idejsg2kovdmhoune77bqhgi5m
Multiple-Morphs Adaptive Stream Architecture
2005
Journal of Computer Science and Technology
This paper presents the definition of regular stream and irregular stream, and then describes MASA (Multiple-morphs Adaptive Stream Architecture) prototype system which supports different execution models ...
In modern VLSI technology, hundreds of thousands of arithmetic units fit on a 1-cm 2 chip. The challenge is supplying them with instructions and data. ...
During the entire period of pipeline, the stream architecture performs 35 memory references as stated in Figure 4 . ...
doi:10.1007/s11390-005-0635-7
fatcat:dov76qs23vhpdna536e2xapqmu
A Survey of Coarse-Grained Reconfigurable Architecture and Design
2019
ACM Computing Surveys
This article reviews the architecture and design of CGRAs thoroughly for the purpose of exploiting their full potential. First, a novel multidimensional taxonomy is proposed. ...
As general-purpose processors have hit the power wall and chip fabrication cost escalates alarmingly, coarsegrained reconfigurable architectures (CGRAs) are attracting increasing interest from both academia ...
This model can be implemented on the dynamic-scheduling dynamic dataflow execution model. ...
doi:10.1145/3357375
fatcat:pqi4d33i6bg45a6llswhwd44qi
Elastic pipeline
2011
Proceedings of the 8th ACM International Conference on Computing Frontiers - CF '11
Simulation results show that our proposed elastic pipeline together with the co-designed bankconflict aware warp scheduling reduces the pipeline stalls by up to 64.0% (with 42.3% on average) and improves ...
Based on this observation, we investigate and propose a novel elastic pipeline design that minimizes the negative impact of on-chip memory bank conflicts on system throughput, by decoupling bank conflicts ...
Effect on Pipeline Stall Reduction On the other hand, the number of pipeline stalls due to warp scheduling failures are increased for some kernels. ...
doi:10.1145/2016604.2016608
dblp:conf/cf/GouG11
fatcat:5y7cfhv6nbfdrigvoxw7t767dy
Triggered instructions
2013
Proceedings of the 40th Annual International Symposium on Computer Architecture - ISCA '13
over-serialized execution, essentially achieving the effect of techniques such as dynamic instruction reordering and multithreading, which each require distinct hardware mechanisms in a traditional sequential architecture ...
These architectures are either purely systolic [16] , statically map only one operation per ALU [12] , or schedule operations onto the ALUs in strict dataflow order [4] . ...
These architectures rely on being able to transform control flow graphs into predicated dataflow graphs. ...
doi:10.1145/2485922.2485935
dblp:conf/isca/ParasharPAACLPZGJARME13
fatcat:2euggxike5evxoj3pptumxiu4e
« Previous
Showing results 1 — 15 out of 278 results