Filters








278 Hits in 3.1 sec

Streaming Computations with Region-Based State on SIMD Architectures [article]

Stephen Timcheck, Jeremy Buhler
2020 arXiv   pre-print
This work describes mechanisms to implement such computations efficiently on a SIMD-parallel architecture such as a GPU.  ...  Finally, we study an implementation of our ideas as part of the MERCATOR system for irregular streaming computations on GPUs, investigating how the frequency of region boundaries in a stream impacts SIMD  ...  Streaming Computations with Region-Based State on SIMD Architectures  ... 
arXiv:2006.07478v1 fatcat:vw6vzsvf4ndyje3xrwtk2rorby

Balancing Efficiency and Flexibility for DNN Acceleration via Temporal GPU-Systolic Array Integration [article]

Cong Guo, Yangjie Zhou, Jingwen Leng, Yuhao Zhu, Zidong Du, Quan Chen, Chao Li, Bin Yao, Minyi Guo
2020 arXiv   pre-print
We propose Simultaneous Multi-mode Architecture (SMA), a novel architecture design and execution model that offers general-purpose programmability on DNN accelerators in order to accelerate end-to-end  ...  The key to SMA is the temporal integration of the systolic execution model with the GPU-like SIMD execution model.  ...  This dataflow is more SIMD-friendly and enables the seamless integration on a SIMD substrate.  ... 
arXiv:2002.08326v2 fatcat:3asj3sqruncz7czbtkxbookt2u

Stream-Dataflow Acceleration

Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, Karthikeyan Sankaralingam
2017 Proceedings of the 44th Annual International Symposium on Computer Architecture - ISCA '17  
The dataflow component of this architecture enables high concurrency, and the stream component enables communication and coordination at very-low power and area overhead.  ...  SIMD, GPGPUs) are insufficient, as evidenced by the orderof-magnitude improvements and industry adoption of application and domain-specific accelerators in important areas like machine learning, computer  ...  Because the vector length is fixed and relatively short, short-vector SIMD processors constantly rely on the general purpose core to dynamically schedule parallel instructions.  ... 
doi:10.1145/3079856.3080255 dblp:conf/isca/NowatzkiGAS17 fatcat:xm36xv6cbfevveabvmpafgjtli

Stream-Dataflow Acceleration

Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, Karthikeyan Sankaralingam
2017 SIGARCH Computer Architecture News  
The dataflow component of this architecture enables high concurrency, and the stream component enables communication and coordination at very-low power and area overhead.  ...  SIMD, GPGPUs) are insufficient, as evidenced by the orderof-magnitude improvements and industry adoption of application and domain-specific accelerators in important areas like machine learning, computer  ...  Because the vector length is fixed and relatively short, short-vector SIMD processors constantly rely on the general purpose core to dynamically schedule parallel instructions.  ... 
doi:10.1145/3140659.3080255 fatcat:g5spj35pyvh7jlr6i3qr5ertlq

A Comparative Study of Scheduling Techniques for Multimedia Applications on SIMD Pipelines [article]

Mehmet Ali Arslan, Flavius Gruian, Krzysztof Kuchcinski
2015 arXiv   pre-print
Parallel architectures are essential in order to take advantage of the parallelism inherent in streaming applications. One particular branch of these employ hardware SIMD pipelines.  ...  architecture design.  ...  In particular, the target architecture we focus on is a generic architecture that employs a SIMD pipeline.  ... 
arXiv:1502.07447v1 fatcat:3xd2psbxprezrgd4eyp7rejyni

Exploring the potential of heterogeneous von neumann/dataflow execution models

Tony Nowatzki, Vinay Gangadhar, Karthikeyan Sankaralingam
2015 SIGARCH Computer Architecture News  
However, even after decades of research, dataflow architectures have yet to come into prominence as a solution.  ...  This paper makes the observation that if both out-of-order and explicit-dataflow were available in one processor, many types of GPP cores can benefit from dynamically switching during certain phases of  ...  is that if very-high performance on irregular code is necessary, dataflow is not an alternative to building big OOOs.  ... 
doi:10.1145/2872887.2750380 fatcat:f7i5ox5p6vgq5eqd65isiyhe2a

Analyzing Behavior Specialized Acceleration

Tony Nowatzki, Karthikeyan Sankaralingam
2016 ACM SIGOPS Operating Systems Review  
Acknowledgments We thank Venkatraman Govindaraju for his help in creating the initial TDG models and validation for DySER and SIMD.  ...  Offloaded instructions require two additional edges to enforce accelerator pipelining: one for the pipeline depth between computation instances, and one for in-order completion.  ...  Here, nodes represent pipeline stages, and edges represent dependencies to enforce architectural constraints.  ... 
doi:10.1145/2954680.2872412 fatcat:66uy7l3ggbh6ze2mp33wtgmbtm

SPA-GCN: Efficient and Flexible GCN Accelerator with an Application for Graph Similarity Computation [article]

Atefeh Sohrabizadeh, Yuze Chi, Jason Cong
2021 arXiv   pre-print
The architecture is specialized for dealing with many small graphs since the graph size has a significant impact on design considerations.  ...  The unique characteristics of graphs, such as the irregular memory access and dynamic parallelism, impose several challenges when the algorithm is mapped to a CPU or GPU.  ...  However, on the FPGA side, we can exploit a deep pipeline across the phases by enabling a dataflow architecture.  ... 
arXiv:2111.05936v1 fatcat:lu6lwxjatnfetpcrv3pdai44km

Exploring the potential of heterogeneous von neumann/dataflow execution models

Tony Nowatzki, Vinay Gangadhar, Karthikeyan Sankaralingam
2015 Proceedings of the 42nd Annual International Symposium on Computer Architecture - ISCA '15  
designs: A pessimistic view is that if very-high performance on irregular code is necessary, dataflow is not  ...  Figure 4 shows our view on how different dataflow-based squashing.  ... 
doi:10.1145/2749469.2750380 dblp:conf/isca/NowatzkiGS15 fatcat:hql7xymzgjch3jv4dk5mvbesji

Computational models and resource allocation for supercomputers

J. Mauney, D.P. Agrawal, Y.K. Choe, E.A. Harcourt, S. Kim, W.J. Staats
1989 Proceedings of the IEEE  
There are several different architectures used in supercomputers, with differing computational models. These different models present a variety of resource allocation problems that must be solved.  ...  Implementing the dataflow model of computation on a non-dataflow architecture requires careful handling of resources in order to control the possibility of too much parallelism, which could cause resources  ...  Large-and medium-grain dataflow models [7] , [18] - [20] take processes consisting of many operations and execute them in dataflow fashion. 4) Multiple SIMD Machines: Many newer supercomputers offer  ... 
doi:10.1109/5.48828 fatcat:vhlqy2v3trfwblcqpsol2zivpi

Efficient Spatial Processing Element Control via Triggered Instructions

Angshuman Parashar, Michael Pellauer, Michael Adler, Bushra Ahsan, Neal Crago, Daniel Lustig, Vladimir Pavlov, Antonia Zhai, Mohit Gambhir, Aamer Jaleel, Randy Allmon, Rachid Rayess (+2 others)
2014 IEEE Micro  
In a classic dataflow architecture, multiple pipeline stages are devoted to marshaling tokens, distributing tokens, and scoreboarding which instructions are ready.  ...  This reduces scheduler implementation cost and removes the token-related pipeline stages.  ...  Rachid Rayess is a silicon architecture engineer in the MMDC group at Intel. His research focuses on memory architecture and memory design automation.  ... 
doi:10.1109/mm.2014.14 fatcat:idejsg2kovdmhoune77bqhgi5m

Multiple-Morphs Adaptive Stream Architecture

Mei Wen, Nan Wu, Hai-Yan Li, Chun-Yuan Zhang
2005 Journal of Computer Science and Technology  
This paper presents the definition of regular stream and irregular stream, and then describes MASA (Multiple-morphs Adaptive Stream Architecture) prototype system which supports different execution models  ...  In modern VLSI technology, hundreds of thousands of arithmetic units fit on a 1-cm 2 chip. The challenge is supplying them with instructions and data.  ...  During the entire period of pipeline, the stream architecture performs 35 memory references as stated in Figure 4 .  ... 
doi:10.1007/s11390-005-0635-7 fatcat:dov76qs23vhpdna536e2xapqmu

A Survey of Coarse-Grained Reconfigurable Architecture and Design

Leibo Liu, Jianfeng Zhu, Zhaoshi Li, Yanan Lu, Yangdong Deng, Jie Han, Shouyi Yin, Shaojun Wei
2019 ACM Computing Surveys  
This article reviews the architecture and design of CGRAs thoroughly for the purpose of exploiting their full potential. First, a novel multidimensional taxonomy is proposed.  ...  As general-purpose processors have hit the power wall and chip fabrication cost escalates alarmingly, coarsegrained reconfigurable architectures (CGRAs) are attracting increasing interest from both academia  ...  This model can be implemented on the dynamic-scheduling dynamic dataflow execution model.  ... 
doi:10.1145/3357375 fatcat:pqi4d33i6bg45a6llswhwd44qi

Elastic pipeline

Chunyang Gou, Georgi N. Gaydadjiev
2011 Proceedings of the 8th ACM International Conference on Computing Frontiers - CF '11  
Simulation results show that our proposed elastic pipeline together with the co-designed bankconflict aware warp scheduling reduces the pipeline stalls by up to 64.0% (with 42.3% on average) and improves  ...  Based on this observation, we investigate and propose a novel elastic pipeline design that minimizes the negative impact of on-chip memory bank conflicts on system throughput, by decoupling bank conflicts  ...  Effect on Pipeline Stall Reduction On the other hand, the number of pipeline stalls due to warp scheduling failures are increased for some kernels.  ... 
doi:10.1145/2016604.2016608 dblp:conf/cf/GouG11 fatcat:5y7cfhv6nbfdrigvoxw7t767dy

Triggered instructions

Angshuman Parashar, Aamer Jaleel, Randy Allmon, Rachid Rayess, Stephen Maresh, Joel Emer, Michael Pellauer, Michael Adler, Bushra Ahsan, Neal Crago, Daniel Lustig, Vladimir Pavlov (+2 others)
2013 Proceedings of the 40th Annual International Symposium on Computer Architecture - ISCA '13  
over-serialized execution, essentially achieving the effect of techniques such as dynamic instruction reordering and multithreading, which each require distinct hardware mechanisms in a traditional sequential architecture  ...  These architectures are either purely systolic [16] , statically map only one operation per ALU [12] , or schedule operations onto the ALUs in strict dataflow order [4] .  ...  These architectures rely on being able to transform control flow graphs into predicated dataflow graphs.  ... 
doi:10.1145/2485922.2485935 dblp:conf/isca/ParasharPAACLPZGJARME13 fatcat:2euggxike5evxoj3pptumxiu4e
« Previous Showing results 1 — 15 out of 278 results