Filters








6,964 Hits in 6.5 sec

Batch Codes through Dense Graphs without Short Cycles [article]

Alexandros G. Dimakis, Anna Gal, Ankit Singh Rawat, Zhao Song
2014 arXiv   pre-print
Our main technical innovation is a graph-theoretic method of designing multiset batch codes using dense bipartite graphs with no small cycles.  ...  We modify prior graph constructions of dense, high-girth graphs to obtain our batch code results. We achieve close to optimal tradeoffs between the parameters for bipartite graph based batch codes.  ...  Techniques: Our main technical innovation is a graph-theoretic method of designing multiset batch codes using dense bipartite graphs with no short cycles.  ... 
arXiv:1410.2920v1 fatcat:ggrixfxvnzarbhavhwv2a2hqau

Blind Neural Belief Propagation Decoder for Linear Block Codes

Guillaume Larue, Louis-Adrien Dufrene, Quentin Lampin, Paul Chollet, Hadi Ghauch, Ghaya Rekaya
2021 2021 Joint European Conference on Networks and Communications & 6G Summit (EuCNC/6G Summit)  
codes.  ...  of the coding scheme used by the encoder.  ...  without knowledge of the code.  ... 
doi:10.1109/eucnc/6gsummit51104.2021.9482479 fatcat:4macrykp4zbqvoka34vxqce3z4

Enforcing Deadlines for Skeleton-based Parallel Programming

Paul Metzger, Murray Cole, Christian Fensch, Marco Aldinucci, Enrico Bini
2020 2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS)  
Directed Acyclic Graphs (DAGs) are a popular and very general application model that can capture any possible interaction among threads.  ...  Our experimental results demonstrate that batching reduces the minimum sustainable period by up to 22%, leading to a reduced number of required cores.  ...  The execution time share spent in communication decreases with higher application code WCETs and so maximum achievable improvements through batching decrease. C.  ... 
doi:10.1109/rtas48715.2020.000-7 dblp:conf/rtas/MetzgerCFAB20 fatcat:sykdkapyxjd37ert2tgpjtuh6e

Deep Learning with Dynamic Computation Graphs [article]

Moshe Looks, Marcello Herreshoff, DeLesley Hutchins, Peter Norvig
2017 arXiv   pre-print
We introduce a technique called dynamic batching, which not only batches together operations between different input graphs of dissimilar shape, but also between different nodes within a single input graph  ...  However, since the computation graph has a different shape and size for every input, such networks do not directly support batched training or inference.  ...  The long downward arrows are the pass-throughs.  ... 
arXiv:1702.02181v2 fatcat:wlhqjcgoofgehpkwa626q3gyoa

Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks [article]

Charith Mendis, Alex Renda, Saman Amarasinghe, Michael Carbin
2019 arXiv   pre-print
We show that Ithemal is more accurate than state-of-the-art hand-written tools currently used in compiler backends and static machine code analyzers.  ...  Predicting the number of clock cycles a processor takes to execute a block of assembly instructions in steady state (the throughput) is important for both compiler designers and performance engineers.  ...  Systems such as STOKE and LLVM need to search through many code blocks before emitting the faster versions of a given instruction sequence.  ... 
arXiv:1808.07412v2 fatcat:rpebyqwf5jdf5nqvzfypmkotzi

Whale: Efficient Giant Model Training over Heterogeneous GPUs [article]

Xianyan Jia, Le Jiang, Ang Wang, Wencong Xiao, Ziji Shi, Jie Zhang, Xinyuan Li, Langshi Chen, Yong Li, Zhen Zheng, Xiaoyong Liu, Wei Lin
2022 arXiv   pre-print
The Whale runtime utilizes those annotations and performs graph optimizations to transform a local deep learning DAG graph for distributed multi-GPU execution.  ...  If we scale the dense 10B model to the dense 10T model linearly without considering overhead, we need at least 256,000 NVIDIA V100 GPUs.  ...  To assist the analysis of the user model without modifying the user code, Whale inspects and overwrites TensorFlow build-in functions to capture augmented information.  ... 
arXiv:2011.09208v3 fatcat:zetefqb6o5htlhgp7gajhubxuy

revised submission [article]

XX
2021 Zenodo  
cannot be detected during the 2 code generation phase.  ...  Performance measured on a cycle-accurate simulator. (b) The density of the graphs measured by the ratio of the number of edges to the maximum possible edges.  ... 
doi:10.5281/zenodo.5758581 fatcat:6wjhp7qelbhpbmfjcjoi4yjgzy

revised submission [article]

XXX
2021 Zenodo  
a relatively short time.  ...  cannot be detected during the code generation phase.  ... 
doi:10.5281/zenodo.5758517 fatcat:ie7ga7f4krdprdizkdl4ekaqya

A Configurable Cloud-Scale DNN Processor for Real-Time AI

Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, Stephen Heil, Prerak Patel (+8 others)
2018 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA)  
., RNNs, CNNs, MLPs) without costly silicon updates. This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI.  ...  The Brainwave NPU achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1.  ...  Sub-graphs of a large DNN model can be encoded through atomic instruction chains (without named storage between instructions) that efficiently capture explicit communication between graph edges, simplifying  ... 
doi:10.1109/isca.2018.00012 dblp:conf/isca/FowersOPMLLAHAG18 fatcat:qalwazqx7jcctkndjrqmhszccq

Kimera-Multi: a System for Distributed Multi-Robot Metric-Semantic Simultaneous Localization and Mapping [article]

Yun Chang, Yulun Tian, Jonathan P. How, Luca Carlone
2020 arXiv   pre-print
We present the first fully distributed multi-robot system for dense metric-semantic Simultaneous Localization and Mapping (SLAM).  ...  Then, when two robots are within communication range, they initiate a distributed place recognition and robust pose graph optimization protocol with a novel incremental maximum clique outlier rejection  ...  ; deformation graphs are typically used for 3D animations, where one wants to animate a 3D object while ensuring it moves smoothly and without artifacts [52] .  ... 
arXiv:2011.04087v1 fatcat:xh2rxeka4fgbvj3xpwxsarlbie

PEFP: Efficient k-hop Constrained s-t Simple Path Enumeration on FPGA [article]

Zhengmin Lai, You Peng, Shiyu Yang, Xuemin Lin, Wenjie Zhang
2021 arXiv   pre-print
On the FPGA side in PEFP, we propose a novel DFS-based batching technique to save on-chip memory efficiently.  ...  Moreover, using hardware devices like FPGA to accelerate graph computation has become popular.  ...  In addition, the acceleration ratio of Baidu is less competitive than Skitter's, suggesting that PEFP tends to have a greater speedup in sparse graphs than in dense graphs.  ... 
arXiv:2012.11128v2 fatcat:wlwbhd7d6fdrpjoc4g3tklmq5e

Capstan: A Vector RDA for Sparsity [article]

Alexander Rucker, Matthew Vilim, Tian Zhao, Yaqi Zhang, Raghu Prabhakar, Kunle Olukotun
2021 arXiv   pre-print
For sparse applications that can be mapped to Plasticine, a recent dense RDA, Capstan is 7.6x to 365x faster and only 13% larger.  ...  This paper proposes Capstan: a scalable, parallel-patterns-based, reconfigurable-dataflow accelerator (RDA) for sparse and dense tensor applications.  ...  The allocator's decisions travel through the pipeline over multiple cycles-the decision made for crossbar traversal in cycle n will control reads in cycle n + 1 and writes and the output crossbar in cycle  ... 
arXiv:2104.12760v1 fatcat:k7s6dsgikvgixcip2xyrd7eriu

SAHA: A String Adaptive Hash Table for Analytical Databases

Tianqi Zheng, Zhibin Zhang, Xueqi Cheng
2020 Applied Sciences  
values for long strings only; (2) it uses special memory loading techniques to do quick dispatching and hashing computations; and (3) it utilizes vectorized processing to batch hashing operations.  ...  We designed a new hash table, SAHA, which is tightly integrated with modern analytical databases and optimized for string data with the following advantages: (1) it inlines short strings and saves hash  ...  Acknowledgments: We thank the Yandex ClickHouse team for reviewing the SAHA code and helping merge it to the ClickHouse code base.  ... 
doi:10.3390/app10061915 fatcat:7yw3swcdnvaazpasfohabzzbbi

Efficient conditional operations for data-parallel architectures

Ujval J. Kapasi, William J. Dally, Scott Rixner, Peter R. Mattson, John D. Owens, Brucek Khailany
2000 Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture - MICRO 33  
The largest batch size TRADITIONAL can process without spilling any intermediate streams to memory is 40 triangles.  ...  On almost all vector processors, the only way to compress or expand a vector is through gather/scatter instructions that cycle data through memory.  ... 
doi:10.1145/360128.360145 fatcat:3ey64huyd5dhvhuart3iwypit4

Distributed subgraph matching on timely dataflow

Longbin Lai, Wenjie Zhang, Ying Zhang, Zhengping Qian, Jingren Zhou, Zhu Qing, Zhengyi Yang, Xin Jin, Zhengmin Lai, Ran Wang, Kongzhang Hao, Xuemin Lin (+1 others)
2019 Proceedings of the VLDB Endowment  
namely "without Batching", "without Trindexing" and "without Compression".  ...  TrIndexing precomputes and indices the triangles (3-cycles) of the graph to facilitate pruning.  ... 
doi:10.14778/3339490.3339494 fatcat:ln7xhyl5h5g2fa6pjbpe5gfseu
« Previous Showing results 1 — 15 out of 6,964 results